Using MD5SUM to Validate the Integrity of (Downloaded) Files
This tip is nothing new, in and of itself, but it is an important reminder. It also happens to make a perfect example for our continuing series on CygWin, the Linux emulator for Windows.
MD5 Codes: MD5 stands for Message Digest version 5. The MD5 algorithm takes a file (the “message”) of any size, and reduces it down to a code that looks like this: “ac30ce5b07b0018d65203fbc680968f5″ (the “digest”). The brilliant thing about the MD5 algorithm is that if the message changes by so much as a single byte, it will produce a completely different digest. Think of MD5s as fingerprints for files.
The MD5 algorithm has many uses. Foremost is the ability to validate that when a data file is transmitted from point A to point B, it arrives intact, without distortion. This is done by calculating the MD5 string that corresponds to the original file (at point A), calculating the MD5 again using the copy of the file (at point B), and then comparing the two MD5 strings.
For Example, Validating Files Downloaded From a Mirror Site: When downloading programs from mirror sites, it’s always a good idea to check the integrity of those files, to ensure that nobody has mucked with them (either accidentally or maliciously) before they get to you. Say, for example, that you intend to download the latest version of Apache ANT, which, if you dabble in Java programming at all, you may recognize as a tool that is commonly required for compiling and building Java projects. (Trivia tidbit: ANT stands for “Another Neat Tool.”) For the purposes of this demonstration, all that matters is that Ant is an example of a very popular tool that is mirrored on file servers across the globe. It’s mirrored so as to spread the bandwidth load.
Step-by-Step: Downloading the Files: This first part of the example is just to walk you through what it means to download a file (from a mirrored site) and how to obtain the corresponding MD5 code (from the main site). If you are familiar with this process, then skip to the next section.
- Bring up the download page(http://ant.apache.org/bindownload.cgi) and you’ll see that you are asked to select a mirror. Do that.
- Below that is the opportunity to select which file(s) are to be downloaded. In this case, the choices are a ZIP version, a TAR GZ version, and a TAR BZ2 version. Do not click on any of them yet.
- Notice that next to each file there are a three links: PGP, SHA1, MD5. These are offered as ways to validate the integrity of the file you’re about to download. Click on the MD5 link next to the selected file.
- This brings up a file in the web browser that has nothing but a short string of letters and numbers like this: “ac30ce5b07b0018d65203fbc680968f5″. Save this text file to your hard disk (e.g. as C:\downloads\ant.md5).
- Now, go ahead and download the file itself (e.g. as C:\downloads\apache-ant-1.7.0-bin.zip). The ZIP file will come from the mirror site, but the MD5 will have come directly from the Apache Main server.
Validating the File Using the MD5 Code: So, we now have a file that we’ve downloaded, together with the MD5 that was calculated against the original file. What we need to do now is calculate the MD5 for our copy of the file and check it against the MD5 for the original. There are many programs available for the Windows platform that calculate MD5s, but we are going to use the Linux MD5SUM command (within CygWin).
- Open a CygWin command prompt window (or a true Linux command prompt window, for that matter).
- Navigate to the file folder that contains the downloaded file and the file with the MD5 string (e.g. “cd c:\downloads”, which, as you may recall in CygWin, is the equivalent of “cd /cygdrive/c/downloads”).
- Use the LS command to be reminded of the names of the files in that folder.
- Use the CAT command to display the contents of the MD5 file (which is the MD5 string that we are using for comparison purposes).
- Finally, run the MD5SUM command to calculate the MD5 anew using the downloaded copy of the file. It should be identical to the MD5 code obtained from the main site (the one displayed by the CAT command).
Tip #1: SHA1 is an alternative algorithm to MD5. It works the same way. The command to calculate an SHA1 digest is SHA1SUM. Everything in his article that talks about MD5SUM also applies to SHA1SUM.
Tip #2: Use MAN MD5SUM to read the manual pages for the command.
Tip #3: Notice that in the MD5SUM output, there is an asterisk preceding the filename. This means that MD5SUM treated the file as binary, and calculated the MD5 using every byte in the file verbatim. MD5SUM also has a mode where it can treat ASCII files specially by ignoring any significant differences in terms of whitespace and end of line characters. (When running in that mode, rather than an asterisk preceding the file name, there will just be an extra space before the file name.) So, if at first the MD5 strings do not compare, don’t panic. Try running the calculation in the other mode first (by adding a -b commandline switch to force a binary calculation, or a -t switch to force an ASCII text calculation).
Tip #4: MD5SUM has an option to automatically compare a calculated MD5 digest to an expected result. This comes in especially handy when there are multiple files to be checked. To take advantage of this, prepare an ASCII file that contains all of the MD5 codes to be checked together with the corresponding filenames. There should be one line per MD5/filename pair. There should be two spaces between the MD5 and the file name (for an ASCII calculation), or one space and one asterisk (for a binary calculation). In other words, each line of this prepared file should look exactly like the normal output from an MD5SUM run. If the prepared file is called “allmd5s.txt”, then the command to check it is “MD5SUM -c allmd5s.txt”.