hpr4657 :: UNIX Curio #8 - Comparing Files

Finding out what's the diff, without using "diff"

Hosted by Vance on Tuesday, 2026-06-09 is flagged as Clean and is released under a CC-BY-SA license.
unix curio, unix, diff, cmp, comm. (Be the first).

Listen in ogg, opus, or mp3 format. Play now:

Duration: 00:14:20
Download the transcription and subtitles.

general.

This series is dedicated to exploring little-known—and occasionally useful—trinkets lurking in the dusty corners of UNIX-like operating systems.

Most users of UNIX-like systems are probably familiar with the diff utility. It is widely used with source code to compare two files and see what the differences are between them. Non-programmers, like me, also use it to examine what has changed in different versions of scripts or configuration files. Quite a few pieces of newer software can compare different versions of data and express changes in a format either identical to or similar to diff output.

However, there are two other long-standing tools for this purpose that are far less known and deserve in my view to be termed UNIX Curios. The first of these is cmp 1 . While diff is primarily intended to be used on text files and compares them line by line, cmp compares files byte by byte. In my experience, its main use is to see whether two binary files are in fact identical—if they are, cmp outputs nothing and returns an exit status of 0. Back when methods of transferring files were not as reliable as they are today, this was a tool I would reach for sometimes. For example, you could use it to confirm that the data on a CD-ROM you burned was the same as the original.

If there is a difference between the files, cmp will return an exit status of 1. By default, it will also print the location (byte and line number) of the first differing byte. When used with the -l option, it will print the location and value of every byte that differs. There is one exception to these: if the files are the same except that one is shorter than the other, it will print a message to that effect. The exit status will still be 1 in that case.

Using the -s option with cmp will cause it to be totally silent and output nothing. Only the exit status will indicate whether the files are the same, different, or if the exit status is greater than 1, that an error occurred. This makes it useful for scripting, for example in case you wanted to confirm that a file copied to another location arrived fully intact.

It is worth noting that diff is also capable of comparing binary files—however, it is not required by POSIX to report what is actually different or where differences occur. The same exit status as in cmp is returned: 0 if the files are the same, 1 if they are different, or greater than 1 if an error occurred. While many implementations offer an option to suppress the output, this is not in the standard 2 so the most portable method would be to instead redirect output to /dev/null . On my system the diff utility is three times the size of cmp , so if you don't need its extra capabilities, it is a less efficient way of doing the job.

The other UNIX Curio for today is comm , and this utility 3 is also intended to compare two files to see what is common between them. Ken Fallon briefly talked about it a few years ago in HPR episode 3889 . Compared to the others, it has a much more specific use case. The two files are expected to be text files that are already sorted. What comm will do is print a tab-separated list of all the lines appearing in either or both files. Lines only in the first file will appear in the first column, lines only in the second file will be in the second column, and lines in both files will be in the third column.

Any combination of the options -1 , -2 , and -3 can be used with comm to suppress printing of the first, second, or third column respectively. Using all three options at the same time is supported but it results in no output, so that isn't very useful. Unlike the other utilities, the exit status of comm doesn't tell you anything about the two files. It will be 0 if the program ran successfully, and greater than 0 if it didn't.

I'm not sure if I have ever actually used comm for anything practical. I find its default output a bit difficult to meaningfully interpret, plus you need to ensure the two files are already sorted. It seems to be best suited to comparing lists, and one use case that Ken Fallon mentioned would be comparing two lists of files to see if any are missing. The command comm -3 listA listB would print files that only appear in listA in the first column and those only in listB in the second column. This would let you ignore all the filenames that appear in both and focus on those that were absent from one or the other. If on the other hand you only wanted to see the filenames that are on both lists, comm -12 listA listB would give you that.

Some more frivolous potential uses also come to mind. If for some reason the cat utility is broken on your system, you could use comm listA /dev/null to print the file listA instead. If you want to insert tab characters before every line of a file but have an aversion to using sed or awk , then comm /dev/null listA would output listA with one tab before each line, and comm listA listA would insert two tabs. A bit silly, but it would work. The GNU implementation of comm even lets you choose something other than a tab to separate the columns 4 , so you could go wild with that.

According to the POSIX specifications for cmp and comm , one of the two filenames given as arguments, but not both, can be a " - ", in which case standard input will be used for that "file" in the comparison. Also, the results are undefined if both arguments are the same FIFO special, character special, or block special file. Some implementations might not have these limitations, but you shouldn't rely on that everywhere.

All three of these were developed quite early. The cmp utility appeared in 1971's First Edition UNIX 5 , while comm and diff seem to have made their debut in Fourth Edition UNIX 6,7 from 1973. The original versions might not have behaved exactly like their modern counterparts, and newer implementations (especially of the diff utility) have acquired additional options and capabilities, but the basic operation of each has stayed the same.

The next time you need to compare files against each other, consider whether cmp or comm might be appropriate before automatically reaching for diff . They all have their uses in different situations.

References:

  1. Cmp specification https://pubs.opengroup.org/onlinepubs/009695399/utilities/cmp.html
  2. Diff specification https://pubs.opengroup.org/onlinepubs/009695399/utilities/diff.html
  3. Comm specification https://pubs.opengroup.org/onlinepubs/009695399/utilities/comm.html
  4. GNU coreutils manual: comm https://www.gnu.org/software/coreutils/manual/html_node/comm-invocation.html
  5. First Edition UNIX cmp manual page http://man.cat-v.org/unix-1st/1/cmp
  6. Fourth Edition UNIX comm manual page https://www.tuhs.org/cgi-bin/utree.pl?file=V4/usr/man/man1/comm.1
  7. Fourth Edition UNIX diff source https://www.tuhs.org/cgi-bin/utree.pl?file=V4/usr/source/s1/diff1.c


Comments

Subscribe to the comments RSS feed.

Leave Comment

Note to Verbose Commenters
If you can't fit everything you want to say in the comment below then you really should record a response show instead.

Note to Spammers
All comments are moderated. All links are checked by humans. We strip out all html. Feel free to record a show about yourself, or your industry, or any other topic we may find interesting. We also check shows for spam :).

Provide feedback
Your Name/Handle:
Title:
Comment:
Anti Spam Question: What does the letter P in HPR stand for?