https://leancrew.com/all-this/2021/08/checking-it-twice/

[snowman-20]

And now it's all this

I just said what I said and it was wrong
Or was taken wrong

Next post Previous post

Checking it twice

August 29, 2021 at 11:53 AM by Dr. Drang

Comparing lists is something I have to fairly often in my work. There
are lots of ways to do it, and the techniques I've used have evolved
considerably over the years. This post is about where I've been and
where I am now.

First, let's talk about the kinds of lists I deal with. They are
typically sets of alphanumeric strings--serial numbers of products
would be a good example. The reason I have two lists that need to be
compared varies, but a common example would be one list of devices
that were supposed to be collected from inventory and tested and
another list of devices that actually were tested. This latter list
is usually part of a larger data set that includes the test results.
Before I get into analyzing the test results, I have to see whether
all the devices from the first list were tested and whether some of
the devices that were tested weren't on the first list.

In the old days, the lists came on paper, and the comparison
consisted of me sitting at my desk with a highlighter pen, marking
off the items in each list as I found them. This is still a pretty
good way of doing things when the lists are small. By having your
eyes go over every item in each list, you get a good sense of the
data you'll be dealing with later.

But when the lists come to me in electronic form, the "by hand"
method is less appealing, especially when the lists are dozens or
hundreds of items long. This is where software tools come into play.

The first step is data cleaning, a topic I don't want to get into in
this post other than to say that the goal is to get two files, which
we'll call listA.txt and listB.txt. Each file has one item per line.
How you get the lists into this state depends on the form they took
when you received them, but it typically involves copying, pasting,
searching, and replacing. This is where you develop your strongest
love/hate relationships with spreadsheets and regular expressions.

Let's say these are the contents of your two files: three-digit
serial numbers. I'm showing them in parallel to save space in the
post, but they are two separate files

listA.txt          listB.txt
   115                114
   119                115
   105                106
   101                119
   113                105
   116                125
   102                114
   106                120
   114                101
   108                117
   120                113
   103                111
   107                112
   109                123
   118                116
   112                107
   110                118
   104                114
                      102
                      105
                      110
                      121
                      122
                      109
                      104

You'll note that neither of these lists are sorted. Or at least
they're not sorted in any obvious way. They may be sorted by the date
on which the device was tested, and that might be important in your
later analysis, so it's a good idea, as you go through the data
cleanup, to preserve the ordering of the data as you received it.

At this stage, I typically look at each list separately and see if
there's anything weird about it. One of the most common weirdnesses
is duplication, which can be found through this simple pipeline:

sort listB.txt | uniq -c | sort -r | head

The result for our data is

3 114
2 105
1 125
1 123
1 122
1 121
1 120
1 119
1 118
1 117

What this tells us is that item 114 is in list B three times, item
105 is in list B twice, and all the others are there just once. At
this point, some investigation is needed. Were there actually three
tests run on item 114? Did someone mistype the serial number? Was the
data I was given concatenated from several redundant sources and not
cleaned up before sending it to me?

The workings of the pipeline are pretty simple. The first sort
alphabetizes the list, which puts all the duplicate lines adjacent to
one another. The uniq command prints the list without the repeated
lines, and its -c ("count") option adds a prefix of the number times
each line appears. The second sort sorts this augmented list in
reverse (-r) order, so the lines that are repeated appear at the top.
Finally, the head prints just the top ten lines of this last sorted
list. Although head can be given an option that changes the number of
lines that are printed, I seldom need to use that option, as my lists
tend to have only a handful of duplicates at most.

Let's say we've figured why list B had those duplicates. And we also
learned that list A has no duplicates. Now it's time to compare the
two lists. I used to do this by making sorted versions of each list,

sort listA.txt > listA-sorted.txt
sort -u listB.txt > listB-sorted.txt

and then compared the -sorted files.^1 This worked, but it led to
cluttered folders with files that were only used once. More recently,
I've avoided the clutter by using process substitution, a feature of
the bash shell (and zsh) that's kind of like a pipeline but can be
used when you need files instead of a stream of data. We'll see how
this works later.

There are two main ways to compare files in Unix: diff and comm. diff
is the more programmery way of comparison. It tells you not only what
the differences are between the files, but also how to edit one file
to make it the same as the other. That's useful, and we certainly
could use diff to do our comparison, but it's both more verbose and
more cryptic than we need.

The comm command takes two arguments, the files we want to compare.
By default, it prints three columns of output: the first column
consists of lines in the first file that don't appear in the second;
the second column consists of lines that appear in the second file
but not in the first; and the third column is the lines that appear
in both files. Like this:

comm listA-sorted.txt listB-sorted.txt

gives output of

                101
                102
103
                104
                105
                106
                107
108
                109
                110
        111
                112
                113
                114
                115
                116
        117
                118
                119
                120
        121
        122
        123
        125

(The columns are separated by tab characters, but I've converted them
to spaces here to make the output look more like what you'd see in
the Terminal.)

This is typically more information than I want. In particular, the
third column is redundant. If we know which items are only in list A
and which items are only in list B, we know that all the others
(usually the most numerous by far) are in both lists. We can suppress
the printing of column 3 with the -3 option:

comm -3 listA-sorted.txt listB-sorted.txt

gives the more compact output

103
108
        111
        117
        121
        122
        123
        125

You can suppress the printing of the other columns with -1 and -2. So
to get the items that are unique to list A, we do

comm -23 listA-sorted.txt listB-sorted.txt

There are a couple of things I hate about comm:

  * What's up with that second m? This is a comparison program, not a
    communication program.
  * The options are given in what my dad would call "yes, we have no
    bananas" form.^2 You don't tell comm the columns you want, you
    tell it the columns you don't want. Very intuitive.

I said earlier that I don't like making -sorted files. Now that we
see how comm works with files, we'll switch to process substitution.
Here's how to get the items unique to each list without creating
sorted files first:

comm -3 <(sort listA.txt) <(sort -u listB.txt)

What goes inside the <() construct is the command we'd use to create
the sorted file. I like this syntax, as the less-than symbol is
reminiscent of the symbol used for redirecting input.

You may prefer using your text editor's diffing system because it's
more visual. Here's BBEdit's difference window:

BBEdit diff window

Here's Kaleidoscope, which gets good reviews as a file comparison
tool. I don't own it but downloaded the free trial so I could get
this screenshot.

Kaleidoscope comparison

These are both nice, but they don't give you output you can paste
directly into a report or email.

    The following items were supposed to be tested, but there are no
    results for them:
    [paste]
    What happened?

Update 08/31/2021 6:05 PM
Reader Geoff Tench emailed to tell me that comm is short for common,
and he pointed to the first line of the man page:

    comm -- select or reject lines common to two files

This is what you see when you type man comm in the Terminal. The
online man page I linked to, at ss64.com, is slightly different:

    Compare two sorted files line by line. Output the lines that are
    common, plus the lines that are unique.

I focused on the compare and missed the common completely.

So comm is a good abbreviation, but it's a good abbreviation to a
poor description. By default, comm outputs three columns of text, and
only one of those columns has the common lines. So comm is really
more about unique lines than it is about common lines. Of course,
uniq is taken, as are diff and cmp. And comp would probably be a bad
name, given that cmp exists.

diff is such a good command name, I often use it by mistake when I
really mean to use comm. Now that I've written this exegesis of comm,
I'll probably find it easier to remember. Thanks, Geoff!

---------------------------------------------------------------------

 1. The -u option to sort makes it act like sort <file> | uniq. It
    strips out the duplicates, which we don't want in the file when
    doing our comparisons. -

 2. My father wasn't a programmer, but it's not unusual to be asked
    to provide negative information. Tax forms, for example: All
    income not included in Lines 5, 9, and 10e that weren't in the
    years in which you weren't a resident of the state. -

Next post Previous post

Site search

[                    ] [Go!]
Meta

  * drdrang at leancrew
  * Blog archive
  * RSS feed
  * JSON feed
  * GitHub repositories

Recent posts

Credits

Powered by MathJax

This work is licensed under a Creative Commons Attribution-Share
Alike 3.0 Unported License.