RSS Feed Share on Twitter

You can buy this tutorial to keep, as a Paperback or eBook from Amazon, or Buy this tutorial as a PDF (RRP $5)

All Shell Scripting Tips

6 May 2020

The relatively obscure 'join' command

There are some older *nix commands which have tricky syntax, and can be intimidating to learn

Unix, and therefore the Shell (whether under Unix, Linux, or any other operating system), is very much text-file driven. The Unix philsophy of "everything is a file", combined with the flexibility and easily-edited nature of text files means that a lot of information in a Unix / Linux system is stored in plain-text files. This makes for easy management of data, without bespoke tools, and it also means that the Unix and Linux userspace (the command-line tools available to the shell) includes a lot of tools for manipulating text files.

One such command is called join, and as its syntax is a bit opaque, not many people use it very much. However, it is quite powerful and flexible, as we shall see.

The join command will combine two text files, based on a common key. So - a bit like with a SQL JOIN command, if you have two files with a common key somewhere in them, you can use the join command to merge these two files together. And you can actually manage the results quite flexibly, as we shall see.

Sample input files

Let's say we have two text files, one called countries.txt which provides a list of two-letter country codes (DE for Germany, FR for France, etc), and another called distros.txt which lists GNU/Linux distributions, their package management type (RPM/DEB/Zypper) and their countries of origin. For example:

countries.txt:

DE:Germany
FR:France
PT:Portugal
UK:United Kingdom
US:United States of America

distros.txt:

Knoppix:DEB:DE
OpenSuSE:Zypper:DE
APODIO:DEB:FR
Debian:DEB:NONE
Ubuntu:DEB:UK
Kali:DEB:US
Red Hat:RPM:US

First Run

We can use the join command to combine these files, because they both have the ISO-3166 country code. Here, the "-t:" means that the delimiter in these files is the colon (:) character (these older *nix commands are missing a bit of consistency - it would be nice if they all used cut's "-d" for "delimiter", but join and sort use "-t" instead)

The syntax continues to be rather odd, in that we now need to tell join that the first file (countries.txt) is keyed on the first field, because "DE", "FR", etc are the first field in that file. To do this, we say "-1 1".
Then we say "-2 3" to tell join that the second file (distros.txt) is keyed on the third field, because those two-letter country codes are the third field in distros.txt.
Finally, we tell join the names of file1 and file2: countries.txt and distros.txt, respectively.

$ join -t: -1 1 -2 3 countries.txt distros.txt 
DE:Germany:Knoppix:DEB
DE:Germany:OpenSuSE:Zypper
FR:France:APODIO:DEB
UK:United Kingdom:Ubuntu:DEB
US:United States of America:Kali:DEB
US:United States of America:Red Hat:RPM

We now have a combination of the two files, merged on their common key of the two-letter country code.

Caveat

An important caveat to be aware of is that both files need to be sorted alphabetically before starting, and they need to be sorted based on their key. So a simple "sort distros.txt" is not sufficient. The sort command uses the same "-t:" syntax as join to say that the delimiter is the colon (:) character, but it uses the "-k 3" argument to sort the file on the third field with that delimiter.

So to make sure that the countries.txt file is correctly sorted, we need to run: "sort -o countries.txt countries.txt" - again, sort is ancient, it doesn't have the simplicity of "sed -i", so we have to tell it "sort countries.txt and write the output to countries.txt". This invocation will replace the unsorted countries.txt with a new, sorted, version of the file.

To sort the distros.txt file requires a bit more - the key is in field three. So we run "sort -t: -k3 -o distros.txt distros.txt" to sort the file on field 3, with a delimiter of ":", writing the output to distros.txt, and reading the input from the same distros.txt file. Again, using the distros.txt name twice is necessary, and replaces the unsorted file with a sorted version.

Reformatting

We can do a bit more than this with the join command - we can format the output as we wish. We have told join what the fields are, so we can tell it which fields, from which files we want, and in what order. To do this, we use the "-o" option, which again has a non-standard syntax. We use the format "-o FILE.FIELD" syntax, where "-o 1.x" means field "x" in the first filename listed, and "-o 2.y" means field "y" in the second filename listed on the command line.

So we can get field #2 (Country Name) from countries.txt, merged with field #1 (distro name) from distro.txt:

$ join -t: -1 1 -2 3 -o 1.2,2.1 countries.txt distros.txt
Germany:Knoppix
Germany:OpenSuSE
France:APODIO
United Kingdom:Ubuntu
United States of America:Kali
United States of America:Red Hat

Or we might want to append the package manager (which is file 2, field 2, so append ",2.2" to the "-o" option:

$ join -t: -1 1 -2 3 -o 1.2,2.1,2.2 countries.txt distros.txt
Germany:Knoppix:DEB
Germany:OpenSuSE:Zypper
France:APODIO:DEB
United Kingdom:Ubuntu:DEB
United States of America:Kali:DEB
United States of America:Red Hat:RPM

We can reverse this by changing the "-o" output options, to mix-and-match between the input files. Here, "-o 2.2,1.2,2.1"" selects field 2 from file 2, then field 2 from file 1, followed by field 2 from file 1:

$ join -t: -1 1 -2 3 -o 2.2,1.2,2.1 countries.txt distros.txt
DEB:Germany:Knoppix
Zypper:Germany:OpenSuSE
DEB:France:APODIO
DEB:United Kingdom:Ubuntu
DEB:United States of America:Kali
RPM:United States of America:Red Hat

Missing Lines

The eagle-eyed amongst you will have noticed that there is one country listed which does not have matching distributions, and one distribution which does not have any particular country associated with it.

We can find these with the "-v" option. We can ask for those in file 1, or those in file 2, which have no matches in the other file:

$ $ join -v1 -t: -1 1 -2 3 countries.txt distros.txt
PT:Portugal
$ join -v2 -t: -1 1 -2 3 countries.txt distros.txt
NONE:Debian:DEB

Portugal does of course have Linux Caixa Mágica and PaiPix to its name, it's just that these were not listed in these example files.

Conclusion

Hopefully this will have removed some of the confusion around the join command, and you will find renewed confidence in using this old, somewhat archaic, but very useful command to manage your field-delimited text files.

 

Appreciate this site? Please consider making a donation:

 


Books and eBooks

Contact

You can mail me with this form. If you expect a reply, please ensure that the address you specify is valid. Don't forget to include the simple addition question at the end of the form, to prove that you are a real person!