6 May 2020

The relatively obscure JOIN command

Unix, and therefore the Shell (whether under Unix, Linux, or any other operating system), is very much text-file driven. The Unix philsophy of "everything is a file", combined with the flexibility and easily-edited nature of text files means that a lot of information in a Unix / Linux system is stored in plain-text files. This makes for easy management of data, without bespoke tools, and it also means that the Unix and Linux userspace (the command-line tools available to the shell) includes a lot of tools for manipulating text files.

One such command is called join, and as its syntax is a bit opaque, not many people use it very much. However, it is quite powerful and flexible, as we shall see.

The join command will combine two text files, based on a common key. So - a bit like with a SQL JOIN command, if you have two files with a common key somewhere in them, you can use the join command to merge these two files together. And you can actually manage the results quite flexibly, as we shall see.

Sample input files

Let's say we have two text files, one called countries.txt which provides a list of two-letter country codes (DE for Germany, FR for France, etc), and another called distros.txt which lists GNU/Linux distributions, their package management type (RPM/DEB/Zypper) and their countries of origin. For example:


UK:United Kingdom
US:United States of America


Red Hat:RPM:US

First Run

We can use the join command to combine these files, because they both have the ISO-3166 country code. Here, the "-t:" means that the delimiter in these files is the colon (:) character (these older *nix commands are missing a bit of consistency - it would be nice if they all used cut's "-d" for "delimiter", but join and sort use "-t" instead)

The syntax continues to be rather odd, in that we now need to tell join that the first file (countries.txt) is keyed on the first field, because "DE", "FR", etc are the first field in that file. To do this, we say "-1 1".
Then we say "-2 3" to tell join that the second file (distros.txt) is keyed on the third field, because those two-letter country codes are the third field in distros.txt.
Finally, we tell join the names of file1 and file2: countries.txt and distros.txt, respectively.

$ join -t: -1 1 -2 3 countries.txt distros.txt 
UK:United Kingdom:Ubuntu:DEB
US:United States of America:Kali:DEB
US:United States of America:Red Hat:RPM

We now have a combination of the two files, merged on their common key of the two-letter country code.


An important caveat to be aware of is that both files need to be sorted alphabetically before starting, and they need to be sorted based on their key. So a simple "sort distros.txt" is not sufficient. The sort command uses the same "-t:" syntax as join to say that the delimiter is the colon (:) character, but it uses the "-k 3" argument to sort the file on the third field with that delimiter.

So to make sure that the countries.txt file is correctly sorted, we need to run: "sort -o countries.txt countries.txt" - again, sort is ancient, it doesn't have the simplicity of "sed -i", so we have to tell it "sort countries.txt and write the output to countries.txt". This invocation will replace the unsorted countries.txt with a new, sorted, version of the file.

To sort the distros.txt file requires a bit more - the key is in field three. So we run "sort -t: -k3 -o distros.txt distros.txt" to sort the file on field 3, with a delimiter of ":", writing the output to distros.txt, and reading the input from the same distros.txt file. Again, using the distros.txt name twice is necessary, and replaces the unsorted file with a sorted version.


We can do a bit more than this with the join command - we can format the output as we wish. We have told join what the fields are, so we can tell it which fields, from which files we want, and in what order. To do this, we use the "-o" option, which again has a non-standard syntax. We use the format "-o FILE.FIELD" syntax, where "-o 1.x" means field "x" in the first filename listed, and "-o 2.y" means field "y" in the second filename listed on the command line.

So we can get field #2 (Country Name) from countries.txt, merged with field #1 (distro name) from distro.txt:

$ join -t: -1 1 -2 3 -o 1.2,2.1 countries.txt distros.txt
United Kingdom:Ubuntu
United States of America:Kali
United States of America:Red Hat

Or we might want to append the package manager (which is file 2, field 2, so append ",2.2" to the "-o" option:

$ join -t: -1 1 -2 3 -o 1.2,2.1,2.2 countries.txt distros.txt
United Kingdom:Ubuntu:DEB
United States of America:Kali:DEB
United States of America:Red Hat:RPM

We can reverse this by changing the "-o" output options, to mix-and-match between the input files. Here, "-o 2.2,1.2,2.1"" selects field 2 from file 2, then field 2 from file 1, followed by field 2 from file 1:

$ join -t: -1 1 -2 3 -o 2.2,1.2,2.1 countries.txt distros.txt
DEB:United Kingdom:Ubuntu
DEB:United States of America:Kali
RPM:United States of America:Red Hat

Missing Lines

The eagle-eyed amongst you will have noticed that there is one country listed which does not have matching distributions, and one distribution which does not have any particular country associated with it.

We can find these with the "-v" option. We can ask for those in file 1, or those in file 2, which have no matches in the other file:

$ $ join -v1 -t: -1 1 -2 3 countries.txt distros.txt
$ join -v2 -t: -1 1 -2 3 countries.txt distros.txt

Portugal does of course have Linux Caixa Mágica and PaiPix to its name, it's just that these were not listed in these example files.


Hopefully this will have removed some of the confusion around the join command, and you will find renewed confidence in using this old, somewhat archaic, but very useful command to manage your field-delimited text files.

Shell Scripting Tips

Shell Scripting Tutorial