6 May 2020
The relatively obscure 'join' command
There are some older *nix commands which have tricky syntax, and can be intimidating to learn
Unix, and therefore the Shell (whether under Unix, Linux, or any other operating system), is very much text-file driven. The Unix philsophy of "everything is a file", combined with the flexibility and easily-edited nature of text files means that a lot of information in a Unix / Linux system is stored in plain-text files. This makes for easy management of data, without bespoke tools, and it also means that the Unix and Linux userspace (the command-line tools available to the shell) includes a lot of tools for manipulating text files.
One such command is called
join, and as its syntax is a bit opaque, not many people use it very
much. However, it is quite powerful and flexible, as we shall see.
join command will combine two text files, based on a common key. So - a bit like with a SQL
JOIN command, if you have two files with a common key somewhere in them, you can use the
join command to merge these two files together. And you can actually manage the results quite flexibly,
as we shall see.
Sample input files
Let's say we have two text files, one called countries.txt which provides a list of two-letter country codes (DE for Germany, FR for France, etc), and another called distros.txt which lists GNU/Linux distributions, their package management type (RPM/DEB/Zypper) and their countries of origin. For example:
DE:Germany FR:France PT:Portugal UK:United Kingdom US:United States of America
Knoppix:DEB:DE OpenSuSE:Zypper:DE APODIO:DEB:FR Debian:DEB:NONE Ubuntu:DEB:UK Kali:DEB:US Red Hat:RPM:US
We can use the
join command to combine these files, because they both have the ISO-3166 country code. Here, the "
-t:" means that the delimiter in these files is the colon (
:) character (these older *nix commands are missing a bit of consistency - it would be nice if they all used
-d" for "delimiter", but
sort use "
The syntax continues to be rather odd, in that we now need to tell
join that the first file (countries.txt) is keyed on the first field, because "DE", "FR", etc are the first field in that file. To do this, we say "
Then we say "
-2 3" to tell
join that the second file (distros.txt) is keyed on the third field, because those two-letter country codes are the third field in distros.txt.
Finally, we tell
join the names of file1 and file2: countries.txt and distros.txt, respectively.
$ join -t: -1 1 -2 3 countries.txt distros.txt DE:Germany:Knoppix:DEB DE:Germany:OpenSuSE:Zypper FR:France:APODIO:DEB UK:United Kingdom:Ubuntu:DEB US:United States of America:Kali:DEB US:United States of America:Red Hat:RPM
We now have a combination of the two files, merged on their common key of the two-letter country code.
An important caveat to be aware of is that both files need to be sorted alphabetically before starting, and they need to be sorted based on their key. So a simple "
sort distros.txt" is not sufficient. The
sort command uses the same "
-t:" syntax as
join to say that the delimiter is the colon (
:) character, but it uses the "
-k 3" argument to sort the file on the third field with that delimiter.
So to make sure that the countries.txt file is correctly sorted, we need to run: "
sort -o countries.txt countries.txt" - again,
sort is ancient, it doesn't have the simplicity of "
sed -i", so we have to tell it "sort countries.txt and write the output to countries.txt". This invocation will replace the unsorted countries.txt with a new, sorted, version of the file.
To sort the distros.txt file requires a bit more - the key is in field three. So we run "
sort -t: -k3 -o distros.txt distros.txt" to sort the file on field 3, with a delimiter of "
:", writing the output to distros.txt, and reading the input from the same distros.txt file. Again, using the distros.txt name twice is necessary, and replaces the unsorted file with a sorted version.
We can do a bit more than this with the
join command - we can format the output as we wish. We have
join what the fields are, so we can tell it which fields, from which files we want, and in what order.
To do this, we use the "
-o" option, which again has a non-standard syntax. We use the format "
-o FILE.FIELD" syntax, where "
-o 1.x" means field "x" in the first filename listed, and "
-o 2.y" means field "y" in the second filename listed on the command line.
So we can get field #2 (Country Name) from countries.txt, merged with field #1 (distro name) from distro.txt:
$ join -t: -1 1 -2 3 -o 1.2,2.1 countries.txt distros.txt Germany:Knoppix Germany:OpenSuSE France:APODIO United Kingdom:Ubuntu United States of America:Kali United States of America:Red Hat
Or we might want to append the package manager (which is file 2, field 2, so append "
,2.2" to the "
$ join -t: -1 1 -2 3 -o 1.2,2.1,2.2 countries.txt distros.txt Germany:Knoppix:DEB Germany:OpenSuSE:Zypper France:APODIO:DEB United Kingdom:Ubuntu:DEB United States of America:Kali:DEB United States of America:Red Hat:RPM
We can reverse this by changing the "
-o" output options, to mix-and-match between the input files. Here, "
-o 2.2,1.2,2.1"" selects field 2 from file 2, then field 2 from file 1, followed by field 2 from file 1:
$ join -t: -1 1 -2 3 -o 2.2,1.2,2.1 countries.txt distros.txt DEB:Germany:Knoppix Zypper:Germany:OpenSuSE DEB:France:APODIO DEB:United Kingdom:Ubuntu DEB:United States of America:Kali RPM:United States of America:Red Hat
The eagle-eyed amongst you will have noticed that there is one country listed which does not have matching distributions, and one distribution which does not have any particular country associated with it.
We can find these with the "
-v" option. We can ask for those in file 1, or those in file 2, which have no matches in the other file:
$ $ join -v1 -t: -1 1 -2 3 countries.txt distros.txt PT:Portugal $ join -v2 -t: -1 1 -2 3 countries.txt distros.txt NONE:Debian:DEB
Portugal does of course have Linux Caixa Mágica and PaiPix to its name, it's just that these were not listed in these example files.
Hopefully this will have removed some of the confusion around the
join command, and you will find
renewed confidence in using this old, somewhat archaic, but very useful command to manage your field-delimited text
Books and eBooks
You can mail me with this form. If you expect a reply, please ensure that the address you specify is valid. Don't forget to include the simple addition question at the end of the form, to prove that you are a real person!