To my eyes, for basic merging purposes (merging two sets of data into one) - this is the syntax I would use:
Which would yield this:
That would typically be the output that I'd expect from join when attempting to merge two data-sets together.
That way all of the values for a,b,c and d are there in the merged file and I don't care which file it originally came from - all of the data has been collated into a single file.
But, from the looks of your requirements - you want to perform two things:
1. Merge the data
2. Be able to see which file each piece of data originally came from.
Sadly, from looking at the man and info pages for join - AND from having a play with join - I've been unable to find anything that could yield the output you want.... At least not using a single join command on its own!
But you could use two
join commands and
sort to get the output you require:
Code:
join -a 1 file1 file2 > outputfile
join -a2 -v2 file1 file2 | awk '/ / {gsub(" ", " ");print}' >> outputfile
sort outputfile -o outputfile
In the above:
The first
join command joins
file1 and
file2 - including lines that are
only in file1 and writes the output to a file called
outputfile via output redirection.
So after the first
join command,
outputfile would contain:
The second
join command outputs lines that are
only in file2.
Each line output by the second
join is piped through
awk. Any lines containing spaces are expanded to three spaces and printed. The printed, modified string is then appended to the
outputfile via output redirection.
The reason we're piping through awk and adding spaces in the second
join command is because otherwise, the lines would come out looking like this:
And you want the values for fields that are only in
file2 to be displayed in column 2. So we are using awk to expand the space in the middle of each line to three spaces. So the value (5) will end up in the 2nd column of the output.
e.g.
Finally we
sort outputfile to ensure that all non-matching lines from
file2 are in their correct place, overwriting
outputfile with the sorted output
Yielding an
outputfile, which looks like this:
NOTE: We don't
NEED to sort the
outputfile in the above example, because the only line that is in
file2 and
NOT file1 is the final line.
But if the two datasets were like this:-
file1:
file2:
After the two join commands, the output would look like this:
So it would
need to be sorted to yield:
So although the final sort isn't required with the sample data you have provided - the
sort at the end
could be necessary if you run a different set of data through those commands. So it would be prudent to keep the final
sort there.
And now that we know what we are doing - we can turn those few commands into a re-usable script that could be ran on any two data-files that match the format of your example files.
The following script takes three parameters:
1. /path/to/file1
2. /path/to/file2
3. /path/to/outputfile
mergedata.sh (or you could call it whatever you want!)
Code:
#!/usr/bin/env bash
# ensure we have 3 parameters
if [[ $# -ne 3 ]]; then
echo "Error - Requires exactly 3 parameters: "
echo "usage: $0 /path/tofile1 /path/to/file2 /path/to/outputfile"
exit 1
fi
# ensure file1 exists
if [[ ! -f $1 ]]; then
echo "Error: \"$1\" does not exist!"
exit 1
fi
# ensure file2 exists
if [[ ! -f $2 ]]; then
echo "Error: \"$2\" does not exist!"
exit 1
fi
# ensure the directory for the output file exists
if [[ ! -d $(dirname $3) ]]; then
echo "Error: Unable to create $3"
echo " The directory \"$(dirname $3)\" does not exist!"
exit 1
fi
# join file1 and file2 including unmatched lines in file1
# and write to output-file
join -a1 $1 $2 > $3
# append unmatched lines from file2 to output file
# adding some extra spaces with awk
join -a2 -v2 $1 $2 | awk '/ / {gsub(" ", " ");print}' >> $3
# sort the output file
sort $3 -o $3
# display the content of the file on-screen
echo
cat $3
There's quite a bit more code in the above.
First up there are a few checks to ensure that:
1. We have the correct number of parameters
2. Each input file exists
3. The path to the output file is valid.
If any of the checks fail - we print an error message and quit with an error code.
Otherwise - if everything is OK, we perform our join operations on the two files, writing to the output file as we go. Then we sort the output file and display it on-screen.