Post

Should we sort files before or after merging them?

This weekend I produced a very large database of chess positions. This database is useful to chess engine programmers and researcher. The database is a text file containing one position per line encoded using the Forsyth-Edwards Notation.

At some point in the process of creating this file I had multiple files containing more than 600 million lines with lots of duplicates (positions occurring in more than one game). I needed to use the GNU sort utility to merge those files and remove duplicates. I had two options : I could join the files in one single large file and sort it, or I could sort the files individually and use the sort utility to merge them together and remove duplicates at the same time. After some research on the internet I could not find any advice about which route might be the best. So I decided to test it.

I’ve taken a more manageable subset of my file containing 9.5 million lines, I shuffled it with :

1
sort --randomized-sort -o testdata testdata

Then I splited this file into roughly 10 equals parts :

1
split -n l/10 -d testdata testdata

I’ve run the first scenario (join the files and sort the result) four times with the command below. The average time for the four run is 69.4 seconds with the slowest run taking 70.9 seconds.

1
cat testdata* > testdata; sort -o testdata testdata

I’ve run the second scenario with the small bash script below. The four run average is 70.9 with the slowest run at 71.1.

1
2
3
4
5
6
#!/bin/bash
for f in testdata*
do
 sort -o $f $f
done
sort --merge -o testdata testdata*

While the second method is marginally slower it’s not really significant. It kinds of make sense. GNU sort implements a merge sort which is already a recursive sort. By sorting the individual files before hands we are doing exactly what GNU sort is doing internaly, that is merge sorted chunks of the data.

Cet article est sous licence CC BY-NC-ND 4.0 par l'auteur.

Tags tendance