Avatar
by Andrew Charlton on 30 Jul 2018

How do we sort large data sets when we can’t guarantee that the whole data set will fit into memory? Can we do it in a performant way whilst guaranteeing an upper bound on our memory usage?

If you’ve been programming for more than 5 minutes, it’s fairly certain that you have sorted some data. It’s such a common operation that every modern programming language has at least one built-in sort method. Unfortunately, these algorithms often assume fairly small datasets that fit comfortably into memory. In this blog post, we look at the problem of what to do when we can’t be sure that our data set will fit into memory, and how to avert those Out-Of-Memory exceptions hiding round the corner.

TL;DR - implement an external sorting algorithm: partition your data into smaller files based on the start of the record (choosing how to partition requires some thought), then sort each file in turn.

Background

With the advent of GDPR, managing whether or not a business has legal permission to send an email to someone has become incredibly important. Here at SaleCycle, we offer a ‘Marketing Permissions Service’ to help our clients manage these permissions. We gather consent in multiple ways: by scraping customer interactions with a website, by showing a customer a popup to ask for their consent, and we also allow our clients to send us files containing consent records which we import into our system.

The files we receieve from clients vary considerably so, to ensure their validity and consistency, we perform a series of steps on all of them: validation, obfuscation and de-duplication. To minimise the additional load on our datastores, we also diff the files with the previous upload for that client and only act on any changes between the files.

Many of the files we receive are large, maybe not quite big data size, but easily consisting of tens of millions of records. Deduplicating a file this size in memory if it is not sorted is a no-go, and diffing when the order of the two files is not guaranteed to be the same is also a no-go. We need to sort these files.

Problem

We have a file containing email addresses (plus other data) that we would like to sort, but we cannot guarantee that we will be able to load the whole thing into memory and sort it there.

It is important to note that for our particular use case, we don’t actually need them to be in alphabetical order based on the email, the sort order just needs to be consistent so that we can deduplicate and diff.

Solution

There are several approaches we can take to sorting large files like this, but all depend on partitioning the data in some way, sorting each partition and then recombining them. The major differences between algorithms depend on whether we can partition our data in a sorted way to start with, and the resultant complexity of recombining them. Wikipedia has a good article if you’d like some more ideas.

One approach, and the one we used is, to implement a bucket sort using a temporary file for each bucket. This removes the complexity of having to recombine the parititioned files in some clever way afterwards.

Partitioning our data

When performing a bucket sort, we need to decide on how to determine which bucket each record goes into. Sorting email addresses like this is pretty horrible, they have a massive range of permissible values - even something like "()<>[]:,;@\\\"!#$%&'-/=?^_{}| ~.a"@example.org is a valid email!

Thankfully, in our particular case we also needed to hash each email address at some point in the process (we obfuscate and hash all personal data within our systems, so raw email addresses being stored or used as keys within our data stores is a big no-no). Because we only needed a consistent order for deduplicting and diffing, sorting based on the hash of the email address is perfectly acceptable in our case.

The properties of a hash are very suited to sorting - there is only a small set of characters that are possible in each string and the chances of getting any particular character in any place are fairly uniform.

In our case, we decided to take the first two characters of the hash, and write the record to one of 256 temporary files based on those characters. After initialising writers for each temp file, we can do something like:

public void put(Record record) {

    int firstChar = Character.digit(record.getEmail().charAt(0), 16);
    int secondChar = Character.digit(record.getEmail().charAt(1), 16);
    int index = firstChar * 16 + secondChar;

    writers[index].write(record)
}

So, all of the records that start ‘00’ get written to the first file, all of the records that start ‘01’ get written to the second file and so on.

Sorting and combining partitions

By partitioning our data so that we know that all of the records in a bucket should be before the records in the remaining buckets, we can easily combine them to create a complete sorted list. We just need to iterate through each file in turn, sort the contents of that file and then output the result.

public void sort() {

    for (File file : files) {
        sortAndPut(file);
        file.delete();
    }
}

private void sortAndPut(File file) {

    BufferedReader reader = new BufferedReader(new FileReader(file));
    reader.lines()
            .sorted(String::compareTo)
            .map(this::convertToRecord)
            .forEach(consumer::put);
}

Simples!

Performance

To benchmark the performance of the external sorting algorithm, I generated a set of test files of varying size. Each test file had a set of records consisting of a random 32 character hex string and a utc timestamp. I then sorted the file using:

  • A native Java in-memory sort
  • A Java implementation of the external bucket sort
  • A native linux sort

Results

Records In Memory External Bucket Linux
500,000 0.886 0.861 7.2
1,000,000 1.43 1.31 15
2,000,000 3.42 2.34 32
5,000,000 9.68 5.50 90
10,000,000 25.25 11.03 205
20,000,000 49.23 22.82 414

The results were a little surprising, after we hit 500,000 records, the external bucket sort ended up being faster than the in-memory sort. Obviously this is just a quick and dirty test on my macbook, but I would expect similar results on a more rigorous test. What is obvious though, is that for a file of any significant size, the external sorting algorithm is a performant option when in-memory sorting is impractical.

The limiting factor in an external sort like this is likely to be IO performance. Each record is written to and read from a temporary file as part of the process, so the amount of IO required grows linearly as our files get bigger and is likely to take a significant amount of the runtime required.

More importantly, with 20,000,000 records the Java process required over 3.6GB to sort the file in memory. The hashes we actually use are 64 characters long, so this could almost double in real-world use. The external sort required under 600MB. For a process that we run within AWS Batch (which has a hard memory limit), that is a huge advance in our ability to process large files reliably.