Posted on January 15, 2014
Interview questions explained: Sorting a terabyte of integers with a gigabyte of memory
Given a terabyte of unsorted 32-bit unsigned integers on disk and only one gigabyte of memory, sort the integers. At first glance, mergesort would be the first solution that comes to mind. However, the typical mergesort implementation assumes that all the elements can fit into main memory. Indeed, it is still possible to modify the mergesort algorithm to solve this problem, but there is another alternative: counting sort. The basic idea is to keep track of how many times each integer appears. A hash table would be the ideal data structure for this purpose. From there, its just a simple matter of iterating through each integer in the hash table in order, looking up how many times it appears, and writing them out to disk. The end result is a sorted list of integers. This sort has an O(n) runtime, as it requires one linear pass to count all the integers, and a second one to write them out in sorted order. This sort can only be used in special cases where all the possible values of the unsorted elements is either known ahead of time or is finite. A counting sort would not work for example if we were dealing with floats or strings.
However, because we are dealing with integers, we can apply the counting sort in this scenario. Here are some quick numbers. There are 1,099,511,627,776 bytes in a terabyte. Each integer takes up four bytes of storage, which means there are a total of 274,877,906,944 unsorted integers. There are 1,073,741,824 bytes in a gigabyte. Assume we use some kind of hash table to keep track of the # of occurrences of each integer, and that the key/value pair consists of integer/integer. Without considering the additional overhead of storing the hash table metadata in memory, we would need eight bytes for each key/value pair. This means that at most, we could hope to store the counts for 134,217,728 integers in memory at a time. It is quickly clear that there is not enough memory to do the counting sort in one pass. One workaround would be to write out the integer counts to disk, reading and counting the integers in one gigabyte chunks at a time (allowing for additional memory overhead used by the program itself). Disk I/O is expensive. To make the system more robust, we can write out additional metadata to disk to keep track of what has been read so far. This would allow the program to resume progress from where it left off in the case of a fatal error.
This solution can be further optimized based on the distribution of the unsorted integers. This is why asking for clarification during the interview process is so important. If the range of possible integer values is limited by a minimum and maximum value, then this frees up the amount of memory required for storing the keys. Likewise, if there is a maximum number of times each integer can appear in the unsorted list, this frees up the amount of memory required for storing the count values.
The first chapter of the book Programming Pearls provides one such example of a problem that could be solved entirely in memory with an optimized counting sort. The problem involves a list of unique nine digit phone numbers that need to appear in sorted order. Because each number appears only once in the unsorted set, only one bit of memory is needed to store the count for each one. Thus, a bit vector can be used to store the key/value pairs for the counting sort. Each bit in the vector represents all the possible ten digit numbers between 0 and 999,999,999. If the particular bit corresponding to a number is flipped on, the number appears in the unsorted list, otherwise it does not. A bit vector with 1,000,000,000 bits requires only 125,000,000 bytes or roughly 119 megabytes of memory. Not bad. But even if we had a scenario where the set of key/value pairs cannot fit into main memory, this limitation can be bypassed altogether.
The advantage of the counting sort is that it can scale with the size of the unsorted elements by using a distributed system to do the counting. Here is a high level overview of one possible design. The range of possible values can be divided into buckets across various slave machines; each machine stores the counts for its respective range of values. For example, machine A stores all the integers between 1 and 1,000,000, machine B stores all the integers between 1,000,001 to 2,000,000, and so on. Each time an integer is read from the master, a mapping function is applied to determine the machine whose range the integer falls in. A notification service then alerts the appropriate machine, which then increments the count for that integer accordingly. Once all the integers have been read and counted, each machine in the system can return its set of ordered key/value pairs by exposing it via a web service interface, or some functional equivalent. Keys in the range that have a count of zero can be omitted from the response. The master can then aggregate this data and print out the final result in sorted order.
The potential downside when it comes to a distributed approach is the increased complexity. Each machine is now a potential point of failure. To address this, additional error checking logic is needed to make the system more robust. For example, every time the machine increments its value, we want to notify the master that the work is done. If an error is encountered, we want to retry the increment operation. Ideally, this should all be done asynchronously to improve performance, using some sort of event driven notification system. The easiest solution here would be to simply use a third party component. Messaging queues such as RabbitMQ, MSMQ, IBM MQ, etc were built from the ground up with these kind of issues in mind. We can add redundancy and provide even more error checking by having clusters of machines be responsible for each range of numbers. Once all the unsorted integers have been counted, we can md5 hash the data structure containing the key/value pairs on each machine. By checking that this value is identical across all the machines in a given cluster, we can help ensure that the counts were performed correctly. The end result of this additional complexity is a highly scalable system that can sort an unlimited number of elements, given enough hardware.
When given a sorting problem in an interview, it is important to ask lots of questions and try to break the problem down into something simpler. What exactly is being sorted? By identifying the various constraints on the possible set of values, an easier solution such as a counting sort may present itself. Typically, the interviewer will follow up the question by asking you to scale your solution. You’ll want to think about how to divide the work up in parallel across a large distributed system. The map/reduce class of algorithms was designed to do exactly that; a distributed counting sort is one such example. It may not be the right solution every time, but its a good one to bring up during the interview process. The right mindset to have when asked a scalability question is to have a conversation with the interviewer, discussing the various possible solutions and their design tradeoffs. A good talk goes a long way, and at the very least, you may come out of it learning a lot.