Posted on January 21, 2014
Interview questions: Sorting a terabyte of integers with a gigabyte of memory (part 2)
In the previous article, I mentioned that the mergesort algorithm could be modified in order to solve this problem. The usual mergesort implementation assumes that all the unsorted elements can fit into main memory. The algorithm divides the unsorted elements into two halves, and then recursively calls mergesort on each half before merging them together. The merge operation requires a temporary array to hold the merged result, making it a non starter. Worse, the merging will also be horribly inefficient on disk, as swapping two elements on disk versus swapping two elements of an in-memory array is orders of magnitude slower. Disk I/O is optimized for sequential, not random access. Furthermore, due to the large size of the unsorted input, there will likely not be enough memory on the stack for all the recursive calls required.
We will need to use an external sorting algorithm to address these issues, and the basic idea behind mergesort can be used to come up with a workable solution. Let us define M as the set of unsorted integers residing on hard drive D1, and N as the total number of records that can be sorted in memory at any given time (N takes into account any overhead needed to perform the sort itself). Assume also that we have three additional hard drives in addition to D1 (or some other similar storage medium that is optimized for sequential read/write access, such as tape): D2, D3, and D4. We can sort the data in multiple passes. The initial pass generates N/M sorted subsets of size N. Subsequent passes will merge these subsets together until we are left with one final merged set of sorted integers.
Let’s use a simple example with N = 18 and M = 3. Here is the unsorted data before we apply our sorting algorithm:
D1 15 4 1 20 19 3 100 80 8 12 10 11 55 40 31 39 67 88
The first pass reads M elements at a time from D1 and sorts them in memory (any performant algorithm with a good asymptotic runtime such as quicksort will do). The resulting subsets will be written to disk, with the destination alternating between D3 and D4.
D3 1 4 15 | 8 80 100 | 31 40 55
D4 3 19 20 | 10 11 12 | 39 67 88
D3 and D4 now each contain three (half of N/M) sorted subsets of size M. The next step will be to combine two subsets at a time, one from D3 and one from D4, into a set of size 2*M. The output will be written to disk, this time with the destination alternating between D1 and D2.
D1 1 3 4 15 19 20 | 31 39 40 55 67 88
D2 8 10 11 12 80 100
This method of reading in the data from two disks and writing the merged output to the other two disks is repeated until we end up with one disk which contains all the data in sorted order.
D3 1 3 4 8 10 11 12 19 20 80 100
D4 31 39 40 55 67 88
D1 1 3 4 8 10 11 12 19 20 31 39 40 55 67 80 88 100
The final detail here is how we perform the actual merge operation itself given the limited amount of memory available. It turns out we can re-use the merge logic from mergesort. Recall that in the mergesort algorithm, the merge operation takes in two subarrays, S1 and S2 (since the sort happens in memory, these are typically passed in as start and end indexes from the array being sorted), which are then merged in sorted order into a temporary array T1. The contents of T1 are then written back to the array being sorted. The logic for the merge is fairly simple. Let A1, A2, and A3 be indexes into S1, S2, S3, respectively. One integer is read from each subarray at a time, with the smaller of the integers being written out to the temporary array. We then advance the appropriate index. For example, if S1[A1] < S2[A2], then S1[A1] is written to S3[A3], and A1 and A3 are incremented. If S1[A1] > S2[A2], then S2[A2] is written to S3[A3], and A2 and A3 are incremented.
The important takeaway here is that only two integers need to be compared at any given time, which means that the merge operation requires very little memory. Thus, the logic for the merge operation of our external sorting algorithm will be almost identical, only instead of reading/writing from an array, we will be dealing with Disk I/O instead. Instead of incrementing an index into an array, we advance the spindle on the hard drive instead. Because we are reading and writing from the disks in sequential order, we avoid being penalized by random disk I/O access.
As with mergesort, each pass of the data halves the number of sets. After the initial run, there are N/M subsets. Thus, log2(N/M) subsequent passes through the data are required. Note that if additional external storage devices were available, less runs would be required. For example, if there were six total disks, then each run would cut the number of sets by one third. Given 2k disks, the number of runs required would be logk(N/M). There is additional complexity in the merge logic however. Finding the smaller of two integers is trivial, since we can just compare them directly. Finding the smallest element in a set of k integers however, will require additional work. A data structure such as a priority queue could be used to return the minimum. Note that we also need to keep track of which disk the minimum came from, so that we can read in the next integer from that disk. Additional calls to remove the previous minimum from the priority queue and insert the next integer read from disk will also need to be made.