Interview questions: Sorting a terabyte of integers with a gigabyte of memory (part 2)

In the previous article, I mentioned that the mergesort algorithm could be modified in order to solve this problem. The usual mergesort implementation assumes that all the unsorted elements can fit into main memory. The algorithm divides the unsorted elements into two halves, and then recursively calls mergesort on each half before merging them together. The merge operation requires a temporary array to hold the merged result, making it a non starter. Worse, the merging will also be horribly inefficient on disk, as swapping two elements on disk versus swapping two elements of an in-memory array is orders of magnitude slower. Disk I/O is optimized for sequential, not random access. Furthermore, due to the large size of the unsorted input, there will likely not be enough memory on the stack for all the recursive calls required.

We will need to use an external sorting algorithm to address these issues, and the basic idea behind mergesort can be used to come up with a workable solution. Let us define M as the set of unsorted integers residing on hard drive D1, and N as the total number of records that can be sorted in memory at any given time (N takes into account any overhead needed to perform the sort itself). Assume also that we have three additional hard drives in addition to D1 (or some other similar storage medium that is optimized for sequential read/write access, such as tape): D2, D3, and D4. We can sort the data in multiple passes. The initial pass generates N/M sorted subsets of size N. Subsequent passes will merge these subsets together until we are left with one final merged set of sorted integers.

Let’s use a simple example with N = 18 and M = 3. Here is the unsorted data before we apply our sorting algorithm:

D1    15 4 1 20 19 3 100 80 8 12 10 11 55 40 31 39 67 88           
D2
D3
D4

The first pass reads M elements at a time from D1 and sorts them in memory (any performant algorithm with a good asymptotic runtime such as quicksort will do). The resulting subsets will be written to disk, with the destination alternating between D3 and D4.

D1              
D2
D3    1  4  15 | 8  80 100 | 31 40 55
D4    3  19 20 | 10 11 12  | 39 67 88

D3 and D4 now each contain three (half of N/M) sorted subsets of size M. The next step will be to combine two subsets at a time, one from D3 and one from D4, into a set of size 2*M. The output will be written to disk, this time with the destination alternating between D1 and D2.

D1    1  3  4  15 19 20  | 31 39 40 55 67 88           
D2    8  10 11 12 80 100 
D3    
D4    

This method of reading in the data from two disks and writing the merged output to the other two disks is repeated until we end up with one disk which contains all the data in sorted order.

D1               
D2     
D3    1  3  4  8  10 11 12 19 20 80 100
D4    31 39 40 55 67 88
D1    1 3 4 8 10 11 12 19 20 31 39 40 55 67 80 88 100            
D2     
D3   
D4   

The final detail here is how we perform the actual merge operation itself given the limited amount of memory available. It turns out we can re-use the merge logic from mergesort. Recall that in the mergesort algorithm, the merge operation takes in two subarrays, S1 and S2 (since the sort happens in memory, these are typically passed in as start and end indexes from the array being sorted), which are then merged in sorted order into a temporary array T1. The contents of T1 are then written back to the array being sorted. The logic for the merge is fairly simple. Let A1, A2, and A3 be indexes into S1, S2, S3, respectively. One integer is read from each subarray at a time, with the smaller of the integers being written out to the temporary array. We then advance the appropriate index. For example, if S1[A1] < S2[A2], then S1[A1] is written to S3[A3], and A1 and A3 are incremented. If S1[A1] > S2[A2], then S2[A2] is written to S3[A3], and A2 and A3 are incremented.

The important takeaway here is that only two integers need to be compared at any given time, which means that the merge operation requires very little memory. Thus, the logic for the merge operation of our external sorting algorithm will be almost identical, only instead of reading/writing from an array, we will be dealing with Disk I/O instead. Instead of incrementing an index into an array, we advance the spindle on the hard drive instead. Because we are reading and writing from the disks in sequential order, we avoid being penalized by random disk I/O access.

As with mergesort, each pass of the data halves the number of sets. After the initial run, there are N/M subsets. Thus, log2(N/M) subsequent passes through the data are required. Note that if additional external storage devices were available, less runs would be required. For example, if there were six total disks, then each run would cut the number of sets by one third. Given 2k disks, the number of runs required would be logk(N/M). There is additional complexity in the merge logic however. Finding the smaller of two integers is trivial, since we can just compare them directly. Finding the smallest element in a set of k integers however, will require additional work. A data structure such as a priority queue could be used to return the minimum. Note that we also need to keep track of which disk the minimum came from, so that we can read in the next integer from that disk. Additional calls to remove the previous minimum from the priority queue and insert the next integer read from disk will also need to be made.

Interview questions explained: Sorting a terabyte of integers with a gigabyte of memory

Given a terabyte of unsorted 32-bit unsigned integers on disk and only one gigabyte of memory, sort the integers. At first glance, mergesort would be the first solution that comes to mind. However, the typical mergesort implementation assumes that all the elements can fit into main memory. Indeed, it is still possible to modify the mergesort algorithm to solve this problem, but there is another alternative: counting sort. The basic idea is to keep track of how many times each integer appears. A hash table would be the ideal data structure for this purpose. From there, its just a simple matter of iterating through each integer in the hash table in order, looking up how many times it appears, and writing them out to disk. The end result is a sorted list of integers. This sort has an O(n) runtime, as it requires one linear pass to count all the integers, and a second one to write them out in sorted order. This sort can only be used in special cases where all the possible values of the unsorted elements is either known ahead of time or is finite. A counting sort would not work for example if we were dealing with floats or strings.

However, because we are dealing with integers, we can apply the counting sort in this scenario. Here are some quick numbers. There are 1,099,511,627,776 bytes in a terabyte. Each integer takes up four bytes of storage, which means there are a total of 274,877,906,944 unsorted integers. There are 1,073,741,824 bytes in a gigabyte. Assume we use some kind of hash table to keep track of the # of occurrences of each integer, and that the key/value pair consists of integer/integer. Without considering the additional overhead of storing the hash table metadata in memory, we would need eight bytes for each key/value pair. This means that at most, we could hope to store the counts for 134,217,728 integers in memory at a time. It is quickly clear that there is not enough memory to do the counting sort in one pass. One workaround would be to write out the integer counts to disk, reading and counting the integers in one gigabyte chunks at a time (allowing for additional memory overhead used by the program itself). Disk I/O is expensive. To make the system more robust, we can write out additional metadata to disk to keep track of what has been read so far. This would allow the program to resume progress from where it left off in the case of a fatal error.

This solution can be further optimized based on the distribution of the unsorted integers. This is why asking for clarification during the interview process is so important. If the range of possible integer values is limited by a minimum and maximum value, then this frees up the amount of memory required for storing the keys. Likewise, if there is a maximum number of times each integer can appear in the unsorted list, this frees up the amount of memory required for storing the count values.

The first chapter of the book Programming Pearls provides one such example of a problem that could be solved entirely in memory with an optimized counting sort. The problem involves a list of unique nine digit phone numbers that need to appear in sorted order. Because each number appears only once in the unsorted set, only one bit of memory is needed to store the count for each one. Thus, a bit vector can be used to store the key/value pairs for the counting sort. Each bit in the vector represents all the possible ten digit numbers between 0 and 999,999,999. If the particular bit corresponding to a number is flipped on, the number appears in the unsorted list, otherwise it does not. A bit vector with 1,000,000,000 bits requires only 125,000,000 bytes or roughly 119 megabytes of memory. Not bad. But even if we had a scenario where the set of key/value pairs cannot fit into main memory, this limitation can be bypassed altogether.

The advantage of the counting sort is that it can scale with the size of the unsorted elements by using a distributed system to do the counting. Here is a high level overview of one possible design. The range of possible values can be divided into buckets across various slave machines; each machine stores the counts for its respective range of values. For example, machine A stores all the integers between 1 and 1,000,000, machine B stores all the integers between 1,000,001 to 2,000,000, and so on. Each time an integer is read from the master, a mapping function is applied to determine the machine whose range the integer falls in. A notification service then alerts the appropriate machine, which then increments the count for that integer accordingly. Once all the integers have been read and counted, each machine in the system can return its set of ordered key/value pairs by exposing it via a web service interface, or some functional equivalent. Keys in the range that have a count of zero can be omitted from the response. The master can then aggregate this data and print out the final result in sorted order.

The potential downside when it comes to a distributed approach is the increased complexity. Each machine is now a potential point of failure. To address this, additional error checking logic is needed to make the system more robust. For example, every time the machine increments its value, we want to notify the master that the work is done. If an error is encountered, we want to retry the increment operation. Ideally, this should all be done asynchronously to improve performance, using some sort of event driven notification system. The easiest solution here would be to simply use a third party component. Messaging queues such as RabbitMQ, MSMQ, IBM MQ, etc were built from the ground up with these kind of issues in mind. We can add redundancy and provide even more error checking by having clusters of machines be responsible for each range of numbers. Once all the unsorted integers have been counted, we can md5 hash the data structure containing the key/value pairs on each machine. By checking that this value is identical across all the machines in a given cluster, we can help ensure that the counts were performed correctly. The end result of this additional complexity is a highly scalable system that can sort an unlimited number of elements, given enough hardware.

When given a sorting problem in an interview, it is important to ask lots of questions and try to break the problem down into something simpler. What exactly is being sorted? By identifying the various constraints on the possible set of values, an easier solution such as a counting sort may present itself. Typically, the interviewer will follow up the question by asking you to scale your solution. You’ll want to think about how to divide the work up in parallel across a large distributed system. The map/reduce class of algorithms was designed to do exactly that; a distributed counting sort is one such example. It may not be the right solution every time, but its a good one to bring up during the interview process. The right mindset to have when asked a scalability question is to have a conversation with the interviewer, discussing the various possible solutions and their design tradeoffs. A good talk goes a long way, and at the very least, you may come out of it learning a lot.

Interview questions explained: Fun with multiplication!

Write a method that takes as its input an array of integers. For each element in the array, print the product of all the other elements in the array. For example, given the following array:
[1,2,3,4,5]
The output would be:
120, 60, 40, 30, 24

Let’s start with the most literal solution. For each element in the array, loop through all the other elements and multiply them together:

int[] j = { 1, 2, 3, 4, 5}; 

for (int outerIndex = 0; outerIndex < j.Length; outerIndex ++)
{
    int product = 1;
    for (int innerIndex = 0; innerIndex < j.Length; innerIndex++)
    {
        if (innerIndex != outerIndex)
        {
            product *= j[innerIndex];
        }
    }

    Console.WriteLine(product);
} 

This solution requires an outer for loop to iterate through each element in the array and then an inner for loop to iterate through the other n-1 elements in the array. The asymptotic complexity of this is O(n^2).

Obviously, this isn’t a very good solution, so we can do better. The key observation to make here is that for any given element with value x in the array, the product of all the other elements in the array is simply the total product of all the elements divided by x. We only need to calculate the total product once.

This solution only requires two passes through the array. The first calculates the total product, and the second divides the total product by each element in the array:

int[] j = { 1, 2, 3, 4, 5 };

int totalProduct = 1;
for (int i = 0; i < j.Length; i++)
{
    totalProduct *= j[i];
}

for (int i = 0; i < j.Length; i++)
{
    //assume we check for zeroes beforehand to prevent divide by zero errors
    Console.WriteLine(totalProduct / j[i]);
}

This solution has O(n) complexity, since it requires only two linear scans of the array.

Now let’s make the problem a little bit more challenging. What if we cannot use division? Assume that the operation is too prohibitively expensive and that we need to find a workaround (a not uncommon scenario in the real world).

We can use an algorithm design technique known as dynamic programming: Break the problem up into smaller sub-problems, solve and store the solution for each one, and then combine the solutions as necessary to arrive at the answer to the original problem. One difference between dynamic programming and the divide and conquer class of algorithms is that dynamic programming stores the solutions to the subproblems, which are then retrieved at a later point in time. This is an optimization that prevents the same sub-problems from being solved multiple times. One example of this approach is calculating the nth element in the fibbonacci sequence. The fibbonacci sequence is a sequence that starts with 0, 1, 1, 2 and where every subsequent number is the sum of the previous two numbers in the sequence. Typically, the solution involves using either recursion or iteration. However, we can use dynamic programming by precomputing the sequence up to some number j. Thus, for all numbers where n < j, instead of having to use recursion or iteration, we can do a simple lookup. For numbers where n > j, we can still reduce the number of computations by summing from j and j-1 instead of from 0 and 1.

With respect to the problem of determining the products without the usage of division, we can use a similar approach. Note that the inefficiency of the first solution is due to the same multiplication operations being carried out multiple times. We can eliminate these redundant calculations by using dynamic programming. For any given element k in the array j of size n, we want to multiply all the numbers to the left of k with all the numbers to the right of k. By precomputing the running product of all the elements in the array in both directions (from left to right and vice versa), we now know the product of all the numbers to the left and right of any element k in the array. The problem can then be solved:

int[] j = {1,2,3,4,5};

int[] runningProductLeft = new int[j.Length];
int[] runningProductRight = new int[j.Length];  

int product = 1;
//there is no element to the left of the start of the array, so set it to 1
runningProductLeft[0] = 1;
            
//since we already set the first element of runningProductLeft
//start populating it from index 1
for (int i = 1; i < j.Length; i++)
{
    product = product * j[i - 1];
    runningProductLeft[i] = product;
}

//we want to populate runningProductRight in reverse by starting from the end of the array

//there is no element to the right of the end of the array, so set it to 1
runningProductRight[j.Length - 1] = 1;
product = 1;

//since we already populated the last element of runningProductRight, start populating from the second to last element in the array
for (int i = j.Length - 2; i >= 0; i--)
{ 
    product = product * j[i + 1];
    runningProductRight[i] = product;
}

//now that the running products have been precomputed, printing out the solution becomes trivial
for (int i = 0; i < j.Length; i++)
{
    product = runningProductLeft[i] * runningProductRight[i];
    Console.WriteLine(product);
}

This solution requires three linear scans through the array, so the runtime complexity is still O(n).

Reverse all the bits in a byte

A common question encountered in interviews is to reverse all the characters in a string. A fun little variant on this one is to reverse all the bits in a byte. It’s a good way to test the candidate’s basic knowledge of bit manipulation. Of course, there are quite a few solutions to these problems, some of them devised by low level programming gurus. You know, the kind of people who code in assembly and write drivers for fun. These solutions involve an esoteric combination of multiplication and modulo operations that somehow end up with the correct answer. The solution presented here is the standard (boring) textbox solution for mere mortals.

First, we’ll need a way to figure out whether a given bit in a byte is on or off. To do so, we bitwise AND the byte with 2^(n-1), where n is the bit position we are interested in (ranging of course, from one to eight).    We then compare the result to zero. 2^(n-1) serves as a bit mask. Note that every bit in this byte is off, except at position n. Thus, the resultant AND operation will zero out every other bit in the byte. So if the result is zero, then bit n of the original byte must also be equal to 0. If the result is anything else, then bit n must be equal to 1.

Here is a concrete example. If we want to determine whether or not the third bit is ON in a given byte, we AND byte x with 2^(3-1), or 00000100 in binary. Let’s take a look at the number 36. 36 AND 4 = 4, which means the third bit is on. How about 17? 17 AND 4 = 0, so the third bit is off.    Opening up the trusty Windows Calculator in programmer mode can quickly verify this.

Now, instead of writing out eight separate bit masks for each bit position, we can simply use one bit mask: 1. This makes life easier. We can then write a for loop from one to eight that bit shifts the byte right by one through every iteration. By then ANDing this value with one, we get the value of the bit at position n, where n is the current loop index.

During each iteration, we’ll need to write this bit value to its correct location to an output byte, which we will initialize to zero. If the bit in question is 0, then we don’t need to do anything, since the corresponding bit in the output byte is already zero. If the bit is equal to 1, then we need to write it at position 9-n in the output byte, where n is the current loop index. To do that, we’ll bitwise OR the output byte with the bitmask 2^(n-1). Because every other value in this bit mask is zero, we preserve all the existing values in the output byte, except the one that we want to flip on.

Instead of using eight different bit masks, we can again simplify things by using only one bit mask: 1. To do so, we’ll slightly modify the for loop logic to left shift the output byte by 1 before ORing it with 1. This will write the current bit to the lowest significant bit. Eventually, after every bit in the input byte has been iterated over, all bits will have been shifted into their correct position.

The code looks like this in C#:

static byte reverseByte(byte val)
{
    byte result = 0;

    int counter = 8;
    while (counter-- < 0)
    {
        result <<= 1;         
        result |= (byte)(val & 1);         
        val = (byte)(val >> 1);
    }

    return result;
}

Let’s reverse 182 (10110110 in binary). Here are the values of the output byte in binary after each iteration of our loop, so we can see what’s going on:

Bit 1 of 182 is 0.
0 OR 0 = 0.

Bit 2 of 182 is 1
0 << 1 = 00
0 OR 1 = 01

Bit 3 of 182 is 1
01 << 1 = 010
010 OR 1 = 011

Bit 4 of 182 is 0
011 << 1 = 0110
0110 OR 1 = 0110

Bit 5 of 182 is 1
0110 << 1 = 01100
01100 OR 1 = 01101

Bit 6 of 182 is 1
01101 << 1 = 011010
011010 OR 1 = 011011

Bit 7 of 182 is 0
011011 << 1 = 0110110
0110110 OR 0 = 0110110

Bit 8 of 182 is 1
0110110 << 1 = 01101101
01101101 OR 1 = 01101101

Whew, so there you have it. This is the standard answer you’d want to give in an interview. For bonus points, you’d want to point out that there’s a simpler and faster solution that saves on CPU cycles, at the cost of a few extra bytes of memory: a lookup table. By storing each byte value, and its reversed value as key value pairs, we can now do an instant O(1) lookup, at the expense of just 512 bytes of memory. As an added bonus, the CPU can potentially store the data in its cache as well, so it would never even need to hit main memory. This is a pretty good tradeoff. Its worth doing this kind of analysis in an interview. In the real world, applications are always bound by some bottleneck, whether it be the CPU, memory, disk IO, or something else entirely, and a good developer needs to aware of these issues when writing code, as they obviously play an important role when making architectural design decisions.