I’m currently studying to get a Microsoft SQL Server 2008 Database Development Certificate and am going through their official 70$ 400 + page training kit. It covers a wide variety of topics, one of them being how to find the missing gaps in a table. This actually turns out to be a good interview question that can test whether or not a candidate understands how to write more complex queries that go beyond simple select statements.
The problem itself is relatively straightforward. Imagine for example, that you have an identity column on a table. You might expect to see a consecutive series of integers when you select that column from the table. However, on a production environment, you’d typically see “missing” values. This can be due to a number of reasons. The most obvious one being that rows can get deleted. A more subtle cause is that rollbacks do NOT reset the identity seed value on a column – so any failed inserts will cause the seed value column to increment. Regardless of the cause, let’s say you would like to identify all the gaps in the table.
Consider the following table:
CREATE TABLE TestTable (SomeID INT NOT NULL PRIMARY KEY)
INSERT INTO TestTable (SomeID) VALUES (1)
INSERT INTO TestTable (SomeID) VALUES (2)
INSERT INTO TestTable (SomeID) VALUES (3)
INSERT INTO TestTable (SomeID) VALUES (10)
INSERT INTO TestTable (SomeID) VALUES (11)
INSERT INTO TestTable (SomeID) VALUES (12)
INSERT INTO TestTable (SomeID) VALUES (90)
INSERT INTO TestTable (SomeID) VALUES (91)
INSERT INTO TestTable (SomeID) VALUES (92)
The missing gaps are 4-9, and 13-89. How can you write a query that correctly returns these values?
A good approach is to solve the problem by breaking it down into smaller pieces. The first piece of the puzzle is to figure out where each missing gap begins. We know that there is a gap if for a given value x in the table, there doesn’t exist an entry x + 1 in the table. To use a concrete example from our sample table: There is no gap after 1, because 2 exists in the table. However, there is a gap after 3, because 4 does not exist in the table. So we translate this into SQL: Select all values of SomeID where there does not exist another SomeID value that is equal to the current SomeID value plus one. This will require a correlated subquery where the inner query references the outer query’s SomeID value:
SELECT a.SomeID AS StartGap FROM TestTable a WHERE NOT EXISTS (SELECT * FROM TestTable b WHERE b.SomeID = a.SomeID + 1)
StartGap --------- 3 12 92
There are two things to note here. First, we need to add 1 to the Start value in order to get the actual starting value of the gap. The second thing to note is that 92 is returned because it is the largest entry in the table. This isn’t a gap, its just the last value in the table. So, we need to filter out the maximum value of SomeID from the table. The modified query looks like this:
SELECT a.SomeID + 1 AS StartGap FROM TestTable a WHERE NOT EXISTS (SELECT SomeID FROM TestTable b WHERE b.SomeID = a.SomeID + 1) AND a.SomeID < (SELECT MAX(SomeID) FROM TestTable)
This gives us
StartGap -------- 4 13
Now that we know where each gap begins, we can tackle the task of figuring out where the gap ends. Given the start gap value, the end gap value is equal to the next largest number in the table minus one. For example, there is a gap that begins at 4. The next largest number that exists in the table is 10, so the gap ends at 9. Translated into SQL, given the current value of SomeID, we want to select all the values from the table greater than SomeID, and then return the MINIMUM.
Again, we will use a correlated subquery to reference the current value of SomeID:
SELECT a.SomeID + 1 AS StartGap ,(SELECT MIN(SomeID) - 1 FROM TestTable c WHERE c.SomeID > a.SomeID) AS EndGap FROM TestTable a WHERE NOT EXISTS (SELECT * FROM TestTable b WHERE b.SomeID = a.SomeID + 1) AND a.SomeID < (SELECT MAX(SomeID) FROM TestTable )
This gives us the correct result of
StartGap EndGap ------------------ 4 9 13 89
As a side note, the Microsoft SQL Server 2008 – Database Development Training Kit provides two solutions to this problem, but it literally doesn’t bother explaining them at all. Quite disappointingly, it merely provides the answer. I had remembered reading about this problem before in a different book, Inside Microsoft SQL Server 2008: T-SQL Querying, by Itzik Ben-Gan, so I referred to it instead. The author does a deep dive on this topic and its related problems in his book, providing multiple solutions and detailed explanations. I was fortunate enough to attend a Microsoft DevConnections conference many years ago where Itzik Ben-Gan ran a number of T-SQL workshops, covering the missing gaps problem, in addition to many other tips and tricks. For me, the most impressive part about it was not the clear and concise way he managed to educate everyone, but that he could take a seemingly dry topic such as TSQL and actually make it interesting and even humorous. Naturally I bought his book and am glad I did so; after all I’m still referencing it years later. I’d highly recommend it for anyone else hoping to either get a SQL Certification, or just improve their TSQL skills. And no, I am not getting paid to plug the book (I wish). The Training Kit on the other hand, is quite comprehensive in the list of topics it covers, but it sacrifices depth for breadth. Its a good starting point and should be used more as a baseline checklist of SQL topics you should understand.