TSQL PIVOT

The TSQL PIVOT command is an operation that essentially rotates a set of table data. The end result is a transform that converts columns into rows. One common usage of PIVOT is to normalize denormalized data. This may seem rather abstract and confusing at first, so here’s a simple example.
ACME Corp has an optional end user survey that allows you to them to track customer satisfaction. The survey asks users to rate the company on a scale of 1-5 across a variety of criteria. This data is stored in the database as a set of denormalized key/value pairs:

CREATE TABLE SURVEY_RESULTS (
   RESULT_ID int PRIMARY KEY IDENTITY(1,1)
, [USER_ID] int
, QUESTION_ID int
, SURVEY_SCORE int
)

The SURVEY_RESULTS table has a foreign key, QUESTION_ID, that maps to the QUESTION_ID column in the table SURVEY_QUESTIONS. The SURVEY_QUESTIONS table stores the actual survey questions themselves (In a real world scenario, SURVEY_QUESTION would store a localization id that points to some resource file instead of holding the actual text of the question):

CREATE TABLE SURVEY_QUESTIONS (
QUESTION_ID int PRIMARY KEY
,SURVEY_QUESTION varchar(50))


ALTER TABLE SURVEY_RESULTS
	ADD CONSTRAINT fk_SURVEY_RESULTS_QUESTION_ID_SURVEY_QUESTIONS_QUESTION_ID FOREIGN KEY (QUESTION_ID)
REFERENCES SURVEY_QUESTIONS (QUESTION_ID)

INSERT 
  INTO SURVEY_QUESTIONS WITH (ROWLOCK)
       (QUESTION_ID, SURVEY_QUESTION)
VALUES (1, 'Rate our service')
      ,(2, 'Rate your customer satisfaction')
      ,(3, 'Rate our user interface')

INSERT
  INTO SURVEY_RESULTS WITH (ROWLOCK)
       ([USER_ID], QUESTION_ID, SURVEY_SCORE)
VALUES (1, 1, 5)
      ,(1, 2, 4)
      ,(1, 3, 5)
      ,(2, 1, 4)
      ,(2, 2, 3)
      ,(2, 3, 2)
      ,(3, 1, 3)
      ,(3, 2, 2)
      ,(3, 3, 1)

This design allows flexibility in adding and removing questions. Using a normalized table with a column for each question would mean that every time the survey changed, entire columns would need to be added or removed. Of course, a normalized table makes it much easier to visualize the survey results though, so it would be helpful to create a VIEW on the SURVEY_RESULTS table that converted the key/value rows into columns.

SELECT * FROM SURVEY_RESULTS

Gives us the following, which is hard to interpret at a glance due to its denormalized nature:

 
RESULT_ID   USER_ID     QUESTION_ID SURVEY_SCORE
----------- ----------- ----------- ------------
1           1           1           5
2           1           2           4
3           1           3           5
4           2           1           4
5           2           2           3
6           2           3           2
7           3           1           3
8           3           2           2
9           3           3           1 

What we’d like to see is one row for each user, with the score for each question as its own column in that row. This can be achieved using PIVOT (In the real world we’d create a view for easy at-a-glance access to this data):

SELECT  
USER_ID, [1] AS 'Service score', [2] AS 'Customer satisfaction', [3] AS 'User Interface'
FROM
(SELECT [USER_ID], QUESTION_ID, SURVEY_SCORE
    FROM SURVEY_RESULTS) AS InputTable
PIVOT
(
AVG(SURVEY_SCORE)
FOR QUESTION_ID IN ([1], [2], [3])
) AS PivotTable;

This gives us the desired result set:


USER_ID     Service score Customer satisfaction User Interface
----------- ------------- --------------------- --------------
1           5             4                     5
2           4             3                     2
3           3             2                     1

Let’s take a closer look at this query. The first thing that we need to do is specify what data we are pivoting. This is done in the FROM clause. As you can see, this is just a simple table expression that selects all three columns from the SURVEY_RESULTS table. For brevity’s sake, I will refer to this table expression by its alias, InputTable.

Now that we know what we are pivoting, we need to identity the pivot column and the set of values to pivot on. This is done in the FOR clause. In our example, we are pivoting on the QUESITON_ID column, and the set of values we are interested in are 1, 2, and 3, which correspond to the questions about service, customer satisfaction, and the user interface. For each value in this set, all rows in InputTable that have a matching value in the specified pivot column will be combined together based on an aggregation function. In our case, that would be the average survey score.

The astute reader will point out that applying an aggregation function implies that there is some sort of grouping being done. There is indeed an implicit grouping based on any column appearing inside InputTable that is not directly referenced inside of the PIVOT statement. In our case, USER_ID is mentioned inside the InputTable but it is not referenced inside the PIVOT statement, so the final result will be grouped based on USER_ID. The query will generate a row for each user in the final result set, along with three columns corresponding to the average score for each question. These are referenced in the outer SELECT statement and are based on the set of values provided to PIVOT. In our case, these are [1], [2], and [3]. Of course, since our sample data only has one answer from a given user per question, the aggregation function doesn’t really matter.

Let’s suppose we want to average the scores across all users. In that case, we simply modify the query slightly and remove USER_ID from the InputTable (note that we also cast SURVEY_SCORE as a decimal, so the average does not get rounded to the nearest integer). This causes the entire result set to be implicitly grouped, which is exactly what we want:

SELECT  
 [1] AS 'Service score', [2] AS 'Customer satisfaction', [3] AS 'User Interface'
FROM
(SELECT QUESTION_ID, CAST(SURVEY_SCORE AS DECIMAL) AS SURVEY_SCORE
    FROM SURVEY_RESULTS) AS InputTable
PIVOT
(
AVG(SURVEY_SCORE)
FOR QUESTION_ID IN ([1], [2], [3])
) AS PivotTable;

As expected, this query returns exactly one row with the average score for each question appearing in its own column:

 
Service score                           Customer satisfaction                   User Interface
--------------------------------------- --------------------------------------- ---------------------------------------
4.000000                                3.000000                                2.666666

 

And that covers the PIVOT statement in a nutshell. What about UNPIVOT? As you can probably guess, the UNPIVOT command essentially rotates the data in the opposite direction: It takes a set of normalized table data and denormalizes it, converting columns into rows. Its syntax is similar and its use is left as an exercise to the reader.

TSQL query the missing gaps in a table

I’m currently studying to get a Microsoft SQL Server 2008 Database Development Certificate and am going through their official 70$ 400 + page training kit. It covers a wide variety of topics, one of them being how to find the missing gaps in a table. This actually turns out to be a good interview question that can test whether or not a candidate understands how to write more complex queries that go beyond simple select statements.

The problem itself is relatively straightforward. Imagine for example, that you have an identity column on a table. You might expect to see a consecutive series of integers when you select that column from the table. However, on a production environment, you’d typically see “missing” values. This can be due to a number of reasons. The most obvious one being that rows can get deleted. A more subtle cause is that rollbacks do NOT reset the identity seed value on a column – so any failed inserts will cause the seed value column to increment. Regardless of the cause, let’s say you would like to identify all the gaps in the table.

Consider the following table:

CREATE TABLE TestTable (SomeID INT NOT NULL PRIMARY KEY)

INSERT INTO TestTable (SomeID) VALUES (1)
INSERT INTO TestTable (SomeID) VALUES (2)
INSERT INTO TestTable (SomeID) VALUES (3)
INSERT INTO TestTable (SomeID) VALUES (10)
INSERT INTO TestTable (SomeID) VALUES (11)
INSERT INTO TestTable (SomeID) VALUES (12)
INSERT INTO TestTable (SomeID) VALUES (90)
INSERT INTO TestTable (SomeID) VALUES (91)
INSERT INTO TestTable (SomeID) VALUES (92)

The missing gaps are 4-9, and 13-89. How can you write a query that correctly returns these values?

A good approach is to solve the problem by breaking it down into smaller pieces. The first piece of the puzzle is to figure out where each missing gap begins. We know that there is a gap if for a given value x in the table, there doesn’t exist an entry x + 1 in the table. To use a concrete example from our sample table: There is no gap after 1, because 2 exists in the table. However, there is a gap after 3, because 4 does not exist in the table. So we translate this into SQL: Select all values of SomeID where there does not exist another SomeID value that is equal to the current SomeID value plus one. This will require a correlated subquery where the inner query references the outer query’s SomeID value:

SELECT a.SomeID AS StartGap
  FROM TestTable a 
 WHERE NOT EXISTS (SELECT * FROM TestTable b WHERE b.SomeID = a.SomeID + 1)

This returns:

StartGap
---------
3
12
92

There are two things to note here. First, we need to add 1 to the Start value in order to get the actual starting value of the gap. The second thing to note is that 92 is returned because it is the largest entry in the table. This isn’t a gap, its just the last value in the table. So, we need to filter out the maximum value of SomeID from the table. The modified query looks like this:

SELECT a.SomeID + 1 AS StartGap
  FROM TestTable a 
WHERE NOT EXISTS (SELECT SomeID FROM TestTable b WHERE b.SomeID = a.SomeID + 1)
   AND a.SomeID < (SELECT MAX(SomeID) FROM TestTable)

This gives us

StartGap
--------
4
13

Now that we know where each gap begins, we can tackle the task of figuring out where the gap ends. Given the start gap value, the end gap value is equal to the next largest number in the table minus one. For example, there is a gap that begins at 4. The next largest number that exists in the table is 10, so the gap ends at 9. Translated into SQL, given the current value of SomeID, we want to select all the values from the table greater than SomeID, and then return the MINIMUM.

Again, we will use a correlated subquery to reference the current value of SomeID:

SELECT a.SomeID + 1 AS StartGap
	  ,(SELECT MIN(SomeID) - 1 FROM TestTable c WHERE c.SomeID > a.SomeID) AS EndGap
  FROM TestTable a 
WHERE NOT EXISTS (SELECT * FROM TestTable b WHERE b.SomeID = a.SomeID + 1)
   AND a.SomeID < (SELECT MAX(SomeID) FROM TestTable )

This gives us the correct result of

StartGap    EndGap
------------------
4	        9
13	        89

Tada!

As a side note, the Microsoft SQL Server 2008 – Database Development Training Kit provides two solutions to this problem, but it literally doesn’t bother explaining them at all. Quite disappointingly, it merely provides the answer. I had remembered reading about this problem before in a different book, Inside Microsoft SQL Server 2008: T-SQL Querying, by Itzik Ben-Gan, so I referred to it instead. The author does a deep dive on this topic and its related problems in his book, providing multiple solutions and detailed explanations. I was fortunate enough to attend a Microsoft DevConnections conference many years ago where Itzik Ben-Gan ran a number of T-SQL workshops, covering the missing gaps problem, in addition to many other tips and tricks. For me, the most impressive part about it was not the clear and concise way he managed to educate everyone, but that he could take a seemingly dry topic such as TSQL and actually make it interesting and even humorous. Naturally I bought his book and am glad I did so; after all I’m still referencing it years later. I’d highly recommend it for anyone else hoping to either get a SQL Certification, or just improve their TSQL skills. And no, I am not getting paid to plug the book (I wish). The Training Kit on the other hand, is quite comprehensive in the list of topics it covers, but it sacrifices depth for breadth. Its a good starting point and should be used more as a baseline checklist of SQL topics you should understand.