Finding crap in the SQL Server database

Anyone who has ever maintained an old legacy code base often finds themselves working with a decade old database schema that now spans hundreds of tables, each of which contain dozens of columns. Trying to locate a particular column or table in the db can be particularly difficult, especially if the only clue you have is a vague inkling of what said table/column might be named. Luckily, SQL Server provides several system tables that hold metadata that can help you find the proverbial needle in the haystack.

We can query sys.tables and sys.columns like so:

--look for a particular table
SELECT t.name AS table_name
FROM sys.tables AS t
WHERE t.name LIKE '%needle%'

--look for a particular column
SELECT t.name AS table_name,
c.name AS column_name
FROM sys.tables AS t
JOIN sys.columns c
ON t.OBJECT_ID = c.OBJECT_ID
WHERE c.name LIKE '%needle%'
ORDER BY table_name;

The information_schema.columns table lets you accomplish the same thing. This table shares a lot of the same information as sys.columns (but does not contain the object_id), but it already contains the table_name so you don’t have to do a join. This is good if you are lazy and dislike typing more than necessary.

SELECT TABLE_NAME, COLUMN_NAME
FROM Information_Schema.Columns
WHERE COLUMN_NAME LIKE '%needle%'
ORDER BY TABLE_NAME;

These queries provide a good starting point for exploring the database and all the metadata available to you. Of course, if you’re too lazy to do any typing at all, you can simply open SQL Server Management Studio, click on “View” in the menus, and select Object Explorer Details. From there you can do a search. Its important to note that this searches EVERYTHING. Views, stored procedures, triggers, tables, columns, and everything else that is an object in SQL Server.

If you run the profiler, you can see that it is indeed querying all the system tables, including sys.database_principals, sys.asymmetric_keys, and sys.all_objects.

Often times this UI search will get the job done, although it can produce a lot of noise.

TSQL PIVOT

The TSQL PIVOT command is an operation that essentially rotates a set of table data. The end result is a transform that converts columns into rows. One common usage of PIVOT is to normalize denormalized data. This may seem rather abstract and confusing at first, so here’s a simple example.
ACME Corp has an optional end user survey that allows you to them to track customer satisfaction. The survey asks users to rate the company on a scale of 1-5 across a variety of criteria. This data is stored in the database as a set of denormalized key/value pairs:

CREATE TABLE SURVEY_RESULTS (
   RESULT_ID int PRIMARY KEY IDENTITY(1,1)
, [USER_ID] int
, QUESTION_ID int
, SURVEY_SCORE int
)

The SURVEY_RESULTS table has a foreign key, QUESTION_ID, that maps to the QUESTION_ID column in the table SURVEY_QUESTIONS. The SURVEY_QUESTIONS table stores the actual survey questions themselves (In a real world scenario, SURVEY_QUESTION would store a localization id that points to some resource file instead of holding the actual text of the question):

CREATE TABLE SURVEY_QUESTIONS (
QUESTION_ID int PRIMARY KEY
,SURVEY_QUESTION varchar(50))


ALTER TABLE SURVEY_RESULTS
	ADD CONSTRAINT fk_SURVEY_RESULTS_QUESTION_ID_SURVEY_QUESTIONS_QUESTION_ID FOREIGN KEY (QUESTION_ID)
REFERENCES SURVEY_QUESTIONS (QUESTION_ID)

INSERT 
  INTO SURVEY_QUESTIONS WITH (ROWLOCK)
       (QUESTION_ID, SURVEY_QUESTION)
VALUES (1, 'Rate our service')
      ,(2, 'Rate your customer satisfaction')
      ,(3, 'Rate our user interface')

INSERT
  INTO SURVEY_RESULTS WITH (ROWLOCK)
       ([USER_ID], QUESTION_ID, SURVEY_SCORE)
VALUES (1, 1, 5)
      ,(1, 2, 4)
      ,(1, 3, 5)
      ,(2, 1, 4)
      ,(2, 2, 3)
      ,(2, 3, 2)
      ,(3, 1, 3)
      ,(3, 2, 2)
      ,(3, 3, 1)

This design allows flexibility in adding and removing questions. Using a normalized table with a column for each question would mean that every time the survey changed, entire columns would need to be added or removed. Of course, a normalized table makes it much easier to visualize the survey results though, so it would be helpful to create a VIEW on the SURVEY_RESULTS table that converted the key/value rows into columns.

SELECT * FROM SURVEY_RESULTS

Gives us the following, which is hard to interpret at a glance due to its denormalized nature:

 
RESULT_ID   USER_ID     QUESTION_ID SURVEY_SCORE
----------- ----------- ----------- ------------
1           1           1           5
2           1           2           4
3           1           3           5
4           2           1           4
5           2           2           3
6           2           3           2
7           3           1           3
8           3           2           2
9           3           3           1 

What we’d like to see is one row for each user, with the score for each question as its own column in that row. This can be achieved using PIVOT (In the real world we’d create a view for easy at-a-glance access to this data):

SELECT  
USER_ID, [1] AS 'Service score', [2] AS 'Customer satisfaction', [3] AS 'User Interface'
FROM
(SELECT [USER_ID], QUESTION_ID, SURVEY_SCORE
    FROM SURVEY_RESULTS) AS InputTable
PIVOT
(
AVG(SURVEY_SCORE)
FOR QUESTION_ID IN ([1], [2], [3])
) AS PivotTable;

This gives us the desired result set:


USER_ID     Service score Customer satisfaction User Interface
----------- ------------- --------------------- --------------
1           5             4                     5
2           4             3                     2
3           3             2                     1

Let’s take a closer look at this query. The first thing that we need to do is specify what data we are pivoting. This is done in the FROM clause. As you can see, this is just a simple table expression that selects all three columns from the SURVEY_RESULTS table. For brevity’s sake, I will refer to this table expression by its alias, InputTable.

Now that we know what we are pivoting, we need to identity the pivot column and the set of values to pivot on. This is done in the FOR clause. In our example, we are pivoting on the QUESITON_ID column, and the set of values we are interested in are 1, 2, and 3, which correspond to the questions about service, customer satisfaction, and the user interface. For each value in this set, all rows in InputTable that have a matching value in the specified pivot column will be combined together based on an aggregation function. In our case, that would be the average survey score.

The astute reader will point out that applying an aggregation function implies that there is some sort of grouping being done. There is indeed an implicit grouping based on any column appearing inside InputTable that is not directly referenced inside of the PIVOT statement. In our case, USER_ID is mentioned inside the InputTable but it is not referenced inside the PIVOT statement, so the final result will be grouped based on USER_ID. The query will generate a row for each user in the final result set, along with three columns corresponding to the average score for each question. These are referenced in the outer SELECT statement and are based on the set of values provided to PIVOT. In our case, these are [1], [2], and [3]. Of course, since our sample data only has one answer from a given user per question, the aggregation function doesn’t really matter.

Let’s suppose we want to average the scores across all users. In that case, we simply modify the query slightly and remove USER_ID from the InputTable (note that we also cast SURVEY_SCORE as a decimal, so the average does not get rounded to the nearest integer). This causes the entire result set to be implicitly grouped, which is exactly what we want:

SELECT  
 [1] AS 'Service score', [2] AS 'Customer satisfaction', [3] AS 'User Interface'
FROM
(SELECT QUESTION_ID, CAST(SURVEY_SCORE AS DECIMAL) AS SURVEY_SCORE
    FROM SURVEY_RESULTS) AS InputTable
PIVOT
(
AVG(SURVEY_SCORE)
FOR QUESTION_ID IN ([1], [2], [3])
) AS PivotTable;

As expected, this query returns exactly one row with the average score for each question appearing in its own column:

 
Service score                           Customer satisfaction                   User Interface
--------------------------------------- --------------------------------------- ---------------------------------------
4.000000                                3.000000                                2.666666

 

And that covers the PIVOT statement in a nutshell. What about UNPIVOT? As you can probably guess, the UNPIVOT command essentially rotates the data in the opposite direction: It takes a set of normalized table data and denormalizes it, converting columns into rows. Its syntax is similar and its use is left as an exercise to the reader.

Quick and dirty table import hack for SQL Server

Every so often, I’ve needed to import data from the QA/Staging database table onto my local machine to help troubleshoot a hard to repro bug. Typically, I only need to import a few table’s worth of data, so creating a backup of the QA database and restoring it on my local machine is overkill. Here’s a simple trick to do so quickly and painlessly, without needing to click through the tedious export/import wizard or open up a command line to run BCP (and invariably having to look up the BCP syntax).

Open up SQL Server Management Studio and connect to the QA database in question. Open up a new query, and run a SELECT * FROM [TableName] query. Click on the results pane, and then Ctrl-A Ctrl-C to select everything and copy the data. Shown as an example here is the AdventureWorks.Person.Address table (obviously truncated as it contains almost 20,000 rows):

Disconnect from the QA database and reconnect to your local database. Open up a new query window and run TRUNCATE TABLE [TableName] to wipe out existing data. If necessary, you can back up this data, but since this is data on your local machine, chances are you won’t really need to. If you have any columns with the identity property, things will get trickier. Your best bet is to right-click the table in the object explorer, SCRIPT TABLE AS -> CREATE TO -> NEW QUERY WINDOW. From there, edit the automatically generated query by removing any IDENTITY properties (along with any foreign key constraints). DROP the table, and then run the creation script to recreate the table sans any Identity columns.

Now, navigate to the table using the Object Explorer and right click the table you want to import the data into. Select Edit Table:

Right click the arrow to the right of the first row and select Paste:

The rows will now show up. You may now close the window (the data is saved automatically):

Yes, its really that easy. SQL Server is even smart enough to preserve the NULL values in varchar columns (as opposed to treating them as a string with the value “NULL”). Note that you can also use this same method to back up and restore your data by using an Excel spreadsheet as an intermediary backup file. Select all the data in the table and copy it as before. This time, paste it into an Excel spreadsheet. Excel will automatically detect and preserve the columns:

Later when you want to restore your original data, select and copy all the data from the Excel spreadsheet and paste it back into the table using the “Edit Table” method outlined above.

Obviously, this operation gets slower the more data there is to copy. After a certain point, it’d make more sense to use a more robust method, such as BCP. However, when it comes to programming, laziness is a virtue, and this is one of the fastest and easiest ways to import data.

TSQL query the missing gaps in a table

I’m currently studying to get a Microsoft SQL Server 2008 Database Development Certificate and am going through their official 70$ 400 + page training kit. It covers a wide variety of topics, one of them being how to find the missing gaps in a table. This actually turns out to be a good interview question that can test whether or not a candidate understands how to write more complex queries that go beyond simple select statements.

The problem itself is relatively straightforward. Imagine for example, that you have an identity column on a table. You might expect to see a consecutive series of integers when you select that column from the table. However, on a production environment, you’d typically see “missing” values. This can be due to a number of reasons. The most obvious one being that rows can get deleted. A more subtle cause is that rollbacks do NOT reset the identity seed value on a column – so any failed inserts will cause the seed value column to increment. Regardless of the cause, let’s say you would like to identify all the gaps in the table.

Consider the following table:

CREATE TABLE TestTable (SomeID INT NOT NULL PRIMARY KEY)

INSERT INTO TestTable (SomeID) VALUES (1)
INSERT INTO TestTable (SomeID) VALUES (2)
INSERT INTO TestTable (SomeID) VALUES (3)
INSERT INTO TestTable (SomeID) VALUES (10)
INSERT INTO TestTable (SomeID) VALUES (11)
INSERT INTO TestTable (SomeID) VALUES (12)
INSERT INTO TestTable (SomeID) VALUES (90)
INSERT INTO TestTable (SomeID) VALUES (91)
INSERT INTO TestTable (SomeID) VALUES (92)

The missing gaps are 4-9, and 13-89. How can you write a query that correctly returns these values?

A good approach is to solve the problem by breaking it down into smaller pieces. The first piece of the puzzle is to figure out where each missing gap begins. We know that there is a gap if for a given value x in the table, there doesn’t exist an entry x + 1 in the table. To use a concrete example from our sample table: There is no gap after 1, because 2 exists in the table. However, there is a gap after 3, because 4 does not exist in the table. So we translate this into SQL: Select all values of SomeID where there does not exist another SomeID value that is equal to the current SomeID value plus one. This will require a correlated subquery where the inner query references the outer query’s SomeID value:

SELECT a.SomeID AS StartGap
  FROM TestTable a 
 WHERE NOT EXISTS (SELECT * FROM TestTable b WHERE b.SomeID = a.SomeID + 1)

This returns:

StartGap
---------
3
12
92

There are two things to note here. First, we need to add 1 to the Start value in order to get the actual starting value of the gap. The second thing to note is that 92 is returned because it is the largest entry in the table. This isn’t a gap, its just the last value in the table. So, we need to filter out the maximum value of SomeID from the table. The modified query looks like this:

SELECT a.SomeID + 1 AS StartGap
  FROM TestTable a 
WHERE NOT EXISTS (SELECT SomeID FROM TestTable b WHERE b.SomeID = a.SomeID + 1)
   AND a.SomeID < (SELECT MAX(SomeID) FROM TestTable)

This gives us

StartGap
--------
4
13

Now that we know where each gap begins, we can tackle the task of figuring out where the gap ends. Given the start gap value, the end gap value is equal to the next largest number in the table minus one. For example, there is a gap that begins at 4. The next largest number that exists in the table is 10, so the gap ends at 9. Translated into SQL, given the current value of SomeID, we want to select all the values from the table greater than SomeID, and then return the MINIMUM.

Again, we will use a correlated subquery to reference the current value of SomeID:

SELECT a.SomeID + 1 AS StartGap
	  ,(SELECT MIN(SomeID) - 1 FROM TestTable c WHERE c.SomeID > a.SomeID) AS EndGap
  FROM TestTable a 
WHERE NOT EXISTS (SELECT * FROM TestTable b WHERE b.SomeID = a.SomeID + 1)
   AND a.SomeID < (SELECT MAX(SomeID) FROM TestTable )

This gives us the correct result of

StartGap    EndGap
------------------
4	        9
13	        89

Tada!

As a side note, the Microsoft SQL Server 2008 – Database Development Training Kit provides two solutions to this problem, but it literally doesn’t bother explaining them at all. Quite disappointingly, it merely provides the answer. I had remembered reading about this problem before in a different book, Inside Microsoft SQL Server 2008: T-SQL Querying, by Itzik Ben-Gan, so I referred to it instead. The author does a deep dive on this topic and its related problems in his book, providing multiple solutions and detailed explanations. I was fortunate enough to attend a Microsoft DevConnections conference many years ago where Itzik Ben-Gan ran a number of T-SQL workshops, covering the missing gaps problem, in addition to many other tips and tricks. For me, the most impressive part about it was not the clear and concise way he managed to educate everyone, but that he could take a seemingly dry topic such as TSQL and actually make it interesting and even humorous. Naturally I bought his book and am glad I did so; after all I’m still referencing it years later. I’d highly recommend it for anyone else hoping to either get a SQL Certification, or just improve their TSQL skills. And no, I am not getting paid to plug the book (I wish). The Training Kit on the other hand, is quite comprehensive in the list of topics it covers, but it sacrifices depth for breadth. Its a good starting point and should be used more as a baseline checklist of SQL topics you should understand.

Database isolation levels

I remember a job interview where I was asked to rate my SQL Server experience on a scale of 1 to 10.   Not wanting to put my foot in my mouth, I responded with a modest 6. So, the interviewer asked me what the different SQL Server 2008 isolation levels were.    Yikes! That was not an easy question.  Luckily, it just so happened that I had recently been reading High Performance MySQL, and even though it is an entirely different database, the concept of isolation levels remains mostly the same. I was able to wing it and get the job.

Although this question seems like it is just a “vocabulary” quiz that tests somebody’s ability to memorize, it can open up a good conversation and test a candidate’s understanding of databases, transactions, and concurrency. After the interview, I decided to brush up. After all, I don’t deal much with isolation levels on a daily basis at work.   The wikipedia article is a great place to start, but being wikipedia, and not the easiest to understand.   Here’s my attempt to dumb it down.

A database isolation level essentially dictates what data different transactions running concurrently on a database can see.    For example, lets say you have one transaction A running that does a SELECT * FROM dbo.Customers (never mind the fact that this is a horrible query to run ). Lets say transaction B is running simultaneously, and inserts a new customer into the database.      Neither transaction has committed yet. Does transaction A return the newly inserted customer from transaction B?     Well, the answer depends on the isolation level.   Here are the ones defined in the ANSI SQL standard:

  • Serializable – All transactions are isolated from each other.     The system appears to behave as if all transactions execute serially.   This doesn’t necessarily mean that they run one after the other.    If two transactions don’t impact one another, some database systems run them concurrently.     In our example, transaction A and transaction B cannot execute at the same time.   So if A runs first, it does not see the insert.   If B runs first, A sees the insert.

    Needless to say, such a draconian isolation level is not good for performance.    Isolation levels such as this one would probably need to be used in financial applications such as ATMs that require information to be as up to date and accurate as possible.     When dealing with money, you don’t want race conditions.

  • Repeatable read – Consider a transaction that runs the same query multiple times.   In the repeatable read isolation level, the query will return the same result every time.   Essentially, this means that a read lock is acquired for every row touched by the transaction; Other transactions can’t update or delete those rows until the current transaction is done.     Ironically enough, despite being called the “repeatable read” isolation level, it can’t quite guarantee the same result set every time.

    This isolation level is susceptible to a phenomenon known as phantom reads.    Lets say that the query being executed in transaction A is “SELECT * FROM dbo.Customer WHERE FirstName BETWEEN ‘A’ and ‘B'”.     Transaction B inserts a customer with a first name of “Bob”.   Now transaction A runs the same query.   The newly inserted “Bob” row shows up.    The reason for this is because no range locks are acquired on the rows.    So although the individual rows themselves are locked, the range in the WHERE clause is not.   If another transaction inserts/updates/deletes a row that happens to satisfy the WHERE clause, the first transaction sees it.This is the main difference between the serializable and repeatable read isolation levels.   With serializable, you never get phantom reads.     But with repeatable read, you get slightly more flexibility and better performance.

  • Read committed – This is typically the default isolation level of most databases.   It does exactly what it sounds like.    Any given transaction can see committed changes from other transactions.    Write locks are held until the transaction commits, but read locks are released immediately (these read locks are acquired to ensure only commited data is read).    Needless to say, nonrepeatable reads can occur, but the results will always be “valid”, since they contain the latest rows from commited transactions.    This is a good default level, as it prevents dirty reads and lost updates (two transactions updating the same row simultaneously), while not stifling performance.
  • Read uncommitted – Transactions can read uncommitted changes from other transactions.    Everything is a dirty read.    This is great for performance, as this results in far fewer locks, and works well with things where race conditions don’t really matter, like tweets or facebook updates.    For example, lets say John decides to break up with Mary on some social networking website that implements stalker feeds.    Transaction A inserts this into the database.    Mary happens to log on at the same time, and Transaction B selects this event from the database and displays John’s updated status of being single on Mary’s home page.    For whatever reason, transaction A fails and rolls back.   Well, Mary sees the breakup, and not believing her eyes, refreshes her home page, and sees that the update is gone.    Well, its not the end of the world, she’ll get news of the breakup sooner or later, and move on to someone better 🙂

To sum up, the higher the isolation level, the more locking overhead and risk of deadlocks.   The lower, the better the performance, but the higher the risk of reading “invalid” data.    Typically, the default level is good enough.