Finding crap in the SQL Server database

Anyone who has ever maintained an old legacy code base often finds themselves working with a decade old database schema that now spans hundreds of tables, each of which contain dozens of columns. Trying to locate a particular column or table in the db can be particularly difficult, especially if the only clue you have is a vague inkling of what said table/column might be named. Luckily, SQL Server provides several system tables that hold metadata that can help you find the proverbial needle in the haystack.

We can query sys.tables and sys.columns like so:

--look for a particular table
SELECT t.name AS table_name
FROM sys.tables AS t
WHERE t.name LIKE '%needle%'

--look for a particular column
SELECT t.name AS table_name,
c.name AS column_name
FROM sys.tables AS t
JOIN sys.columns c
ON t.OBJECT_ID = c.OBJECT_ID
WHERE c.name LIKE '%needle%'
ORDER BY table_name;

The information_schema.columns table lets you accomplish the same thing. This table shares a lot of the same information as sys.columns (but does not contain the object_id), but it already contains the table_name so you don’t have to do a join. This is good if you are lazy and dislike typing more than necessary.

SELECT TABLE_NAME, COLUMN_NAME
FROM Information_Schema.Columns
WHERE COLUMN_NAME LIKE '%needle%'
ORDER BY TABLE_NAME;

These queries provide a good starting point for exploring the database and all the metadata available to you. Of course, if you’re too lazy to do any typing at all, you can simply open SQL Server Management Studio, click on “View” in the menus, and select Object Explorer Details. From there you can do a search. Its important to note that this searches EVERYTHING. Views, stored procedures, triggers, tables, columns, and everything else that is an object in SQL Server.

If you run the profiler, you can see that it is indeed querying all the system tables, including sys.database_principals, sys.asymmetric_keys, and sys.all_objects.

Often times this UI search will get the job done, although it can produce a lot of noise.

TSQL PIVOT

The TSQL PIVOT command is an operation that essentially rotates a set of table data. The end result is a transform that converts columns into rows. One common usage of PIVOT is to normalize denormalized data. This may seem rather abstract and confusing at first, so here’s a simple example.
ACME Corp has an optional end user survey that allows you to them to track customer satisfaction. The survey asks users to rate the company on a scale of 1-5 across a variety of criteria. This data is stored in the database as a set of denormalized key/value pairs:

CREATE TABLE SURVEY_RESULTS (
   RESULT_ID int PRIMARY KEY IDENTITY(1,1)
, [USER_ID] int
, QUESTION_ID int
, SURVEY_SCORE int
)

The SURVEY_RESULTS table has a foreign key, QUESTION_ID, that maps to the QUESTION_ID column in the table SURVEY_QUESTIONS. The SURVEY_QUESTIONS table stores the actual survey questions themselves (In a real world scenario, SURVEY_QUESTION would store a localization id that points to some resource file instead of holding the actual text of the question):

CREATE TABLE SURVEY_QUESTIONS (
QUESTION_ID int PRIMARY KEY
,SURVEY_QUESTION varchar(50))


ALTER TABLE SURVEY_RESULTS
	ADD CONSTRAINT fk_SURVEY_RESULTS_QUESTION_ID_SURVEY_QUESTIONS_QUESTION_ID FOREIGN KEY (QUESTION_ID)
REFERENCES SURVEY_QUESTIONS (QUESTION_ID)

INSERT 
  INTO SURVEY_QUESTIONS WITH (ROWLOCK)
       (QUESTION_ID, SURVEY_QUESTION)
VALUES (1, 'Rate our service')
      ,(2, 'Rate your customer satisfaction')
      ,(3, 'Rate our user interface')

INSERT
  INTO SURVEY_RESULTS WITH (ROWLOCK)
       ([USER_ID], QUESTION_ID, SURVEY_SCORE)
VALUES (1, 1, 5)
      ,(1, 2, 4)
      ,(1, 3, 5)
      ,(2, 1, 4)
      ,(2, 2, 3)
      ,(2, 3, 2)
      ,(3, 1, 3)
      ,(3, 2, 2)
      ,(3, 3, 1)

This design allows flexibility in adding and removing questions. Using a normalized table with a column for each question would mean that every time the survey changed, entire columns would need to be added or removed. Of course, a normalized table makes it much easier to visualize the survey results though, so it would be helpful to create a VIEW on the SURVEY_RESULTS table that converted the key/value rows into columns.

SELECT * FROM SURVEY_RESULTS

Gives us the following, which is hard to interpret at a glance due to its denormalized nature:

 
RESULT_ID   USER_ID     QUESTION_ID SURVEY_SCORE
----------- ----------- ----------- ------------
1           1           1           5
2           1           2           4
3           1           3           5
4           2           1           4
5           2           2           3
6           2           3           2
7           3           1           3
8           3           2           2
9           3           3           1 

What we’d like to see is one row for each user, with the score for each question as its own column in that row. This can be achieved using PIVOT (In the real world we’d create a view for easy at-a-glance access to this data):

SELECT  
USER_ID, [1] AS 'Service score', [2] AS 'Customer satisfaction', [3] AS 'User Interface'
FROM
(SELECT [USER_ID], QUESTION_ID, SURVEY_SCORE
    FROM SURVEY_RESULTS) AS InputTable
PIVOT
(
AVG(SURVEY_SCORE)
FOR QUESTION_ID IN ([1], [2], [3])
) AS PivotTable;

This gives us the desired result set:


USER_ID     Service score Customer satisfaction User Interface
----------- ------------- --------------------- --------------
1           5             4                     5
2           4             3                     2
3           3             2                     1

Let’s take a closer look at this query. The first thing that we need to do is specify what data we are pivoting. This is done in the FROM clause. As you can see, this is just a simple table expression that selects all three columns from the SURVEY_RESULTS table. For brevity’s sake, I will refer to this table expression by its alias, InputTable.

Now that we know what we are pivoting, we need to identity the pivot column and the set of values to pivot on. This is done in the FOR clause. In our example, we are pivoting on the QUESITON_ID column, and the set of values we are interested in are 1, 2, and 3, which correspond to the questions about service, customer satisfaction, and the user interface. For each value in this set, all rows in InputTable that have a matching value in the specified pivot column will be combined together based on an aggregation function. In our case, that would be the average survey score.

The astute reader will point out that applying an aggregation function implies that there is some sort of grouping being done. There is indeed an implicit grouping based on any column appearing inside InputTable that is not directly referenced inside of the PIVOT statement. In our case, USER_ID is mentioned inside the InputTable but it is not referenced inside the PIVOT statement, so the final result will be grouped based on USER_ID. The query will generate a row for each user in the final result set, along with three columns corresponding to the average score for each question. These are referenced in the outer SELECT statement and are based on the set of values provided to PIVOT. In our case, these are [1], [2], and [3]. Of course, since our sample data only has one answer from a given user per question, the aggregation function doesn’t really matter.

Let’s suppose we want to average the scores across all users. In that case, we simply modify the query slightly and remove USER_ID from the InputTable (note that we also cast SURVEY_SCORE as a decimal, so the average does not get rounded to the nearest integer). This causes the entire result set to be implicitly grouped, which is exactly what we want:

SELECT  
 [1] AS 'Service score', [2] AS 'Customer satisfaction', [3] AS 'User Interface'
FROM
(SELECT QUESTION_ID, CAST(SURVEY_SCORE AS DECIMAL) AS SURVEY_SCORE
    FROM SURVEY_RESULTS) AS InputTable
PIVOT
(
AVG(SURVEY_SCORE)
FOR QUESTION_ID IN ([1], [2], [3])
) AS PivotTable;

As expected, this query returns exactly one row with the average score for each question appearing in its own column:

 
Service score                           Customer satisfaction                   User Interface
--------------------------------------- --------------------------------------- ---------------------------------------
4.000000                                3.000000                                2.666666

 

And that covers the PIVOT statement in a nutshell. What about UNPIVOT? As you can probably guess, the UNPIVOT command essentially rotates the data in the opposite direction: It takes a set of normalized table data and denormalizes it, converting columns into rows. Its syntax is similar and its use is left as an exercise to the reader.

Database isolation levels

I remember a job interview where I was asked to rate my SQL Server experience on a scale of 1 to 10.   Not wanting to put my foot in my mouth, I responded with a modest 6. So, the interviewer asked me what the different SQL Server 2008 isolation levels were.    Yikes! That was not an easy question.  Luckily, it just so happened that I had recently been reading High Performance MySQL, and even though it is an entirely different database, the concept of isolation levels remains mostly the same. I was able to wing it and get the job.

Although this question seems like it is just a “vocabulary” quiz that tests somebody’s ability to memorize, it can open up a good conversation and test a candidate’s understanding of databases, transactions, and concurrency. After the interview, I decided to brush up. After all, I don’t deal much with isolation levels on a daily basis at work.   The wikipedia article is a great place to start, but being wikipedia, and not the easiest to understand.   Here’s my attempt to dumb it down.

A database isolation level essentially dictates what data different transactions running concurrently on a database can see.    For example, lets say you have one transaction A running that does a SELECT * FROM dbo.Customers (never mind the fact that this is a horrible query to run ). Lets say transaction B is running simultaneously, and inserts a new customer into the database.      Neither transaction has committed yet. Does transaction A return the newly inserted customer from transaction B?     Well, the answer depends on the isolation level.   Here are the ones defined in the ANSI SQL standard:

  • Serializable – All transactions are isolated from each other.     The system appears to behave as if all transactions execute serially.   This doesn’t necessarily mean that they run one after the other.    If two transactions don’t impact one another, some database systems run them concurrently.     In our example, transaction A and transaction B cannot execute at the same time.   So if A runs first, it does not see the insert.   If B runs first, A sees the insert.

    Needless to say, such a draconian isolation level is not good for performance.    Isolation levels such as this one would probably need to be used in financial applications such as ATMs that require information to be as up to date and accurate as possible.     When dealing with money, you don’t want race conditions.

  • Repeatable read – Consider a transaction that runs the same query multiple times.   In the repeatable read isolation level, the query will return the same result every time.   Essentially, this means that a read lock is acquired for every row touched by the transaction; Other transactions can’t update or delete those rows until the current transaction is done.     Ironically enough, despite being called the “repeatable read” isolation level, it can’t quite guarantee the same result set every time.

    This isolation level is susceptible to a phenomenon known as phantom reads.    Lets say that the query being executed in transaction A is “SELECT * FROM dbo.Customer WHERE FirstName BETWEEN ‘A’ and ‘B'”.     Transaction B inserts a customer with a first name of “Bob”.   Now transaction A runs the same query.   The newly inserted “Bob” row shows up.    The reason for this is because no range locks are acquired on the rows.    So although the individual rows themselves are locked, the range in the WHERE clause is not.   If another transaction inserts/updates/deletes a row that happens to satisfy the WHERE clause, the first transaction sees it.This is the main difference between the serializable and repeatable read isolation levels.   With serializable, you never get phantom reads.     But with repeatable read, you get slightly more flexibility and better performance.

  • Read committed – This is typically the default isolation level of most databases.   It does exactly what it sounds like.    Any given transaction can see committed changes from other transactions.    Write locks are held until the transaction commits, but read locks are released immediately (these read locks are acquired to ensure only commited data is read).    Needless to say, nonrepeatable reads can occur, but the results will always be “valid”, since they contain the latest rows from commited transactions.    This is a good default level, as it prevents dirty reads and lost updates (two transactions updating the same row simultaneously), while not stifling performance.
  • Read uncommitted – Transactions can read uncommitted changes from other transactions.    Everything is a dirty read.    This is great for performance, as this results in far fewer locks, and works well with things where race conditions don’t really matter, like tweets or facebook updates.    For example, lets say John decides to break up with Mary on some social networking website that implements stalker feeds.    Transaction A inserts this into the database.    Mary happens to log on at the same time, and Transaction B selects this event from the database and displays John’s updated status of being single on Mary’s home page.    For whatever reason, transaction A fails and rolls back.   Well, Mary sees the breakup, and not believing her eyes, refreshes her home page, and sees that the update is gone.    Well, its not the end of the world, she’ll get news of the breakup sooner or later, and move on to someone better 🙂

To sum up, the higher the isolation level, the more locking overhead and risk of deadlocks.   The lower, the better the performance, but the higher the risk of reading “invalid” data.    Typically, the default level is good enough.