So you have started running queries in SQL Server. You are becoming a master at joining tables and getting the data you want. But have you given any thought to where/how all this information is stored? Sure, its housed in tables on your server’s data drive. Yet, knowing exactly how this data is stored will help you better understand why running

SELECT {every field}
FROM SomeTable

takes just as long as

SELECT {single char field}
FROM SomeTable

And also knowing how to set up the right indexes can tremendously speed up queries.

Configure your storage

Data in SQL Server is stored on objects called pages. You can think of a page as equivalent to a physical piece of paper. Each of these pages is 8kb in size and holds all data within your table as well as some miscellaneous meta data about the pages and the data contained therein. SQL also groups every 8 pages into an extent; which means an extent is 64kb. SQL Server will read/write an extent at a time when it needs to get data or save data to disk. However, Microsoft’s default Bytes per Cluster (often referred to as Allocation Unit Size) is only 4k. This is great when you may have smaller files to save, but for SQL Server its best to match with what it’s actually working in — 64kb (MS Whitepaper Here). If you leave it at 4k, that means it will have to write 16 times for each time it wants to write out an extent (64 ÷ 4). While 64k is Microsoft’s recommended best practice, your mileage may vary depending on your actual storage subsystem. It is best to check your manufacturer’s documentation and also test using a benchmark tool like SQLIO.

To see if your storage subsystem is set up correctly:

Open Command Prompt in Admin
Run the following (replacing E with the drive in question):
FSUTIL FSINFO NTFSINFO E:
Review the value of Bytes per Cluster

If the value is not correct, you will need to reformat the drive with the correct value. This is sad trombones if you already have data on there. But if you are lucky enough to have another storage location to migrate the mdf/ldf files to you can then format this one, and then migrate back and format the other.

Let’s play pretend

Lets imagine we have an Employee table with the fields FirstName (CHAR(20)), LastName (CHAR(20)), Address (which is street, city, state, zip concatenated) (CHAR(100)), Gender (M/F) (CHAR(1)), and BirthDate (DATE) fields. (Yes, my sharp-eyed friend, I have the datatypes as fixed length rather than variable. In this case I am doing it to illustrate a point. You would definitely not want these to be fixed length in a real table.) This means each row will be 145 bytes. And if you recall earlier each page is 8kb. This means we can house 55 rows or employees per page. Finally for this example let’s say the company has 60,000 employees. This means the table will fit in 1091 pages.

Now export this all out nice in columns and rows in Excel and print it out. Go on. I’ll wait.

Ok now that you have it all printed in nice columns and rows (and have mourned over the loss of all those trees) let’s start querying this data. Give me back all data for all employees.

SELECT FirstName, LastName, Address, Gender, BirthDate
FROM dbo.Employees

You pick up all 1,091 pages and start reading all the columns on all the pages. This would take a very long time obviously — let’s assume for a point of comparison that it takes 1 full week of continuous reading.

Now that you have completed that task I give you a short 15 minute break and once you are back I tell you I have another query for you. I ask for a list of every LastName.

SELECT LastName
FROM dbo.Employees

You think to yourself, oh good, instead of 145 characters I only need to read 20 characters per employee. I can just scan down the LastName column in this print out and read it off. However, I explain to you that you have to read the entire page, but only say the LastName out loud to me. This is how SQL works. It still needs to read the entire page no matter how much data it needs on that page. Also, even though for this illustrative purpose we have all our data set up in nice and neat columns and rows, SQL actually stores all the data back to back crunched up together in one long string.

Grumbling you start to read through the whole table but only reading the LastName out loud. Again, this takes you 1 week to accomplish.

What if I just want to see all employees with the last name Johnson?

SELECT LastName, FirstName, Gender
FROM dbo.Employees
WHERE LastName = ‘Johnson’

Nope, still have to go through the whole 1091 pages to find all the Johnsons because the data is not stored in any particular order. Another week down the drain.

After you are done I can see you are fuming so I give you a full hour lunch break. When you get back, lets try another query. This time give me back just the FirstName and Gender of everyone born before 1/1/1952 so we know who will be eligible to retire this year. Luckily, you know that this company is mostly millennials so the population ready to retire will be less than 100.

SELECT FirstName, Gender
FROM dbo.Employees
WHERE BirthDate < ’01-Jan-1952′

You begin to scan down through the BirthDate field looking for older birthdates when you hear me clear my throat. You realize you this isn’t how SQL works either. So you once again begin to read every field of every page and only saying the FirstName and Gender for those that meet the criteria. Another week? Yep you got it.

Ok, lets try one more that has to be easier and quicker… I only want the 5 oldest employees’ first names and genders that are older than that near retirement date.

SELECT TOP 5 FirstName, Gender
FROM dbo.Employees
WHERE BirthDate < ’01-Jan-1952′
ORDER BY BirthDate ASC

Surely this will be easier, just find the first 5 and stop. Nope, because you need the 5 OLDEST, so you still need to go through the whole entire list. And to make it even worse you have to then order the rows by BirthDate and then find the 5 oldest. This would likely take even longer than 1 week. By the time you are done you are beyond exhaustion and bury me in the 1,091 pages for all the pain I’ve caused you over the last month and you promptly run for the hills before I ask for another request.

Indexes: The quicker picker upper

After finally digging my way out of all that paper and bandaging up my paper cuts, I decide to give you a break. How can we find all the Johnsons in our company? How can we get this near retirement data more easily? With indexes! There are two types of indexes: Clustered indexes and NonClustered indexes. And there are a couple great analogies to our printed paper example that tie into both types.

Clustered Indexes

When you printed out the whole table, it was printed in no particular order. That is because, by default, SQL does not gaurantee any order when it saves or returns results. To return the data in a specific order you can use the ORDER BY clause. To save it in a specific order you use a clustered index. In our example of printing off this list of employees, think of a physical object made of paper in your everyday life that is ordered by name. Right! A phonebook is ordered by last name and then first name to make it easier to find individuals. So let’s order our table, put it in Excel, print it out again (all 1091 pages, but this time in order of LastName then FirstName) and recycle the original mess you buried me in. (Yes, a clustered index like this is not advisable, but rather just used to correlate a non-tangible table to something in real life).

CREATE CLUSTERED INDEX cl_LastName_FirstName
ON dbo.Employees (LastName, FirstName)

You can think of a clustered index as a phone book, only even more efficient. With a phone book, you know where to go to find a last name because you know your alphabet and can kind of guess where about in the book that first letter will be and then adjust accordingly until you find the name you are looking for. With indexes they are sorted in what are called B-Trees (short for balanced search tree). You can picture these as upside down trees (or a pyramid) with all the leaves at the bottom in one massive row/line. Each leaf is a page from your print out. In other words, the leaf is where the actual data is stored.

So you line up all 1091 pieces of paper, side by side in portait mode, across about 2.5 football fields. This is the leaf level of our clustered index which again holds all our data. If you remember above it also contains some metadata about each page. This data includes the page number which will be valuable next.

In our index, above the leaf level is the intermediate layer. This layer contains some more metadata, some miscellaneous data, a pointer to the previous and next page for that layer, a page number, and the last clustered indexes’ value for that page. So in our example, lets say our first page on the leaf level is all names from Andrew Anderson to Kelly Ball. Our second page is Michael Ball to Chris Connors. So the first row of index data in our intermediate page would be “page 1 Kelly Ball” and the second row would be “page 2 Chris Connors”. Because we have a rather wide index (LastName, FirstName – 40 bytes) we can maybe fit 200 leaf page references on an intermediate page. Don’t forget, SQL works only in pages, which are 8k in size. This even goes for all the different parts of an index. You can see why it would be best practice to pick as small of a indexed column as possible. For instance, if we had picked an employee number (INT) we could fit about 2,000 references to a page. I say maybe/about because there is more information per index page than just the references and even the reference to the page takes up space.

With 200 leaf page references per intermediate page, we can fit all of our references on 6 pages (the last page would only be half full). In our example, we would only have one more layer. This final layer would hold one reference to each page in our intermediate layer (so 6 index rows). The final layer is always only one page and is called the root layer. SQL builds as many intermediate layers as it needs until there is only 1 root page.

Levels of a clustered index

(Image Source MSDN)

Now how is this more effecient than a phone book? Well, imagine my example above of looking for all Johnsons. You would have to guess where J is, and may hit N instead. Then you back up and accidentally hit G. Then try again and get I. Then finally get J but you are on Jude. You back up to Jackson. Then slowly go to Jetson, Jiou, Jobs, and finally Johnson but you are at the end of the Johnsons at Zebadiah and have to back up to the first one.

With a B-Tree you first go to the Root page and begin to read the index rows to see which index row references the intermediate page that ends just after Johnson. Lets say the row for reference Page 2 is Bob Garrison and the row for Page 3 is Steve Noon. You know Johnson has to be somewhere on Page 3 so you go to intermediate Page 3. Then you begin to scan through the index rows on intermediate Page 3 and find Page 467 has Clyde Jobs and Page 468 has Ben Johnson. You know therefore that leaf page 468 is where the Johnsons start. So you pull up 468 and scan through the page until you find the first reference to a Johnson and start sending me those results. If you get to the end of the page, you look at the pointer to the next page and continue reading until you no longer have a Johnson as the last name.

SELECT LastName, FirstName, Gender
FROM dbo.Employees
WHERE LastName = ‘Johnson’

Using the B-Tree it only took 2 pages read to figure out where the Johnsons were. Gone are the days of taking a week to scan through all 1091 pages to find this information. You can now find it in no time at all! But it is still not completely efficient. You are still reading all the other information from the leaf page that you do not need (Address and BirthDate). However, we have improved performance so much (now only minutes to get all Johnsons instead of a week) that you are happy with this improvement… for now.

You can quickly see the power and need for clustered indexes. Since clustered indexes actually sort the order of the table itself you can only ever have one clustered index per table. So make sure you pick something that makes sense for what this table is used for.

Nonclustered Indexes

However, you often need to get data based on something more than just the column you have the clustered index based on. In our example above to get those employees who are near retirement, we want to find anyone born before Jan 1 1952. It would be great if you had some way of knowing where in this long list of alphabetically ordered employees all those individuals are, right?

In real life, our analog to a nonclustered index is the index in the back of a text book. In school, if you need to figure out where in your textbook some topic is you could scan through the whole book or just go to the index in the back and find that the topic is on pages 5, 75, 348, and 780. A nonclustered index works the same way. So let’s create a nonclustered index on the BirthDate and print it out.

CREATE NONCLUSTERED INDEX ncl_BirthDate
ON dbo.Employees (BirthDate)

So where does this reside on our 2.5 football field length memorial to many fallen trees? It is actually separate and not tied directly to that object at all. It can be stored on the same file or a separate file in SQL. This also means you can have as many nonclustered indexes as you want (although it is not advisable to add more than you need due to the overhead of maintaining the indexes).

Nonclustered indexes are also built like a Clustered index with B-Trees, except the leaf layer does not contain the actual table data — It only contains index information. This includes references to the specific pages where the actual table data is located and some other index details (which we’ll discuss further in a bit). In our example, the reference to the row data would be LastName, FirstName. If we were looking for the Gender of the people born on 05/13/1949 we could traverse the B-Tree to find the index pages that has 05/13/1949 and see all the LastName, FirstName’s it belongs to and then we’d go to our clustered index to find the rows that have that LastName, FirstName combination to get the Genders.

You can see why it is not recommended to have a wide clustered index. Not only does it make the root/intermediate pages of the clustered index large, but also every nonclustered index as well. The index leaf pages in our example above would probably only fit 180 rows per page (44 bytes – 4 for BirthDate, 20 for LastName, 20 for FirstName). If we had picked a better index like INT we could fit 1k rows on a page (8 bytes – 4 for BirthDate, 4 for INT field).

So we go to the small park next to our two football fields and lay out about 350 pages for our index leaf pages and above that our root page.

SELECT FirstName, Gender
FROM dbo.Employees
WHERE BirthDate < ’01-Jan-1952′

Now when I ask you to give me the near retirement first names and genders you find the root page shows you need to go to the first index leaf. You start to read from the top down until you get to Jan 1 1952 and stop. The leaf page only contains the BirthDates and row references (LastName, FirstName). Unfortunately, you realize I also asked for the Gender for each employee so you have to take the references and look up those rows over in the clustered index (the main table on the 2.5 football fields). This takes a while but you are able to accomplish this much faster than the week it took to look them all up last time.

Luckily, you are only looking up 100 out of 60,000 employees. Imagine if the date range returned 10,000 employees? You might as well have read through the full 60,000 clustered index rather than find the 10,000 employees in the index, jot down the reference information, and then go look up those 10,000 employees that are scattered all throughout the two football fields. SQL does the same thing, you may find it just ignores the nonclustered index in favor of reading the clustered index if it is less costly.

SELECT TOP 5 FirstName, Gender
FROM dbo.Employees
WHERE BirthDate < ’01-Jan-1952′
ORDER BY BirthDate ASC

What about the TOP 5 request I gave you? You would do the same as you did above with going to the nonclustered index, but instead of jotting down all 100 employees to then look up their Gender you only jot down the first 5. This is true even if the result would have been 10,000! If the WHERE criteria (not including the TOP 5 clause) would have returned 10,000, SQL would likely not use the clustered index. I would likely use the nonclustered index instead since it could more easily get the oldest 5 birthdates and then look up the references to get the Gender.

With all these changes you are now extremely happy to be getting these reports for me all the time compared to the pain you went through before. However, after a month of doing these reports, even with the indexes, you start to loathe reading all that extra information (Address, Gender, Birthdate) when looking up the Johnsons and looking up the Gender in the clustered index for the near retirement employees. I can see the hatred filling your eyes one Monday morning and realize I better help you fix this as my paper cuts from your last explosion are just now finally starting to heal.

I’m givin ya all she’s got, cap’n!

I quickly run to the ice cream parlor and pick you up your favorite cone to calm you down and ask what the problem is. You tell me you are starting to hate reading the Address and BirthDate from the Johnsons report. There is no need for you to read more than twice the data you actually use. So you suggest creating an index that has all the fields you need in it:

CREATE NONCLUSTERED INDEX ncl_LastName_FirstName_Gender
ON dbo.Employees (LastName, FirstName, Gender)

I smile and ask if anything else is bothering you. You tell me you are getting tired of having to lookup the Gender for the near retirement employees and ask to alter our current index to include Gender:

CREATE NONCLUSTERED INDEX ncl_BirthDate_Gender
ON dbo.Employees (BirthDate, Gender)

I listen quietly until you are done ranting and then tell you that is a good first attempt. However, have you thought about what would happen to your indexes with this many columns indexed? You are making it wider, which means you’ll get fewer rows per page. Also you’ll have more to maintain should values change.

Instead of the two above how about the following:

CREATE NONCLUSTERED INDEX ncl_LastName_FirstName
ON dbo.Employees (LastName, FirstName)
INCLUDE (Gender)

CREATE NONCLUSTERED INDEX ncl_BirthDate
ON dbo.Employees (BirthDate)
INCLUDE (Gender)

The index root and intermediate pages remain small and only as wide as you need them for the criteria of the queries. But now you have included in the leaf layer additional index details. Gender only resides on the leaf level of the index. This means the index is covering for our particular queries and has all the data you need in order to provide the report. You no longer need to to read the Address or Birthdates to get just the LastName, FirstName, and Gender of the Johnsons. Likewise, you no longer need to go look up the Gender for the near retirement employees. Your life is blissful.

That is until about 2/3rds of the year rolls around and you realize you have been running around getting the same employees this whole time for near retirement and many of them have already retired. We go in and add a new field for Active which is a CHAR(1) field and update it accordingly. We now update the query

SELECT FirstName, Gender
FROM dbo.Employees
WHERE BirthDate < ’01-Jan-1952′
AND Active = ‘Y’

I ask for the report and you realize you still have to read through all 100 employees you have since the first part of the year but just not return to me all the ones who have retired. Before your eyes burn a hole through me, I quickly show you an update to your index:

CREATE NONCLUSTERED INDEX ncl_BirthDate
ON dbo.Employees (BirthDate)
INCLUDE (Gender)
WHERE Active = ‘Y’

Now you only have records in your index of those that are active. You still have all the BirthDates, but you could even go further and change the WHERE clause in the index to include only BirthDate before Jan 1 1952. However, then you would need to change this every year. You could also do the same with the Johnsons, but if we want to look at the Smiths that index would no longer be viable.

Conclusion

Making sure your storage is configured properly before you begin to use it is one of the quickest and easiest performance gains you can make.

Just like the Grail Knight said, be sure you “choose wisely, for while the true Grail will bring you life, the false Grail will take it from you.” The correctly chosen index will be a tremendous performance gain; a poorly chosen index will bring your performance to a grinding halt.

SQLCwby – Chris Woods

SQL Server Insights with a down-home feel

I want to be a SQL Engine when I grow up

Configure your storage

Let’s play pretend

Indexes: The quicker picker upper

Clustered Indexes

Nonclustered Indexes

I’m givin ya all she’s got, cap’n!

Conclusion

1 thought on “I want to be a SQL Engine when I grow up”

Leave a comment Cancel reply

Configure your storage

Let’s play pretend

Indexes: The quicker picker upper

Clustered Indexes

Nonclustered Indexes

I’m givin ya all she’s got, cap’n!

Conclusion

Share this:

Related

1 thought on “I want to be a SQL Engine when I grow up”

Leave a comment Cancel reply