Category Archives: Uncategorized

Why Your SQL Query Runs Backwards (And Why It Matters)

“Order and simplification are the first steps toward mastery of a subject — the actual enemy is the unknown.”

Thomas Mann

As a new SQL developer, do you remember being frustrated that you couldn’t use the alias of a COUNT(*) in the predicate/WHERE clause, but you could in the ORDER BY? Or have you been confused on when you should put something in the ON clause vs the WHERE clause vs the HAVING clause? This article will help clear that all up and it all has to do with how SQL Server processes your query.

Why SQL Isn’t Like Other Languages

Most developers are used to writing code top to bottom and expecting the computer to execute that code in the same order. However, unlike those procedural languages, SQL is a declarative language. In a procedural language, you give the computer specific set of directions telling it how to do tasks. In a declarative language, you give the computer what you want it to return and it determines how it will do that (using the Query Optimizer). This often confuses first time SQL developers; especially when they believe SQL Server is doing all the tasks from the top down in the query. In fact, it may even surprise many seasoned SQL Developers to learn that SQL Server actually starts in the middle of your query.


*Many thanks to Itzik Ben-Gan and his books for helping solidify this in my brain several years ago! I highly recommend reading them if you want to go even deeper into this.*


TL;DR:
The T-SQL Logical Processing Order:

  • FROM
  • JOIN
  • ON
  • WHERE
  • GROUP BY
  • HAVING
  • SELECT
  • DISTINCT
  • ORDER BY
  • TOP

Lets take this query as an example for our discussion:

SELECT TOP 50 e.EmployeeName Name, DATEDIFF(Year, e.StartDate, GETDATE()) YearsEmployed, SUM(s.SalesAmt) Total_SalesAmt
FROM dbo.EmpTable e
LEFT JOIN dbo.SalesTable s
ON e.EmpID = S.EmpID
AND s.Type = 'Software'
WHERE e.Region = 'North'
GROUP BY e.EmployeeName, DATEDIFF(Year, e.StartDate, GETDATE())
HAVING SUM(s.SalesAmt) >= 1000
ORDER BY YearsEmployed DESC, SUM(s.SalesAmt) DESC, e.EmployeeName

Most people who write SQL write it from the top down starting with SELECT (maybe leaving that with a * until you decide what fields you want to put in there), moving on to FROM, the JOIN(s), a WHERE clause, a GROUP BY, the HAVING, and finally the ORDER BY.

The Logical Order vs. The Written Order

However, SQL Server actually reads/processes the query in this logical order: FROM, JOIN(s), ON, WHERE, GROUP BY, HAVING, SELECT, DISTINCT, ORDER BY, and finally TOP.

It is important to note here we are talking about the Logical processing order. This is how SQL Server is going to read your query to make sense of it. It will then pass this information on over to have a query plan created which will be the Physical processing order, which is how it will actually gather the data from the tables and return the results.

A Step-by-Step Walkthrough of Query Processing

SQL Server starts with the raw tables themselves , and filters down the rows before it gets to the specific columns you want returned. This can have a significant impact on how and what you can put in your different sections of the query.

Phase 1: FROM – Gathering the Raw Data

In the first phase, SQL Server pulls back all records and all rows from the table after the FROM clause. Note: It does not limit to any specific rows (because at this point it does not know any predicate information) nor does it limit to any specific columns (because at this point it does not know any columns you want returned in your SELECT clause).

(1) FROM dbo.EmpTable e

Below you see what we’ll call ResultSet1 from Phase 1 (the … represents all the other fields in this table, you’ll see the same for the next several phases). The FROM clause is one of two mandatory clause that every query must have.

Phase 2 & 3: JOIN and ON – Combining and Filtering Tables

The second phase does your join(s). At this point SQL Server does a cartesian join between ResultSet1 from above to your tables in the join (in this case the dbo.SalesTable). This is a cartesian join (every row from ResultSet1 is matched to every row from your table), because SQL Server has not read any of the ON criteria yet.

As a side note: This phase also includes APPLY, PIVOT, and UNPIVOT.

(1) FROM dbo.EmpTable e
(2) LEFT JOIN dbo.SalesTable s

Lets say below is the first few rows of the dbo.SalesTable:

Below is what the ResultSet2 would look like after Phase 2 with the cartesian join:

Next in Phase 3, we would apply the ON clause of the join. This would filter our cartesian join based on whatever criteria we may have in this clause.

(1) FROM dbo.EmpTable e
(2) LEFT JOIN dbo.SalesTable s
(3) ON e.EmpID = S.EmpID
AND s.Type = ‘Software’

In our example, this will remove the “extra joins” of the cartesian join (where the EmpIDs don’t match). It will also only keep rows where the type is Software. As you can see below, this means we have both SalesTable rows for Fred since they are Software, but then for Bob and Gloria we see nothing from SalesTable. But we do still see their data from the EmpTable since we have a LEFT JOIN.

At this point, SQL Server would repeat Phases 2 and 3 for each table you are joining on. In our example, though, we have no more tables.

Phase 4: WHERE – Applying the First Filters

Now we move on to Phase 4 which is the WHERE criteria. This is where many developers get confused. In an INNER JOIN, the ON clause and the WHERE clause can be used interchangeably. However, on LEFT or RIGHT JOINs, your WHERE clause could actually change it into an INNER JOIN. For instance, if you were to move the s.Type = ‘Software’ to the where clause, that would in essence make it logically into an INNER JOIN.

In our example, we will lose Gloria because she was in the South region.

(1) FROM dbo.EmpTable e
(2) LEFT JOIN dbo.SalesTable s
(3) ON e.EmpID = S.EmpID
AND s.Type = ‘Software’
(4) WHERE e.Region = ‘North’

Slight detour: Why Can’t I Use a SELECT Alias in my WHERE Clause?

Also, this is a good place to stop and answer one of the more confusing things for new SQL Developers: Why can’t I use an alias from my SELECT clause in the WHERE predicate? The phases so far show you why you can’t. So far all the SQL Server has seen when logically processing your query is the FROM, JOIN, ON, and WHERE clauses. It has not even looked at SELECT yet. This is why your query will fail if you try to put your column aliases in the WHERE predicate.

Phase 5: GROUP BY – Aggregating the Data

For the next one, Phase 5, we bring in the GROUP BY clause. This is where it gets kind of weird, so stick with me. It brings the last result set from Phase 4 and it tacks onto that all the fields you are looking to group by. But it still keeps this at the row level data you have seen so far, including all other columns (because SQL Server has not yet looked at what fields are needed for any of the other clauses, so it still needs to keep all the fields under consideration).

(1) FROM dbo.EmpTable e
(2) LEFT JOIN dbo.SalesTable s
(3) ON e.EmpID = S.EmpID
AND s.Type = ‘Software’
(4) WHERE e.Region = ‘North’
(5) GROUP BY e.EmployeeName, DATEDIFF(Year, e.StartDate, GETDATE())

Notice how the two grouping fields at the start of the rows for Fred Smith are spanning the two “raw” data fields following them. This is starting to show how we are going to group values together in the final results. This also is starting to tell subsequent phases what will and what will not be allowed in those phases. For instance, it’s telling SQL Server that you cannot put Region in the SELECT clause, because we are not grouping by it (unless you are doing an aggregation on Region like MIN/MAX, of course). All the other non-grouped fields will still be eligible (which is why this phase’s result set still includes them, but they will only be eligible for certain parts of those phases.

This is another good place to stop and show how a DISTINCT and a GROUP BY are for all intents and purposes the same and you will often get the same execution plan if you group by all the same fields you are using a DISTINCT keyword on. The difference, though, is in the intent: GROUP BY is meant to the allow you to do some sort of aggregation when you get to the next Phases and DISTINCT is meant to remove any duplicates. However, they can generate the same execution plan depending on how they are written.

Phase 6: HAVING – Filtering the Groups

We are over half way done now with Phase 6, which is where SQL Server looks at the HAVING clause. This is typically where you put any aggregation predicates, but you can still include any raw field level criteria if you want/need as well. Just remember, at this point you have already performed your group by, so any criteria you put in the HAVING clause is going to be at that level.

(1) FROM dbo.EmpTable e
(2) LEFT JOIN dbo.SalesTable s
(3) ON e.EmpID = S.EmpID
AND s.Type = ‘Software’
(4) WHERE e.Region = ‘North’
(5) GROUP BY e.EmployeeName, DATEDIFF(Year, e.StartDate, GETDATE())
(6) HAVING SUM(s.SalesAmt) >= 1000

In our example, Bob’s record is going to drop off because he has no sales info. This is obviously a very simplified example, but you could see how it could still have many other records at this point. Note that we do not see any columns with that summed value of 1100 for Fred. This is because SQL Server does not necessarily need it. You can have fields/calculations in your HAVING that never actually make it to be presented to the user.

Phase 7: SELECT – Choosing the Final Columns

Now we finally get to what many new SQL developers think is what SQL Server first considers; Phase 7 brings in the SELECT clause. This is where SQL Server will finally get rid of any extra fields that are not needed.

The SELECT clause is where you can give any fields or calculations an alias. And that is also why you can use aliases finally in the ORDER BY but you can’t before this phase because SQL Server hasn’t seen the alias yet. However, now it has and you can use it in the subsequent phase.

As a side note, while our example does not include them, this is also where any window functions (ROW_NUMBER(), LEAD(), LAG(), etc) would be evaluated. This also explains why you can’t include them in your WHERE clause, because they haven’t even been read yet at that point.

(7) SELECT TOP 50 e.EmployeeName Name, DATEDIFF(Year, e.StartDate, GETDATE()) YearsEmployed, SUM(s.SalesAmt) Total_SalesAmt
(1) FROM dbo.EmpTable e
(2) LEFT JOIN dbo.SalesTable s
(3) ON e.EmpID = S.EmpID
AND s.Type = ‘Software’
(4) WHERE e.Region = ‘North’
(5) GROUP BY e.EmployeeName, DATEDIFF(Year, e.StartDate, GETDATE())
(6) HAVING SUM(s.SalesAmt) >= 1000

Note above: the TOP 50 part of the query is not bolded, because that is part of another phase yet to come.

Phase 8: DISTINCT – Removing Duplicates

We are almost done with the clauses as we come up to Phase 8: the DISTINCT clause. In our example we do not have a DISTINCT clause, but this is where you can remove duplicates based on all fields in your SELECT clause. So if you had something like below from Phase 7, where there were two Jane Doe’s, both started the same year and both had the same total sales amount:

If we had a distinct on this, then it would return only one row:

As an important aside here: Do not use DISTINCT as a quick fix to remove duplicates. If you do not expect duplicates, then investigate why. There may be something wrong with your joins to cause a bit of a cartesian join somewhere which is duplicating rows by mistake. Distinct is very resource intensive, especially if you have a ton of rows you have to look at to see if there are duplicates. You are much better off fixing the query if it’s causing the duplicates.

Phase 9: ORDER BY – Sorting the Results

We move back down to the bottom of the query in Phase 9 with the ORDER BY clause. SQL Server does not ever guarantee the order of your results. Even if you are running a simple SELECT * from dbo.Sometable where that Sometable has a clustered index. The only way to guarantee your results are in a specific order is to use an ORDER BY clause.

(7) SELECT TOP 50 e.EmployeeName Name, DATEDIFF(Year, e.StartDate, GETDATE()) YearsEmployed, SUM(s.SalesAmt) Total_SalesAmt
(1) FROM dbo.EmpTable e
(2) LEFT JOIN dbo.SalesTable s
(3) ON e.EmpID = S.EmpID
AND s.Type = ‘Software’
(4) WHERE e.Region = ‘North’
(5) GROUP BY e.EmployeeName, DATEDIFF(Year, e.StartDate, GETDATE())
(6) HAVING SUM(s.SalesAmt) >= 1000
(9) ORDER BY YearsEmployed DESC, SUM(s.SalesAmt) DESC, e.EmployeeName

As you can see above, you can use the alias (YearsEmployed) or you can use the actual calculation (SUM(s.SalesAmt)), or the real field name even if it’s aliased (e.EmployeeName). In fact, you could even use the ordinal value of the field in your select statement if you wanted (eg, ORDER BY 2). WARNING: Anyone having to review your query is likely going to be cursing your name in the future! Plus, if you change the order of your SELECT clause in the future, you will need to remember to change the ordinal value, as well.

Phase 10: TOP – Limiting the Final Rows

FInally, we are at the last one, Phase 10 — the TOP clause. This is what will limit your query results to a maximum number of records. You can even add the keyword PERCENT after the number to get the top x-percent of the results (eg, TOP 10 PERCENT). If you have the result set ordered, like we do in this query, you’ll get that TOP x-number/percent back in that same order. If you did not use an ORDER BY clause, then you will just get the first x-number/percent records back.

(7 and 10) SELECT TOP 50 e.EmployeeName Name, DATEDIFF(Year, e.StartDate, GETDATE()) YearsEmployed, SUM(s.SalesAmt) Total_SalesAmt
(1) FROM dbo.EmpTable e
(2) LEFT JOIN dbo.SalesTable s
(3) ON e.EmpID = S.EmpID
AND s.Type = ‘Software’
(4) WHERE e.Region = ‘North’
(5) GROUP BY e.EmployeeName, DATEDIFF(Year, e.StartDate, GETDATE())
(6) HAVING SUM(s.SalesAmt) >= 1000
(9) ORDER BY YearsEmployed DESC, SUM(s.SalesAmt) DESC, e.EmployeeName

Logical vs. Physical Processing: What’s Next?

Remember, this is just how SQL Server will logically process this query; this is not how it will physically process it. That will be determined by the query optimizer and will involve any number of operators depending on how SQL Server thinks it will best be able to get you the data.


SELECT (Transact-SQL) – Microsoft Learn If you want to read the official word from the source, the Microsoft documentation for the SELECT statement covers this logical processing order in detail.

Itzik Ben-Gan’s Simple Talk Articles As I mentioned, Itzik Ben-Gan’s books were a huge help for me. You can find more of his articles and resources over at his Simple Talk author page.


Before reading this, did you realize that SELECT happens so late in the process (Phase 7)? How does this change how you view your aliases?

An Underutilized Secret Weapon for Faster Queries: The SQL Filtered Index

“Efficiency is doing things right; effectiveness is doing the right things.”

Peter Drucker

First, a Quick Refresher: “Normal” Indexes

When storing and retrieving data in MS SQL Server, you have two options: Clustered and Nonclustered indexes. A clustered index is your table in a specific order (Think of the old-timey phonebooks). A nonclustered index is typically a smaller subset of fields of the entire table, usually ordered on a different field than your clustered index (Think the index in the back of a textbook). However, both indexes, by default, consist of one record for each row in your table.

Continue reading

Fill Factor: The Hidden Culprit Behind Your Database’s Slowdown?

“An ounce of prevention is worth a pound of cure.”Benjamin Franklin

Summary

Fill Factor is a database setting that determines the percentage of free space on each leaf-level page in an index. It directly impacts DiskIO, RAM utilization, and internal fragmentation. Which makes the decision of changing this setting not something to be taken lightly for optimal SQL Server performance.

If you have a Fill Factor of 100 (or 0), the page will completely fill up. After it is full, a new page is created. But if you have a fill factor of 60, it will stop adding new records when the page reaches 60% full. Then, it will place the next record on a new page.

Understanding how Fill Factor works is crucial to the health of your SQL server. Too high and you may have a lot of unnecessary page splits. Too low and SQL is wasting resources and taking longer to get you the data you want. It also needs to be considered alongside your index maintenance planning to properly handle internal fragmentation.

To Adjust Fill Factor or Not To Adjust?

Many newbie DBAs will automatically adjust the default fill factor to something less than 100 on all indexes to avoid page splits. This over-reaction can actually cause more problems than it solves. Lets talk about specifics on what the DBAs are trying to avoid when they do this…

If you remember the Defragmentation post from last week, you’ll recall I talked about adding a new record into the middle of the table. This will cause a page split if that page you are attempting to add the record to is full. Each page split adds extra overhead by creating a new page, and then moving half the old page to the new page. And on top of that, now you have two pages that are half empty. Many page splits like this lead to excessive internal fragmentation. This is the worst type of fragmentation. Read the article above if you want to learn why. For the rest of this article make sure you are thinking of INTERNAL and not external fragmentation.

If you have an index on a table that continually adds records at the end, you do not need to adjust the fill factor. It never adds new records in the middle of the index. This is like a log table or index. It will never cause a page split since it’s always adding to new pages at the end of the table/index.

What is wrong with non-default Fill Factor values?

How Fill factor impacts DiskIO: If you recall from I want to be a SQL Engine when I grow up article, you learned that SQL stores (reads and writes) data in 8kb pages. Regardless of how full that page is it still reads the full 8kb. So having empty space just causes you to have your data spread across more pages, causing more reads and thus taking longer to run if the data isn’t already in memory.

How Fill factor impacts RAM Utilization: But even if the data is already in memory, a non-default Fill Factor will still cause problems. SQL not only reads/writes in 8kb pages, but it also holds that entire 8kb page in RAM. This means if your fill factor is 60%, then you are wasting 40% of your RAM on empty space. In Standard Edition, which only allows 128gb of RAM, this is a huge waste of resources.

What should be our initial baseline?

The default setting in your DBs and for all indexes should be 100. If you find you are having issues, then you can adjust from there. But you really need to understand your data. You need to know the patterns of the inserts. You also need to be aware of how how much internal fragmentation you are seeing in your indexes. You should not be wildly adjust this fine tuning control in SQL without a good process in place.

When/If You Should Adjust

Many DBAs are unaware that the page split counter increases during normal page additions

Newbie DBAs often look at the page splits perfmon counter incorrectly. They see high counts and then quickly adjust Fill Factor without any other investigation. They hope this adjustment will reduce page splits. The problem with relying on just the page split counter is that it reflects more than one action. In perfmon, a page split is counted for actions beyond just inserting data in the middle of a full page. Many DBAs are unaware that the page split counter increases during normal page additions. They believe a page split counter increases only when a page is split and half the data is moved to the end of the index. In our example above about a typical Log table/index, a new page is added when the end of the index page is filled. This completely blank page is then recorded as a page split in perfmon! Really, this perfmon counter should just be called the “New Page” counter.

Ideally, you have create date and modified date on your tables. You can use this to find the cause of your page split counts. Find out if new records were added into the middle of your index or at the end of the index.

How to Adjust Fill Factor properly?

  1. Review your indexes and find ones that are non-default. Change these to 100 and rebuild the pages. Do this only if you are sure there wasn’t a valid reason for the current non-default value. Be careful doing this. If you do it, make sure to quickly keep an eye on internal fragmentation. Be ready to do the other steps swiftly to add some of that empty space back in there.
  2. Any indexes that are already on the default 100 fill factor value, do a rebuild to consolidate any internal fragmentation.
  3. Now your indexes will all be at a good starting point. Record today’s date as the start of your experiment.
  4. Every day (or as often as your workload warrants), pull a report of internal fragmentation.
  5. If this internal fragmentation is significant, then a) reduce the fill factor by 5 points and b) rebuild the index so the “clock” resets. Record in your log that you lowered the fill factor on this day. This will help you see how long it takes to become significant again.
  6. Repeat steps 4 and 5 until a “good amount” of time has gone by.

How long is a good amount of time? It all depends on your workload and environment. If your data is inserted in the middle of your index, you will never get rid of internal fragmentation completely. You need to decide at what point lowering the Fill Factor further will be more detrimental to your DiskIO and RAM utilization than doing a rebuild. My opinion would be you should aim for at least a week between rebuilds. This helps to fix internal fragmentation, but then you aren’t doing rebuilds all the time to fix it, either. However, your mileage may vary.

Finally, review this plan on a recurring basis! You may find over time you need to reduce the fill factor more. Perhaps you can rebuild the index less often. You may even find you can increase the fill factor, saving you on resources.

Conclusion

Remember, Fill Factor is not a “set it and forget it” setting. You should monitor internal fragmentation on all your indexes regularly. Remember to check before you rebuild indexes if you do that on a recurring basis. It is also a precise adjustment tool. You should not make sweeping changes to all indexes or even huge changes to single indexes. Your mileage may vary. You need to look at your specific workload because there is no good “one size fits all” suggestion.

What has been your experience with adjusting Fill Factor in your own SQL Server environments, and what results did you observe?

Order up!

Happy Birthday! What’d I get you for your birthday? Ummm I got…. a fun experiment. Yeah yeah, let’s do a thought experiment, shall we? Close your eyes… no really go ahead and close them, I’ll watch your cake so no one takes it… I promise.

Continue reading

Too many chefs

Development has come a long way over the last several decades. One of the biggest advancements is Source control. But its not just for developers.

Continue reading

What’s the best part of free?

One of the best parts of the SQL Server community is the abundance of those willing to help others. I don’t know how many times I’ve posted on Slack, Stack Exchange, or other sites asking for assistance and received a prompt well informed reply. There are so many free scripts and tools out there written by some of the best of our brethren that you would be surprised they are giving away for free!

Continue reading

The Interesting Case of NULLs

NULL values can sometimes confuse SQL developers when they first start to write queries. In this article, we’ll review some of nuances of NULL values that you should be aware of in your queries.

Continue reading

I want to be a SQL Engine when I grow up

So you have started running queries in SQL Server. You are becoming a master at joining tables and getting the data you want. But have you given any thought to where/how all this information is stored? Sure, its housed in tables on your server’s data drive. Yet, knowing exactly how this data is stored will help you better understand why running

SELECT {every field}
FROM SomeTable

takes just as long as

SELECT {single char field}
FROM SomeTable

And also knowing how to set up the right indexes can tremendously speed up queries.

Continue reading