Tag Archives: Query Optimization

Why Your SQL Query Runs Backwards (And Why It Matters)

“Order and simplification are the first steps toward mastery of a subject — the actual enemy is the unknown.”

Thomas Mann

As a new SQL developer, do you remember being frustrated that you couldn’t use the alias of a COUNT(*) in the predicate/WHERE clause, but you could in the ORDER BY? Or have you been confused on when you should put something in the ON clause vs the WHERE clause vs the HAVING clause? This article will help clear that all up and it all has to do with how SQL Server processes your query.

Why SQL Isn’t Like Other Languages

Most developers are used to writing code top to bottom and expecting the computer to execute that code in the same order. However, unlike those procedural languages, SQL is a declarative language. In a procedural language, you give the computer specific set of directions telling it how to do tasks. In a declarative language, you give the computer what you want it to return and it determines how it will do that (using the Query Optimizer). This often confuses first time SQL developers; especially when they believe SQL Server is doing all the tasks from the top down in the query. In fact, it may even surprise many seasoned SQL Developers to learn that SQL Server actually starts in the middle of your query.


*Many thanks to Itzik Ben-Gan and his books for helping solidify this in my brain several years ago! I highly recommend reading them if you want to go even deeper into this.*


TL;DR:
The T-SQL Logical Processing Order:

  • FROM
  • JOIN
  • ON
  • WHERE
  • GROUP BY
  • HAVING
  • SELECT
  • DISTINCT
  • ORDER BY
  • TOP

Lets take this query as an example for our discussion:

SELECT TOP 50 e.EmployeeName Name, DATEDIFF(Year, e.StartDate, GETDATE()) YearsEmployed, SUM(s.SalesAmt) Total_SalesAmt
FROM dbo.EmpTable e
LEFT JOIN dbo.SalesTable s
ON e.EmpID = S.EmpID
AND s.Type = 'Software'
WHERE e.Region = 'North'
GROUP BY e.EmployeeName, DATEDIFF(Year, e.StartDate, GETDATE())
HAVING SUM(s.SalesAmt) >= 1000
ORDER BY YearsEmployed DESC, SUM(s.SalesAmt) DESC, e.EmployeeName

Most people who write SQL write it from the top down starting with SELECT (maybe leaving that with a * until you decide what fields you want to put in there), moving on to FROM, the JOIN(s), a WHERE clause, a GROUP BY, the HAVING, and finally the ORDER BY.

The Logical Order vs. The Written Order

However, SQL Server actually reads/processes the query in this logical order: FROM, JOIN(s), ON, WHERE, GROUP BY, HAVING, SELECT, DISTINCT, ORDER BY, and finally TOP.

It is important to note here we are talking about the Logical processing order. This is how SQL Server is going to read your query to make sense of it. It will then pass this information on over to have a query plan created which will be the Physical processing order, which is how it will actually gather the data from the tables and return the results.

A Step-by-Step Walkthrough of Query Processing

SQL Server starts with the raw tables themselves , and filters down the rows before it gets to the specific columns you want returned. This can have a significant impact on how and what you can put in your different sections of the query.

Phase 1: FROM – Gathering the Raw Data

In the first phase, SQL Server pulls back all records and all rows from the table after the FROM clause. Note: It does not limit to any specific rows (because at this point it does not know any predicate information) nor does it limit to any specific columns (because at this point it does not know any columns you want returned in your SELECT clause).

(1) FROM dbo.EmpTable e

Below you see what we’ll call ResultSet1 from Phase 1 (the … represents all the other fields in this table, you’ll see the same for the next several phases). The FROM clause is one of two mandatory clause that every query must have.

Phase 2 & 3: JOIN and ON – Combining and Filtering Tables

The second phase does your join(s). At this point SQL Server does a cartesian join between ResultSet1 from above to your tables in the join (in this case the dbo.SalesTable). This is a cartesian join (every row from ResultSet1 is matched to every row from your table), because SQL Server has not read any of the ON criteria yet.

As a side note: This phase also includes APPLY, PIVOT, and UNPIVOT.

(1) FROM dbo.EmpTable e
(2) LEFT JOIN dbo.SalesTable s

Lets say below is the first few rows of the dbo.SalesTable:

Below is what the ResultSet2 would look like after Phase 2 with the cartesian join:

Next in Phase 3, we would apply the ON clause of the join. This would filter our cartesian join based on whatever criteria we may have in this clause.

(1) FROM dbo.EmpTable e
(2) LEFT JOIN dbo.SalesTable s
(3) ON e.EmpID = S.EmpID
AND s.Type = ‘Software’

In our example, this will remove the “extra joins” of the cartesian join (where the EmpIDs don’t match). It will also only keep rows where the type is Software. As you can see below, this means we have both SalesTable rows for Fred since they are Software, but then for Bob and Gloria we see nothing from SalesTable. But we do still see their data from the EmpTable since we have a LEFT JOIN.

At this point, SQL Server would repeat Phases 2 and 3 for each table you are joining on. In our example, though, we have no more tables.

Phase 4: WHERE – Applying the First Filters

Now we move on to Phase 4 which is the WHERE criteria. This is where many developers get confused. In an INNER JOIN, the ON clause and the WHERE clause can be used interchangeably. However, on LEFT or RIGHT JOINs, your WHERE clause could actually change it into an INNER JOIN. For instance, if you were to move the s.Type = ‘Software’ to the where clause, that would in essence make it logically into an INNER JOIN.

In our example, we will lose Gloria because she was in the South region.

(1) FROM dbo.EmpTable e
(2) LEFT JOIN dbo.SalesTable s
(3) ON e.EmpID = S.EmpID
AND s.Type = ‘Software’
(4) WHERE e.Region = ‘North’

Slight detour: Why Can’t I Use a SELECT Alias in my WHERE Clause?

Also, this is a good place to stop and answer one of the more confusing things for new SQL Developers: Why can’t I use an alias from my SELECT clause in the WHERE predicate? The phases so far show you why you can’t. So far all the SQL Server has seen when logically processing your query is the FROM, JOIN, ON, and WHERE clauses. It has not even looked at SELECT yet. This is why your query will fail if you try to put your column aliases in the WHERE predicate.

Phase 5: GROUP BY – Aggregating the Data

For the next one, Phase 5, we bring in the GROUP BY clause. This is where it gets kind of weird, so stick with me. It brings the last result set from Phase 4 and it tacks onto that all the fields you are looking to group by. But it still keeps this at the row level data you have seen so far, including all other columns (because SQL Server has not yet looked at what fields are needed for any of the other clauses, so it still needs to keep all the fields under consideration).

(1) FROM dbo.EmpTable e
(2) LEFT JOIN dbo.SalesTable s
(3) ON e.EmpID = S.EmpID
AND s.Type = ‘Software’
(4) WHERE e.Region = ‘North’
(5) GROUP BY e.EmployeeName, DATEDIFF(Year, e.StartDate, GETDATE())

Notice how the two grouping fields at the start of the rows for Fred Smith are spanning the two “raw” data fields following them. This is starting to show how we are going to group values together in the final results. This also is starting to tell subsequent phases what will and what will not be allowed in those phases. For instance, it’s telling SQL Server that you cannot put Region in the SELECT clause, because we are not grouping by it (unless you are doing an aggregation on Region like MIN/MAX, of course). All the other non-grouped fields will still be eligible (which is why this phase’s result set still includes them, but they will only be eligible for certain parts of those phases.

This is another good place to stop and show how a DISTINCT and a GROUP BY are for all intents and purposes the same and you will often get the same execution plan if you group by all the same fields you are using a DISTINCT keyword on. The difference, though, is in the intent: GROUP BY is meant to the allow you to do some sort of aggregation when you get to the next Phases and DISTINCT is meant to remove any duplicates. However, they can generate the same execution plan depending on how they are written.

Phase 6: HAVING – Filtering the Groups

We are over half way done now with Phase 6, which is where SQL Server looks at the HAVING clause. This is typically where you put any aggregation predicates, but you can still include any raw field level criteria if you want/need as well. Just remember, at this point you have already performed your group by, so any criteria you put in the HAVING clause is going to be at that level.

(1) FROM dbo.EmpTable e
(2) LEFT JOIN dbo.SalesTable s
(3) ON e.EmpID = S.EmpID
AND s.Type = ‘Software’
(4) WHERE e.Region = ‘North’
(5) GROUP BY e.EmployeeName, DATEDIFF(Year, e.StartDate, GETDATE())
(6) HAVING SUM(s.SalesAmt) >= 1000

In our example, Bob’s record is going to drop off because he has no sales info. This is obviously a very simplified example, but you could see how it could still have many other records at this point. Note that we do not see any columns with that summed value of 1100 for Fred. This is because SQL Server does not necessarily need it. You can have fields/calculations in your HAVING that never actually make it to be presented to the user.

Phase 7: SELECT – Choosing the Final Columns

Now we finally get to what many new SQL developers think is what SQL Server first considers; Phase 7 brings in the SELECT clause. This is where SQL Server will finally get rid of any extra fields that are not needed.

The SELECT clause is where you can give any fields or calculations an alias. And that is also why you can use aliases finally in the ORDER BY but you can’t before this phase because SQL Server hasn’t seen the alias yet. However, now it has and you can use it in the subsequent phase.

As a side note, while our example does not include them, this is also where any window functions (ROW_NUMBER(), LEAD(), LAG(), etc) would be evaluated. This also explains why you can’t include them in your WHERE clause, because they haven’t even been read yet at that point.

(7) SELECT TOP 50 e.EmployeeName Name, DATEDIFF(Year, e.StartDate, GETDATE()) YearsEmployed, SUM(s.SalesAmt) Total_SalesAmt
(1) FROM dbo.EmpTable e
(2) LEFT JOIN dbo.SalesTable s
(3) ON e.EmpID = S.EmpID
AND s.Type = ‘Software’
(4) WHERE e.Region = ‘North’
(5) GROUP BY e.EmployeeName, DATEDIFF(Year, e.StartDate, GETDATE())
(6) HAVING SUM(s.SalesAmt) >= 1000

Note above: the TOP 50 part of the query is not bolded, because that is part of another phase yet to come.

Phase 8: DISTINCT – Removing Duplicates

We are almost done with the clauses as we come up to Phase 8: the DISTINCT clause. In our example we do not have a DISTINCT clause, but this is where you can remove duplicates based on all fields in your SELECT clause. So if you had something like below from Phase 7, where there were two Jane Doe’s, both started the same year and both had the same total sales amount:

If we had a distinct on this, then it would return only one row:

As an important aside here: Do not use DISTINCT as a quick fix to remove duplicates. If you do not expect duplicates, then investigate why. There may be something wrong with your joins to cause a bit of a cartesian join somewhere which is duplicating rows by mistake. Distinct is very resource intensive, especially if you have a ton of rows you have to look at to see if there are duplicates. You are much better off fixing the query if it’s causing the duplicates.

Phase 9: ORDER BY – Sorting the Results

We move back down to the bottom of the query in Phase 9 with the ORDER BY clause. SQL Server does not ever guarantee the order of your results. Even if you are running a simple SELECT * from dbo.Sometable where that Sometable has a clustered index. The only way to guarantee your results are in a specific order is to use an ORDER BY clause.

(7) SELECT TOP 50 e.EmployeeName Name, DATEDIFF(Year, e.StartDate, GETDATE()) YearsEmployed, SUM(s.SalesAmt) Total_SalesAmt
(1) FROM dbo.EmpTable e
(2) LEFT JOIN dbo.SalesTable s
(3) ON e.EmpID = S.EmpID
AND s.Type = ‘Software’
(4) WHERE e.Region = ‘North’
(5) GROUP BY e.EmployeeName, DATEDIFF(Year, e.StartDate, GETDATE())
(6) HAVING SUM(s.SalesAmt) >= 1000
(9) ORDER BY YearsEmployed DESC, SUM(s.SalesAmt) DESC, e.EmployeeName

As you can see above, you can use the alias (YearsEmployed) or you can use the actual calculation (SUM(s.SalesAmt)), or the real field name even if it’s aliased (e.EmployeeName). In fact, you could even use the ordinal value of the field in your select statement if you wanted (eg, ORDER BY 2). WARNING: Anyone having to review your query is likely going to be cursing your name in the future! Plus, if you change the order of your SELECT clause in the future, you will need to remember to change the ordinal value, as well.

Phase 10: TOP – Limiting the Final Rows

FInally, we are at the last one, Phase 10 — the TOP clause. This is what will limit your query results to a maximum number of records. You can even add the keyword PERCENT after the number to get the top x-percent of the results (eg, TOP 10 PERCENT). If you have the result set ordered, like we do in this query, you’ll get that TOP x-number/percent back in that same order. If you did not use an ORDER BY clause, then you will just get the first x-number/percent records back.

(7 and 10) SELECT TOP 50 e.EmployeeName Name, DATEDIFF(Year, e.StartDate, GETDATE()) YearsEmployed, SUM(s.SalesAmt) Total_SalesAmt
(1) FROM dbo.EmpTable e
(2) LEFT JOIN dbo.SalesTable s
(3) ON e.EmpID = S.EmpID
AND s.Type = ‘Software’
(4) WHERE e.Region = ‘North’
(5) GROUP BY e.EmployeeName, DATEDIFF(Year, e.StartDate, GETDATE())
(6) HAVING SUM(s.SalesAmt) >= 1000
(9) ORDER BY YearsEmployed DESC, SUM(s.SalesAmt) DESC, e.EmployeeName

Logical vs. Physical Processing: What’s Next?

Remember, this is just how SQL Server will logically process this query; this is not how it will physically process it. That will be determined by the query optimizer and will involve any number of operators depending on how SQL Server thinks it will best be able to get you the data.


SELECT (Transact-SQL) – Microsoft Learn If you want to read the official word from the source, the Microsoft documentation for the SELECT statement covers this logical processing order in detail.

Itzik Ben-Gan’s Simple Talk Articles As I mentioned, Itzik Ben-Gan’s books were a huge help for me. You can find more of his articles and resources over at his Simple Talk author page.


Before reading this, did you realize that SELECT happens so late in the process (Phase 7)? How does this change how you view your aliases?

The TOP (N) Trick: Short Circuit Your SQL Server Counts for Massive Speed Gains

“It is not enough to be busy… The question is: What are we busy about?”Henry David Thoreau (Poet and Philosopher)

Have you ever had to write a query to check some threshold count of records and then do some action based on that?

For instance, let’s say your reporting website shows a red indicator when you have more than 10,000 orders of a specific category type in the last 30 days. Otherwise, that indicator is green. You don’t care if there is only one more than your limit or one million more.

The Obvious (and Slow) Solution: SELECT COUNT(*)

The obvious first answer is to do a full count of everything that matches that criteria:

SELECT COUNT(*) Cnt
FROM Sales.Orders
WHERE CategoryID = 5
  AND CreateDt >= GETDATE()-30;

Lets say this gives us 156,000 records, which is well above the threshold of 10,000. But look at the cost: it produced 12,080 reads and took 380ms (which you can see using SET STATISTICS IO ON and SET STATISTICS TIME ON). This is on a query that is already using an index and whose stats are up to date.

This query is so slow because it must count every single matching row, even after we have easily counted past our 10,000 record threshold. But how can we short circuit this counting and stop at our threshold?

The Fast Solution: Short-Circuiting with TOP (N)

I’m sure you have all used SELECT TOP (N) at some point to just get a sample of what a table looks like. This same logic can help us short circuit our counting. What if we just asked SQL for the top 10,000 records and then counted that result?

We can do this with an inline subquery:

SELECT COUNT(*) Cnt
FROM (
    SELECT TOP (10000) ID
    FROM Sales.Orders
    WHERE CategoryID = 5
      AND CreateDt >= GETDATE()-30
    ) AS r;

This query now only does 54 reads (0.4% of the original) and completes in 12ms (3% of the original).

It works because the inner query (SELECT TOP (10000)) stops as soon as it finds 10,000 matching records. The outer query then just counts the results of the inner query. If the count is 10,000, we know we have at least that many. If it’s less (for example 8,500), we know we’re under the threshold.

The Takeaway

So don’t forget: if you don’t really care about the total count, but just that it’s above some limit, don’t waste resources and time doing a full COUNT(*). Simply put a short circuit into your query with TOP (N) so it stops counting at your given threshold!


Do you know of any other methods in SQL Server to solve this “at least N rows” problem? Share them in the comments!

Is SQL Defragmentation Still Relevant Today?

Most DBAs have heard the term fragmentation. But do you really know what it means? Do you know there are two types of fragmentation? And in today’s world is it still an issue?

External Fragmentation

SQL stores data in what are called pages. Each page is 8kb in size. Let’s say we have a table for Users that is 100 pages in size. This table has a clustered index on lastname. Every page is completely full of all of our users for the application. What happens when a new user named Joe Cool is added into the application? If there is no space for Joe Cool on page 7 of our table, then SQL will divide page 7 into two parts. It will move the second part of that page to a new page 101. It will then insert Joe Cool on page 7.

Imagine this happening for a few thousand more users. Each one potentially causing another page split. In the end, lets say we have 150 pages. Now we have all of our pages out of order — This is what we call External Fragmentation. In the golden days of databases when we had spinning disks this was an huge issue. Those rust buckets read the data much faster when it was sequential. But, now with modern SSD storage this is almost never a problem anymore. Since it random reads the data it can read the data anywhere very quickly.

Internal Fragmentation

However, as you can imagine in our example above, there may be a lot of empty space on those split pages. This depends on how they were filled with the new users. Also, imagine if we needed to drop a bunch of those users out of our system. We would have even more empty space in our table. This is what we call Internal Fragmentation — where there is empty space (gaps) in our pages.

Internal Fragmentation is going to be much more of a problem than what External Fragmentation is. Imagine printing off our list of Users just as it exists in the table on disk. These physical pages you print off will have a lot of wasted white space. This happens where there are gaps from the splits and/or from the deletes. Before SQL can do anything, it needs the data/pages in memory. To get that, it will need to read the data from disk. This process takes more time than it otherwise would because there are more pages. Why? The data is spread across more pages than necessary with lots of empty space. And then even worse: once it’s in memory the entire page is in memory, even the “white space”. As you can now see, this will be a huge waste of not only disk IO but also precious RAM.

The Bad Fix

Many DBAs react to performance complaints on a query by checking External Fragmentation. They often conclude that they need to rebuild their index when they see the high percentage of fragmentation. Once they do the performance is suddenly better and they pat their backs and say, “Job well done!” But what they don’t realize is most likely it wasn’t the actual defragmentation that solved the problem.

When a rebuild occurs three things happen:

  1. The actual rebuild
  2. A full stats update
  3. Clearing any plans from cache that involved that index

Most likely what fixed your issue was #2 or #3.

The Good Fix

So it’s much better to look at and fix those two first:

Full Stats update: Look to see if the the index has out of date stats. If so, then update the stats and check the query again. Stats updates are much less resource intensive than an index rebuild.

Plan Cache clearing: This may have solved your problem because of a bad query plan. Instead, you should check the query plan in cache. See if it makes sense for what you are trying to do. If this was your issue, most likely you have parameter sniffing that you need to look into.

Internal Fragmentation: If neither of the above has solved your problem, then look at Internal Fragmentation. If you find a lot of Internal Fragmentation, then go ahead and rebuild the index. You should check your fill factor. Make sure it’s at the default to decrease the chances of this happening again. (I plan to have a future post about how to adjust fill factor if you believe it needs to be adjusted. So be sure to subscribe so you won’t miss it when it comes out!)

Obviously after this point if you are still having issues, it’s time to dig deeper into the query plan. Examine how you can improve it. Use just good old query tuning techniques.

Have you found other ‘bad fixes’ that DBAs commonly rely on, and what’s your ‘good fix’ alternative?