How do SQL Indexes Work
Summary
TLDRIn this educational video, Venkat explains the functionality of SQL indexes, focusing on clustered and non-clustered types. He uses an Employees table example to illustrate how a clustered index on the primary key speeds up query performance by organizing data in a sorted, tree-like structure. Venkat demonstrates the inefficiency of searching without an index and shows how creating a non-clustered index on the 'Name' column improves search efficiency. The video includes a practical SQL script example and execution plan analysis, highlighting the significant performance impact of using indexes.
Takeaways
- 🔍 Indexes are crucial for improving SQL query performance by allowing the database engine to quickly locate data.
- 🌐 There are two types of indexes: clustered and non-clustered, each serving different purposes in data retrieval.
- 📚 A clustered index sorts and physically stores data rows in a tree-like structure based on the index key.
- 📈 The script provides a practical example using an Employees table with EmployeeId as the primary key, which by default creates a clustered index.
- 🔑 The root node of a clustered index contains index rows with key values and pointers to data pages or leaf nodes.
- 📊 A non-clustered index, on the other hand, stores key values and row locators, but does not physically sort the data rows.
- 🚀 The video demonstrates the efficiency of using an index by showing how SQL Server quickly finds a specific employee row using a clustered index.
- 📉 Without an index on a column, SQL Server must perform a full table scan, which is inefficient and slow, especially with large datasets.
- 🛠️ The script includes a SQL script example that creates an Employees table, inserts a large amount of test data, and demonstrates the use of indexes.
- 💡 SQL Server provides recommendations for missing indexes to improve query performance, as shown when searching by employee name without an index.
- 📈 The video concludes with a comparison of estimated subtree costs with and without an index, highlighting the significant performance benefits of using indexes.
Q & A
What is the main topic of the video by Venkat?
-The main topic of the video is explaining how indexes work and how they improve the performance of SQL queries, focusing on both clustered and non-clustered indexes.
What is a clustered index and how does it affect data storage?
-A clustered index determines the physical order of data in a table. In the example, EmployeeId is the primary key, and thus a clustered index is created on it, sorting and storing the employee data rows by EmployeeId.
How does the database engine use a clustered index to find a specific row?
-The database engine starts at the root node and follows pointers through intermediate nodes to the leaf nodes, which contain the actual data rows sorted by the key column, allowing quick data retrieval.
What is the difference between data pages and leaf nodes in the context of a clustered index?
-Data pages or leaf nodes are the bottom nodes of the tree structure in a clustered index that contain the actual data rows. They are where the sorted data is physically stored.
How many rows does SQL Server have to read to find an employee with EmployeeId 1120, given the clustered index?
-SQL Server only has to read 3 rows (root node, intermediate node, and leaf node) to find the employee with EmployeeId 1120, thanks to the clustered index.
What happens when a query is made on a column that does not have an index?
-Without an index, SQL Server has to perform a full table scan, reading every record, which is inefficient and slow, especially with large datasets.
Why is creating a non-clustered index on the 'Name' column suggested in the video?
-Creating a non-clustered index on the 'Name' column is suggested to improve the performance of queries searching by employee name, as it allows the database engine to quickly locate the name in the index and then use the cluster key to find the actual data row.
How does a non-clustered index physically store data in the database?
-A non-clustered index stores key values and row locators. The key values are sorted, and the row locators point to the actual data rows, which are stored in a different order due to the clustered index.
What is the role of the clustered index when a non-clustered index is used to find an employee by name?
-When using a non-clustered index to find an employee by name, the clustered index is used in a subsequent step to locate the actual data row using the cluster key (EmployeeId) retrieved from the non-clustered index.
What is the impact of having an index on the 'Name' column as shown in the execution plan?
-Having an index on the 'Name' column changes the operation from a clustered index scan to an index seek, significantly reducing the estimated subtree cost and improving query performance.
What is the estimated subtree cost with and without an index on the 'Name' column?
-Without an index, the estimated subtree cost is 11.something, indicating a full table scan. With an index, it is 0.006, showing a dramatic improvement in performance.
Outlines
🔍 Understanding Clustered Indexes in SQL
In this segment, Venkat introduces the concept of indexes in SQL, specifically focusing on clustered indexes. He explains how a clustered index sorts and physically stores data in a tree-like B-Tree structure, using the 'EmployeeId' column as an example. The video demonstrates how SQL Server efficiently locates a specific employee row using the clustered index in just three operations, highlighting the performance benefits of using indexes.
🚀 Improving Query Performance with Non-Clustered Indexes
Venkat discusses the limitations of searching without an index, using the 'Name' column as an example. He shows how SQL Server performs a full table scan when no index is available, which is highly inefficient. The video then transitions into creating a non-clustered index on the 'Name' column to improve search performance. It explains how non-clustered indexes store key values and row locators, and how they work in conjunction with clustered indexes to retrieve data quickly.
📈 Comparing Performance with and without Indexes
In the final part, Venkat compares the performance impact of having an index versus not having one. He presents the 'Estimated Subtree Cost' for a query with and without an index, showing a significant improvement in performance with the index. The video concludes with a promise to delve deeper into SQL Server execution plans in future videos, providing viewers with a comprehensive understanding of how indexes enhance database query performance.
Mindmap
Keywords
💡Indexes
💡Clustered Index
💡Non-Clustered Index
💡Data Pages
💡Root Node
💡Intermediate Levels
💡Index Seek
💡Execution Plan
💡Index Scan
💡Missing Index
Highlights
Indexes improve the performance of SQL queries.
Clustered and non-clustered indexes have different functions.
A clustered index sorts and stores data physically in a tree-like structure.
Data pages or leaf nodes at the bottom of the tree contain actual data rows.
Root and intermediate nodes contain index rows with key values and pointers.
Clustered indexes enable quick data retrieval using a series of pointers.
A demonstration of how SQL Server uses a clustered index to find an employee row.
SQL Server can directly read a specific employee row with the help of an index.
Indexes allow SQL Server to find data quickly, even in large datasets.
Searching without an index requires reading every record, leading to inefficiency.
SQL Server suggests creating a non-clustered index when one is missing for a query.
Non-clustered indexes store key values and row locators, not actual table data.
Execution plans show the steps SQL Server takes to execute a query.
Non-clustered indexes work in conjunction with clustered indexes for data retrieval.
The impact of indexes on performance is significant, as shown by estimated subtree costs.
SQL Server Management Studio provides execution plan details for performance analysis.
Transcripts
Hey guys, I'm Venkat and in this video, we'll discuss how indexes actually work and help
improve the performance of our sql queries. We'll discuss how both the index types work - clustered
and non-clustered. If you're new to indexes, we've already covered all the basics you need in this
sql server tutorial for beginners course. Please check out the videos from parts 35 to 38. I'll
include the link in the description of this video. Now, consider this Employees table. EmployeeId is
the primary key, so by default a clustered index on the EmployeeId column is created. This means,
employee data is sorted by EmployeeId column and physically stored in a series of data pages
in a tree-like structure that looks like the following. The nodes at the bottom of
the tree are called data pages or leaf nodes and contains the actual data rows in our case
employee rows. These employee rows are sorted by EmployeeId column because EmployeeId
is the primary key and by default, a clustered index on this column is created. For our example,
let's say in this Employees table we have 1200 rows and let's assume in each data page
we have 200 rows. So, in the first data page we have 1 to 100 rows, in the second 201 to 400,
in the third 401 to 600, so on and so forth. The node at the top of the tree is called root node.
The nodes between the root node and the leaf nodes are called intermediate levels. The root
and the intermediate level nodes contain index rows. Each index row contains a key value,
in our case EmployeeId and a pointer to either an intermediate level page in the B-Tree
or a data row in the leaf node. So, this tree-like structure has a series of pointers that helps the
query engine find data quickly. For example, let's say we want to find employee row with employee id
1120. So, the database engine starts at the root node and it picks the index node on the right
because the database engine knows it is this node that contains employee ids from 801 to 1200. From
there, it picks the leaf node that is present on the extreme right because employee data rows
from 1001 to 1200 are present in this leaf node. The data rows in the leaf node are sorted by
employee id so it's easy for the database engine to find the employee row with id equals 1120.
Notice, in just 3 operations sql server is able to find the data we are looking for. It's making use
of the clustered index we have on the table. Let's look at this in action. This piece of sql script
at the top creates Employees table with these four columns - Id, Name, Email and Department. First,
let's create the table. This second block of code here inserts test data into Employees table. Let's
actually execute the script. It's going to take a few seconds to complete and that's because,
if you take a look at this code, notice we're using while loop to insert one million rows
into this table and if we click on the messages tab, in a few seconds we should see a message
saying 100,000 rows inserted, that's because for every hundred thousand rows that we insert,
we are logging the message. Let's give it a few seconds to complete.
There we go, all the 1 million rows are inserted. Now, let's execute this select
query. We are trying to find employee whose id is 932 000 and before we execute this query, click
on this icon right here which includes the actual execution plan. You can also use the keyboard
shortcut CTRL + M. There we go, we got the one row that we expected and when I click on the
execution plan and when I hover over this, notice the operation is clustered index seek, meaning the
database engine is using the clustered index on the EmployeeId column to find the employee
row we want. Number of rows read is 1, Actual number of rows for all executions is also 1. Now,
number of rows read is the number of rows sql server has to read to produce the query
result. In our case EmployeeId is unique, so we expect one row and that is represented by
actual number of rows for all executions. With the help of the index, sql server is able to directly
read that one specific employee row we want, hence both number of rows read and actual number of rows
for all executions is 1. So, the point is if there are thousands or even millions of records,
sql server can easily and quickly find the data we are looking for,
provided there is an index that can help the query find data.
Now, we have a clustered index on the EmployeeId column, so when we search by EmployeeId,
sql server can easily and quickly find the data we are looking for, but what if we search by
employee name? At the moment, there is no index on the "Name" column. So, there is no easy way for
sql server to find the data we are looking for. SQL server has to read every record in the table
which is extremely inefficient from performance standpoint. Let's actually look at this in action.
Here is the query, we are trying to find the employee by name. Let's execute it. There we go,
we have the one row that we expected and I click on the execution plan and hover over this. Notice,
the operation is clustered index scan. Since there is no proper index to help this query,
the database engine has no other choice than to read every record in the table. This is exactly
the reason why number of rows read is 1 million, that is every row in the table and if you take a
look at actual number of rows for all executions, the value is 1. How many rows are we expecting in
the result? Well, only one row, because there is only one employee whose name is "ABC 932000". So,
to produce this one row as the result, sql server has to read all the 1 million rows from the table,
because there is no index to help this query. This is called index scan and in general, index
scans are bad for performance. This is when we create a non-clustered index on the "Name" column.
Actually, sql server is helping us here. Notice, it's actually telling us there is a missing index.
To improve the performance of this select query, it's asking us to create a non-clustered index
on the "Name" column. Why on the "Name" column? Well, that's because we are looking up employees
by name. So, let's actually right click on this and select this option - "Missing Index Details".
We actually have the required code here to create non-clustered index. Let's uncomment this.
Create non-clustered index, we are creating on the Name column and let's give this index a name "IX"
for index, we are creating it on the Employees table and on the Name column. Let's execute this.
Now, let's execute that same select query again. Click on the "Execution plan" tab
and we have several steps here. We'll discuss execution plans in detail in our upcoming videos.
For now, just understand, we read the execution plans from right to left and top to bottom. So,
we start here and when I hover over this, notice, now the operation is index seek
on the non-clustered index. Before we understand this execution plan, let's first understand how
non-clustered index is stored in the database. In a non-clustered index, we do not have table data.
We have key values and row locators. We created a non-clustered index on the "Name" column. So,
the key values, in this case employee names are sorted and stored in alphabetical order.
The row locators that are present at the bottom of the tree contain employee names and cluster key of
the row, in our example employee id is the cluster key. Why? because employee id is the primary key,
by default it is the cluster key. Now, if we look at one of the row locators, notice the names of
the employees are sorted in alphabetical order and we also have their respective employee id. Now,
if you remember, on the employee id we have the clustered index. Now, when we search employee
by name both these indexes, non-clustered index on the Name column and clustered index on the
EmployeeId column are going to work together to find the employee that we are looking for. Let's
look at the steps involved. First, sql server uses the non-clustered index on the Name column to
quickly find this employee entry in the index. In a non-clustered index along with the employee name
we also have the cluster key, in our case its employee id. The database engine knows there is
clustered index on employee id, so this clustered index is then used to find the respective employee
record. Now, let's relate these steps to the execution plan that we have in sql server
management studio. Remember, we read the execution plan from right to left and top to bottom. So,
we start on the top right here. Notice, the first step is index seek on the non-clustered index.
On the Name column, we have non-clustered index and sql server is using it to find an entry for
this employee in the index and remember, in the index along with the employee name we also have
employee id which is the primary key. Next, this primary key is used to find an entry
in the clustered index, that's why we have the operation here as key lookup clustered.
The value from the cluster index, in our case employee id is then used in an inner join with
the Employees table to retrieve the respective employee record. If you're new to these execution
plans and wondering why this nested loop or inner join is required, we'll discuss these execution
plans in detail in our upcoming videos. Now, on this slide, I have "Estimated Subtree Cost"
with and without index. So, for this query - Select * from employees where name equals whatever
name we supply, estimated subtree cost without index is 11. something. With index, it is 0.006.
Just imagine the impact it can have on performance if we don't have index on the Name column.
If you're wondering, how did I get these statistics? Well, in SQL server management studio,
on the "Execution plan" tab, if you hover over the "Select" operation, you get the total estimated
subtree cost for all these operations. In our upcoming videos in this series, we'll discuss sql
server execution plans in detail with examples. That's it in this video. Thank you for listening.
تصفح المزيد من مقاطع الفيديو ذات الصلة
Things every developer absolutely, positively needs to know about database indexing - Kai Sassnowski
Google SWE teaches systems design | EP27: Search Indexes
Database Indexes: What do they do? | Systems Design Interview: 0 to 1 with Google Software Engineer
L-1.4: Types of OS(Real Time OS, Distributed, Clustered & Embedded OS)
Why You Should NOT Use SELECT *
Part3 : Database Testing | How To Test Schema of Database Table | Test Cases
5.0 / 5 (0 votes)