How do SQL Indexes Work

kudvenkat
30 Mar 202112:12

Summary

TLDRIn this educational video, Venkat explains the functionality of SQL indexes, focusing on clustered and non-clustered types. He uses an Employees table example to illustrate how a clustered index on the primary key speeds up query performance by organizing data in a sorted, tree-like structure. Venkat demonstrates the inefficiency of searching without an index and shows how creating a non-clustered index on the 'Name' column improves search efficiency. The video includes a practical SQL script example and execution plan analysis, highlighting the significant performance impact of using indexes.

Takeaways

  • 🔍 Indexes are crucial for improving SQL query performance by allowing the database engine to quickly locate data.
  • 🌐 There are two types of indexes: clustered and non-clustered, each serving different purposes in data retrieval.
  • 📚 A clustered index sorts and physically stores data rows in a tree-like structure based on the index key.
  • 📈 The script provides a practical example using an Employees table with EmployeeId as the primary key, which by default creates a clustered index.
  • 🔑 The root node of a clustered index contains index rows with key values and pointers to data pages or leaf nodes.
  • 📊 A non-clustered index, on the other hand, stores key values and row locators, but does not physically sort the data rows.
  • 🚀 The video demonstrates the efficiency of using an index by showing how SQL Server quickly finds a specific employee row using a clustered index.
  • 📉 Without an index on a column, SQL Server must perform a full table scan, which is inefficient and slow, especially with large datasets.
  • 🛠️ The script includes a SQL script example that creates an Employees table, inserts a large amount of test data, and demonstrates the use of indexes.
  • 💡 SQL Server provides recommendations for missing indexes to improve query performance, as shown when searching by employee name without an index.
  • 📈 The video concludes with a comparison of estimated subtree costs with and without an index, highlighting the significant performance benefits of using indexes.

Q & A

  • What is the main topic of the video by Venkat?

    -The main topic of the video is explaining how indexes work and how they improve the performance of SQL queries, focusing on both clustered and non-clustered indexes.

  • What is a clustered index and how does it affect data storage?

    -A clustered index determines the physical order of data in a table. In the example, EmployeeId is the primary key, and thus a clustered index is created on it, sorting and storing the employee data rows by EmployeeId.

  • How does the database engine use a clustered index to find a specific row?

    -The database engine starts at the root node and follows pointers through intermediate nodes to the leaf nodes, which contain the actual data rows sorted by the key column, allowing quick data retrieval.

  • What is the difference between data pages and leaf nodes in the context of a clustered index?

    -Data pages or leaf nodes are the bottom nodes of the tree structure in a clustered index that contain the actual data rows. They are where the sorted data is physically stored.

  • How many rows does SQL Server have to read to find an employee with EmployeeId 1120, given the clustered index?

    -SQL Server only has to read 3 rows (root node, intermediate node, and leaf node) to find the employee with EmployeeId 1120, thanks to the clustered index.

  • What happens when a query is made on a column that does not have an index?

    -Without an index, SQL Server has to perform a full table scan, reading every record, which is inefficient and slow, especially with large datasets.

  • Why is creating a non-clustered index on the 'Name' column suggested in the video?

    -Creating a non-clustered index on the 'Name' column is suggested to improve the performance of queries searching by employee name, as it allows the database engine to quickly locate the name in the index and then use the cluster key to find the actual data row.

  • How does a non-clustered index physically store data in the database?

    -A non-clustered index stores key values and row locators. The key values are sorted, and the row locators point to the actual data rows, which are stored in a different order due to the clustered index.

  • What is the role of the clustered index when a non-clustered index is used to find an employee by name?

    -When using a non-clustered index to find an employee by name, the clustered index is used in a subsequent step to locate the actual data row using the cluster key (EmployeeId) retrieved from the non-clustered index.

  • What is the impact of having an index on the 'Name' column as shown in the execution plan?

    -Having an index on the 'Name' column changes the operation from a clustered index scan to an index seek, significantly reducing the estimated subtree cost and improving query performance.

  • What is the estimated subtree cost with and without an index on the 'Name' column?

    -Without an index, the estimated subtree cost is 11.something, indicating a full table scan. With an index, it is 0.006, showing a dramatic improvement in performance.

Outlines

00:00

🔍 Understanding Clustered Indexes in SQL

In this segment, Venkat introduces the concept of indexes in SQL, specifically focusing on clustered indexes. He explains how a clustered index sorts and physically stores data in a tree-like B-Tree structure, using the 'EmployeeId' column as an example. The video demonstrates how SQL Server efficiently locates a specific employee row using the clustered index in just three operations, highlighting the performance benefits of using indexes.

05:03

🚀 Improving Query Performance with Non-Clustered Indexes

Venkat discusses the limitations of searching without an index, using the 'Name' column as an example. He shows how SQL Server performs a full table scan when no index is available, which is highly inefficient. The video then transitions into creating a non-clustered index on the 'Name' column to improve search performance. It explains how non-clustered indexes store key values and row locators, and how they work in conjunction with clustered indexes to retrieve data quickly.

10:08

📈 Comparing Performance with and without Indexes

In the final part, Venkat compares the performance impact of having an index versus not having one. He presents the 'Estimated Subtree Cost' for a query with and without an index, showing a significant improvement in performance with the index. The video concludes with a promise to delve deeper into SQL Server execution plans in future videos, providing viewers with a comprehensive understanding of how indexes enhance database query performance.

Mindmap

Keywords

💡Indexes

Indexes in the context of the video refer to database structures that improve the speed of data retrieval operations on a database table at the cost of additional writes and storage space to maintain the index data structure. They are a key concept in the video as they are used to enhance the performance of SQL queries. For example, the video explains how a clustered index on the 'EmployeeId' column allows for efficient data retrieval by sorting and storing the data in a tree-like structure.

💡Clustered Index

A clustered index determines the physical order of data in a table. In the video, it is mentioned that a clustered index on the 'EmployeeId' column is created by default because it is the primary key. This means that the actual data rows are stored in a sorted order based on the 'EmployeeId', which is crucial for the efficiency of the database engine when searching for specific employee data.

💡Non-Clustered Index

A non-clustered index, as discussed in the video, is a separate lookup table that contains the key values and row locators or pointers to the actual data. Unlike a clustered index, the data in the table is not re-ordered based on the index. The video uses the example of creating a non-clustered index on the 'Name' column to improve search performance when querying by employee names.

💡Data Pages

Data pages, also known as leaf nodes, are the lowest level of a tree-like index structure where the actual data rows are stored. The video explains that in a clustered index, these data pages contain the sorted employee rows, with each page holding a specific range of rows, such as rows 1 to 100 in the first data page.

💡Root Node

The root node is the topmost node in an index's B-tree structure. It is used as a starting point for searches. The video describes how the database engine starts at the root node and navigates through the index structure to find the desired data, such as an employee with a specific 'EmployeeId'.

💡Intermediate Levels

Intermediate levels are the nodes in an index's B-tree structure that lie between the root node and the leaf nodes (data pages). These nodes contain index rows with key values and pointers to other nodes or data pages. The video mentions that these levels help guide the database engine to the correct leaf node where the actual data is stored.

💡Index Seek

An index seek operation, as mentioned in the video, is a type of operation performed by the database engine when it uses an index to find a specific row of data. The video demonstrates an index seek on the 'EmployeeId' column, showing how the engine efficiently locates the data with minimal reads due to the use of the clustered index.

💡Execution Plan

An execution plan is a graphical representation of the steps the SQL Server query optimizer plans to use to execute a query. The video explains how to view the execution plan and how it shows that the database engine uses an index seek to quickly find data, as opposed to a table scan which would be less efficient.

💡Index Scan

An index scan, as discussed in the video, occurs when the database engine has to read every record in the table because there is no index that can help the query. This is shown in the video when searching by employee name without an index, resulting in a full table scan and poor performance.

💡Missing Index

A missing index is a suggestion by SQL Server for an index that could improve query performance if it were created. The video describes how SQL Server suggests creating a non-clustered index on the 'Name' column to improve the performance of queries searching for employees by name.

Highlights

Indexes improve the performance of SQL queries.

Clustered and non-clustered indexes have different functions.

A clustered index sorts and stores data physically in a tree-like structure.

Data pages or leaf nodes at the bottom of the tree contain actual data rows.

Root and intermediate nodes contain index rows with key values and pointers.

Clustered indexes enable quick data retrieval using a series of pointers.

A demonstration of how SQL Server uses a clustered index to find an employee row.

SQL Server can directly read a specific employee row with the help of an index.

Indexes allow SQL Server to find data quickly, even in large datasets.

Searching without an index requires reading every record, leading to inefficiency.

SQL Server suggests creating a non-clustered index when one is missing for a query.

Non-clustered indexes store key values and row locators, not actual table data.

Execution plans show the steps SQL Server takes to execute a query.

Non-clustered indexes work in conjunction with clustered indexes for data retrieval.

The impact of indexes on performance is significant, as shown by estimated subtree costs.

SQL Server Management Studio provides execution plan details for performance analysis.

Transcripts

play00:02

Hey guys, I'm Venkat and in this video, we'll  discuss how indexes actually work and help  

play00:08

improve the performance of our sql queries. We'll  discuss how both the index types work - clustered  

play00:15

and non-clustered. If you're new to indexes, we've  already covered all the basics you need in this  

play00:21

sql server tutorial for beginners course. Please  check out the videos from parts 35 to 38. I'll  

play00:28

include the link in the description of this video.  Now, consider this Employees table. EmployeeId is  

play00:35

the primary key, so by default a clustered index  on the EmployeeId column is created. This means,  

play00:42

employee data is sorted by EmployeeId column  and physically stored in a series of data pages  

play00:48

in a tree-like structure that looks like  the following. The nodes at the bottom of  

play00:53

the tree are called data pages or leaf nodes  and contains the actual data rows in our case  

play01:00

employee rows. These employee rows are sorted  by EmployeeId column because EmployeeId  

play01:06

is the primary key and by default, a clustered  index on this column is created. For our example,  

play01:13

let's say in this Employees table we have  1200 rows and let's assume in each data page  

play01:20

we have 200 rows. So, in the first data page we  have 1 to 100 rows, in the second 201 to 400,  

play01:28

in the third 401 to 600, so on and so forth. The  node at the top of the tree is called root node.  

play01:37

The nodes between the root node and the leaf  nodes are called intermediate levels. The root  

play01:43

and the intermediate level nodes contain index  rows. Each index row contains a key value,  

play01:49

in our case EmployeeId and a pointer to either  an intermediate level page in the B-Tree  

play01:56

or a data row in the leaf node. So, this tree-like  structure has a series of pointers that helps the  

play02:03

query engine find data quickly. For example, let's  say we want to find employee row with employee id  

play02:10

1120. So, the database engine starts at the root  node and it picks the index node on the right  

play02:18

because the database engine knows it is this node  that contains employee ids from 801 to 1200. From  

play02:27

there, it picks the leaf node that is present  on the extreme right because employee data rows  

play02:34

from 1001 to 1200 are present in this leaf node.  The data rows in the leaf node are sorted by  

play02:41

employee id so it's easy for the database engine  to find the employee row with id equals 1120.  

play02:49

Notice, in just 3 operations sql server is able to  find the data we are looking for. It's making use  

play02:55

of the clustered index we have on the table. Let's  look at this in action. This piece of sql script  

play03:01

at the top creates Employees table with these four  columns - Id, Name, Email and Department. First,  

play03:07

let's create the table. This second block of code  here inserts test data into Employees table. Let's  

play03:14

actually execute the script. It's going to take  a few seconds to complete and that's because,  

play03:20

if you take a look at this code, notice we're  using while loop to insert one million rows  

play03:26

into this table and if we click on the messages  tab, in a few seconds we should see a message  

play03:31

saying 100,000 rows inserted, that's because  for every hundred thousand rows that we insert,  

play03:36

we are logging the message. Let's  give it a few seconds to complete.

play03:48

There we go, all the 1 million rows are  inserted. Now, let's execute this select  

play03:53

query. We are trying to find employee whose id is  932 000 and before we execute this query, click  

play04:01

on this icon right here which includes the actual  execution plan. You can also use the keyboard  

play04:07

shortcut CTRL + M. There we go, we got the one  row that we expected and when I click on the  

play04:15

execution plan and when I hover over this, notice  the operation is clustered index seek, meaning the  

play04:23

database engine is using the clustered index  on the EmployeeId column to find the employee  

play04:28

row we want. Number of rows read is 1, Actual  number of rows for all executions is also 1. Now,  

play04:36

number of rows read is the number of rows  sql server has to read to produce the query  

play04:42

result. In our case EmployeeId is unique, so  we expect one row and that is represented by  

play04:49

actual number of rows for all executions. With the  help of the index, sql server is able to directly  

play04:55

read that one specific employee row we want, hence  both number of rows read and actual number of rows  

play05:03

for all executions is 1. So, the point is if  there are thousands or even millions of records,  

play05:10

sql server can easily and quickly  find the data we are looking for,  

play05:14

provided there is an index that  can help the query find data.  

play05:19

Now, we have a clustered index on the EmployeeId  column, so when we search by EmployeeId,  

play05:24

sql server can easily and quickly find the data  we are looking for, but what if we search by  

play05:30

employee name? At the moment, there is no index  on the "Name" column. So, there is no easy way for  

play05:35

sql server to find the data we are looking for.  SQL server has to read every record in the table  

play05:41

which is extremely inefficient from performance  standpoint. Let's actually look at this in action.  

play05:47

Here is the query, we are trying to find the  employee by name. Let's execute it. There we go,  

play05:54

we have the one row that we expected and I click  on the execution plan and hover over this. Notice,  

play06:01

the operation is clustered index scan. Since  there is no proper index to help this query,  

play06:07

the database engine has no other choice than to  read every record in the table. This is exactly  

play06:13

the reason why number of rows read is 1 million,  that is every row in the table and if you take a  

play06:20

look at actual number of rows for all executions,  the value is 1. How many rows are we expecting in  

play06:26

the result? Well, only one row, because there is  only one employee whose name is "ABC 932000". So,  

play06:35

to produce this one row as the result, sql server  has to read all the 1 million rows from the table,  

play06:42

because there is no index to help this query.  This is called index scan and in general, index  

play06:48

scans are bad for performance. This is when we  create a non-clustered index on the "Name" column.  

play06:56

Actually, sql server is helping us here. Notice,  it's actually telling us there is a missing index.  

play07:02

To improve the performance of this select query,  it's asking us to create a non-clustered index  

play07:08

on the "Name" column. Why on the "Name" column?  Well, that's because we are looking up employees  

play07:14

by name. So, let's actually right click on this  and select this option - "Missing Index Details".  

play07:20

We actually have the required code here to  create non-clustered index. Let's uncomment this.  

play07:26

Create non-clustered index, we are creating on the  Name column and let's give this index a name "IX"  

play07:31

for index, we are creating it on the Employees  table and on the Name column. Let's execute this.

play07:40

Now, let's execute that same select query  again. Click on the "Execution plan" tab  

play07:47

and we have several steps here. We'll discuss  execution plans in detail in our upcoming videos.  

play07:53

For now, just understand, we read the execution  plans from right to left and top to bottom. So,  

play08:00

we start here and when I hover over this,  notice, now the operation is index seek  

play08:07

on the non-clustered index. Before we understand  this execution plan, let's first understand how  

play08:14

non-clustered index is stored in the database. In  a non-clustered index, we do not have table data.  

play08:21

We have key values and row locators. We created  a non-clustered index on the "Name" column. So,  

play08:29

the key values, in this case employee names  are sorted and stored in alphabetical order.  

play08:36

The row locators that are present at the bottom of  the tree contain employee names and cluster key of  

play08:43

the row, in our example employee id is the cluster  key. Why? because employee id is the primary key,  

play08:50

by default it is the cluster key. Now, if we look  at one of the row locators, notice the names of  

play08:56

the employees are sorted in alphabetical order and  we also have their respective employee id. Now,  

play09:04

if you remember, on the employee id we have the  clustered index. Now, when we search employee  

play09:10

by name both these indexes, non-clustered index  on the Name column and clustered index on the  

play09:17

EmployeeId column are going to work together to  find the employee that we are looking for. Let's  

play09:23

look at the steps involved. First, sql server  uses the non-clustered index on the Name column to  

play09:30

quickly find this employee entry in the index. In  a non-clustered index along with the employee name  

play09:36

we also have the cluster key, in our case its  employee id. The database engine knows there is  

play09:42

clustered index on employee id, so this clustered  index is then used to find the respective employee  

play09:49

record. Now, let's relate these steps to the  execution plan that we have in sql server  

play09:54

management studio. Remember, we read the execution  plan from right to left and top to bottom. So,  

play10:01

we start on the top right here. Notice, the first  step is index seek on the non-clustered index.  

play10:08

On the Name column, we have non-clustered index  and sql server is using it to find an entry for  

play10:14

this employee in the index and remember, in the  index along with the employee name we also have  

play10:20

employee id which is the primary key. Next,  this primary key is used to find an entry  

play10:27

in the clustered index, that's why we have  the operation here as key lookup clustered.

play10:36

The value from the cluster index, in our case  employee id is then used in an inner join with  

play10:42

the Employees table to retrieve the respective  employee record. If you're new to these execution  

play10:48

plans and wondering why this nested loop or inner  join is required, we'll discuss these execution  

play10:54

plans in detail in our upcoming videos. Now,  on this slide, I have "Estimated Subtree Cost"  

play11:02

with and without index. So, for this query -  Select * from employees where name equals whatever  

play11:08

name we supply, estimated subtree cost without  index is 11. something. With index, it is 0.006.  

play11:18

Just imagine the impact it can have on performance  if we don't have index on the Name column.  

play11:25

If you're wondering, how did I get these  statistics? Well, in SQL server management studio,  

play11:30

on the "Execution plan" tab, if you hover over the  "Select" operation, you get the total estimated  

play11:37

subtree cost for all these operations. In our  upcoming videos in this series, we'll discuss sql  

play11:44

server execution plans in detail with examples.  That's it in this video. Thank you for listening.

Rate This

5.0 / 5 (0 votes)

相关标签
SQL IndexingClustered IndexNon-Clustered IndexDatabase PerformanceEmployee TableIndex SeekIndex ScanData PagesB-Tree StructureQuery Optimization
您是否需要英文摘要?