Things every developer absolutely, positively needs to know about database indexing - Kai Sassnowski
Summary
TLDRIn this insightful talk, the speaker, Keselowski, enlightens developers on the intricacies of database indexing, emphasizing its importance for application performance. He clarifies misconceptions, delves into data structures like B-trees, and discusses the impact of functions and inequality operators on index usability. Through a live coding session, he illustrates common pitfalls, such as the inefficiency of full table scans and the importance of index order in multi-column indexes. The talk underscores that indexing is a nuanced, developer-centric task, crucial for optimizing query performance.
Takeaways
- π£οΈ The speaker emphasizes the importance of database indexing for developers, highlighting that it's a crucial aspect of ensuring application performance.
- π Indexes are primarily used to improve read performance in databases, which can significantly enhance the speed of queries and overall application responsiveness.
- π The analogy of a phone book is used to explain what an index is, illustrating how an ordered representation of data can expedite searches.
- π³ The B-tree is introduced as the underlying data structure for database indexes, with a focus on its balanced nature to ensure efficient searching.
- π The index only contains the values of the columns it's created on, and uses a row ID to reference the original table for data not stored in the index.
- β‘ The benefits of indexing include fast search capabilities due to binary search-like operations, with logarithmic scalability.
- π However, indexing is not without trade-offs; it can slow down write operations such as inserts, updates, and deletes due to the need to update the index as well.
- π Understanding execution plans is vital for diagnosing how a database will use an index to execute a query, with different 'access types' indicating different strategies.
- π‘ The talk demonstrates practical scenarios where mismanagement of indexes can lead to suboptimal performance, emphasizing the need for developers to have a deep understanding of how to design effective indexes.
- π« Common pitfalls include the misuse of functions in WHERE clauses which can invalidate the use of indexes, and the importance of column order in multi-column indexes.
- π The order of columns in an index matters significantly, as does the presence of inequality operators which can limit the effectiveness of an index.
Q & A
What is the main topic of the speaker's presentation?
-The main topic of the speaker's presentation is database indexing and why every developer should understand its importance and nuances.
Why did the speaker change the title of Joel Spolsky's blog post in their talk?
-The speaker changed the title to focus on database indexing instead of Unicode because they wanted to emphasize the importance of understanding indexing in the context of their talk.
What is the speaker's profession and where is their workplace located?
-The speaker works for a software development agency located in Berlin.
What is the common mistake the speaker sees in developers' understanding of database indexing?
-The common mistake is that developers often think that adding an index to every column in the 'where' clause of a query will improve performance, without understanding the nuances and potential downsides of indexing.
What is the B-tree and why is it significant in the context of database indexing?
-The B-tree is a balanced tree data structure used in databases to store indexes. It is significant because it allows for efficient searching, insertion, deletion, and access to data in a sorted manner.
Why is the order of data important when creating an index?
-Order is important because it allows for more efficient searching, such as binary search, which is much faster on ordered data compared to unordered data.
What is the purpose of the doubly linked list in the leaf nodes of a B-tree?
-The doubly linked list in the leaf nodes allows for efficient sequential scanning of data without having to go back up the tree each time, thus improving performance during searches.
What additional piece of information does the database store along with the indexed values?
-The database stores a row ID, which is a database-internal identifier that points to a specific row in a table, allowing the database to retrieve the full row if needed.
What is the difference between a 'range scan' and a 'full index scan' according to the speaker?
-A 'range scan' uses the index to find the starting point of a range and then scans through the leaf nodes within that range. A 'full index scan' scans through every value in the index without using it for limiting or filtering the rows.
Why did the speaker's initial attempt to optimize a query with an index not improve performance?
-The initial attempt did not improve performance because the query involved a function on the indexed column (`YEAR` function on the `created_at` column), which prevented the database from using the index effectively.
What is the consequence of using functions on indexed columns in a query's WHERE clause?
-Using functions on indexed columns in a WHERE clause can prevent the database from using the index, as the function's result may not correlate with the index values, thus negating the benefits of indexing.
What is the 'force index' and why should it be used with caution?
-The 'force index' is a way to force the database to use a specific index for a query. It should be used with caution because it can lead to suboptimal query performance if the chosen index is not the most efficient for the query's needs.
What is the term used to describe an index that can be used to satisfy a query entirely from memory?
-An 'index only scan' is used to describe an index that contains all the data needed for a query, allowing the operation to be performed entirely in memory without additional disk reads.
Why did adding the 'total' column to the index improve the performance of the query?
-Adding the 'total' column to the index improved performance because it allowed the query to perform an 'index only scan', meaning all necessary data was available in the index, eliminating the need for additional disk reads.
What is the significance of the order of columns in a multi-column index?
-The order of columns in a multi-column index is significant because it determines which parts of the index can be used for filtering data in a query. The database can only use the index from left to right and cannot skip columns.
What is the impact of using inequality operators on the usage of a multi-column index?
-Using inequality operators on any of the columns in a multi-column index can limit the effectiveness of the index. The index can only be used up to the point where the inequality operation is applied, as if the index stops at that column.
What is the final recommendation the speaker gives regarding indexing and query optimization?
-The speaker recommends that developers should consider indexing as their concern, design indexes specifically for the queries they write, and understand that the context of data access is crucial for creating efficient indexes.
Outlines
π£οΈ Database Indexing Introduction
The speaker, Keselowski, opens the session with an energetic introduction, expressing excitement about the event and the topic of database indexing. He references Joel Spolsky's blog post on Unicode, changing the subject to indexing, and shares his contact information. Keselowski works for a software development agency in Berlin and hints at hiring opportunities. He discusses the importance of understanding what a database index is and its role in improving query performance, which is crucial for application performance. He emphasizes that while developers generally understand the purpose of indexes, many lack deep knowledge about how to use them effectively, which is the gap he aims to address in his talk.
π The Theory of Indexing
Keselowski dives into the theory behind database indexing, starting with the analogy of a phone book to explain how indexes work. He introduces the concept of a B-tree, a balanced tree data structure used in databases for indexing. The speaker explains the properties of B-trees, such as all leaf nodes being at the same level, which ensures consistent search time. He also discusses the doubly linked list nature of leaf nodes for efficient sequential scanning. The summary clarifies that indexes only contain the columns they are created on, and the database uses an additional row ID to retrieve the full row from the table when needed.
π Understanding Execution Plans
The speaker discusses the importance of understanding execution plans for optimizing database queries. He explains that execution plans show the steps a database takes to execute a query, and the 'type' column in the plan indicates how the database accesses the data. Keselowski outlines different access types like 'const', 'ref', 'range', 'index', and 'all', each with its implications for performance. He emphasizes that developers should aim for 'const' or 'ref' in their execution plans for efficient querying and avoid 'all', which indicates a full table scan.
π οΈ Live Coding and Indexing Pitfalls
Keselowski presents a live coding session to demonstrate the practical application of indexing. He creates a hypothetical scenario where a report is requested for all orders placed in 2013. Initially, he makes a common mistake by adding an index on the 'created_at' column without considering the query's full requirements. This leads to a full table scan despite the index, highlighting the importance of aligning indexes with query needs. The speaker also points out the pitfalls of using functions in the WHERE clause, which can prevent the use of indexes.
π Optimizing Queries with Indexes
The speaker revisits the previous query and explains how to optimize it by adjusting the index to include necessary columns for the query's SELECT and WHERE clauses. By creating a composite index that includes 'created_at', 'total', and 'user_id', Keselowski demonstrates how to perform an index-only scan, which can dramatically improve query performance by eliminating disk reads. He emphasizes the trade-off between creating specific indexes for particular queries and maintaining a balance to avoid excessive indexing.
π Adjusting Indexes for Changing Requirements
Keselowski addresses a new requirement to modify the report for a single user's orders. He shows how adding a 'user_id' to the query without adjusting the index leads to a full table scan. The solution is to include 'user_id' in the index to avoid unnecessary disk reads. However, he also explains the limitations of multi-column indexes, noting that the order of columns in an index is crucial and that skipping columns or using inequality operators can limit the index's effectiveness.
π The Nuances of Multi-Column Indexes
The speaker further explores the complexities of multi-column indexes, explaining that they can only be used from left to right and that inequality operators can restrict their use. He demonstrates how changing the order of columns in an index can affect its usability with different queries. Keselowski emphasizes the importance of designing indexes specifically for the queries they will support, rather than creating generic indexes.
π Conclusion: Indexing as a Developer Responsibility
In the conclusion, Keselowski reiterates that indexing is a critical concern for developers. He stresses that indexes should be designed in the context of specific queries and that developers, who understand their data access patterns, are best positioned to create effective indexes. The speaker encourages developers to take ownership of indexing and to make informed decisions based on the needs of their applications and queries.
Mindmap
Keywords
π‘Database Indexing
π‘Query Performance
π‘B-Tree
π‘Index Design
π‘Execution Plan
π‘Index Only Scan
π‘Full Table Scan
π‘Multi-Column Index
π‘Function-Based Index
π‘Index Pitfalls
Highlights
The importance of understanding database indexing for developers to improve application performance.
Indexing is not just about faster reads; it's a nuanced aspect of database management.
The analogy of a phone book to explain the concept of an index in a database.
Introduction of the B-tree as the fundamental data structure behind database indexes.
The significance of order in an index for efficient data retrieval.
How the B-tree maintains balance to ensure consistent search performance.
The role of leaf nodes and doubly linked lists in efficient index traversal.
Clarification that an index does not contain all table data, only the indexed columns and a row ID.
The impact of using functions on columns within a WHERE clause, negating the use of indexes.
The concept of an 'index-only scan' and its potential for maximizing query performance.
The pitfalls of designing indexes without considering the specific queries they need to serve.
The demonstration of how adding unnecessary columns to an index can degrade performance.
The importance of column order in multi-column indexes and how it affects query performance.
The issue with inequality operators in multi-column indexes and their effect on index utilization.
The practical example of optimizing a query for orders placed in 2013, illustrating the complexities of indexing.
The conclusion emphasizing that indexing is a critical concern for developers, not just database administrators.
Transcripts
[Music]
[Applause]
[Music]
[Applause]
welcome everyone let me start my timer
really quickly so welcome everyone I am
super excited to be here it's been an
awesome event so far I've really been
enjoying myself greatly talking to all
sorts of cool people here I'm slightly
mad at risky for hanging the bar so high
but anyways I am here to talk to you
about things that I believe every
developer absolutely positively needs to
know about database indexing which by
the way is a the title of an old blog
post by Joel Spolsky which was called
things I believe every developer
absolutely positively needs to know
about Unicode so I just changed the last
words because I thought that was cool my
name is Keselowski you can find me at my
blog where I post once every two years
on github or on Twitter where I post
once every three to four months I work
for this awesome group of people we are
a software development agency in Berlin
we are hiring so if you're interesting
come talk to me after the talk now when
I interview people I like to ask them
the question what is the database index
and what do we use it for and most
people are able to answer the question
something like this well a an index
makes reading faster improves query
performance and yes that is that is
actually correct that is the answer of
why we index our database now the
question is like why do we want fast
reads then well turns out slow queries
are one of the most common if not the
most common causes of poor application
performance I believe we have all
experiences at least once in our careers
where we just happens one query that
when it runs it just grinds everything
to a halt a slow query can be the
difference between a site being super
snappy and being literally unusable and
sometimes the difference between these
two extremes is just one good index away
so if most developers can already answer
the question of why do we use an index
to improve read performance and I
believe we can all agree on this that
slow queries are in fact
what am I doing here why am i talking to
you right now
the reason is that developers do not
know enough about indexing indexing is a
lot more nuanced than just throwing an
index at every column in your where
clause I'm hoping that one of them
sticks if you don't know what you're
doing a bad index can actually make
performance worse excuse me I got a
little emotional there so here's what's
gonna happen I won't be able to make you
into an expert there's simply not enough
time for that
my main goal for this talk is to show
you how much there actually is to this
topic if you leave this talk thinking
man I should really read up on this
indexing stuff then I'm happy to have
done my job nevertheless we are gonna
learn a few practical things but first
there's gonna be some theory we are
gonna talk about what an index actually
is and I'm talking data structures this
is to me the most important part of the
talk because everything else everything
about how an index behaves in certain
situation why it sometimes works and
sometimes doesn't can be deduced from
understanding how the underlying data
structure works you don't have to
memorize everything that's coming up in
the rest of the talk if you understand
this part of it you're in a very good
shape then points two and three are
understanding the execution plan and
common pitfalls when designing indices
and these are gonna happen roughly at
the same time we're gonna explore both
as we go through the example okay
so what's an index when I ask people
what an index is most of them like to
use the sort of mental image of a phone
book I don't know how much longer I will
be able to make that example until
people go a phone what case in point I
could not find a picture of a phone book
on the stock site I was using so I
improvised here's a phone in a book okay
if you think about a phone book for
simplicity's sake let's say it has three
pieces of information it has the first
name the last name and the phone number
and a phone book is almost always going
to be ordered by last name so if I give
you someone's last thing you can find
the person in the phone book pretty
quickly because the
book is ordered for exactly that type of
search query if I only give you
someone's first name you're gonna have
to look through every single entry and
you probably end up with a whole bunch
of results so you could say in a phone
book is like having an index on last
name more generally speaking an index is
an order representation of the index
data now why is order so important turns
out what we're doing when we're creating
our data is we are searching our data
for a subset of that data and it turns
out that searching an ordered input is a
lot more efficient than searching in an
unordered input think binary search
write very very fast assumes your input
is sorted okay enough with all the
mental models and helper images the
training wheels are about to come off we
are going to look at what an index
actually is introducing the b-tree what
does the B in B tree stand for it is not
binary thank you for saying binary it is
in fact balanced it is a balanced tree
we're gonna look at what exactly that
means and how it is balanced but it is a
balanced tree I was hoping I wouldn't I
wasn't sure how I was gonna react with
someone set balanced so thank you
whoever said binary okay so what is the
B tree looked like surprisingly it looks
a lot like a tree now if you look at the
very top note at 23 this is called the
root node you can see that it has two
sub trees it has the left sub tree and
has the right sub tree all the values in
the left sub tree are less than 23 and
all the values on the right sub tree are
greater than or equal to 23 and this is
true for every node in the tree the left
sub tree is always less than the right
sub tree is always greater than or equal
to this is how the tree is ordered what
we can also see is that the nodes at the
very bottom these are called the leaf
nodes they don't have any further sub
trees the leaf nodes are all at the same
level of the tree or the same depth of
the tree this is important because this
way we can guarantee that it takes the
same
number of steps to find every node or
any value in the tree if we had one side
of the tree there was just way deeper
right and it was unbalanced then
searching for a value in that part of
the tree would take longer which is not
what we want okay so this is why it's a
balanced tree the leaf nodes will always
be at the same level the leaf nodes also
form what's called a doubly linked list
which you can see by the tiny arrows I
added pointing back and forth so each
set of leaf nodes has a pointer to the
next set of leaf nodes and I pointed to
the previous set of leaf nodes so there
should actually be two more arrows on
either side pointing to nala now why is
this important
turns out that when we're searching for
data in an index we are gonna spend a
lot of our time down in those leave
notes scanning through them if we didn't
have a doubly linked list we would have
- as soon as we hit the end of one of
those leaf nodes we'd have to go back up
the tree find the next leaf node Skaar
scanning go up again and so on that's
not very efficient by having the W
linked list we can just scan through
them sequentially without having to go
back up the tree every time so an
interesting question now is what data is
actually being stored on the index if we
have a very simple table like this we
just have employee ID and first names
and say we put an index on the name
column the index you would end up with
looks like this the employee ID employee
ID is nowhere to be found on this index
it only contains the name in other words
an index only contains the values of the
columns you actually put the index on
now this sounds very dull but I've heard
people describe an index as well it's
like a copy of your table that is
ordered by a different column that is
incorrect that actually has consequences
and we're gonna see why it's important
to know what data is actually being
stored on the index this kind of begs
the question now what if I need data
that is not only index how do I get back
to the table if I actually want to find
out
what is the employee ID of Taylor so
this image is actually incomplete the
database stores one additional piece of
information on the index for every value
the so-called row ID it's not literally
a string row ID it is a database
internal identifier that identifies a
particular row in a table it is not your
primary key it's a database internal
value and so that the database stores
that bit of information for every value
in the index and that way we can go back
to the table and find the corresponding
row to which this value in the index
belongs to okay so much for the data
structure now what are the consequences
of choosing that data structure
first of all searching for a value is
very very fast because essentially what
you're doing is binary search right with
every step you go either left or right
and you eliminate a very large portion
of your remaining data and so you can
narrow down your search very quickly and
either find the value or find out that
there is no result and it scales really
really well because it has logarithmic
scalability which is a fancy way of
saying it pretty much doesn't matter how
large your input gets this will still
keep performing okay so if indexing
makes our queries faster then we should
just put an index on every single column
right chance to redeem yourself whoever
said binary right no okay of course not
there are no silver bullets we should
notice by now there's always only
trade-offs using an index improves
retime yes but it makes writing slower
with every insert update and delete you
perform you also have to insert update
and delete into the index which would
mean the index will have to be
potentially rebalance which can take a
while and if you have a whole bunch of
indices your rights are gonna be very
very slow so as a rule of thumb you want
to have as many indices as necessary but
as few as possible okay moving on
understanding the execution plan and in
tiny font it says my sequel version this
is because the output of the execution
plan looks different for each database
vendor and I've just made a gamble that
most of us here will be using my sequel
who here is not using my sequel or Maria
to be okay to people like okay so sorry
but you're gonna have to be a bit of
extra work the concepts I'm going to
talk about are gonna be the same across
database vendors it's just that they use
different terminology to speak about
them now what the key execution plan
actually is it's these steps that the
database needs to perform to execute
your query so here's how you get the
execution plan that's pretty uniform
across all database vendors you just
pretend and explain to your query so
instead of saying select star you say
explain select star and then on my
sequel you will get a result that looks
something like this do note that this is
actually one row and I have just
formatted it that way so it fits better
on a slide so it is one row and the
columns are ID select type table
partitions and so on now that's a lot of
rows we're not gonna look I'm sorry
columns we're not going to look at all
of them but we're going to look at the
most important one right now which is
the type column right here it says type
Const if you look up the column in the
documentation it'll call it the join
type which I think is not the best name
for it because like in this example
there is no joins I'm not really sure
what it's referring to right it's a bit
confusing to me I think a better name
for this would be access type because
what this really tells us is how the
database is going to access our data how
exactly it is going to use an index or
not use an index to execute our query so
we're gonna go through the possible
values you might encounter in the
explain output in that column and talk
about the characteristics of each one of
them really quickly don't worry if this
is all a bit much right now we will come
back to them as we go through the
examples is just so that you've seen
them and have a rough on
standing in how they differ so starting
with cons or a craft now these two
values are actually distinct values they
might appear on their own in the
explained output I have grouped them
together here because for our purposes
they behave the same way they access the
data in the same way so I'll treat them
as one
so what constant EKF do is the database
is gonna perform a bee tree traversal to
find a single value in the index tree so
basically you're doing binary search
right this can only get used if you can
somehow guarantee uniqueness of your
result set that means you can somehow
guarantee that that can be at most one
result there's two ways you can do this
one would be to have a primary key on
the column which is what I had in the
example I had a where clause on the ID
the other one is you have a unique
constraint on that column fun fact limit
does not guarantee uniqueness if you say
limit 1 it does not guarantee uniqueness
because you might still fetch more than
one row you're just discarding them
afterwards okay so in limit does not
guarantee uniqueness ok so basically
you're starting at the top of the tree
and then you go left or right depending
on if the value is less than or greater
then until you either find a value or
you find out that there is no value and
as you can imagine that is super fast if
you seek on story craft in your explain
output stop optimizing it's not gonna
get any faster moving on ref and range
again these are two distinct values but
we're grouping them together because for
our purposes they are the same they
behave the same way they are known what
they are known as an index range scan
what they do is they perform a b-tree
traversal but instead of finding a
single value they find the starting
point of a range and then they scan from
that point on so let's say we had a
query where we say where ID greater than
15 and less than 30 what this would do
is it would do a bee tree traversal to
find the first value that is greater
than 50
and from that point on it'll start
scanning through the leaf nodes remember
this is where the double linked list
comes in until it hits the first value
that is less than or greater than or
equal to 30 and these rows are the only
rules that database has eaten has to
look at okay so what a range or a ref
does is it limits the total number of
rows the database has to inspect to
perform your query which is a good thing
right less work for the database next
one index this is also known as a full
index scan we are still using the index
but we are not using it to limit the
number of rows like we just did we are
literally starting at the very first
leaf node and we're gonna scan through
all of them until we hit the very last
leaf node okay so there's no filtering
of anything going on you simply traverse
through the entire index but you're
still using the index and lastly there
is all which is a so-called full table
scan which is everything as bad as it
sounds you do not want this think of all
as avoid at all costs okay a full table
scan is pretty much never what you want
you do not use an index at all you are
gonna load every row and every column of
the table into memory go through them
one by one and then omit or discard them
depending on whether or not they fulfill
the filter criteria okay so to recap we
have cons or a craft basically binary
search to find a single value super fast
stop if you got that it's perfect next
we have referent we are using the index
to limit the number of rows to a certain
subset of rows that we even have to look
at next we have index we still use the
index but we're not using it for
limiting or filtering we're just
scanning through every value in the
index and lastly we have all avoided all
costs the full table scan where we just
load up everything and go through it
okay so this is where the live coding
part comes in this is where it gets
scary
let me see you don't see anything I'm
gonna have to mirror my displays for
that so I don't break my neck
there we go oh no I'm giving everything
away so let's see I have prepared a
example database that contains a single
table that is a formula unfortunately
contains a single table called orders it
has roughly 2.2 point five million rows
in it and it's just garbage data
basically it's randomly generated data
it has four columns the ID the total
which is the total number of cents of
the order the user ID there's not even a
users table and the created add date
when that order was placed now here's
what we need to do management comes to
us and says we want a report we want to
know the total sum of all orders that
were placed in 2013 so let's see sounds
easy enough so we're going to do
something like this we're going to say
select the sum of the total column from
the orders table we're okay we're what
orders placed in 2013 so you might do
something like this I have seen this so
there is a year function in my sequel
that basically extracts the Year part
from a date and returns it as either a
number or a string I'm not quite sure so
you can say what year of created add
equals 2013 you run that it returns a
value looks plausible enough you commit
your force push you go home okay next
day your boss comes to you and says yes
we got the report numbers look good but
oh my god is it slow
can you please optimize and so you say
okay I need to improve read performance
that means I need an index I only have
one value in my where class but just the
created adds so I'm just gonna put an
index on that let's do that I'm gonna
use the GUI for this because it's faster
so I
ad you see that add an index I don't
care what it's called but I want it to
be on the created at column of the table
so I save that it actually takes a few
seconds as you can see every because it
has to build up the index for the first
time and now we are going to rerun the
same query and it's still around 600
milliseconds it did not change at all
okay at this point you're thinking back
to Larrick only you 2018 there was this
guy talking about database indexing and
there was something about and explained
statements and maybe that can help so
let's just look at the query execution
plan I'm gonna prepend and explain to
this query and I get back this output
this time as a row not as a bunch of
columns like on the slide and right away
we can see here it says type all we are
doing a full table scan we are loading
all 2.5 million rows into memory and
then we're gonna go through them one by
one that sounds terrible what's weird is
the two columns next to the type which
are these two they are called I don't
think it can read this
they're called possible keys and key
possible keys would contain the names of
the indices that the database set could
potentially be used for this query and
then key contains the name of the index
it actually used now the interesting
thing is and this is kind of hard to see
visit for some reason it sort of cuts
off half of the text possible keys is
null so for some reason even though we
have an index on that column it is not
even being considered right so the
database things
this index cannot work at all so what is
going on which brings me to the first
pitfalls which is functions if you have
a query like this where in your where
clause you say where year off created at
something what a database essentially
sees is this it doesn't matter what goes
into the function because the column
will not be seen by the database at all
this is because you cannot guarantee
that the output of the function has
anything to do with the index values
right let's assume you have a function
instead of a year you have a function
that calculates the number of string
characters on a string so the output
would be a number but you have your
index on just the string field okay so
that can't possibly work because now
you're trying to use an index that's on
a string column with the result of this
function call which is going to return a
number so there's no connection between
these two values so if you use a
function on a column in a where Clause
you cannot use an index on that function
anymore by saying here of created ad we
have lost the ability to use an index on
the created add column so this sucks
what can we do there's a couple things
we can do in Postgres there are things
called function based indices which is
basically instead of putting in index on
the Year column you put the index on
year I sorry on the created add column
you put the index directly on year I've
created at it's like a computed index
and then this will behave correctly my
sequel doesn't have that even in my
sequel eight I checked this morning even
my C plate doesn't have function based
indices it has something that is a bit
similar call that generated column so
you basically add a new column to your
table and in its declaration you can say
the value of this field will be the
result of calling year of created ad on
that row so it's like I generated a
computed field if you want and then
since this is just a regular column in
your table you could have put an index
on that column and it would work but in
our case please do not do this this is
not how you work with dates in a
database do not use a year column a
month column add a column an hour column
right there are ways to deal with dates
in the database that are much better
suited to this there's a reason that
there's a date/time field in a database
so what we're gonna do we're gonna get
rid of this and let's think about what
we want to do we want to get all orders
that were placed in 2013 so why not just
define the explicit range we are looking
for why not say where
created at between between what the
first possible value in 2013 that is the
first of December ad midnight and the
last possible value in 2013 which would
be the 31st of December at one second
before midnight so this is how you deal
with dates because now you can do much
more interesting things you can say give
me all rows that are between the 3rd of
April at 7:00 in the morning
and the 28th of August at 3:00 the
afternoon right you can't do that with a
year column so looks like we just solved
our problem and we remove the function
and it should be blazingly fast the
place is gonna go wild and so let's just
lets us run this let's just look at the
explained statement for now oh boy still
a full table scan but something changed
the possible keys column now at least it
sees our index as something that could
potentially be used but for some reason
it is deciding not to use the index and
this is the point where if you don't
know what's going on you might be
tempted to make a mistake like this
where you say piece-of-crap database I
know better than you I just designed
this handcrafted index for exactly this
query you should really use it so in my
sequel and this is do not try at home
territory in my sequel you can do force
index and then give it the name of the
index like if it begins with force it's
probably a bad idea
so let's run the explain statement with
the force index
okay take a look at that we changed from
a full table scan to arrange scan so now
thinking back to the slide the previous
slides a range scan means we are now
using the index to find the beginning of
a range and then just scanning until we
hit the end of the range and don't look
at anything other we can actually double
check that my sequel the explained
output has this column back here called
rows
which is terribly named like so many
things in this output because Rose is
not the number of rows in your result
set it is the number of rows the
database estimates it has to look at to
execute your query so in this case it's
around 460 thousand so that's around the
fifth of our of our rows if I just
remove the force index again to see the
full table scan Rose says 2.4 million ok
so we have to look at everything and
with the range scan with the range scan
we only look have to look at 466
thousand so we've just proven the
database wrong because everything is
better now we are gonna remove the
explain from a query run it it's gonna
be amazing we're gonna get promoted and
take a look at this oh my god it is
still running it just took 4 seconds so
we clearly have just discovered a bug in
my sequel we should go and open a ticket
and flame the maintainer 'he's about
this piece of software don't do it
there's some people pulling out their
phones don't this is a joke ok but this
is confusing if you don't notice what's
going on this is super weird and this is
what I meant if you don't know what
you're doing you can make performance
worse it just went from 600 milliseconds
to 4 seconds I'm bad at math but that's
more than twice as slow I think so
what's going on is this comes back to
what data is actually being stored on
the index if you look at the query oh
sorry if you think back to the index we
have an index on only they created that
column the query however says select sum
of total we are also referencing the
total column the total column is not on
the index so what the database needs to
do is for all of these 466 thousand rows
it estimates it has to look at it will
have to take the row ID go back to the
table fetch the corresponding row that
is a read from disk right fetch the row
take the total column sum
and do that 466 thousand times that is
466 thousand is that's bad so you might
say well isn't a full table scan even
worse it has to do that 2.4 million
times
dramatic pause while I drink no it isn't
the database is actually smart enough to
know that if I have to do a full table
scan anyways I know from the get-go I
have to read everything anyways so it's
not gonna read them one by one it'll
actually batch read them and read a
couple thousand at a times and the
amount of disk IO is gonna be way less
and turns out in this example that is
actually way faster even though you have
to still look at five times as many rows
but you cut down on the number of disk
IO the disk reads you have to perform
and in total that is much faster than
using the index and having to read from
disk so the database rightly decided in
this case a full table scan is actually
faster than using the index so now we
know what's happening still sucks so
what can we do if we look at our where
clause we've kind of exhausted our
options there right we already have a an
index on the created add column but if
the issue is that not all data is
present on the index why not put the
data on the index why not put the total
column on the index as well so this is
what I'm gonna do I'm gonna take our
existing index and just gonna change it
to also include the total column gonna
save that it'll take a second or two or
three okay there we go
and now let's rerun this explained
statement there we go now it is
voluntarily choosing our index because
it can see that it doesn't have to read
from disk anymore the interesting thing
to note is that in this extra column
which I think is the worst column in
this entire output because it's not
quite clear what is extra and the values
that appear in this column are usually
really weird like using index
because what this really means is using
index only we have just put all the data
that this query needs on the index that
means this operation can now be
performed entirely in memory because my
sequel stores its indices in memory we
have put all the data that this query
needs on the index so there are no reads
from disk at all this is what's called
an index only scan it's one of the most
powerful tools you can use when you're
designing an index it is also one of the
most aggressive ones what do we mean by
aggressive we have just created a very
specific index and index terms probably
only works for this query because not
only have we indexed for the where
clause we have index for the Select
clause as well okay and thinking about
the whole you should try to keep the
number of indices to the minimum that
you need introducing an index like that
that can probably only be used by a
single query is a trade-off that you you
have to consider right but it can
dramatically improve performance so if
we now run this query without the
explain it just took 92 milliseconds so
from 600 milliseconds to below 100
milliseconds okay great
you got a home don't forget the force
push next day management comes to you
and says great job on the report we have
a feature request and the hair on the
back of your neck starts standing up and
they say we want the same report but now
we want to be able to do it for only a
single employee they have a sigh of
relief like that's easy okay so that's
just for illustration purposes let's
just take a random user ID no that's a
total this one one three six and you say
well that is easy we just say and user
ID equals this one you run it it's still
running it took two seconds okay let's
look at the explain what's going on here
oh my god we're doing a full table scan
again well luckily the solution to this
is pretty simple because we just
basically reintroduced the same problem
we just solved we added a column to the
query that is
not on the index and then we have to do
all that reading from disk again so if
the problem is the same why not get with
the same solution just add the user ID
column to the index I mean this indexing
stuff is easy really so I'm just going
to put the user ID on the index as well
save it wait for it to finish and then
let's rerun the explained statement
there we go we are now back to using the
index we're doing a range scan something
is still not right though the rows
column says it's still things it has to
look at around half a million rows that
doesn't seem right that's the same
number it thought it had to look at when
we didn't have the user ID constrain
that should be a much more limiting
constraint right a user should not maybe
have a couple dozen orders so why is it
thinking that it has to use this look at
the same number of rows than before this
is telling me that it
yes it's using the index but it's not
using it to its full extent which brings
me to the next pitfall so this is what
our Inuk looks right now as a table okay
we have created add we have total we
have user ID so this means this table or
the index is sorted first by created add
then if two values have the same created
add value they're sorted by total and if
they have the same created add and total
value they are then sorted by user ID
here's what you need to know about multi
column indices you can use a multi
column index from left to right so you
can use this index for a query that uses
that filters on created at filters on
created a done total and created a total
and user ID you cannot skip columns what
we're doing is we have a where clause on
created add and user ID we're skipping
the total that doesn't work why doesn't
it work because the user ID itself is
not sorted it is only sorted in respect
to the total and the created add so if
we just leave out the middle column the
user ID column is essentially unsorted
so it is still
using the index but it's only using it
up until created at point being the
column order in an index matters and
index on a and B is not the same as an
index on BN a so if this is the problem
and we have a where clause on on created
and user ID the solution seems to be to
put the user ID into the second place
right so we can use the first two
columns of the index so let's do that
I'm gonna rearrange the columns in the
index and put the user ID into the
second position save it and then when we
go back and we rerun it it should now
have to look at way fewer rows than
before
it doesn't it actually has to look at
more rows but that's just because this
is an estimation so it fluctuates a bit
but it still says it has to look at four
hundred five hundred thousand rows
nothing changed
so everything I just said is correct
right the column order matters there's
actually another problem with this query
which brings me funnily enough to my
last pitfall which is inequality
operators yes you can use an index from
left to right but as soon as you have an
inequality operation on any of those
columns in the index it's as if the
index just stops there we actually have
an inequality operation we have a
between on the created add column the
created at column is the first column in
our index so it's as if our index just
stops there
this is why it's basically unchanged
between having a user ID and not having
the user in a query because it doesn't
even get to that point right the index
can only be used up until created app
because we have an inequality operation
on that column so again if this is the
problem then we should just put the user
ID into first position right because we
have an equality operation on user ID
and then we put the created either into
the second position because we have an
inequality operation on that so let's
let's try
I'm gonna change the index one last time
put the nut of total put the user ID
into first position created at into
first position so this could should mean
we should be able to use the index to
limit the number of rows on just the
rows that the user ordered and then
limit them further to only the orders
that were placed in 2013 so if
everything went according to plan and we
rerun this query now there we go
886 rows okay now it's using the index
to its full extent to use both column in
the index to filter our number of rows
we have to look at if we remove the
explained statement
it just ran under a millisecond okay we
went from four seconds to under a
millisecond that is really really fast
okay so you commit you for suppose you
go home next day management says cool
the user report is really well but we
noticed that the report on all orders
seems to have gotten slower again and
you're like no this is this isn't
happening let's have a look remove the
user ID run the query Oh God we're back
to 600 milliseconds what is going on
let's take a look at the explain oh this
is a new one index you've seen that one
before
so now we're doing a full index scan we
are still using the index but we're not
using it too limiting the number of rows
we have to look at we are basically
starting at the very first leaf node and
then just traverse through all 200 2.5
million rows which we can see back here
in the rows column where it says yes I
do have to look at every single column
so what just happened well we changed
the order of the columns now the user
IDs in first place that query doesn't
have a user ID and so since we can use
since we can only use an index from left
to right and we can skip columns we
basically can't use the index anymore
because the created add is not
second column okay so the point is there
is no query that can satisfy us our
there is no index that can satisfy both
these queries and this is sort of the
the moral of this entire talk is
indexing is a developer concern it is
not the concern of your database person
or the guys sitting in the corner over
there who seems to have my sequel
workbench open all the time
it is a developer concern and it's
because a index and a query always have
to go together you do not design an
index in a vacuum you always design an
index for a query and only we as
developers know how our queries actually
look like how we are accessing the data
so only we are in a position to write a
good index so in this case you would
actually have to decide do I introduce a
second index is the report that's run
for all users maybe only run once a year
so it really doesn't matter if it takes
600 milliseconds right it's it's
something that you can only decide if
you know the context and you know how
your index how your data is being
accessed so to close this talk everyone
repeat after me I am a developer
indexing is my concern indexing is my
concern thank you all for listening
[Applause]
[Music]
Browse More Related Video
How do SQL Indexes Work
Google SWE teaches systems design | EP27: Search Indexes
How indexes work in Distributed Databases, their trade-offs, and challenges
DynamoDB: Under the hood, managing throughput, advanced design patterns | Jason Hunter | AWS Events
Sequence_data_part_2
Google SWE teaches systems design | EP1: Database Design
5.0 / 5 (0 votes)