Query Optimization - Tunning Indexes for High Performance
- Table of content
- Types of Indexing
- B-tree structure and variances [[#B-Tree structure]]
- Advantages of using B-tree
- When B-tree can’t be used
- Hash indexes [[#Hash Indexes]]
- Spatial Tree indexes
- Indexing Strategies for high performance [[#Indexing Strategies for high performance]]
- Isolating Columns
- Prefix indexes and Index Selectivity
- Multi-Column index
- Choosing a Good column order
- Clustered Index and Non-Clustered Index
- Covering Indexes
- Indexing and Locking
- Index and Table Maintaince [[#Indexing and Locking]]
- Resources [[#Resources]]
- Types of Indexing
Introduction
[!info] You can head over to this Github repo in which I’ve maintained advanced concepts in DB besides other resources and notes.
Intro
Indexes (also called “keys” in MySQL) are data structures that storage engines use to find rows quickly. They also have several other beneficial properties that we’ll explorein this chapter.
Thus, they are not standardized: indexing works slightly differently in each engine, and not all engines support all types of indexes. Even when multiple engines support the same index type, they might implement it differently under the hood
In MySQL, a storage engine uses indexes in as in book’s index you used to search for pages containing specific terms. It searches the index’s data structure for a value. When it finds a match, it can find the row that contains the match. Suppose you run the following query
SELECT first_name FROM sakila.actor WHERE actor_id = 5;
[!info] An index contains values from one or more columns in a table. If you index more than one column, the column order is very important, because MySQL can only search efficiently on a leftmost prefix of the index
Creating an index on two columns is not the same as creating two separate single-column indexes, as you’ll see.
Type of indexes
There are many types of indexes, each designed to perform well for different purposes. We’re going to cover
- B-Tree Index
- Hash Index
- Bitmap Index
[!info] Indexes are implemented in the storage engine layer, not the server layer.
MySQL used B-tree index by default on CREATE TABLE
and other statements. However, storage engines might use different storage structures internally.
Storage engines use B-Tree indexes in various ways, which can affect performance. For instance, MyISAM uses a prefix compression technique that makes indexes smaller, but InnoDB leaves values uncompressed in its indexes. Also, MyISAM indexes refer to the indexed rows by their physical storage locations, but InnoDB refers to them by their primary key values which is the cluster key. [more on that later]. Each variation has benefits and drawbacks.
the general idea of of B-tree is that all the values are stored in order, and each leaf page is the same distance from the root.
This figure shows an abstraction of a B-tree index used by InnoDB storage engine
Leaf pages are special, because they have pointers to the indexed data instead of pointers to other pages. (Different storage engines have different types of “pointers” to the data.)
Because B-tree store the indexes columns in order, they’re useful for searchhind for ranges of data.
Let’s take an example
CREATE TABLE People (
last_name varchar(50) not null,
first_name varchar(50) not null,
dob date not null,
gender enum('m', 'f')not null,
key(last_name, first_name, dob)
);
The index here will contain the values from the lat_name,, first_name, and dob columns for every row in the table
Here’s how the index arranges the data it stores
Notice that the index sorts the values according to the order of the columns given in the index in the CREATE TABLE statement
It’s way more important to know when to use B-tree and when not to!
There’re some types of queries that can use a B-tree index, which are the ones that can be used for lookups by
- full key value
- key range
- key prefix
[!danger] They’re useful only if the lookup uses a leftmost prefix of the index.
Taking the previous example in considerations, let’s look when the index will be used and when it not would be!
Match the full value key Specifying values for all columns in the index.
SELECT first_name FROM People WHERE last_name = 'Allen' AND first_name = 'Cuba' AND dob = '1960-01-01'
Match a leftmost prefix Find all people with the last name Allen, this will use only the first column in the index
SELECT first_name FROM People WHERE last_name = 'Allen'
Match a column prefix Matching on the first part of a column’s value, for example, finding all people whose last names begin with J
SELECT first_name FROM People WHERE last_name LIKE 'J%'
Match a range of values You an find people whose last names are between Allen and Meska
SELECT first_name FROM People WHERE last_name BETWEEN 'Allen' AND 'Meska';
Match one part exactly and match a range on another part This index can help you find everyone whose last name is Allen and whose first name starts with the letter K (Kim, Karl, etc.). This is an exact match on last_ name and a range query on first_name
SELECT first_name FROM People WHERE last_name = 'Allen' AND first_name LIKE 'K%'
Match index-only queries B-Tree indexes can normally support index-only queries, which are queries that access only the index, not the row storage. We discuss this optimization later in “Covering Indexes”
[!info] Another important benefit from using B-tree is finding values in sorted order as well lookups, Because B-tree’s nodes are sorted.
So B-tree index would be helpful for ORDER BY
and GROUP BY
clauses
Drawbacks and limitations of using B-tree index
They’re not useful if the lookup does not start from the leftmost side of the indexed colum
SELECT first_name FROM People WHERE first_name = 'Yousef';
The ubove quey will not use the index. Also you can’t find people whose last name ends with a particular letter, for examaple the below query will not use the index.
SELECT first_name FROM People WHERE last_name LIKE '%J';
The storage engine can’t optimize accesses with any columns to the right of the first range condition. For example, if your query is
WHERE last_name="Smith" AND first_name LIKE 'J%' AND dob='1976-12-23'
the index access will use only the first two columns in the index, because the LIKE is a range condition (the server can use the rest of the columns for other purposes, though). For a column that has a limited number of values, you can often work around this by specifying equality conditions instead of range conditions. We show detailed examples of this in the indexing case study later in this chapter
Some of these limitations are not inherent to B-Tree indexes, but are a result of how the MySQL query optimizer and storage engines use indexes. Some of them might be removed in the future
Now you know why we said the column order is extremely important: these limitations are all related to column ordering. For optimal performance, you might need to create indexes with the same columns in different orders to satisfy your queries.
if the information you’re searching for in the index itself, it’s called “inline query”, like EXPLAIN ANALYZE SELECT id from employees where id = 1; # (Id where is primary key (btree)) It will show you that Heap Fetches are 0 opposing to this query SELECT name from employees where id = 2 (name has no index on it ) So, First thing will happen, the engine will go to the page using the id (on the index) which containthe information we need about name (which is stored on disk [another read]) If we do the query above again, it will result in much smaller time, because caching Let’s look at another query
SELECT id from employee where name = 'Zsh';
this will result in Sequential scan which cost alot of time. [Fulltable scan]. Although mySql does something intelligent somehow by applying worker threads so it can dsequential scan in parallel If we now make an index on name
CREATE INDEX employee_idx on employees(name)
$ Bitmap index created Now let’s seaerch
SELECT id,name FROM employees Where name = 'Yousef'; (index will be used)
SELECT id,name FROM employees WHERE name like '%You%' (index will not be used) #
becausactually this expression is not a single value, we’ve alot of possabilites.
Clustered Index vs Non-Clustered Index
Clustered Index (only 1 on the table, mostly the primary key)
Non-clustered index
suppose we’ve created a non-clustered index on the name column
Row locators contain:
- Cluster key (employee_id)
- actual row [name]
Both clustered index and non-clustered index are working together to find the data!
What are the steps used to find the actual row ?
- SQL servers used non-clustered index on the name column to quicly find the employee entry on the index
- Clustered index (employee_id) is used to find the actual row!
In sql server we read the execution plan from top to bottom and from right to left!
When a query navigate through the clustered index tree to the base table data, this is called Clustered index seek The clustered index contains the base table data itself, this is why you can create one clustere index
non-clustered index is seperate from the base data, the base data could exist instead as clustered index the data directly available may be limited because usually non-clustered indexes only include a subset of columns from the table
if values from columns not in the index are requested, the query may navigate back to the base data using the references mentioned earlier If the query optimizer decides that doing that is too expensive, it may fall back on scaning the base data instead of using the index
Non-clustered index begin seperate from the base data brings about a couple of important features
Filtered indexes only contains rows that meet a user-defind predicate, and to create these you use WHERE
clause on the index definition, so clustered index can’t be used because it has to contain all the data on the table.
CREATE NONCLUSTERED INDEX IX_PhoneBook_NCI
ON dbo.PhoneBook(LastName, FirstName)
WHERE (LastName >= 'Yousef');
Summary: A clusterd index is a way of representing data as a whole A non-clustered index is a physically seperate structure that refernces the base data and can different sort order
/** Clustered index **/
ALTER TABLE dbo.MyTable
ADD CONSTRAINT PK_MyTable PRIMARY KEY(Id);
/**Nonclustered index**/
ALTER TABLE dbo.MyTable
ADD CONSTRAINT UC_MyTable_Id UNIQUE(Id);
Benefits of Indexes
FROM CMU Lecture: Indexing I and II