Optimization and Indexes

Indexes are used to find rows with specific column values quickly. Without an index, MySQL must
begin with the first row and then read through the entire table to find the relevant rows. The larger the
table, the more this costs. If the table has an index for the columns in question, MySQL can quickly
determine the position to seek to in the middle of the data file without having to look at all the data. This
is much faster than reading every row sequentially.

index 用于加速查询.

(PS: 就最简单来说, 索引相当于一个人的特征. 查询数据时不一定需要完全匹配, 也不一定一开始就需要完全匹配搜索, 这时候使用索引就相当于找人时使用年龄, 性别等特征来寻找一样, 是一个很简单也很重要的概念. 如何寻找并规定 index 是极其重要的一环)

Most MySQL indexes (PRIMARY KEY, UNIQUE, INDEX, and FULLTEXT) are stored in B-trees.
Exceptions: Indexes on spatial data types use R-trees; MEMORY tables also support hash indexes;
InnoDB uses inverted lists for FULLTEXT indexes.

大多数 MySQL 索引类型为 B-trees. 除了某些特殊案例: 特殊数据类型使用 R-trees. 内存表支持 hash 索引. InnoDB 对全文索引使用倒排序表.

In general, indexes are used as described in the following discussion. Characteristics specific to hash
indexes (as used in MEMORY tables) are described in Section 8.3.9, “Comparison of B-Tree and Hash
Indexes”.
MySQL uses indexes for these operations:
• To find the rows matching a WHERE clause quickly.

加速查询

• To eliminate rows from consideration. If there is a choice between multiple indexes, MySQL normally
uses the index that finds the smallest number of rows (the most selective index).
• If the table has a multiple-column index, any leftmost prefix of the index can be used by the
optimizer to look up rows. For example, if you have a three-column index on (col1, col2,
col3), you have indexed search capabilities on (col1), (col1, col2), and (col1, col2,
col3).

• To retrieve rows from other tables when performing joins. MySQL can use indexes on columns
more efficiently if they are declared as the same type and size. In this context, VARCHAR and CHAR
are considered the same if they are declared as the same size. For example, VARCHAR(10) and
CHAR(10) are the same size, but VARCHAR(10) and CHAR(15) are not.
For comparisons between nonbinary string columns, both columns should use the same character
set. For example, comparing a utf8 column with a latin1 column precludes use of an index.
Comparison of dissimilar columns (comparing a string column to a temporal or numeric column, for
example) may prevent use of indexes if values cannot be compared directly without conversion. For
a given value such as 1 in the numeric column, it might compare equal to any number of values in
the string column such as ‘1’, ‘ 1’, ‘00001’, or ‘01.e1’. This rules out use of any indexes for
the string column.

同样长度和类型的索引在多表查询时可以提高效率. 类型不一致, 长度不一致, 字符集不一致等都会造成无法使用索引的情况.

• To find the MIN() or MAX() value for a specific indexed column key_col. This is optimized by a
preprocessor that checks whether you are using WHERE key_part_N = constant on all key
parts that occur before key_col in the index. In this case, MySQL does a single key lookup for each
MIN() or MAX() expression and replaces it with a constant. If all expressions are replaced with
constants, the query returns at once. For example:
SELECT MIN(key_part2),MAX(key_part2)
FROM tbl_name WHERE key_part1=10;
• To sort or group a table if the sorting or grouping is done on a leftmost prefix of a usable index (for
example, ORDER BY key_part1, key_part2). If all key parts are followed by DESC, the key
is read in reverse order. (Or, if the index is a descending index, the key is read in forward order.)

加速 ORDER BY

• In some cases, a query can be optimized to retrieve values without consulting the data rows. (An
index that provides all the necessary results for a query is called a covering index.) If a query uses
from a table only columns that are included in some index, the selected values can be retrieved from
the index tree for greater speed:
SELECT key_part3 FROM tbl_name
WHERE key_part1=1
Indexes are less important for queries on small tables, or big tables where report queries process most
or all of the rows. When a query needs to access most of the rows, reading sequentially is faster than
working through an index. Sequential reads minimize disk seeks, even if not all the rows are needed for
the query. See Section 8.2.1.23, “Avoiding Full Table Scans” for details.

索引存在两种不适用的情况 : 一种是对于数据少的表查询. 另外一种是需要访问大部分表数据的查询, 这种情况下的顺序读取要比索引快, 即使大部分数据都不需要.

Primary Key Optimization

The primary key for a table represents the column or set of columns that you use in your most vital
queries. It has an associated index, for fast query performance. Query performance benefits from
the NOT NULL optimization, because it cannot include any NULL values. With the InnoDB storage
engine, the table data is physically organized to do ultra-fast lookups and sorts based on the primary
key column or columns.

表的主键是查询中使用的列. 主键有相关的索引, 用于加速查询.

If your table is big and important, but does not have an obvious column or set of columns to use as a
primary key, you might create a separate column with auto-increment values to use as the primary key.
These unique IDs can serve as pointers to corresponding rows in other tables when you join tables
using foreign keys.

如果没有明显的列作为主键, 可以创建一个自增列作为主键. 唯一 ID 可作为外键.

Multiple-Column Indexes

多列索引是由多个列组从的索引(= = 废话)… 其使用方式必须从左到右, 比如索引 (i1, i2, i3), 则使用 (i1), (i1, i2), (i1, i2, i3) 查询时可使用索引, 其余情况均不可使用索引 (与顺序无关, 与存在有关).

Hash Index Characteristics

Hash indexes have somewhat different characteristics from those just discussed:
• They are used only for equality comparisons that use the = or <=> operators (but are very fast). They
are not used for comparison operators such as < that find a range of values. Systems that rely on
this type of single-value lookup are known as “key-value stores”; to use MySQL for such applications,
use hash indexes wherever possible.

只能用于等式/不等式. (PS: 意料之中, hash 并没有任何比较)

• The optimizer cannot use a hash index to speed up ORDER BY operations. (This type of index cannot
be used to search for the next entry in order.)

不能用于 ORDER BY 优化 (PS: 同理)

• MySQL cannot determine approximately how many rows there are between two values (this is used
by the range optimizer to decide which index to use). This may affect some queries if you change a
MyISAM or InnoDB table to a hash-indexed MEMORY table.

不能得知两值之间存在多少行/

• Only whole keys can be used to search for a row. (With a B-tree index, any leftmost prefix of the key
can be used to find rows.)

只支持全值匹配.

(PS: 这些限制和特性的根源就是 hash 自身的特性)

Use of Index Extensions

InnoDB automatically extends each secondary index by appending the primary key columns to it.

Consider this table definition:
CREATE TABLE t1 (
i1 INT NOT NULL DEFAULT 0,
i2 INT NOT NULL DEFAULT 0,
d DATE DEFAULT NULL,
PRIMARY KEY (i1, i2),
INDEX k_d (d)
) ENGINE = InnoDB;

This table defines the primary key on columns (i1, i2). It also defines a secondary index k_d on
column (d), but internally InnoDB extends this index and treats it as columns (d, i1, i2).
The optimizer takes into account the primary key columns of the extended secondary index when
determining how and whether to use that index. This can result in more efficient query execution plans
and better performance.
The optimizer can use extended secondary indexes for ref, range, and index_merge index access,
for Loose Index Scan access, for join and sorting optimization, and for MIN()/MAX() optimization.
The following example shows how execution plans are affected by whether the optimizer uses
extended secondary indexes. Suppose that t1 is populated with these rows:

MySQL 自动扩展次级索引.

Optimizer Use of Generated Column Indexes

MySQL supports indexes on generated columns. For example:

CREATE TABLE t1 (f1 INT, gc INT AS (f1 + 1) STORED, INDEX (gc));

The generated column, gc, is defined as the expression f1 + 1. The column is also indexed and the
optimizer can take that index into account during execution plan construction. In the following query,
the WHERE clause refers to gc and the optimizer considers whether the index on that column yields a
more efficient plan:

SELECT * FROM t1 WHERE gc > 9;

The optimizer can use indexes on generated columns to generate execution plans, even in the
absence of direct references in queries to those columns by name. This occurs if the WHERE, ORDER
BY, or GROUP BY clause refers to an expression that matches the definition of some indexed generated
column. The following query does not refer directly to gc but does use an expression that matches the
definition of gc:

SELECT * FROM t1 WHERE f1 + 1 > 9;

The optimizer recognizes that the expression f1 + 1 matches the definition of gc and that gc
is indexed, so it considers that index during execution plan construction. You can see this using
EXPLAIN:

mysql> EXPLAIN SELECT * FROM t1 WHERE f1 + 1 > 9\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: t1
partitions: NULL
type: range
possible_keys: gc
key: gc
key_len: 5
ref: NULL
rows: 1
filtered: 100.00
Extra: Using index condition

generated column 指的是一个包含其他字段的表达式.

Descending Indexes

MySQL supports descending indexes: DESC in an index definition is no longer ignored but causes
storage of key values in descending order. Previously, indexes could be scanned in reverse order but
at a performance penalty. A descending index can be scanned in forward order, which is more efficient.
Descending indexes also make it possible for the optimizer to use multiple-column indexes when the
most efficient scan order mixes ascending order for some columns and descending order for others.
Consider the following table definition, which contains two columns and four two-column index
definitions for the various combinations of ascending and descending indexes on the columns:

CREATE TABLE t (
c1 INT, c2 INT,
INDEX idx1 (c1 ASC, c2 ASC),
INDEX idx2 (c1 ASC, c2 DESC),
INDEX idx3 (c1 DESC, c2 ASC),
INDEX idx4 (c1 DESC, c2 DESC)
);

The table definition results in four distinct indexes. The optimizer can perform a forward index scan for
each of the ORDER BY clauses and need not use a filesort operation:

ORDER BY c1 ASC, c2 ASC -- optimizer can use idx1
ORDER BY c1 DESC, c2 DESC -- optimizer can use idx4
ORDER BY c1 ASC, c2 DESC -- optimizer can use idx2
ORDER BY c1 DESC, c2 ASC -- optimizer can use idx3

可以指定索引列的顺序, 以用于 ORDER BY 优化.

Indexed Lookups from TIMESTAMP Columns

Temporal values are stored in TIMESTAMP columns as UTC values, and values inserted into and
retrieved from TIMESTAMP columns are converted between the session time zone and UTC. (This is
the same type of conversion performed by the CONVERT_TZ() function. If the session time zone is
UTC, there is effectively no time zone conversion.)

在 TIMESTAMP 中的值以 UTC 规范存储. 读取和写入在会话时区和 UTC 之间转换. (效果和使用 CONVERT_TZ 函数一样. 如果时区已经是 UTC 了, 那么不需要转换)

Due to conventions for local time zone changes such as Daylight Saving Time (DST), conversions
between UTC and non-UTC time zones are not one-to-one in both directions. UTC values that are
distinct may not be distinct in another time zone. The following example shows distinct UTC values that
become identical in a non-UTC time zone:

mysql> CREATE TABLE tstable (ts TIMESTAMP);
mysql> SET time_zone = 'UTC'; -- insert UTC values
mysql> INSERT INTO tstable VALUES
('2018-10-28 00:30:00'),
('2018-10-28 01:30:00');
mysql> SELECT ts FROM tstable;
+---------------------+
| ts |
+---------------------+
| 2018-10-28 00:30:00 |
| 2018-10-28 01:30:00 |
+---------------------+
mysql> SET time_zone = 'MET'; -- retrieve non-UTC values
mysql> SELECT ts FROM tstable;
+---------------------+
| ts |
+---------------------+
| 2018-10-28 02:30:00 |
| 2018-10-28 02:30:00 |
+---------------------+

Other

why b-trees ?

很好奇为什么是 b 树, 到底优势在哪里? answer

Algorithmic complexity is the same, since O(logb n) = O(c log n) = O(log n) but the constant factors, which are hidden in big-O notation, could vary noticeably, depending on implementation and hardware.

算法的复杂度是一样的.

B-trees were designed for platter hard disks, which have a large access time (moving the head into position) after which an entire physical sector is read. Making the B-tree nodes as large as the sector minimizes the number of access times and maximizes the useful data out of each read operation.

B-trees 为硬盘而设计, 将其节点设置为最小扇区大小可以减少访问时间, 最大化利用每个读取出的数据.

But if you are working out of memory you have a negligible access time, therefore a better comparison is to count the number of single words accessed by your algorithm.

如果用完了内存, 将会有轻微的消耗. 一个对比点是计算算法访问的单个字符数.

For example, let’s plan a data structure to store 220 keys of 1 word each, for a total of 4MiB of raw data on a 32bit machine.

A binary search tree will have 220 nodes, each holding one key and two pointers (3 words). Depth will be log2(220) = 20. The average search will have to read the key and one of the pointers from each node in its path, from the root all the way down = 40 words.

A B-tree made for hard disks will have 4kB nodes. Each node could be stored internally as a sorted array of key and pointer couples, between 256 and 512 of them. What is the average search going to look like? Considering an average 3/4 fill, each node will contain 384 entries, and its internal binary search will have to visit on average log2(384) = 5.95 keys. The average depth will be log384(220) = 2.33, so our search will have to read on average 2.33 times 5.95 keys, or about 14 words.

(PS: 这里因为格式的问题, 220 的含义是 2的 20次方. 这里想表达的是: 如果是 bst 存储, 每个节点可能分布在不同的扇区上, 一次磁盘访问得到的有用信息最坏的情况是只能使用一次, 而 B-trees 将其尽可能最大化, 为 s(扇区大小) / n(单个节点大小) 次, 所以虽然搜索的效率是基本一致的, 但是其每次 I/O 所得的有效率是完全不同的. 这是 B-trees 广泛应用于数据库的根源所在.)

In the case of a low-fanout (branching factor) B-tree, with each node holding between 16 and 32 keys, the average fill will be 24 keys, the average depth log24(220) = 4.36, the binary search in each node will make log2(24) = 4.58 comparisons, and the overall average search will have to read about 20 words.

Keep in mind that the last two data structures achieve a better result than binary trees because they optimize read operations over modifications. To insert a key into one of these B-trees you would have to rewrite on average an entire 384-word or 24-word node, if not more than one, while in the binary-tree case a write operation would still only need to touch up to 40 words.

(PS: 这里说的是关于数据插入, B-trees 可能存在最坏情况, 虽然存在的最好情况优于 bst)

R-trees

R-trees 是专用于空间存储的数据结构. 其结构和 B-trees 类似, 专用于存储空间信息, 子节点和父节点的关系为相交. 父节点包含子节点(但不一定完全)

R-trees are tree data structures used for spatial access methods, i.e., for indexing multi-dimensional information such as geographical coordinates, rectangles or polygons.

R-trees 是用于索引多维信息, 如地理坐标, 矩阵, 或多边形的数据结构.

(PS: 我这里再次深思, 什么是数据结构? 数据的组织方式. 数据的组织方式由其访问方式及存储方式决定. 不同的数据结构就是为了不同的目的而生的. 例如 B-trees 它针对访问做了优化. 尽可能减少其 I/O, 以提高效率. 变量的前后, 热点. 都是很重要的点. 再考虑一下 B+tree, 他的优点是叶子节点包含所有数据, 当进行全表访问时, 就是一个线性的时间, 同时, 可以使用二分. 可能是处于这样的目的, 才有了优化版本的 B+trees. 而 R-trees , 用于多维信息. 其父子节点的关系变成了类似包含/被包含的关系. 可以处理一些例如 xx 离 xx 多远. xx 附近有哪些建筑… 广泛来说, 是 “组织”. 身边的事物都有组织, 公司的管理级. 省市区的划分. 电脑的构成… )