read/RedisMemoryOptimization

Posted on 2020-01-09 Edited on 2019-09-04

This page is a work in progress. Currently it is just a list of things you should check if you have problems with memory.

Special encoding of small aggregate data types

Since Redis 2.2 many data types are optimized to use less space up to a certain size. Hashes, Lists, Sets composed of just integers, and Sorted Sets, when smaller than a given number of elements, and up to a maximum element size, are encoded in a very memory efficient way that uses up to 10 times less memory (with 5 time less memory used being the average saving).

从 Redis 2.2 开始, 许多数据类型经过优化后使用更少的空间. 哈希, 链表, 只由整型组成的集合, 以及已排序集合.

当小于一定量元素, 并且元素大小达到最大时, 数据以一种相当高效的内存方式编码, 做多可节省 10 倍的内存空间(平均是 5 倍)

This is completely transparent from the point of view of the user and API. Since this is a CPU / memory trade off it is possible to tune the maximum number of elements and maximum element size for special encoded types using the following redis.conf directives.

从用户和API的角度上来看, 这是完全透明的.

因为这是 CPU/内存的权衡, 可以为使用以下 redis.conf 指令的特殊编码类型的最大元素数量和最大元素大小进行调优

hash-max-zipmap-entries 512 (hash-max-ziplist-entries for Redis >= 2.6)
hash-max-zipmap-value 64  (hash-max-ziplist-value for Redis >= 2.6)
list-max-ziplist-entries 512
list-max-ziplist-value 64
zset-max-ziplist-entries 128
zset-max-ziplist-value 64
set-max-intset-entries 512

If a specially encoded value will overflow the configured max size, Redis will automatically convert it into normal encoding. This operation is very fast for small values, but if you change the setting in order to use specially encoded values for much larger aggregate types the suggestion is to run some benchmark and test to check the conversion time.

如果特殊编码的值将超过配置的最大值, Redis 将自动转换其为普通编码. 对于小值转换操作非常快, 但是如果更改设置以为更大的聚合类型使用特殊编码值, 建议使用基准测试测量转换时间

Using 32 bit instances

Redis compiled with 32 bit target uses a lot less memory per key, since pointers are small, but such an instance will be limited to 4 GB of maximum memory usage. To compile Redis as 32 bit binary use make 32bit. RDB and AOF files are compatible between 32 bit and 64 bit instances (and between little and big endian of course) so you can switch from 32 to 64 bit, or the contrary, without problems.

使用32位编译的redis每个键少使用很多空间, 因为指针变小了(8字节变4字节), 但是这样最多就只能使用4GB的内存.

使用make 32bit可以将redis编译为32位程序. RDB和AOF文件在32位和64位实例下是兼容的(同样也是大小端兼容的)所以可以从32位和64位间切换

Bit and byte level operations

Redis 2.2 introduced new bit and byte level operations: GETRANGE, SETRANGE, GETBIT and SETBIT. Using these commands you can treat the Redis string type as a random access array. For instance if you have an application where users are identified by a unique progressive integer number, you can use a bitmap in order to save information about the sex of users, setting the bit for females and clearing it for males, or the other way around. With 100 million users this data will take just 12 megabytes of RAM in a Redis instance. You can do the same using GETRANGE and SETRANGE in order to store one byte of information for each user. This is just an example but it is actually possible to model a number of problems in very little space with these new primitives.

redis 2.2 引进了新的比特和位级别的操作: GETRANGE, SETRANGE, GETBIT 和 SET BIT. 使用这些命令, 你可以对待string类型像一个随机访问数组一样(??? 难道不应该是么?) 比如, 如果有这样一个app, 用户以一个唯一的整型数字标识, 你可以使用bitmap保存用户的性别, 男的置1, 女的清0, 或者其他方式. 1亿用户只需要12m的内存

你可以多 GETRANGE 和 SETRANGE 做同样的操作

Use hashes when possible

Small hashes are encoded in a very small space, so you should try representing your data using hashes every time it is possible. For instance if you have objects representing users in a web application, instead of using different keys for name, surname, email, password, use a single hash with all the required fields.

If you want to know more about this, read the next section.

Using hashes to abstract a very memory efficient plain key-value store on top of Redis

I understand the title of this section is a bit scaring, but I’m going to explain in details what this is about.

Basically it is possible to model a plain key-value store using Redis where values can just be just strings, that is not just more memory efficient than Redis plain keys but also much more memory efficient than memcached.

在 redis 中可以实现只使用单纯string作为值的 key-value, 比 redis plain keys 和 memcached 都要更节省内存

Let’s start with some fact: a few keys use a lot more memory than a single key containing a hash with a few fields. How is this possible? We use a trick. In theory in order to guarantee that we perform lookups in constant time (also known as O(1) in big O notation) there is the need to use a data structure with a constant time complexity in the average case, like a hash table.

But many times hashes contain just a few fields. When hashes are small we can instead just encode them in an O(N) data structure, like a linear array with length-prefixed key value pairs. Since we do this only when N is small, the amortized time for HGET and HSET commands is still O(1): the hash will be converted into a real hash table as soon as the number of elements it contains will grow too much (you can configure the limit in redis.conf).

This does not work well just from the point of view of time complexity, but also from the point of view of constant times, since a linear array of key value pairs happens to play very well with the CPU cache (it has a better cache locality than a hash table).

However since hash fields and values are not (always) represented as full featured Redis objects, hash fields can’t have an associated time to live (expire) like a real key, and can only contain a string. But we are okay with this, this was anyway the intention when the hash data type API was designed (we trust simplicity more than features, so nested data structures are not allowed, as expires of single fields are not allowed).

So hashes are memory efficient. This is very useful when using hashes to represent objects or to model other problems when there are group of related fields. But what about if we have a plain key value business?

Imagine we want to use Redis as a cache for many small objects, that can be JSON encoded objects, small HTML fragments, simple key -> boolean values and so forth. Basically anything is a string -> string map with small keys and values.

Now let’s assume the objects we want to cache are numbered, like:

object:102393
object:1234
object:5

This is what we can do. Every time there is to perform a SET operation to set a new value, we actually split the key into two parts, one used as a key, and used as field name for the hash. For instance the object named “object:1234” is actually split into:

a Key named object:12
a Field named 34

So we use all the characters but the latest two for the key, and the final two characters for the hash field name. To set our key we use the following command:

1	HSET object:12 34 somevalue

As you can see every hash will end containing 100 fields, that is an optimal compromise between CPU and memory saved.

There is another very important thing to note, with this schema every hash will have more or less 100 fields regardless of the number of objects we cached. This is since our objects will always end with a number, and not a random string. In some way the final number can be considered as a form of implicit pre-sharding.

What about small numbers? Like object:2? We handle this case using just “object:” as a key name, and the whole number as the hash field name. So object:2 and object:10 will both end inside the key “object:”, but one as field name “2” and one as “10”.

How much memory we save this way?

I used the following Ruby program to test how this works:

require 'rubygems'
require 'redis'

UseOptimization = true

def hash_get_key_field(key)
    s = key.split(":")
    if s[1].length > 2
        {:key => s[0]+":"+s[1][0..-3], :field => s[1][-2..-1]}
    else
        {:key => s[0]+":", :field => s[1]}
    end
end

def hash_set(r,key,value)
    kf = hash_get_key_field(key)
    r.hset(kf[:key],kf[:field],value)
end

def hash_get(r,key,value)
    kf = hash_get_key_field(key)
    r.hget(kf[:key],kf[:field],value)
end

r = Redis.new
(0..100000).each{|id|
    key = "object:#{id}"
    if UseOptimization
        hash_set(r,key,"val")
    else
        r.set(key,"val")
    end
}

This is the result against a 64 bit instance of Redis 2.2:

UseOptimization set to true: 1.7 MB of used memory
UseOptimization set to false; 11 MB of used memory

This is an order of magnitude, I think this makes Redis more or less the most memory efficient plain key value store out there.

WARNING: for this to work, make sure that in your redis.conf you have something like this:

1	hash-max-zipmap-entries 256

Also remember to set the following field accordingly to the maximum size of your keys and values:

1	hash-max-zipmap-value 1024

Every time a hash will exceed the number of elements or element size specified it will be converted into a real hash table, and the memory saving will be lost.

You may ask, why don’t you do this implicitly in the normal key space so that I don’t have to care? There are two reasons: one is that we tend to make trade offs explicit, and this is a clear tradeoff between many things: CPU, memory, max element size. The second is that the top level key space must support a lot of interesting things like expires, LRU data, and so forth so it is not practical to do this in a general way.

But the Redis Way is that the user must understand how things work so that he is able to pick the best compromise, and to understand how the system will behave exactly.

Memory allocation

To store user keys, Redis allocates at most as much memory as the maxmemory setting enables (however there are small extra allocations possible).

The exact value can be set in the configuration file or set later via CONFIG SET (see Using memory as an LRU cache for more info). There are a few things that should be noted about how Redis manages memory:

Redis will not always free up (return) memory to the OS when keys are removed. This is not something special about Redis, but it is how most malloc() implementations work. For example if you fill an instance with 5GB worth of data, and then remove the equivalent of 2GB of data, the Resident Set Size (also known as the RSS, which is the number of memory pages consumed by the process) will probably still be around 5GB, even if Redis will claim that the user memory is around 3GB. This happens because the underlying allocator can’t easily release the memory. For example often most of the removed keys were allocated in the same pages as the other keys that still exist.
The previous point means that you need to provision memory based on your peak memory usage. If your workload from time to time requires 10GB, even if most of the times 5GB could do, you need to provision for 10GB.
However allocators are smart and are able to reuse free chunks of memory, so after you freed 2GB of your 5GB data set, when you start adding more keys again, you’ll see the RSS (Resident Set Size) to stay steady and don’t grow more, as you add up to 2GB of additional keys. The allocator is basically trying to reuse the 2GB of memory previously (logically) freed.
Because of all this, the fragmentation ratio is not reliable when you had a memory usage that at peak is much larger than the currently used memory. The fragmentation is calculated as the amount of memory currently in use (as the sum of all the allocations performed by Redis) divided by the physical memory actually used (the RSS value). Because the RSS reflects the peak memory, when the (virtually) used memory is low since a lot of keys / values were freed, but the RSS is high, the ratio mem_used / RSS will be very high.

If maxmemory is not set Redis will keep allocating memory as it finds fit and thus it can (gradually) eat up all your free memory. Therefore it is generally advisable to configure some limit. You may also want to set maxmemory-policy to noeviction (which is not the default value in some older versions of Redis).

It makes Redis return an out of memory error for write commands if and when it reaches the limit - which in turn may result in errors in the application but will not render the whole machine dead because of memory starvation.

Work in progress

Work in progress… more tips will be added soon.

read/MinimunWindowSubstring

Posted on 2020-01-09 Edited on 2019-08-23

Minimum Window Substring

找到字符串 S 中包含字符串 T 所有字符的最小子串

my solution

const int _ = []() {
    std::ios::sync_with_stdio(false);
    std::cin.tie(NULL);
    return 0;
}();

class Solution {
public:
    string minWindow(string s, string t) {
        if (t.size() > s.size()) return {};

        int had = 0;
        int b = 0, e = 0;
        string tmpt = t;
        map<char, int> used;
        for (int i = 0; i < s.size(); ++i) {
            int indx = tmpt.find(s[i]);
            if (indx == string::npos) {
                ++used[s[i]]; continue; 
            }
            tmpt.erase(tmpt.begin() + indx);

            if (had++ == 0) b = i;
            if (tmpt.size() == 0) {
                e = i; break;
            }
        }
        if (had != t.size()) return {};

        int rb = b, re = e;
        while (b < s.size()) {
            char need = s[b++];
            while (b < s.size() && t.find(s[b]) == string::npos) b++;
            if (b >= s.size()) break;

            if (used[need] > 0) { 
                if (e - b < re - rb) {
                    rb = b; re = e;
                }
                --used[need]; continue; 
            }

            while (++e < s.size() && s[e] != need)
                ++used[s[e]];

            if (e >= s.size()) break;
            else if (e - b < re - rb) {
                rb = b; re = e;
            }
        }

        return s.substr(rb, re - rb + 1);
    }
};

代码量挺大的…

简单来说, 我的想法是首先找到最左边包含所有字符的最小子串, 然后 slide window , 慢慢移动到字符 S 尾端

这样的话, 复杂度就是 O(n)

the best solution

class Solution {
public:
    string minWindow(string s, string t) {
        vector<int> hash(128, 0);
        for(auto c: t) hash[c]++;
        int count = t.size(), start = 0, end = 0, minStart, minLen = INT_MAX;
        while(end < s.size()) {
            if(hash[s[end]] > 0) count--;
            hash[s[end]]--;
            end++;

            while(count == 0) {
                if(hash[s[start]] == 0) {
                    if(end - start < minLen) {
                        minStart = start;
                        minLen = end - start;
                    }
                    count++;
                }
                hash[s[start]]++;
                start++;
            }
        }
        if(minLen == INT_MAX) return "";
        return s.substr(minStart, minLen);
    }
};

它也是类似 slide window 的做法, 不过比我简洁得多, 从实现角度上, 比我要好一些

PS : 这是 hard 难度的题, 但是做完后感觉也就那样…

是不是所有 hard 难度的题其实本质上的解决方法都不难 ? 或许我对于这些的理解还不够深刻

感觉我再加深一下, 应该就能较为轻松应对 hard 了 :) (膨胀ing… )

read/LocksSetbyDifferentSQLStatementsinInnoDB

Posted on 2020-01-09 Edited on 2019-08-08

15.7.3 Locks Set by Different SQL Statements in InnoDB

If a secondary index is used in a search and index record locks to be set are exclusive, InnoDB also retrieves the corresponding clustered index records and sets locks on them.

如果在搜索中使用次级索引, 并且索引记录锁被设置为互斥. InnoDB 还检索相应的聚簇索引记录, 在记录上设置锁

If you have no indexes suitable for your statement and MySQL must scan the entire table to process the statement, every row of the table becomes locked, which in turn blocks all inserts by other users to the table. It is important to create good indexes so that your queries do not unnecessarily scan many rows.

如果语句没有适当的索引, MySQL 必须扫描整个表. 表的每行都被加锁, 这会导致其他用户在该表的插入操作被阻塞. 创建好的索引非常重要, 这样查询语句就不需要扫描大量行

InnoDB sets specific types of locks as follows.

SELECT ... FROM is a consistent read, reading a snapshot of the database and setting no locks unless the transaction isolation level is set to SERIALIZABLE. ForSERIALIZABLE level, the search sets shared next-key locks on the index records it encounters. However, only an index record lock is required for statements that lock rows using a unique index to search for a unique row.

SELECT … FROM 具有一致读属性, 读取数据库的快照, 除非事务隔离等级设置为序列化, 否则不加锁.

对序列化隔离等级, 搜索设置共享的 next-key 锁在遇到的索引记录上, 然而, 使用唯一索引去搜索唯一行的语句, 只有索引记录锁是必须的 (PS. 我不懂 = =… )
SELECT ... FOR UPDATE and SELECT ... FOR SHARE statements that use a unique index acquire locks for scanned rows, and release the locks for rows that do not qualify for inclusion in the result set (for example, if they do not meet the criteria given in the WHERE clause). However, in some cases, rows might not be unlocked immediately because the relationship between a result row and its original source is lost during query execution. For example, in a UNION, scanned (and locked) rows from a table might be inserted into a temporary table before evaluation whether they qualify for the result set. In this circumstance, the relationship of the rows in the temporary table to the rows in the original table is lost and the latter rows are not unlocked until the end of query execution.

SELECT … FOR UPDATE 和 SELECT … FOR SHARE 语句使用唯一索引获取锁扫描行, 释放不符合返回集合的锁 (比如, 如果它们不在 WHERE 子句条件中). 然而, 在某些条件下, 行可能不会被立即解锁, 因为返回行和它原始资源在查询语句执行期间丢失了. 比如, 在 UNION 语句中, 来自表的已扫描(和已锁)行在它们被评估是否符合搜索结果时, 可能被插入到临时表中, 在这样的环境下, 临时表中的行与原始表中的行的关系将会丢失, 后面的行直到查询结束后才解锁
For locking reads (SELECT with FOR UPDATE or FOR SHARE), UPDATE, and DELETE statements, the locks that are taken depend on whether the statement uses a unique index with a unique search condition, or a range-type search condition.

对于加锁读(SELECT 和 FRO UPDATE 或 FOR SHARED 配合), UPDATE, 和 DELETE 语句, 是否加锁取决于语句是否使用具有唯一搜索条件或范围类型搜索条件的唯一索引
- For a unique index with a unique search condition, InnoDB locks only the index record found, not the gap before it.
  
  对于具有唯一搜索条件的唯一索引而言, InnoDB 只锁定查找到的索引记录, 不包含前面的间隙
- For other search conditions, and for non-unique indexes, InnoDB locks the index range scanned, using gap locks or next-key locks to block insertions by other sessions into the gaps covered by the range. For information about gap locks and next-key locks, see Section 15.7.1, “InnoDB Locking”.
  
  对于其他搜索条件, 以及非唯一索引, InnoDB 锁定范围扫描的索引, 使用间隙锁或 next-key (PS. 我真不知道怎么翻译这个比较优雅 = =) 锁阻塞其他会话包含范围的插入
For index records the search encounters, SELECT ... FOR UPDATE blocks other sessions from doing SELECT ... FOR SHARE or from reading in certain transaction isolation levels. Consistent reads ignore any locks set on the records that exist in the read view.

搜索所遇见的索引记录, SELECT … FOR UPDATE 阻塞其他来自 SELECT … FOR SHARE 或来自在某一隔离级别的读取操作的会话, 一致读忽略任何在读取视图中的记录上设置的锁
UPDATE ... WHERE ... sets an exclusive next-key lock on every record the search encounters. However, only an index record lock is required for statements that lock rows using a unique index to search for a unique row.

UPDATE … WHERE … 设置互斥 next-key 锁到每个搜索遇到的记录上, 然而, 使用唯一索引查找唯一行的语句只需要一个索引记录锁
When UPDATE modifies a clustered index record, implicit locks are taken on affected secondary index records. The UPDATE operation also takes shared locks on affected secondary index records when performing duplicate check scans prior to inserting new secondary index records, and when inserting new secondary index records.

当 UPDATE 更改聚簇索引记录时, 在受影响的次级索引上添加一个隐式锁. UPDATE 还会在插入新的次级索引记录前, 执行重复检查扫描时, 向受影响的次级索引记录上添加共享锁
DELETE FROM ... WHERE ... sets an exclusive next-key lock on every record the search encounters. However, only an index record lock is required for statements that lock rows using a unique index to search for a unique row.

DELETE FROM … WHERE … 给每个搜索遇到的记录设置一个互斥锁, 然而, (PS. 一模一样 = =)
INSERT sets an exclusive lock on the inserted row. This lock is an index-record lock, not a next-key lock (that is, there is no gap lock) and does not prevent other sessions from inserting into the gap before the inserted row.

INSERT 在插入行设置互斥锁, 这是一个索引记录锁, 而非 next-key 锁 (也就是说, 不是间隙锁)

Prior to inserting the row, a type of gap lock called an insert intention gap lock is set. This lock signals the intent to insert in such a way that multiple transactions inserting into the same index gap need not wait for each other if they are not inserting at the same position within the gap. Suppose that there are index records with values of 4 and 7. Separate transactions that attempt to insert values of 5 and 6 each lock the gap between 4 and 7 with insert intention locks prior to obtaining the exclusive lock on the inserted row, but do not block each other because the rows are nonconflicting.

在插入行之前, 一种称为插入意图间隙锁的间隙锁被设置. 这个锁表示意图一种插入方式 : 多个事务插入同样的索引间隙, 如果它们不是插入间隙中的同样位置, 那么不必等待其他事务完成 (PS. 例如, 你想插入 4 和 5 到 3 和7 之间, 4 和 5 虽然插入的是同样的间隙, 但是它们插入的位置是不同的, 所以不应该阻塞彼此) 假设有值为 4 和 7 的索引记录, 事务分别尝试插入值 5 和 6, 每个事务在获取行的互斥锁之前都使用插入间隙锁锁住间隙 4 到 7, 但是因为行不冲突, 所以不会阻塞彼此

If a duplicate-key error occurs, a shared lock on the duplicate index record is set. This use of a shared lock can result in deadlock should there be multiple sessions trying to insert the same row if another session already has an exclusive lock. This can occur if another session deletes the row. Suppose that an InnoDB table t1 has the following structure:

如果发生重复键错误, 在重复所以记录上设置一个共享锁, 这种共享锁的用法会导致死锁 : 可能会有多个会话尝试插入同样的行, 但彼此都已经有了一个互斥锁. 如果其他的事务尝试删除行, 那么可能会出错. 假设 InnoDB 表 t1 有如下结构 :
1
CREATE TABLE t1 (i INT, PRIMARY KEY (i)) ENGINE = InnoDB;
Now suppose that three sessions perform the following operations in order:

Session 1:
1
2
START TRANSACTION;
INSERT INTO t1 VALUES(1);
Session 2:
1
2
START TRANSACTION;
INSERT INTO t1 VALUES(1);
Session 3:
1
2
START TRANSACTION;
INSERT INTO t1 VALUES(1);
Session 1:
1
ROLLBACK;
The first operation by session 1 acquires an exclusive lock for the row. The operations by sessions 2 and 3 both result in a duplicate-key error and they both request a shared lock for the row. When session 1 rolls back, it releases its exclusive lock on the row and the queued shared lock requests for sessions 2 and 3 are granted. At this point, sessions 2 and 3 deadlock: Neither can acquire an exclusive lock for the row because of the shared lock held by the other.

会话1 的操作获取行的互斥锁, 会话2 和会话3 都会导致重复键错误, 都请求获得共享锁. 当会话1 回滚后, 释放它在行上的互斥锁, 排队的会话2 和会话3 被授权共享锁请求, 此时, 会话2 和会话3 将会陷入死锁状态. 没有会话能获取行互斥锁, 因为彼此持有共享锁

A similar situation occurs if the table already contains a row with key value 1 and three sessions perform the following operations in order:

类似的情况还有 : 如果表已经包含了具有键值 1 的行, 三个会话顺序执行如下操作

Session 1:
1
2
START TRANSACTION;
DELETE FROM t1 WHERE i = 1;
Session 2:
1
2
START TRANSACTION;
INSERT INTO t1 VALUES(1);
Session 3:
1
2
START TRANSACTION;
INSERT INTO t1 VALUES(1);
Session 1:
1
COMMIT;
The first operation by session 1 acquires an exclusive lock for the row. The operations by sessions 2 and 3 both result in a duplicate-key error and they both request a shared lock for the row. When session 1 commits, it releases its exclusive lock on the row and the queued shared lock requests for sessions 2 and 3 are granted. At this point, sessions 2 and 3 deadlock: Neither can acquire an exclusive lock for the row because of the shared lock held by the other.
INSERT ... ON DUPLICATE KEY UPDATE differs from a simple INSERT in that an exclusive lock rather than a shared lock is placed on the row to be updated when a duplicate-key error occurs. An exclusive index-record lock is taken for a duplicate primary key value. An exclusive next-key lock is taken for a duplicate unique key value.

INSERT … ON DUPLICATE KEY UPDATE 不同于简单的 INSERT, 当重复键错误发生时, 它获取行的互斥锁而非共享锁, 重复主键值获取互斥索引记录锁, 重复唯一键值获取互斥 next-key 锁
REPLACE is done like an INSERT if there is no collision on a unique key. Otherwise, an exclusive next-key lock is placed on the row to be replaced.

如果没有唯一键碰撞, REPLACE 行为类似 INSERT. 其他情况下, 获取行的互斥 next-key 锁
INSERT INTO T SELECT ... FROM S WHERE ... sets an exclusive index record lock (without a gap lock) on each row inserted into T. If the transaction isolation level is READ COMMITTED, InnoDB does the search on S as a consistent read (no locks). Otherwise, InnoDB sets shared next-key locks on rows from S. InnoDB has to set locks in the latter case: During roll-forward recovery using a statement-based binary log, every SQL statement must be executed in exactly the same way it was done originally.

CREATE TABLE ... SELECT ... performs the SELECT with shared next-key locks or as a consistent read, as for INSERT ... SELECT.

INSERT INTO T SELECT … FROM S WHERE … 在每个插入到 T 的行上设置一个互斥索引记录锁.

如果事务隔离等级是 READ COMMITTED, InnoDB 在 S 上执行一致读, 其他情况, InnoDB 在来自 S 的行上设置共享 next-key 锁. InnoDB 需要在后一种情况下设置锁 : 在 roll-forward 恢复期间, 使用基于语句的二进制日志, 每个 SQL 语句必须以与初始语句完全相同的方式执行.

CREATE TABLE … SELECT … 执行带有共享 next-key 锁或一致读的 SELECT 语句, 就像 INSERT … SELECT 一样

When a SELECT is used in the constructs REPLACE INTO t SELECT ... FROM s WHERE ... or UPDATE t ... WHERE col IN (SELECT ... FROM s ...), InnoDBsets shared next-key locks on rows from table s.

当 SELECT 在构造 REPLACE INTO t SELECT … FROM s WHERE … 或 UPDATE t … WHERE col IN (SELECR … FROM s …) 语句时, InnoDB 在 s 中的行上设置共享锁
While initializing a previously specified AUTO_INCREMENT column on a table, InnoDB sets an exclusive lock on the end of the index associated with the AUTO_INCREMENTcolumn. In accessing the auto-increment counter, InnoDB uses a specific AUTO-INC table lock mode where the lock lasts only to the end of the current SQL statement, not to the end of the entire transaction. Other sessions cannot insert into the table while the AUTO-INC table lock is held; see Section 15.7.2, “InnoDB Transaction Model”.

当初始化在表上之前指定带有 AUTO_INCREMENT 关键字的列时, InnoDB 在与 AUTO_INCREMENT 关联的索引末尾设置互斥锁. 在访问自增计数器时, InnoDB 使用特殊的 AUTO-INC 表锁模式, 锁只持续当前 SQL 语句的末尾, 而不是当前事务的末尾. 其他会话不能在 AUTO-INC 表锁已获得的情况下插入数据进该表中

InnoDB fetches the value of a previously initialized AUTO_INCREMENT column without setting any locks.

InnoDB 无需加锁即可获取之前初始化的 AUTO_INCREMENT 列的值
If a FOREIGN KEY constraint is defined on a table, any insert, update, or delete that requires the constraint condition to be checked sets shared record-level locks on the records that it looks at to check the constraint. InnoDB also sets these locks in the case where the constraint fails.

如果表上定义了 FOREIGN KEY 约束, 任何插入, 更新, 或删除都需要检查约束条件, 在检查约束的记录上设置共享的, 记录级锁
LOCK TABLES sets table locks, but it is the higher MySQL layer above the InnoDB layer that sets these locks. InnoDB is aware of table locks if innodb_table_locks = 1(the default) and autocommit = 0, and the MySQL layer above InnoDB knows about row-level locks.

LOCK TABLES 设置表锁, 这是高于 InnoDB 层的 MySQL 层设置的锁. 如果 innodb_table_locks = 1 (默认) 并且 autocommit = 0, 则 InnoDB 能够感知表锁, MySQL 层知道行级锁

Otherwise, InnoDB‘s automatic deadlock detection cannot detect deadlocks where such table locks are involved. Also, because in this case the higher MySQL layer does not know about row-level locks, it is possible to get a table lock on a table where another session currently has row-level locks. However, this does not endanger transaction integrity, as discussed in Section 15.7.5.2, “Deadlock Detection and Rollback”. See also Section 15.6.1.6, “Limits on InnoDB Tables”.

其他情况, InnoDB 的自动死锁发现机制不能察觉涉及这样的表锁的死锁, 同样, 因为在这样的情况下 MySQL 层不能察觉行级锁, 可以在另一个会话当前拥有行级锁的表上获取表锁. 然而, 这会危害事务的完整性

read/LimitsonInnoDBTables

Posted on 2020-01-09 Edited on 2019-08-05

15.6.1.6 Limits on InnoDB Tables

Limits on InnoDB tables are described under the following topics in this section:

InnoDB表的限制在以下几个话题中讨论 : 最大和最小, InnoDB表的限制, 锁和事务

Maximums and Minimums

A table can contain a maximum of 1017 columns. Virtual generated columns are included in this limit.

表最多可以包含 1017 列, 虚拟创建的列也包含在其中
A table can contain a maximum of 64 secondary indexes.

表最多能有 64 个次级索引
The index key prefix length limit is 3072 bytes for InnoDB tables that use DYNAMIC or COMPRESSED row format.

使用 DYNAMIC 或 COMPRESSED 行存储的 InnoDB 表索引键的前缀宽度被限制在 3072 字节内 (<=)

The index key prefix length limit is 767 bytes for InnoDB tables that use REDUNDANT or COMPACT row format. For example, you might hit this limit with a column prefixindex of more than 191 characters on a TEXT or VARCHAR column, assuming a utf8mb4 character set and the maximum of 4 bytes for each character.

使用 REDUNDANT 或 COMPACT 行存储的 InnoDB 表索引间的前缀宽度被限制在 767 字节内 (<=)

Attempting to use an index key prefix length that exceeds the limit returns an error.

尝试使用超过限制的字节前缀宽度会返回错误

The limits that apply to index key prefixes also apply to full-column index keys.

适用于索引键前缀的限制也适用于全列索引键
If you reduce the InnoDB page size to 8KB or 4KB by specifying the innodb_page_size option when creating the MySQL instance, the maximum length of the index key is lowered proportionally, based on the limit of 3072 bytes for a 16KB page size. That is, the maximum index key length is 1536 bytes when the page size is 8KB, and 768 bytes when the page size is 4KB.

如果在创建 MySQL 实例时显式指定 innodb_page_size 减少 InnoDB 的页面大小(8kb -> 4kb)索引键的最大长度也会成比例减少

16kb 页大小限制在 3072 字节, 8kb 页大小限制在 1536 字节, 而 4kb 页面大小限制在 768 字节
A maximum of 16 columns is permitted for multicolumn indexes. Exceeding the limit returns an error.

多列索引子列数不能超过 16 个
1
ERROR 1070 (42000): Too many key parts specified; max 16 parts allowed
The maximum row length, except for variable-length columns (VARBINARY, VARCHAR, BLOB and TEXT), is slightly less than half of a page for 4KB, 8KB, 16KB, and 32KB page sizes. For example, the maximum row length for the default innodb_page_size of 16KB is about 8000 bytes. However, for an InnoDB page size of 64KB, the maximum row length is approximately 16000 bytes. LONGBLOB and LONGTEXT columns must be less than 4GB, and the total row length, including BLOB and TEXTcolumns, must be less than 4GB.

除变长列(VARBINARY, VATCHAR, BLOB, TEXT)外, 列的长度都稍少于一半页的大小. 比如, 页面大小为 16kb 的行长度最大 8000 字节左右, 然而 InnoDB 64kb 的页最大在 16000 字节左右. LONGBLOB 和 LONGTEXT 列必须少于 4GB, 整个行的长度, 包括 BLOB 和 TEXT 列, 也必须少于 4GB.

If a row is less than half a page long, all of it is stored locally within the page. If it exceeds half a page, variable-length columns are chosen for external off-page storage until the row fits within half a page, as described in Section 15.11.2, “File Space Management”.

如果行小于页的一般, 数据则全部保存在页中, 如果超过, 变长列被选择页外存储直到列的大小符合为止

Although InnoDB supports row sizes larger than 65,535 bytes internally, MySQL itself imposes a row-size limit of 65,535 for the combined size of all columns:

尽管 InnoDB 支持超过 65535 字节的行, MySQL 自身约束列的组合大小不能超过 65535

mysql> CREATE TABLE t (a VARCHAR(8000), b VARCHAR(10000),
    -> c VARCHAR(10000), d VARCHAR(10000), e VARCHAR(10000),
    -> f VARCHAR(10000), g VARCHAR(10000)) ENGINE=InnoDB;
ERROR 1118 (42000): Row size too large. The maximum row size for the
used table type, not counting BLOBs, is 65535. You have to change some
columns to TEXT or BLOBs

See Section C.10.4, “Limits on Table Column Count and Row Size”.

On some older operating systems, files must be less than 2GB. This is not a limitation of InnoDB itself, but if you require a large tablespace, configure it using several smaller data files rather than one large data file.

在一些较老的操作系统上, 文件必须小于 2GB. 这不是 InnoDB 的限制, 但是如果你需要一个大的表空间, 可以配置成使用些许较小的数据文件, 而不是一个大的数据文件
The combined size of the InnoDB log files can be up to 512GB.

InnoDB 日志文件的组合大小最大 512GB
The minimum tablespace size is slightly larger than 10MB. The maximum tablespace size depends on the InnoDB page size.

表的最小容量稍大于 10mb, 最大的表大小取决于 InnoDB 页大小

Table 15.3 InnoDB Maximum Tablespace Size

InnoDB Page Size Maximum Tablespace Size

4KB 16TB

8KB 32TB

16KB 64TB

32KB 128TB

64KB 256TB

The maximum tablespace size is also the maximum size for a table.

最大表空间大小也是最大表大小
The path of a tablespace file, including the file name, cannot exceed the MAX_PATH limit on Windows. Prior to Windows 10, the MAX_PATH limit is 260 characters. As of Windows 10, version 1607, MAX_PATH limitations are removed from common Win32 file and directory functions, but you must enable the new behavior.

表空间文件的路径, 包括表名, 在 windows 上不能超过 MAX_PATH 限制, 在 win10 以前 MAX_PATH 限制为 260 字符. win10 1607 版本后移除了该限制
The default page size in InnoDB is 16KB. You can increase or decrease the page size by configuring the innodb_page_size option when creating the MySQL instance.

InnoDB 默认页大小为 16kb, 创建 MySQL 实例时可以通过设置 innodb_page_size 更改(PS. 页大小在创建后是属于表的局部常量么?)

32KB and 64KB page sizes are supported, but ROW_FORMAT=COMPRESSED is unsupported for page sizes greater than 16KB. For both 32KB and 64KB page sizes, the maximum record size is 16KB. For innodb_page_size=32KB, extent size is 2MB. For innodb_page_size=64KB, extent size is 4MB.

A MySQL instance using a particular InnoDB page size cannot use data files or log files from an instance that uses a different page size.

32kb 和 64kb 的页也是支持的, 但是行存储为 COMPRESSED 不支持超过 16kb 大小的页. 对于 32kb 和 64kb 大小的也, 最大记录的大小其实是 16kb.

MySQL 实例不能使用底层页面不一致的数据文件和日志文件

InnoDB Page Size	Maximum Tablespace Size
4KB	16TB
8KB	32TB
16KB	64TB
32KB	128TB
64KB	256TB

Restrictions on InnoDB Tables

ANALYZE TABLE determines index cardinality (as displayed in the Cardinality column of SHOW INDEX output) by performing random dives on each of the index trees and updating index cardinality estimates accordingly. Because these are only estimates, repeated runs of ANALYZE TABLE could produce different numbers. This makesANALYZE TABLE fast on InnoDB tables but not 100% accurate because it does not take all rows into account.

ANALYZE TABLE 语句每个索引树执行随机潜水(PS. ???)更新预估索引的基数(在 SHOW INDEX 的输出中 Cardinality 列显示)

因为只是估计, 重复执行 ANALYZE TABLE 将会产生不同的值, 这使得 ANALYZE TABLE 在 InnoDB 上运行地很快, 但并不是 100% 正确, 因为并不会估计所有的行

(PS. ANALYZE TABLE 将会分析表, 其中部分分析结果可以在 SHOW INDEX 中查看. 这里以列的基数为例)

You can make the statistics collected by ANALYZE TABLE more precise and more stable by turning on the innodb_stats_persistent configuration option, as explained in Section 15.8.10.1, “Configuring Persistent Optimizer Statistics Parameters”. When that setting is enabled, it is important to run ANALYZE TABLE after major changes to indexed column data, because the statistics are not recalculated periodically (such as after a server restart).

通过启用 innodb_stats_persistent 使 ANALYZE TABLE 的数据统计更加精确, 当启用这个选项后, 需要在对索引列数据进行主要更改后再使用 ANALYZE TABLE, 因为统计信息不是定期重新计算的

If the persistent statistics setting is enabled, you can change the number of random dives by modifying the innodb_stats_persistent_sample_pages system variable. If the persistent statistics setting is disabled, modify the innodb_stats_transient_sample_pages system variable instead.

如果启用了持续性统计, 可以更改 innodb_stats_persistent_sample_pages 系统变量, 如果持久性统计设置取消, 则可以通过 innodb_stats_transient_sampke_pages 更改

MySQL uses index cardinality estimates in join optimization. If a join is not optimized in the right way, try using ANALYZE TABLE. In the few cases that ANALYZE TABLEdoes not produce values good enough for your particular tables, you can use FORCE INDEX with your queries to force the use of a particular index, or set themax_seeks_for_key system variable to ensure that MySQL prefers index lookups over table scans. See Section B.4.5, “Optimizer-Related Issues”.

MySQL 在 Join 优化中使用索引基数估计, 如果 Join 不能正确优化, 尝试使用 ANALYZE_TABLE. 在少数情况下 ANALYZE_TABLE 不能为指定的表产生足够好的值, 可以使用 FORCE INDEX 配合查询, 强制使用指定的索引或者设置 max_seeks_for_key 系统变量确保 MySQL 更偏向于索引查找而不是表扫描
If statements or transactions are running on a table, and ANALYZE TABLE is run on the same table followed by a second ANALYZE TABLE operation, the second ANALYZE TABLE operation is blocked until the statements or transactions are completed. This behavior occurs because ANALYZE TABLE marks the currently loaded table definition as obsolete when ANALYZE TABLE is finished running. New statements or transactions (including a second ANALYZE TABLE statement) must load the new table definition into the table cache, which cannot occur until currently running statements or transactions are completed and the old table definition is purged. Loading multiple concurrent table definitions is not supported.

如果语句和事务正在运行于表上, ANALYZE TABLE 同样运行于这张表上, 之后继续执行 ANALYZE_TABLE 操作, 第二个 ANALYZE TABLE 操作将会阻塞直到之前的事务/语句执行完成. 会发生这样的情况是因为 ANALYZE_TABLE 在执行完成时会将当前加载的表标记为已过时, 新的语句/事务(包括第二个 ANALYZE_TABLE 语句)必须加载新的表到表缓存中, 改操作在当前语句/事务已经完成, 并且旧表的定义已经被清理后才能发生, 不支持加载多个并发表
SHOW TABLE STATUS does not give accurate statistics on InnoDB tables except for the physical size reserved by the table. The row count is only a rough estimate used in SQL optimization.

SHOW TABLE STATUS 不提供 InnoDB 表除表存储的物理大小外, 其他统计信息的精确性, 行数量只是一个 SQL 优化中粗略的估计
InnoDB does not keep an internal count of rows in a table because concurrent transactions might “see” different numbers of rows at the same time. Consequently, SELECT COUNT(*) statements only count rows visible to the current transaction.

InnoDB 不保存表中行的内部计数, 因为并发事务可能在同样的时间见到不同的值.

因此, SELECT COUNT(*) 语句只计数对当前事务可见的行

For information about how InnoDB processes SELECT COUNT(*) statements, refer to the COUNT() description in Section 12.20.1, “Aggregate (GROUP BY) Function Descriptions”.
On Windows, InnoDB always stores database and table names internally in lowercase. To move databases in a binary format from Unix to Windows or from Windows to Unix, create all databases and tables using lowercase names.

在 Windows 上, InnoDB 总是以小写存储数据库和表的名字(大小写不敏感 = =)
An AUTO_INCREMENT column ai_col must be defined as part of an index such that it is possible to perform the equivalent of an indexed SELECT MAX(*ai_col*) lookup on the table to obtain the maximum column value. Typically, this is achieved by making the column the first column of some table index.

AUTO_INCREMENT 列 ai_col 必须被定义成索引的一部分, 这样就可以执行已被索引的 SELECT MAX(ai_col) 相等性判断, 获取最大的列值. 通常都是公国将列成为列的首列来获得的
InnoDB sets an exclusive lock on the end of the index associated with the AUTO_INCREMENT column while initializing a previously specified AUTO_INCREMENT column on a table.

InnoDB 在和 AUTO_INCREMENT 列相关的索引末尾设置互斥锁

With innodb_autoinc_lock_mode=0, InnoDB uses a special AUTO-INC table lock mode where the lock is obtained and held to the end of the current SQL statement while accessing the auto-increment counter. Other clients cannot insert into the table while the AUTO-INC table lock is held. The same behavior occurs for “bulk inserts”with innodb_autoinc_lock_mode=1. Table-level AUTO-INC locks are not used with innodb_autoinc_lock_mode=2. For more information, See Section 15.6.1.4, “AUTO_INCREMENT Handling in InnoDB”.

innodb_automic_lock_mod 设置为 0 时, InnoDB 使用特殊的 AUTO_INC 表锁模式. 在访问自增计数器时锁被持有直到当前 SQL 语句结束
When an AUTO_INCREMENT integer column runs out of values, a subsequent INSERT operation returns a duplicate-key error. This is general MySQL behavior.

当 AUTO_INCREMENT 越界, 插入操作会返回重复的键错误, 这是通常 MySQL 的行为
DELETE FROM *tbl_name* does not regenerate the table but instead deletes all rows, one by one.

DELETE FROM tbl_name 不再生表, 而是依次删除所有行
Cascaded foreign key actions do not activate triggers.

级联外键操作不激活触发器
You cannot create a table with a column name that matches the name of an internal InnoDB column (including DB_ROW_ID, DB_TRX_ID, DB_ROLL_PTR, and DB_MIX_ID). This restriction applies to use of the names in any letter case.

你不能创建和内部 InnoDB 列重名的列, 包括(DB_ROW_ID, DB_TRX_ID, DB_ROLL_PTR, DB_MIX_ID)
1
2
mysql> CREATE TABLE t1 (c1 INT, db_row_id INT) ENGINE=INNODB;
ERROR 1166 (42000): Incorrect column name 'db_row_id'

Locking and Transactions

LOCK TABLES acquires two locks on each table if innodb_table_locks=1 (the default). In addition to a table lock on the MySQL layer, it also acquires an InnoDB table lock. Versions of MySQL before 4.1.2 did not acquire InnoDB table locks; the old behavior can be selected by setting innodb_table_locks=0. If no InnoDB table lock is acquired, LOCK TABLES completes even if some records of the tables are being locked by other transactions.

如果 innodb_table_locks = 1 则 LOCK TABLES 在每个表上获取两个锁, 除了在 MySQL 层上的锁, 还获取 InnoDB 表锁, 如果没有获取 InnoDB 表锁, 即使一些记录被其他事务锁定, LOCK TABLES 也会完成

In MySQL 8.0, innodb_table_locks=0 has no effect for tables locked explicitly with LOCK TABLES ... WRITE. It does have an effect for tables locked for read or write by LOCK TABLES ... WRITE implicitly (for example, through triggers) or by LOCK TABLES ... READ.
All InnoDB locks held by a transaction are released when the transaction is committed or aborted. Thus, it does not make much sense to invoke LOCK TABLES onInnoDB tables in autocommit=1 mode because the acquired InnoDB table locks would be released immediately.

通过事务获取的所有 InnoDB 锁在事务提交或退出时释放, 因此, 在 autocommit = 1 模式下没有那么必要在 InnoDB 表上显式调用 LOCK TABLES 语句, 因为获取的 InnoDB 表锁会立即释放
You cannot lock additional tables in the middle of a transaction because LOCK TABLES performs an implicit COMMIT and UNLOCK TABLES.

你不能在事务中锁额外的表, 因为 LOCK TABLES 执行隐式的 COMMIT 和 UNLOCK TABLES
For limits associated with concurrent read-write transactions, see Section 15.6.6, “Undo Logs”.

read/JVM-Memory-model

Posted on 2020-01-09 Edited on 2019-09-24

source link

Java (JVM) Memory Model - Memory Management in Java

Java (JVM) Memory Model

As you can see in the above image, JVM memory is divided into separate parts. At broad level, JVM Heap memory is physically divided into two parts – Young Generation and Old Generation.

如上图所示, JVM 被分为了多个部分. JVM 堆物理性被分为两部分, Young Generation 和 Old Generation

Memory Management in Java – Young Generation

The young generation is the place where all the new objects are created. When the young generation is filled, garbage collection is performed. This garbage collection is called Minor GC. Young Generation is divided into three parts – Eden Memory and two Survivor Memory spaces.

所有新创建的对象一开始都在 young generation 中. GC 在 young generation 被填满时执行. 这个 GC 被称为 Minor GC. Young Generation 被分为三部分, Eden Memory 和两个 Survivor Memory 空间

Important Points about Young Generation Spaces:

Most of the newly created objects are located in the Eden memory space.
大部分新创建的对象位于 Eden memory 空间
When Eden space is filled with objects, Minor GC is performed and all the survivor objects are moved to one of the survivor spaces.
Minor GC 在 Eden space 被对象填充时执行, 所有的 survivor 对象被移动到其中一个 survivor 空间
Minor GC also checks the survivor objects and move them to the other survivor space. So at a time, one of the survivor space is always empty.
Minor GC 元入会检查 survivor 对象, 将其移动到其他的 survivor 空间. 所以在同一时刻, 有一个 survivor 空间一直是空的
Objects that are survived after many cycles of GC, are moved to the Old generation memory space. Usually, it’s done by setting a threshold for the age of the young generation objects before they become eligible to promote to Old generation.
在多次 GC 循环后仍幸存的对象将会被移动到 Old generation 内存空间.

通常, 是通过给 young generation 对象设置一个年龄的阈值来实现的 (超过这个阈值的对象将会被移动到 Old Generation 中)

Memory Management in Java – Old Generation

Old Generation memory contains the objects that are long-lived and survived after many rounds of Minor GC. Usually, garbage collection is performed in Old Generation memory when it’s full. Old Generation Garbage Collection is called Major GC and usually takes a longer time.

Old Generation 内存包含在多次 Minor GC 循环后仍幸存的对象. 通常, GC 在 Old Generation 被填满时执行.

Old Generation GC 被成为 Major GC, 通常花费更长的时间

Stop the World Event

All the Garbage Collections are “Stop the World” events because all application threads are stopped until the operation completes.

所有的 GC 都是 “Stop the World” 事件. 因为所有的应用线程阻塞直到操作完成

Since Young generation keeps short-lived objects, Minor GC is very fast and the application doesn’t get affected by this.

因为 Young generation 保持短期存活对象, Minor GC 相当快, 应用不会受此影响

However, Major GC takes a long time because it checks all the live objects. Major GC should be minimized because it will make your application unresponsive for the garbage collection duration. So if you have a responsive application and there are a lot of Major Garbage Collection happening, you will notice timeout errors.

然而, Major GC 花费较长的时间, 因为它会检查所有的存活对象, Major GC 应该尽量少使用. 因为它会在 GC 期间, 使你的程序无相应. 所以如果你有一个响应应用, 同时在应用中存在大量 Major GC 发生时, 你应该注意是否会产生超时错误

The duration taken by garbage collector depends on the strategy used for garbage collection. That’s why it’s necessary to monitor and tune the garbage collector to avoid timeouts in the highly responsive applications.

GC 占用时间取决于 GC 策略, 这就是为什么要在高响应应用程序中需要监控和调整 GC 策略的原因

Java Memory Model – Permanent Generation

Permanent Generation or “Perm Gen” contains the application metadata required by the JVM to describe the classes and methods used in the application. Note that Perm Gen is not part of Java Heap memory.

Permanent Generation 或 “Perm Gen” 包含 JVM 所需的应用元数据, 描述应用于应用中的类和方法. 注意, Perm Gen 不是 Java Heap memory 的一部分

Perm Gen is populated by JVM at runtime based on the classes used by the application. Perm Gen also contains Java SE library classes and methods. Perm Gen objects are garbage collected in a full garbage collection.

Perm Gen 由 JVM 在运行时填充, 基于被应用程序使用的类. Perm Gen 也包含 Java SE 库类以及方法. Perm Gen 对象在完整的 GC 中被回收

Java Memory Model – Method Area

Method Area is part of space in the Perm Gen and used to store class structure (runtime constants and static variables) and code for methods and constructors.

Method Area 是 Perm Gen 空间的一部分. 用于存储类数据结构 (运行时常量和静态变量) 以及方法的代码和结构

Java Memory Model – Memory Pool

Memory Pools are created by JVM memory managers to create a pool of immutable objects if the implementation supports it. String Pool is a good example of this kind of memory pool. Memory Pool can belong to Heap or Perm Gen, depending on the JVM memory manager implementation.

内存池由 JVM 内存管理器创建, 创建一个不变量对象池 (如果实现支持的话). String 池是这种内存池的一个很好的例子. 内存池可以属于 Heap 或 Perm Gen, 取决于 JVM 内存管理器的实现

Java Memory Model – Runtime Constant Pool

Runtime constant pool is per-class runtime representation of constant pool in a class. It contains class runtime constants and static methods. Runtime constant pool is part of the method area.

运行时常量池是类中常量池的每个类的运行时表示. 包含类运行时常量和静态方法. 运行时常量池是 method 区域的一部分

Java Memory Model – Java Stack Memory

Java Stack memory is used for execution of a thread. They contain method specific values that are short-lived and references to other objects in the heap that is getting referred from the method. You should read Difference between Stack and Heap Memory.

Java 栈内存用户执行线程. 它们包含方法指定值 (生命周期短, 指向堆中的对象)

Memory Management in Java – Java Heap Memory Switches

Java provides a lot of memory switches that we can use to set the memory sizes and their ratios. Some of the commonly used memory switches are:

Java 提供了大量的内存开关, 用于设置内存大小和比例. 一些常用的内存开关有:

VM Switch	VM Switch Description
-Xms	For setting the initial heap size when JVM starts
-Xmx	For setting the maximum heap size.
-Xmn	For setting the size of the Young Generation, rest of the space goes for Old Generation.
-XX:PermGen	For setting the initial size of the Permanent Generation memory
-XX:MaxPermGen	For setting the maximum size of Perm Gen
-XX:SurvivorRatio	For providing ratio of Eden space and Survivor Space, for example if Young Generation size is 10m and VM switch is -XX:SurvivorRatio=2 then 5m will be reserved for Eden Space and 2.5m each for both the Survivor spaces. The default value is 8.
-XX:NewRatio	For providing ratio of old/new generation sizes. The default value is 2.

Most of the times, above options are sufficient, but if you want to check out other options too then please check JVM Options Official Page.

大多数时候, 上面的选项就足够了, 其他选项可以参见…

Memory Management in Java – Java Garbage Collection

Java Garbage Collection is the process to identify and remove the unused objects from the memory and free space to be allocated to objects created in future processing. One of the best features of Java programming language is the automatic garbage collection, unlike other programming languages such as C where memory allocation and deallocation is a manual process.

Java GC 是用于从内存中确定和移除未使用对象的进程, 释放内存, 以用于将来创建的对象. Java 的特性之一是自动垃圾回收, 不像其他语言需要手动处理 (其实也不是啦, 栈是不需要管的, 也就是 heap 需要手动释放)

Garbage Collector is the program running in the background that looks into all the objects in the memory and find out objects that are not referenced by any part of the program. All these unreferenced objects are deleted and space is reclaimed for allocation to other objects.

GC 是一个运行于后台的程序, 查看在内容中的所有对象, 找到没有被程序中任何部分引用的对象. 所有这些未引用对象都会被删除, 空间会回收利用, 以用于其他对象的分配

One of the basic ways of garbage collection involves three steps:

Marking: This is the first step where garbage collector identifies which objects are in use and which ones are not in use.

Marking: 这是 GC 用于确定对象是否被使用的第一步
Normal Deletion: Garbage Collector removes the unused objects and reclaim the free space to be allocated to other objects.

Normal Deletion: GC 移除未使用对象, 回收资源
Deletion with Compacting: For better performance, after deleting unused objects, all the survived objects can be moved to be together. This will increase the performance of allocation of memory to newer objects.

Deletion with Compacting: 为了获得更好的性能, 在删除未使用对象后, 所有幸存对象可以移到一起, 这可以增加分配新对象的性能

There are two problems with a simple mark and delete approach.

First one is that it’s not efficient because most of the newly created objects will become unused

首先, 这并不高效, 因为大部分新创建的对象将会变成未使用
Secondly objects that are in-use for multiple garbage collection cycle are most likely to be in-use for future cycles too.

其次, 多个 GC 循环后仍使用的对象在将来也可能是使用的

(总结来说就是, 大多数对象是创建后就失效的, 维护起来有一定成为. 其次, 一个多次 GC 后存在的对象在以后也大概率不会被移除, 这又会产生额外的性能损失)

The above shortcomings with the simple approach is the reason that Java Garbage Collection is Generational and we have Young Generation and Old Generation spaces in the heap memory. I have already explained above how objects are scanned and moved from one generational space to another based on the Minor GC and Major GC.

以上简单方法的缺点是 Java GC 是在堆内存中分代, 使用 Young Generation 和 Old Generation 空间的原因

Memory Management in Java – Java Garbage Collection Types

There are five types of garbage collection types that we can use in our applications. We just need to use the JVM switch to enable the garbage collection strategy for the application. Let’s look at each of them one by one.

我们可以在我们的应用中使用五种 GC, 仅需使用 JVM 开关就可使 GC 策略应用于应用.

Serial GC (-XX:+UseSerialGC): Serial GC uses the simple mark-sweep-compact approach for young and old generations garbage collection i.e Minor and Major GC.

Serial GC is useful in client machines such as our simple stand-alone applications and machines with smaller CPU. It is good for small applications with low memory footprint.

Serial GC: Serial GC 使用最简单的 mark-sweep-compact 方法用于 young 和 old generations GC. 比如, Minor 和 Major GC.

Serial GC 在如我们简单的, 独立的应用和小的 CPU 上很有用, 适用于低内存占用的小型应用
Parallel GC (-XX:+UseParallelGC): Parallel GC is same as Serial GC except that is spawns N threads for young generation garbage collection where N is the number of CPU cores in the system. We can control the number of threads using -XX:ParallelGCThreads=n JVM option.

Parallel GC: Parallel GC 类似于 Serial GC, 除了为 young generation GC 产生 N 个线程 (N 是CPU核心数). 可以使用 JVM 选项控制线程的数量

Parallel Garbage Collector is also called throughput collector because it uses multiple CPUs to speed up the GC performance. Parallel GC uses a single thread for Old Generation garbage collection.

Parallel GC 也被称为吞吐量收集器, 因为它使用多个CPU提升GC性能. Parallel GC为Old Generation GC 单独分配一个线程
Parallel Old GC (-XX:+UseParallelOldGC): This is same as Parallel GC except that it uses multiple threads for both Young Generation and Old Generation garbage collection.

和Parallel GC相同, 但它对 Young GC 和 Old GC 都使用多线程
Concurrent Mark Sweep (CMS) Collector (-XX:+UseConcMarkSweepGC): CMS Collector is also referred as concurrent low pause collector. It does the garbage collection for the Old generation. CMS collector tries to minimize the pauses due to garbage collection by doing most of the garbage collection work concurrently with the application threads.

CMS 收集器也被称为并发低暂停收集器. 它为 Old generation 做 GC 操作. CMS 收集器尝试使用应用线程, 通过并发执行大部分 GC 操作来尽可能减低 GC 的暂停

CMS collector on the young generation uses the same algorithm as that of the parallel collector. This garbage collector is suitable for responsive applications where we can’t afford longer pause times. We can limit the number of threads in CMS collector using -XX:ParallelCMSThreads=n JVM option.

CMS 在 young generation 上像 parallel 收集器一样使用同样的算法, 这个算法适用于不能承担长时间暂停的响应程序, 可以使用选项限制 CMS 收集器使用的线程数
G1 Garbage Collector (-XX:+UseG1GC): The Garbage First or G1 garbage collector is available from Java 7 and its long term goal is to replace the CMS collector. The G1 collector is a parallel, concurrent, and incrementally compacting low-pause garbage collector.Garbage First Collector doesn’t work like other collectors and there is no concept of Young and Old generation space. It divides the heap space into multiple equal-sized heap regions. When a garbage collection is invoked, it first collects the region with lesser live data, hence “Garbage First”. You can find more details about it at Garbage-First Collector Oracle Documentation.

G1 GC 从 Java 7 开始支持, 它长期的目标是取代 CMS 收集器. G1 是一个并行, 并发以及递增压紧的低暂停 GC. G1 不像其他收集器, 它没有 Young 和 Old generation 的概念. 它将 heap 空间分为多个相同的 heap 区域. 当GC 执行时, 它首先清理较少存留数据的 heap 区域, 因此成为”Garbage First”. 你可以在 Oracle 的文档中找到更多关于 G1 收集器的信息.

read/IteratorsandReverseIterators

Posted on 2020-01-09 Edited on 2019-08-26

Iterators and Reverse Iterators

You can convert normal iterators into reverse iterators. Naturally, the iterators must be bidirectional
iterators, but note that the logical position of an iterator is moved during the conversion. Consider
the following program:

可以将普通的迭代器转换为反向迭代去(当然, 这个迭代器必须是双向的)

请记住, 转换后迭代器的逻辑位置会发生改变, 参考以下程序 :

// iter/reviter2.cpp
#include <iterator>
#include <iostream>
#include <vector>
#include <algorithm>

using namespace std;

int main()
{
    // create list with elements from 1 to 9
    vector<int> coll = { 1, 2, 3, 4, 5, 6, 7, 8, 9 };

    // find position of element with value 5
    vector<int>::const_iterator pos;
    pos = find (coll.cbegin(), coll.cend(),
    5);

    // print value to which iterator pos refers
    cout << "pos: " << *pos << endl;

    // convert iterator to reverse iterator rpos
    vector<int>::const_reverse_iterator rpos(pos);

    // print value to which reverse iterator rpos refers
    cout << "rpos: " << *rpos << endl;
}

This program has the following output:
pos: 5
rpos: 4
Thus, if you print the value of an iterator and convert the iterator into a reverse iterator, the value has
changed. This is not a bug; it’s a feature! This behavior is a consequence of the fact that ranges are
half open. To specify all elements of a container, you must use the position after the last argument.
However, for a reverse iterator, this is the position before the first element. Unfortunately, such a
position may not exist. Containers are not required to guarantee that the position before their first
element is valid. Consider that ordinary strings and arrays might also be containers, and the language
does not guarantee that arrays don’t start at address zero.

因此, 更改后迭代器的值发生了变化. 这不是一个 bug, 而是一个特性. 这是一个因半开区间而产生的结果.

要指定容器的所有元素, 需要使用最后一个元素后的位置. 然而, 对于反向迭代器而言, 就是一个在第一个元素之前的位置, 不幸的是, 这样的位置是不存在的. 容器不保证他们第一个位置之前的元素是有效的.

参考原始字符串和数组可能也会是容器, 语言不保证数组不从地址0开始

As a result, the designers of reverse iterators use a trick: They “physically” reverse the “half-open
principle.” Physically, in a range defined by reverse iterators, the beginning is not included, whereas
the end is. However, logically, they behave as usual. Thus, there is a distinction between the physical
position that defines the element to which the iterator refers and the logical position that defines the
value to which the iterator refers (Figure 9.3). The question is, what happens on a conversion from
an iterator to a reverse iterator? Does the iterator keep its logical position (the value) or its physical
position (the element)? As the previous example shows, the latter is the case. Thus, the value is
moved to the previous element (Figure 9.4)

结论是, 反向迭代器设计用了一个方法 : 它们物理性地反向半开区间. 物理性地, 在反向迭代器的定义中, 开始的元素不包含在内, 而尾端被包含. 然而, 逻辑上, 它们的行为和普通的迭代器一样. 因此, 定义迭代器引用的元素的物理位置和定义迭代器引用的值的逻辑位置是有区别的.

问题是, 当普通迭代器转换为反向迭代器的时候, 会发生什么? 迭代器会保持它的逻辑位置, 还是物理位置? 在上面的案例显示中, 属于后者. 因此, 值移动到了之前的元素位置.

内部如何处理的

以上, 就是这篇笔记的原因, 我想看看反向迭代器它的内部是如何工作的

这次的目标是解决如下疑问

普通迭代器转换成反向迭代器会做什么?
反向迭代器如何迭代的?
反向迭代器如何保证不会越界?

首先, 参考一下 C++ 代码 :

int main() {
	array<int, 9> v{1, 2, 3, 4, 5, 6, 7, 8, 9};
	auto it = v.begin();
	cout << *it << endl;
	array<int,9>::const_reverse_iterator rit(it);
	cout << *rit << endl;
}

然后是汇编代码 :

main:
.LFB4368:
▹   .cfi_startproc
▹   pushq▹  %rbp
▹   .cfi_def_cfa_offset 16
▹   .cfi_offset 6, -16
▹   movq▹   %rsp, %rbp
▹   .cfi_def_cfa_register 6
▹   subq▹   $64, %rsp
▹   movl▹   $1, -64(%rbp)
▹   movl▹   $2, -60(%rbp)
▹   movl▹   $3, -56(%rbp)
▹   movl▹   $4, -52(%rbp)
▹   movl▹   $5, -48(%rbp)
▹   movl▹   $6, -44(%rbp)
▹   movl▹   $7, -40(%rbp)
▹   movl▹   $8, -36(%rbp)
▹   movl▹   $9, -32(%rbp)
▹   leaq▹   -64(%rbp), %rax
▹   movq▹   %rax, %rdi
▹   call▹   _ZNSt5arrayIiLm9EE5beginEv	// 上述是 array 的初始化代码
▹   movq▹   %rax, -8(%rbp)	// 构造的返回值即是 begin
▹   movq▹   -8(%rbp), %rax
▹   movl▹   (%rax), %eax	// *it
▹   movl▹   %eax, %esi
▹   movl▹   $_ZSt4cout, %edi
▹   call▹   _ZNSolsEi
▹   movl▹   $_ZSt4endlIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_, %esi
▹   movq▹   %rax, %rdi
▹   call▹   _ZNSolsEPFRSoS_E
▹   movq▹   -8(%rbp), %rdx	// -8(%rbp) 即 it
▹   leaq▹   -16(%rbp), %rax	// 这是一块未使用的内存
▹   movq▹   %rdx, %rsi
▹   movq▹   %rax, %rdi
▹   call▹   _ZNSt16reverse_iteratorIPKiEC1ES1_	// 这应当就是 reverse_iterator 的 construct
▹   leaq▹   -16(%rbp), %rax
▹   movq▹   %rax, %rdi
▹   call▹   _ZNKSt16reverse_iteratorIPKiEdeEv	// 在解引用之前, 会调用这个函数
▹   movl▹   (%rax), %eax	// 返回值解引用, 即 *rit
▹   movl▹   %eax, %esi
▹   movl▹   $_ZSt4cout, %edi
▹   call▹   _ZNSolsEi
▹   movl▹   $_ZSt4endlIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_, %esi
▹   movq▹   %rax, %rdi
▹   call▹   _ZNSolsEPFRSoS_E
▹   movl▹   $0, %eax
▹   leave
▹   .cfi_def_cfa 7, 8
▹   ret

可以看出, 关键点在于 “ZNSt16reverse_iteratorIPKiEC1ES1“ 和 “_ZNKSt16reverse_iteratorIPKiEdeEv” 将是关键

其中, ZNSt16reverse_iteratorIPKiEC1ES1 的代码如下所示 (没有错, 就是这个)

_ZNSt16reverse_iteratorIPKiEC2ES1_:
.LFB4445:
▹   .cfi_startproc
▹   pushq▹  %rbp
▹   .cfi_def_cfa_offset 16
▹   .cfi_offset 6, -16
▹   movq▹   %rsp, %rbp
▹   .cfi_def_cfa_register 6
▹   subq▹   $16, %rsp
▹   movq▹   %rdi, -8(%rbp)
▹   movq▹   %rsi, -16(%rbp)
▹   movq▹   -8(%rbp), %rax
▹   movq▹   %rax, %rdi
▹   call▹   _ZNSt8iteratorISt26random_access_iterator_tagilPKiRS1_EC2Ev
▹   movq▹   -8(%rbp), %rax
▹   movq▹   -16(%rbp), %rdx
▹   movq▹   %rdx, (%rax)
▹   leave
▹   .cfi_def_cfa 7, 8
▹   ret

代码除 _ZNSt8iteratorISt26random_access_iterator_tagilPKiRS1_EC2Ev 外, 就是简单地将 %rsi(-8(%rbp) 即 it) 的地址放入 %rdi(-16(%rbp) 我提到过的那块未使用的内存)

_ZNSt8iteratorISt26random_access_iterator_tagilPKiRS1_EC2Ev 代码如下所示

_ZNSt8iteratorISt26random_access_iterator_tagilPKiRS1_EC2Ev:
.LFB4443:
▹   .cfi_startproc
▹   pushq▹  %rbp
▹   .cfi_def_cfa_offset 16
▹   .cfi_offset 6, -16
▹   movq▹   %rsp, %rbp
▹   .cfi_def_cfa_register 6
▹   movq▹   %rdi, -8(%rbp)
▹   popq▹   %rbp
▹   .cfi_def_cfa 7, 8
▹   ret
▹   .cfi_endproc

好像… 什么也没做 = = , 这个函数没有栈帧开辟, movq▹ %rdi, -8(%rbp) 也大概率是个无用的语句

那么, 可以暂时得出结论, 反向迭代器是一个二级指针, 它保存了普通迭代器的地址

那么, 来看另外一个函数 _ZNKSt16reverse_iteratorIPKiEdeEv

_ZNKSt16reverse_iteratorIPKiEdeEv:
.LFB4447:
▹   .cfi_startproc
▹   pushq▹  %rbp
▹   .cfi_def_cfa_offset 16
▹   .cfi_offset 6, -16
▹   movq▹   %rsp, %rbp
▹   .cfi_def_cfa_register 6
▹   movq▹   %rdi, -24(%rbp)
▹   movq▹   -24(%rbp), %rax
▹   movq▹   (%rax), %rax
▹   movq▹   %rax, -8(%rbp)
▹   subq▹   $4, -8(%rbp)
▹   movq▹   -8(%rbp), %rax
▹   popq▹   %rbp
▹   .cfi_def_cfa 7, 8
▹   ret
▹   .cfi_endproc

一句话概括: 二级指针解引用 -4. 那么就和文档所说的吻合了, 反向迭代器会往前移动一个位置

这个函数发生在反向迭代器解引用时, 大概可以猜出它的行为

更进一步

疑问 1 和疑问 2 解决了, 但是疑问 3 还未解决

我原来想通过更改内存值来看的, 但是貌似有什么防范操作

Breakpoint 1, main () at t.cpp:13
13	    array<int, 9> v{1, 2, 3, 4, 5, 6, 7, 8, 9};
(gdb) n
14	    auto it = v.begin();
(gdb) p {int}0x7fffffffe480	// 这里就是 it
$16 = 1
(gdb) set {int}0x7fffffffe480 = 100	// 更改 it 中的值
(gdb) p {int}0x7fffffffe480	// 成功更改
$17 = 100
(gdb) n
15	    cout << *it << endl;
(gdb) n	// 本来该打印 1 的, 更改后打印了 100 
100
16	    array<int,9>::const_reverse_iterator rit(it);
(gdb) p v	// 再次验证更改是有效的
$18 = {_M_elems = {100, 2, 3, 4, 5, 6, 7, 8, 9}}
(gdb) p {int}0x7fffffffe47b	// 根据推算, 这个就是 rend 的地址
$19 = 0
(gdb) set {int}0x7fffffffe47b = 200	// 更改 rend ( 在程序崩溃的边缘疯狂试探 :) )
(gdb) p {int}0x7fffffffe47b
$20 = 200
(gdb) n
17	    cout << *rit << endl;
(gdb) n	// 结果让人很失望
0
18	}
(gdb) p {int}0x7fffffffe47b // 谁让你改回去的!?
$21 = 0

我之后在 end 上做了同样的测试, 发现 end 也有防范操作

但我不知道具体是哪里防范了我这个操作, 应该是在解引用时, 但是里面代码不会更改到.

我还看了 objdump 的汇编代码, 依旧没有发现什么特别之处, 和编译器生成的汇编代码是一样的…

算了, 也算有所收获吧…

read/InnoDBRowFormat

Posted on 2020-01-09 Edited on 2019-08-01

15.10 InnoDB Row Formats

The row format of a table determines how its rows are physically stored, which in turn can affect the performance of queries and DML operations. As more rows fit into a single disk page, queries and index lookups can work faster, less cache memory is required in the buffer pool, and less I/O is required to write out updated values.

表的行格式化确定了如何物理存储方式, 反过来也会影响查询的性能和 DML 的操作.

更多的行放在单个磁盘页上, 查询和索引检索的速度会更快, 缓存池需要的空间也更少, 写出更新数据的 I/O 操作也会减少

The data in each table is divided into pages. The pages that make up each table are arranged in a tree data structure called a B-tree index. Table data and secondary indexes both use this type of structure. The B-tree index that represents an entire table is known as the clustered index, which is organized according to the primary key columns. The nodes of a clustered index data structure contain the values of all columns in the row. The nodes of a secondary index structure contain the values of index columns and primary key columns.

表的数据被分散到多个页中, 组成表的页的范围被限制在 B-tree 索引的树结构中, 表的数据和次级索引都使用这种结构

表示整个表的 B-tree 索引被称作聚簇索引, 根据主键列构造. 聚簇索引的节点数据结构包含行中的所有列, 而次级索引机构包含索引列和主键列(PS. 类似二级指针)

Variable-length columns are an exception to the rule that column values are stored in B-tree index nodes. Variable-length columns that are too long to fit on a B-tree page are stored on separately allocated disk pages called overflow pages. Such columns are referred to as off-page columns. The values of off-page columns are stored in singly-linked lists of overflow pages, with each such column having its own list of one or more overflow pages. Depending on column length, all or a prefix of variable-length column values are stored in the B-tree to avoid wasting storage and having to read a separate page.

可变长度列是列值存储在 B-tree 索引节点规则的一种特殊情况, 可变长度列太长, 无法存储在单个 B-tree 也中, 可变长度列存储在单独分配的磁盘页中, 这些页叫做溢出页, 这些列也叫作页外列

页外列以单链表形式存储在溢出页中, 每个页都有一个或多个溢出页的列表

取决于列的长度, 整个/前缀部分变长列可以存储在 B-tree 中, 避免存储消耗和读取额外的页

The InnoDB storage engine supports four row formats: REDUNDANT, COMPACT, DYNAMIC, and COMPRESSED.

InnoDB 存储引擎支持 4 种行格式化: REDUNDANT, COMPACT, DYNAMIC 和 COMPRESSED

Table 15.16 InnoDB Row Format Overview

Row Format	Compact Storage Characteristics	Enhanced Variable-Length Column Storage	Large Index Key Prefix Support	Compression Support	Supported Tablespace Types
`REDUNDANT`	No	No	No	No	system, file-per-table, general
`COMPACT`	Yes	No	No	No	system, file-per-table, general
`DYNAMIC`	Yes	Yes	Yes	No	system, file-per-table, general
`COMPRESSED`	Yes	Yes	Yes	Yes	file-per-table, general

The topics that follow describe row format storage characteristics and how to define and determine the row format of a table.

REDUNDANT Row Format

Tables that use the REDUNDANT row format store the first 768 bytes of variable-length column values (VARCHAR, VARBINARY, and BLOB and TEXT types) in the index record within the B-tree node, with the remainder stored on overflow pages. Fixed-length columns greater than or equal to 768 bytes are encoded as variable-length columns, which can be stored off-page. For example, a CHAR(255) column can exceed 768 bytes if the maximum byte length of the character set is greater than 3, as it is with utf8mb4.

使用 REDUNDANT 行格式化在 B-tree 节点记录中只存储变长列值的前 768 字节(VARCHAR, VARBINARY, BLOB, TEXT 类型), 剩下的存储在溢出页中, 固定长度但 >= 768 字节的列被按照变长列编码, 存储到溢出页中

If the value of a column is 768 bytes or less, an overflow page is not used, and some savings in I/O may result, since the value is stored entirely in the B-tree node. This works well for relatively short BLOB column values, but may cause B-tree nodes to fill with data rather than key values, reducing their efficiency. Tables with many BLOBcolumns could cause B-tree nodes to become too full, and contain too few rows, making the entire index less efficient than if rows were shorter or column values were stored off-page.

如果列的值小于 768 字节, 溢出页不会被用到, 会节省一些 I/O 操作, 因为值完整存储在 B-tree 索引中(PS. 现在我中文的语序都有点乱了, 因为被英文语法的顺序影响了 = =)

这在相对小的 BLOB 列上表现得很好, 但是可能会导致 B-tree 节点被数据填满, 而不是键值, 效率被减少了.

具有太多 BLOB 列的表可能会导致 B-tree 过于充实, 包含的行过于少. 使整个索引的效率低于较短的列或者列存储在溢出页

REDUNDANT Row Format Storage Characteristics

The REDUNDANT row format has the following storage characteristics:

Each index record contains a 6-byte header. The header is used to link together consecutive records, and for row-level locking.

每个索引记录包含 6 字节的头部, 用于链接连续的记录和行级锁
Records in the clustered index contain fields for all user-defined columns. In addition, there is a 6-byte transaction ID field and a 7-byte roll pointer field.

在聚簇索引记录中包含了所有用户定义的列, 除此之外, 还有 6 字节的事务ID和 7 字节的回滚指针
If no primary key is defined for a table, each clustered index record also contains a 6-byte row ID field.

如果没有为表定义一个主键, 每个聚簇索引记录还包含 6 字节的行 ID 字段
Each secondary index record contains all the primary key columns defined for the clustered index key that are not in the secondary index.

每个次级记录包含为聚簇索引定义的不在次级索引中的所有主键列
A record contains a pointer to each field of the record. If the total length of the fields in a record is less than 128 bytes, the pointer is one byte; otherwise, two bytes. The array of pointers is called the record directory. The area where the pointers point is the data part of the record.

记录包含指向记录每个字段的指针, 如果记录中所有字段长度总和小于 128 字节, 指针大小为 1 字节, 否则为 2 字节 (PS. 这里的 pointer 可能不是指针的意思, 而是类似指示的含义, 它可能是偏移 offset)

指针数组被称为记录目录, 指针指向记录的数据部分
Internally, fixed-length character columns such as CHAR(10) in stored in fixed-length format. Trailing spaces are not truncated from VARCHAR columns.

固定长度的列, 例如 CHAR(10) 以固定长度格式化, VARCHAR 结尾空白不会被截断
Fixed-length columns greater than or equal to 768 bytes are encoded as variable-length columns, which can be stored off-page. For example, a CHAR(255) column can exceed 768 bytes if the maximum byte length of the character set is greater than 3, as it is with utf8mb4.

长度 >= 768 字节的列即使是固定的, 也会被当做变长宽度列编码, 存储到溢出页

比如, 一个 CHAR(255) 的列如果字符集的长度超过3, 那么就超过了768 字节, 正如 utf8mb4
An SQL NULL value reserves one or two bytes in the record directory. An SQL NULL value reserves zero bytes in the data part of the record if stored in a variable-length column. For a fixed-length column, the fixed length of the column is reserved in the data part of the record. Reserving fixed space for NULL values permits columns to be updated in place from NULL to non-NULL values without causing index page fragmentation.

SQL NULL 值在记录目录中占有 1 或 2 个字节, 如果存储在变长列中, 在记录的数据部分不占空间.

对于一个固定长度的列, 存储在记录的数据部分, 为 NULL 值保留固定的长度使列的更新可以就地发生, 而不会导致索引页碎片

COMPACT Row Format

Tables that use the COMPACT row format store the first 768 bytes of variable-length column values (VARCHAR, VARBINARY, and BLOB and TEXT types) in the index record within the B-tree node, with the remainder stored on overflow pages. Fixed-length columns greater than or equal to 768 bytes are encoded as variable-length columns, which can be stored off-page. For example, a CHAR(255) column can exceed 768 bytes if the maximum byte length of the character set is greater than 3, as it is with utf8mb4.

(PS. 和 REDUNDANT 一样的 = =)

(PS. 怎么还是一样的 = = )

COMPACT Row Format Storage Characteristics

The COMPACT row format has the following storage characteristics:

Each index record contains a 5-byte header that may be preceded by a variable-length header. The header is used to link together consecutive records, and for row-level locking.

每个索引记录包含 5 字节的头部, 前面可能是变长头部. 头部用于链接记录和行锁
The variable-length part of the record header contains a bit vector for indicating NULL columns. If the number of columns in the index that can be NULL is N, the bit vector occupies CEILING(*N*/8) bytes. (For example, if there are anywhere from 9 to 16 columns that can be NULL, the bit vector uses two bytes.) Columns that are NULLdo not occupy space other than the bit in this vector. The variable-length part of the header also contains the lengths of variable-length columns. Each length takes one or two bytes, depending on the maximum length of the column. If all columns in the index are NOT NULL and have a fixed length, the record header has no variable-length part.

变长列记录头部包含空列的位向量 NULL 列, 如果索引中可以是 NULL 的列的数量为 N, 向量占用 CEILING(N/8)字节(例如, 如果有 9 到 16 列可以为空, 向量使用 2 字节) NULL 列不占用空间, 只占用向量中的位

头部的变长部分也包含变长列的长度, 占用 1 到 2 个字节, 取决于列的最大长度, 如果索引中的所有列非空, 并且有一个固定的长度, 那么记录头部不会有变长长度部分
For each non-NULL variable-length field, the record header contains the length of the column in one or two bytes. Two bytes are only needed if part of the column is stored externally in overflow pages or the maximum length exceeds 255 bytes and the actual length exceeds 127 bytes. For an externally stored column, the 2-byte length indicates the length of the internally stored part plus the 20-byte pointer to the externally stored part. The internal part is 768 bytes, so the length is 768+20. The 20-byte pointer stores the true length of the column.

对于每个非空变长字段, 记录头部包含列的长度, 1 ~ 2 字节. 2 字节仅在列的部分存储在溢出页, 或最大长度超过 255 字节, 并且真实长度超过 127 字节

对于一个外部存储的列, 这 2 字节长度代表内部存储部分 + 20 字节的指向外部存储部分的指针

内部部分是 768 字节, 长度是 768 + 20, 这 20 字节指针存储列的真实长度 (PS. 这里有点不对)
The record header is followed by the data contents of non-NULL columns.

记录头后跟着非空列的数据内容
Records in the clustered index contain fields for all user-defined columns. In addition, there is a 6-byte transaction ID field and a 7-byte roll pointer field.

聚簇索引记录包含所有用户定义列, 同时有 6 字节的事务 ID 和 7 字节的回滚 ID
If no primary key is defined for a table, each clustered index record also contains a 6-byte row ID field.

如果没有为表定义主键, 每个聚簇索引还包含 6 字节的行 ID
Each secondary index record contains all the primary key columns defined for the clustered index key that are not in the secondary index. If any of the primary key columns are variable length, the record header for each secondary index has a variable-length part to record their lengths, even if the secondary index is defined on fixed-length columns.

每个次级索引记录包含所有为聚簇索引定义的不在次级索引中的主键, 如果有主键列是变长的, 每个次级索引的头部有一个变长部分记录次级索引的长度, 即使次级索引是固定的列
Internally, for nonvariable-length character sets, fixed-length character columns such as CHAR(10) are stored in a fixed-length format.

Trailing spaces are not truncated from VARCHAR columns.

对非变长字符集, 像 CHAR(10) 这样的固定字符集列以固定长度格式化

VARCHAR 列尾部空白不会被截断
Internally, for variable-length character sets such as utf8mb3 and utf8mb4, InnoDB attempts to store CHAR(*N*) in N bytes by trimming trailing spaces. If the byte length of a CHAR(*N*) column value exceeds N bytes, trailing spaces are trimmed to a minimum of the column value byte length. The maximum length of a CHAR(*N*) column is the maximum character byte length × N.

内部, 如 utf8mb3 和 utf8mb4 变长字符集, InnoDB 通过裁剪尾随的空格将 CHAR(N) 存储 N 字节

如果 CHAR(N) 列长度超过 N 字节, 尾端空白被裁剪到列值字节数的最小值, CHAR(N) 列的最大长度是最大字符字节宽度 x N

A minimum of N bytes is reserved for CHAR(*N*). Reserving the minimum space N in many cases enables column updates to be done in place without causing index page fragmentation. By comparison, CHAR(*N*) columns occupy the maximum character byte length × N when using the REDUNDANT row format.

为 CHAR(N) 存储一个最小 N 字节空间, 存储这个最小的 N 空间在很多情况下是列值的更新就地发生, 不导致页碎片. 对比而言, 当使用 REDUNDANT 行格式化时, CHAR(N) 列占有最大字符长度 x N

Fixed-length columns greater than or equal to 768 bytes are encoded as variable-length fields, which can be stored off-page. For example, a CHAR(255) column can exceed 768 bytes if the maximum byte length of the character set is greater than 3, as it is with utf8mb4.

固定长度大于或等于 768 字节的列以变长字段编码, 存储在页外, … (PS. emm… 好像是一样的了)

(PS. REDUNDANT 和 COMPACT 的区别主要有两个 : 1. COMPACT 不保存固定字段的长度 2. COMPACT 会裁剪尾随的空白字符. 所以 REDUNDANT 叫 REDUNDANT, COMPACT 叫 COMPACT, REDUNDANT 牺牲了空间换取效率)

DYNAMIC Row Format

When a table is created with ROW_FORMAT=DYNAMIC, InnoDB can store long variable-length column values (for VARCHAR, VARBINARY, and BLOB and TEXT types) fully off-page, with the clustered index record containing only a 20-byte pointer to the overflow page. Fixed-length fields greater than or equal to 768 bytes are encoded as variable-length fields. For example, a CHAR(255) column can exceed 768 bytes if the maximum byte length of the character set is greater than 3, as it is with utf8mb4.

当表被创建为 DYNAMIC 类型时, InnoDB 能将长的变长列完全存储在页外, 聚簇索引只包含 20 字节的指向溢出页的指针. 固定长度 >= 768 的字段按照变长字段存储, 比如 … (PS. 一样的, 就不比如了… = =)

Whether columns are stored off-page depends on the page size and the total size of the row. When a row is too long, the longest columns are chosen for off-page storage until the clustered index record fits on the B-tree page. TEXT and BLOB columns that are less than or equal to 40 bytes are stored in line.

列是否存储在页外取决于页的大小和行的总大小.

当行太长时, 最长的列被选为页外存储, 直到聚簇索引记录大小适合保存在 B-tree 页, TEXT 和 BLOB 列 <= 40 字节按行存储

The DYNAMIC row format maintains the efficiency of storing the entire row in the index node if it fits (as do the COMPACT and REDUNDANT formats), but the DYNAMIC row format avoids the problem of filling B-tree nodes with a large number of data bytes of long columns. The DYNAMIC row format is based on the idea that if a portion of a long data value is stored off-page, it is usually most efficient to store the entire value off-page. With DYNAMIC format, shorter columns are likely to remain in the B-tree node, minimizing the number of overflow pages required for a given row.

DYNAMIC 行存储维护在索引节点中存储整行(如果大小匹配的话)的效率(就像 COMPACT 和 REDUNDANT 一样), 但是 DYNAMIC 避免了 B-tree 被大量长列填满的问题

DYNAMIC 行存储基于一部分数据存储在页外的想法, 通常最有效的方法是整个值都存储在页外

使用 DYNAMIC 存储, 较短的行更可能保存在 B-tree 节点, 最小化行所需要的溢出页数量

The DYNAMIC row format supports index key prefixes up to 3072 bytes.

DYNAMIC 行存储支持索引键前缀, 最大可达 3072 字节 (PS. = = 这么大的么…) (需要设置 innodb_large_prefix=1)

Tables that use the DYNAMIC row format can be stored in the system tablespace, file-per-table tablespaces, and general tablespaces. To store DYNAMIC tables in the system tablespace, either disable innodb_file_per_table and use a regular CREATE TABLE or ALTER TABLE statement, or use the TABLESPACE [=] innodb_system table option with CREATE TABLE or ALTER TABLE. The innodb_file_per_table variable is not applicable to general tablespaces, nor is it applicable when using the TABLESPACE [=] innodb_system table option to store DYNAMIC tables in the system tablespace.

使用 DYNAMIC 存储的行能保存在 system tablespace, file-per-table tablespace, 以及 general tablespace 中.

将 DYNAMIC 表存储在 system tablespace 要么取消 innodb_file_per_table 以及使用常规的 CREATE TABLE 或 ALTER TABLE 语句, 或者在 CREATE TABLE 或 ALTER TABLE 时使用 TABLESPACE [=] innodb_system 表选项.

innodb_file_per_table 变量不适用于 general tablespace, 也不适用于使用 TABLESPACE [=] innodb_system 表选项去在 system tablespace 中存储 DYNAMIC 表

DYNAMIC Row Format Storage Characteristics

The DYNAMIC row format is a variation of the COMPACT row format. For storage characteristics, see COMPACT Row Format Storage Characteristics.

DYNAMIC 行存储是 COMPACT 行存储的一种变化

COMPRESSED Row Format

The COMPRESSED row format uses similar internal details for off-page storage as the DYNAMIC row format, with additional storage and performance considerations from the table and index data being compressed and using smaller page sizes. With the COMPRESSED row format, the KEY_BLOCK_SIZE option controls how much column data is stored in the clustered index, and how much is placed on overflow pages. For more information about the COMPRESSED row format, see Section 15.9, “InnoDB Table and Page Compression”.

COMPRESSED 行存储使用和 DYNAMIC 类似的内部细节 : 页外存储, 压缩表和索引数据以使用更少的页, 考虑额外的存储和性能.

使用 COMPRESSED 行存储, KEY_BLOCK_SIZE 选项控制多少列数据存储在聚簇索引, 多少存储在溢出页.

The COMPRESSED row format supports index key prefixes up to 3072 bytes.

COMPRESSED 行存储支持索引键前缀, 最大可达 3072 字节

Tables that use the COMPRESSED row format can be created in file-per-table tablespaces or general tablespaces. The system tablespace does not support the COMPRESSEDrow format. To store a COMPRESSED table in a file-per-table tablespace, the innodb_file_per_table variable must be enabled. The innodb_file_per_table variable is not applicable to general tablespaces. General tablespaces support all row formats with the caveat that compressed and uncompressed tables cannot coexist in the same general tablespace due to different physical page sizes. For more information, see Section 15.6.3.3, “General Tablespaces”.

Compressed Row Format Storage Characteristics

The COMPRESSED row format is a variation of the COMPACT row format. For storage characteristics, see COMPACT Row Format Storage Characteristics.

Defining the Row Format of a Table

The default row format for InnoDB tables is defined by innodb_default_row_format variable, which has a default value of DYNAMIC. The default row format is used when the ROW_FORMAT table option is not defined explicitly or when ROW_FORMAT=DEFAULT is specified.

innodb_default_row_format 控制默认创建的表行存储类型 (dynamic) , 当表选项 ROW_FORMAT 没有显式指明时才会使用默认配置

The row format of a table can be defined explicitly using the ROW_FORMAT table option in a CREATE TABLE or ALTER TABLE statement. For example:

1	CREATE TABLE t1 (c1 INT) ROW_FORMAT=DYNAMIC;

An explicitly defined ROW_FORMAT setting overrides the default row format. Specifying ROW_FORMAT=DEFAULT is equivalent to using the implicit default.

The innodb_default_row_format variable can be set dynamically:

1	mysql> SET GLOBAL innodb_default_row_format=DYNAMIC;

Valid innodb_default_row_format options include DYNAMIC, COMPACT, and REDUNDANT. The COMPRESSED row format, which is not supported for use in the system tablespace, cannot be defined as the default. It can only be specified explicitly in a CREATE TABLE or ALTER TABLE statement. Attempting to set theinnodb_default_row_format variable to COMPRESSED returns an error:

COMPRESSED 不能用于默认设置, 只能显式指定

1
2
3

mysql> SET GLOBAL innodb_default_row_format=COMPRESSED;
ERROR 1231 (42000): Variable 'innodb_default_row_format'
can't be set to the value of 'COMPRESSED'

Newly created tables use the row format defined by the innodb_default_row_format variable when a ROW_FORMAT option is not specified explicitly, or whenROW_FORMAT=DEFAULT is used. For example, the following CREATE TABLE statements use the row format defined by the innodb_default_row_format variable.

1 2	CREATE TABLE t1 (c1 INT); CREATE TABLE t2 (c1 INT) ROW_FORMAT=DEFAULT;

When a ROW_FORMAT option is not specified explicitly, or when ROW_FORMAT=DEFAULT is used, an operation that rebuilds a table silently changes the row format of the table to the format defined by the innodb_default_row_format variable.

Table-rebuilding operations include ALTER TABLE operations that use ALGORITHM=COPY or ALGORITHM=INPLACE where table rebuilding is required. See Section 15.12.1, “Online DDL Operations” for more information. OPTIMIZE TABLE is also a table-rebuilding operation.

The following example demonstrates a table-rebuilding operation that silently changes the row format of a table created without an explicitly defined row format.

mysql> SELECT @@innodb_default_row_format;
+-----------------------------+
| @@innodb_default_row_format |
+-----------------------------+
| dynamic                     |
+-----------------------------+

mysql> CREATE TABLE t1 (c1 INT);

mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_TABLES WHERE NAME LIKE 'test/t1' \G
*************************** 1. row ***************************
     TABLE_ID: 54
         NAME: test/t1
         FLAG: 33
       N_COLS: 4
        SPACE: 35
   ROW_FORMAT: Dynamic
ZIP_PAGE_SIZE: 0
   SPACE_TYPE: Single

mysql> SET GLOBAL innodb_default_row_format=COMPACT;

mysql> ALTER TABLE t1 ADD COLUMN (c2 INT);

mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_TABLES WHERE NAME LIKE 'test/t1' \G
*************************** 1. row ***************************
     TABLE_ID: 55
         NAME: test/t1
         FLAG: 1
       N_COLS: 5
        SPACE: 36
   ROW_FORMAT: Compact
ZIP_PAGE_SIZE: 0
   SPACE_TYPE: Single

(PS. 这种情况要格外注意, 更改表操作相当于重新创建, 之前设置的一些默认选项可能已经被改变了)

Consider the following potential issues before changing the row format of existing tables from REDUNDANT or COMPACT to DYNAMIC.

The REDUNDANT and COMPACT row formats support a maximum index key prefix length of 767 bytes whereas DYNAMIC and COMPRESSED row formats support an index key prefix length of 3072 bytes. In a replication environment, if the innodb_default_row_format variable is set to DYNAMIC on the master, and set to COMPACT on the slave, the following DDL statement, which does not explicitly define a row format, succeeds on the master but fails on the slave:

REDUNDANT 和 COMPACT 行存储支持最大 767 字节的索引键前缀, 然而 DYNAMIC 和 COMPRESSED 行存储支持的长度可达 3072 字节.

在同样的环境下, 如果 innodb_default_row_format 在主环境下设置为 DYNAMIC, 而在次环境下设置为 COMPACT, 下列没有显式定义行存储的语句会在主环境下成功, 而次环境会失败
1
CREATE TABLE t1 (c1 INT PRIMARY KEY, c2 VARCHAR(5000), KEY i1(c2(3070)));
For related information, see Section 15.6.1.6, “Limits on InnoDB Tables”.

(PS. 原因是 REDUNDANT 和 COMPACT 行存储限制了索引前缀必须低于 768 字节

这里还要加限制, 就是 c2(3070) 真正内存 < 3072 )
Importing a table that does not explicitly define a row format results in a schema mismatch error if the innodb_default_row_format setting on the source server differs from the setting on the destination server. For more information, refer to the limitations outlined in Section 15.6.3.7, “Copying Tablespaces to Another Instance”.

导入一个为显式指定行存储的表时, 被导入表的行存储和默认存储设置不一致会产生错误

Determining the Row Format of a Table

To determine the row format of a table, use SHOW TABLE STATUS:

mysql> SHOW TABLE STATUS IN test1\G
*************************** 1. row ***************************
           Name: t1
         Engine: InnoDB
        Version: 10
     Row_format: Dynamic
           Rows: 0
 Avg_row_length: 0
    Data_length: 16384
Max_data_length: 0
   Index_length: 16384
      Data_free: 0
 Auto_increment: 1
    Create_time: 2016-09-14 16:29:38
    Update_time: NULL
     Check_time: NULL
      Collation: utf8mb4_0900_ai_ci
       Checksum: NULL
 Create_options: 
        Comment:

Alternatively, query the INFORMATION_SCHEMA.INNODB_TABLES table:

mysql> SELECT NAME, ROW_FORMAT FROM INFORMATION_SCHEMA.INNODB_TABLES WHERE NAME='test1/t1';
+----------+------------+
| NAME     | ROW_FORMAT |
+----------+------------+
| test1/t1 | Dynamic    |
+----------+------------+

read/InnoDBRedoLog

Posted on 2020-01-09 Edited on 2019-08-07

15.6.5 Redo Log

By default, the redo log is physically represented on disk by two files named ib_logfile0 and ib_logfile1. MySQL writes to the redo log files in a circular fashion. Data in the redo log is encoded in terms of records affected; this data is collectively referred to as redo. The passage of data through the redo log is represented by an ever-increasing LSN value.

默认 redo 日志通过两个叫 ib_logfile0 和 ib_logfile1 的文件物理地记录在磁盘上. MySQL 以循环方式写入到文件中. redo 日志中的数据受记录影响编码. 这些数据统称为 redo. 数据通过 redo 日志的传递由一个不断增长的 LSN 值表示

For related information, see Redo Log File Configuration, and Section 8.5.4, “Optimizing InnoDB Redo Logging”.

For information about data-at-rest encryption for redo logs, see Redo Log Encryption.

Changing the Number or Size of Redo Log Files

Stop the MySQL server and make sure that it shuts down without errors.
Edit my.cnf to change the log file configuration. To change the log file size, configure innodb_log_file_size. To increase the number of log files, configureinnodb_log_files_in_group.
Start the MySQL server again.

If InnoDB detects that the innodb_log_file_size differs from the redo log file size, it writes a log checkpoint, closes and removes the old log files, creates new log files at the requested size, and opens the new log files.

Group Commit for Redo Log Flushing

InnoDB, like any other ACID-compliant database engine, flushes the redo log of a transaction before it is committed. InnoDB uses group commit functionality to group multiple such flush requests together to avoid one flush for each commit. With group commit, InnoDB issues a single write to the log file to perform the commit action for multiple user transactions that commit at about the same time, significantly improving throughput.

InnoDB 像其他 ACID 数据引擎一样, 在事务提交前冲刷 redo 日志, InnoDB 使用组提交功能, 将多个提交请求分组, 避免每次提交都冲刷记录. 在组提交中, InnoDB 为在大约同一时间提交的多次用户事务执行单次文件写入, 显著提高了吞吐量

For more information about performance of COMMIT and other transactional operations, see Section 8.5.2, “Optimizing InnoDB Transaction Management”.

Redo Log Archiving

Backup utilities that copy redo log records may sometimes fail to keep pace with redo log generation while a backup operation is in progress, resulting in lost redo log records due to those records being overwritten. This issue most often occurs when there is significant MySQL server activity during the backup operation, and the redo log file storage media operates at a faster speed than the backup storage media. The redo log archiving feature, introduced in MySQL 8.0.17, addresses this issue by sequentially writing redo log records to an archive file in addition to the redo log files. Backup utilities can copy redo log records from the archive file as necessary, thereby avoiding the potential loss of data.

拷贝 redo 日志记录的备份工具可能有时会跟丢 redo 记录日志的生成速度, 当备份操作运行中, 因为这些记录被覆写而导致丢失 redo 日志记录

当备份操作产生大量的 MySQL 服务器活动时这个问题经常发生, 此时 redo 日志文件存储媒介操作处于比备份存储媒介操作更快的速率. 在 MySQL 8.0.17 引进的 redo 日志归档功能, 通过除写入 redo 日志外, 还序列写入 redo 日志记录到归档文件解决这个问题. 备份工具尽可能从归档文件中拷贝 redo 日志记录, 从而规避潜在的数据丢失问题

If redo log archiving is configured on the server, MySQL Enterprise Backup, available with the MySQL Enterprise Edition, uses the redo log archiving feature when backing up a MySQL server.

Enabling redo log archiving on the server requires setting a value for the innodb_redo_log_archive_dirs system variable. The value is specified as a semicolon-separated list of labeled redo log archive directories. The *label:directory* pair is separated by a colon (:). For example:

1	mysql> SET GLOBAL innodb_redo_log_archive_dirs='label1:directory_path1[;label2:directory_path2;…]';

The label is an arbitrary identifier for the archive directory. It can be any string of characters, with the exception of colons (:), which are not permitted. An empty label is also permitted, but the colon (:) is still required in this case. A directory_path must be specified. The directory that is selected for the redo log archive file must exist when redo log archiving is activated, or an error is returned. The path can contain colons (‘:’), but semicolons (;) are not permitted.

The innodb_redo_log_archive_dirs variable must be configured before the redo log archiving can be activated. The default value is NULL, which does not permit activating redo log archiving.

Notes

The archive directories that you specify must satisfy the following requirements. (The requirements are enforced when redo log archiving is activated.):

Directories must exist. Directories are not created by the redo log archive process. Otherwise, the following error is returned:

ERROR 3844 (HY000): Redo log archive directory ‘directory_path1‘ does not exist or is not a directory
Directories must not be world-accessible. This is to prevent the redo log data from being exposed to unauthorized users on the system. Otherwise, the following error is returned:

ERROR 3846 (HY000): Redo log archive directory ‘directory_path1‘ is accessible to all OS users
Directories cannot be those defined by datadir, innodb_data_home_dir, innodb_directories, innodb_log_group_home_dir,innodb_temp_tablespaces_dir, innodb_tmpdir innodb_undo_directory, or secure_file_priv, nor can they be parent directories or subdirectories of those directories. Otherwise, an error similar to the following is returned:

ERROR 3845 (HY000): Redo log archive directory ‘directory_path1‘ is in, under, or over server directory ‘datadir’ - ‘/path/to/data_directory‘

When a backup utility that supports redo log archiving initiates a backup, the backup utility activates redo log archiving by invoking the innodb_redo_log_archive_start() user-defined function.

If you are not using a backup utility that supports redo log archiving, redo log archiving can also be activated manually, as shown:

mysql> SELECT innodb_redo_log_archive_start('label', 'subdir');
+------------------------------------------+
| innodb_redo_log_archive_start('label') |
+------------------------------------------+
| 0                                        |
+------------------------------------------+

Or:

1 2	mysql> DO innodb_redo_log_archive_start('label', 'subdir'); Query OK, 0 rows affected (0.09 sec)

Note

The MySQL session that activates redo log archiving (using innodb_redo_log_archive_start()) must remain open for the duration of the archiving. The same session must deactivate redo log archiving (using innodb_redo_log_archive_stop()). If the session is terminated before the redo log archiving is explicitly deactivated, the server deactivates redo log archiving implicitly and removes the redo log archive file.

where label is a label defined by innodb_redo_log_archive_dirs; subdir is an optional argument for specifying a subdirectory of the directory identified by label for saving the archive file; it must be a simple directory name (no slash (/), backslash (), or colon (:) is permitted). subdir can be empty, null, or it can be left out.

Only users with the INNODB_REDO_LOG_ARCHIVE privilege can activate redo log archiving by invoking innodb_redo_log_archive_start(), or deactivate it usinginnodb_redo_log_archive_stop(). The MySQL user running the backup utility or the MySQL user activating and deactivating redo log archiving manually must have this privilege.

The redo log archive file path is *directory_identified_by_label*/[*subdir*/]archive.*serverUUID*.000001.log, where *directory_identified_by_label* is the archive directory identified by the *label* argument for innodb_redo_log_archive_start(). *subdir* is the optional argument used for innodb_redo_log_archive_start().

For example, the full path and name for a redo log archive file appears similar to the following:

1	/directory_path/subdirectory/archive.e71a47dc-61f8-11e9-a3cb-080027154b4d.000001.log

After the backup utility finishes copying InnoDB data files, it deactivates redo log archiving by calling the innodb_redo_log_archive_stop() user-defined function.

If you are not using a backup utility that supports redo log archiving, redo log archiving can also be deactivated manually, as shown:

mysql> SELECT innodb_redo_log_archive_stop();
+--------------------------------+
| innodb_redo_log_archive_stop() |
+--------------------------------+
| 0                              |
+--------------------------------+

Or:

1 2	mysql> DO innodb_redo_log_archive_stop(); Query OK, 0 rows affected (0.01 sec)

After the stop function completes successfully, the backup utility looks for the relevant section of redo log data from the archive file and copies it into the backup.

After the backup utility finishes copying the redo log data and no longer needs the redo log archive file, it deletes the archive file.

Removal of the archive file is the responsibility of the backup utility in normal situations. However, if the redo log archiving operation quits unexpectedly beforeinnodb_redo_log_archive_stop() is called, the MySQL server removes the file.

Performance Considerations

Activating redo log archiving typically has a minor performance cost due to the additional write activity.

On Unix and Unix-like operating systems, the performance impact is typically minor, assuming there is not a sustained high rate of updates. On Windows, the performance impact is typically a bit higher, assuming the same.

If there is a sustained high rate of updates and the redo log archive file is on the same storage media as the redo log files, the performance impact may be more significant due to compounded write activity.

If there is a sustained high rate of updates and the redo log archive file is on slower storage media than the redo log files, performance is impacted arbitrarily.

Writing to the redo log archive file does not impede normal transactional logging except in the case that the redo log archive file storage media operates at a much slower rate than the redo log file storage media, and there is a large backlog of persisted redo log blocks waiting to be written to the redo log archive file. In this case, the transactional logging rate is reduced to a level that can be managed by the slower storage media where the redo log archive file resides.

read/implementionThreadInUserAndKernelSpace

Posted on 2020-01-09 Edited on 2019-12-27

之前就看过关于在内核以及用户空间实现线程的文章, 到现在还对于其中的一些点一知半解, 比如: 为什么实现在用户空间的线程比实现在内核空间的快?. 今天碰巧看到了这篇文章, 原文出自 <modern operating system, fourth edition>

threads implementation in kernel and user space

2.2.4 Implementing Threads in User Space

There are two main places to implement threads: user space and the kernel.
The choice is a bit controversial, and a hybrid implementation is also possible. We
will now describe these methods, along with their advantages and disadvantages.

有两种主要的地方用于实现线程: 用户空间以及内核空间. 如何在哪里实现具有一定争议性, 同时, 一种混合的实现也是可能的. 我们将会概述这些方法, 以及他们的优点和缺点.

The first method is to put the threads package entirely in user space. The kernel knows nothing about them. As far as the kernel is concerned, it is managing
ordinary, single-threaded processes. The first, and most obvious, advantage is that
a user-level threads package can be implemented on an operating system that does
not support threads. All operating systems used to fall into this category, and even
now some still do. With this approach, threads are implemented by a library.

第一种方法是将整个线程包放到用户空间. 内核对此毫无所知. 就内核而言, 它依旧像对待单线程对象一样.

首先, 最明显的优点是, 用户级别的线程可以实现在一个不支持多线程的操作系统上.所有的操作系统曾经都是这种类型, 直到现在还有部分保留, 在这种方式下, 线程通过一个库实现.

All of these implementations have the same general structure, illustrated in
Fig. 2-16(a). The threads run on top of a run-time system, which is a collection of
procedures that manage threads. We have seen four of these already: pthread create, pthread exit, pthread join, and pthread yield, but usually there are more.

所有的实现有同样通用的结构, 如图2-16(a). 线程运行于运行时系统上(一系列管理线程的程序). 我们已经见过四种这样的程序了: 线程创建, 退出, 加入, 放弃(这是本书前面部分的内容, 但为什么是 pthread 呢? 难道是基于 posix 标准的线程实现?)

When threads are managed in user space, each process needs its own private
thread table to keep track of the threads in that process. This table is analogous to
the kernel’s process table, except that it keeps track only of the per-thread proper-
ties, such as each thread’s program counter, stack pointer, registers, state, and so
forth. The thread table is managed by the run-time system. When a thread is
moved to ready state or blocked state, the information needed to restart it is stored
in the thread table, exactly the same way as the kernel stores information about
processes in the process table.

当线程管理于用户空间时, 每个进程需要拥有独有的线程表, 以用于持续跟踪进程中的线程. 这个表类似与内核的进程表, 不过它只跟踪每个线程的属性. 比如每个线程的程序计数器, 栈指针, 寄存器, 状态, 以及…

线程表由运行时系统管理, 当线程转变为就绪/阻塞状态时, 用于重启的信息就存储在线程表中, 就和内核在进程表中存储关于进程的信息一样.

When a thread does something that may cause it to become blocked locally, for
example, waiting for another thread in its process to complete some work, it calls a
run-time system procedure. This procedure checks to see if the thread must be put
into blocked state. If so, it stores the thread’s registers (i.e., its own) in the thread
table, looks in the table for a ready thread to run, and reloads the machine registers
with the new thread’s saved values. As soon as the stack pointer and program
counter have been switched, the new thread comes to life again automatically.

如果线程做了某些操作导致它本地阻塞时, 比如: 等待进程中的其他线程完成某些工作. 它调用一个运行时作业调度.

这个程序检查线程是否必须置于阻塞态, 如果是, 它在线程表中存储线程的寄存器(它自己的). 在表中查找一个就绪态线程运行, 重新加载新线程的寄存器. 同时栈指针和程序计数器也会切换, 新线程再次自动运行.

If the machine happens to have an instruction to store all the registers and another
one to load them all, the entire thread switch can be done in just a handful of in-
structions. Doing thread switching like this is at least an order of magnitude—
maybe more—faster than trapping to the kernel and is a strong argument in favor
of user-level threads packages.

如果机器开始有一个指令可以存储所有的寄存器, 同时另一个指令加载他们, 那么整个线程的切换就只需要少量的指令.

要完成这样的线程切换比捕获内核至少快一个数量级, 或许更快. 这是一个对用户级线程拥护者强有力的论点.

However, there is one key difference with processes. When a thread is finished
running for the moment, for example, when it calls thread yield, the code of
thread yield can save the thread’s information in the thread table itself. Fur-
thermore, it can then call the thread scheduler to pick another thread to run. The
procedure that saves the thread’s state and the scheduler are just local procedures,
so invoking them is much more efficient than making a kernel call. Among other
issues, no trap is needed, no context switch is needed, the memory cache need not
be flushed, and so on. This makes thread scheduling very fast.

然而, 有一个关于进程的关键不同. 当线程暂停时, 比如: 调用 yield, 保存线程的信息到线程表中.

更进步一, 调用线程调度, 选择另一个线程执行. 程序保存线程状态, 因为调度只是本地程序, 所以调用其会比内核调用更加高效. 其他方面, 没有捕获, 没有环境切换, 内存缓冲也不需要刷新, 等等. 这使得线程调度非常快.

User-level threads also have other advantages. They allow each process to have
its own customized scheduling algorithm. For some applications, for example,
those with a garbage-collector thread, not having to worry about a thread being
stopped at an inconvenient moment is a plus. They also scale better, since kernel
threads invariably require some table space and stack space in the kernel, which
can be a problem if there are a very large number of threads.

用户级线程还有其他优点. 它使每个进程都可以有自己的特定调度算法. 对于一些应用, 比如垃圾回收线程, 不用担心线程在不适当时候停下来, 这是一个优点. 他们拥有更好的伸缩性, 因为内核线程总是需要一些表空间和栈空间, 当线程逐渐增加时, 会造成麻烦.

Despite their better performance, user-level threads packages have some major
problems. First among these is the problem of how blocking system calls are im-
plemented. Suppose that a thread reads from the keyboard before any keys hav e
been hit. Letting the thread actually make the system call is unacceptable, since
this will stop all the threads. One of the main goals of having threads in the first
place was to allow each one to use blocking calls, but to prevent one blocked
thread from affecting the others. With blocking system calls, it is hard to see how
this goal can be achieved readily.

即使它们拥有更好的性能, 用户级线程包也有一些固有的问题.

首先, 如何实现阻塞的系统调用. 假如线程等待来自键盘的输入, 让这个线程准确执行系统调用是不允许的, 因为这会阻塞所有线程, 线程的首要目的之一是允许每个线程使用阻塞调用, 但是保证一个阻塞线程不会影响其他线程. 可以看出这很难实现.

The system calls could all be changed to be nonblocking (e.g., a read on the
keyboard would just return 0 bytes if no characters were already buffered), but re-
quiring changes to the operating system is unattractive. Besides, one argument for
user-level threads was precisely that they could run with existing operating sys-
tems. In addition, changing the semantics of read will require changes to many
user programs.

系统调用必须都变为非阻塞的(比如, 读取键盘输入应该在没有任何字符被缓存时返回0), 但是这对于操作系统而已不太友好. 次外(我真不知道怎么翻译这句…). 另外, 改变读取的语义将会影响到大量用户程序.

Another alternative is available in the event that it is possible to tell in advance
if a call will block. In most versions of UNIX, a system call, select , exists, which
allows the caller to tell whether a prospective read will block. When this call is
present, the library procedure read can be replaced with a new one that first does a
select call and then does the read call only if it is safe (i.e., will not block). If the
read call will block, the call is not made. Instead, another thread is run. The next
time the run-time system gets control, it can check again to see if the read is now
safe. This approach requires rewriting parts of the system call library, and is inef-
ficient and inelegant, but there is little choice. The code placed around the system
call to do the checking is called a jacket or wrapper.

在这种情况下还有另一个方法: 提前告知一个调用将会被阻塞是可行的(??? 啥意思啊 = =).

在多个 UNIX 版本中, 选择性地存在系统调用运行调用者判断未来的读操作将会阻塞. 当这样的调用存在时, 库程序读取替换成一个首先做判断, 然后当确定是安全的时候读取(比如, 非阻塞).不会执行会阻塞的读操作, 另一个线程将会运行.

在下次运行时系统获得控制时, 会再次检查读操作是否是安全的. 这个方法需要重写部分系统调用库. 不那么高效和优雅. 不过这是一个选择, 放置在系统函数周围的代码去检查的这种方法被称为 jacket 或 wrapper.

Somewhat analogous to the problem of blocking system calls is the problem of
page faults. We will study these in Chap. 3. For the moment, suffice it to say that
computers can be set up in such a way that not all of the program is in main memo-
ry at once. If the program calls or jumps to an instruction that is not in memory, a
page fault occurs and the operating system will go and get the missing instruction
(and its neighbors) from disk. This is called a page fault. The process is blocked
while the necessary instruction is being located and read in. If a thread causes a
page fault, the kernel, unaware of even the existence of threads, naturally blocks
the entire process until the disk I/O is complete, even though other threads might
be runnable.01

(简单来说, 这段说的是页错误, 主存和磁盘间虚拟空间内容的交换.)

Another problem with user-level thread packages is that if a thread starts run-
ning, no other thread in that process will ever run unless the first thread voluntarily
gives up the CPU. Within a single process, there are no clock interrupts, making it
impossible to schedule processes round-robin fashion (taking turns). Unless a
thread enters the run-time system of its own free will, the scheduler will never get a
chance.

用户级线程将面临的另一个问题是: 当线程开始执行时, 除非自愿放弃, 不然其他线程无法执行.

单线程程序, 不会产生时钟终端, 使用 round-robin 调度器管理进程是不可能的. 除非线程自愿进入运行时系统. 否则调度器将不会生效.

One possible solution to the problem of threads running forever is to have the
run-time system request a clock signal (interrupt) once a second to give it control,
but this, too, is crude and messy to program. Periodic clock interrupts at a higher
frequency are not always possible, and even if they are, the total overhead may be
substantial. Furthermore, a thread might also need a clock interrupt, interfering
with the run-time system’s use of the clock.

一个可能的方法是: 让运行时系统每秒请求一个时钟信号来控制它(总而言之这是一个馊主意).

Another, and really the most devastating, argument against user-level threads is
that programmers generally want threads precisely in applications where the
threads block often, as, for example, in a multithreaded Web server. These threads
are constantly making system calls. Once a trap has occurred to the kernel to carry
out the system call, it is hardly any more work for the kernel to switch threads if
the old one has blocked, and having the kernel do this eliminates the need for con-
stantly making select system calls that check to see if read system calls are safe.
For applications that are essentially entirely CPU bound and rarely block, what is
the point of having threads at all? No one would seriously propose computing the
first n prime numbers or playing chess using threads because there is nothing to be
gained by doing it that way.

另一个反对用户级线程的论证(也是最具破坏性的)是, 程序员通常希望线程在线程经常阻塞的应用中使用, 比如, 在一个多线程 web 服务器中. 线程不间断地使用系统调用, 一旦内核执行系统调用, 如果旧线程已被阻塞, 那么切换线程就几乎没有其他需要做的了. …(后面我翻不下去了, 大概意思是, 这样的话, 程序就没有必要使用多线程了)

(总结归纳一下: 大概意思是, 用户级线程最大的优点是在于其切换起来很快, 但是我们通常希望在频繁发生线程阻塞的应用中使用线程, 而在这种情况下, 线程切换所需的操作就会变少(如果旧线程已经被阻塞了的话), 那么用户级线程存在的意义就不大了)

2.2.5 Implementing Threads in the Kernel

Now let us consider having the kernel know about and manage the threads. No
run-time system is needed in each, as shown in Fig. 2-16(b). Also, there is no
thread table in each process. Instead, the kernel has a thread table that keeps track
of all the threads in the system. When a thread wants to create a new thread or
destroy an existing thread, it makes a kernel call, which then does the creation or
destruction by updating the kernel thread table.

现在, 让我们考虑让内核知道如何管理线程. 如图 2-16(b) 所示, 在进程中没有运行时系统, 也没有线程表. 内核有张线程表, 用于跟踪系统中的所有线程. 当线程想要创建或删除一个线程时, 使用一个内核调用, 然后通过更新内核线程表来创建或删除.

The kernel’s thread table holds each thread’s registers, state, and other infor-
mation. The information is the same as with user-level threads, but now kept in the
kernel instead of in user space (inside the run-time system). This information is a
subset of the information that traditional kernels maintain about their single-
threaded processes, that is, the process state. In addition, the kernel also maintains
the traditional process table to keep track of processes.

内核的线程表保存每个线程的寄存器, 状态, 以及其他信息. 与用户级线程保存的信息一致, 只是保存在内核中.

这些信息是传统内核管理的单线程进程信息的子集. 内核也同样管理传统的进程表, 以用于跟踪进程.

All calls that might block a thread are implemented as system calls, at consid-
erably greater cost than a call to a run-time system procedure. When a thread
blocks, the kernel, at its option, can run either another thread from the same proc-
ess (if one is ready) or a thread from a different process. With user-level threads,
the run-time system keeps running threads from its own process until the kernel
takes the CPU away from it (or there are no ready threads left to run).

所有可能阻塞线程的调用都被实现为系统调用, 相对运行时系统的调用, 明显有很大的额外消耗. 当线程阻塞时, 内核可以选择同进程下的线程运行, 也可以运行另一个进程的线程. 但用户级线程只会运行本进程的线程, 直到内核不让其使用 CPU 资源.

Due to the relatively greater cost of creating and destroying threads in the ker-
nel, some systems take an environmentally correct approach and recycle their
threads. When a thread is destroyed, it is marked as not runnable, but its kernel
data structures are not otherwise affected. Later, when a new thread must be creat-
ed, an old thread is reactivated, saving some overhead. Thread recycling is also
possible for user-level threads, but since the thread-management overhead is much
smaller, there is less incentive to do this.

因为在内核中创建和销毁线程操作相对更费力, 一些系统使用与环境相关的方法, 重利用它们的线程.

当线程销毁时, 将其标记为不可运行, 但是其内核数据结构不受影响, 随后, 当新线程需要创建时, 重新利用这些资源. 用户级线程也可以使用这个方法, 不过因为线程管理的消耗较小, 并不是很有必要这么做

Kernel threads do not require any new, nonblocking system calls. In addition,
if one thread in a process causes a page fault, the kernel can easily check to see if
the process has any other runnable threads, and if so, run one of them while wait-
ing for the required page to be brought in from the disk. Their main disadvantage is
that the cost of a system call is substantial, so if thread operations (creation, termi-
nation, etc.) a common, much more overhead will be incurred.

内核线程不需要任何新的, 非阻塞系统调用. 另外, 如果线程导致了页错误, 内核可以轻松地检查进程是否有其他线程可运行, 如果有, 在等待所需的页加载入内存中时, 执行该线程. 它们潜在的问题是: 系统调用比较耗时, 所以如果线程操作比较常见, 则会有更多的负载.

While kernel threads solve some problems, they do not solve all problems. For
example, what happens when a multithreaded process forks? Does the new proc-
ess have as many threads as the old one did, or does it have just one? In many
cases, the best choice depends on what the process is planning to do next. If it is
going to call exec to start a new program, probably one thread is the correct choice,
but if it continues to execute, reproducing all the threads is probably best.

内核线程依旧有一些未能解决的问题, 比如, 当多线程进程执行 fork 的时候, 会发生什么? 新的进程是否会像旧进程一样拥有同样多的线程呢? 还是只拥有一个呢? 在大多数情况下, 取决于进程将要做什么, 如果它将会调用 exec 执行一个新的程序, 当然只有一个好, 但是如果是继续运行的话, 则保留所有的线程则是最好的.

(PS: 在 linux posix 线程下, 默认是同样多的线程)

Another issue is signals. Remember that signals are sent to processes, not to
threads, at least in the classical model. When a signal comes in, which thread
should handle it? Possibly threads could register their interest in certain signals, so
when a signal came in it would be given to the thread that said it wants it. But what
happens if two or more threads register for the same signal? These are only two of
the problems threads introduce, and there are more.

另一个问题是信号. 信号是发给进程的, 而并非线程(至少在经典模型下). 当信号到达时, 那个线程来处理它呢? 可能线程会注册自己感兴趣的信号, 所以, 当信号到达, 会交由那个注册线程处理. 但是如果多个线程注册了同样的信号呢? 这仅仅是线程引入的其中两个问题.

2.2.6 Hybrid Implementations

Various ways have been investigated to try to combine the advantages of user-
level threads with kernel-level threads. One way is use kernel-level threads and
then multiplex user-level threads onto some or all of them, as shown in Fig. 2-17.
When this approach is used, the programmer can determine how many kernel
threads to use and how many user-level threads to multiplex on each one. This
model gives the ultimate in flexibility.

已经有多种方法被研究出来, 用于融合用户级线程和内核级线程. 其中一种方法是使用内核级别线程, 然后每个内核线程使用多个用户级别线程. 如 2-17.

程序能够知晓多少内核线程, 多少用户线程被使用. 这种模型给予了很大的灵活性.

With this approach, the kernel is aware of only the kernel-level threads and
schedules those. Some of those threads may have multiple user-level threads multi-
plexed on top of them. These user-level threads are created, destroyed, and sched-
uled just like user-level threads in a process that runs on an operating system with-
out multithreading capability. In this model, each kernel-level thread has some set
of user-level threads that take turns using it.

在这种方法下, 内核只至少内核线程, 并调度它们. 其中一些内核线程上可能存在多个用户级线程. 这些用户线程将会在进程中管控.

内核线程和用户线程各有其优势, 用户线程效率更高, 但是操作系统不知情的情况下, 会产生许多逻辑上是多线程, 但物理上依旧是单线程才会产生的错误. 比如信号, 中断. 而内核线程虽然相对效率低, 并且占用内核空间, 但是操作系统知晓是多线程, 与操作系统间有更多协作的空间.

read/fsync

Posted on 2020-01-09 Edited on 2019-07-29

看 MySQL 官方文档, 有关 InnoDB 数据管理的时候, 看到有提到 fsync

仔细查了一下, 才发现自己好像对这东西, 乃至文件系统都一无所知…

原地址 : http://blog.httrack.com/blog/2013/11/15/everything-you-always-wanted-to-know-about-fsync/

Everything You Always Wanted to Know About Fsync()

NOV 15TH, 2013

And then the developer wondered:

is my file properly sync’ed on disk ?

You probably know more or less how databases (or things that look like one) store their data on disk in a permanent and safe way. Or at least, you know the basic principles. Or not ?

你可能或多或少知道数据库如何以持久并安全的方式将数据存储到磁盘

最少, 你知道基础的原则, 或者不知道?

Being on AC(I)D

There are a bunch of concepts that first must be understood: what is atomicity, consistency, and durability ? These concepts apply on databases (see ACID), but also on the underlying filesystem.

这里有一系列必须理解的原则: 什么是原子性, 一致性, 持久性?

这些原则不仅用于数据库, 同时也用于底层文件系统

Atomicity: a write operation is fully executed at once, and is not interleaved with another one (if, for example, someone else is writing to the same location)

原子性 : 写操作一次彻底完成, 不会和另外一个写操作交错(其他写操作在同时更改同样的区域)

Atomicity is typically guaranteed in operations involving filename handling ; for example, for rename, “specification requires that the action of the function be atomic” – that is, when renaming a file from the old name to the new one, at no circumstances should you ever see the two files at the same time.

涉及到文件名管理的操作通常是保存其原子性的, 比如, 重命名, 规则描述需要这个函数必须是原子的.

也就是说, 当重命名文件时, 不应该同时存在两个相同文件名的文件
Consistency: integrity of data must be maintained when executing an operation, even in a crash event – for example, a power outage in the middle of a rename() operation shall not leave the filesystem in a “weird” state, with the filename being unreachable because its metadata has been corrupted. (ie. either the operation is lost, or the operation is committed.)

一致性 : 当执行操作时, 数据的一致性必须被维护, 即使是在崩溃事件中, 比如 : 断电不会使文件系统处于因其元数据损坏, 文件名不可被检索到的奇怪状态(即, 操作要么丢失, 要么被提交)

(PS. meta ? 什么是 meta ? 根据 wiki 的解释 :

Meta (from the Greek meta- μετά- meaning “after” or “beyond”) is a prefix used in English to indicate a concept that is an abstraction behind another concept, used to complete or add to the latter.

元是一个概念背后的抽象概念, 用来完整前者

以及有 about 语义, 两者相同, 前者关联后者, metaprogramming(writing programs manipulate programs) )

Consistency is guaranteed on the filesystem level ; but you also need to have the same guarantee if you build a database on disk, for example by serializing/locking certain operations on a working area, and committing the transaction by changing some kind of generation number.

一致性在文件系统层面上确保, 但当在磁盘上构建数据库时, 也需要其确保一致性, 比如通过序列化/锁定某些在工作区域上的操作, 通过更改一些生成的数字提交事务
Durability: the write operation is durable, that is, unplugging the power cord (or a kernel panic, a crash…) shall not lose any data (hitting the hard disk with a hammer is however not covered!)

写操作是耐久的, 也就是说, 拔掉电源线(或者内核错误, 崩溃)将不会丢失任何数据(用锤子砸硬盘当然不包含在其中!)( PS.挺喜欢有幽默感的作者 :) )

This is an important one – at a given point, you must ensure that the data is actually written on disk physically, preventing any loss of data in case of a sudden power outage, for example. This is absolutely critical when dealing with a client/server architecture: the client may have its connection or transaction aborted at any time without troubles (ie. the transaction will be retried later), but once the server acknowledges it, no event should ever cause it to be lost (think of responsibility in a commercial transaction, or a digital signature, for example). For this reason, having the data committed in the internal system or hard disk cache is NOT durable for obvious reasons (unless there is a guarantee that no such power outage could happen – if a battery is used on a RAID array, for example).

在某种观点来看, 这是很重要的一点. 你必须确保数据正确地物理性地写入了磁盘, 防止断电而引发的数据丢失

应用与CS模型时, 相当关键: 客户端会在任意时刻无困难断开连接/事务(换句话说, 事务会在之后重试). 但是一旦服务器确认后, 就应该没有任何事件导致它被丢失(考虑企业事务或数字签名中的责任)

因为这个理由, 在内部系统或硬盘缓存中提交数据是不是可一致的(除非保证没有类似断电的行为会发生 - 比如, RDID 磁盘阵列用了蓄电池)

On POSIX systems, durability is achieved through sync operations (fsync(), fdatasync(), aio_fsync()): “The fsync() function is intended to force a physical write of data from the buffer cache, and to assure that after a system crash or other failure that all data up to the time of the fsync() call is recorded on the disk.”. [Note: The difference between fsync() and fdatasync() is that the later does not necessarily update the meta-data associated with a file – such as the “last modified” date – but only the file data.]

在 POSIX 系统, 持久性通过 sync 操作(fsync, fdatasync, aio_fsync) 获得 (PS. fsync 和 fdatasync 的区别在于, 后者不冲刷 metadata) fsync 函数故意强制物理性的缓冲的数据写入, 保证在系统崩溃或其他错误之后, 所有 fsync 调用时的数据都被记录在磁盘中

Now that these concepts are a bit clearer, let’s go back to our filesystem!

Hey, What is a File, By The Way ?

If we want to simplify the concept, let’s consider the filesystem on POSIX platforms as a very simple flat storage manager, allowing to read/write data blobs and basic properties (such as the modified time) indexed by an integer number (hint: they sometimes call that the inode number).

如果想要简化这个概念, 让我们考虑在 POSIX 平台上, 文件系统作为一个非常简单的扁平存储管理器, 允许读/写二进制数据和其基础属性(比如更改时间) 被一个整型数字索引 (有时被称作 inode 数字)

For example, you may want to read the file #4242’s data. And later, write some data on file #1234.

To have a more convenient way to handle files (because “I need to send you the presentation number 155324” would not be really convenient in the real world), we use the filename/directory concepts. A file has a name, and it is contained within a directory structure. You may put files and directories in a directory, building a hierarchical structure. But everything rely on our previous basic data blobs to store both filename and the associated index.

为了更方便地操作文件(因为 “我需要发送你描述编号为155324” 在真实世界中很不方便) 我们使用文件名/目录概念.

文件有一个名字, 被包含一个目录结构中, 你可以将文件和目录放到另外一个目录下, 构建一个层次的结构

但是, 所有都依赖于之前的基础数据集合存储文件名和相关联的索引

As an example, reading the file foo/bar.txt (ie. the file bar.txt within the foo directory) will require to access the data blob associated with the directory foo. After parsing this opaque data blob, the system will fetch the entry for bar.txt, and open the associated data blob. (And yes, there is obviously a root entry, storing references to first-level entries, allowing to access any file top-down)

If I now want to create a new file named foo/baz.txt, it will require the system to access the data blob associated with the directory foo, add an entry named baz.txt with a new allocated index for the upcoming file, and write the updated directory blob back, and from this point, write to the newly allocated blob. The operation therefore involves two data structures: the directory entry, and the file itself.

Keeping My File(name) Safe

Let’s go back to our database problem: what is the impact of having two data structures for our files ?

Atomicity and consistency of filenames are handled for us by the filesystem, so this is not really a bother.

What about durability ?

让我们回到数据库的问题(PS. 我觉得, 好像越说越多了 = =) : 如果我们的文件有两个数据结构的话, 会怎样?

文件名原子性和一致性由我们通过文件系统控制, 这不会造成什么麻烦, 但是持久性呢?

We know that fsync() provides guarantees related to data and meta-data sync’ing. But if you look closer to the specification, the only data involved are the one related to the file itself – not its directory entry. The “metadata” concept involves modified time, access time etc. – not the directory entry itself.

我们知道 fsync 提供相关文件和元数据的同步保证

但是深入了解规格, 唯一涉及的数据是和文件本身相关的数据, 不是它的目录入口

元数据概念包含更改时间, 访问时间等等, 但不包含目录入口

It would be cumbersome for a filesystem to provide this guarantee, by the way: on POSIX systems, you can have an arbitrary number of directory links to a filename (or to another directory entry). The most common case is one, of course. But you may delete a file being used (the file entry will be removed by the system when the file is closed) – the very reason why erasing a log file which is flooding a filesystem is a futile and deadly action – in such case, the number of links will be zero. And you may also create as many hard-links as you want for a given file/directory entry.

Therefore, in theory, you may create a file, write some data, synchronize it, close the file, and see your precious file lost forever because of a power outage. Oh, the filesystem must guarantee consistency, of course, but not durability unless explicitly asked by the client – which means that a filesystem check may find your directory entry partially written, and decide to achieve consistency by taking the previous directory blob entry, wiping the unreferenced file entry (note: if you are “lucky” enough, the file will be expelled in lost+found)

The filesystem can, of course, decide to be gentle, and commit all filename operations when fsync’ing. It may also, such as for ext3, commit everything when fsync’ing a file – causing the infamous and horrendous lags in firefox or thunderbird.

But if you need to have guarantees, and not just hope the filesystem “will be gentle”, and do not want to “trust the filesystem*” (yes, someone actually told me that: you *need to “trust the filesystem” – I swear it), you have to actually make sure that your filename entry is properly sync’ed on disk following the POSIX specification.

Oh, and by the way: according to POSIX, The fsync() function is intended to force a physical write of data from the buffer cache, and to assure that after a system crash or other failure that all data up to the time of the fsync() call is recorded on the disk.

But things are sometimes a bit obscure on the implementation side :

Linux/ext3: If the underlying hard disk has write caching enabled, then the data may not really be on permanent storage when fsync() / fdatasync() return. (do’h!)

如果底层磁盘允许写入缓存, 当 fsync/fdatasync 时数据可能不会真正的永久性存储
Linux/ext4: The fsync() implementations in older kernels and lesser used filesystems does not know how to flush disk caches. (do’h!) – issue adressed quite recently
OSX: For applications that require tighter guarantees about the integrity of their data, Mac OS X provides the F_FULLFSYNC fcntl. The F_FULLFSYNC fcntl asks the drive to flush all buffered data to permanent storage (hey, fsync was supposed to do that, no ? guys ?) (Edit: no, fsync is actually not required to do that – thanks for the clarification Florent!)

But we may assume that on Linux with ext4 (and OSX with proper flags ?) the system is properly propagating write barriers.

On Windows, using FlushFileBuffers() is probably the way to go.

(PS. 很惭愧, 有些我没有读懂… 不过有一件事情是明白的 : 我好像对文件系统真的一无所知… )

Syncing Filenames

I told you that a filesystem was actually a bunch of flat data blobs with associated metadata, and that a file had actually two parts: its directory entry (let’s assume there is only one directory entry for the sake of simplicity), and its actual data. We already know how to sync the later one ; do we have a way to do the same for the directory container itself ?

我已经告诉你, 文件系统实际上是一堆扁平的与元数据关联的数据块, 文件实际上有两部分 : 它的目录入口(为了简单起见, 假设只有一个目录入口), 以及它的实际数据, 我们已经知道如何同步后面的一部分, 我们有同样的刚发同步目录容器自身么?

On POSIX, you may actually open a directory as if you were opening a file (hint: a directory is a file that contains directory entries). It means that open() may successfully open a directory entry. But on the other hand, you generally can not open a directory entry for writing (see POSIX remark regarding EISDIR: The named file is a directory and oflag includes O_WRONLY or O_RDWR), and this is perfectly logical: by directly writing to the internal directory entry, you may be able to mess up with the directory structure, ruining the filesystem consistency.

在 POSIX 下, 你可能正常地像打开一个文件一样打开目录(目录是一个包含目录入口的文件)(PS. MD 我要疯了, 它说的 directory entries 到底是什么东西, 直译目录入口? 什么鬼东西啊, inode?那怎么不叫 inode/vnode ? 根据 google 出的解答 https://unix.stackexchange.com/questions/186992/what-is-directory-entry 那是一个目录和文件名相关的结构体)

But can we fsync() written data using a file descriptor opened only for reading ? The question is… yes, or at least “yes it should*” – even POSIX group had *editorial inconsistencies regarding fdatasync and aio_fsync(), leading to incorrect behavior on various implementations. And the reason it should execute the operation is because requesting the completion of a write operation does not have to require actual write access – which have already been checked and enforced.

On Windows… err, there is no clear answer. You can not call FlushFileBuffers() on a directory handle as far as I can see.

Oh, a last funny note: how do you sync the content of a symbolic link (and its related meta-data), that is, the filename pointed by this link ? The answer is… you can’t. Nope. This is not possible with the current standard (hint: you can not open() a symbolic link). Which means that if you handle some kind of database generation update based on symbolic links (ie. changing a “last-version” symlink to the latest built generation file), you have zero guarantee over durability.

Conclusion

Does it means that we need to call fsync() twice, one on the file data, and one on its parent directory ? When you need to achieve durability, the answer is obviously yes. (Remember that file file/filename will be sync’ed on disk anyway by the operating system, so you do not actually need to do that for every single file – only for those you want to have a durability guarantee at a given time)

However, the question is causing some headache on the POSIX standard, and as a follow-up to the austin-group (ie. POSIX mailing-list) discussion, an editorial clarification request is still pending and is waiting for feedback from various implementors. (you may also have a look at the comp.unix.programmer discussion)

TL;DR: syncing a file is not as simple as it seems!