看 MySQL 官方文档, 有关 InnoDB 数据管理的时候, 看到有提到 fsync

仔细查了一下, 才发现自己好像对这东西, 乃至文件系统都一无所知…

原地址 : http://blog.httrack.com/blog/2013/11/15/everything-you-always-wanted-to-know-about-fsync/

Everything You Always Wanted to Know About Fsync()

NOV 15TH, 2013

And then the developer wondered:

is my file properly sync’ed on disk ?

You probably know more or less how databases (or things that look like one) store their data on disk in a permanent and safe way. Or at least, you know the basic principles. Or not ?

你可能或多或少知道数据库如何以持久并安全的方式将数据存储到磁盘

最少, 你知道基础的原则, 或者不知道?

Being on AC(I)D

There are a bunch of concepts that first must be understood: what is atomicity, consistency, and durability ? These concepts apply on databases (see ACID), but also on the underlying filesystem.

这里有一系列必须理解的原则: 什么是原子性, 一致性, 持久性?

这些原则不仅用于数据库, 同时也用于底层文件系统

Atomicity: a write operation is fully executed at once, and is not interleaved with another one (if, for example, someone else is writing to the same location)

原子性 : 写操作一次彻底完成, 不会和另外一个写操作交错(其他写操作在同时更改同样的区域)

Atomicity is typically guaranteed in operations involving filename handling ; for example, for rename, “specification requires that the action of the function be atomic” – that is, when renaming a file from the old name to the new one, at no circumstances should you ever see the two files at the same time.

涉及到文件名管理的操作通常是保存其原子性的, 比如, 重命名, 规则描述需要这个函数必须是原子的.

也就是说, 当重命名文件时, 不应该同时存在两个相同文件名的文件
Consistency: integrity of data must be maintained when executing an operation, even in a crash event – for example, a power outage in the middle of a rename() operation shall not leave the filesystem in a “weird” state, with the filename being unreachable because its metadata has been corrupted. (ie. either the operation is lost, or the operation is committed.)

一致性 : 当执行操作时, 数据的一致性必须被维护, 即使是在崩溃事件中, 比如 : 断电不会使文件系统处于因其元数据损坏, 文件名不可被检索到的奇怪状态(即, 操作要么丢失, 要么被提交)

(PS. meta ? 什么是 meta ? 根据 wiki 的解释 :

Meta (from the Greek meta- μετά- meaning “after” or “beyond”) is a prefix used in English to indicate a concept that is an abstraction behind another concept, used to complete or add to the latter.

元是一个概念背后的抽象概念, 用来完整前者

以及有 about 语义, 两者相同, 前者关联后者, metaprogramming(writing programs manipulate programs) )

Consistency is guaranteed on the filesystem level ; but you also need to have the same guarantee if you build a database on disk, for example by serializing/locking certain operations on a working area, and committing the transaction by changing some kind of generation number.

一致性在文件系统层面上确保, 但当在磁盘上构建数据库时, 也需要其确保一致性, 比如通过序列化/锁定某些在工作区域上的操作, 通过更改一些生成的数字提交事务
Durability: the write operation is durable, that is, unplugging the power cord (or a kernel panic, a crash…) shall not lose any data (hitting the hard disk with a hammer is however not covered!)

写操作是耐久的, 也就是说, 拔掉电源线(或者内核错误, 崩溃)将不会丢失任何数据(用锤子砸硬盘当然不包含在其中!)( PS.挺喜欢有幽默感的作者 :) )

This is an important one – at a given point, you must ensure that the data is actually written on disk physically, preventing any loss of data in case of a sudden power outage, for example. This is absolutely critical when dealing with a client/server architecture: the client may have its connection or transaction aborted at any time without troubles (ie. the transaction will be retried later), but once the server acknowledges it, no event should ever cause it to be lost (think of responsibility in a commercial transaction, or a digital signature, for example). For this reason, having the data committed in the internal system or hard disk cache is NOT durable for obvious reasons (unless there is a guarantee that no such power outage could happen – if a battery is used on a RAID array, for example).

在某种观点来看, 这是很重要的一点. 你必须确保数据正确地物理性地写入了磁盘, 防止断电而引发的数据丢失

应用与CS模型时, 相当关键: 客户端会在任意时刻无困难断开连接/事务(换句话说, 事务会在之后重试). 但是一旦服务器确认后, 就应该没有任何事件导致它被丢失(考虑企业事务或数字签名中的责任)

因为这个理由, 在内部系统或硬盘缓存中提交数据是不是可一致的(除非保证没有类似断电的行为会发生 - 比如, RDID 磁盘阵列用了蓄电池)

On POSIX systems, durability is achieved through sync operations (fsync(), fdatasync(), aio_fsync()): “The fsync() function is intended to force a physical write of data from the buffer cache, and to assure that after a system crash or other failure that all data up to the time of the fsync() call is recorded on the disk.”. [Note: The difference between fsync() and fdatasync() is that the later does not necessarily update the meta-data associated with a file – such as the “last modified” date – but only the file data.]

在 POSIX 系统, 持久性通过 sync 操作(fsync, fdatasync, aio_fsync) 获得 (PS. fsync 和 fdatasync 的区别在于, 后者不冲刷 metadata) fsync 函数故意强制物理性的缓冲的数据写入, 保证在系统崩溃或其他错误之后, 所有 fsync 调用时的数据都被记录在磁盘中

Now that these concepts are a bit clearer, let’s go back to our filesystem!

Hey, What is a File, By The Way ?

If we want to simplify the concept, let’s consider the filesystem on POSIX platforms as a very simple flat storage manager, allowing to read/write data blobs and basic properties (such as the modified time) indexed by an integer number (hint: they sometimes call that the inode number).

如果想要简化这个概念, 让我们考虑在 POSIX 平台上, 文件系统作为一个非常简单的扁平存储管理器, 允许读/写二进制数据和其基础属性(比如更改时间) 被一个整型数字索引 (有时被称作 inode 数字)

For example, you may want to read the file #4242’s data. And later, write some data on file #1234.

To have a more convenient way to handle files (because “I need to send you the presentation number 155324” would not be really convenient in the real world), we use the filename/directory concepts. A file has a name, and it is contained within a directory structure. You may put files and directories in a directory, building a hierarchical structure. But everything rely on our previous basic data blobs to store both filename and the associated index.

为了更方便地操作文件(因为 “我需要发送你描述编号为155324” 在真实世界中很不方便) 我们使用文件名/目录概念.

文件有一个名字, 被包含一个目录结构中, 你可以将文件和目录放到另外一个目录下, 构建一个层次的结构

但是, 所有都依赖于之前的基础数据集合存储文件名和相关联的索引

As an example, reading the file foo/bar.txt (ie. the file bar.txt within the foo directory) will require to access the data blob associated with the directory foo. After parsing this opaque data blob, the system will fetch the entry for bar.txt, and open the associated data blob. (And yes, there is obviously a root entry, storing references to first-level entries, allowing to access any file top-down)

If I now want to create a new file named foo/baz.txt, it will require the system to access the data blob associated with the directory foo, add an entry named baz.txt with a new allocated index for the upcoming file, and write the updated directory blob back, and from this point, write to the newly allocated blob. The operation therefore involves two data structures: the directory entry, and the file itself.

Keeping My File(name) Safe

Let’s go back to our database problem: what is the impact of having two data structures for our files ?

Atomicity and consistency of filenames are handled for us by the filesystem, so this is not really a bother.

What about durability ?

让我们回到数据库的问题(PS. 我觉得, 好像越说越多了 = =) : 如果我们的文件有两个数据结构的话, 会怎样?

文件名原子性和一致性由我们通过文件系统控制, 这不会造成什么麻烦, 但是持久性呢?

We know that fsync() provides guarantees related to data and meta-data sync’ing. But if you look closer to the specification, the only data involved are the one related to the file itself – not its directory entry. The “metadata” concept involves modified time, access time etc. – not the directory entry itself.

我们知道 fsync 提供相关文件和元数据的同步保证

但是深入了解规格, 唯一涉及的数据是和文件本身相关的数据, 不是它的目录入口

元数据概念包含更改时间, 访问时间等等, 但不包含目录入口

It would be cumbersome for a filesystem to provide this guarantee, by the way: on POSIX systems, you can have an arbitrary number of directory links to a filename (or to another directory entry). The most common case is one, of course. But you may delete a file being used (the file entry will be removed by the system when the file is closed) – the very reason why erasing a log file which is flooding a filesystem is a futile and deadly action – in such case, the number of links will be zero. And you may also create as many hard-links as you want for a given file/directory entry.

Therefore, in theory, you may create a file, write some data, synchronize it, close the file, and see your precious file lost forever because of a power outage. Oh, the filesystem must guarantee consistency, of course, but not durability unless explicitly asked by the client – which means that a filesystem check may find your directory entry partially written, and decide to achieve consistency by taking the previous directory blob entry, wiping the unreferenced file entry (note: if you are “lucky” enough, the file will be expelled in lost+found)

The filesystem can, of course, decide to be gentle, and commit all filename operations when fsync’ing. It may also, such as for ext3, commit everything when fsync’ing a file – causing the infamous and horrendous lags in firefox or thunderbird.

But if you need to have guarantees, and not just hope the filesystem “will be gentle”, and do not want to “trust the filesystem*” (yes, someone actually told me that: you *need to “trust the filesystem” – I swear it), you have to actually make sure that your filename entry is properly sync’ed on disk following the POSIX specification.

Oh, and by the way: according to POSIX, The fsync() function is intended to force a physical write of data from the buffer cache, and to assure that after a system crash or other failure that all data up to the time of the fsync() call is recorded on the disk.

But things are sometimes a bit obscure on the implementation side :

Linux/ext3: If the underlying hard disk has write caching enabled, then the data may not really be on permanent storage when fsync() / fdatasync() return. (do’h!)

如果底层磁盘允许写入缓存, 当 fsync/fdatasync 时数据可能不会真正的永久性存储
Linux/ext4: The fsync() implementations in older kernels and lesser used filesystems does not know how to flush disk caches. (do’h!) – issue adressed quite recently
OSX: For applications that require tighter guarantees about the integrity of their data, Mac OS X provides the F_FULLFSYNC fcntl. The F_FULLFSYNC fcntl asks the drive to flush all buffered data to permanent storage (hey, fsync was supposed to do that, no ? guys ?) (Edit: no, fsync is actually not required to do that – thanks for the clarification Florent!)

But we may assume that on Linux with ext4 (and OSX with proper flags ?) the system is properly propagating write barriers.

On Windows, using FlushFileBuffers() is probably the way to go.

(PS. 很惭愧, 有些我没有读懂… 不过有一件事情是明白的 : 我好像对文件系统真的一无所知… )

Syncing Filenames

I told you that a filesystem was actually a bunch of flat data blobs with associated metadata, and that a file had actually two parts: its directory entry (let’s assume there is only one directory entry for the sake of simplicity), and its actual data. We already know how to sync the later one ; do we have a way to do the same for the directory container itself ?

我已经告诉你, 文件系统实际上是一堆扁平的与元数据关联的数据块, 文件实际上有两部分 : 它的目录入口(为了简单起见, 假设只有一个目录入口), 以及它的实际数据, 我们已经知道如何同步后面的一部分, 我们有同样的刚发同步目录容器自身么?

On POSIX, you may actually open a directory as if you were opening a file (hint: a directory is a file that contains directory entries). It means that open() may successfully open a directory entry. But on the other hand, you generally can not open a directory entry for writing (see POSIX remark regarding EISDIR: The named file is a directory and oflag includes O_WRONLY or O_RDWR), and this is perfectly logical: by directly writing to the internal directory entry, you may be able to mess up with the directory structure, ruining the filesystem consistency.

在 POSIX 下, 你可能正常地像打开一个文件一样打开目录(目录是一个包含目录入口的文件)(PS. MD 我要疯了, 它说的 directory entries 到底是什么东西, 直译目录入口? 什么鬼东西啊, inode?那怎么不叫 inode/vnode ? 根据 google 出的解答 https://unix.stackexchange.com/questions/186992/what-is-directory-entry 那是一个目录和文件名相关的结构体)

But can we fsync() written data using a file descriptor opened only for reading ? The question is… yes, or at least “yes it should*” – even POSIX group had *editorial inconsistencies regarding fdatasync and aio_fsync(), leading to incorrect behavior on various implementations. And the reason it should execute the operation is because requesting the completion of a write operation does not have to require actual write access – which have already been checked and enforced.

On Windows… err, there is no clear answer. You can not call FlushFileBuffers() on a directory handle as far as I can see.

Oh, a last funny note: how do you sync the content of a symbolic link (and its related meta-data), that is, the filename pointed by this link ? The answer is… you can’t. Nope. This is not possible with the current standard (hint: you can not open() a symbolic link). Which means that if you handle some kind of database generation update based on symbolic links (ie. changing a “last-version” symlink to the latest built generation file), you have zero guarantee over durability.

Conclusion

Does it means that we need to call fsync() twice, one on the file data, and one on its parent directory ? When you need to achieve durability, the answer is obviously yes. (Remember that file file/filename will be sync’ed on disk anyway by the operating system, so you do not actually need to do that for every single file – only for those you want to have a durability guarantee at a given time)

However, the question is causing some headache on the POSIX standard, and as a follow-up to the austin-group (ie. POSIX mailing-list) discussion, an editorial clarification request is still pending and is waiting for feedback from various implementors. (you may also have a look at the comp.unix.programmer discussion)

TL;DR: syncing a file is not as simple as it seems!