Group commit and real fsync

During the recent months I’ve seen few cases of customers upgrading to MySQL 5.0 and having serious performance slow downs, up to 10 times in certain cases. What was the most surprising for them is the problem was hardware and even OS specific - it could show up with one OS version but not in the other. Even more interesting performance may be dramatically affected by –log-bin settings, which usually has just couple of percent overhead. So what is going on?

最近这几个月,我已经碰到少数几个案例:一些客户升级到 MySQL 5.0,结果性能严重下降,某些特定情况下甚至达到10倍以下。然而令他们最为惊讶的是,产生这些问题竟然是由于硬件甚至是操作系统 -- 在某个版本的操作系统上存在这些问题但在其他版本则没有。更有趣的是,MySQL 的性能竟然戏剧性地受到 log-bin 设置的影响 -- 这通常只是对系统性能有 2% 的影响。那么,到底发生什么事了呢?

Actually we’re looking at two issues here which interleave such funny way


  • Group commit is broken in MySQL 5.0 if binary loging is enabled (as it enables XA)
  • 在 MySQL 5.0 中如果启用二进制日志(binary log)(启用XA也是如此),则分组提交中断了
  • Certain OS/Hardware configurations still fake fsync delivering great performance at the cost of being non ACID
  • 某些操作系统/硬件配置仍旧只是实现了伪 fsync,由于它是 非ACID,结果导致大量的性能损失

First one can be tracked by this bug. In the nutshell the problem is - new feature - XA was implemented in MySQL 5.0 which did not work with former group commit code. The new code for group commit however was never implemented. XA allows to keep different transactonal storage engines in sync, together with binary log. XA is enabled if binary log is enabled this is why this issue is trigered by enabled binary log. if binary log is disabled, so is XA and old group commit code works just fine.

第一个问题可以查看 这个bug。概括地说,这个问题是新特性 -- MySQL 5.0 中新增加了 XA 特性,它不支持旧的分组提交代码。然而新的分组提交代码还完全没实现。XA 支持让不同的事务性存储引擎保持同步,都保存在二进制日志中。如果启用了二进制日志,则 XA 也启用了,这就是为什么启用二进制日之后会触发这个问题。如果禁用二进制日志,则 XA 和旧的分组提交代码就都没问题了。

Second one is interesting. Actually we would hear much more people screaming about this problem if OS would be honest with us. Happily for us many OS/Hardware pairs are still lying about fsync(). fsync() call suppose to place data on the disk securely, which unless you have battery backed up cache would give you only 80-200 sequential fsync() calls per second depending on your hard drive speed. With fake fsync() call the data is only written to the drives memory and so can be lost if power goes down. However it gives great performance improvement and you might see 1000+ of fsync() calls per second. So if your OS is not giving you real fsync you might not notice this bug. The performance degradation will still happen but it will be much smaller, especially with large transactions.

第二个问题很有趣。事实上如果操作系统更加诚实的话,我们将会听到更多的用户的抱怨。幸好,对我们来说,不少操作系统/硬件组合还是基于 fsync() 之上。fsync() 调用假使安全地把数据放在磁盘中,除非有备用电池高速缓存依赖于磁盘的驱动速度才只能达到每秒 80 - 200 次连续的 fsync() 调用。而伪 fsync() 则只是把数据写在磁盘内存中,一旦断电了,这些数据就会丢失了。不过这么做能获得很高性能,大约能达到每秒有1000多次的 fsync() 调用。因此,如果你的操作系统不支持实时 fsync() 调用,就要注意这个bug。性能会被降低,不过这会越来越少,尤其是在很大的事务过程中。

So how you can solve the problem ?


  • Disable binary log. This could be option for slaves for example which do not need point in time recovery etc.
  • 禁用二进制日志。这在那些不需要及时恢复的slave上这个是可选的,以及其他类似的情况下。
  • Check if you OS is doing real fsync. You should to know anyway if you care about your data safety. This can be done for example by using SysBench: sysbench –test=fileio –file-fsync-freq=1 –file-num=1 –file-total-size=16384 –file-test-mode=rndwr. This will write and fsync the same page and you should see how many requests/sec it is doing. You also might want to check diskTest from this page http://www.faemalia.net/mysqlUtils/ which does some extra tests for fsync() correctness.
  • 检查你的操作系统是否支持实时 fsync()。如果你关心数据的安全性,则无论如何都必须要知道。这可以用 SysBench 来检查: sysbench –test=fileio –file-fsync-freq=1 –file-num=1 –file-total-size=16384 –file-test-mode=rndwr. 。它会在同一个内存页写入和同步,你只要看一下每秒完成了多少次请求。也可以用 diskTest 来针对 fsync() 做这些检查。
  • Install RAID with battery backed up cache. This gives about the same effect as fake fsync() but you can make it secure (However make sure your drives are not caching data by themselves). The good thing RAID with battery backed up cache are becoming really inexpensive.
  • 安装支持高速电池缓存的RAID。这么做类似实现了伪 fsync(),不过更安全(它确保无需由磁盘驱动器自己来完成数据缓冲)。现在这个系统花费也不太贵。

You also probably want to know if this bug is going to be fixed ? I’m not authority in this question but as Heikki describes it as fundamental task I’m not sure it will be done in 5.0 Good if it is done in 5.1.

你也许想知道这个bug是否已经被修复了?对这个问题我无权回答,不过如 Heikki 所述,它是 MySQL 5.0 中的一项基础工作,不知道在 5.1 中是否能够完成。