Innodb Fuzzy checkpointing woes
As you might know Innodb storage engines uses Fuzzy Checkpointing technique as part of it recovery strategy. It is very nice approach which means database never needs to “stall” to perform total modified pages flush but instead flushing of dirty pages happens gradually in small chunks so database load is very even.
This works great in most cases, however recently while running benchmarks with our partners we’ve observed very different behavior - Instead of being spread over time at certain time Innodb starts to flush pages very agressively seriously slowing down or even almost stalling for few minutes. Finally we were able to track down the bug.
这种方式大多数情况下都表现很好，然而最近我们的搭档运行基准时观察到一些不同的行为 -- Innodb 在某些时候很快就开始刷新内存页了而不是持续较长时间，这导致系统速度严重下降，甚至有时候延迟了好几分钟。最后我们只得跟踪了这个 bug。
Here is what is happening. Then planing fuzzy checkpoints Innodb assumes kind of uniform distribution of LSN (Log Sequence Number) for the pages which basically corresponds to age of unflushed pages. In such case current implementation works just fine - Innodb flush threads selects range of LSN to flush each round which is small enough.In certain workloads however (happily most of these are benchmarks)just after very large amount of pages are modified at once.In this case many pages fall in the same LSN range scheduled for flush and Innodb might end up flushing most of buffer pool pages.Now as the pages are flushed they get modified again, and again in very short period of time so process repeats itself.
There is one more related bug which makes the problem worse. Technically if there are too many dirty pages Innodb should start doing checkpoints more actively, this is if there are more than 70% of pages dirty. However if your database is small (ie only size of allocated buffer pool size) this might never happen. This is very unusual case of increasing your buffer pool size actually may decrease performance.
那么这时究竟发生了什么事情？那么，Innodb 的模糊检查点设计方案假定了内存页的 LSN (日志顺序号)统一发布，它跟未刷新的内存页的时长相对应。在这种情况下，当前的执行方式运行的挺好 - Innod 刷新线程在每个回合里都从 LSN 域中`选择一个足够小的内存页来刷新它。某些工作量（幸好大部分都是基准），在大批量的内存页被立刻修改之后。在这种情况下，许多内存页聚集在定期刷新的同一个 LSN 域中，并且 Innodb 可能停止刷新缓冲内存页。现在当内存内修改后，就要刷新，并且在很短的时间内这样的处理不停的重复。还有一个使问题更糟糕的相关 bug 。从技术上讲，如果存在很多脏内存页，Innodb 会更加活跃地检查点，通常是在脏内存页多于70%的情况下。但是，如果你的数据库很小（例如跟缓冲池大小一样），这样的情况就不会发生。增加缓冲池是很少见的做法，事实上这可能会降低性能。
What I would do to fix it I think fuzzy checkpointing should be dynamic and based at amount of log space free rather than LSN numbers. For example we may be flushing by N pages per round until log files are 50% full (the target can be different), then we increse the number of flushed pages progressively as log free space drops, making sure it never reaches 100%. We could also have low water value, ie 30% utilized at which we can stop checkpoint kind flushes, which would avoid uneeded IO.
我该做什么才能解决这个问题？我想模糊检查点应该是动态的而且是基于空闲日志空间的总量而不是 LSN 编号的。比如我们可能在每个回合里刷新N个内存页，直到日志文件满了50%（这个目标可以不一样），然后，我们逐渐增加刷新内存页数目让日志用掉剩余空间，要确保不能达到100%。我们也可以设定低一点的值，例如30%的利用率时停止使用检查点方法的刷新内存页，这能避免无谓的IO。
One more related problem Innodb has - too many hard coded numbers. For example Innodb will count number of IOs done to see how much disk IO system is loaded and compare it to constant. However the shared drive at shared MySQL virtual hosting and powerful RAID system may have very different performance properties. This means for example if you have RAID system which does 5000 IOs/sec and you have relatively idle interval of performing just 500 IOs/sec Innodb will not catch it as such and will not use it to flush dirty buffers aggressively and perform other needy tasks, which might decrease its performance from where it could be when peak load returns.
跟 Innodb 相关的另一个问题是 -- 太多的硬编号了。例如 Innodb 会计算 IO 总量来知道总共有多高的磁盘 IO 系统负载，并且把它和常量作比较。然而在共享MySQL虚拟机和强大的 RAID 系统中的共享驱动器性能表现可能大不相同。意思是如果有一个能做5000次IO/每秒的RAID系统并且有相关的定期间隔空闲的话，只能达到每秒500次IO，Innodb 就无法达到这个水平，也就无法使用它来集中地刷新脏缓冲和运行其它所需的任务，这样的话可能会降低性能。
This is actually the question I do not have good answer for. How can we detect how good IO subsystem do we have and how much load it can sustain. The problem becomes even worse as with SAN or simply shared disk available bandwidth may be veriable. Can we look at latency of submited requests Probbably but we need to consider caching by OS as well as disk volume itsef. Should we get utilization ratios from OS (ie iostat) - this is nice but needs to be done for all OSes which makes it less portable. Also utilization does not give us enough data - 100% “utiization” corresponds both to 1 or 20 outstanding requests which are very different. One however probably can combine this data with device queue sizes and average request execution time to get some decent values. Anyway it should be better than fixed constants.
对这个问题实际上我还没有好的答案。我们怎样检测IO子系统性能如何,还有它能承受多少负载呢？使用 SAN 或硬盘共享产生的带宽变化可能会导致问题更严重。我们可能看到潜在的提交请求，但是我们需要跟磁盘卷一样考虑操作系统的缓存。可以获得 OS 的利用率（比如 iostat），但是最好是让所有不方便使用它的操作系统也能支持。但是这个利用率并不能给我们足够的数据 -- 100% 的利用率可能对应 1 或者 20 个不等的不同未完成请求。然而一方面，我们可能可以把这个数据与设备排队尺寸以及平均请求执行时间结合起来以得到一些合适的数值。不管怎样，这比固定的常量好。
At very least I would like to get the control about this in my hands and being able to say this device can handle 1000 requests per second and the best would be to keep 16 outstanding requests at the time.The number of outstanding requests is also very important - if you do not submit enough concurrent requests IO subsystems will not be able to perform at their full capacity.One more thing to take into account.
至少，我想把这个控制在手中，并且有能力说这个设备可以每秒钟处理1000个请求，而且，最好的情况只有 16 个未完成请求。未完成请求的数量同样很重要 -- 如果你不提交足够的并发请求，IO 子系统就不能以最大载量运行。这个需要引起重视。
Now some good news. As I mentioned this Fuzzy checkpointing issue does not happen that frequently in real workloads. If you run into what you think is one please help us by commenting on this bug. Real word use case reports are best way to boost priority of such performance bugs.