Deep dive into postgres stats: pg_stat_bgwriter reports
Everything you always wanted to know about Postgres stats
Today, I would like to make a little detour from the main series and will dive into pg_stat_bgwriter. If you’ve been following my previous posts, you will remember that pg_stat_bgwriter view has summary statistics about bgwriter and checkpointer. Here I would like to show an interesting report query which is based on pg_stat_bgwriter. Sources of the query were found in postgres mailing lists and shared by my colleague Viсtor Yegorov and slightly modified by me. This report provides comprehensive information about bgwriter and checkpointer activity and helps to better configure them.
A tiny recommendation to run this query with expanded output in psql. Report produces only one row and looks like this :
The report consists of two parts which are separated by horizontal dotted line, the first part is the report itself and the second is raw values from pg_stat_bgwriter and auxiliary items used in the report.
The first part is more interesting and here is why.
Next is the information about checkpoints — «Forced checkpoint ratio» is the ratio of checkpoints which occurred by xlog. Current value is 7.6 and it’s good enough. High values, for example more than 40%, indicate that xlog checkpoints occur too frequently. As you might remember, xlog checkpoints are less preferred than time checkpoints, so general idea is to reduce the number of xlog checkpoints by increasing the number of WAL segments required to trigger checkpoint. «Minutes between checkpoints» is time interval between occured checkpoints. When everything is ok, this value should be near checkpoint_timeout. Values significantly lower than checkpoint_timeout also indicate on occurrence frequency of xlog checkpoints. General recommendation in both cases is raise max_wal_size (or checkpoint_segments for 9.4 or over).
Next is the average write and sync time in seconds. Write time is near the 13 minutes and that’s a good value, it shows that write stage of checkpoints was performed fast enough despite the long interval between checkpoints. Values that are closer to the earlier mentioned intervals between checkpoints aren’t good — it’s an indicator that storage spent too much time writing buffers. Average sync time should be near zero — values that far from zero would indicate on low performance of the storage.
Next items have informative value, they tell us about average throughput of checkpointer, bgwriter and backends. These numbers give us additional information about the workload. The «Total MB written» is obviously the size of written data by all subprocesses. The «MB per checkpoint» is an average value for checkpoints. Next values are measured in Mbps and they are about process throughput. In the example above, there are low values less than 1 Mbps means that server doesn’t have a huge amount of dirty data or maybe the report has been built with longer stats interval and throughput values are spread throughout.
«New buffer allocation ratio» field is the ratio of new allocated buffers to all written buffers. When backends handle data, firstly they check are data already in shared buffers area? If there are no required data in shared buffers, backends allocate new buffers and load data from main storage to shared buffers and then process it (see details in BufferAlloc() and StrategyGetBuffer() functions). Thus, high number here tell us that backends allocated a lot of buffers since required data didn’t exist among shared buffers.
There are two reasons for this, the first is that the backends read rarely uses «cold» data, old archived partitions or something similar; the second reason is that the early used data had been evicted from shared buffers because of lack of shared buffers. That’s not all however, this number means how many times data were read to shared buffers more than it had been written out from them. This item potentially comes with cache hit ratio and high allocation ratio and low cache hit ratio can indicate insufficient shared buffers, though it’s hard to know for sure.
Next set of values are on how many buffers in percent are cleaned by the sub-processes. High «Clean by checkpoints» value is the sign of write-intensive workload. High «Clean by bgwriter» tells us about read workload. High number of «Clean by backends» is the sign that backends done a lot of bgwriter’ work and that’s not good — values more than 50% tells us about ineffective setting of bgwriter, and in this case I would suggest trying to make it more aggressive.
Next values are «Bgwriter halt-only length» and «Bgwriter halt ratio«. They are about frequency with which bgwriter was delayed due to exceeded bgwriter_lru_maxpages. The values in our example are perfect and high values conversely indicate that bgwriter went to sleep mode too frequently and didn’t do its work fast enough. In this case, I’d recommend to configure bgwriter in a more aggressive way — decrease delay and increase maxpages parameters.
The second part of the report is the raw values from pg_stat_bgwriter and configuration parameters which also related to bgwriter and checkpointer — they are used in report’s query, hence, you don’t need to see them in separate queries.
Here I prepared a few reports with my comments from different pgbench workloads:
- here is eight hours of read-only workload with 4GB shared buffers (default bgwriter)
- here is eight hours of read-only workload with 8GB shared buffers (default bgwriter)
- here is eight hours of read/write workload with 4GB shared buffers (default bgwriter)
- here is eight hours of read/write workload with 8GB shared buffers (default bgwriter)
- here is eight hours of read/write workload with 8GB shared buffers (aggressive bgwriter)
Tests were performed on the server with 16CPU, 32GB RAM, RAID1 on 2xSSD (datadir), RAID1 on 2xSAS (wal) with PostgreSQL 9.6.2 and test database size is 96GB.
That is all for this time and I hope you enjoyed this post and found it helpful.