Deep dive into postgres stats: pg_stat_database
Everything you always wanted to know about Postgres stats
You can find full description of view’s columns in the official documentation so here I will focus on types of problems that it helps us to solve:
- Cache hit ratio.
- Commit ratio.
- Database anomalies.
- Load distribution.
SELECT
datname, 100 * blks_hit / (blks_hit + blks_read) as cache_hit_ratio
FROM pg_stat_database WHERE (blks_hit + blks_read) > 0;
or summary among all databases:
round(100 * sum(blks_hit) / sum(blks_hit + blks_read), 3) as cache_hit_ratio
FROM pg_stat_database;
Sweet spot here are values close to 100 — it means that the almost all necessary data were read from shared buffers. Values near 90 show that postgres read from disk time to time. And values below 80 show that we have insufficient amount of shared buffers or physical RAM . Data required for top-called queries don’t fit into memory, and postgres has to read it from disk. It’s quite good if this data in the OS page cache, if they aren’t there it’s a bad scenario. The basic idea here is increased amount of shared buffers — good starting point for this is 25% of the available RAM. When all databases are able to fit in RAM, the good starting point is to allocate 80% of all available RAM.
Note, when postgres is restarted and actively fills buffer cache, it shows low cache hit ratio and this is normal behaviour.
Commit ratio. Commit ratio is the similar to cache hit ratio, but in addition it shows the amount of successful operations. It’s well known that changes made by transactions may be aborted (rollback) or commited. If rollback isn’t properly handled by an application, it can be considered as an error. Also, single queries that have failed outside the transactions are also accounted as rollbacks. So, in general, using commit ratio we can estimate amount of errors in a particular database. For commit ratio, xact_commit and xact_rollback values are used, and using queries like above it’s possible to calculate commit ratio. Here is an example which shows per-database results:
SELECT
datname, 100 * xact_commit / (xact_commit + xact_rollback) as commit_ratio
FROM pg_stat_database WHERE (xact_commit + xact_rollback) > 0;
Result values may vary between 1 to 100. Values that are closer to 100 mean that you database has very few errors. In case when commit ratio is below 90, a good idea is to configure proper logging and check out logs for errors, built a list of most often errors and begin to eliminate them step by step.
Database anomalies. Another pg_stat_database’s useful option is the anomaly detection. Anomalies are the unwanted events occurring in databases and this stats view provides information about rollbacks, recovery conflicts, deadlocks and temporary files. All these events are unwanted and if there are too many of them, you should pay attention and try to eliminate their sources.
As mentioned above, rollbacks aren’t the only transaction aborts — it also failed queries, so do check logs to understand what caused an error. Another type of anomalies is recovery conflicts — the situation when queries running on standbys are cancelled. From user’s perspective it looks like a failed query and when it fails, additional details written in the postgres logs. Also, a good point to start an investigation is the pg_stat_database_conflicts view – there may be various reasons for conflicts to occur and this view allows us to understand the exact cause.
Another issue are deadlocks — when two or more transactions have locked resources and are trying to obtain resources already locked by other transactions. In normal situation xact A locks resource A and xact B locks resource B. In deadlock situation xact A will try to lock resource B. This, in turn puts it into hold mode since xact B has already locked resource B. This way both transactions lock each other’s resources. When a deadlock occurs, postgres cancels one of the involved transactions and removes it from the waiting queue so that other xacts continue their work.
Load distribution. And to finish here are a few words on load distribution. This metric isn’t as important as metrics mentioned above, however, sometimes it is useful when you need to understand the kind of workload in your database. Is your workload write- or read-intensive? Answer to this question could give you the following metrics: tup_returned, tup_fetched, tup_inserted, tup_updated, tup_deleted. They reveal how many operations were done by queries. With your favorite monitoring system, which can make graphs using these numbers, you can quickly track load spikes or collapses and react appropriately.
To summarise, the pg_stat_database view contains general information about databases and events occured, so tracking numbers from this view is a good starting point that allows you to assess whether everything is fine with your databases. Of course, using only that view is insufficient and additional resources like views and logs should be used. In my next post I will continue to explain other stats views and how to efficiently use them.