Waiting for PostgreSQL 11: Amcheck enhancement.
Check table’s indexed columns with amcheck.
This weekend, an interesting patch related to amcheck was committed. Note, amcheck is a contrib module which allow to verify the logical consistency of indexes’ structure.
commit 7f563c09f8901f6acd72cb8fba7b1bd3cf3aca8e
Author: Andres Freund <[email protected]>
Date: Sat Mar 31 19:52:01 2018 -0700
Add amcheck verification of heap relations belonging to btree indexes.
Add a new, optional, capability to bt_index_check() and
bt_index_parent_check(): check that each heap tuple that should have an
index entry does in fact have one. The extra checking is performed at
the end of the existing nbtree checks.
This is implemented by using a Bloom filter data structure. The
implementation performs set membership tests within a callback (the same
type of callback that each index AM registers for CREATE INDEX). The
Bloom filter is populated during the initial index verification scan.
Reusing the CREATE INDEX infrastructure allows the new verification
option to automatically benefit from the heap consistency checks that
CREATE INDEX already performs. CREATE INDEX does thorough sanity
checking of HOT chains, so the new check actually manages to detect
problems in heap-only tuples.
Author: Peter Geoghegan
Reviewed-By: Pavan Deolasee, Andres Freund
Discussion: https://postgr.es/m/CAH2-Wzm5VmG7cu1N-H=nnS57wZThoSDQU+F5dewx3o84M+jY=g@mail.gmail.com
As mentioned in the commit message, amcheck has obtained the ability to check heap tuples associated with index entries. Sounds great, because due to this, users have an option to check not only indexes, but indexed columns too. Let’s take a look on how it works in real life. For the next test I’ve created a test cluster with no data page checksums.
Let’s create test table with some data with primary key, it’s enough for our test.
$ create table products (id serial primary key, payload text);
$ insert into products (payload) select md5(random()::text) from generate_series(1,100);
With introduction of heapallindexed — the new boolean flag which should be passed into amcheck’s functions, checking the heap tuples becomes possible. It tells us about check heap tuples for each index entry. More detailed information about this option can be found in devel documentation.
$ select bt_index_check('products_pkey', true);
bt_index_check
—————-
(1 row)
No errors, as expected, but what will we see if we bring some corruptions into the heap data file? There are many ways to do this, and pg_hexdump is one of those. Here, I’ve opened table’s data file and just zeroed one value in the indexed column for one row. When it’s done, you would need to restart Postgres — this is required for re-initialization of shared buffers, because without that, all further readings from test table are made from shared buffers where table’s pages are still valid. After a restart the first reading of the table will be made directly from the disk. Now let’s take at the index on its own.
$ select bt_index_check('products_pkey');
bt_index_check
—————-
(1 row)
No errors again, and things look fine. As next step, let’s pass the heapallindexed flag and see what happens.
$ select bt_index_check('products_pkey', true);
ERROR: heap tuple (0,100) from table «products» lacks matching index tuple within index «products_pkey»
HINT: Retrying verification using the function bt_index_parent_check() might provide a more specific error.
Nice, amcheck has catched the corruption and even shows us information about the broken tuple.
I think, this patch is very important step on the way to creating built-in PostgreSQL tool for verification of data and for checking its consistency, so kudos to Peter Geoghegan and all the reviewers involved.