LVM Cache woops

2020-04-23 2-minute read

At May First, disk i/o has been our most serious bottle neck for many years. We have plenty of RAM, disk space and even CPU.

But when too much data is being written to our spinning disks everyting grinds to a halt.

As we have been adding SSD disks to our servers, we’ve recently begun experimenting with adding SSD-backed lvm caches. This approach has had a tremendous impact - resolving most of our disk i/o problems.

However, this morning we rebooted one of those virtual guests and I almost had a heart attack:

0 claudette:~# mount /home
mount: special device /dev/mapper/vg_claudette0-home does not exist
32 claudette:~# lvs vg_claudette0/home
  LV   VG            Attr       LSize   Pool             Origin       Data%  Meta%  Move Log Cpy%Sync Convert
  home vg_claudette0 Cwi---C--- 309.00g [home_cachepool] [home_corig]                                        
0 claudette:~# ls /dev/mapper/
control             vg_claudette0-swap_1  vg_claudette0-var
vg_claudette0-root  vg_claudette0-tmp     vg_claudette0-var+lib+mysql
0 claudette:~# 

Wah! lvm ate our data!

Let’s remove the cache and return to the way it was:

0 claudette:~# lvconvert --uncache vg_claudette0/home
  /usr/sbin/cache_check: execvp failed: No such file or directory
  Check of pool vg_claudette0/home_cachepool failed (status:2). Manual repair required!
  Failed to active cache locally vg_claudette0/home.
5 claudette:~#

Wah! That doesn’t work either!

Let’s repair:

0 claudette:~# lvconvert --repair vg_claudette0/home_cachepool
  Using default stripesize 64.00 KiB.
  Operation not permitted on cache pool LV vg_claudette0/home_cachepool.
  Operations permitted on a cache pool LV are:
  --splitcache    (operates on cache LV)

5 claudette:~#

What is happening!!

I booted into a live rescue disk with a more modern version of lvm that really should support the --repair option:

0 debirf-rescue:~# lvconvert --repair vg_claudette0/home_cachepool
  /dev/vg_claudette0/lvol1: not found: device not cleared
  Aborting. Failed to wipe start of new LV.
  WARNING: If everything works, remove vg_claudette0/home_cachepool_meta0 volume.
  WARNING: Use pvmove command to move vg_claudette0/home_cachepool_cmeta on the best fitting PV.
0 debirf-rescue:~#

Help! Help!

Wait… thanks to a cool headed colleauge, it turns out the the only problem was that thin-provisioning-tools was not installed on the host.

Feel free to review the whole fiasco as it unfolded.