After upgrading node ofter node, we saw attaching of storage failing. "REPAIR" did not work.
A pbd-plug told us that the needed Volume Group was not found.
A "pvs -v" revealed the whole Desaster: Some PVs where not member of any VG anymore.
Kind screwed up, needed another coffee an thought about what went wrong.
Started looking through the install log:
You see a VG remove on 2 Points there: first the correct VG on local disk is removed.
But before recreating it, another vgremove is run - and it removes the VG on the top disk starting with sda - example: sdab
So we took a deep dive into the script (upgrade.py in the installer.img under /opt/xenserver/installer/) and found the problem beginning line 227, taking ist peak in 231:
if storage_partnum > 0 and self.vgs_output:
storage_part = partitionDevice(primary_disk, storage_partnum)
rc, out = util.runCmd2(['pvs', '-o', 'pv_name,vg_name', '--noheadings'], with_stdout = True)
vgs_list = out.split('\n')
vgs_output_wrong = filter(lambda x: str(primary_disk) in x, vgs_list)
vgs_output_wrong = vgs_output_wrong.strip()
if ' ' in vgs_output_wrong:
_, vgs_label = vgs_output_wrong.split(None, 1)
util.runCmd2(['vgremove', '-f', vgs_label])
util.runCmd2(['vgcreate', self.vgs_output, storage_part])
Just before the vgcreate there is another run of remove, when Special circumstances are true, but the check is done with the Code "str(primary_disk) in x" - and of Course the PRIMARY_DISK string "sda" is contained in "sdab" and so on...
Since it is not looped here, only the first disk containig "sda" gets removed...
In our case that took place for on at each node - stopped the update early enough to prevent a big downtime - but was not nice either.
We resolved the Problems caused with a vgcfgrestore - where at we were fuck*** happy to have backups of the volume Group metadata in /etc/lvm/backup...