Details
-
Bug
-
Resolution: Fixed
-
Major
-
7.3, 7.4, 7.5
-
None
-
We are taking snapshots of running vm every night on a pool of 3 Xenservers 7.2 (fully patched) connected to 2 iscsi SR with multipath.
-
XSI-75
Description
This bug report is related to ------ and XSO-837XSO-855. As I haven't received any constructive answer, I investigated the problem myself.
iSCSI SR volume group metadata are randomly being corrupted during backups (snapshots) on 3 different installations. Every time metadata were corrupted, I noticed that the last process which wrote the metadata was always 'vgs' on the slave nodes (in /etc/lvm/backup/VG_XenStorage-ead98b75-7449-80e9-a54d-1b0a0c9449ac) :
description = "Created *after* executing '/sbin/vgs VG_XenStorage-ead98b75-7449-80e9-a54d-1b0a0c9449ac'"
Vgs should not write metadata in normal situation, but it can do it if it detects some anomalies. As it is often called from the function '_checkVG' without any lock on the SR, it can corrupt metadata if it writes at the same time another lvm command is writing on another node.
There is an undocumented flag (not in the manual, but in the command line help) which prevents vgs from writing on the volume group : "--readonly".
Since I patched '/opt/xensource/sm/lvutil.py' with this patch :
184c184
< cmd_lvm([CMD_VGS, vgname])
---
> cmd_lvm([CMD_VGS, "--readonly", vgname])
I had no more volume group metadata corruption!
It explains also why the corruption happens more frequently on pools with more than 2 nodes (more nodes you have, more risk you have a node is calling vgs at wrong time).
It think this bug exists in every version of Xenserver (even the latest).
Kind regards,
Nicolas Michaux