Uploaded image for project: 'XenServer Org'
  1. XenServer Org
  2. XSO-574

Storage stack is unstable in general in 7.0 pool on GC / coalesce



    • Type: Bug
    • Status: Done (View Workflow)
    • Priority: Blocker
    • Resolution: Done
    • Affects Version/s: 7.0
    • Fix Version/s: None
    • Component/s: Storage
    • Labels:
    • Environment:

      Pool of 3 to 5 nodes


      Alas, I cannot provide many details because we aborted upgrade to 7.0 since encountering this second time, but I'll describe my findings in general.

      GC/coalesce part of storage stack in 7.0 pool seems critically unstable on iSCSI/LVM SRs. Same type SRs with same VMs on same SAN work perfectly with 6.5 (even after moving SR itself from 7.0 pool to 6.5 pool), there are no error messages in dmesg of 7.0 nor on SAN (Equallogic), and thereby SAN itself can be excluded from equation.

      When GC/coalesce happens on 7.0 even on simple type VDI chains (cross pool migrated VMs without any snapshots), it tends to fail with different errors (timeouts, operation failures, tapdisk pause/unpause/etc. failures, etc., LVM operation failures due to either locking or timeout) in SMlog. After it fails, it leaves system state undefined: stale LVM volumes like coalesce_*, LVM volumes not renamed (but VDI has new LVM name) or renamed and not linked to VDIs. I also saw metadata errors on SR, VDIs got orphaned and lost, and so on.

      Also it can leave VHDs themselves in a bad state. What I saw are: parent mismatches (parent not found), 'block clobbers footer', 'backup footer checksum mismatch', 'wrong cookie', etc. With a bit of manual tinkering, this terrible mess could be repaired so VHD chains are attachable to VMs and at least readable, but not writable. So it does damage metadata, but data remains intact.

      Also I saw the following conditions: tapdisk for running VMs was accessing one part of disk space, while LVM volumes were already redefined. I dug LVM table with dmsetup, did lvm metadata backup to text file, and data in text file mismatched actual data in dm tables, being either stale or forward from the actual running configuration. Some SR metadata errors arose as well (GUIDs in MGT did not match actual LVM GUIDs), and I had to manually edit MGT volume to make VDIs reappear.

      How to reproduce? Create pool of i.e. 3 nodes with iSCSI SR. Then do some cross-pool migrations one after other, many VMs at once, large VMs (which take 10-12 hours to migrate) etc., SR to SR moves, running VMs at the time of, and just wait for background coalesce process to start few times. It will eventually fail and damage metadata badly. That's what it took to encounter this bug for us. ~8 VMs (two about 500G and 6 about 100G) migrated from 6.5 to 7.0 pool were enough to hit it.

      To be honest, I recommend recalling current 7.0 release into beta state again because this will certainly cause data loss for many trying to use it.




            Alex/AT Alexey Asemov
            0 Vote for this issue
            8 Start watching this issue