Uploaded image for project: 'XenServer Org'
  1. XenServer Org
  2. XSO-793

Upgrade deletes wrong volume group if more then 25 SRs or pathes

    Details

    • Type: Bug
    • Status: Done (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: 7.2
    • Fix Version/s: None
    • Component/s: Installer
    • Labels:
      None
    • Environment:

      XenServer 6.5 farm, upgrading to 7.2

      Many SRs attached through HBA FC.

      Many path (4 each)

      Description

      After upgrading node ofter node, we saw attaching of storage failing. "REPAIR" did not work.

      A pbd-plug told us that the needed Volume Group was not found.

       

      A "pvs -v" revealed the whole Desaster: Some PVs where not member of any VG anymore.

      Kind screwed up, needed another coffee an thought about what went wrong.

       

      Started looking through the install log:

      You see a VG remove on 2 Points there: first the correct VG on local disk is removed.

       

      But before recreating it, another vgremove is run - and it removes the VG on the top disk starting with sda - example: sdab

       

      So we took a deep dive into the script (upgrade.py in the installer.img under /opt/xenserver/installer/) and found the problem beginning  line  227, taking ist peak in 231:

       

                      if storage_partnum > 0 and self.vgs_output:
                          storage_part = partitionDevice(primary_disk, storage_partnum)
                          rc, out = util.runCmd2(['pvs', '-o', 'pv_name,vg_name', '--noheadings'], with_stdout = True)
                          vgs_list = out.split('\n')
                          vgs_output_wrong = filter(lambda x: str(primary_disk) in x, vgs_list)
                          if vgs_output_wrong:
                              vgs_output_wrong = vgs_output_wrong[0].strip()
                              if ' ' in vgs_output_wrong:
                                  _, vgs_label = vgs_output_wrong.split(None, 1)
                                  util.runCmd2(['vgremove', '-f', vgs_label])
                          util.runCmd2(['vgcreate', self.vgs_output, storage_part])

       

      Just before the vgcreate there is another run of remove, when Special circumstances are true, but the check is done with the Code "str(primary_disk) in x" - and of Course the PRIMARY_DISK string "sda" is contained in "sdab" and so on...

      Since it is not looped here, only the first disk containig "sda" gets removed...

      In our case that took place for on at each node - stopped the update early enough to prevent a big downtime - but was not nice either.

       

       

      We resolved the Problems caused with a vgcfgrestore - where at we were fuck*** happy to have backups of the volume Group metadata in /etc/lvm/backup...

       

       

       

       

        Attachments

          Activity

            People

            • Assignee:
              SimonCro Simon Crowe
              Reporter:
              Daniel Benden Daniel Benden
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: