Uploaded image for project: 'XenServer Org'
  1. XenServer Org
  2. XSO-822

Cannot take Snapshots - SR_BACKEND Failure 82 input/output

    Details

    • Type: Bug
    • Status: Done (View Workflow)
    • Priority: Blocker
    • Resolution: Won't Do
    • Affects Version/s: 7.2
    • Fix Version/s: None
    • Component/s: Storage
    • Labels:
      None
    • Environment:

      Description

      Over the last month, some of scheduled XOA snapshots have begin failing with SR_BACKEND 82 error and input/output errors started appearing. These errors were limited to a specific SR (storage3-fastdrive2 with UUID : 3edcf6a6-0d0f-fe34-56e7-bf6d468c6a0f).
      Nothing special was done on this SR and no special changes introduced into the system.

      We had no clue what could lead to this (there were 2TB free on that drive so it's not space related). Now, suddenly things have come to a total halt. Any action on the SR fails, snapshots fail, can't add or delete or do anything ("The SR failed to complete the operation)

      Running VMs are still running fine and no corruption seems to have taken place. If a VM is shutdown, it will not come back again with the same error.

      Attempting to MOVE a VDI fails with the same error.

      Currently the storage system is literally 'frozen' in its state. Running VMs are running, we can't shut them down. We can't MOVE the VDIs and we can't do anything.
      THIS IS A SHOW-STOPPER THAT CAN TRASH SR's IN PRODUCTION ENVIRONMENTS.

      We found out that using Xackup to MIGRATE the machines on the same pool but to a different SR (on the same SAN!) works but this is a large SR and migration takes forever (this is a production system, so you can't stop the world though this is what we are doing having no other choice).

       

      2017-12-16 12:06:22,970 ERROR XenAdmin.Actions.AsyncAction [29] - The SR failed to complete the operation
      The SR failed to complete the operation
      2017-12-16 12:06:22,970 ERROR XenAdmin.Actions.AsyncAction [29] - The SR failed to complete the operation
      2017-12-16 12:06:22,970 ERROR XenAdmin.Actions.AsyncAction [29] -    at XenAdmin.Actions.VMActions.VMDestroyAction.DestroyVM(Session session, VM vm, List`1 deleteDisks, IEnumerable`1 deleteSnapshots)
         at XenAdmin.Actions.VMSnapshotDeleteAction.Run()
         at XenAdmin.Actions.AsyncAction.RunWorkerThread(Object o)
      2017-12-16 12:06:22,970 WARN  Audit [29] - Operation failure: VMSnapshotDeleteAction: PROD: ic-vps-pool2: VM 4c8478a0-15f1-4741-870c-a49055ac352c (ic-svc-mailgw1): Pool 150f6a7a-cc1f-6ece-f30f-5148c72d58f0 (PROD: ic-vps-pool2): Deleting snapshot 'rollingSnapshot_20171005T190359Z_SNAPSHOT: Regular Clients_ic-svc-mailgw1'...
      2017-12-16 12:06:22,985 ERROR XenAdmin.Actions.MultipleAction [19] - The SR failed to complete the operation
      2017-12-16 12:06:22,985 ERROR XenAdmin.Actions.MultipleAction [19] - The SR failed to complete the operation
      2017-12-16 12:06:22,985 ERROR XenAdmin.Actions.MultipleAction [19] - The SR failed to complete the operation
      2017-12-16 12:06:22,985 ERROR XenAdmin.Actions.MultipleAction [19] - The SR failed to complete the operation

       

       Other sample logs of the failure :

       

      Dec 16 09:30:22 ic-blc1-vmsrv12 xapi: [error|ic-blc1-vmsrv12|11477483 ||backtrace] Raised Server_error(SR_BACKEND_FAILURE, [ non-zero exit; ; Traceback (most recent call last):#012 File "/opt/xensource/sm/LVMoHBASR", line 243, in #012 SRC
      ommand.run(LVHDoHBASR, DRIVER_INFO)#012 File "/opt/xensource/sm/SRCommand.py", line 351, in run#012 sr = driver(cmd, cmd.sr_uuid)#012 File "/opt/xensource/sm/SR.py", line 147, in init#012 self.load(sr_uuid)#012 File "/opt/xensource/sm/LV
      MoHBASR", line 105, in load#012 LVHDSR.LVHDSR.load(self, sr_uuid)#012 File "/opt/xensource/sm/LVHDSR.py", line 199, in load#012 self._undoAllJournals()#012 File "/opt/xensource/sm/LVHDSR.py", line 1133, in _undoAllJournals#012 self._handle
      InterruptedCloneOps()#012 File "/opt/xensource/sm/LVHDSR.py", line 882, in _handleInterruptedCloneOps#012 self._handleInterruptedCloneOp(uuid, val)#012 File "/opt/xensource/sm/LVHDSR.py", line 919, in _handleInterruptedCloneOp#012 self._undoC
      loneOp(lvs, origUuid, baseUuid, clonUuid)#012 File "/opt/xensource/sm/LVHDSR.py", line 988, in _undoCloneOp#012 lvhdutil.inflate(self.journaler, self.uuid, baseUuid, fullSize)#012 File "/opt/xensource/sm/lvhdutil.py", line 179, in inflate#012
      lvmCache.setSize(lvName, newSize)#012 File "/opt/xensource/sm/lvmcache.py", line 49, in wrapper#012 ret = op(self, args)#012 File "/opt/xensource/sm/lvmcache.py", line 136, in setSize#012 lvutil.setSize(path, newSize, (newSize < size))#012
      File "/opt/xensource/sm/lvutil.py", line 546, in setSize#012 cmd_lvm([CMD_LVRESIZE, "-L", str(sizeMB), path], pread_func=util.pread)#012 File "/opt/xensource/sm/lvutil.py", line 157, in cmd_lvm#012 stdout = pread_func([os.path.join(LVM_BIN, l
      vm_cmd)] + lvm_args, args)#012 File "/opt/xensource/sm/util.py", line 182, in pread#012 raise CommandException(rc, str(cmdlist), stderr.strip())#012util.CommandException: Input/output error#012 ])
      Dec 16 09:30:22 ic-blc1-vmsrv12 xapi: [error|ic-blc1-vmsrv12|11477483 ||backtrace] 1/1 xapi @ ic-blc1-vmsrv12 Raised at file (Thread 11477483 has no backtrace table. Was with_backtraces called?, line 0
      Dec 16 09:30:22 ic-blc1-vmsrv12 xapi: [error|ic-blc1-vmsrv12|11477483 ||backtrace]

        Attachments

          Activity

            People

            • Assignee:
              chandrikas Chandrika Srinivasan
              Reporter:
              interconnect Avi Bluestein
            • Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: