Loading...

Details

Type: Bug
Resolution: Won't Do
Priority: Blocker
Fix Version/s: None
Affects Version/s: 7.2
Component/s: Storage
Labels:
None
Environment:

Hide

XenServer 7.1, with all patches applied.

Running in a pool with 7 members, on IBM HS23 blades attached to an EMC VNX 5700 SAN through Brocade 8/16GB FC switches. Intel E5-2620 CPUs, 128GB RAM each.

The pool has access to 2 LUNs running from the same storage system and same RAID set. All FC paths are up and no special error anywhere in the hardware.

2 x 1Gbps Cisco switches, all pool members synchronized through local NTP proxy.

Show
XenServer 7.1, with all patches applied. Running in a pool with 7 members, on IBM HS23 blades attached to an EMC VNX 5700 SAN through Brocade 8/16GB FC switches. Intel E5-2620 CPUs, 128GB RAM each. The pool has access to 2 LUNs running from the same storage system and same RAID set. All FC paths are up and no special error anywhere in the hardware. 2 x 1Gbps Cisco switches, all pool members synchronized through local NTP proxy.

Team:
- xs-storage
Internal JIRA Reference:
XSI-2

Description

Over the last month, some of scheduled XOA snapshots have begin failing with SR_BACKEND 82 error and input/output errors started appearing. These errors were limited to a specific SR (storage3-fastdrive2 with UUID : 3edcf6a6-0d0f-fe34-56e7-bf6d468c6a0f).
Nothing special was done on this SR and no special changes introduced into the system.

We had no clue what could lead to this (there were 2TB free on that drive so it's not space related). Now, suddenly things have come to a total halt. Any action on the SR fails, snapshots fail, can't add or delete or do anything ("The SR failed to complete the operation)

Running VMs are still running fine and no corruption seems to have taken place. If a VM is shutdown, it will not come back again with the same error.

Attempting to MOVE a VDI fails with the same error.

Currently the storage system is literally 'frozen' in its state. Running VMs are running, we can't shut them down. We can't MOVE the VDIs and we can't do anything.
THIS IS A SHOW-STOPPER THAT CAN TRASH SR's IN PRODUCTION ENVIRONMENTS.

We found out that using Xackup to MIGRATE the machines on the same pool but to a different SR (on the same SAN!) works but this is a large SR and migration takes forever (this is a production system, so you can't stop the world though this is what we are doing having no other choice).

2017-12-16 12:06:22,970 ERROR XenAdmin.Actions.AsyncAction [29] - The SR failed to complete the operation
The SR failed to complete the operation
2017-12-16 12:06:22,970 ERROR XenAdmin.Actions.AsyncAction [29] - The SR failed to complete the operation
2017-12-16 12:06:22,970 ERROR XenAdmin.Actions.AsyncAction [29] -    at XenAdmin.Actions.VMActions.VMDestroyAction.DestroyVM(Session session, VM vm, List`1 deleteDisks, IEnumerable`1 deleteSnapshots)
   at XenAdmin.Actions.VMSnapshotDeleteAction.Run()
   at XenAdmin.Actions.AsyncAction.RunWorkerThread(Object o)
2017-12-16 12:06:22,970 WARN Audit [29] - Operation failure: VMSnapshotDeleteAction: PROD: ic-vps-pool2: VM 4c8478a0-15f1-4741-870c-a49055ac352c (ic-svc-mailgw1): Pool 150f6a7a-cc1f-6ece-f30f-5148c72d58f0 (PROD: ic-vps-pool2): Deleting snapshot 'rollingSnapshot_20171005T190359Z_SNAPSHOT: Regular Clients_ic-svc-mailgw1'...
2017-12-16 12:06:22,985 ERROR XenAdmin.Actions.MultipleAction [19] - The SR failed to complete the operation
2017-12-16 12:06:22,985 ERROR XenAdmin.Actions.MultipleAction [19] - The SR failed to complete the operation
2017-12-16 12:06:22,985 ERROR XenAdmin.Actions.MultipleAction [19] - The SR failed to complete the operation
2017-12-16 12:06:22,985 ERROR XenAdmin.Actions.MultipleAction [19] - The SR failed to complete the operation

Other sample logs of the failure :

Dec 16 09:30:22 ic-blc1-vmsrv12 xapi: [error|ic-blc1-vmsrv12|11477483 ||backtrace] Raised Server_error(SR_BACKEND_FAILURE, [ non-zero exit; ; Traceback (most recent call last):#012 File "/opt/xensource/sm/LVMoHBASR", line 243, in #012 SRC
ommand.run(LVHDoHBASR, DRIVER_INFO)#012 File "/opt/xensource/sm/SRCommand.py", line 351, in run#012 sr = driver(cmd, cmd.sr_uuid)#012 File "/opt/xensource/sm/SR.py", line 147, in init#012 self.load(sr_uuid)#012 File "/opt/xensource/sm/LV
MoHBASR", line 105, in load#012 LVHDSR.LVHDSR.load(self, sr_uuid)#012 File "/opt/xensource/sm/LVHDSR.py", line 199, in load#012 self._undoAllJournals()#012 File "/opt/xensource/sm/LVHDSR.py", line 1133, in _undoAllJournals#012 self._handle
InterruptedCloneOps()#012 File "/opt/xensource/sm/LVHDSR.py", line 882, in _handleInterruptedCloneOps#012 self._handleInterruptedCloneOp(uuid, val)#012 File "/opt/xensource/sm/LVHDSR.py", line 919, in _handleInterruptedCloneOp#012 self._undoC
loneOp(lvs, origUuid, baseUuid, clonUuid)#012 File "/opt/xensource/sm/LVHDSR.py", line 988, in _undoCloneOp#012 lvhdutil.inflate(self.journaler, self.uuid, baseUuid, fullSize)#012 File "/opt/xensource/sm/lvhdutil.py", line 179, in inflate#012
lvmCache.setSize(lvName, newSize)#012 File "/opt/xensource/sm/lvmcache.py", line 49, in wrapper#012 ret = op(self, args)#012 File "/opt/xensource/sm/lvmcache.py", line 136, in setSize#012 lvutil.setSize(path, newSize, (newSize < size))#012
File "/opt/xensource/sm/lvutil.py", line 546, in setSize#012 cmd_lvm([CMD_LVRESIZE, "-L", str(sizeMB), path], pread_func=util.pread)#012 File "/opt/xensource/sm/lvutil.py", line 157, in cmd_lvm#012 stdout = pread_func([os.path.join(LVM_BIN, l
vm_cmd)] + lvm_args, args)#012 File "/opt/xensource/sm/util.py", line 182, in pread#012 raise CommandException(rc, str(cmdlist), stderr.strip())#012util.CommandException: Input/output error#012 ])
Dec 16 09:30:22 ic-blc1-vmsrv12 xapi: [error|ic-blc1-vmsrv12|11477483 ||backtrace] 1/1 xapi @ ic-blc1-vmsrv12 Raised at file (Thread 11477483 has no backtrace table. Was with_backtraces called?, line 0
Dec 16 09:30:22 ic-blc1-vmsrv12 xapi: [error|ic-blc1-vmsrv12|11477483 ||backtrace]

Cannot take Snapshots - SR_BACKEND Failure 82 input/output

Details

Description

Attachments

Activity

People

Dates