Loading...

Details

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: None
Affects Version/s: 6.5 SP1, 7.0
Component/s: other
Labels:
None
Environment:

Hide

XS65E001
XS65E002
XS65E003
XS65E005
XS65E006
XS65E007
XS65E008
XS65E009
XS65E010
XS65E013
XS65E014
XS65E015
XS65E017
XS65E018
XS65ESP1
XS65ESP1002
XS65ESP1003
XS65ESP1004
XS65ESP1005
XS65ESP1008
XS65ESP1009
XS65ESP1010
XS65ESP1011
XS65ESP1012
XS65ESP1013
XS65ESP1014
XS65ESP1016
XS65ESP1018
XS65ESP1019
XS65ESP1020
XS65ESP1021
XS65ESP1021
XS65ESP1022
XS65ESP1023
XS65ESP1024
XS65ESP1025
XS65ESP1026
XS65ESP1028
XS65ESP1029
XS65ESP1032
XS65ESP1033

Show
XS65E001 XS65E002 XS65E003 XS65E005 XS65E006 XS65E007 XS65E008 XS65E009 XS65E010 XS65E013 XS65E014 XS65E015 XS65E017 XS65E018 XS65ESP1 XS65ESP1002 XS65ESP1003 XS65ESP1004 XS65ESP1005 XS65ESP1008 XS65ESP1009 XS65ESP1010 XS65ESP1011 XS65ESP1012 XS65ESP1013 XS65ESP1014 XS65ESP1016 XS65ESP1018 XS65ESP1019 XS65ESP1020 XS65ESP1021 XS65ESP1021 XS65ESP1022 XS65ESP1023 XS65ESP1024 XS65ESP1025 XS65ESP1026 XS65ESP1028 XS65ESP1029 XS65ESP1032 XS65ESP1033

Team:
- xs-datapath

Description

Hello,

We're being hit with this error showing up in messages:

Jul 19 16:45:19 aus-vhp5 tapdisk[9007]: ERROR: errno -12 at tapdisk_control_accept: failed to allocate new control connection 
Jul 19 16:45:19 aus-vhp5 tap-ctl[10782]: tap-err:tap_ctl_read_raw: failure reading data 0/544 
Jul 19 16:45:19 aus-vhp5 tap-ctl[10782]: tap-err:tap_ctl_send_and_receive: failed to receive 'pid' message 
Jul 19 16:45:19 aus-vhp5 tapdisk[9058]: ERROR: errno -12 at tapdisk_control_accept: failed to allocate new control connection 
Jul 19 16:45:19 aus-vhp5 tap-ctl[10782]: tap-err:tap_ctl_read_raw: failure reading data 0/544 
Jul 19 16:45:19 aus-vhp5 tap-ctl[10782]: tap-err:tap_ctl_send_and_receive: failed to receive 'pid' message 
Jul 19 16:45:20 aus-vhp5 tapdisk[9007]: ERROR: errno -12 at tapdisk_control_accept: failed to allocate new control connection 
Jul 19 16:45:20 aus-vhp5 tap-ctl[10796]: tap-err:tap_ctl_read_raw: failure reading data 0/544 
Jul 19 16:45:20 aus-vhp5 tap-ctl[10796]: tap-err:tap_ctl_send_and_receive: failed to receive 'pid' message 
Jul 19 16:45:20 aus-vhp5 tapdisk[9058]: ERROR: errno -12 at tapdisk_control_accept: failed to allocate new control connection 
Jul 19 16:45:20 aus-vhp5 tap-ctl[10796]: tap-err:tap_ctl_read_raw: failure reading data 0/544 
Jul 19 16:45:20 aus-vhp5 tap-ctl[10796]: tap-err:tap_ctl_send_and_receive: failed to receive 'pid' message

This is happening roughly every 5 seconds. It appears tapdisk spawns a tap_ctl subprocess that errors, then loops.

On the VM side, we can see high Utilization and a_wait for the vdi associated with the tapdisk:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
xvda              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
xvdb              0.00     0.00    0.00  135.00     0.00     4.98    75.61    59.97  546.99   7.41 100.00
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00   30.00     0.00     0.48    32.47    59.97 2469.30  33.33 100.00
dm-3              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-4              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-5              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-6              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

Another individual mentioned this precise issue in ~~XSO-419~~, and it was determined that this specific issue is not quite covered by XE65ESP1029. It looks similar, but has some uniqueness built into it.

The storage here is backed by 4TB HGST NVMe, if this has any sort of influence on the behavior here.

Let me know if you need additional information.

Edit: One additional piece of some coincidental information, this only seems to trigger on our RabbitMQ VMs. They're reasonably busy, with some moderate to high I/O, but we've had higher I/O VMs not trip this.

Edit2: Even after rebooting the affected VM, the old tapdisk process remains, and continues to loop its failures. I can manually kill the process at this point without any bad affects from what I can tell. Here's an strace of a stale tapdisk process looping:

[root@aus-vhp4 ~]# strace -p 26386
Process 26386 attached
select(24, [8 9 11 12 16 20 21 23], [], [], {0, 126999}) = 0 (Timeout)
fcntl(8, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(11, F_GETFL)                      = 0x2 (flags O_RDWR)
fcntl(12, F_GETFL)                      = 0x2 (flags O_RDWR)
fcntl(9, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(16, F_GETFL)                      = 0x8002 (flags O_RDWR|O_LARGEFILE)
fcntl(20, F_GETFL)                      = 0x2 (flags O_RDWR)
fcntl(21, F_GETFL)                      = 0x2 (flags O_RDWR)
fcntl(23, F_GETFL)                      = 0x8002 (flags O_RDWR|O_LARGEFILE)
select(24, [8 9 11 12 16 20 21 23], [], [], {1, 0}) = 1 (in [11], left {0, 999987})
accept(11, 0, NULL)                     = 15
close(15)                               = 0
sendto(14, "<27>Jul 19 17:09:41 tapdisk[2638"..., 122, MSG_DONTWAIT, NULL, 0) = 122
fcntl(8, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(11, F_GETFL)                      = 0x2 (flags O_RDWR)
fcntl(12, F_GETFL)                      = 0x2 (flags O_RDWR)
fcntl(9, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(16, F_GETFL)                      = 0x8002 (flags O_RDWR|O_LARGEFILE)
fcntl(20, F_GETFL)                      = 0x2 (flags O_RDWR)
fcntl(21, F_GETFL)                      = 0x2 (flags O_RDWR)
fcntl(23, F_GETFL)                      = 0x8002 (flags O_RDWR|O_LARGEFILE)
select(24, [8 9 11 12 16 20 21 23], [], [], {1, 0}) = 0 (Timeout)
msync(0x7f43a65a9000, 4096, MS_ASYNC)   = 0
ftruncate(25, 116)                      = 0
fcntl(8, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(11, F_GETFL)                      = 0x2 (flags O_RDWR)
fcntl(12, F_GETFL)                      = 0x2 (flags O_RDWR)
fcntl(9, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(16, F_GETFL)                      = 0x8002 (flags O_RDWR|O_LARGEFILE)
fcntl(20, F_GETFL)                      = 0x2 (flags O_RDWR)
fcntl(21, F_GETFL)                      = 0x2 (flags O_RDWR)
fcntl(23, F_GETFL)                      = 0x8002 (flags O_RDWR|O_LARGEFILE)
select(24, [8 9 11 12 16 20 21 23], [], [], {0, 0}) = 1 (in [11], left {0, 0})
accept(11, 0, NULL)                     = 15
close(15)                               = 0
sendto(14, "<27>Jul 19 17:09:43 tapdisk[2638"..., 122, MSG_DONTWAIT, NULL, 0) = 122
fcntl(8, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(11, F_GETFL)                      = 0x2 (flags O_RDWR)
fcntl(12, F_GETFL)                      = 0x2 (flags O_RDWR)
fcntl(9, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(16, F_GETFL)                      = 0x8002 (flags O_RDWR|O_LARGEFILE)
fcntl(20, F_GETFL)                      = 0x2 (flags O_RDWR)
fcntl(21, F_GETFL)                      = 0x2 (flags O_RDWR)
fcntl(23, F_GETFL)                      = 0x8002 (flags O_RDWR|O_LARGEFILE)
select(24, [8 9 11 12 16 20 21 23], [], [], {1, 0}) = 0 (Timeout)

And the fds:

[root@aus-vhp4 ~]# ls -al /proc/26386/fd
total 0
dr-x------ 2 root root  0 Jul 19 16:53 .
dr-xr-xr-x 8 root root  0 Jun 15 15:20 ..
lrwx------ 1 root root 64 Jul 19 16:53 0 -> /dev/null
lrwx------ 1 root root 64 Jul 19 16:53 1 -> /dev/null
lrwx------ 1 root root 64 Jul 19 16:53 10 -> socket:[91954]
lrwx------ 1 root root 64 Jul 19 16:53 11 -> socket:[91955]
lrwx------ 1 root root 64 Jul 19 16:53 12 -> anon_inode:[eventfd]
l-wx------ 1 root root 64 Jul 19 16:53 13 -> /var/log/blktap/tapdisk.26386.log
lrwx------ 1 root root 64 Jul 19 16:53 14 -> socket:[92898]
lrwx------ 1 root root 64 Jul 19 16:53 16 -> /dev/xen/blktap-2/blktap7
lr-x------ 1 root root 64 Jul 19 16:53 17 -> /dev/zero
lrwx------ 1 root root 64 Jul 19 16:53 18 -> /dev/mapper/VG_XenStorage--dc7b3bcc--a56d--22a0--6f31--0297f544eb3a-VHD--5ad84812--9a58--43f3--b745--20a445c96e6b
lrwx------ 1 root root 64 Jul 19 16:53 19 -> /dev/shm/metrics/tap-26386-7
lrwx------ 1 root root 64 Jul 19 16:53 2 -> /dev/null
lrwx------ 1 root root 64 Jul 19 16:53 20 -> socket:[93507]
lrwx------ 1 root root 64 Jul 19 16:53 21 -> socket:[91977]
lr-x------ 1 root root 64 Jul 19 16:53 22 -> /dev/xen/gntdev
lrwx------ 1 root root 64 Jul 19 16:53 23 -> /dev/xen/evtchn
lrwx------ 1 root root 64 Jul 19 16:53 24 -> /dev/xen/gntdev
lrwx------ 1 root root 64 Jul 19 16:53 25 -> /dev/shm/vbd3-4-51728/io_ring
lrwx------ 1 root root 64 Jul 19 16:53 26 -> /dev/shm/vbd3-4-51728/statistics
lrwx------ 1 root root 64 Jul 19 16:53 3 -> /var/lock/sm/dc7b3bcc-a56d-22a0-6f31-0297f544eb3a/sr
lrwx------ 1 root root 64 Jul 19 16:53 4 -> /var/lock/sm/5ad84812-9a58-43f3-b745-20a445c96e6b/vdi
lrwx------ 1 root root 64 Jul 19 16:53 5 -> /var/lock/sm/dc7b3bcc-a56d-22a0-6f31-0297f544eb3a/sr
lr-x------ 1 root root 64 Jul 19 16:53 6 -> /sys/fs/cgroup/memory.pressure_level
l-wx------ 1 root root 64 Jul 19 16:53 7 -> /sys/fs/cgroup/cgroup.event_control
lrwx------ 1 root root 64 Jul 19 16:53 8 -> anon_inode:[eventfd]
lrwx------ 1 root root 64 Jul 19 16:53 9 -> socket:[92899]

As a note here, fd's 8, 9, 10, 11, 12, 14, 20, 21 are all broken symlinks.

ERROR: errno -12 at tapdisk_control_accept

Details

Description

Attachments

Activity

People

Dates