Details
-
Bug
-
Resolution: Done
-
Major
-
None
-
7.0
-
None
Description
In a pool of 2 servers (2 R810 with 2 10G ethernet INTEL x520-DA2 connected NFS to R510 (same 10G ethernet) with switch 10G DELL 8024F), I have 2 crashes in 4 days since last HA update. First crash with this log:
*WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x1a4/0x280() Dec 3 03:52:20 xenserver-1 kernel: [1364416.497092] NETDEV WATCHDOG: eth5 (ixgbe): transmit queue 4 timed out * Dec 3 03:52:20 xenserver-1 kernel: [1364416.497095] Modules linked in: tun nfsv4 nfs fscache bnx2fc(O) cnic(O) uio fcoe libfcoe libfc scsi_transport_fc scsi_tgt openvswitch(O) gre libcrc32c 8021q garp mrp stp llc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp xt_multiport xt_conntrack nf_conntrack iptable_filter dm_multipath ipmi_devintf dcdbas dm_mod coretemp crc32_pclmul aesni_intel aes_x86_64 ablk_helper cryptd lrw gf128mul glue_helper microcode psmouse lpc_ich mfd_core sg i7core_edac ipmi_si wmi ipmi_msghandler edac_core hed shpchp nfsd auth_rpcgss oid_registry nfs_acl lockd sunrpc nls_utf8 isofs ip_tables x_tables sr_mod cdrom ata_generic pata_acpi hid_generic usbhid hid sd_mod serio_raw ata_piix libata ehci_pci ixgbe(O) e1000e(O) ehci_hcd uhci_hcd ptp pps_core megaraid_sas(O) bnx2(O) scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_dh scsi_mod ipv6 autofs4 Dec 3 03:52:20 xenserver-1 kernel: [1364416.497202] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G O 3.10.0+10 #1 Dec 3 03:52:20 xenserver-1 kernel: [1364416.497204] Hardware name: Dell Inc. PowerEdge R810/0TT6JF, BIOS 2.9.0 07/29/2013 Dec 3 03:52:20 xenserver-1 kernel: [1364416.497207] 0000000000000009 ffff880183203d58 ffffffff81545307 ffff880183203d90 Dec 3 03:52:20 xenserver-1 kernel: [1364416.497212] ffffffff81054da1 ffff88017a680000 0000000000000004 0000000000000000 Dec 3 03:52:20 xenserver-1 kernel: [1364416.497215] ffff88017a675800 ffff88017a675780 ffff880183203df0 ffffffff81054e0c Dec 3 03:52:20 xenserver-1 kernel: [1364416.497219] Call Trace: Dec 3 03:52:20 xenserver-1 kernel: [1364416.497222] <IRQ> [<ffffffff81545307>] dump_stack+0x19/0x1b Dec 3 03:52:20 xenserver-1 kernel: [1364416.497240] [<ffffffff81054da1>] warn_slowpath_common+0x61/0x80 Dec 3 03:52:20 xenserver-1 kernel: [1364416.497243] [<ffffffff81054e0c>] warn_slowpath_fmt+0x4c/0x50 Dec 3 03:52:20 xenserver-1 kernel: [1364416.497248] [<ffffffff8149f914>] dev_watchdog+0x1a4/0x280 Dec 3 03:52:20 xenserver-1 kernel: [1364416.497251] [<ffffffff8149f770>] ? dev_deactivate_queue.constprop.29+0x60/0x60 Dec 3 03:52:20 xenserver-1 kernel: [1364416.497256] [<ffffffff81063cd3>] call_timer_fn+0x53/0x130 Dec 3 03:52:20 xenserver-1 kernel: [1364416.497259] [<ffffffff8149f770>] ? dev_deactivate_queue.constprop.29+0x60/0x60 Dec 3 03:52:20 xenserver-1 kernel: [1364416.497264] [<ffffffff810658fd>] run_timer_softirq+0x22d/0x290 Dec 3 03:52:20 xenserver-1 kernel: [1364416.497272] [<ffffffff8105d48b>] __do_softirq+0xfb/0x240 Dec 3 03:52:20 xenserver-1 kernel: [1364416.497277] [<ffffffff8155509c>] call_softirq+0x1c/0x30 Dec 3 03:52:20 xenserver-1 kernel: [1364416.497287] [<ffffffff81014203>] do_softirq+0x43/0x80 Dec 3 03:52:20 xenserver-1 kernel: [1364416.497289] [<ffffffff8105d6d9>] irq_exit+0x49/0xa0 Dec 3 03:52:20 xenserver-1 kernel: [1364416.497299] [<ffffffff81384ca5>] xen_evtchn_do_upcall+0x35/0x50 Dec 3 03:52:20 xenserver-1 kernel: [1364416.497302] [<ffffffff815550fe>] xen_do_hypervisor_callback+0x1e/0xa0 Dec 3 03:52:20 xenserver-1 kernel: [1364416.497303] <EOI> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 Dec 3 03:52:20 xenserver-1 kernel: [1364416.497311] [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 Dec 3 03:52:20 xenserver-1 kernel: [1364416.497318] [<ffffffff8100a340>] ? xen_safe_halt+0x10/0x30 Dec 3 03:52:20 xenserver-1 kernel: [1364416.497325] [<ffffffff8101a844>] ? default_idle+0x44/0xd0 Dec 3 03:52:20 xenserver-1 kernel: [1364416.497328] [<ffffffff8101b038>] ? arch_cpu_idle+0x18/0x30 Dec 3 03:52:20 xenserver-1 kernel: [1364416.497335] [<ffffffff810a3532>] ? cpu_startup_entry+0x1c2/0x280 Dec 3 03:52:20 xenserver-1 kernel: [1364416.497343] [<ffffffff8152b442>] ? rest_init+0x72/0x80 Dec 3 03:52:20 xenserver-1 kernel: [1364416.497352] [<ffffffff81ad6eee>] ? start_kernel+0x404/0x40f Dec 3 03:52:20 xenserver-1 kernel: [1364416.497355] [<ffffffff81ad68f3>] ? repair_env_string+0x5e/0x5e Dec 3 03:52:20 xenserver-1 kernel: [1364416.497358] [<ffffffff81ad65ee>] ? x86_64_start_reservations+0x2a/0x2c Dec 3 03:52:20 xenserver-1 kernel: [1364416.497362] [<ffffffff81ad9b48>] ? xen_start_kernel+0x531/0x53d Dec 3 03:52:20 xenserver-1 kernel: [1364416.497364] ---[ end trace f1167427101bfd71 ]--- Dec 3 03:52:20 xenserver-1 kernel: [1364416.497374] ixgbe 0000:0d:00.1 eth5: Fake Tx hang detected with timeout of 5 seconds Dec 3 03:52:30 xenserver-1 kernel: [1364426.497042] ixgbe 0000:0d:00.1 eth5: Fake Tx hang detected with timeout of 10 seconds Dec 3 03:52:50 xenserver-1 kernel: [1364446.496920] ixgbe 0000:0d:00.1 eth5: Fake Tx hang detected with timeout of 20 seconds
—
Dec 3 03:52:53 xenserver-1 squeezed: [debug|xenserver-1|3 ||xenops] watch /control/feature-balloon <- 1 Dec 3 03:52:54 xenserver-1 squeezed: [debug|xenserver-1|3 ||xenops] watch /control/feature-balloon <- 1 Dec 3 03:53:00 xenserver-1 squeezed: [debug|xenserver-1|3 ||xenops] watch /control/feature-balloon <- 1 Dec 3 03:53:01 xenserver-1 squeezed: [debug|xenserver-1|3 ||xenops] watch /control/feature-balloon <- 1 Dec 3 03:53:02 xenserver-1 squeezed: [debug|xenserver-1|3 ||xenops] watch /control/feature-balloon <- 1 *Dec 3 03:53:24 xenserver-1 xha[8287]: Watchdog is expiring soon id=1 label=statefile. * *-> REBOOT HERE | * Dec 3 03:56:01 xenserver-1 systemd[1]: Started System Logging Service. V Dec 3 03:56:01 xenserver-1 systemd[1]: Reached target Basic System. Dec 3 03:56:01 xenserver-1 systemd[1]: Starting Basic System. Dec 3 03:56:01 xenserver-1 systemd[1]: Starting Dump dmesg to /var/log/dmesg... Dec 3 03:56:01 xenserver-1 systemd[1]: Starting Set up crash environment... Dec 3 03:56:01 xenserver-1 systemd[1]: Starting QLogic ESwitch Configuration... Dec 3 03:56:01 xenserver-1 systemd[1]: Starting Move kernel messages to tty2...
Reboot without problem and no VMS was lost with the HA, but the secondary force reboot too 20 mins after...
Last crash:
Dec 6 17:23:28 xenserver-2 kernel: [306707.861263] IPv6: ADDRCONF(NETDEV_CHANGE): vif11.1: link becomes ready Dec 6 17:23:28 xenserver-2 kernel: [306707.864790] vif vif-11-2 vif11.2: Guest Rx ready Dec 6 17:23:28 xenserver-2 kernel: [306707.865266] IPv6: ADDRCONF(NETDEV_CHANGE): vif11.2: link becomes ready Dec 6 17:23:38 xenserver-2 kernel: [306717.708268] device vif12.2 entered promiscuous mode Dec 6 17:23:38 xenserver-2 kernel: [306717.820788] IPv6: ADDRCONF(NETDEV_UP): vif12.2: link is not ready Dec 6 17:24:13 xenserver-2 kernel: [306752.789700] vif vif-9-1 vif9.1: Guest Rx stalled Dec 6 17:24:19 xenserver-2 kernel: [306759.095789] block tdo: sector-size: 512/512 capacity: 209715200 Dec 6 17:24:20 xenserver-2 kernel: [306760.074132] vif vif-12-2 vif12.2: Guest Rx ready Dec 6 17:24:20 xenserver-2 kernel: [306760.074718] IPv6: ADDRCONF(NETDEV_CHANGE): vif12.2: link becomes ready *Dec 6 17:24:36 xenserver-2 kernel: [306775.783573] vif vif-11-1 vif11.1: Guest Rx stalled Dec 6 17:24:44 xenserver-2 kernel: [306783.717686] vif vif-11-2 vif11.2: Guest Rx stalled * Dec 6 17:42:49 xenserver-2 kernel: [ 0.000000] PAT configuration [0-7]: WB WT UC- UC WC WP UC UC Dec 6 17:42:49 xenserver-2 kernel: [ 0.000000] Initializing cgroup subsys cpuset Dec 6 17:42:49 xenserver-2 kernel: [ 0.000000] Initializing cgroup subsys cpu Dec 6 17:42:49 xenserver-2 kernel: [ 0.000000] Initializing cgroup subsys cpuacct Dec 6 17:42:49 xenserver-2 kernel: [ 0.000000] Linux version 3.10.0+10 (root@yeltha-4) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Thu Sep 22 12:31:44 UTC 2016
on xha.log:
Dec 06 17:40:43 CET 2016 [warn] SC: (script_service_do_query_liveset) reporting "heartbeat approaching timeout". host[1].time_since_last_hb=34522. Dec 06 17:40:43 CET 2016 [warn] SC: (script_service_do_query_liveset) reporting "State-File approaching timeout". host[1].time_since_last_update_on_sf=36035. Dec 06 17:41:07 CET 2016 [debug] SM: SF domain is updated [sfdomain = (1_)]. Dec 06 17:41:07 CET 2016 [debug] FH: Start fault handler. Dec 06 17:41:11 CET 2016 [debug] HB: HB domain is updated [hbdomain = (m_)]. Dec 06 17:41:12 CET 2016 [debug] FH: HB/SF state has become stable. Dec 06 17:41:12 CET 2016 [debug] FH: weight value [1] is commited. Dec 06 17:41:13 CET 2016 [debug] FH: waiting for consistent view... Dec 06 17:41:13 CET 2016 [warn] All hosts now have a consistent view. Dec 06 17:41:13 CET 2016 [warn] HB domain = (10) Dec 06 17:41:13 CET 2016 [warn] SF domain = (10) Dec 06 17:41:13 CET 2016 [debug] FH: All hosts now have consistent view to the pool membership. Dec 06 17:41:13 CET 2016 [debug] FH: I have won. Dec 06 17:41:13 CET 2016 [debug] SM: Node (1) will be removed from liveset. Dec 06 17:41:35 CET 2016 [notice] Liveset has been updated. new liveset = (10) Dec 06 17:41:36 CET 2016 [debug] FH: End fault handler. Dec 06 17:41:43 CET 2016 [info] SC: (script_service_do_query_liveset) "Heartbeat approaching timeout" turned FALSE Dec 06 17:41:43 CET 2016 [info] SC: (script_service_do_query_liveset) "State-file approaching timeout" turned FALSE Dec 06 17:43:23 CET 2016 [debug] HB: HB domain is updated [hbdomain = (m@)]. Dec 06 17:43:39 CET 2016 [debug] SM: SF domain is updated [sfdomain = (1@)]. Dec 06 17:43:39 CET 2016 [debug] Join Agent: Send ack to join request from host (1). Dec 06 17:43:39 CET 2016 [info] Join Agent: Join request from host (1) is accepted by the local host. Dec 06 17:43:39 CET 2016 [notice] Liveset has been updated. new liveset = (11) Dec 06 17:43:39 CET 2016 [info] Join Agent: proposed_liveset = (11)
Just this error : "vif vif-11-2 vif11.2: Guest Rx stalled" and server has reboot , but not the primary.
No errors appear on the iDrac and the Dell 10G switch... Watchdog force restart servers due to time out