Uploaded image for project: 'XenServer Org'
  1. XenServer Org
  2. XSO-658

Crash host in pool due to network/watchdog

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done
    • Major
    • None
    • 7.0
    • Networking, Storage
    • None

    Description

      In a pool of 2 servers (2 R810 with 2 10G ethernet INTEL x520-DA2 connected NFS to R510 (same 10G ethernet) with switch 10G DELL 8024F), I have 2 crashes in 4 days since last HA update. First crash with this log:

      *WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x1a4/0x280() 
      Dec  3 03:52:20 xenserver-1 kernel: [1364416.497092] NETDEV WATCHDOG: eth5 (ixgbe): transmit queue 4 timed out *
      Dec  3 03:52:20 xenserver-1 kernel: [1364416.497095] Modules linked in: tun nfsv4 nfs fscache bnx2fc(O) cnic(O) uio fcoe libfcoe libfc scsi_transport_fc scsi_tgt openvswitch(O) gre libcrc32c 8021q garp mrp stp llc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp xt_multiport xt_conntrack nf_conntrack iptable_filter dm_multipath ipmi_devintf dcdbas dm_mod coretemp crc32_pclmul aesni_intel aes_x86_64 ablk_helper cryptd lrw gf128mul glue_helper microcode psmouse lpc_ich mfd_core sg i7core_edac ipmi_si wmi ipmi_msghandler edac_core hed shpchp nfsd auth_rpcgss oid_registry nfs_acl lockd sunrpc nls_utf8 isofs ip_tables x_tables sr_mod cdrom ata_generic pata_acpi hid_generic usbhid hid sd_mod serio_raw ata_piix libata ehci_pci ixgbe(O) e1000e(O) ehci_hcd uhci_hcd ptp pps_core megaraid_sas(O) bnx2(O) scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_dh scsi_mod ipv6 autofs4 
      Dec  3 03:52:20 xenserver-1 kernel: [1364416.497202] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O 3.10.0+10 #1 
      Dec  3 03:52:20 xenserver-1 kernel: [1364416.497204] Hardware name: Dell Inc. PowerEdge R810/0TT6JF, BIOS 2.9.0 07/29/2013 
      Dec  3 03:52:20 xenserver-1 kernel: [1364416.497207] 0000000000000009 ffff880183203d58 ffffffff81545307 ffff880183203d90 
      Dec  3 03:52:20 xenserver-1 kernel: [1364416.497212] ffffffff81054da1 ffff88017a680000 0000000000000004 0000000000000000 
      Dec  3 03:52:20 xenserver-1 kernel: [1364416.497215] ffff88017a675800 ffff88017a675780 ffff880183203df0 ffffffff81054e0c 
      Dec  3 03:52:20 xenserver-1 kernel: [1364416.497219] Call Trace: 
      Dec  3 03:52:20 xenserver-1 kernel: [1364416.497222]  <IRQ> [<ffffffff81545307>] dump_stack+0x19/0x1b 
      Dec  3 03:52:20 xenserver-1 kernel: [1364416.497240] [<ffffffff81054da1>] warn_slowpath_common+0x61/0x80 
      Dec  3 03:52:20 xenserver-1 kernel: [1364416.497243] [<ffffffff81054e0c>] warn_slowpath_fmt+0x4c/0x50 
      Dec  3 03:52:20 xenserver-1 kernel: [1364416.497248] [<ffffffff8149f914>] dev_watchdog+0x1a4/0x280 
      Dec  3 03:52:20 xenserver-1 kernel: [1364416.497251] [<ffffffff8149f770>] ? dev_deactivate_queue.constprop.29+0x60/0x60 
      Dec  3 03:52:20 xenserver-1 kernel: [1364416.497256] [<ffffffff81063cd3>] call_timer_fn+0x53/0x130 
      Dec  3 03:52:20 xenserver-1 kernel: [1364416.497259] [<ffffffff8149f770>] ? dev_deactivate_queue.constprop.29+0x60/0x60 
      Dec  3 03:52:20 xenserver-1 kernel: [1364416.497264] [<ffffffff810658fd>] run_timer_softirq+0x22d/0x290 
      Dec  3 03:52:20 xenserver-1 kernel: [1364416.497272] [<ffffffff8105d48b>] __do_softirq+0xfb/0x240 
      Dec  3 03:52:20 xenserver-1 kernel: [1364416.497277] [<ffffffff8155509c>] call_softirq+0x1c/0x30 
      Dec  3 03:52:20 xenserver-1 kernel: [1364416.497287] [<ffffffff81014203>] do_softirq+0x43/0x80 
      Dec  3 03:52:20 xenserver-1 kernel: [1364416.497289] [<ffffffff8105d6d9>] irq_exit+0x49/0xa0 
      Dec  3 03:52:20 xenserver-1 kernel: [1364416.497299] [<ffffffff81384ca5>] xen_evtchn_do_upcall+0x35/0x50 
      Dec  3 03:52:20 xenserver-1 kernel: [1364416.497302] [<ffffffff815550fe>] xen_do_hypervisor_callback+0x1e/0xa0 
      Dec  3 03:52:20 xenserver-1 kernel: [1364416.497303]  <EOI> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 
      Dec  3 03:52:20 xenserver-1 kernel: [1364416.497311] [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 
      Dec  3 03:52:20 xenserver-1 kernel: [1364416.497318] [<ffffffff8100a340>] ? xen_safe_halt+0x10/0x30 
      Dec  3 03:52:20 xenserver-1 kernel: [1364416.497325] [<ffffffff8101a844>] ? default_idle+0x44/0xd0 
      Dec  3 03:52:20 xenserver-1 kernel: [1364416.497328] [<ffffffff8101b038>] ? arch_cpu_idle+0x18/0x30 
      Dec  3 03:52:20 xenserver-1 kernel: [1364416.497335] [<ffffffff810a3532>] ? cpu_startup_entry+0x1c2/0x280 
      Dec  3 03:52:20 xenserver-1 kernel: [1364416.497343] [<ffffffff8152b442>] ? rest_init+0x72/0x80 
      Dec  3 03:52:20 xenserver-1 kernel: [1364416.497352] [<ffffffff81ad6eee>] ? start_kernel+0x404/0x40f 
      Dec  3 03:52:20 xenserver-1 kernel: [1364416.497355] [<ffffffff81ad68f3>] ? repair_env_string+0x5e/0x5e 
      Dec  3 03:52:20 xenserver-1 kernel: [1364416.497358] [<ffffffff81ad65ee>] ? x86_64_start_reservations+0x2a/0x2c 
      Dec  3 03:52:20 xenserver-1 kernel: [1364416.497362] [<ffffffff81ad9b48>] ? xen_start_kernel+0x531/0x53d 
      Dec  3 03:52:20 xenserver-1 kernel: [1364416.497364] ---[ end trace f1167427101bfd71 ]--- 
      Dec  3 03:52:20 xenserver-1 kernel: [1364416.497374] ixgbe 0000:0d:00.1 eth5: Fake Tx hang detected with timeout of 5 seconds 
      Dec  3 03:52:30 xenserver-1 kernel: [1364426.497042] ixgbe 0000:0d:00.1 eth5: Fake Tx hang detected with timeout of 10 seconds 
      Dec  3 03:52:50 xenserver-1 kernel: [1364446.496920] ixgbe 0000:0d:00.1 eth5: Fake Tx hang detected with timeout of 20 seconds 

      Dec  3 03:52:53 xenserver-1 squeezed: [debug|xenserver-1|3 ||xenops] watch /control/feature-balloon <- 1 
      Dec  3 03:52:54 xenserver-1 squeezed: [debug|xenserver-1|3 ||xenops] watch /control/feature-balloon <- 1 
      Dec  3 03:53:00 xenserver-1 squeezed: [debug|xenserver-1|3 ||xenops] watch /control/feature-balloon <- 1 
      Dec  3 03:53:01 xenserver-1 squeezed: [debug|xenserver-1|3 ||xenops] watch /control/feature-balloon <- 1 
      Dec  3 03:53:02 xenserver-1 squeezed: [debug|xenserver-1|3 ||xenops] watch /control/feature-balloon <- 1 
      *Dec  3 03:53:24 xenserver-1 xha[8287]: Watchdog is expiring soon id=1 label=statefile. *  *->  REBOOT HERE | *
      Dec  3 03:56:01 xenserver-1 systemd[1]: Started System Logging Service.                                                                                                                              V 
      Dec  3 03:56:01 xenserver-1 systemd[1]: Reached target Basic System. 
      Dec  3 03:56:01 xenserver-1 systemd[1]: Starting Basic System. 
      Dec  3 03:56:01 xenserver-1 systemd[1]: Starting Dump dmesg to /var/log/dmesg... 
      Dec  3 03:56:01 xenserver-1 systemd[1]: Starting Set up crash environment... 
      Dec  3 03:56:01 xenserver-1 systemd[1]: Starting QLogic ESwitch Configuration... 
      Dec  3 03:56:01 xenserver-1 systemd[1]: Starting Move kernel messages to tty2... 

      Reboot without problem and no VMS was lost with the HA, but the secondary force reboot too 20 mins after...

      Last crash:

      Dec  6 17:23:28 xenserver-2 kernel: [306707.861263] IPv6: ADDRCONF(NETDEV_CHANGE): vif11.1: link becomes ready 
      Dec  6 17:23:28 xenserver-2 kernel: [306707.864790] vif vif-11-2 vif11.2: Guest Rx ready 
      Dec  6 17:23:28 xenserver-2 kernel: [306707.865266] IPv6: ADDRCONF(NETDEV_CHANGE): vif11.2: link becomes ready 
      Dec  6 17:23:38 xenserver-2 kernel: [306717.708268] device vif12.2 entered promiscuous mode 
      Dec  6 17:23:38 xenserver-2 kernel: [306717.820788] IPv6: ADDRCONF(NETDEV_UP): vif12.2: link is not ready 
      Dec  6 17:24:13 xenserver-2 kernel: [306752.789700] vif vif-9-1 vif9.1: Guest Rx stalled 
      Dec  6 17:24:19 xenserver-2 kernel: [306759.095789] block tdo: sector-size: 512/512 capacity: 209715200 
      Dec  6 17:24:20 xenserver-2 kernel: [306760.074132] vif vif-12-2 vif12.2: Guest Rx ready 
      Dec  6 17:24:20 xenserver-2 kernel: [306760.074718] IPv6: ADDRCONF(NETDEV_CHANGE): vif12.2: link becomes ready 
      *Dec  6 17:24:36 xenserver-2 kernel: [306775.783573] vif vif-11-1 vif11.1: Guest Rx stalled 
      Dec  6 17:24:44 xenserver-2 kernel: [306783.717686] vif vif-11-2 vif11.2: Guest Rx stalled *
      Dec  6 17:42:49 xenserver-2 kernel: [    0.000000] PAT configuration [0-7]: WB  WT  UC- UC  WC  WP  UC  UC   
      Dec  6 17:42:49 xenserver-2 kernel: [    0.000000] Initializing cgroup subsys cpuset 
      Dec  6 17:42:49 xenserver-2 kernel: [    0.000000] Initializing cgroup subsys cpu 
      Dec  6 17:42:49 xenserver-2 kernel: [    0.000000] Initializing cgroup subsys cpuacct 
      Dec  6 17:42:49 xenserver-2 kernel: [    0.000000] Linux version 3.10.0+10 (root@yeltha-4) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP 
      Thu Sep 22 12:31:44 UTC 2016 

      on xha.log:

      Dec 06 17:40:43 CET 2016 [warn] SC: (script_service_do_query_liveset) reporting "heartbeat approaching timeout". host[1].time_since_last_hb=34522.
      Dec 06 17:40:43 CET 2016 [warn] SC: (script_service_do_query_liveset) reporting "State-File approaching timeout". host[1].time_since_last_update_on_sf=36035.
      Dec 06 17:41:07 CET 2016 [debug] SM: SF domain is updated [sfdomain = (1_)].
      Dec 06 17:41:07 CET 2016 [debug] FH: Start fault handler.
      Dec 06 17:41:11 CET 2016 [debug] HB: HB domain is updated [hbdomain = (m_)].
      Dec 06 17:41:12 CET 2016 [debug] FH: HB/SF state has become stable.
      Dec 06 17:41:12 CET 2016 [debug] FH: weight value [1] is commited.
      Dec 06 17:41:13 CET 2016 [debug] FH: waiting for consistent view...
      Dec 06 17:41:13 CET 2016 [warn] All hosts now have a consistent view.
      Dec 06 17:41:13 CET 2016 [warn] 	HB domain = (10)
      Dec 06 17:41:13 CET 2016 [warn] 	SF domain = (10)
      Dec 06 17:41:13 CET 2016 [debug] FH: All hosts now have consistent view to the pool membership.
      Dec 06 17:41:13 CET 2016 [debug] FH: I have won.
      Dec 06 17:41:13 CET 2016 [debug] SM: Node (1) will be removed from liveset.
      Dec 06 17:41:35 CET 2016 [notice] Liveset has been updated.  new liveset = (10)
      Dec 06 17:41:36 CET 2016 [debug] FH: End fault handler.
      Dec 06 17:41:43 CET 2016 [info] SC: (script_service_do_query_liveset) "Heartbeat approaching timeout" turned FALSE 
      Dec 06 17:41:43 CET 2016 [info] SC: (script_service_do_query_liveset) "State-file approaching timeout" turned FALSE 
      Dec 06 17:43:23 CET 2016 [debug] HB: HB domain is updated [hbdomain = (m@)].
      Dec 06 17:43:39 CET 2016 [debug] SM: SF domain is updated [sfdomain = (1@)].
      Dec 06 17:43:39 CET 2016 [debug] Join Agent: Send ack to join request from host (1).
      Dec 06 17:43:39 CET 2016 [info] Join Agent: Join request from host (1) is accepted by the local host.
      Dec 06 17:43:39 CET 2016 [notice] Liveset has been updated.  new liveset = (11)
      Dec 06 17:43:39 CET 2016 [info] Join Agent: proposed_liveset = (11)

      Just this error : "vif vif-11-2 vif11.2: Guest Rx stalled" and server has reboot , but not the primary.

      No errors appear on the iDrac and the Dell 10G switch... Watchdog force restart servers due to time out

      Attachments

        Activity

          People

            andrewhal Andrew Halley
            david_telemaque DALBERA David
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: