Microsoft Azure - Kernel panic related due to mlx5_core driver
This document (000021005) is provided subject to the disclaimer at the end of this document.
Environment
SUSE Linux Enterprise Server 12 - SP 5
SUSE Linux Enterprise Server 15 - SP 1
SUSE Linux Enterprise Server 15 - SP 2
SUSE Linux Enterprise Server 15 - SP 3
SUSE Linux Enterprise Server 15 - SP 4
SUSE Linux Enterprise Server 15 - SP 5
SUSE Linux Enterprise Server for SAP Applications 12 - SP 4 - ESPOS
SUSE Linux Enterprise Server for SAP Applications 12 - SP 5
SUSE Linux Enterprise Server for SAP Applications 15 - SP 1
SUSE Linux Enterprise Server for SAP Applications 15 - SP 2
SUSE Linux Enterprise Server for SAP Applications 15 - SP 3
SUSE Linux Enterprise Server for SAP Applications 15 - SP 4
SUSE Linux Enterprise Server for SAP Applications 15 - SP 5
Microsoft Azure
Situation
Additional information and output from commands executed on SUSE Linux Enterprise Server for SAP Applications 15 Service Pack 4:
# dmesg
[...]
[20747.904589] hv_netvsc 00224882-00d1-0022-4882-00d100224882 eth0: Data path switched from VF: eth2
[20748.082681] hv_netvsc 00224882-00d1-0022-4882-00d100224882 eth0: VF unregistering: eth2
[20748.087866] mlx5_core f0a2:00:02.0 eth2 (unregistering): Disabling LRO, not supported in legacy RQ
[20749.938028] hv_netvsc 00224882-00d1-0022-4882-00d100224882 eth0: VF slot 1 removed
[20749.940727] pci_bus f0a2:00: busn_res: [bus 00] is released
[20750.661865] hv_netvsc 00224882-00d1-0022-4882-00d100224882 eth0: VF slot 1 added
[20750.662626] hv_pci edb08bf6-f0a2-48c9-9c9a-f120cb611ef5: PCI VMBus probing: Using version 0x10004
[20752.539511] hv_pci edb08bf6-f0a2-48c9-9c9a-f120cb611ef5: PCI host bridge to bus f0a2:00
[20752.541175] hv_netvsc 00224882-00d1-0022-4882-00d100224882 eth0: VF slot 1 removed
[....]
# modinfo mlx5_core
filename: /lib/modules/5.14.21-150400.24.38-default/kernel/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.ko.zst
license: Dual BSD/GPL
description: Mellanox 5th generation network adapters (ConnectX series) core driver
author: Eli Cohen <eli@mellanox.com>
suserelease: SLE15-SP4
srcversion: EA83021FF5434929ED33F2F
alias: auxiliary:mlx5_core.eth
alias: pci:v000015B3d0000A2DFsv*sd*bc*sc*i*
alias: pci:v000015B3d0000A2DCsv*sd*bc*sc*i*
alias: pci:v000015B3d0000A2D6sv*sd*bc*sc*i*
alias: pci:v000015B3d0000A2D3sv*sd*bc*sc*i*
alias: pci:v000015B3d0000A2D2sv*sd*bc*sc*i*
alias: pci:v000015B3d00001023sv*sd*bc*sc*i*
alias: pci:v000015B3d00001021sv*sd*bc*sc*i*
alias: pci:v000015B3d0000101Fsv*sd*bc*sc*i*
alias: pci:v000015B3d0000101Esv*sd*bc*sc*i*
alias: pci:v000015B3d0000101Dsv*sd*bc*sc*i*
alias: pci:v000015B3d0000101Csv*sd*bc*sc*i*
alias: pci:v000015B3d0000101Bsv*sd*bc*sc*i*
alias: pci:v000015B3d0000101Asv*sd*bc*sc*i*
alias: pci:v000015B3d00001019sv*sd*bc*sc*i*
alias: pci:v000015B3d00001018sv*sd*bc*sc*i*
alias: pci:v000015B3d00001017sv*sd*bc*sc*i*
alias: pci:v000015B3d00001016sv*sd*bc*sc*i*
alias: pci:v000015B3d00001015sv*sd*bc*sc*i*
alias: pci:v000015B3d00001014sv*sd*bc*sc*i*
alias: pci:v000015B3d00001013sv*sd*bc*sc*i*
alias: pci:v000015B3d00001012sv*sd*bc*sc*i*
alias: pci:v000015B3d00001011sv*sd*bc*sc*i*
alias: auxiliary:mlx5_core.eth-rep
alias: auxiliary:mlx5_core.sf
depends: tls,pci-hyperv-intf,mlxfw,psample
supported: yes
retpoline: Y
intree: Y
name: mlx5_core
vermagic: 5.14.21-150400.24.38-default SMP preempt mod_unload modversions
sig_id: PKCS#7
signer: SUSE Linux Enterprise Secure Boot CA
sig_key: ED:87:85:B7:8F:FC:12:7F
sig_hashalgo: sha256
signature: BE:84:24:0C:CA:94:E6:0D:36:29:D1:13:79:BC:E5:54:3B:85:94:A0:
FD:6F:7F:71:1B:AE:CD:46:54:6D:E9:4D:6C:9B:53:BB:DF:34:6A:DC:
95:59:CC:C0:3E:8D:AA:BE:E3:F5:B0:5F:DB:69:02:C8:50:65:31:1D:
86:E4:EB:1C:6D:62:B4:6C:28:43:19:C9:75:FB:0C:D7:5D:DC:C2:DB:
ED:3B:27:70:86:1D:64:13:CE:E1:C8:EE:2F:0D:8A:2A:C8:72:23:85:
64:C8:02:8B:59:8B:92:30:C3:CE:2A:FF:5E:9A:7E:33:F8:17:33:DD:
BE:84:B9:F6:5C:12:4E:05:E4:B8:7C:77:E7:8A:E6:55:AC:53:69:CA:
6A:E1:15:AE:3E:E2:16:DF:FB:48:1D:D0:E1:83:BE:51:1D:57:C8:8F:
D0:D5:BC:F6:46:7A:A1:C0:0A:78:0B:DE:25:33:D2:BD:ED:14:CC:72:
9E:1F:28:A7:6C:93:27:95:83:72:F1:EA:C1:1F:E4:34:66:91:54:E3:
3F:49:DC:BD:4D:58:79:08:E8:03:E4:F3:A2:31:9B:96:CB:A9:91:3B:
47:EC:79:B9:CB:B3:1C:CF:61:08:0E:82:9F:D1:FB:7A:3B:38:40:FC:
C7:6E:35:CD:D1:68:FA:16:E9:88:90:1B:35:34:00:F5
parm: debug_mask:debug mask: 1 = dump cmd data, 2 = dump cmd exec time, 3 = both. Default=0 (uint)
parm: prof_sel:profile selector. Valid range 0 - 2 (uint)
# lspci -vvv
f0a2:00:02.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] (rev 80)
Subsystem: Mellanox Technologies Device 0190
Physical Slot: 1
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
NUMA node: 0
Region 0: Memory at fc0000000 (64-bit, prefetchable) [size=1M]
Capabilities: [60] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
MaxPayload 128 bytes, MaxReadReq 128 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L1 <4us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed unknown, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABC, TimeoutDis+, LTR-, OBFF Not Supported
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
AtomicOpsCtl: ReqEn-
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [9c] MSI-X: Enable+ Count=32 Masked-
Vector table: BAR=0 offset=00002000
PBA: BAR=0 offset=00003000
Kernel driver in use: mlx5_core
Kernel modules: mlx5_core
Resolution
Fixes to improve kernel stability during the platform event known to trigger this panic were officially released.
The corresponding kernel versions are:
SLES Version |
Kernel Version |
SLES 12 SP4 LTSS, released on 2023-05-17 |
kernel-default-4.12.14-95.125.1.x86_64.rpm |
SLES 12 SP5, released on 2023-04-10 |
kernel-default-4.12.14-122.156.1.x86_64.rpm |
SLES 15 SP1 LTSS, released on 2023-04-14 |
kernel-default-4.12.14-150100.197.142.1.x86_64.rpm |
SLES 15 SP2 LTSS, released on 2023-04-10 |
kernel-default-5.3.18-150200.24.148.1.x86_64.rpm |
SLES 15 SP3 LTSS, released on 2023-04-11 |
kernel-default-5.3.18-150300.59.118.1.x86_64.rpm |
SLES 15 SP4, released on 2023-04-18 |
kernel-default-5.14.21-150400.24.60.1.x86_64.rpm |
SLES 15 SP5, released on 2023-07-19 |
kernel-default-5.14.21-150500.55.7.1.x86_64.rpm |
Cause
If there are a large number of bad memory pages which need to be corrected, Hyper-V currently handles the pages one by one, and each correction causes a VF remove event and a VF add event. When a Linux workload gets a large number of these back-to-back VF add/remove events in a short period of time, it has been found to cause kernel stability issues. Several race conditions have been found in the pci-hyperv driver which is responsible for these hardware changes.
Status
Additional Information
Microsoft has paused the known problematic behavior on the platform and intends to roll out the re-implemented version of the feature at a later time.
More information can be found here:
https://lwn.net/ml/linux-kernel/20230328045122.25850-1-decui@microsoft.com/
Disclaimer
This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.
- Document ID:000021005
- Creation Date: 08-Mar-2023
- Modified Date:15-May-2024
-
- SUSE Linux Enterprise Server
- SUSE Linux Enterprise Server for SAP Applications
For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com