miranon
Member
- Oct 13, 2024
- 13
Hello everyone.
I have a S2D cluster of 4 servers.
A BSOD occurs with an error "driver_irql_not_less_or_equal" on vmswitch.sys. At the same time, an "Uncorrectable ECC" error message appears. Each time on the second cpu. Occurs from time to time on all servers in the cluster.
It was noticed that this problem occurs more often with SR-IOV enabled on VMs and a large network load. A complete reinstallation of all servers in the cluster was performed, including switching to the core version. Updating all drivers and firmwares to the latest versions.
The configuration of all 4 servers is identical:
ASUS RS700-E10-RS12U
2xXeon 6346 (HyperTreading disabled)
512MB RAM (8x64GB Samsung M393A8G40BB4-CWE, installed in A1, C1, E1, G1, J1, L1, N1, R1)
2x480GB SSD SATA in RAID 1 for boot (SAMSUNG MZ7L3480HCHQ-00A07)
4x3,84TB NVMe for Storage (Available only 3,2TB - we use Micron Flex Capacity Feature to increase DWDP. Micron 7300 MTFDHBE3T8TDF)
1x400GB NVMe for Storage (Intel Optane DC P5800X SSDPF21Q400GB)
1x6,4TB NVMe for Storage (Intel D7-P5620 SSDPF2KE064T1)
Intel E810-XXV Network Adapter. iWarp RDMA Enabled.
2x1600 Power Supply (CHICONY POWER R1K6AW03P)
I use Windows Server 2022 Datacenter Core with all updates.
I don't have any additional software on servers
I don't use overclocking (Disabled in Bios)
High performance power schema is enabled (In the BIOS too)
I ran the memory and processor tests recommended by the vendor (stress-ng --cpu 32 --cpu-method all --metrics --timeout 8h for CPU and stress-ng --vm 8 --vm-bytes 80% --timeout 8h for memmory).
Everythink is OK. The vendor said it was not a hardware problem.
http://speccy.piriform.com/results/sOtiaDuzbQ4TPf860Ja8ACt
I apologize for any mistakes - English is not my native language.
I would appreciate your help.
Alex.
I have a S2D cluster of 4 servers.
A BSOD occurs with an error "driver_irql_not_less_or_equal" on vmswitch.sys. At the same time, an "Uncorrectable ECC" error message appears. Each time on the second cpu. Occurs from time to time on all servers in the cluster.
It was noticed that this problem occurs more often with SR-IOV enabled on VMs and a large network load. A complete reinstallation of all servers in the cluster was performed, including switching to the core version. Updating all drivers and firmwares to the latest versions.
The configuration of all 4 servers is identical:
ASUS RS700-E10-RS12U
2xXeon 6346 (HyperTreading disabled)
512MB RAM (8x64GB Samsung M393A8G40BB4-CWE, installed in A1, C1, E1, G1, J1, L1, N1, R1)
2x480GB SSD SATA in RAID 1 for boot (SAMSUNG MZ7L3480HCHQ-00A07)
4x3,84TB NVMe for Storage (Available only 3,2TB - we use Micron Flex Capacity Feature to increase DWDP. Micron 7300 MTFDHBE3T8TDF)
1x400GB NVMe for Storage (Intel Optane DC P5800X SSDPF21Q400GB)
1x6,4TB NVMe for Storage (Intel D7-P5620 SSDPF2KE064T1)
Intel E810-XXV Network Adapter. iWarp RDMA Enabled.
2x1600 Power Supply (CHICONY POWER R1K6AW03P)
I use Windows Server 2022 Datacenter Core with all updates.
I don't have any additional software on servers
I don't use overclocking (Disabled in Bios)
High performance power schema is enabled (In the BIOS too)
I ran the memory and processor tests recommended by the vendor (stress-ng --cpu 32 --cpu-method all --metrics --timeout 8h for CPU and stress-ng --vm 8 --vm-bytes 80% --timeout 8h for memmory).
Everythink is OK. The vendor said it was not a hardware problem.
http://speccy.piriform.com/results/sOtiaDuzbQ4TPf860Ja8ACt
I apologize for any mistakes - English is not my native language.
I would appreciate your help.
Alex.