We recently experienced a number of recurring, unexpected restarts of guest VM’s, all Windows 2008 R2 servers running MSSQL Server 2008. These VM’s were all hosted on ESXi 5.0.0 build-1489271 (update 3). All hosts in the cluster are relatively new HP DL380p Gen 8 servers, with two Intel Xeon E5-2667 v2 processors @3.3GHz.
My first thoughts were that it was something to do with the MSSQL Server 2008 as these were the only guest VM’s affected. I used the Windows debugging tools (including the correct symbols) to analyse the kernel dumps, and in all cases the probable cause was memory corruption;
Bugcheck analysis, probably caused by;
## START ## memory_corruption ( nt!MiDeletePageTableHierarchy+9c ) ###################################################### 'IRQL_NOT_LESS_OR_EQUAL (a)' An attempt was made to access a pageable (or completely invalid) address at an interrupt request level (IRQL) that is too high. This is usually caused by drivers using improper addresses. If a kernel debugger is available get the stack backtrace. Arguments: Arg1: fffff6fb40001de0, memory referenced Arg2: 0000000000000000, IRQL Arg3: 0000000000000000, bitfield : bit 0 : value 0 = read operation, 1 = write operation bit 3 : value 0 = not an execute operation, 1 = execute operation (only on chips which support this level of status) Arg4: fffff800016bbacc, address which referenced memory memory_corruption ( nt!MiBadShareCount+4c ) ########################################### 'PFN_LIST_CORRUPT (4e)' Typically caused by drivers passing bad memory descriptor lists (ie: calling MmUnlockPages twice with the same list, etc). If a kernel debugger is available get the stack trace. Arguments: Arg1: 0000000000000099, A PTE or PFN is corrupt Arg2: 0000000000000000, page frame number Arg3: 0000000000000000, current page state Arg4: 0000000000000000, 0 memory_corruption ( nt!MiResolveDemandZeroFault+2e2 ) ##################################################### 'IRQL_NOT_LESS_OR_EQUAL (a)' An attempt was made to access a pageable (or completely invalid) address at an interrupt request level (IRQL) that is too high. This is usually caused by drivers using improper addresses. If a kernel debugger is available get the stack backtrace. Arguments: Arg1: fffff700010818a0, memory referenced Arg2: 0000000000000000, IRQL Arg3: 0000000000000000, bitfield : bit 0 : value 0 = read operation, 1 = write operation bit 3 : value 0 = not an execute operation, 1 = execute operation (only on chips which support this level of status) Arg4: fffff800016ce272, address which referenced memory memory_corruption ( nt!MiRemoveWorkingSetPages+388 ) #################################################### 'PAGE_FAULT_IN_NONPAGED_AREA (50)' Invalid system memory was referenced. This cannot be protected by try-except, it must be protected by a Probe. Typically the address is just plain bad or it is pointing at freed memory. Arguments Arg1 fffff68000000088, memory referenced. Arg2 0000000000000000, value 0 = read operation, 1 = write operation. Arg3 fffff80001855350, If non-zero, the instruction address which referenced the bad memory address. Arg4 0000000000000005, (reserved) ## END ##
The other common factor was the identical host hardware, so I started looking into the firmware levels and customer advisories for any known issues. One of the first things I found was a BIOS firmware update to address a cross-platform issue caused by the Intel microcode of the E5-2600 v2 series processors. This is specific to the v2 series as the E5-2600 series is not affected.
HP have addressed this in a number of recent BIOS updates which are not included in the most recent HP Service Pack for ProLiant (SPP) Version 2014.02.0(B);
Version:2014.02.10 (2 May 2014)
Version:2013.12.20 (A) (21 Jan 2014)
Version:2013.11.14 (A) (20 Dec 2013)
Addressed an issue where servers configured with Intel Xeon E5-2600 v2 processors and 32 GB LRDIMMs may experience an increased rate of corrected memory errors or uncorrected memory errors. This issue impacts servers configured with 2 DIMMs per channel or 3 DIMMs per channel. Any server configured with Intel Xeon E5-2600 v2 processors using LRDIMMs should be updated to this revision of the System ROM or later. If experiencing memory errors with the indicated configuration, HP recommends updating to this revision of the System ROM or later before contacting HP service.
Addressed an issue where Memory Address or Command Parity errors are not logged to the Integrated Management Log (IML) if they occur. With previous revisions of the System ROM, these types of errors would cause the server to reset without any notification of the error. A “283-Memory Address/Command Parity Error Detected” error will now be displayed during system boot and logged to the IML.
Version:2013.09.18 (A) (24 Sep 2013)
Added support for LRDIMMs for systems configured with Intel Xeon E5-2600 Series v2 processors. Previous System ROM revisions that supported E5-2600 Series v2 processors displayed a “274-Unsupported DIMM Configuration Detected” message during system boot when LRDIMMs were installed with Intel Xeon E5-2600 v2 processors. Previous ROM revisions did support LRDIMMs with Intel Xeon E5-2600 processors.
Version:2013.09.08 (A) (13 Aug 2013)
These excerpts above are all taken from the HP ProLiant DL380p Gen8 Server BIOS release notes, so please refer the the vendors site for a full list of fixes and enhancements.
We do have a support case open with Microsoft regarding this so it will be interesting to see what recommendations come back, but I plan to address this by updating the hosts BIOS to the current version 2014.02.10 (A) released on 2nd May 2014.
Update 13/06/2014;
Since deploying the BIOS update, we haven’t had a single instance re-occur – success!
94,768 total views, 2 views today
Hi Jon,
Did the HP BIOS update resolve your issue with the Intel microcode?
We ran into an issue today with a Cisco UCS blade running the E5-2640 V2 processors. A Windows 2008R2 VM with SQL 2008 blue screened with a similar memory dump.
Jon,
Did the BIOS upgrade resolve your issue with the Intel microcode?
Thanks,
Ben Warner
Hi Ben,
Yes, not a single incident since upgrading the BIOS and we were seeing 2 per week before this.
Cheers,
Jon
Thanks Jon.
Excellent article. Just happened across this through specific Googling of parts of the memory dump. I “believe” we also saw this issue on a Cisco UCS blade yesterday (E5-2640 V2). Cisco has confirmed and also relayed the VMware article detailing a MMU setting workaround through guest configuration. Cisco apparently has also released a code update that resolves the issue (2.2.c).
This is the time that I wish I was using good ole’ HP Proliants. The UCS situation will require a full firmware update on many different components – least of which being the actual server. LOL.
Thanks again for a great post.
Ben Warner
Thanks for this article. We were plagued by this on our UCS cluster. Updated the BIOS on all hosts and haven’t had the problem since!