Intel microcode issue affecting E5-2600 v2 series processors

We recently experienced a number of recurring, unexpected restarts of guest VM’s, all Windows 2008 R2 servers running MSSQL Server 2008. These VM’s were all hosted on ESXi 5.0.0 build-1489271 (update 3). All hosts in the cluster are relatively new HP DL380p Gen 8 servers, with two Intel Xeon E5-2667 v2 processors @3.3GHz.

My first thoughts were that it was something to do with the MSSQL Server 2008 as these were the only guest VM’s affected. I used the Windows debugging tools (including the correct symbols) to analyse the kernel dumps, and in all cases the probable cause was memory corruption;

Bugcheck analysis, probably caused by;
## START ##

memory_corruption ( nt!MiDeletePageTableHierarchy+9c )
######################################################

'IRQL_NOT_LESS_OR_EQUAL (a)'
An attempt was made to access a pageable (or completely invalid) address
at an interrupt request level (IRQL) that is too high.  This is usually
caused by drivers using improper addresses. If a kernel debugger is
available get the stack backtrace.

Arguments:
Arg1: fffff6fb40001de0, memory referenced
Arg2: 0000000000000000, IRQL
Arg3: 0000000000000000, bitfield :
bit 0 : value 0 = read operation, 1 = write operation
bit 3 : value 0 = not an execute operation, 1 = execute operation (only
on chips which support this level of status)
Arg4: fffff800016bbacc, address which referenced memory

memory_corruption ( nt!MiBadShareCount+4c )
###########################################

'PFN_LIST_CORRUPT (4e)'
Typically caused by drivers passing bad memory descriptor lists (ie:
calling MmUnlockPages twice with the same list, etc).  If a kernel
debugger is available get the stack trace.

Arguments:
Arg1: 0000000000000099, A PTE or PFN is corrupt
Arg2: 0000000000000000, page frame number
Arg3: 0000000000000000, current page state
Arg4: 0000000000000000, 0

memory_corruption ( nt!MiResolveDemandZeroFault+2e2 )
#####################################################

'IRQL_NOT_LESS_OR_EQUAL (a)'
An attempt was made to access a pageable (or completely invalid) address
at an interrupt request level (IRQL) that is too high.  This is usually
caused by drivers using improper addresses. If a kernel debugger is
available get the stack backtrace.

Arguments:
Arg1: fffff700010818a0, memory referenced
Arg2: 0000000000000000, IRQL
Arg3: 0000000000000000, bitfield :
bit 0 : value 0 = read operation, 1 = write operation
bit 3 : value 0 = not an execute operation, 1 = execute operation (only
on chips which support this level of status)
Arg4: fffff800016ce272, address which referenced memory

memory_corruption ( nt!MiRemoveWorkingSetPages+388 )
####################################################

'PAGE_FAULT_IN_NONPAGED_AREA (50)'
Invalid system memory was referenced.  This cannot be protected by
try-except, it must be protected by a Probe.  Typically the address is
just plain bad or it is pointing at freed memory.

Arguments
Arg1 fffff68000000088, memory referenced.
Arg2 0000000000000000, value 0 = read operation, 1 = write operation.
Arg3 fffff80001855350, If non-zero, the instruction address which
referenced the bad memory address.
Arg4 0000000000000005, (reserved)

## END ##

The other common factor was the identical host hardware, so I started looking into the firmware levels and customer advisories for any known issues. One of the first things I found was a BIOS firmware update to address a cross-platform issue caused by the Intel microcode of the E5-2600 v2 series processors. This is specific to the v2 series as the E5-2600 series is not affected.

HP have addressed this in a number of recent BIOS updates which are not included in the most recent HP Service Pack for ProLiant (SPP) Version 2014.02.0(B);

Version:2014.02.10 (2 May 2014)
Addressed a processor issue which can result in a Blue Screen of Death (BSOD) in a Windows virtual machine or Linux Kernel Panic in a Linux virtual machine when running on Microsoft Hyper-V or VMware ESX 5.x on Intel Xeon E5-2600 series v2 processors. This issue is not unique to HP ProLiant servers and could impact any system using affected processors operating with the conditions listed. This revision of the System ROM contains an updated version of Intel’s microcode that addresses this issue. This issue does NOT affect servers configured with the Intel Xeon E5-2600 series processors.

 

Version:2013.12.20 (A) (21 Jan 2014)
Addressed an issue where Memory Address or Command Parity errors may occur on servers configured with Intel Xeon E5-2600 series v2 processors and memory configurations where the memory speed is running at 1600 MHz or 1866 MHz. These errors may have resulted in the server resetting without notification of the error or the system resetting and displaying a “283-Memory Address/Command Parity Error Detected Error” and logging the event to the Integrated Management Log (IML). HP strongly recommends that all servers utilizing Intel E5-2600 v2 processors with impacted memory speeds update to this revision of the System ROM or later. This issue does NOT affect servers configured with the Intel Xeon E5-2600 series processor.

 

Version:2013.11.14 (A) (20 Dec 2013)
Addressed an issue where the server may not be able to enter processor idle power states (C-states) which can increase idle power when configured with 2 Intel Xeon E5-2600 v2Series Processors.
Addressed an issue where servers configured with Intel Xeon E5-2600 v2 processors and 32 GB LRDIMMs may experience an increased rate of corrected memory errors or uncorrected memory errors. This issue impacts servers configured with 2 DIMMs per channel or 3 DIMMs per channel. Any server configured with Intel Xeon E5-2600 v2 processors using LRDIMMs should be updated to this revision of the System ROM or later. If experiencing memory errors with the indicated configuration, HP recommends updating to this revision of the System ROM or later before contacting HP service.
Addressed an issue where Memory Address or Command Parity errors are not logged to the Integrated Management Log (IML) if they occur. With previous revisions of the System ROM, these types of errors would cause the server to reset without any notification of the error. A “283-Memory Address/Command Parity Error Detected” error will now be displayed during system boot and logged to the IML.

 

Version:2013.09.18 (A) (24 Sep 2013)
Addressed an issue where a system configured with Intel Xeon E5-2690 v2, E5-2680 v2, E5-2670 v2, and E5-2660 v2 processors and Advanced Memory Protection configured to Online Spare Mode may experience incorrect behavior when multiple Online Spare switchovers occur on the same processor.
Added support for LRDIMMs for systems configured with Intel Xeon E5-2600 Series v2 processors. Previous System ROM revisions that supported E5-2600 Series v2 processors displayed a “274-Unsupported DIMM Configuration Detected” message during system boot when LRDIMMs were installed with Intel Xeon E5-2600 v2 processors. Previous ROM revisions did support LRDIMMs with Intel Xeon E5-2600 processors.

 

Version:2013.09.08 (A) (13 Aug 2013)
Addressed a processor issue under which a rare and complex sequence of internal processor microarchitecture events that occur in specific operating environments could cause a server system to experience unexpected page faults, general protection faults, or machine check exceptions or other unpredictable system behavior. While all processors supported by this server have this issue, to be affected by this issue the server must be operating in a virtualized environment, have Intel Hyperthreading enabled, have a hypervisor that enables Intel VT FlexPriority and Extended Page Tables, and have a guest OS utilizing 32-bit PAE Paging Mode. This issue is not unique to HP ProLiant servers and could impact any system utilizing affected processors operating with the conditions listed above. This revision of the System ROM contains an updated version of Intel’s microcode that addresses this issue. Due to the potential severity of the issue addressed in this revision of the System ROM, this System ROM upgrade is considered a critical fix.

 

These excerpts above are all taken from the HP ProLiant DL380p Gen8 Server BIOS release notes, so please refer the the vendors site for a full list of fixes and enhancements.

We do have a support case open with Microsoft regarding this so it will be interesting to see what recommendations come back, but I plan to address this by updating the hosts BIOS to the current version 2014.02.10 (A) released on 2nd May 2014.

Update 13/06/2014;

Since deploying the BIOS update, we haven’t had a single instance re-occur – success!

46,416 total views, 36 views today

Share on LinkedInShare on FacebookTweet about this on TwitterShare on Google+Digg thisShare on RedditPin on PinterestEmail this to someonePrint this page

Comments

  1. Ben Warner says:

    Hi Jon,

    Did the HP BIOS update resolve your issue with the Intel microcode?

    We ran into an issue today with a Cisco UCS blade running the E5-2640 V2 processors. A Windows 2008R2 VM with SQL 2008 blue screened with a similar memory dump.

  2. Ben Warnr says:

    Jon,

    Did the BIOS upgrade resolve your issue with the Intel microcode?

    Thanks,

    Ben Warner

    • Hi Ben,

      Yes, not a single incident since upgrading the BIOS and we were seeing 2 per week before this.

      Cheers,
      Jon

      • Ben Warnr says:

        Thanks Jon.

        Excellent article. Just happened across this through specific Googling of parts of the memory dump. I “believe” we also saw this issue on a Cisco UCS blade yesterday (E5-2640 V2). Cisco has confirmed and also relayed the VMware article detailing a MMU setting workaround through guest configuration. Cisco apparently has also released a code update that resolves the issue (2.2.c).

        This is the time that I wish I was using good ole’ HP Proliants. The UCS situation will require a full firmware update on many different components – least of which being the actual server. LOL.

        Thanks again for a great post.

        Ben Warner

  3. Bernard Banks says:

    Thanks for this article. We were plagued by this on our UCS cluster. Updated the BIOS on all hosts and haven’t had the problem since!

Trackbacks

  1. Alert: Win2K8R2 and Solaris 10 64-bit VMs blue screen or kernel panic when running on ESXi 5.x with Intel E5 v2 processors
    This is something I have heard of happening in the wild so it is something to be aware of. There is no solution yet but there is a workaround to help avoid this issue until there is a solution. Find out the details here. While the workaround will avoid the issue, it is not ideal, and I hope we can get a solution sooner. The article implies a BIOS upgrade may be the solution but we will see. By the way, I used a CloudPhysics card called Host Inventory that showed me the CPU series of my hosts that confirmed I do not have this issue! Have I mentioned before how much I like CloudPhysics?! Here is some further information that was out before the KB article.

Speak Your Mind

*