Dell VC Plugin – iDRAC Queued Jobs

I’ve been doing some standard maintenance on my Dell R815 hosts (upgrading BIOS, firmware, drivers etc.) and ran into some difficulty with one particular server. I use the Dell Management Plugin for VMware vCenter to carry out these operations, and besides being slow it works well.

I simply run the firmware upgrade wizard which stages the current versions to the iDRAC and schedules the deployment job to execute on boot using UEFI system services. The process automatically enters the host into maintenance mode and reboots to start the UEFI jobs.

On this particular server, the job failed and I was left in a situation where the job was scheduled, would not execute and I couldn’t see any option to delete these jobs. In my case it was a BIOS update to address the PSOD issue affecting AMD Opteron 62xx series processors.

I had a look through the logs and found the following entries (also shown in the vCenter tasks);

  • [Firmware Update] File: R815_BIOS_F8FCX_WN32_3.0.5.EXE – Status: Failed – Message: iDRAC – A job for this update already exists in the queue and prevents the new job from being scheduled.
  • [Firmware Update] File: R815_BIOS_F8FCX_WN32_3.0.5.EXE, Status: Failed, iDRAC: 10.170.210.107, Host xxxxxxxx.xxxxxxxxxxxxx.xxx, Recommended resolution actions: iDRAC – Delete the existing job before scheduling a new update job. For Remote Services API consumers, enumerate all the jobs to find and identify the duplicate job. Then invoke the DeleteJob() method to delete this job and enable the new update job to be scheduled
  • [Firmware Update] The state of update job for file (R815_BIOS_F8FCX_WN32_3.0.5.EXE) has changed from (NEW) state to (FAILED) state.

The solution is pretty simple looking at these log entries, delete the existing job and then run the wizard again. The only problem is that there is no obvious way to delete the jobs.

To speed up the resolution I logged a support call with Dell who wanted me to reset the iDRAC to factory defaults and send them various logs (which sounded like they were clutching at straws) so I started doing my own research and found a solution using WinRM nested deep in one the Dell support forums.

Windows Remote Management (WinRM) is the Microsoft implementation of the WS-Management protocol which provides a secure way to communicate with local and remote computers using web services.

Remember to replace the $USERNAME, $PASSWORD and $IPADDRESS with your own values.

Check for existing jobs;
## Use WinRM to check for existing Jobs
winrm e cimv2/root/dcim/DCIM_LifecycleJob -u:$USERNAME -p:$PASSWORD -SkipCNcheck -SkipCAcheck -r:https://$IPADDRESS/wsman:443 -auth:basic -encoding:utf-8
Results;
DCIM_LifecycleJob
    InstanceID = JID_001371642178
    JobStartTime = TIME_NA
    JobStatus = Failed
    JobUntilTime = TIME_NA
    Message = Task for this device is already present.
    MessageID = RED014
    Name = update:DCIM:INSTALLED:NONPCI:159:3.0.4
    PercentComplete = NA

** I had multiple entries but have only shown an excerpt to keep the post short.

Delete all existing jobs;
## Use WinRM to delete all jobs
winrm invoke DeleteJobQueue "cimv2/root/dcim/DCIM_JobService?CreationClassName=DCIM_JobService+Name=JobService+SystemName=Idrac+SystemCreationClassName=DCIM_ComputerSystem" @{JobID="JID_CLEARALL"} -r:https://$IPADDRESS/wsman -u:$USERNAME -p:$PASSWORD -SkipCNcheck -SkipCAcheck -encoding:utf-8 -a:basic -format:pretty
Results;
<n1:DeleteJobQueue_OUTPUT xml:lang="" xmlns:n1="<a href="http://schemas.dmtf.org/wbem/wscim/1/cim-schema/2/root/dcim/DCIM_JobService">http://schemas.dmtf.org/wbem/wscim/1/cim-schema/2/root/dcim/DCIM_JobService</a>">
<n1:Message>The specified job was deleted</n1:Message>
<n1:MessageID>SUP020</n1:MessageID>
<n1:ReturnValue>0</n1:ReturnValue>
</n1:DeleteJobQueue_OUTPUT>
Check for existing jobs (again);
## Use WinRM to check for existing Jobs
winrm e cimv2/root/dcim/DCIM_LifecycleJob -u:$USERNAME -p:$PASSWORD -SkipCNcheck -SkipCAcheck -r:https://$IPADDRESS/wsman:443 -auth:basic -encoding:utf-8
Results;
DCIM_LifecycleJob
    InstanceID = JID_CLEARALL
    JobStartTime = TIME_NA
    JobStatus = Pending
    JobUntilTime = TIME_NA
    Message = NA
    MessageID = NA
    Name = CLEARALL
    PercentComplete = 0

So it’s easy when you know how, but this issue still managed to waste a lot of my time! Hopefully my post will save you from a similar experience.

19,054 total views, 2 views today

FIX : PSOD on hosts with AMD Opteron 62xx series processors

If you’re running ESX/ESXi on either HP or Dell hosts with AMD Opteron 62xx series processors and were affected by the PSOD issue, then you will be happy to know that both vendors have now released BIOS updates to address this. My understanding is that this was actually a problem with the AMD microcode rather than a VMware, HP or Dell issue.

I was affected by this using Dell PowerEdge R815 servers immediately after upgrading the BIOS on my hosts from a mix of version 2.8.2 and version 2.9.0 to version 3.0.4. The workaround up till now was to downgrade the BIOS version on all hosts back to version 2.8.2.

Dell PowerEdge R815 Hosts

ItemValue
Processor ModelAMD Opteron(tm) Processor 6276
Processor Speed2.3 GHz
Processor Sockets4
Processor Cores per Socket16
Logical Processors64
Memory256 GB

Here are the links the vendor advisories and updates;

Interestingly Dell have only flagged the importance of this upgrade as “Recommended” – not sure about you, but I quite like my hosts to be up and running!

18,478 total views, 1 views today

Why are my new servers so much slower than the old ones?

I had a support call today where I was asked to have a look at some servers to find out why they seemed so much slower than the existing ones. With not much detail to go on I first looked at some basic metrics;

Basic Metrics

MetricOld ServerNew Server
Hardware ModelHP ProLiant DL380 G5Dell PowerEdge R715
Operating SystemMicrosoft(R) Windows(R) Server 2003 Standard x64 EditionMicrosoft Windows Server 2008 R2 Enterprise
Memory32,766 MB131,046 MB
Processor2 Processor(s) Installed.
[01]: EM64T Family 6 Model 23 Stepping 10 GenuineIntel ~3000 Mhz
[02]: EM64T Family 6 Model 23 Stepping 10 GenuineIntel ~3000 Mhz
2 Processor(s) Installed.
[01]: AMD64 Family 21 Model 1 Stepping 2 AuthenticAMD ~3000 Mhz
[02]: AMD64 Family 21 Model 1 Stepping 2 AuthenticAMD ~3000 Mhz

The first thing that stands out is that the new server is from a different hardware vendor, but a higher spec, later generation system – so what could be wrong?

I decided to have a look in the BIOS first to see if there were any obvious misconfigurations, and noticed that the power management settings were not set to “Maximum Performance” and that the C1E state was enabled.

Before changing anything I downloaded and ran Super Pi to get a simple baseline of single threaded calculations on the new higher spec server.

I then changed three BIOS settings, and re-ran the Super Pi calculations;

  • Enabled “Processor HPC mode”
  • Disabled “C1E”
  • Set Power Management to “Maximum Performance”

The results;

SuperPi

** before on the left, after on the right

WOW, what a difference! By simply changing the power management settings in the BIOS, a calculation that previously took 1 minute 8 seconds now only takes 11 seconds!

17,580 total views, 1 views today