vExpert 2016 Applications are Open

vExpert-2016-application

The VMware vExpert program is VMware’s global evangelism and advocacy program. The program is designed to put VMware’s marketing resources towards your advocacy efforts. Promotion of your articles, exposure at our global events, co-op advertising, traffic analysis, and early access to beta programs and VMware’s roadmap.

Each year, we bring together in the vExpert Program the people who have made some of the most important contributions to the VMware community. These are the bloggers, book authors, VMUG leaders, speakers, tool builders, community leaders and general enthusiasts. They work as IT admins and architects for VMware customers, they act as trusted advisors and implementors for VMware partners or as independent consultants, and some work for VMware itself. All of them have the passion and enthusiasm for technology and applying technology to solve problems. They have contributed to the success of us all by sharing their knowledge and expertise over their days, nights, and weekends.

vExperts who participate in the program have access to private betas, free licenses, early access briefings, exclusive events, free access to VMworld conference materials online, exclusive vExpert parties at VMworld and other opportunities to interact with VMware product teams. They also get access to a private community and networking opportunities.

Evangelist Path
The Evangelist Path includes book authors, bloggers, tool builders, public speakers, VMTN contributors, and other IT professionals who share their knowledge and passion with others with the leverage of a personal public platform to reach many people. Employees of VMware can also apply via the Evangelist path. A VMware employee reference is recommended if your activities weren’t all in public or were in a language other than English.

Customer Path
The Customer Path is for leaders from VMware customer organizations. They have been internal champions in their organizations, or worked with VMware to build success stories, act as customer references, given public interviews, spoken at conferences, or were VMUG leaders. A VMware employee reference is recommended if your activities weren’t all in public.

VPN (VMware Partner Network) Path
The VPN Path is for employees of our partner companies who lead with passion and by example, who are committed to continuous learning through accreditations and certifications and to making their technical knowledge and expertise available to many. This can take shape of event participation, video, IP generation, as well as public speaking engagements. A VMware employee reference is required for VPN Path candidates.

Apply for vExpert 2016

Questions & Updates
For questions about the application process or the vExpert Program, please send email to vexpert@vmware.com. Be sure to follow @vExpert for updates on the 2016 vExpert program.

Thank you to all who have participated in years past,
Corey Romero & the Social Media Team

Source: https://blogs.vmware.com/vmtn/2015/11/vexpert-2016-applications-are-now-open.html

1,822 total views, 5 views today

Download URLs for VMware vSphere Client

VMware have finally released a single page reference for all vSphere client versions – KB article ID 2089791. You can either follow the link to the original KB article, or use the download URL’s in the table below;

VersionBuild
Download URL
VMware vSphere Client 6.0 Update 23562874
VMware vSphere Client 6.0 Update 13016447
VMware vSphere Client 6.02502222
VMware vSphere Client 5.5 Update 33024345
VMware vSphere Client 5.5 Update 21993072
VMware vSphere Client 5.5 Update 1b1880841
VMware vSphere Client 5.5 Update 1a1746248
VMware vSphere Client 5.5 Update C1745234
VMware vSphere Client 5.5 Update 11618071
VMware vSphere Client 5.51281650
VMware vSphere Client 5.1 Update 2a1880906
VMware vSphere Client 5.1 Update 211471691
VMware vSphere Client 5.1 Update 1c1364039
VMware vSphere Client 5.1 Update 1b1235233
VMware vSphere Client 5.1 Update 11064113
VMware vSphere Client 5.1.0b941893
VMware vSphere Client 5.1.0a860230
VMware vSphere Client 5.1786111
VMware vSphere Client 5.0 Update 31300600
VMware vSphere Client 5.0 Update 2913577
VMware vSphere Client 5.0 Update 1b804277
VMware vSphere Client 5.0 Update 1a755629
VMware vSphere Client 5.0 Update 1623373
VMware vSphere Client 5.0455964

3,255 total views, 3 views today

VMware Security Hardening Guides

Security Hardening Guides provide prescriptive guidance for customers on how to deploy and operate VMware products in a secure manner. Guides for vSphere are provided in an easy to consume spreadsheet format, with rich metadata to allow for guideline classification and risk assessment. They also include script examples for enabling security automation. Comparison documents are provided that list changes in guidance in successive versions of the guide.
– See more at: VMware Security Hardening Guides

17,106 total views, 6 views today

vMotion operation fails with the error, “The VM failed to resume on the destination during early power on”.

I had an interesting issue to resolve today, one that I haven’t seen before and one that took a bit of digging to resolve. The problem related to migrating some Exchange mailbox servers from a legacy ESXi 4.1 host onto new ESXi 5.1 host.

This should have been a simple vMotion operation, but the task failed repeatedly at approximately 65% complete. I tried using both high and standard priority migrations, but it failed every time, simply reporting “The VM failed to resume on the destination during early power on

vMotion-failure-masked

First thing I did was check the host log files (vmkwarning and vmkernel), as well as the virtual machine log file (vmware.log) located in the virtual machine folder on the datastore;

## START ##

Nov  6 13:32:30 src-host vmkernel: 843:05:02:40.662 cpu6:64249)WARNING: Migrate: 296: 1415280590445360 S: Failed: Failed to resume VM (0xbad0044) @0x418023e4b250
Nov  6 13:32:30 src-host vmkernel: 843:05:02:40.664 cpu3:64248)WARNING: VMotionSend: 3857: 1415280590445360 S: failed to send final set of pages to the remote host <xx.xx.xx.xx>: Failure.
Nov  6 13:32:30 src-host vmkernel: 843:05:02:40.689 cpu12:48347)WARNING: Migrate: 4328: 1415280590445360 S: Migration considered a failure by the VMX.  It is most likely a timeout, but check the VMX log for the true error.
Nov  6 14:51:31 src-host vmkernel: 843:06:21:41.945 cpu6:64267)WARNING: Migrate: 296: 1415285312829818 S: Failed: Failed to resume VM (0xbad0044) @0x418023e4b250
Nov  6 14:51:31 src-host vmkernel: 843:06:21:41.953 cpu19:64266)WARNING: VMotionSend: 3857: 1415285312829818 S: failed to send final set of pages to the remote host <xx.xx.xx.xx>: Failure.
Nov  6 14:51:31 src-host vmkernel: 843:06:21:41.970 cpu12:48347)WARNING: Migrate: 4328: 1415285312829818 S: Migration considered a failure by the VMX.  It is most likely a timeout, but check the VMX log for the true error.

## END ##
## START ##

Nov  6 13:35:23 dst-host vmkernel: 501:21:49:25.404 cpu1:63073)WARNING: MemSched: 12625: Non-overhead memory reservation for vmx user-world (worldID=63073) is greater than desired minimum amount of 57344 KB (min=57344 KB, reservedOverhead=0 KB, totalReserved=68812 KB)
Nov  6 13:36:05 dst-host vmkernel: 501:21:50:07.143 cpu6:63073)WARNING: MemSched: vm 63073: 5199: Cannot reduce reservation by 2021 pages (total reservation: 27428 pages, consumed reservation: 27428 pages)
Nov  6 13:36:28 dst-host vmkernel: 501:21:50:29.836 cpu5:63091)WARNING: MemSched: 12625: Non-overhead memory reservation for vmx user-world (worldID=63091) is greater than desired minimum amount of 57344 KB (min=57344 KB, reservedOverhead=0 KB, totalReserved=68552 KB)
Nov  6 13:36:39 dst-host vmkernel: 501:21:50:41.256 cpu6:63091)WARNING: MemSched: vm 63091: 5199: Cannot reduce reservation by 1913 pages (total reservation: 24038 pages, consumed reservation: 24038 pages)
Nov  6 13:37:10 dst-host vmkernel: 501:21:51:12.241 cpu5:63106)WARNING: MemSched: 12625: Non-overhead memory reservation for vmx user-world (worldID=63106) is greater than desired minimum amount of 57344 KB (min=57344 KB, reservedOverhead=0 KB, totalReserved=68724 KB)
Nov  6 13:37:50 dst-host vmkernel: 501:21:51:51.758 cpu11:63106)WARNING: MemSched: vm 63106: 5199: Cannot reduce reservation by 2021 pages (total reservation: 27327 pages, consumed reservation: 27327 pages)
Nov  6 13:38:16 dst-host vmkernel: 501:21:52:18.119 cpu6:63124)WARNING: MemSched: 12625: Non-overhead memory reservation for vmx user-world (worldID=63124) is greater than desired minimum amount of 57344 KB (min=57344 KB, reservedOverhead=0 KB, totalReserved=69464 KB)
Nov  6 13:40:23 dst-host vmkernel: 501:21:54:25.336 cpu2:63124)WARNING: MemSched: vm 63124: 5199: Cannot reduce reservation by 2019 pages (total reservation: 38944 pages, consumed reservation: 38944 pages)
Nov  6 14:48:34 dst-host vmkernel: 501:23:02:35.673 cpu1:63154)WARNING: MemSched: 12625: Non-overhead memory reservation for vmx user-world (worldID=63154) is greater than desired minimum amount of 57344 KB (min=57344 KB, reservedOverhead=0 KB, totalReserved=69540 KB)
Nov  6 14:52:04 dst-host vmkernel: 501:23:06:05.978 cpu15:63154)WARNING: Migrate: 4328: 1415285312829818 D: Migration considered a failure by the VMX.  It is most likely a timeout, but check the VMX log for the true error.
Nov  6 14:52:04 dst-host vmkernel: 501:23:06:05.978 cpu15:63154)WARNING: Migrate: 296: 1415285312829818 D: Failed: Migration determined a failure by the VMX (0xbad0092) @0x41801fb9acb9
Nov  6 14:52:04 dst-host vmkernel: 501:23:06:05.978 cpu15:63154)WARNING: VMotionUtil: 3548: 1415285312829818 D: timed out waiting 0 ms to transmit data.
Nov  6 14:52:04 dst-host vmkernel: 501:23:06:05.978 cpu15:63154)WARNING: VMotionSend: 624: 1415285312829818 D: (9-0x410300002058) failed to receive 72/72 bytes from the remote host <xx.xx.xx.xx>: Timeout

## END ##

So reading through the host log files it looks like there was a problem reserving enough memory resources on the destination host and the operation timed out. This sounds relatively plausible, but the exact same results were observed trying to migrate the VM onto an empty host.

So next step was to review the guest VM’s log file;

## START ##

Nov 06 14:52:04.416: vmx| DISKLIB-CTK   : Could not open tracking file. File open returned IO error 4.
Nov 06 14:52:04.416: vmx| DISKLIB-CTK   : Could not open change tracking file "/vmfs/volumes/4f1edf9e-b9c5beba-cd04-0025b30202ac/GUEST-VM/GUEST-VM_3-ctk.vmdk": Could not open/create change tracking file.
Nov 06 14:52:04.417: vmx| DISKLIB-LIB   : Could not open change tracker /vmfs/volumes/4f1edf9e-b9c5beba-cd04-0025b30202ac/GUEST-VM/GUEST-VM_3-ctk.vmdk: Could not open/create change tracking file.
Nov 06 14:52:04.421: vmx| DISKLIB-VMFS  : "/vmfs/volumes/4f1edf9e-b9c5beba-cd04-0025b30202ac/GUEST-VM/GUEST-VM_3-rdm.vmdk" : closed.
Nov 06 14:52:04.421: vmx| DISKLIB-LIB   : Failed to open '/vmfs/volumes/4f1edf9e-b9c5beba-cd04-0025b30202ac/GUEST-VM/GUEST-VM_3.vmdk' with flags 0xa Could not open/create change tracking file (2108).
Nov 06 14:52:04.421: vmx| DISK: Cannot open disk "/vmfs/volumes/4f1edf9e-b9c5beba-cd04-0025b30202ac/GUEST-VM/GUEST-VM_3.vmdk": Could not open/create change tracking file (2108).
Nov 06 14:52:04.422: vmx| Msg_Post: Error
Nov 06 14:52:04.422: vmx| [msg.disk.noBackEnd] Cannot open the disk '/vmfs/volumes/4f1edf9e-b9c5beba-cd04-0025b30202ac/GUEST-VM/GUEST-VM_3.vmdk' or one of the snapshot disks it depends on.
Nov 06 14:52:04.423: vmx| [msg.disk.configureDiskError] Reason: Could not open/create change tracking file.----------------------------------------
Nov 06 14:52:04.437: vmx| Module DiskEarly power on failed.
Nov 06 14:52:04.439: vmx| VMX_PowerOn: ModuleTable_PowerOn = 0
Nov 06 14:52:04.440: vmx| MigrateSetStateFinished: type=2 new state=11
Nov 06 14:52:04.440: vmx| MigrateSetState: Transitioning from state 10 to 11.
Nov 06 14:52:04.441: vmx| Migrate_SetFailure: The VM failed to resume on the destination during early power on.
Nov 06 14:52:04.441: vmx| Msg_Post: Error
Nov 06 14:52:04.442: vmx| [msg.migrate.resume.fail] The VM failed to resume on the destination during early power on.

## END ##

Interestingly here, we now start getting some hints that perhaps a file lock is occurring and we also see the same error message that was observed in the vSphere client. The VM failed to resume on the destination during early power on.

I decided to have a look at the contents of the virtual machine folder, and found a number of suspicious looking “-ctk.vmdk” files, mostly time stamped from more than two years ago.

## START ##

[root@host GUEST-VM]# ls -lh
total 73G
-rw------- 1 root root  4.3M Feb 28  2012 GUEST-VM_10-ctk.vmdk
-rw------- 1 root root  1.1T Feb 27  2012 GUEST-VM_10-rdm.vmdk
-rw------- 1 root root   582 Feb 28  2012 GUEST-VM_10.vmdk
-rw------- 1 root root  7.7M Nov  6 14:51 GUEST-VM_11-ctk.vmdk
-rw------- 1 root root  983G Feb  5  2014 GUEST-VM_11-rdm.vmdk
-rw------- 1 root root   559 Feb  5  2014 GUEST-VM_11.vmdk
-rw------- 1 root root   65K Jun 22  2013 GUEST-VM_1-ctk.vmdk
-rw------- 1 root root  1.0G Nov  3 17:33 GUEST-VM_1-flat.vmdk
-rw------- 1 root root   586 Feb 27  2012 GUEST-VM_1.vmdk
-rw------- 1 root root   65K Jun 22  2013 GUEST-VM_2-ctk.vmdk
-rw------- 1 root root  1.0G Nov  3 17:33 GUEST-VM_2-flat.vmdk
-rw------- 1 root root   586 Feb 27  2012 GUEST-VM_2.vmdk
-rw------- 1 root root  4.3M Feb 28  2012 GUEST-VM_3-ctk.vmdk
-rw------- 1 root root  1.1T Feb 27  2012 GUEST-VM_3-rdm.vmdk
-rw------- 1 root root   580 Feb 28  2012 GUEST-VM_3.vmdk
-rw------- 1 root root  4.3M Feb 28  2012 GUEST-VM_4-ctk.vmdk
-rw------- 1 root root  1.1T Feb 27  2012 GUEST-VM_4-rdm.vmdk
-rw------- 1 root root   580 Feb 28  2012 GUEST-VM_4.vmdk
-rw------- 1 root root  4.3M Feb 28  2012 GUEST-VM_5-ctk.vmdk
-rw------- 1 root root  1.1T Feb 27  2012 GUEST-VM_5-rdm.vmdk
-rw------- 1 root root   580 Feb 28  2012 GUEST-VM_5.vmdk
-rw------- 1 root root  4.3M Feb 28  2012 GUEST-VM_6-ctk.vmdk
-rw------- 1 root root  1.1T Feb 27  2012 GUEST-VM_6-rdm.vmdk
-rw------- 1 root root   580 Feb 28  2012 GUEST-VM_6.vmdk
-rw------- 1 root root  4.3M Feb 28  2012 GUEST-VM_7-ctk.vmdk
-rw------- 1 root root  1.1T Feb 27  2012 GUEST-VM_7-rdm.vmdk
-rw------- 1 root root   580 Feb 28  2012 GUEST-VM_7.vmdk
-rw------- 1 root root  4.3M Feb 28  2012 GUEST-VM_8-ctk.vmdk
-rw------- 1 root root  1.1T Feb 27  2012 GUEST-VM_8-rdm.vmdk
-rw------- 1 root root   580 Feb 28  2012 GUEST-VM_8.vmdk
-rw-r--r-- 1 root root    48 Nov  6 14:48 GUEST-VM-98045e41.hlog
-rw------- 1 root root  4.3M Feb 28  2012 GUEST-VM_9-ctk.vmdk
-rw------- 1 root root  1.1T Feb 27  2012 GUEST-VM_9-rdm.vmdk
-rw------- 1 root root   580 Feb 28  2012 GUEST-VM_9.vmdk
-rw------- 1 root root  4.4M Feb 28  2012 GUEST-VM-ctk.vmdk
-rw------- 1 root root   70G Nov  6 15:11 GUEST-VM-flat.vmdk
-rw------- 1 root root  8.5K Nov  3 17:36 GUEST-VM.nvram
-rw------- 1 root root   585 Feb 27  2012 GUEST-VM.vmdk
-rw-r--r-- 1 root root    44 Feb 28  2012 GUEST-VM.vmsd
-rwxr-xr-x 1 root root  5.8K Feb 19  2014 GUEST-VM.vmx
-rw-r--r-- 1 root root   266 Feb  5  2014 GUEST-VM.vmxf
drwxr-xr-x 1 root root   420 Feb 19  2014 phd
-rw-r--r-- 1 root root   57K Nov 19  2012 vmware-43.log
-rw-r--r-- 1 root root   57K Jun 22  2013 vmware-44.log
-rw-r--r-- 1 root root   57K Jun 22  2013 vmware-45.log
-rw-r--r-- 1 root root  1.0M Feb 19  2014 vmware-46.log
-rw-r--r-- 1 root root 1020K Nov  6 14:51 vmware-47.log
-rw-r--r-- 1 root root   57K Nov  6 13:33 vmware-48.log
-rw-r--r-- 1 root root   57K Nov  6 14:52 vmware.log

## END ##

You can consolidate this view with a simple grep of just the “-ctk.vmdk” files;

## START ##

[root@host GUEST-VM]# ls -al | grep ctk
-rw------- 1 root root       4436480 Feb 28  2012 GUEST-VM_10-ctk.vmdk
-rw------- 1 root root       8053248 Nov  6 14:51 GUEST-VM_11-ctk.vmdk
-rw------- 1 root root         66048 Jun 22  2013 GUEST-VM_1-ctk.vmdk
-rw------- 1 root root         66048 Jun 22  2013 GUEST-VM_2-ctk.vmdk
-rw------- 1 root root       4436480 Feb 28  2012 GUEST-VM_3-ctk.vmdk
-rw------- 1 root root       4436480 Feb 28  2012 GUEST-VM_4-ctk.vmdk
-rw------- 1 root root       4436480 Feb 28  2012 GUEST-VM_5-ctk.vmdk
-rw------- 1 root root       4436480 Feb 28  2012 GUEST-VM_6-ctk.vmdk
-rw------- 1 root root       4436480 Feb 28  2012 GUEST-VM_7-ctk.vmdk
-rw------- 1 root root       4436480 Feb 28  2012 GUEST-VM_8-ctk.vmdk
-rw------- 1 root root       4436480 Feb 28  2012 GUEST-VM_9-ctk.vmdk
-rw------- 1 root root       4588032 Feb 28  2012 GUEST-VM-ctk.vmdk

## END ##

So for ESX/ESXi 3.x/4.x and ESXi 5.0, the lock status of these “-ctk.vmdk” files can be obtained using the vmkfstools command. The process and syntax is explained in detail in KB1003397, titled “Unable to perform operations on a virtual machine with a locked disk.”

## START ##

vmkfstools -D /vmfs/volumes/LUN/VM/disk-flat.vmdk

## END ##

You see output similar to this below, and are specifically interested in the mode value returned;

## START ##

Lock [type 10c00001 offset 54009856 v 11, hb offset 3198976
gen 9, mode 0, owner 4655cd8b-3c4a19f2-17bc-00145e808070  mtime 114]
Addr <4, 116, 4>, gen 5, links 1, type reg, flags 0, uid 0, gid 0, mode 600
len 5368709120, nb 0 tbz 0, cow 0, zla 3, bs 1048576

## END ##

The mode indicates the type of lock that is on the file, as follows;

## START ##

mode 0 = no lock
mode 1 = is an exclusive lock (vmx file of a powered on VM, the currently used disk (flat or delta), *vswp, etc.)
mode 2 = is a read-only lock (e.g. on the ..-flat.vmdk of a running VM with snapshots)
mode 3 = is a multi-writer lock (e.g. used for MSCS clusters disks or FT VMs).

## END ##

In my case, all “-ctk.vmdk” files reported an exclusive mode 1 lock;

## START ##

[root@host GUEST-VM]# vmkfstools -D GUEST-VM-ctk.vmdk
Lock [type 10c00001 offset 62181376 v 7105, hb offset 3211264
gen 87017, mode 1, owner 5003ce7b-b04b7d3f-f4e5-b499babda354 mtime 14822239]
Addr <4, 89, 194>, gen 7012, links 1, type reg, flags 0, uid 0, gid 0, mode 600
len 4588032, nb 1 tbz 0, cow 0, zla 1, bs 8388608

[root@host GUEST-VM]# vmkfstools -D GUEST-VM_1-ctk.vmdk
Lock [type 10c00001 offset 62185472 v 7106, hb offset 3211264
gen 87017, mode 1, owner 5003ce7b-b04b7d3f-f4e5-b499babda354 mtime 14822246]
Addr <4, 89, 196>, gen 7013, links 1, type reg, flags 0, uid 0, gid 0, mode 600
len 66048, nb 1 tbz 0, cow 0, zla 1, bs 8388608

[root@host GUEST-VM]# vmkfstools -D GUEST-VM_2-ctk.vmdk
Lock [type 10c00001 offset 62187520 v 7107, hb offset 3211264
gen 87017, mode 1, owner 5003ce7b-b04b7d3f-f4e5-b499babda354 mtime 14822253]
Addr <4, 89, 197>, gen 7014, links 1, type reg, flags 0, uid 0, gid 0, mode 600
len 66048, nb 1 tbz 0, cow 0, zla 1, bs 8388608

[root@host GUEST-VM]# vmkfstools -D GUEST-VM_3-ctk.vmdk
Lock [type 10c00001 offset 62189568 v 7052, hb offset 3211264
gen 87017, mode 1, owner 5003ce7b-b04b7d3f-f4e5-b499babda354 mtime 11201757]
Addr <4, 89, 198>, gen 7015, links 1, type reg, flags 0, uid 0, gid 0, mode 600
len 4436480, nb 1 tbz 0, cow 0, zla 1, bs 8388608

[root@host GUEST-VM]# vmkfstools -D GUEST-VM_4-ctk.vmdk
Lock [type 10c00001 offset 62191616 v 7053, hb offset 3211264
gen 87017, mode 1, owner 5003ce7b-b04b7d3f-f4e5-b499babda354 mtime 11201767]
Addr <4, 89, 199>, gen 7016, links 1, type reg, flags 0, uid 0, gid 0, mode 600
len 4436480, nb 1 tbz 0, cow 0, zla 1, bs 8388608

[root@host GUEST-VM]# vmkfstools -D GUEST-VM_5-ctk.vmdk
Lock [type 10c00001 offset 61786112 v 7054, hb offset 3211264
gen 87017, mode 1, owner 5003ce7b-b04b7d3f-f4e5-b499babda354 mtime 11201777]
Addr <4, 89, 1>, gen 7017, links 1, type reg, flags 0, uid 0, gid 0, mode 600
len 4436480, nb 1 tbz 0, cow 0, zla 1, bs 8388608

[root@host GUEST-VM]# vmkfstools -D GUEST-VM_6-ctk.vmdk
Lock [type 10c00001 offset 61792256 v 7055, hb offset 3211264
gen 87017, mode 1, owner 5003ce7b-b04b7d3f-f4e5-b499babda354 mtime 11201787]
Addr <4, 89, 4>, gen 7018, links 1, type reg, flags 0, uid 0, gid 0, mode 600
len 4436480, nb 1 tbz 0, cow 0, zla 1, bs 8388608

[root@host GUEST-VM]# vmkfstools -D GUEST-VM_7-ctk.vmdk
Lock [type 10c00001 offset 61794304 v 7056, hb offset 3211264
gen 87017, mode 1, owner 5003ce7b-b04b7d3f-f4e5-b499babda354 mtime 11201797]
Addr <4, 89, 5>, gen 7019, links 1, type reg, flags 0, uid 0, gid 0, mode 600
len 4436480, nb 1 tbz 0, cow 0, zla 1, bs 8388608

[root@host GUEST-VM]# vmkfstools -D GUEST-VM_8-ctk.vmdk
Lock [type 10c00001 offset 61796352 v 7057, hb offset 3211264
gen 87017, mode 1, owner 5003ce7b-b04b7d3f-f4e5-b499babda354 mtime 11201807]
Addr <4, 89, 6>, gen 7020, links 1, type reg, flags 0, uid 0, gid 0, mode 600
len 4436480, nb 1 tbz 0, cow 0, zla 1, bs 8388608

[root@host GUEST-VM]# vmkfstools -D GUEST-VM_9-ctk.vmdk
Lock [type 10c00001 offset 61798400 v 7058, hb offset 3211264
gen 87017, mode 1, owner 5003ce7b-b04b7d3f-f4e5-b499babda354 mtime 11201817]
Addr <4, 89, 7>, gen 7021, links 1, type reg, flags 0, uid 0, gid 0, mode 600
len 4436480, nb 1 tbz 0, cow 0, zla 1, bs 8388608

[root@host GUEST-VM]# vmkfstools -D GUEST-VM_10-ctk.vmdk
Lock [type 10c00001 offset 61800448 v 7059, hb offset 3211264
gen 87017, mode 1, owner 5003ce7b-b04b7d3f-f4e5-b499babda354 mtime 11201827]
Addr <4, 89, 8>, gen 7022, links 1, type reg, flags 0, uid 0, gid 0, mode 600
len 4436480, nb 1 tbz 0, cow 0, zla 1, bs 8388608

[root@host GUEST-VM]# vmkfstools -D GUEST-VM_11-ctk.vmdk
Lock [type 10c00001 offset 12601344 v 46751, hb offset 3211264
gen 87017, mode 1, owner 5003ce7b-b04b7d3f-f4e5-b499babda354 mtime 14822300]
Addr <4, 10, 9>, gen 46740, links 1, type reg, flags 0, uid 0, gid 0, mode 600
len 8053248, nb 1 tbz 0, cow 0, zla 1, bs 8388608

## END ##

So at this stage I created a “tmp” directory in the virtual machine folder and moved all the “-ctk.vmdk” files here. Since this was a live, powered on VM, I felt more comfortable doing this with a GUI than using the shell and used WinSCP to transfer the files.

I then confirmed there were no longer any “-ctk.vmdk” files in the virtual machine folder, and that they were all in the newly created “tmp” folder;

## START ##

[root@host GUEST-VM]# ls -al | grep ctk
[root@host GUEST-VM]#

[root@host GUEST-VM]# cd /vmfs/volumes/datastore/GUEST-VM/tmp/

[root@host tmp]# ls -lh
total 96M
-rw------- 1 root root 4.3M Feb 28  2012 GUEST-VM_10-ctk.vmdk
-rw------- 1 root root 7.7M Nov  6 15:32 GUEST-VM_11-ctk.vmdk
-rw------- 1 root root  65K Jun 22  2013 GUEST-VM_1-ctk.vmdk
-rw------- 1 root root  65K Jun 22  2013 GUEST-VM_2-ctk.vmdk
-rw------- 1 root root 4.3M Feb 28  2012 GUEST-VM_3-ctk.vmdk
-rw------- 1 root root 4.3M Feb 28  2012 GUEST-VM_4-ctk.vmdk
-rw------- 1 root root 4.3M Feb 28  2012 GUEST-VM_5-ctk.vmdk
-rw------- 1 root root 4.3M Feb 28  2012 GUEST-VM_6-ctk.vmdk
-rw------- 1 root root 4.3M Feb 28  2012 GUEST-VM_7-ctk.vmdk
-rw------- 1 root root 4.3M Feb 28  2012 GUEST-VM_8-ctk.vmdk
-rw------- 1 root root 4.3M Feb 28  2012 GUEST-VM_9-ctk.vmdk
-rw------- 1 root root 4.4M Feb 28  2012 GUEST-VM-ctk.vmdk
[root@host tmp]#

## END ##

Now the vMotion operation completes successfully;

vMotion-success-masked

Here are the accompanying entries from the hosts vmkernel logs, which look much healthier than the original logs;

## START ##

Nov  6 15:53:18 src-host vmkernel: 843:07:23:28.414 cpu12:48347)Migrate: vm 48348: 3046: Setting VMOTION info: Source ts = 1415289229136448, src ip = <xx.xx.xx.xx> dest ip = <xx.xx.xx.xx> Dest wid = 4359 using SHARED swap
Nov  6 15:53:18 src-host vmkernel: 843:07:23:28.415 cpu12:48347)Tcpip_Vmk: 1013: Affinitizing xx.xx.xx.xx to world 64297, Success
Nov  6 15:53:18 src-host vmkernel: 843:07:23:28.415 cpu12:48347)VMotion: 2366: 1415289229136448 S: Set ip address 'xx.xx.xx.xx' worldlet affinity to send World ID 64297
Nov  6 15:53:18 src-host vmkernel: 843:07:23:28.415 cpu5:4327)MigrateNet: vm 4327: 1378: Accepted connection from <xx.xx.xx.xx>
Nov  6 15:53:18 src-host vmkernel: 843:07:23:28.415 cpu5:4327)MigrateNet: vm 4327: 1422: dataSocket 0x4100a6063410 receive buffer size is 563272
Nov  6 15:53:18 src-host vmkernel: 843:07:23:28.497 cpu23:64298)VMotionDiskOp: 769: 1415289229136448 S: DiskOps handshake successful.
Nov  6 15:55:59 src-host vmkernel: 843:07:26:10.006 cpu20:48348)VMotion: 3714: 1415289229136448 S: Another pre-copy iteration needed with 640966 pages left to send (prev2 4194304, prev 4194304, network bandwidth ~91.894 MB/s)
Nov  6 15:56:29 src-host vmkernel: 843:07:26:39.611 cpu21:48348)VMotion: 3714: 1415289229136448 S: Another pre-copy iteration needed with 240035 pages left to send (prev2 4194304, prev 640966, network bandwidth ~91.010 MB/s)
Nov  6 15:56:42 src-host vmkernel: 843:07:26:53.116 cpu22:48348)VMotion: 3666: 1415289229136448 S: Stopping pre-copy: not enough forward progress (Pages left to send: prev2 640966, prev 240035, cur 185166, network bandwidth ~87.701 MB/s)
Nov  6 15:56:42 src-host vmkernel: 843:07:26:53.116 cpu22:48348)VMotion: 3696: 1415289229136448 S: Remaining pages can be sent in 8.445 seconds, which is less than the maximum switchover time of 100 seconds, so proceeding with suspend.
Nov  6 15:56:54 src-host vmkernel: 843:07:27:04.521 cpu9:64297)VMotionSend: 3866: 1415289229136448 S: Sent all modified pages to destination (network bandwidth ~81.079 MB/s)

## END ##
## START ##

Nov  6 15:57:20 dst-host vmkernel: 0:01:16:26.628 cpu19:4359)VMotion: 4635: 1415289229136448 D: Page-in made enough progress during checkpoint load. Resuming immediately.
Nov  6 15:57:20 dst-host vmkernel: 0:01:16:26.664 cpu19:4359)VmMemMigrate: vm 4359: 4786: Regular swap file bitmap checks out.
Nov  6 15:57:20 dst-host vmkernel: 0:01:16:26.667 cpu19:4359)VMotion: 4489: 1415289229136448 D: Resume handshake successful
Nov  6 15:57:20 dst-host vmkernel: 0:01:16:26.667 cpu13:4374)Swap: vm 4359: 9066: Starting prefault for the migration swap file
Nov  6 15:57:20 dst-host vmkernel: 0:01:16:26.736 cpu21:4374)Swap: vm 4359: 9205: Finish swapping in migration swap file. (faulted 0 pages, pshared 0 pages). Success.
Nov  6 15:57:20 dst-host vmkernel: 0:01:16:26.824 cpu18:4359)Net: 1421: connected GUEST-VM eth0 to Network xx.xx.xx, portID 0x2000004
Nov  6 15:57:20 dst-host vmkernel: 0:01:16:27.003 cpu21:4359)NetPort: 982: enabled port 0x2000004 with mac 00:00:00:00:00:00
Nov  6 15:57:20 dst-host vmkernel: 0:01:16:27.003 cpu21:4359)Net: 1421: connected GUEST-VM eth0 to Network xx.xx.xx, portID 0x2000005
Nov  6 15:57:20 dst-host vmkernel: 0:01:16:27.003 cpu21:4359)NetPort: 982: enabled port 0x2000005 with mac 00:00:00:00:00:00
Nov  6 15:57:26 dst-host vmkernel: 0:01:16:33.040 cpu16:4359)VmMemMigrate: vm 4359: 1946: pgNum (0x3fce57) changed type to 1 while remotely faulting it in.
Nov  6 15:57:26 dst-host vmkernel: 0:01:16:33.061 cpu16:4359)VmMemMigrate: vm 4359: 1946: pgNum (0x3fd43e) changed type to 1 while remotely faulting it in.
Nov  6 15:57:26 dst-host vmkernel: 0:01:16:33.096 cpu16:4359)VmMemMigrate: vm 4359: 1946: pgNum (0x3fe6b3) changed type to 1 while remotely faulting it in.
Nov  6 15:57:27 dst-host vmkernel: 0:01:16:33.166 cpu1:4367)VMotionRecv: 1984: 1415289229136448 D: DONE paging in
Nov  6 15:57:27 dst-host vmkernel: 0:01:16:33.166 cpu1:4367)VMotionRecv: 1992: 1415289229136448 D: Estimated network bandwidth 81.633 MB/s during page-in

## END ##

I’m not sure where these “-ctk.vmdk” files came from, but suspect it may have originated from a legacy backup process from before my time. At least for now the issue is resolved and we know what to look for the next time this happens.

Credits;

Thanks to Jakob Fabritius Nørregaard for posting this blog article which helped identify and resolve this issue.

34,831 total views, 32 views today

Intel microcode issue affecting E5-2600 v2 series processors

We recently experienced a number of recurring, unexpected restarts of guest VM’s, all Windows 2008 R2 servers running MSSQL Server 2008. These VM’s were all hosted on ESXi 5.0.0 build-1489271 (update 3). All hosts in the cluster are relatively new HP DL380p Gen 8 servers, with two Intel Xeon E5-2667 v2 processors @3.3GHz.

My first thoughts were that it was something to do with the MSSQL Server 2008 as these were the only guest VM’s affected. I used the Windows debugging tools (including the correct symbols) to analyse the kernel dumps, and in all cases the probable cause was memory corruption;

Bugcheck analysis, probably caused by;
## START ##

memory_corruption ( nt!MiDeletePageTableHierarchy+9c )
######################################################

'IRQL_NOT_LESS_OR_EQUAL (a)'
An attempt was made to access a pageable (or completely invalid) address
at an interrupt request level (IRQL) that is too high.  This is usually
caused by drivers using improper addresses. If a kernel debugger is
available get the stack backtrace.

Arguments:
Arg1: fffff6fb40001de0, memory referenced
Arg2: 0000000000000000, IRQL
Arg3: 0000000000000000, bitfield :
bit 0 : value 0 = read operation, 1 = write operation
bit 3 : value 0 = not an execute operation, 1 = execute operation (only
on chips which support this level of status)
Arg4: fffff800016bbacc, address which referenced memory

memory_corruption ( nt!MiBadShareCount+4c )
###########################################

'PFN_LIST_CORRUPT (4e)'
Typically caused by drivers passing bad memory descriptor lists (ie:
calling MmUnlockPages twice with the same list, etc).  If a kernel
debugger is available get the stack trace.

Arguments:
Arg1: 0000000000000099, A PTE or PFN is corrupt
Arg2: 0000000000000000, page frame number
Arg3: 0000000000000000, current page state
Arg4: 0000000000000000, 0

memory_corruption ( nt!MiResolveDemandZeroFault+2e2 )
#####################################################

'IRQL_NOT_LESS_OR_EQUAL (a)'
An attempt was made to access a pageable (or completely invalid) address
at an interrupt request level (IRQL) that is too high.  This is usually
caused by drivers using improper addresses. If a kernel debugger is
available get the stack backtrace.

Arguments:
Arg1: fffff700010818a0, memory referenced
Arg2: 0000000000000000, IRQL
Arg3: 0000000000000000, bitfield :
bit 0 : value 0 = read operation, 1 = write operation
bit 3 : value 0 = not an execute operation, 1 = execute operation (only
on chips which support this level of status)
Arg4: fffff800016ce272, address which referenced memory

memory_corruption ( nt!MiRemoveWorkingSetPages+388 )
####################################################

'PAGE_FAULT_IN_NONPAGED_AREA (50)'
Invalid system memory was referenced.  This cannot be protected by
try-except, it must be protected by a Probe.  Typically the address is
just plain bad or it is pointing at freed memory.

Arguments
Arg1 fffff68000000088, memory referenced.
Arg2 0000000000000000, value 0 = read operation, 1 = write operation.
Arg3 fffff80001855350, If non-zero, the instruction address which
referenced the bad memory address.
Arg4 0000000000000005, (reserved)

## END ##

The other common factor was the identical host hardware, so I started looking into the firmware levels and customer advisories for any known issues. One of the first things I found was a BIOS firmware update to address a cross-platform issue caused by the Intel microcode of the E5-2600 v2 series processors. This is specific to the v2 series as the E5-2600 series is not affected.

HP have addressed this in a number of recent BIOS updates which are not included in the most recent HP Service Pack for ProLiant (SPP) Version 2014.02.0(B);

Version:2014.02.10 (2 May 2014)
Addressed a processor issue which can result in a Blue Screen of Death (BSOD) in a Windows virtual machine or Linux Kernel Panic in a Linux virtual machine when running on Microsoft Hyper-V or VMware ESX 5.x on Intel Xeon E5-2600 series v2 processors. This issue is not unique to HP ProLiant servers and could impact any system using affected processors operating with the conditions listed. This revision of the System ROM contains an updated version of Intel’s microcode that addresses this issue. This issue does NOT affect servers configured with the Intel Xeon E5-2600 series processors.

 

Version:2013.12.20 (A) (21 Jan 2014)
Addressed an issue where Memory Address or Command Parity errors may occur on servers configured with Intel Xeon E5-2600 series v2 processors and memory configurations where the memory speed is running at 1600 MHz or 1866 MHz. These errors may have resulted in the server resetting without notification of the error or the system resetting and displaying a “283-Memory Address/Command Parity Error Detected Error” and logging the event to the Integrated Management Log (IML). HP strongly recommends that all servers utilizing Intel E5-2600 v2 processors with impacted memory speeds update to this revision of the System ROM or later. This issue does NOT affect servers configured with the Intel Xeon E5-2600 series processor.

 

Version:2013.11.14 (A) (20 Dec 2013)
Addressed an issue where the server may not be able to enter processor idle power states (C-states) which can increase idle power when configured with 2 Intel Xeon E5-2600 v2Series Processors.
Addressed an issue where servers configured with Intel Xeon E5-2600 v2 processors and 32 GB LRDIMMs may experience an increased rate of corrected memory errors or uncorrected memory errors. This issue impacts servers configured with 2 DIMMs per channel or 3 DIMMs per channel. Any server configured with Intel Xeon E5-2600 v2 processors using LRDIMMs should be updated to this revision of the System ROM or later. If experiencing memory errors with the indicated configuration, HP recommends updating to this revision of the System ROM or later before contacting HP service.
Addressed an issue where Memory Address or Command Parity errors are not logged to the Integrated Management Log (IML) if they occur. With previous revisions of the System ROM, these types of errors would cause the server to reset without any notification of the error. A “283-Memory Address/Command Parity Error Detected” error will now be displayed during system boot and logged to the IML.

 

Version:2013.09.18 (A) (24 Sep 2013)
Addressed an issue where a system configured with Intel Xeon E5-2690 v2, E5-2680 v2, E5-2670 v2, and E5-2660 v2 processors and Advanced Memory Protection configured to Online Spare Mode may experience incorrect behavior when multiple Online Spare switchovers occur on the same processor.
Added support for LRDIMMs for systems configured with Intel Xeon E5-2600 Series v2 processors. Previous System ROM revisions that supported E5-2600 Series v2 processors displayed a “274-Unsupported DIMM Configuration Detected” message during system boot when LRDIMMs were installed with Intel Xeon E5-2600 v2 processors. Previous ROM revisions did support LRDIMMs with Intel Xeon E5-2600 processors.

 

Version:2013.09.08 (A) (13 Aug 2013)
Addressed a processor issue under which a rare and complex sequence of internal processor microarchitecture events that occur in specific operating environments could cause a server system to experience unexpected page faults, general protection faults, or machine check exceptions or other unpredictable system behavior. While all processors supported by this server have this issue, to be affected by this issue the server must be operating in a virtualized environment, have Intel Hyperthreading enabled, have a hypervisor that enables Intel VT FlexPriority and Extended Page Tables, and have a guest OS utilizing 32-bit PAE Paging Mode. This issue is not unique to HP ProLiant servers and could impact any system utilizing affected processors operating with the conditions listed above. This revision of the System ROM contains an updated version of Intel’s microcode that addresses this issue. Due to the potential severity of the issue addressed in this revision of the System ROM, this System ROM upgrade is considered a critical fix.

 

These excerpts above are all taken from the HP ProLiant DL380p Gen8 Server BIOS release notes, so please refer the the vendors site for a full list of fixes and enhancements.

We do have a support case open with Microsoft regarding this so it will be interesting to see what recommendations come back, but I plan to address this by updating the hosts BIOS to the current version 2014.02.10 (A) released on 2nd May 2014.

Update 13/06/2014;

Since deploying the BIOS update, we haven’t had a single instance re-occur – success!

38,026 total views, 32 views today

Creating custom SATP claimrules for EMC Symmetrix

As part of my migration work to EMC storage I need to create a custom SATP rule on each of my vSphere 5.0 hosts. The two obvious options are using ESXCLI from a SSH session to each host, or using PowerShell where ESXCLI is exposed using the Get-EsxCli cmdlet. The PowerShell option suits me better as I can add this claimrule to all my hosts in one go.

These are the two options;

Using ESXCLI via SSH;

## START ##

## Add custom SATP claimrule
esxcli storage nmp satp rule add -s "VMW_SATP_SYMM" -V "EMC" -M "SYMMETRIX" -P "VMW_PSP_RR" -O "iops=1" -e "EMC Symmetrix (custom rule)"

## List all SATP rules (filtered)
esxcli storage nmp satp rule list -s VMW_SATP_SYMM

## Results
Name           Device  Vendor  Model      Driver  Transport  Options  Rule Group  Claim Options  Default PSP  PSP Options  Description
-------------  ------  ------  ---------  ------  ---------  -------  ----------  -------------  -----------  -----------  ---------------------------
VMW_SATP_SYMM          EMC     SYMMETRIX                              user                       VMW_PSP_RR   iops=1       EMC Symmetrix (custom rule)
VMW_SATP_SYMM          EMC     SYMMETRIX                              system                                               EMC Symmetrix

## END ##

For the PowerShell method, make sure you check out the ESXCLI syntax;

Refer to the vSphere documentation for ESXCLI options, and Robert van den Nieuwendijk blog for the ESXCLI syntax;

#$esxcli.storage.nmp.satp.rule.add(boolean boot, string claimoption, string description, string device, string driver, boolean force, string model, string option, string psp, string pspoption, string satp, string transport, string type, string vendor)

Using ESXCLI via PowerShell;

## START ##

## Required syntax for ESXCLI (see note above)
$esxcli.storage.nmp.satp.rule.add(boolean boot, string claimoption, string description, string device, string driver, boolean force, string model, string option, string psp, string pspoption, string satp, string transport, string type, string vendor)

## This syntax translates to this (my example)
$esxcli.storage.nmp.satp.rule.add($null,$null,"EMC Symmetrix (custom rule)",$null,$null,$null,"SYMMETRIX",$null,"VMW_PSP_RR","iops=1","VMW_SATP_SYMM",$null,$null,"EMC")

## List SATP claimrules (filterd)
$esxcli.storage.nmp.satp.rule.list() | where {$_.Description -like "*Symmetrix*"} | Format-Table -AutoSize

## Results
ClaimOptions DefaultPSP Description                 Device Driver Model     Name          Options PSPOptions RuleGroup
------------ ---------- -----------                 ------ ------ -----     ----          ------- ---------- ---------
             VMW_PSP_RR EMC Symmetrix (custom rule)               SYMMETRIX VMW_SATP_SYMM         iops=1     user
                        EMC Symmetrix                             SYMMETRIX VMW_SATP_SYMM                    system   

## END ##

Now that you have the correct syntax, use PowerShell to apply this change to all your hosts. Just remember to update the scope to reflect your own environment.

Adding SATP claimrule to multiple hosts;

## START ##

Clear-Host

## Connect to vCenter
Connect-VIServer -Server 'myvcenter.fqdn'
Write-host ""

## Get list of hosts that you want to create the SATP claimrule on
$scope = Get-Datacenter 'MY-DATACENTER' | Get-Cluster * | Get-VMHost * | Sort-Object Name

## Action for each of the hosts in scope
Foreach ($esx in $scope){

  Write-Host $esx -ForegroundColor Yellow

  ## Exposes the ESXCLI functionality
  $esxcli = Get-EsxCli -VMHost $esx

  ## Create user defined SATP rule for EMC/Symmetrix (Vendor/Model)
  $esxcli.storage.nmp.satp.rule.add($null,$null,"EMC Symmetrix (custom rule)",$null,$null,$null,"SYMMETRIX",$null,"VMW_PSP_RR","iops=1","VMW_SATP_SYMM",$null,$null,"EMC")

  ## List SATP rule with "Symmetrix" in the description
  $esxcli.storage.nmp.satp.rule.list() | where {$_.Description -like "*Symmetrix*"} | Format-Table -AutoSize

}

## Disconnect from vCenter
Disconnect-VIServer -Server * -Force -Confirm:$false 

## END ##

If you just want to list the rules across all your hosts then comment out line 21 (which creates the claimrule) and execute the script again. The formatting is not the best in this instance, but it’s a quick way of validating what you’ve just done.

12,089 total views, no views today

EMC Symmetrix VMAX 40K testing on vSphere 5.0

We are in the process of migrating from HDS to EMC storage and I have been testing our Symmetrix VMAX 40K on vSphere 5.0. This has been an interesting journey and has highlighted that although similar concepts (ie. block storage with FC connectivity), storage arrays differ and need careful implementation if you want to get the best performance from your infrastructure.

This post will cover my testing with this specific storage array and hopefully prompt some feedback on other implementations. Perhaps it will help identify any obvious areas that I have missed and need to address? Either way, some feedback would be awesome.

In terms of storage presentation to the hosts (HP DL380p Gen 8 servers), I used 2x single port QLE2560 HBA’s, each connected at 4GB to Brocade FC switches with 2x paths to each LUN (4 in total). The LUNs were configured as striped META’s, each 2TB in size.

For my performance tests, I ran a series of iometer access specifications and graphed the results in Excel for an easy side by side comparison.

Iometer Access Specifications;

Access Specification Transfer SizeRead / WriteRandom / SequentialAligned on
Max Throughput-100%Read32K100% / 0%0% / 100%32K
RealLife-60%Rand-65%Read8K65% / 35%60% / 40%8K
Max Throughput-50%Read32K50% / 50%0% / 100%32K
Random-8k-70%Read8K70% / 30%100% / 0%8K

All tests were run in 30 second intervals, increasing the number of outstanding IO requests, using exponential stepping to the power of 2, up to a maximum queue depth of 512 outstanding IOs.

The workload was initially placed on a single ESXi host with a single worker thread (to get a baseline), and then scaled out to multiple ESXi hosts with multiple worker threads. The guest VM’s were not optimized in any way and had a single LSI Logic SAS controller (Windows 2008 R2 Standard Edition).

For my first baseline, I used standard NMP with all the defaults;

# START #

~ # esxcli storage nmp device list -d naa.60000970000295700663533030383446
naa.60000970000295700663533030383446
   Device Display Name: EMC Fibre Channel Disk (naa.60000970000295700663533030383446)
   Storage Array Type: VMW_SATP_SYMM
   Storage Array Type Device Config: SATP VMW_SATP_SYMM does not support device configuration.
   Path Selection Policy: VMW_PSP_RR
   Path Selection Policy Device Config: {policy=rr,iops=1000,bytes=10485760,useANO=0;lastPathIndex=3: NumIOsPending=0,numBytesPending=0}
   Path Selection Policy Device Custom Config:
   Working Paths: vmhba3:C0:T3:L1, vmhba3:C0:T4:L1, vmhba2:C0:T4:L1, vmhba2:C0:T3:L1

~ # esxcli storage nmp psp roundrobin deviceconfig get -d naa.60000970000295700663533030383446
   Byte Limit: 10485760
   Device: naa.60000970000295700663533030383446
   IOOperation Limit: 1000
   Limit Type: Default
   Use Active Unoptimized Paths: false

# END #

Here are the results;

NMP Results (policy=rr,iops=1000);

EMC_SYMM_VMAX_40K_NMP_RR_IOPS_1000

For my next test, I changed the IO operations limit from the default 1000 to 1, as recommended in the EMC document (see pages 82-83);

Using EMC Symmetrix Storage in VMware vSphere Environments

# START #

~ # esxcli storage nmp satp rule add -s "VMW_SATP_SYMM" -V "EMC" -M "SYMMETRIX" -P "VMW_PSP_RR" -O "iops=1"

~ # esxcli storage nmp satp rule list -s VMW_SATP_SYMM
Name           Device  Vendor  Model      Driver  Transport  Options  Rule Group  Claim Options  Default PSP  PSP Options  Description
-------------  ------  ------  ---------  ------  ---------  -------  ----------  -------------  -----------  -----------  -------------
VMW_SATP_SYMM          EMC     SYMMETRIX                              user                       VMW_PSP_RR   iops=1
VMW_SATP_SYMM          EMC     SYMMETRIX                              system                                               EMC Symmetrix

# END #

I rebooted my hosts at this point, and confirmed that the device had been claimed correctly;

# START #

~ # esxcli storage nmp device list 
naa.60000970000295700663533030383446
   Device Display Name: EMC Fibre Channel Disk (naa.60000970000295700663533030383446)
   Storage Array Type: VMW_SATP_SYMM
   Storage Array Type Device Config: SATP VMW_SATP_SYMM does not support device configuration.
   Path Selection Policy: VMW_PSP_RR
   Path Selection Policy Device Config: {policy=rr,iops=1,bytes=10485760,useANO=0;lastPathIndex=1: NumIOsPending=0,numBytesPending=0}
   Path Selection Policy Device Custom Config:
   Working Paths: vmhba2:C0:T3:L1, vmhba3:C0:T1:L1, vmhba2:C0:T1:L1, vmhba3:C0:T3:L1

# END #

Here are the new performance results using the {policy=rr,iops=1} device configuration.

NMP Results (policy=rr,iops=1);

EMC_SYMM_VMAX_40K_NMP_RR_IOPS_1

WOW, that single change increased performance by a factor of 4X using a single worker thread! Throughput increased from just over 300 MB/s to 800 MB/s, operations increased from 10,000 IOPS to 40,000 IOPS and average latency dropped from 70ms to 25ms at a MAX queue depth of 512.

The results were similar in the scale out tests, with the same observation that guest CPU utilisation increased with the additional workload that it was able to process.

I then implemented EMC PowerPath/VE to see how this compared to NMP in the identical iometer tests. My assumption was that PowerPath/VE would far out perform NMP and the cost would be easily justifiable by the performance gains.

Interestingly, the results were similar to NMP with the IO operations set to 1 which has made it hard to sell to management. I understand the benefits that it offers over NMP, but perhaps these will only become apparent when we ramp up the workload and need the extra smarts behind the workload.

PowerPath/VE Results;

EMC_SYMM_VMAX_40K_MPP_POWERPATH_VE

These are obviously very simple tests, but it’s incredible how much performance can change by simply reading vendor recommendations and testing this in your own environment.

17,572 total views, 19 views today

Update NTP configuration on multiple ESXi 5.0 hosts

We recently upgraded our NTP infrastructure and I had to reconfigure 46x ESXi 5.0 hosts to reflect this change. I’m not keen on doing these types of operations manually so I wrote this script to automate the process.

I’m sure there are definitely more elegant scripts available, but this one works perfectly well. Just change the highlighted rows to reflect your own environment;

 

PowerShell Script;
## START ##
Clear-Host

## Connect to vCenter
$vcenter = 'myvcenter.mydomain.fqdn'
Write-Host "Connecting to $vcenter" -ForegroundColor Green
Connect-VIServer $vcenter | Out-Null

## OLD NTP servers
$ntp1_old = 'old.ntp.server1'
$ntp2_old = 'old.ntp.server2'
$ntp3_old = 'old.ntp.server3'
$ntp4_old = 'old.ntp.server4'

## NEW NTP servers
$ntp1_new = 'new.ntp.server1'
$ntp2_new = 'new.ntp.server1'
$ntp3_new = 'new.ntp.server1'

## Hosts to Configure
$vmHosts =  Get-Datacenter 'MY-DC' | Get-Cluster 'MY-CLUSTER' | Get-VMHost *

## Action for Each Host
ForEach ($vmHost in $vmHosts) {

Write-Host $vmHost -BackgroundColor Red

## Remove existing NTP servers
Write-Host " - Removing existing NTP servers" -ForegroundColor Yellow
Remove-VmHostNtpServer -NtpServer $ntp1_old -VMHost $vmHost -Confirm:$false | Out-Null
Remove-VmHostNtpServer -NtpServer $ntp2_old -VMHost $vmHost -Confirm:$false | Out-Null
Remove-VmHostNtpServer -NtpServer $ntp3_old -VMHost $vmHost -Confirm:$false | Out-Null
Remove-VmHostNtpServer -NtpServer $ntp4_old -VMHost $vmHost -Confirm:$false | Out-Null

## Add new NTP servers
Write-Host " - Adding New NTP servers" -ForegroundColor Green
Add-VmHostNtpServer -NtpServer $ntp1_new -VMHost $vmHost | Out-Null
Add-VmHostNtpServer -NtpServer $ntp2_new -VMHost $vmHost | Out-Null
Add-VmHostNtpServer -NtpServer $ntp3_new -VMHost $vmHost | Out-Null

## Restart NTP Service
Write-Host " - Restarting the NTP service" -ForegroundColor Yellow
Get-VmHostService -VMHost $vmHost | Where-Object {$_.key -eq "ntpd"} | Restart-VMHostService -Confirm:$false | Out-Null 

## Update Complete
Write-Host " - NTP Servers updated" -ForegroundColor Green
Write-Host ""
}

## Disconnect from vCenter
Disconnect-VIServer -Server * -Force -Confirm:$false -ErrorAction SilentlyContinue | Out-Null

## END ##

13,867 total views, no views today

RVTools Version 3.6, FREE download

RVTools has been publicly available since 2008, and the latest version is now available for FREE download – thanks Rob de Veij, awesome work!

Version 3.6 (February, 2014)
  • New tabpage with cluster information
  • New tabpage with multipath information
  • On vInfo tabpage new fields HA Isolation response and HA restart priority
  • On vInfo tabpage new fields Cluster affinity rule information
  • On vInfo tabpage new fields connection state and suspend time
  • On vInfo tabpage new field The vSphere HA protection state for a virtual machine (DAS Protection)
  • On vInfo tabpage new field quest state.
  • On vCPU tabpage new fields Hot Add and Hot Remove information
  • On vCPU tabpage cpu/socket/cores information adapted
  • On vHost tabpage new fields VMotion support and storage VMotion support
  • On vMemory tabpage new field Hot Add
  • On vNetwork tabpage new field VM folder.
  • On vSC_VMK tabpage new field MTU
  • RVToolsSendMail: you can now also set the mail subject
  • Fixed a datastore bug for ESX version 3.5
  • Fixed a vmFolder bug when started from the commandline
  • Improved documentation for the commandline options

13,820 total views, 14 views today

PSOD : LINT1 motherboard interrupt

I had a Dell R815 host crash yesterday, with the following PSOD error message;

The system has found a problem on your machine and cannot continue.
LINT1 motherboard interrupt. This is a hardware problem; please contact your hardware vendor.

PSOD-LINT1

When I checked the system logs on the iDRAC, I could see a bus fatal error logged;

System Event Logs

SeverityTimeDescription
Critical18:24:36The watchdog timer expired.
Normal18:16:37An OEM diagnostic event has occurred.
Critical18:16:36A bus fatal error was detected on a component at bus 4 device 4 function 0.

I ran the integrated hardware diagnostics using the system services on boot (F10) which confirmed these errors, but only because it read the system logs. I find this really annoying because if I had cleared the event logs prior to running the hardware diagnostics no errors would have been reported, and now I’m not sure if the hardware is faulty or not. Here are the reported errors;

Watchdog-Sensor

PCIE-Fatal-Error

Either way I can’t put it back into production without further analysis and need to find out what hardware component is located at bus 4 device 4 function 0 so that I can log a support call to Dell. It turns out this is really easy, using the lspci command which returns detailed info on all PCI devices.

lspci prints the device syntax in the [domain]:[bus]:[device].[function] format, so it’s easy to add the device information to grep the specific component without seeing all the other PCI devices. Here is what mine returned;

lspci

~ # lspci | grep '000:004:04.0'
000:004:04.0 Bridge: PLX Technology, Inc. PEX 8624 24-lane, 6-Port PCI Express Gen 2 (5.0 GT/s) Switch [ExpressLane]

~ # lspci --help
lspci   -p --pciinfo   Prints detailed info on all PCI devices
        -n --nolookup  Don't look up PCI device names and info
        -d --dump      Print hex dump of the full config space
        -v --verbose   Verbose information

So now I know there was a problem with the PCI bridge and can log this to Dell in the hope that they simply replace the component under warranty.

37,238 total views, 7 views today