This post today is to warn some of you about an issue you may see in an ESXi 6.5 based Linked-Clone View environment. I had built a ten host, ESXi 6.5, cluster for our VDI linked-clone VM’s in order to move away from a dying blade chassis.
When I was building up the View pools, I was testing vMotion and maintenance mode operations in the cluster. Everything vMotion’d fine and the host entered maintenance mode fine. Everything looked great until someone logged into View and used a desktop. View (as it should) attempted to power on another VM in the pool and I saw this lovely error:
File system specific implementation of LookupAndOpen[file] failed. An Error was received from the ESX host while powering on VM. Failed to start the VM. Module ‘Disk’ power on failed. Cannot open the disk ‘vmfs/volumes/…vm_5-checkpoint.vmdk’ or one of the snapshot disks it depends on. Failed to lock file.
Lovely ain’t it. I proceeded to investigate the locks using the nifty tool: vmfsfilelockinfo on the files from one host. Tracing the locks to another host, and then tracing a different lock to a separate host. I could free up some of the locks by rebooting a host or two, but this was simply not a scalable solution. I proceeded to open a case with VMware to remedy this.
The first proposed solution from VMware was to follow KB 2146451 and disabling ATS just for VMFS heartbeat (bottom of the article). We implemented this on all of our hosts and to no avail. When vCenter tried to power on a VM that was originally powered off, then moved to another host due to maintenance operations, and then attempted power on, the operation failed with the error above.
While in a reddit conversation with another user experiencing this exact issue, he stated that his support contact recommended rolling back to ESXi 6.0. I also have a second cluster that is 6.0 and I have never experienced this error in the past.
So I spent a Saturday rolling back my hosts to 6.0. Completely rebuilt with 6.0u3 with all 800+ seats available. I’ve placed all my hosts into maintenance mode, powered off hosts with a test pool on them, vMotion’d VM’s all around the cluster, and I have had a single error yet.
Hopefully 6.5u1 will have some better results.