This post today is to warn some of you about an issue you may see in an ESXi 6.5 based Linked-Clone View environment. I had built a ten host, ESXi 6.5, cluster for our VDI linked-clone VM’s in order to move away from a dying blade chassis.
When I was building up the View pools, I was testing vMotion and maintenance mode operations in the cluster. Everything vMotion’d fine and the host entered maintenance mode fine. Everything looked great until someone logged into View and used a desktop. View (as it should) attempted to power on another VM in the pool and I saw this lovely error:
File system specific implementation of LookupAndOpen[file] failed. An Error was received from the ESX host while powering on VM. Failed to start the VM. Module ‘Disk’ power on failed. Cannot open the disk ‘vmfs/volumes/…vm_5-checkpoint.vmdk’ or one of the snapshot disks it depends on. Failed to lock file.
Lovely ain’t it. I proceeded to investigate the locks using the nifty tool: vmfsfilelockinfo on the files from one host. Tracing the locks to another host, and then tracing a different lock to a separate host. I could free up some of the locks by rebooting a host or two, but this was simply not a scalable solution. I proceeded to open a case with VMware to remedy this.
The first proposed solution from VMware was to follow KB 2146451 and disabling ATS just for VMFS heartbeat (bottom of the article). We implemented this on all of our hosts and to no avail. When vCenter tried to power on a VM that was originally powered off, then moved to another host due to maintenance operations, and then attempted power on, the operation failed with the error above.
While in a reddit conversation with another user experiencing this exact issue, he stated that his support contact recommended rolling back to ESXi 6.0. I also have a second cluster that is 6.0 and I have never experienced this error in the past.
So I spent a Saturday rolling back my hosts to 6.0. Completely rebuilt with 6.0u3 with all 800+ seats available. I’ve placed all my hosts into maintenance mode, powered off hosts with a test pool on them, vMotion’d VM’s all around the cluster, and I have had a single error yet.
Hopefully 6.5u1 will have some better results.
To solve this issue just just disable storage accelerator. https://pubs.vmware.com/view-52/index.jsp?topic=%2Fcom.vmware.view.administration.doc%2FGUID-77B22AC9-EF9F-4161-9856-88DADEE095DD.html
This issue exists with or without storage accelerator. We do not use storage accelerator in our environment because our images are to large.
I got the same problem with Horizon 7.4, ESXi 6.5 U1e 7526125. Rebooting the host solve the problem temporarily.
It seems to be a non-View related issue, but an ESXi issue. I’m still on 6.0 which doesn’t seem to experience the issue.
What type of storage are you using? I’ve heard this bug doesn’t always affect NFS storage. I’ve heard of it in vSAN, Fibre Channel, and iSCSI.
Any further developments on this front, I am having this issue as well using NFS storage. Running 6.5.0, 5969303. Really getting old having to manually delete locked linked clones.
I actually haven’t heard of this issue occuring with NFS storage. Interesting to hear though.
To my knowledge this hasn’t even been stated as an issue in any of the 6.5 release notes. I have since stayed on 6.0u3 which has been very stable for me. Spending a weekend rolling back my hosts wasn’t part of the plan, but was a necessity for us to get our Horizon environment stable again.