Tuesday, May 10, 2016

Taxonomy Of A VMWare Outage After A Power Failure

A few VMs hosted on the two blade servers indicated they were inaccessible. Some were simply orphaned which is sometimes a result of a power loss event.  Others listed the GUID of the path (rather than the label for the storage).  Orphaned VMs must be removed from inventory and then added to inventory from the datastore browser.  VMs that show the GUID path can often be recovered simply by refreshing the storage after all services are restored.  Neither of these were successful.

The hosts have been configured with local storage and NetApp storage for some time.  VMs have been running from both datastore sources without issue.

After the power failure, we experienced problems with VMs on NetApp storage.

-          VMs running from local storage experienced no issues and could be started normally.
-          VMs running on the NetApp could not be started.  Orphaned VMs could not be added to inventory.  We were not able to modify or copy the files on the NetApp.   We could not unmount the NetApp.  Neither SSH or Desktop clients actions were successful.  Later in the morning, we identified another blade experiencing the same condition.  This indicated it was not isolated to just these two blades.

We verified the NetApp was in a normal state by accessing the files  as a CIFS share. 

We identified changes since the last power failure (a week ago).  1. We added a vSwitch and network interface to the Storage network.  2. We attached a Nimble volume to the host.  We decided to back out these two changes.

We set the Nimble Volume to Offline in Nimble.  Surprisingly, we were then able to unmount the NetApp.  This indicated a strange link to the datastores. VMWare indicated the datastore was in use. Several iterations of removing and adding did not solve the VM issues on NetApp.

We unmounted NetApp then removed the new Storage vSwitch followed by mounting the NetApp again.  VM operation on the NetApp were normal again.

We unmounted NetApp, added the Storage vSwitch then mounted NetApp again.  VM operations failed on the NetApp once again.

A short time later we realized the NetApp export access permissions permitted the (public) IP address of the original vSwitch of the blades.  Adding the vSwitch on the Storage network established a connection from the new IP on the local storage subnet.

The solution was to change the NetApp export permissions to allow root access from the new Storage vSwitch IP address.  Resolution time was 7 hours.


Last week we added another blade to the system.  In the course of adding the additional server, we added a Nimble storage volumes to the existing two servers.  We add a new Storage vSwitch connected to the IO Aggregator which has a 10Gb connection directly attached to the storage network.  After the power failure, the NetApp was remounted over the new Storage vSwitch as the destination was on the locally attached network.  This put the datastore in a read only state.  Technically, the NFS export is mounted as non-root which does not have write privileges.  Additional blades have this same configuration and experienced the same issue though the users hadn’t realized it yet.

No comments: