When “Not” to Kill the Cluster Services In Hyper-V – Update
In a recent article I wrote for SearchServerVirtualization I spoke of times when it is necessary to kill the cluster service on your Hyper-V cluster. This was a last resort in the event that no other way was available to manipulate the cluster i.e. Failover Cluster Manager, cluster.exe, SCVMM etc… In the past, this has most commonly happened sporadically around when many Hyper-V writer VSS backups have been performed by DPM. That is a different issue and I won’t go into it here. Recently I had the pleasure of having a lockup of one of my Hyper-V nodes which left me no way to manipulate the cluster resources. Cluster.exe would show me the resources, but giving commands to move a VM resource over to another node just hung or left the resource in a pending state on the effected node. So after trying every option I could come up with, and dodging calls from angry end users trying to do their jobs, I decided it was time to kill the cluster service. All went as expected. The hung VMs on the affected node moved over to other remaining healthy nodes of the cluster and they started up. The problem node was still up an running, but with the cluster services stopped.
This is the solution I usually recommend for this circumstance, but after a recent experience with this exact situation, I am reconsidering. My recommended solution now is to hard power the server off. You could try to shutdown the node, but from my experience, the clustersvs.exe process, never lets go and inhibits the shutdown. The reason why I now recommend using an even bigger hammer of hard powering down of the problem node is for the good of the VMs. How can this be good? Let me say, it takes some nuggets to power down a node that may have 10-30 VMs living on it, but if there was an easier way to get them functioning better, believe me I would have tried them already before getting to this conclusion. One strange occurrence that happens when you just kill the cluster service is the problem host never flushes the network layer. Essentially what this means is that the VMs move to another node and start back up, but come back up with the message that a duplicate IP Address exists on the network or they acquire and APIPA. Even when you shut these VMs down, they still ping, because the network associated with the VMs are still responding on the problem host. Rebooting the VMs will not help. If you do just kill the cluster services, the only thing that will help is if you reboot the problem host and then reboot the VMs that were on the problem host originally who are now moved to other nods of the cluster. This can be a little hard to track down if you have many VMs on a particular cluster since they scatter to the remaining healthy nodes. So in order to allow the VMs from the problem host to come up as cleanly as possible on a healthy node with full network capabilities, I now recommend hard powering down the problem host. It may seem a bit harsh, but it can save you a lot of time. Do it for the good of your VMs.
Questions? Easier way? Did I miss something? Pass it along and I will post it up.