Azure's default load balancing mechanism presents challenges for virtual machines running SharePoint. When load balancing is set up for the first time, it is using a simple TCP check on port 80 to see if the virtual machine responds on that port – if it fails to respond to two successive checks (which happen every 15 seconds) that machine is removed from the load balancer.
The problem with this configuration is that port 80 will respond pretty much all the time – even if your application pool is stopped, and users are receiving a “503 – Service Unavailable” error. So the Azure platform includes the ability to add http level load balancing probes, to check for a 200 response instead. This ensures the web server is actually responding with content, and you can even direct it at a particular page – e.g. health.aspx.
This is fine for virtual machines that run web services under the default web site, with a port 80 binding. However, SharePoint typically has a number of applications under IIS, all with different bindings – and SharePoint uses host headers to distinguish one from another. These map back to alternate access mappings, so when a request is received through an IIS application, SharePoint then responds with the correct content, served in the context of a particular security zone. Each SharePoint application will also have its own application pool, generally running as a distinct user account. The upshot is, even the http load balancing probe won’t know if a SharePoint application is having issues.
There is a way to bring all this together though. By changing the default web site to run in the same application pool as the SharePoint site(s) you wish to load balance, you can have the load balancer respond to issues with that application pool instead. The limitation is, you can only monitor one application pool, not a problem if all your SharePoint sites run in the same pool, but this won’t necessarily be the case. I suppose you could add further “dummy” load balanced ports/probes, with a corresponding non-SharePoint IIS application that responds on that port to get around this.
A useful side effect of this is that stopping the default web site on a server removes that server from the load balancing – useful when you need to perform maintenance for example.
One thing to look out for – ensure your SharePoint application pool identity has access to the default web site’s content directory, and as always an IISReset seems to sort everything out when switching app pools on an active server.
The end result is that if/when SharePoint the application pool suffers an outage, or is recycling, the load balancing probe will pick this up and stop sending requests to that server. As soon as the application pool has recovered, the next load balancing check will add the server back. This includes the daily application pool recycles early in the morning – so it should be possible to achieve genuine 100% uptime using this solution.