VMware AutoDeploy is a great way to manage a stateless ESXi environment. For more information on Autodeploy read the VMware docs located here.
Whilst Autodeploy is a great idea as the number of hosts in your infrastructure increase your AutoDeploy server may become a bottleneck that prevents your hosts from powering on in a reasonable timeframe. As the VMware docs state:
“Simultaneously booting large numbers of hosts places a significant load on the Auto Deploy server. Because Auto Deploy is a web server at its core, you can use existing web server scaling technologies to help distribute the load. For example, one or more caching reverse proxies can be used with Auto Deploy to serve up the static files that make up the majority of an ESXi boot image. Configure the reverse proxy to cache static content and pass requests through to the Auto Deploy server.”
“After a massive power outage, VMware recommends that you bring up the hosts on a per-cluster basis. If you bring up multiple clusters simultaneously, the Auto Deploy server might experience CPU bottlenecks. All hosts come up after a potential delay. The bottleneck is less severe if you set up the reverse proxy.”
As I couldn't find any docs detailing how this could be achived I thought I would write something that may help others who need to scale out.
The basic permise is this:
Configure multiple tftpboot servers with custom tramp files pointing to different reverse caching proxies and then balance the requests to these hosts.
To do this I built multiple CentOS 6 hosts and configured each as tftp-server and Squid server and then configured round robin DNS to balance requests between these hosts.
Install and configure a tftp server.
Copy the files from your current tftp server files to your tfp root usually /var/lib/tftpboot
Edit the tramp files as follows:
Install squid and then configure it as a reverse caching proxy as follows:
Edit /etc/squid/squid.conf replacing the red text with the correct values for your infrastructure and only add the green section if you are using self-signed certs.
Finally you need to update your DHCP config to serve multiple tftp servers with option 66. I have read some articles suggesting that this can be done in some DHCP servers however this is against RFC2132 and will most likely not work.
I decided to use round robin DNS instead and so configured option 66 with a host name that has an A record in DNS for each tftp server. This then balances requests between the servers. This can be done by adding the following:
It is worth mentioning that if you are serious about resilience you would be better using an intelligent load balancer that can handle requests to the tftp servers.
Finally test the setup. Boot a number of hosts and monitor the access log for Squid on your proxies. You will start to see that elements of the image are loaded directly from squids cache removing load from the Autodeploy host.
861b5a21de2ecf075f - NONE/- text/html
1334833786.434 334 172.21.1.103 TCP_MEM_HIT/200 367334 GET http://<proxy IP>/vmw/cache/52/5d959364a6cd2e4609e471bad4f246/scsi-lpf.07f8f2635938dc247dd71cf757947ad6 - NONE/- text/html
1334833786.486 10 172.21.1.103 TCP_MEM_HIT/200 30402 GET http://<proxy IP>/vmw/cache/a7/67d0a90aec4fe0345daed522ef47db/scsi-meg.6717a7a3865d8a3775691dcc5b434a03 - NONE/- text/html
Thats it! You now have an AutoDeploy infrastructure capable of booting many hosts quickly without overloading the AutoDeploy service.