Sunday, June 17, 2012

Hunting the silent ESX disk murderer

...or how your old ESX servers that boot from SAN can be creating disk bottlenecks while collecting statistics

State of things

VMware ESX 4.1 cluster with 25 hosts. These hosts do not have local disks, they BOOT from SAN. All of them boot from the same arry that also hosts 200 virtual machines. The array has about 200 disks but despite that it is optimally set up regarding the SAN SATP/PSP, it can not cope with the IO traffic generated by the virtual machines, the backup system and the service console of ESX hosts. At times it is so bad that some processes such as Storage VMotion timeout.

Analyzing which virtual machines and datastores generate most of the IO activity, I see using the vSphere Client that the boot datastores have a sustained write from about 700 to 1000 KB/s. This raises my concern: why the hell are ESX hosts constantly writing to disk such amount of data? It is not a lot of IO load, but multiplied by 25 on an array that is collapsing, it is significant and worth to remove in order to improve the overall performance of the system.

Eacho boot LUN is only presented to its own host and its datastores therefore only host the local service console. So this write activity must come from the service console.

Who is the murderer?

The service console is not swapping at all, with a good amount of free memory. So there is no lack of memory and no trashing in it.

In the service console I run my favorite command: vmstat 1
Surprisingly the figures of the  "bo" variable (disk writes, in KB) are not high. Seeing it in more detail, two or three times a minute there is a burst write, say around 20 MB, but most times little or nothing is written.

My beloved esxtop matched vmstat results: the LUN has only burst writes made by the world 4096 (the service console)

What is the cause of such discrepancy between vSphere Client stats and what esxtop and vmstat say? The explanation is simple: vmstat and esxtop take samples every one or two seconds, respectively. While in the vSphere Client each sample is the average of the last 20 seconds. So what vmstat shows as a burst of 20 MB written in a single second, for the vSphere Client is an average of 1 MB/s during the past 20 seconds.

There is a difference, isn't it? This makes me remember what someone told me many years ago: statistics is a science whose results say that if I eat two hot dogs and you do not eat any, we both ate one each.

Definitely seems to be a murderer in the service console. It's just a matter of finding it.

Prime Suspect: generation of logs

Nothing found at all. Everything in  /var/log is growing but not a chance at that rate.

Second Suspect: other files

Unfortunately Linux does not have very good tools to see what files are being written at any time. iotop did not work in the service console but I was able to analyze the dmesg logs with the write block logging. Nothing relevant came of it.

A find of the entire disk looking for files written in the last minute shed more light on the subject. At least reduced the size of the search, but it was not very conclusive.

Which process is responsible for this?

In order to diagnose the cause I put the host in maintenance mode, stopped all hardware monitoring agents, the vpxa (VirtualCenter Agent) and disconnected the host from vCenter server. The behavior persisted. There was no doubt now, the activity came from hostd as it was the only service running. I shut down hostd and voilĂ , the burst writes disappeared.

Thinking and thinking, seeing and reviewing I see that the bursts of writes occur every 20 seconds. What does hostd do every 20 seconds even if the host is in maintenance mode and disconnected from vCenter server? Bingo! It collects statistics. That was the suspicion, I had to collect evidence and make at trial.

Collecting evidence

To test the suspicions I enabled trivia mode in the hostd logs. This is done by editing /etc/vmware/hostd/config.xml as described in KB1004795 and restarting hostd.



Then I checked that the moments in which statistics were collected (by the hostd service Statssvc) according to the log coincided with the peaks of writing and updated statistics files hosted in /var/ lib/vmware/hostd/stats. Moreover, the size of these files coincide approximately with the volume of each burst (in the screenshot 1 MB). The murderer was surrounded, we're going to trial. (Note: this screenshot is not from an affected host, it is from a lab host. However, the pattern is the same)

The trial

Is this normal? Yes.
Stats are collected every 20 seconds and recorded and retained locally on the host for an hour. Every five minutes (300 s) these data are aggregated into new samples. These samples added every five minutes are sent to vCenter server to be store in the database. Anyway, this data is kept for one day in the host. This is done as an optimization so for all queries from the last day it is not neccessary to go to the vCenter database.
Figure taken from PowerCLI & vSphere statistics

In the previous screenshot there were two files: hostAgentStats-20.stats and hostAgentStats-300.stats. These files seem to store local information for 20 and 5 minutes samples. It is at the moment that these files are updated every 20 seconds when we see the disk traffic generated by the alleged problem.

At this point we can say that we have found the murderer. But we had no choice but to acquit him of manslaughter: what it does is perfectly normal and understandable. In a small host with one CPU and access to few LUNs, the statistics to collect only take just over 1 MB. But when you have to collect statistics on a host with 8 NICs, 2 HBAs, 24 CPUs and 80 LUNs and 650 paths, writing all these stats every 20 seconds can be substantial.

To make things worse, you have a cluster of 25 hosts with clocks perfectly synchronized as the flying spaghetti monster mandates, and hostd synchronizes stats collections at seconds 0, 20 and 40 every minute. So the 25 hosts are like zombies writing at the same time about 500 MB to the disk array.

What about ESXi?

Was I told that ESXi runs from RAM and does not require a physical disk booted? (Except to save configuration changes). In fact I have even removed the boot disk to ESXi and it kept running so happy. How can that work if the statistics cannot be writen to disk as done by ESX?

Lets check which file systems are mounted on ESXi.



Surprise! There is a mounted filesystem just in the place where the stats files are hosted. Why?

The visorfs is something like a RAM disk. Has its own features, but basically it is a RAM disk. Inshort: that means that ESXi does write these files, but in memory! and not on the physical drive.

The creative solution

So how do I get rid of generation of write bursts to the SAN disk? The answer lies in ESXi: write to memory.

The ESX service console is not the vmkernel, there is no visorfs in a Red Hat Linux. Do you have something like that the Linux? Yes, it is called tmpfs.

With several exceptions, the tmpfs is something like vmvisor: a RAM disk. The good part is that, unlike a traditional RAM disk, it will be using memory as needed and will be releasing it when it is not needed anymore.

Then the solution is to mount a tmpfs in /var/lib/vmware/hostd/stats. We do this by adding a line to /etc/fstab and rebooting the host.



After rebooting the host, the tmpfs is mounted.



And the stats files are in place.


Yeah, you are right: when you reboot these files are lost. Do not worry, the 5-minute stats are already on the vCenter database and when hostd starts if this files are missing, it will recreate them.

In fact this is exactly what ESXi does: throws the stats away. In hostd's init script (/etc/init.d/hostd) it creates a RAM disk with a size based on the number of VMs and resource pools in the system.





And they lived happily ever after. Case closed.

Extra! Extra!

There is a section for statssvc in hostd config file /etc/vmware/hostd/config.xml that lets you specify several parameters, including how often to collect statistics or in which directory to storage them.

Disclaimer: I'm not sure this will be "supported" by VMware, so this information is "as is". In any case, the generic disclaimer of this blog applies here.