0 Replies Latest reply on Apr 3, 2011 4:31 AM by ahoogerhuis

    FAQ 17, random NFS issus and "Can't get vim_service - the session object is uninitialized or not logged in…"

    ahoogerhuis Enthusiast

      I thought I'd do a quick writeup on this issue. I've battled it on some installs and decided to dig.


      In every instance all this boils down to is inherent issues with Linux and filesystems.


      The symptoms people see are:


      - ghettovcbg2 failing randomly after having worked for some time, often with "Can't get vim_service - the session object is uninitialized or not logged in…"

      - Systems that have used NFS for VMFS storage fail randomly when folding large snapshots.


      Common factors:


      - Servers providing storage run Linux with NFS. (Wether actual Linux based NAS such as QNAP or other embedded implementations).

      - Using ghettovcbg2 to backup to NFS.


      FAQ 17 mentions that work shows this has issues with ext3, but not with reiserfs or XFS. This key information.


      The root cause, and this can bite in any situation with ESX and NFS, not only during backup, is "what rate can the underlying filesystem delete files at?".


      Ext3 is known for having exceptionally poor performance when performing unlinking of files. Ext4 is better but for the sizes of files that can be common with VMFS storage, it will fail at size commonly used.


      Does it matter if my NFS is mounted on 100mbit or gbit? HIghly unlikely, this is purely an issue of the numbers of IOPS can be porcessed on the OS on the NAS if it runs Linux, and in general, if it can unlink a file quicker than the NFS timeout in the vSphere/ESX NFS client setup.


      Why is something working for one person and at the same time failing for some other person? It depends on the disks, the layout of the storage, and other performance factors in the NAS, such as presence of battery backed cache memory. The limiting facotr is how many IOPS can the OS onthe NAS perform given your mix of disks, cache and RAID-levels. On a QNAP-459 with 4x2Tb green disks in RAID5 and 1Gb RAM with this is not an impressive number. On a HP server with SAS disks in RAID 5 and 1Gb battery cache that is accepting data for commiting, this number is 10-30 times higher. Also, if your NAS is busy with other work that may affect the IOPS and IO bandwidth available to vSphere/ESX.


      For a lot of QNAP-users the limit seems to kick in anywhere from 40 to 200Gb size. My setups have QNAP-x59 Pro series, with a few 6 and 8 slot boxes around. These are all populated with 2Tb green SATA disks (this is backup only and a 6 slot QNAP with green SATA disks will perform to gbit line speed with RAID5 without problem).


      My QNAPs have ext4 filesystems, and I mount them using NFS. I typically can delete about 200Gb in the amount it takes vSphere/ESX's NFS implementaton to time out with the default values.


      So why did everything work wonderfully the first few times and then suddenly grinds to a halt? Because my backup was set to contain 6 rotations. During the 6 first runs ghettovcbg2 never needed to delete anything. On the 7th run it needed to remove an old version of a 400Gb VMDK file. Kaboom. Enter the famous: "vim_service - the session object is uninitialized or not logged in…". Remove a few versions of VMs fromt he backup store by force and restart, and everything runs nicely.


      You'll see the same issue if you have a very big snapshot on a VM and using vSphere to fold it into the base VMDK. This will in some cases spectacularly blow your whole VMDK file into an inconsistent state and isn't recommended at all.


      So, how do I know what size is the magic number where my NAS cannot cope? How do we fix this? Is there a fix? Is there a workaround?


      How to check: make a copy of a VM onto your NAS, and then from the console of your vSphere/ESX machine try to delete it. If it works for 50Gb, then try 100Gb, and at some point it will blow up.


      The fix is: make sure your disk implementation is speedy enough to perform what you need within the time limits provided by NFS in vSphere/ESX.


      The best workaround for QNAP: if you have a QNAP TS-x59 and running firmware 3.3.6 or newer then don't use NFS for backup. Create a iSCSI target and use that. That removes all the NFS/NAS related problematic semantics of file deletion and VMFS on iSCSI copes beautifully. Works for me on 3.5, 4.0 and 4.1.


      The first workaround it: http://vmware.com/files/pdf/VMware_NFS_BestPractices_WP_EN.pdf . Try extending the time vSphere hangs around to wait for the NAS on operations that hang.


      The second workaround: Tell ghettovcbg2 you want more rotations than you need. If you have space for 3 and that is what you want, tell it you want 4. Then have a contab job that goes around and deletes the 4th rotation after the backup runs. Roughly like so: "rm -rf -- ${MY_BACKUP_STORE}*/*--4" This has to be run on the NAS itself.


      The dirty workaround if you can't access the OS onn your NAS: Manually delete the files belinging to the last rotation prior to running the backup, same command as above, and it will time out and dump some errors in your shell, but they are harmless. If the first roundof rm doesnt nuke all the files then loop around again untill the files are actually gone.


      The workaround for ghettovcbg2? When processing rotations of backups and deleting the last rotation, run rm in a thread of it's own, and make it retry deleting N number of times in case the NFS times out. Having NFS time out is not a huge issue, vSphere/ESX recovers quickly and nicely from this.


      Alternative fix for ghettovcbg2: have the vMA mount the same mount as the vSphere/ESX boxes use for backup, mount it with a lot longer timeouts than vSphere/ESX and perform the deleting from the vMA.


      Hope this helps. All and any feedback appreciated.