VMware Cloud Community
Masa201110141
Contributor
Contributor

ESX 4.1 AD integration - lsassd segfault, anybody experiencing the same?

Hello, I would like to see if anybody in the community has seen the same issue. I have already placed a call to VMware support, and I am told that this may be the first incident, which I have hard time believing... So far VMware support has not been very helpful.

Background: We integrated all of ESX 4.1 servers to Active Directory two or so weeks ago so that we can use AD authentication to log on, sudo, etc.

Issue: lsassd segfaults randomly on multiple ESX 4.1 servers after integrating with Windows Active Directory. When this happens, lsassd is not running, and it does not allow us to log in using AD credentials.

Temporarily workaround: Manually or automatically restart lsassd process using 'service lsassd start' command, when we detect the segfault.

Some observations:

- This segfault happens randomly on multiple systems at random times. So far we have seen the following in /var/log/messages (system names hidden with ******). These are entries from multiple systems.

Jan 21 01:12:36 ***** kernel: [3836705.238640] lsassd[24692]: segfault at 00000000f7d39fec rip 00000000008e86df rsp 00000000f7d39ff0 error 6
Jan 20 09:16:52 ***** kernel: [3785634.760625] lsassd[3961]: segfault at 00000000f7d00fec rip 00000000004f86df rsp 00000000f7d00ff0 error 6
Jan 18 21:42:02 ***** kernel: [3649570.760920] lsassd[8785]: segfault at 00000000f7ce1fec rip 000000000066c6df rsp 00000000f7ce1ff0 error 6
Jan 11 01:54:56 ***** kernel: [1606325.992594] lsassd[10142]: segfault at 00000000f7d06fec rip 0000000000bdd6df rsp 00000000f7d06ff0 error 6
Jan 11 02:04:27 ***** kernel: [1676465.246263] lsassd[1250]: segfault at 00000000f7d5efec rip 00000000006686df rsp 00000000f7d5eff0 error 6

- There are systems that do not have the issue since we integrated them to AD. Therefore, we believe the configuration is ok on all the servers since when the lsassd is running, everything appears to be functioning properly for AD login.

- All of our ESX runs VMware ESX 4.1.0 build-320092

- All of our ESX runs the following patches:

------------Bulletin ID------------- -----Installed----- ---------------Summary----------------
ESX410-201010409-SG                  2010-11-26T12:46:39 Updates Tar package                   
ESX410-201010415-BG                  2010-11-26T12:46:39 Updates Cisco fnic driver             
ESX410-201010404-SG                  2010-11-26T12:46:39 Updates NSS_db package                
ESX410-201010414-SG                  2010-11-26T12:46:39 Updates vmware-esx-pam-config         
ESX410-201010412-SG                  2010-11-26T12:46:39 Updates Perl RPM                      
ESX410-201010413-SG                  2010-11-26T12:46:39 Updates GNU cpio package              
ESX410-201010402-SG                  2010-11-26T12:46:39 Updates GnuTLS, NSS, and openSSL      
ESX410-201010401-SG                  2010-11-26T12:46:39 Updates vmkernel64, VMX, CIM          
ESX410-201010405-BG                  2010-11-26T12:46:39 Updates VMware Tools                  
ESX410-201010410-SG                  2010-11-26T12:46:39 Updates Curl RPM                      
ESX410-201010419-SG                  2010-11-26T12:46:39 Updates likewisekrb5, likewiseopenldap
hp-classic-mgmt-solution-840.21.2217 2011-01-25T15:07:08 HP SNMP Agents for ESX 4.0            
ESX400-Update02                      2011-01-25T15:07:08 VMware ESX 4.0 Complete Update 2      
ESX410-GA                            2011-01-25T15:07:08 ESX upgrade Bulletin

- I am not finding a lot of lsassd failure articles on the Internet. Is this really something new?!?

Has anybody experienced similar lsassd segfaults? Any info would be appreciated.

Thanks!

Tags (5)
Reply
0 Kudos
17 Replies
GDillon
Contributor
Contributor

We have seen these segfaults on several hosts.  Logging in as root and restarting lsassd seems to fix it, but I don't know the cause yet.   It has happened on almost all of our hosts so far.     I am considering writing a cron job to check the service status and restart as needed. We are running 4.1.0, 260247

-GDillon
Reply
0 Kudos
Masa201110141
Contributor
Contributor

Hi, Geoff,

Thank you for your response. I am so glad I am not the only one experiencing the issue.

I have been working (?) with VMware Support for nearly one month, and we still don't have a resolution.

We have so far tried:

1. Enable lsassd logging based on KB: http://kb.vmware.com/selfservice/search.do?cmd=displayKC&docType=kc&externalId=1026554

=> Logging is enabled. We have submitted numerous lsassd segfault log files to VMware from our systems. Unfortunately, this didn't really help them figure out what's going on… lsassd.log grows pretty quickly. If you have a separate file system for /var/log, you should watch out the disk space or set up the disk space monitoring so that it doesn't get 100% full.

2. Enable core dump. I got an instruction to do the following. This instruction is supposedly from the VMware Support Engineering team:

They need the core-dumps of lsassd to debug it.

We need to do the following:

  1. ulimit -c unlimited

  2. echo 1 >/proc/sys/kernel/core_uses_pid

  3. echo '/var/core/%e-%p-%t.core' >/proc/core_pattern

On my systems, the last command fails as /proc/core_pattern does not exist, but I was told that this last command was for formatting, and was not critical. All of our systems had /proc/sys/kernel/core_uses_pid as 1, and therefore, I don't think the 2nd command was doing anything different from before. We had many lsassd segfaults since then, but the core file had not been created. I asked VMware again a few days ago why it wasn't created, and I haven't received a response.

This is becoming quite annoying since AD integration is a new feature they recently introduced (and we really wanted to keep running), and it doesn't appear to be stable.

I currently have a cron job running every hour to check the existence of the lsassd process, and if it is dead, it will restart. Here it is if you would like to use it. I place it in /etc/cron.hourly. Nothing fancy, but it seems to do the restart task.

#!/bin/bash

################################################################################

  1. File name: lsassd-restart

  2. File location: /etc/cron.hourly on ESX hosts

  3. Date: 02/04/2011

  4. Version: 0.1

  5. Author: Masa

  6. Summary: This script detects and restarts lsassd if not running.

  7. Usage: Place this in /etc/cron.hourly, and make it 0755.

#

  1. Pre-req: ESX Host to be integrated to AD.

  2. Note: VMware Support Call #***** is in process as of 02/04/2011.

#

################################################################################

  1. Global variables

RESTART="/sbin/service lsassd restart"

PGREP="/usr/bin/pgrep"

LSASSD="lsassd"

GREP="/bin/grep -q -v $$"

  1. Look for lsassd pid

$PGREP $LSASSD | $GREP

  1. Restart lsassd if not running.

if

then

#echo "lsassd not running!"

$RESTART

#else

  1. echo "lsassd running ok"

fi

I will update the forum once we have resolved the issue.

Masa

Reply
0 Kudos
GDillon
Contributor
Contributor

Thanks for the script, although the forum post editor may have mangled it a bit.  What is the expression in the if statement under "1. Restart lsassd is not running"?

-GDillon
Reply
0 Kudos
lamw
Community Manager
Community Manager

Does this problem still occur in the latest 4.1 Update 1 release? Has VMware noted an issue with Likewise AD components or with ESX itself? The version of Likewise that VMware uses is not the latest/greatest, I'm wondering if they could roll out the newest 6.0 version to see if it does in fact resolve the problem.

Reply
0 Kudos
Masa2
Contributor
Contributor

I am sorry, I had just replied to the email that came in, and the response got mangled. Not sure if pasting with web site works better, but here it is.

#!/bin/bash
################################################################################
# File name:     lsassd-restart
# File location: /etc/cron.hourly on ESX hosts
# Date:          02/04/2011
# Version:       0.1
# Author:        Masa
# Summary:       This script detects and restarts lsassd if not running.
# Usage:         Place this in /etc/cron.hourly, and make it 0755.
#               
# Pre-req:       ESX Host to be integrated to AD.
# Note:          VMware Support Call #******** is in process as of 02/04/2011.
#             
################################################################################
# Global variables
RESTART="/sbin/service lsassd restart"
PGREP="/usr/bin/pgrep"
LSASSD="lsassd"
GREP="/bin/grep -q -v $"
# Look for lsassd pid
$PGREP $LSASSD | $GREP
# Restart lsassd if not running.
if [ $? -ne 0 ]
then
   #echo "lsassd not running!"
   $RESTART
#else
#   echo "lsassd running ok"
fi

I am also attaching the script. That may work better.

Masa

Reply
0 Kudos
Masa2
Contributor
Contributor

Hi, Thank you for your response. We try hard to get updated with all the patches available, and we currently have the following:

vmware-esx-likewise-krb5-4.1.0-1.4.348481
vmware-esx-likewise-openldap-4.1.0-1.4.348481
vmware-esx-likewise-open-4.1.0-1.4.348481
vmware-esx-likewise-krb5-64-4.1.0-1.4.348481
vmware-esx-likewise-openldap-64-4.1.0-1.4.348481
vmware-esx-likewise-open-64-4.1.0-1.4.348481
vmware-esx-likewise-ad-provider-4.1.0-1.4.348481
vmware-esx-likewise-krb5-workstation-4.1.0-1.4.348481
We are still having lsassd segfaults on a daily basis. I did suggest your suggestions to the VMware Support, and I was told that your suggestion would be communicated to the engineering team. I am waiting for their reply.
Thank you again for your suggestions!
Masa
Reply
0 Kudos
GDillon
Contributor
Contributor

Any news from vmware support on this?   The cron job that I setup on each host seems to be holding up most of the time.  Because of another dependent application we can't upgrade our hosts to 4.1 Update 1, so I'm wondering if VMware support comes up with a specific patch.

-GDillon
Reply
0 Kudos
Masa2
Contributor
Contributor

I have some news, but not good enough after 2 months. The latest response I received from VMware Technical Support on 4/25 is that the core files I had sent to them when lsassd crashed have been analyzed and their engineering team is currently waiting on a few confirmations from Likewise.

Therefore, there is still no patch. I am still waiting. 😞

Reply
0 Kudos
Masa2
Contributor
Contributor

This issue may finally be going somewhere. Below is what I received from VMware Support yesterday. Due to our resource limitations, I asked a few questions about testing/verifying how this patch works. I am not sure if we can provide a fully separate testing environment similar to our current production environment, and I am waiting for VMware Support to respond. Once I get more helpful updates from them, I will pass it along.

The good thing is that VMware is now aware other customers are experiencing the same, and this needs fixed.

Masa

>>>

Thank you for your Support Request.

My name is ***** and I am writing this email on behalf of my colleague *****. We have received an update from engineering that we are seeing same issue at multiple customer places and we need help to install a debug patch to further dig into it.

Currently we are also trying to reproduce it in-house and we need to check from you if you can install debug patch on one of you ESX hosts to help debug this issue further.

Please note that any ESX hosts installed with debug patch must not be placed into production and has to be re-installed as the debug patch cannot be removed once integrated with ESX kernel.

Looking forward to hearing from you.

Thanks & Regards,

***** VCP
Technical Support Engineer
Global Support Services
VMware Inc.
Contact Number:  1-877-4-VMWARE, Option 4.
Business Days: Monday to Friday
Business Hours: 0930 Hrs to 1730 Hrs (PST) Scheduled Absence: None

Reply
0 Kudos
emeirell21
Contributor
Contributor

Hi there,

any news about this issue ?

We are facing the some problem.

It seems more an intermittent issue since sometimes it works some times it does not.

thanks a lot.

Eduardo Meirelles

www.emeirell.blogspot.com

Reply
0 Kudos
Masa2
Contributor
Contributor

I got this on 8/1/2011 from VMware Technical Support. However, this patch did not install in my environment, possibly because we have kept up with latest VMware ESX patches. I asked Technical Support about it, but haven't received a response. You could possibly try the attached patch, and see if it installs on yours.

I would be very much interested in knowing if it installs on yours, and lsassd stays up.

Thanks!
Masaru

>>>>>>>

Thank you for your patience. The debug patch has been made available by the engineering team. I have attached the same zip file here.
Please note since this is a debug patch, you are not advised to use this patch in production environments.
First off, we need to apply this patch on one of the ESX Servers using the esxupdate command.
         The syntax for the same is { esxupdate -b <filename.zip> update }


The overall steps would be as follows:
Steps:
1) Remove ESX from AD.
2) Update ESX with the above debug patch.
3) Reboot ESX
4) Set stack to unlimited
    a) edit /etc/likewise/init-base.sh
    b) In function daemon_start() add ulimit -s unlimited in case of REDHAT

so it should look somewhat like this.
...
case "${PLATFORM}" in
REDHAT)
ulimit -s unlimited <<<< new line
echo -n $"Starting `basename ${PROG_BIN}`: "
...

5) Join this ESX server to AD and monitor if 'lsassd' service crashes.

Reply
0 Kudos
GDillon
Contributor
Contributor

I established a workaround using a cron job that checks the service state every 5 minutes and restarts it if needed.  I was having this problem on lots of my hosts before that and the workaround seems to have made the problem go away.

-GDillon
Reply
0 Kudos
emeirell21
Contributor
Contributor

Hi Masaru,

I also have a case with VMWARE, but they just take toooo long to answer me.

I'll se if I can get a host to install this patch.

From the netlogond.log seems the host was trying to find a DC, but as you can see it does not associate with any site.

20110819133337:0xf740cb90:INFO:[LWNetSrvGetDCName() /build/mts/release/bora-301967/likewise/esxi-esxi/src/linux/netlogon/server/api/dcinfo.c:97] Looking for a DC in domain 'domain.com', site '<null>' with flags 1000

later on the log seems it reached a DC. but on my case it's on a very distance site and latency will really be a problem.

20110819133436:0xf73eab90:VERBOSE:[LWNetDnsGetAddressForServer() /build/mts/release/bora-301967/likewise/esxi-esxi/src/linux/netlogon/utils/lwnet-dns.c:983] Getting address for 'DC.domain.com'

Gdillon,

sometimes restarting the service does not work. It runs for a few seconds than it crash. If I check it status I get.

lsassd dead but subsystem locked

as soon as I have more news I'll post here.

thanks

Eduardo Meirelles

www.emeirell.blogspot.com

Reply
0 Kudos
Miccoli
Contributor
Contributor

Hello,


We are having this same issue on our environment.

Does anyone have found some fix for this issue???

We still using the restart of the lsassd as solution but some times it work and some times not.

Reply
0 Kudos
GDillon
Contributor
Contributor

What version of ESX 4.1 are you on?   Can you apply any of the updates?    Could you upgrade to ESX 5?   I'm not holding out hope that this will be fixed in 4.1 if the update packs haven't already fixed it.   

-GDillon
Reply
0 Kudos
Miccoli
Contributor
Contributor

Hello

ESX version is ESX 4.1.0 Build-52767
We did applied the patches but no lucky
Unfortunately we don't have the option to upgrade it to 5.0


We have a case opened with Vmware but so far no improve...


I'm starting to think the same that it will be solved only with 5.0 =\

Reply
0 Kudos
Masa2
Contributor
Contributor

All,

I was the one who initiated this thread long time ago... It's been a long time since we got the last response from VMware Support, but basically, the issue was not to be fixed on the particular version of ESX 4.1 we were running. I had mentioned that we kept up with ESX updates, and the issues still did exist although we may have seen less instances of lsassd segfaulting.

Unfortunately to say, but in our environment, the cron job to check on the process was working fairly well, and we just moved on. Since then, we have upgraded our environment to ESXi 5.0, and we don't seem to be seeing issues any longer. I know this is probably not the answer you want to hear, but that is what we have experienced...

Good luck!

Reply
0 Kudos