Re: cannot debug service console kernel dump (cos ...

v_potnis2001 · ‎10-22-2009

Hi experts,

1. We are running

vmware -v

VMware ESX Server 3.5.0 build-110268

2. and we have a sc core

#file cos-core-XX

cos-core-XX: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV)

3. and also the corresponding vmlinux image

file vmlinux-2.4.21-57.ELvmnix

vmlinux-2.4.21-57.ELvmnix: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), statically linked, stripped

4. along with System.map file

5. Installed crash

crash -v

crash 4.0

...

GNU gdb 6.1

...

This GDB was configured as "i686-pc-linux-gnu".

6. When we try to debug the core, crash itself crashes with a segmentation fault.

crash vmlinux-2.4.21-57.ELvmnix cos-core-HWSVR02

crash 4.0

...

..

Segmentation fault

7. Then we compiled crash 4.1.0 from src files and ran it through gdb

Program received signal SIGSEGV, Segmentation fault.

0x08124860 in dump_Elf32_Nhdr (offset=4096, store=1) at netdump.c:1734

1734 netdump_print("%08lx ", *uptr++);

It looks like when crash is trying to dump ELF 32 headers, it crashes.

We are aware the vmkdump utility exists for debugging vmkernel crash dumps but what is the best way of debugging the cos core.

We intend to open a SR for this with vmware but many times it has helped to post questions on VMTN.

So far a many of our questions have been answered and hope to get a response for this asap.

Many thanks in advance.

sundarrd · ‎10-28-2009

Hi do you have these files ?

vmlinux-2.4.21-57.ELvmnix and vmlinux-2.4.21-57.ELvmnix.debug

It looks like a malformed header.

Please use the netdump.c which is attached to this post and recompile the crash utility and try.

Note this is not the patch it is actuall netdump.c

$ tar xvzmf crash-4.1.0.tar.gz

...

$ cd crash-4.1.0

$ cp path/to/attached/netdump.c netdump.c

$ make

v_potnis2001 · ‎10-28-2009

Many Thanks, let me try out netdump.c asap. Yes,I can provide all the required files - vmlinux-2.4.21-57.ELvmnix and vmlinux-2.4.21-57.ELvmnix.debug

Where do I upload these files? We are really in need of help here.

sundarrd · ‎10-28-2009

You can upload some where and let me know the location privately.

v_potnis2001 · ‎10-28-2009

What we have are files

System.map

vmlimnux-2.4.21-57.ELvmnix.21-752-594

vmlinuz-2.4.21-57.ELvmnix-281-752-594

cos-core-HWSVR02

vmlinux-2.4.21-57.EL.debug

We specifically don't have the files you mentioned. I compiled it with netdump.c you provided and no seg fault now but get this error

crash vmlinux-2.4.21-57.EL.debug vmlimnux-2.4.21-57.ELvmnix.21-752-594 cos-core-HWSVR02

crash 4.1.0

crash: vmlimnux-2.4.21-57.ELvmnix.21-752-594: not a supported file format

sundarrd · ‎10-28-2009

I think the

syntax might be wrong

crash [-h ][-v][-s][-i file][-d num]

This patch http://download3.vmware.com/software/esx/ESX350-200808201-UG.zip contains the kernel source rpm

may be use the -g flag to rebuild the debug image and then try .

kernel-source-350.2.4.21-57.EL.110268.i386.rpm

please try :

crash

May be need the debug image from the vmware support of build one from the

source. also make sure that the same version of the esx server is used to build the debug image.

v_potnis2001 · ‎10-29-2009

Thanks for the update. I'll upload the core I could reproduce on my test system (ESX 3.5.4) and send you a private note.

Meanwhile, let me try compiling the kernel.

Another point I want to add is that I couldn't find the *.debug file under /usr/lib/debug/boot.

v_potnis2001 · ‎10-29-2009

1. Compiled crash with the new netdump.c

2. Recompiled ESX 3.5 with -g option

3. Generated a cos core file

4. # pwd

/usr/src/linux-2.4.21-47.0.1.EL

5.crash -S System.map vmlinux /vmfs/volumes/469e8d3d-5bda02b2-5a67-0015c5eaf19d/cos-core-thorpc205

Again, got segmentation fault

crash 4.1.0

crash: overriding /boot/System.map with System.map

WARNING: possibly corrupt ELF32_Nhdr: n_namesz: 0 n_descsz: 0 n_type: 0

Segmentation fault

Will upload the files and provide a location soon.

sundarrd · ‎10-29-2009

have you checked this

http://www.vmware.com/support/vi3/doc/vi3_esx35u4_rel_notes.html#knownissues

see if anything is applicable to the cause of the panic.

If the core is not debuggable there is nothing we can do as it is a propeietory

os with some modification . VMware support is the one that needs to be actively involved.

May be this is applicable:

ESX Server hosts become unresponsive during a network broadcast storm

When a network broadcast storm occurs, ESX Servers might become unresponsive due to an issue with the tg3 network driver. During this time, service console or virtual machines that use the tg3 NIC might lose network connectivity. Rebooting the machine or unloading/loading the driver restores connectivity, but does not resolve the issue.

ESX hosts with tg3 port cannot send or receive packets after being subjected to a broadcast storm. The following error messages might be logged in VMkernel:

1. WARNING: Net: 1082: Rx storm on 1 queue 3501>3500, run 321>320

2. VMNIX: WARNING: NetCos: 1086: virtual HW appears wedged (bug number 90831), resetting

Core dumps are lost when multiple ESX hosts share a dump partition

If multiple ESX hosts that share a dump partition fail and save core dumps simultaneously, the core dumps might be lost.

what is the messages output ?

May be something bad with the hardware . Is it a VMware HA environment ?

Does the panic happen on all nodes ? and unable to debug the others ?

From what i see the core is not valid , Even with gdb you might get an unable to do anything

do you get any error when running

gdb do you get the core file is truncated

or

Cannot access memory at address 0x0 ??

We might not be able to do anything further without vmware support assistance.

is there a difference of the readelf -a output of the core files you generated and the bad core file

you are dealing with ?

v_potnis2001 · ‎10-29-2009

Many thanks once again. I will read thru your post and respond. I did check the ESX 3.5.4 release notes for another issue but its not applicable here.

Meanwhile, here is the location of the core files and I'm pretty sure these core files are not corrupt.

Name (ftp.veritas.com:support): anonymous

331 Guest login ok, send your complete e-mail address as password.

Password:

230 Logged in anonymously.

Remote system type is UNIX.

Using binary mode to transfer files.

ftp> bin

200 Type okay.

ftp> hash

Hash mark printing on (8192 bytes/hash mark).

ftp> prompt

Interactive mode off.

ftp> cd /pub/support/281-752-594

250 "/pub/support/281-752-594" is new cwd.

ftp> ls

200 PORT command successful.

150 Opening ASCII mode data connection for /bin/ls.

System.map

cos-core-thorpc205

vmlinux

#

226 Listing completed.

41 bytes received in 0.0067 seconds (5.95 Kbytes/s)

ftp>

sundarrd · ‎10-29-2009

Could you please post the readelf -a output of this file and the other core file that

you created in the lab on which you were able to run the crash ?

All

cannot debug service console kernel dump (cos core file)