MDB-Cheats

From pressy's brainbackup
Jump to: navigation, search

MDB-Cheats

A collection of commands for kernel dump analyzing. There is a lot more you could do with mdb, perhaps I will update that page soon.

# cd /var/crash/`uname -n`
# mdb 0  

or on a running Solaris

# mdb -k /dev/ksyms
> ::msgbuf		-> look at the message buffer
> $<msgbuf		-> also seen, sometimes the syntax doesn’t work

> ::cpuinfo –v          -> what’s running?
> ::ps		        -> processes
> ::ptree		-> processes in a tree

A process panicked the machine?

> panic_thread/K
panic_thread:
panic_thread:   2a10001bd40
> 2a10001bd40$<thread
[...]
0x2a10001be50:  lwp             procp           audit_data
                0               1438788         0
[...]
> 1438788$<proc2u
[...]
0x30003e15971: psargs /usr/ccs/bin/sparcv9/nm /dev/ksyms\0\0\0\0\0\0\0\0\0\0\0 \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 \0\0\0 
[...]

Real memory consumption

> ::memstat
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                    1947820              7608   47%
ZFS File Data             1448500              5658   35%
Anon                       131224               512    3%
Exec and libs               29826               116    1%
Page cache                 237040               925    6%
Free (cachelist)             8618                33    0%
Free (freelist)            339957              1327    8%

Total                     4142985             16183
Physical                  4054870             15839
>

Change values

# mdb -kw
Loading modules: [ unix krtld genunix ip logindmux ptm nfs ipc lofs ]
> maxusers/D
maxusers:
maxusers:       495
> maxusers/W 200
maxusers:       0x1ef           =       0x200
> $q

Do not forget how to quit the mdb ;)

> ::quit

Debugging core file

# echo ::status | mdb core 



Solaris 11.4 example

who paniced my box? which files where opened? In that example I hit a bug where solaris could not handle to many procfs interests on the same target - so monitoring killed the server...

vmcore.0> $c
vpanic(10817d08, 2a13c710340, 0, 18403ca7b0dc0, 18403ca7b0dc1, 0)
turnstile_will_prio+0x210(18403ca7b0dc0, 1, 0, 18403a2f056c0, 31, 208c9c00)
turnstile_block+0x168(184039bf68c50, 0, 18403f73272b0, 2012d188, 0, 0)
mutex_vector_enter+0x3f0(4, 18403a2f056c0, 18403a2f056c0, 2012d208, 18403f73272b0, 209092e8)
holdf+4(18403f73272b0, 0, 0, 10, 18403a46563c0, 1)
flist_fork+0x12c(18403d1112198, 184042321d408, 1, 7fffffff, 1ff, 184042321c560)
cfork+0xa50(0, 1, 18403f4b244c0, 3, 1, 184042321c560)
spawn+0x1b0(ffffffff4e6d50d0, 18403f4b244c8, ac, 18403f4b244c0, 1224, 0)
syscall_trap+0x238(ffffffff4e6d50d0, ffffffff7f501100, ac, ffffffff7f504000, 1224, 1018d54ac)
vmcore.0>
vmcore.0> ::msgbuf
[...]
panic[cpu111]/thread=18403ca7b0dc0:
Deadlock: cycle in blocking chain


000002a131201530 genunix:turnstile_will_prio+210 (18403ca7b0dc0, 1, 0, 18403a2f056c0, 31, 208c9c00)
  %l0-3: 0000000000000000 00000000208c9e48 0000000000000000 0000000000000000
  %l4-7: 0000000000000000 0000000000000000 0000000000000000 00018403ca7b0dc0
000002a1312015e0 genunix:turnstile_block+168 (184039bf68c50, 0, 18403f73272b0, 2012d188, 0, 0)
  %l0-3: 0000000000069286 0000000000000001 0000000000000001 00018403d11112f0
  %l4-7: 00018403d643f160 00018403ca7b0dc0 000184039bf68c80 0000000000000001
000002a131201690 unix:mutex_vector_enter+3f0 (4, 18403a2f056c0, 18403a2f056c0, 2012d208, 18403f73272b0, 209092e8)
  %l0-3: 00022ecb6ef08e34 0000000000000000 0000000000000000 000000002012d1f0
  %l4-7: 00018403a2f056c0 0000000000000001 0000000000000000 0000000000000000
000002a131201740 genunix:holdf+4 (18403f73272b0, 0, 0, 10, 18403a46563c0, 1)
  %l0-3: 000184038cc06500 ffffffffffffffff 0000000000000000 00000300000dc000
  %l4-7: 00018404dfcdb148 0001840391137388 00018403faaf2150 0000000000000001
000002a1312017f0 genunix:flist_fork+12c (18403d1112198, 184042321d408, 1, 7fffffff, 1ff, 184042321c560)
  %l0-3: 0000000000000001 0000000000000001 0000000000000000 0000000000000100
  %l4-7: 00018403c5658000 00018403f73272b0 00018403a465a000 00000000000001ff
000002a1312018a0 genunix:cfork+a50 (0, 1, 18403f4b244c0, 3, 1, 184042321c560)
  %l0-3: 00018403d11112f0 0000000000000000 00018404867ef2b0 000002a131201978
  %l4-7: 000184038cc06500 00018403ca7b0dc0 00000000209ce978 00000000209ce930
000002a1312019c0 genunix:spawn+1b0 (ffffffff4e6d50d0, 18403f4b244c8, ac, 18403f4b244c0, 1224, 0)
  %l0-3: 0000000000000002 000000007fffffff 0000000097c818a3 0000000097c818a2
  %l4-7: 0000000097c96f17 0000000000000000 0001840455e61800 000184046431a000

syncing file systems...
 done

vmcore.0> panic_thread/K
panic_thread:
panic_thread:   18403ca7b0dc0
vmcore.0> 18403ca7b0dc0$<thread ! grep procp
    t_procp = 0x18403d11112f0
vmcore.0>  0x18403d11112f0$<proc2u ! grep psargs
    p_user.u_psargs = [ "/opt/IBM/ITM/sol296/ux/bin/kuxagent" ]
vmcore.0> 0x18403d11112f0::pfiles
FD   TYPE            VNODE INFO
   0  CHR 00018403530daa00 /zoneHome/z03a/root/dev/null
   1  REG 00018403b90a2480 /zoneHome/z03a/root/opt/IBM/ITM/logs/z03a_ux_1571165952.log
   2  REG 00018403b90a2480 /zoneHome/z03a/root/opt/IBM/ITM/logs/z03a_ux_1571165952.log
   4 DOOR 00018403c0d9ab00 /zoneHome/z03a/root/system/volatile/name_service_door [door to 'nscd' (proc=18403c6119210)]
   5 SOCK 000184036b8e7780 socket: AF_INET 0.0.0.0 17658 listen backlog: 8

   6 SOCK 00018403b4fec400 socket: AF_INET 0.0.0.0 33005 listen backlog: 8

   7 SOCK 00018403a49cd980 socket: AF_INET 127.0.0.1 62718 remote: AF_INET 127.0.0.1 1920
   8 SOCK 00018403b5a08880 socket: AF_INET 127.0.0.1 14856 listen backlog: 49

   9 SOCK 00018403b13c4e80 socket: AF_INET 127.0.0.1 22620 remote: AF_INET 127.0.0.1 3661
  10 SOCK 00018403d14c9580 socket: AF_INET 127.0.0.1 14856 remote: AF_INET 127.0.0.1 13294
  11 SOCK 00018403a4583240 socket: AF_INET 127.0.0.1 13294 remote: AF_INET 127.0.0.1 14856
  12 SOCK 00018403afd68b80 socket: AF_UNIX  remote: AF_?? (0)
  13 SOCK 00018403ac6e4940 socket: AF_UNIX  remote: AF_?? (0)
  14  REG 000184036ad61a40 /zoneHome/z03a/root/opt/IBM/ITM/auditlogs/root.z03a_ux_audit.log
  15 SOCK 00018403b4906640 socket: AF_UNIX /opt/IBM/ITM/sol296/ux/bin/pasipc/.pas_sock listen backlog: 16
  16  CHR 00018403bafb8980 /zoneHome/z03a/root/dev/kstat
  17 SOCK 0001840373fb8c00 socket: AF_INET 192.168.111.146 15797 remote: AF_INET 192.168.100.97 63358
  18 SOCK 00018403cbe12780 socket: AF_INET 0.0.0.0 18302 listen backlog: 8

  19 SOCK 0001840373eaaf00 socket: AF_INET 192.168.111.146 23827 remote: AF_INET 192.168.100.97 1918
  20 SOCK 00018403a0cc9880 socket: AF_INET 127.0.0.1 14856 remote: AF_INET 127.0.0.1 16927
  21 SOCK 00018403a1667700 socket: AF_INET 127.0.0.1 16927 remote: AF_INET 127.0.0.1 14856
  22  REG 00018403ac48f240 /zoneHome/z03a/root/opt/IBM/ITM/logs/z03a_ux_kuxagent_5da61702-02.log
  23  DIR 00018403b775fb80 /zoneHome/z03a/root/opt/IBM/ITM/tmp/osfcp
  24 FIFO 00018403b31b3e80
  25  DIR 00018403a2ee2b80 /zoneHome/z03a/root/proc
  26 PROC 000184050cf70980 /zoneHome/z03a/root/proc/67521/psinfo (proc=184042fc42788)
  27 FIFO 000184052c1df340
  28 FIFO 00018403b5e37e80
 256  DIR 00018403b8634780 /zoneHome/z03a/root/proc/67521
vmcore.0>
vmcore.0> ::ps ! grep 67521
R  67521      1  67521  67521   1001 0x4a004400 000184042fc42788 oracle
vmcore.0> ::ps -zf ! grep 67521
R  67521      1  67521  67521     1   1001 0x4a004400 000184042fc42788 oracleSID_1 (LOCAL=NO)
vmcore.0>


Some mdb commands

Kernel cage - all memory ended up in the Kernel bucket:

::fed -v
  USER MN T   TOTAL        8k      64k       4m     256m
credit  0 u       -       [-]      [-]      [-]      [-]
   fed  0 u       -         -      [-]      [-]        -
 total  0 u       -         -        -        -        -
credit  1 u       -       [-]      [-]      [-]      [-]
   fed  1 u       -       [-]      [-]      [-]      [-]
 total  1 u       -         -        -        -        -
credit  2 u       -       [-]      [-]      [-]      [-]
   fed  2 u       -         -      [-]      [-]      [-]
 total  2 u       -         -        -        -        -
    total u       -         -        -        -        -

 KCAGE MN T   TOTAL        8k      64k       4m     256m
credit  0 k   12.6m        56        -        3      [-]
   fed  0 k  317.8g   27b6b73        -       2b      [-]
 total  0 k  317.8g   27b6bc9        -       2e        -
credit  1 k   21.7m       4f6       bd      [-]      [-]
   fed  1 k    2.3g     16e44      880      17d      [-]
 total  1 k    2.3g     1733a      93d      17d        -
credit  2 k    3.8m       1e9        -      [-]      [-]
   fed  2 k   13.9g    1bd3f5        -       16      [-]
 total  2 k   14.0g    1bd5de        -       16        -
    total k  334.2g   298b4e1      93d      1c1        -    <---- TOTAL 334 GB!
  reserve k    4.6g     94d71        -        -        -
 rsrvhigh k    2.3g     4a6b8        -        -        -

In this case, ::memstat still reports plenty of available memory:

Page Summary                            Pages             Bytes  %Tot
---------------------------- ----------------  ----------------  ----
Kernel                                5691438             43.4G    4%
Kernel (ZFS ARC excess)                  2452             19.1M    0%
Defdump prealloc                      2029989             15.4G    1%
ZFS Kernel Data                       2043300             15.5G    1%
ZFS Metadata                           376392              2.8G    0%
ZFS File Data                         1406002             10.7G    1%
Anon                                 15523780            118.4G   10%
Exec and libs                          299223              2.2G    0%
Page cache                            1339587             10.2G    1%
OSM                                  86447616            659.5G   54%
Free (cachelist)                        16284            127.2M    0%
Free (freelist)                      43814273            334.2G   28%
Total                               158990336              1.1T

Access to this memory is much slower because a user process requesting a page will need to first "uncage" this page. This will manifest itself in the random delays... That's a bug and should be fixed in 11.4+

You might see blocked threads in a waiting queue for memory (empty output is good)

 echo "::walk fed_blocked |::stacks" | mdb -k 

It should look like:

> ::fed -v
  USER MN T   TOTAL        8k      64k       4m     256m       2g      16g
credit  0 u   11.4m       519       14        -      [-]      [-]      [-]
   fed  0 u    5.4g     9ec51      61f       6b        -        -      [-]
 total  0 u    5.4g     9f16a      633       6b        -        -        -
credit  1 u   25.0m       3ca       98        2        -      [-]      [-]
   fed  1 u    5.2g     a1a72      2ff       20        -        -      [-]
 total  1 u    5.2g     a1e3c      397       22        -        -        -
credit  2 u   34.4m       6a6       93        3      [-]      [-]      [-]
   fed  2 u  903.6m     131de      cbf       16        -      [-]      [-]
 total  2 u  938.1m     13884      d52       19        -        -        -
    total u   11.6g    15482a     171c       a6        -        -        -

 KCAGE MN T   TOTAL        8k      64k       4m     256m       2g      16g
credit  0 k   22.6m       454       20        3        -        -        -
   fed  0 k    3.3g     663d2        -       34        -        -        -
 total  0 k    3.4g     66826       20       37        -        -        -
credit  1 k   12.6m       44e        -        1        -        -        -
   fed  1 k    3.9g     76a05        -       49        -        -        -
 total  1 k    4.0g     76e53        -       4a        -        -        -
credit  2 k   18.7m       558        1        2        -        -        -
   fed  2 k    6.5g     cd4dd        -       2a        -        -        -
 total  2 k    6.5g     cda35        1       2c        -        -        -
    total k   14.0g    1ab0ae       21       ad        -        -        -
  reserve k    4.7g     97a00        -        -        -        -        -
 rsrvhigh k    2.3g     4bd00        -        -        -        -        -
>