.NET Hang Debugging Walkthrough
I have talked about a number of different hang/performance issues in my posts. This post is a generic debugging walk through for hangs. It applies to all types of .NET processes and to some extent also to non-.net processes even though it has some .net specific items.
Define the hang/performance issue
The first step in troubleshooting a performance issue/hang of any kind, whether it is .NET or native is to find out what we are dealing with and what the expected resolution is.
Asking yourself a few standard questions may help you zone in on the problem without even attaching a debugger, and if you need to attach a debugger you will have a lot better understanding of what data you should collect and what you should be looking for.
If you call into support with a hang situation it is also very good to have the answers to these questions handy.
My standard questions are:
- What is the user experience? For example. When they browse to your asp.net application do they get timeouts? Do the calls ever return? If they return, do they always seem to return after around 45 seconds? – This helps you get a clue about if you have a dead-lock or if you are waiting for a resource that is timing out etc.
- What is the CPU usage? Low CPU problems usually indicate that we are waiting for a lock or an external resource (web service, database etc), while a high CPU usage indicates that we have an infinite loop, high CPU in garbage collection or that the server simply can’t handle the load. If it is a high CPU issue, it is also interesting to note if we are constantly at 100% CPU in the affected process, or if we are running high but not necessarily 100%. And also, if any other processes on the box run at high CPU.
- Can you reproduce the problem? If you can reproduce the problem reliably, make an effort to try to reproduce it on a test server so you don’t have to put any additional pressure on the production server. In addition, if you can repro you are half-way home because we have a much smaller subset of the application to work with.
- Is there a temporary resolution? For example. Does the problem go away if you do an IISReset? This may give an indication of where the problem lies.
- What external processes are involved? Does the application call web services, do database access, call into a remoting server , access a main frame computer etc. And if so, what do we know about these processes? Do they seem to be blocking, have they stopped receiving requests etc.
- What modules are affected? For example in an ASP.NET application, do all pages hang or only a few (if so, what is special about the few?). Do html/asp pages work? etc.
- Do we have any interesting event log entries during the hang? Anything can be interesting, network entries, event log entries from other applications etc. There might be a clue out there and our job is to play Sherlock.
- What else do we know about the hang? Things that are typically interesting is “It only happens during load”, “Seems to happen more frequently on Mondays around 9:45 am”, “When the process hangs memory also seems to increase rapidly”.
If you have any type of information that seems even remotely related to the problem you should write it down in your log book.
Finally, if you are dealing with a performance issue rather than a full blown hang it might be worthwhile to take a few moments and think about what you consider acceptable performance to be, so you don’t spend hours and hours on optimizing after you have reached acceptable levels.
Gathering dumps
Once you have defined the problem a bit you can start collecting dumps and needless to say, it is very important that you get the dumps while the problem is occurring.
To take memory dumps of a hanging process you can use adplus that comes with the debugging tools for windows by running
adplus -hang -pn <process name.exe>
If it is possible, it is good practice to take two consecutive dumps during the same hang so that you can compare the dumps and determine if the process is moving forward at all, but one is enough in most cases.
During the time the dump is taken the process will momentarily stall. You can usually count 1 second per 10MB data that needs to be written, and with a full memory dump the size of the dump will typically be the same as the private bytes shown in perfmon for that process.
If you do take multiple dumps, make sure that the first one is completely written before you start the second dump, and preferably let it go10-20 seconds after the first dump is completely written before you start the second dump.
Debugging
Now we can start with the fun part of the process, yeah you guessed it right, debugging :)
- Open the memory dump in windbg File/Open Crash Dump
-
Set up your symbol path
.symfix c:\mysymcache .reload
Read this post to learn more about why symbols are important.
-
Load sos.dll
.load clr10\sos (for 1.1) or .load <framework directory>\sos.dll (for 2.0)
-
Check out all the .NET stacks and the native stacks
~* kb 2000 native stacks ~* e !clrstack .net stacks
If you are debugging an ASP.NET app you can read my post on what you should ignore while debugging an asp.net hang.
So far so good… now comes the tricky part, understanding what it is you are seeing…
Low CPU hangs
A low CPU hang is typically caused by one of these:
- Lock
- WaitOne
- Critical Section
- Waiting for external resources
Lock
A lock is a way to synchronize access to a resource and the code will look something like this
lock(_myLock){
//do some work
}
A thread that is waiting for a lock like the one in the sample above looks like this
0:003> kb 200
ChildEBP RetAddr Args to Child
0334f9ec 77f7f49f 77e74bd8 00000001 0334fa38 SharedUserData!SystemCallStub+0x4
0334f9f0 77e74bd8 00000001 0334fa38 00000000 ntdll!NtWaitForMultipleObjects+0xc
0334fa8c 79281971 00000001 00179f88 00000001 KERNEL32!WaitForMultipleObjectsEx+0x12c
0334fabc 79282444 00000001 00179f88 00000001 mscorwks!Thread::DoAppropriateWaitWorker+0xc1
0334fb10 7929a8ed 00000001 00179f88 00000001 mscorwks!Thread::DoAppropriateWait+0x46
0334fb94 7924884b 00184d88 ffffffff 00000000 mscorwks!AwareLock::EnterEpilog+0x9d
0334fbb0 792d2d1e 00cfd47c 00d109bc 00d109a4 mscorwks!AwareLock::Enter+0x69
If you have any stacks that are waiting in this state you should take a look at this blog post to see how to troubleshoot them further
WaitOne
WaitOne is a function called on a ManualResetEvent or an AutoResetEvent. It is used to wait for the return from an async call. For example when you call a web service, under the covers it will make an async call to the web service and the code that called the web service will sit in a WaitOne waiting for the web service call to return so that you can process the results.
The call stack for a thread waiting in a WaitOne looks like this
61 Id: 1164.a44 Suspend: 1 Teb: 7ff3a000 Unfrozen
ChildEBP RetAddr Args to Child
0d58f430 7c822114 77e6711b 00000001 0d58f480 ntdll!KiFastSystemCallRet
0d58f434 77e6711b 00000001 0d58f480 00000000 ntdll!NtWaitForMultipleObjects+0xc
0d58f4dc 791e0b3b 00000001 0d58f5bc 00000001 kernel32!WaitForMultipleObjectsEx+0x11a
0d58f50c 791e0bdd 00000001 0d58f5bc 00000001 mscorsvr!Thread::DoAppropriateWaitWorker+0xc1
0d58f560 791d5ba9 00000001 0d58f5bc 00000001 mscorsvr!Thread::DoAppropriateWait+0x46
0d58f5a8 024840c8 0d58f5b4 00000000 000186a0 mscorsvr!WaitHandleNative::CorWaitOneNative+0x6f
If you have any threads waiting in this type of call, check out this post on debugging locks and critical sections
Critical Section
A Critical Section is the native equivalent of a lock, and the call stack looks something like this:
0:021> kb
ChildEBP RetAddr Args to Child
0210e7cc 7c822124 7c83970f 000009f8 00000000 ntdll!KiFastSystemCallRet
0210e7d0 7c83970f 000009f8 00000000 00000000 ntdll!NtWaitForSingleObject+0xc
0210e80c 7c839620 00000000 00000004 00000001 ntdll!RtlpWaitOnCriticalSection+0x19c
0210e82c 4bfa4c16 027c0714 0277a5b4 4bf727af ntdll!RtlEnterCriticalSection+0xa8
To troubleshoot these, look at the post on debugging locks and critical sections
External Resource
If neither of the above reasons for a low CPU hang applies, look for threads that are making DB calls or calls to other external resources - like APIs
High CPU Hangs
A high CPU hang is typically caused by one of these:
- Infinite loop
- High CPU in GC
- Busy Server
High CPU hangs are a little bit harder to troubleshoot since it is often hard to figure out if it is a single thread that is using up all the CPU or if it is a lot of small actions that make up for the high CPU.
If the process is steadily at 100% it is typically an infinite loop.
To determine what threads are using up most of the CPU you can run .time
and !runaway
to get an idea of how long your process has been up, how much time it has spent on the CPU (user mode time) and how much user mode time each thread has been using to find out what threads are using the most.
For example in this case we have spent a total of 14 seconds in user mode, and out of that, thread 14 has used 4 seconds.
0:000> .time
Debug session time: Mon Oct 16 12:20:22.000 2006 (GMT+2)
System Uptime: 3 days 17:11:24.778
Process Uptime: 0 days 0:09:07.000
Kernel time: 0 days 0:00:26.000
User time: 0 days 0:00:14.000
0:000> !runaway
User Mode Time
Thread Time
14:554 0 days 0:00:04.656
25:2ec 0 days 0:00:02.906
27:1264 0 days 0:00:02.359
17:13b4 0 days 0:00:01.625
16:12dc 0 days 0:00:01.375
...
Check out !runaway
in the windbg help file and play around with some of its options.
Infinite loop
If we are in an infinite loop, the thread with the loop will normally show up on top of !runaway
. You can also identify it by running ~* e !clrstack
and ~* kb
to see if any of the threads seem to be in a looping construct (if you know the code).
High CPU in GC
If the GC threads show up as the top threads in !runaway
. Take a look at the perfmon counter .NET CLR Memory/% Time in GC to see if the average is very high. If it is follow this post on debugging high CPU in GC to determine why.
Busy Server
If all the threads are just spread out, busy with different things you may have some optimization work ahead of you. There is no easy way to debug this type of thing to determine where to optimize. Instead you should pull out some of the most commonly used features and run profilers on them to see how much time / CPU time they take and optimize the pieces that take the longest time.
Phew, That was a long post :) But hopefully it has given you some insight in how to debug hangs and performance issues.