What on earth caused my process to crash?
So you got a 1000 w3wp.exe stopped unexpectedly in the event viewer or your process just exited in some weird undefined way and you don’t know why.
When a process crashes or exits a special event will be fired called EPR (Exit PRocess) so with a debugger like windbg.exe we can attach to the process, wait for epr to be thrown and take a memory dump. When you install the debugging tools for windows you get a vbs script called adplus that will do this automatically for you, and print logs for most exceptions that occur during the process life.
Debugging tip!: When you open a dump taken in -crash mode, you will automatically be positioned on the thread that was active when the crash occurred (the most likely suspect). If you switch threads and want to get back to the faulting thread, type ~
to list all threads and the faulting thread will be marked with a dot.
If your dump only shows one active thread in the process and that thread is the main thread, the process was likely killed by something external (health monitoring, low system memory, iisreset etc.)
I’ll go into more depth of some scenarios in some later posts but I’m a very top-down person so I wanted to start off with a post about some common situations when you will see a managed process exit so that you have some idea of what to look for in the dump.
In no particular order, here are some of the ones we see most in support:
Stack Overflow Exceptions
A stack overflow exception will occur when the memory allocated for the stack for a thread is used up. By default it’s 1 MB so your call stack can be pretty deep, so most of the time when this happens it’s because of infinite recursion, i.e. FunctionA calls FunctionB which calls FunctionB again which calls FunctionB again… and there is no stop condition.
A bit of an obscure infinite recursion situation that unfortunately is pretty common is incorrect usage of the Exception Handling application block. Imagine this scenario: Your application gets an exception, the exception handler kicks in and you have set it up to log to a file. While logging you get an exception of some kind (say access denied) and you have set the exception handler to handle this exception too. In this case you will end up in an infinite recursive loop where you handle an exception, throw another, handle it, throw another… you get the gist. The moral of the story? Don’t use the exception handler to handle exceptions in the exception handler:)
If you run kb 2000
(to see the native stack) and !clrstack
(from sos.dll to see the managed stack) you can spot the recursive pattern to track down where/why the recursion occurred.
Out Of Memory Exceptions
Most of the time when an Out Of Memory exception occurs it is caused by a design problem, where either too much memory is stored in cache or session scope. Caching can be great for increasing performance if used in the right way, i.e. you cache the data that will be accessed the most and you don’t cache it for longer than you need. In old ASP you got problems if you stored objects in session scope and believe me, it was a blessing in disguise because developers stored only the most necessary items in session scope. Storing large datasets in session scope for example is often counter-productive as you reduce the number of concurrent users your web site can handle and when memory gets high enough you may have more overhead in terms of time needed to do garbage collections and searches through cache than you would requesting the data from the database when you actually need it.
There is no one-size fits all solution here on when you should store something in session/cache and when you shouldn’t. The best thing to do is at an early stage, determine how many users your application needs to be able to handle, and based on that determine how much you can allow for storage per user. And then stress-test with more than your maximum number of users to make sure you can cope. Preferably, stress test with and without the objects in session state to see what the perf. difference is for different amounts of users.
Memory issues are very hard to fix when you are in production since they often require a lot of re-design, so a penny spent in the early stages will save a lot of pennies later.
Debugging tip!: Run
!dumpheap -type System.Web.Caching.Cache
to get the cache roots and then run!objsize
on the addresses for these to find out how much you are storing in cache for the different applications. (Note: InProc Session state is stored in the cache as well)
For a more in-depth discussion on why you get Out of Memory exceptions see my earlier post.
Unhandled exception in COM Component
If your application calls a native COM component, you can crash if you get an unhandled exception in it. For example if you are referencing memory that has already been freed or similar.
kb 2000
will then show the COM component on the stack so you can narrow it down from there.
Native heap corruption
This, along with GC holes are some of the nastiest issues to troubleshoot. A native heap corruption when someone writes to a location that it was not supposed to. The problem with this is that you will not see the error when the code that writes to the wrong address executes but rather when someone else is trying to access that memory address in the correct way. In other words, long after the “thief” was there. The location that is written to can be the heap or even worse, a location where code is stored (so instructions are overwritten) or the stack so your code calls into the middle of nowhere. Most frequently this occurs when you write outside of the bounds of a buffer or something to that effect.
For an introduction to heap corruption read Geoff Gray’s article on Heap Corruption. When you get a crash because of heap corruption, the faulting stack will often be in a heap allocation function in ntdll, and to resolve it you need to run with GFlags or PageHeap to be able to catch the thief in the act. However, the reason why it is so hard to catch is that these issues are very random and very hard to reproduce.
Managed Heap Corruption
Managed heap corruption is a heap corruption that occurs on the managed heaps. Again, here, you get the problem long after it occurred. A managed heap corruption occurs if someone overwrites a piece of the managed heap that it wasn’t allowed to. Normally you can’t buffer overflow in managed code. If you have a byte[] and try to write outside the bounds you will get an IndexOutOfRange exception or similar. The most common reason for a managed heap corruption is that the code called PInvoke and passed in a buffer of some sort but the buffer was too small. The PInvoked function writes to the buffer but writes beyond the bounds of the buffer and onto the next object on the managed heap. Later on when the garbage collector is doing its work, it tries to go through the heap and things go bad so the process crashes.
If you get a crash and the top of the active stack contains a GC function, start looking for PInvokes in your code and see if you might be passing buffers that are too small.
Fatal Execution Engine Exceptions
Fatal Execution Engine Exceptions are fairly rare, but when they occur it’s normally a bug. This means that for some reason we went into some piece of code that we were not supposed to in the CLR and the CLR has decided that in the unlikely event that someone comes in here, let’s throw a Fatal Execution Engine exception and die because we can’t recover from this point. In the event log this will be logged as Execution Engine Exception occurred and the address listed will tell exactly where in the code it occurred. If you reach one of these and can’t find a knowledge base article about it, contact support, preferably with a crash dump available since that will speed up the time to resolve the case tremendously.
GC Holes
These are fairly rare too. The unmanaged portion of the CLR has a pointer to managed memory and “forgot” to tell the GC about it so the GC doesn’t know to keep it around (if there are no other roots) or to track movement of it, which means that if a GC happens at the “wrong time” the pointer could be pointing basically anywhere causing a lot of havoc. Yun Jin talks a little about it here
To this date I have had little or no issues involving GC Holes so it wouldn’t be the first thing to look for in a “crash” dump.
No available memory left
If your process dies and at the same time you see a huge dip in the memory/available MBytes, to the point where its 0, this can cause the process to crash. Here of course the task will be to check what process stole the memory.
External process kills/Recycles the process
Countless times when I have had a customer with adplus attached in crash mode i have gotten memory dumps where the process was killed by something external. I wanted to add this gotcha since normally when you troubleshoot a crash, this is not the actual reason for the crash you are troubleshooting (only in rare cases), but rather that someone ran iisreset or killed the process not knowing that you had a debugger attached waiting for the crash. So if you are troubleshooting a crash and get a memory dump. Check the event log to make sure that no-one ran an iisreset so this was the reason for your crash dump. That will save you some hours trying to find the culprit of the crash:)
Another curious one is where some kind of monitoring software is installed to monitor that the server serves pages as expected and shuts down the process if it doesn’t. But usually this logs an event of the ocurrance in the event viewer as well.
Health monitoring settings
This one might be a doh!!! of course i know that we are set to recycle the process every 24 hours, or if the server has been idle for more than 20 minutes, but it’s worth mentioning anyways because we get these issues pretty frequently. Again here, normally an event will be logged in the event log stating why the process was recycled but the moral of the story is to have a brief look at your health monitoring settings for the application pool in iis or in machine.config to see what recycling options you have turned on so you don’t get a surprise later.
There are more reasons for crashes than the above but they are usually pretty obscure so I hope this post has given you a little insight in where to start looking if your process suddenly dies.
Over and out…