garbage collection
Moazam Raja







 

Debugging thread related hangs in the JVM

Once in a while Java users and developers run into problems where their Java application simply seems to hang. No core file is generated, no IO is detected, the process just sits there waiting...for something. Usually these problems can be traced to OS and JVM level threading.

The following is very Solaris oriented, maybe I'll write up something from a Linux threading perspective soon.

A tale of two threading models
Solaris 8 and 9 have two seperate threading models. There is a decent explanation of the two on the Sun website, here. Basically to make a long story short, Solaris 8, by default, uses a Many-to-Many threading model, and Solaris 9 uses a One-to-One threading model.

After Solaris 9 was released, Solaris 8 also included the One-to-One threading model as an 'alternate' threading model. Users should make sure they are running the latest 108993 Patch on Solaris 8 if they want to use the alternate thread library. Earlier iterations of this thread library had flaws which were later fixed by the aforementioned patch. The libraries for this threading model are in /usr/lib/lwp. More on that later..

What to try when a Java process hangs
There are a couple of different things which you can try when your Java process hangs.

1. Get a Java level stack trace.
2. Get a snapshot of the current LWPs and their status.
3. Get a native level stack trace.
4. Try throwing a SIGWAITING signal at the process.
5. Force the process to dump a core.
6. Switch to the alternate thread library and try to reproduce the problem.


Lets go into more detail on each step..

1. Get a Java level stack trace.
The purpose here is to get a Java level stack trace from the hung process. Java level stack traces look like the following:

(please realize that the stack traces shown on this page are not examples of hanging applications, they are only provided to show what general stack traces look like..)


There are a couple of different ways to induce a Java level stack trace from the hanging process. In Unix, you can send the process a SIGQUIT signal (kill -3, or kill -SIGQUIT) and the process will dump the stack trace to its console. In Windows, you will have to press Ctrl-\ at the console which started the process.

If you have started your JVM with the -Xrs option, sending SIGQUIT and SIGWAITING signals will not work. The whole purpose of the -Xrs flag is to tell the VM to ignore system signals.

In order to log the stack trace output from the process, you can redirect STDOUT and STDERR to a file when initially starting your Java application.

For example, to start jEdit with STDERR and STDOUT redirected to a file named 'console.txt', we do the following:

java -jar jedit.jar > console.txt 2>&1

This works in both Unix and Windows.

2. Get a snapshot of the current LWPs and their status.
This section is geared towards Solaris users. Many people use 'top' or 'prstat' to see what how much CPU, memory, etc. is being consumed by each process. The 'prstat' allows the user to view resource usage at finer granularity, at the LWP level. The way to get this info is to run 'prstat -L'.

The output from prstat looks something like this:

prstat -L output

From the above output, you can see that there is one java process but it has multiple LWPs running. The top two (13 and 10) are taking 6.1% and 1.9% of the CPU. Now when you reach to step 3 and look at the native level stack trace, you will be able to see exactly what a specific LWP is executing.

3. Get a native level stack trace.
Solaris includes a very useful program called 'pstack'. Fortunately for Linux users, Ross Thompson has ported this command over to Linux and it can be downloaded here.

The 'pstack' program can return the process call stack for a given PID. It shows the LWP number and whatever system calls where being called at the time the pstack was run. This is extremely helpful in determining why a process is hanging or crashing.

The following is a native level stack trace which corresponds to the Java level stack trace from above. The Java level trace shows the nid as 0x22n which is hexidecimal for 34. You can see this in the first stack strace above within the first two lines.

"SideKick #1" prio=1 tid=0x00593b80 nid=0x22

Hence we know to look for LWP 34 in the pstack output. You may also have noticed that LWP 34 is the same as the LWP listed in the 6th line of the 'prstat -L' output, which is currently using 0.3% CPU.

Native Stack for LWP 34

4. Try throwing a SIGWAITING signal at the process.

The SIGWAITING game
One of the first things I recommend to people to try when their Java app is hung is to send it a SIGWAITING signal. This is Solaris 8 specific and you should not have to do this on Solaris 9 since it has a different default threading behavior. What is SIGWAITING? It's signal number 32 and is defined as:

"A signal that is implemented in Solaris and used to tell a threaded process that it should consider creating a new LWP." [Threads Primer]

Usually when a hang happens, there are multiple threads scheduled on a single LWP and one of those threads is blocked while the others may still be runnable. The rest of the threads become starved and have no LWP to run on. When a SIGWAITING signal is sent to a hanging process, the threads library will create a new LWP and schedule a runnable thread on it. This may cause the hanging process to come back to life.

5. Force the process to dump a core.
If all else fails, you will probably need to do deeper analysis on the stack traces and core file to figure out what is causing the process to hang. Since the process is in a hanging state, it does not automatically dump core. On Solaris there is a program called 'gcore' which will force a process in practically any state to dump a core file. The usage is simple, gcore <pid>.

Make sure you have core dumps enabled within Solaris by using the coreadm or ulimit commands.

For example, to enable global uniquely named core files to be stored in /var/core/, issue the following commands as root:

mkdir -p /var/cores
coreadm -g /var/cores/%f.%n.%p.core -e global -e process -e global-setid -e proc-setid -e log


Personally I feel that the data from the Java level stack trace, Native stack trace from pstack, and the prstat -L output is sufficient, hence the core file is not necessarily needed. But your mileage may vary.

6. Switch to the alternate thread library and try to reproduce the problem.
If the SIGWAITING signal does indeed help the hanging process come back to life, then you are in luck. At this point you should switch to the alternate thread library and see if the hang reoccurs. Most likely, it will not.

The way to run your application under the alternate thread library is relatively simple and does not require recompiling. Simply reset your LD_LIBRARY_PATH environment variables to point to the new thread library. For example: LD_LIBRARY_PATH=/usr/lib/lwp LD_LIBRARY_PATH_64=/usr/lib/lwp/64

You can find much more details about the alternate thread library and how to use it by reading this article, Alternate Thread Library (T2) -- New in the Solaris 8 Operating Environment

Nerd Note: Using the alternate thread library with Java

Some articles on the web have mentioned that only the LD_LIBRARY_PATH needs to be set to use the alternate thread library with Java. Other documents mention the -XX:+OverrideDefaultLibthread flag and state that it is required to attain the full benefits of the new thread library. The fact is, -XX:+OverrideDefaultLibthread is required on certain VM versions. The reason is that certain JVMs can not tell exactly which threading model the OS is using and can only be notified of the model via this flag. Depending on this flag, the internal behavior of certain versions of the VM did indeed change. The majority of VM versions being used do not require it, and do not need it, but it's much safer to add it in anyways. In the end, be on the safe side and just specify -XX:+OverrideDefaultLibthread if you want to use the alternate thread library in Solaris 8.


Click here to visit the Radio UserLand website.
Click to see the XML version of this web page.
Click here to send an email to the editor of this weblog.
© Copyright 2005 Moazam Raja.
Last update: 3/12/05; 2:10:07 PM.