top is a great tool to diagnose a *nix system. The focus of this post is on using it for debugging Java threads on Linux.
First, the environment:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
A thread is a thread of execution. The way it differs from an operating system process (process, for short) is that it shares the address space of a process with other threads, whereas by default, the address space of a process is ‘protected’ from other processes. If you believe in multithreaded programming model, then the sharing of memory is the key strength of (and major source of problems with) threads.
Observation and measurement are critical. Galileo once said, “Measure what is measurable, and make measurable what is not so.”. The Observer Effect, the changes that the act of observation will make on a phenomenon being observed, can be reduced by better instrumentation and observation techniques.
top is one such good tool (although it takes up resources of its own) for *nix systems. It takes a bit of getting used to, but once you are used to it, it is almost indispensable. Until you hit the right keys however, it displays numbers that don’t make sense. This is compounded by the fact that every option usually has a ‘default value’. The default values are great, but the problem with them is that if default value is not something that pleases me, I hate not just the choice of that default, but the software/tool itself (that’s just human psychology)! But believe me, even though you hate to read man pages, coming up with the default value that pleases every user is very hard.
top is for the entire Linux system. It displays the resources used by each process and by the computer. The slowdown of a computer sometimes can be traced back to a few processes/programs running on it. But initial display of top can be daunting for the uninitiated. Hastily, you run top on the command line (and assuming you have no .toprc in your home folder – and if you do, you can skip to the next section), you’ll see something like this (that changes every three seconds):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
There are many references available to decipher the meanings of those lines and you can refer to them by googling around. But remember this: top is highly configurable and most if not all of its controls are (thankfully) toggles! This helps a lot because almost always, you have to press a single key to turn something on, or press the same key to turn it off. This is extremely useful in practice.
So, just press 1 (yeah, the number 1) on the top window display above. And here’s what you see:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Now compare this and that. You see that the statistics for each processor are suddenly appearing! Yeah, it is a quad-core processor! The recent load average is hovering around 1, which means about one of the four processors this box has, is busy! The Cpu1 is busy almost all the time in the so-called ‘userland’ (shown by 99.0us) which is my code. So, my code is busy keeping the processor that runs it busy :-). But which process is that? Ah, I see the first line with PID 7525 to have %CPU as 100.1. Thus, this process is keeping more than 1 processor busy!
I also see that the memory available on this box is about 8 GB and about 2 GB is just free, and about 1 G is in cache, which means plenty of memory is still available. But yes, with processing, we are falling somewhat short.
1 2 3 4 5 6 7 8 9 10 11
You see that the second line in this display shows that the number of Tasks is 1 total. So, you wonder what ‘Task’ means. Googling around you may come to know that for top, Task is same as process (like sometimes it happens in math, in computers, the terminology can be quite confusing). Then it says that 0 task(s) are running and 1 task(s) are sleeping, out of 1 total. That is definitely confusing. If the task/process was consuming 100% of the CPU, how come top shows that it is sleeping (don’t know if that’s a bug)?
OK, but this doesn’t tell me about the Java threads (yet). The thread that you create as a Java Language Construct (by extending java.lang.Thread or implementing java.lang.Runnable and starting it by calling the start() method) should somehow result in something at the system level – that’s a fair expectation – you tell yourself. And you are right, top helps you ‘see’ that as shown in the next section.
Java and top
Let’s now look at a simple Java program that I compile and run: File /home/kedar/ourgit/new_jymob_blog/source/downloads/code/Threads.java could not be found
As you can see, the program creates ‘Runner Threads’ at will. It has a Main Thread and a Reader Thread. The Reader Thread interacts with the stdin and following happens:
- If you press ‘n’ (meaning “new thread”), then a new thread is started. The new thread in its
runmethod runs the so-called ‘busy loop’ where it just chooses several random numbers and adds them up. It does this forever. The moment this thread is scheduled to be run on a processor, it is expected to keep it busy, one hundred percent!
- If you press ’m' (meaning “main thread”), the main thread, which, in Java parlance, is a non-daemon thread stops gracefully. While it’s not ‘stopped’ the main thread just waits to stop. In other words, it just checks the value of a volatile variable (sending too many reads on what is called the ‘interconnect’) all the time. This too, is expected to keep the processor and the interconnect (the bus that connects the chip and the main memory) very busy.
- If you press ‘q’ (meaning “quit”), the JVM program/process quits gracefully, stopping all threads properly. There’s only one thread (Reader) that writes to the volatile variable and others just read it (no compound operations like the innocuous-looking but dangerous ++). So, any change to the variable’s value should be read by other threads. In other words, this is a ‘thread-safe’ little program.
I run it as:
1 2 3
Now, how do you find the process ID (pid) of the JVM? Right, by doing: ps -afe | grep java. When I do that, here’s what I see:
1 2 3
There’s a reason for describing this in (rather excruciating) detail. The point I am trying to make is, you are looking for the pid of a process. We generally understand that every process has a ‘unique’ ID that identifies it and that ID is pid. But because of something peculiar that Linux does, this becomes complicated. And this is for historic reasons that we won’t go into here, but will come back to it.
OK, now do the same: top -p 7525 which will show you last top display.
Press c (yes, lowercase c).
Press H (yes, uppercase H).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
You say. Yeah, I did too! You notice:
- The processor Cpu1 is busy doing work as expected (because of ‘1’).
- You see the entire java command line (because of ‘c’)
- The second line changes from
Threads:and it starts showing 16 threads!
- The 16 threads are listed with every one of them getting its own pid!
Say what? Each thread getting a unique process ID, aka pid? So, is a process same as a thread? How come each thread gets a pid? And even if it gets, shouldn’t it be the same for the threads running in one single Java VM, because, after all, the JVM or your code started those threads? How can they get different pid’s?
If you now go back to Java program’s command prompt, and press ‘n’, you will see:
1 2 3 4 5 6 7
If you pressed ‘n’ again, a new thread that keeps yet another processor busy is started. Thus, if you pressed ‘n’ twice you will have 3 of the 4 CPU’s busy at 100%. (Don’t assume that a thread is ‘pinned to’ a CPU however. It’s completely up to the System Scheduler to assign processors to threads.) Roughly though, 75% of your entire system is now busy running your little Java threads! In this case, if you pressed ’m', the main thread would stop (gracefully), but the JVM leaves on. The reason for that is the non-daemon-ness of threads is inherited and since main thread (which was non-daemon) started off Reader thread, it (Reader) becomes non-daemon making each of the Runner threads non-daemon too (this is all configurable though, I am describing the ‘default behavior’ :-)). Since at least a non-daemon thread is running, JVM process does not go down.
Back at the top window:
Press f (lowercase f).
Then, take the cursor (down arrow) to PPID, press Space, take the cursor to nTH, press Space and take it to TGID and press Space.
Press Escape or q.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Scroll the above window All the Way to Right.
Look at the TGID column. This is where the glimpses of Linux’s threading library implementation are beginning to reveal. All the threads have the same ‘Thread Group ID’, although all of them have different pid’s. The Thread Group ID of all the threads is same as the process ID of the Java (JVM) process – this is the process ID that ps reported back before we started analyzing it with top.
So, you think that, maybe the Java/Native threads created in the JVM are fully specified by the (PID, TGID) pair? Yeah, that seems right. The JVM process (PID: 7525) also has the TGID 7525. The JVM then creates a native thread that runs the Java code in the so-called Main Thread. The main thread, you guess, would be the (7526, 7525) one, because at the moment, it’s keeping the CPU at a 100% (because of that busy-loop that reads the value of volatile variable: stopMain) and all the other are not doing any work.
Is that enough to conclude? Perhaps. But we programmers should sleuth and since (wo)man is a tool-using animal, we need tools. When I was befuddled by this, I somehow recalled another ‘excellent’ Linux tool:
strace. This tool lets you trace the system-calls done by any process. So, I run strace next. For this, I close the Java program and rerun it, however, strace can attach to a running process and profile its system calls.
Run: strace -ff -o /tmp/trace.log java -Xmx1g Threads
From strace man page:
1 2 3 4 5 6
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Now, my favorite Linux system call is
fork which (I thought) was the only way to create a new process in *nix. I thought maybe JVM is using pthreads library, who knows? I was prepared to grep for something like pthreads_create. So, I look at the trace.log.17449 and here’s what I find as the last line:
1 2 3 4 5
Oh I see! On Linux, on this particular kernel, the system call this JVM uses is
And this SO thread tells you about the ‘exceptional approach’ Linux takes about threads.
From man 2 clone:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Now things begin to make (more) sense. If you carefully analyze the parameters passed by the JVM to the clone() system call you will know the choices it is making. The clone system call, just like the fork system call returns the pid of the created thread/process. The TGID of the created thread however stays the same, which is the pid of the ultimate process that starts all of it, the pid of the ‘java’ process that identifies the JVM.
This seems to nicely come together. I see the clone system calls in the relevant strace log files.
Java Thread Dumps
What about the Java Thread Dumps? IMO, thread dumps are one of the best diagnosis tools to find out what is going on with your Java Threads. And you can get thread dump using the
kill -3 <pid> command that sends the SIGQUIT signal to the given pid. And now that we know that each thread has its own pid, you can send the SIGQUIT signal to any of the threads (pid: 17449 to 17463) to produce the dump or stack of all the threads! The pid’s of the threads are revealed in the thread dumps as well. If you did kill -3 18788 assuming 18788 is the pid of one of the threads started by the JVM, you will see something like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Note that you can send the SIGQUIT signal to any of the pid’s that are identified by those nid values in the thread dump: It’s all in the family :-) of the thread group the thread belongs to. For some reason, the JVM guys chose to format those nid values in hex (e.g. hex 496a = 18794 in decimal).
- We should know top (and other diagnosis tools) really well. Memorizing the keystrokes and understanding their meaning really helps.
- Be curious.