My Guide to Performance Tuning For Android and Java

This is my guide to tuning performance of Java programs. It covers both analysis of cpu time and object allocation. I have found that this works for Java programs and for many Android applications. In the latter case, it helps if your Android application is structured so you can turn a section of an application into a standalone test case. Since many  Android applications have background tasks, often implemented as subclasses of AsyncTask, this is usually easy to do.

The tool I prefer to use is HPROF. It is a simple profiling tool that is available with the Java SDK.  It provides a variety of tools so you can address common performance problems, like the following: (1) you are allocating too many objects, which leads to too much garbage collection; (2) you are allocating objects and not properly releasing them, leading to a memory leak; (3) there are long delays for the user because your computations are taking too much CPU time.

The topics I cover in this article include:

  • Getting started with HPROF in Eclipse
  • HPROF Results: cpu times, object allocation
  • Performance tuning suggestions
  • More on Android performance tuning
In the last section, I elaborate on why I prefer HPROF over some of the other Android performance tools.

1 Getting Started With HPROF in Eclipse

This section presents an overview of the steps I suggest for performance tuning Java programs and Android applications using HPROF.

To get started, identify a section of your application for which you’d like to check performance. If you don’t already have standalone test program for that part of your application, create one. If you have a Junit test case, you can start with that and turn it into a program you can run.

public static void main (String args []) {
    String result = testPerformance ();
    System.out.println ("Performance test result: " + result); 
}

Be sure that your test case runs correctly and gives you some indication that it is working. Something as simple as printing something to standard output, as shown in the code example above, will suffice. You are about to start a period of performance tuning work that could extend over hours or even days. During that, you will make lots of changes, some small, some not. Having some output displayed lets you make changes with some confidence that you have not introduced an error while working on improving performance.

Set up Eclipse to run the application with the HPROF performance tool. That is easy to do.  Simply edit the Run Configuration for your program. There is a section on that form for adding VM arguments. Enter “-Xrunhprof:heap=sites”,  as shown in Figure 1 below. Then click the “Apply” button so the changes are saved and you are ready to run.

Figure 1 – HPROF Run Configuration  (click to enlarge)

It’s a good idea to start with a check of object allocation. It runs without slowing down your application very much. So you won’t have to wait long to know if you have everything for HPROF set up correctly. Run the application, and then look for the HPROF output  in a file named “java.hprof.txt” file in the project folder inside your Eclipse workspace.

The HPROF output file has a lot of information. It starts out with a bit of explanation. That section is followed by a long section of call traces that point to spots in your code. It is best to skip over all of that and locate the summary data, which is near the bottom of the file. That section begins with “SITES BEGIN”. For “runhprof:heap=sites”, the summary part looks like the output in Figure 2.

Sample Output for HPROF

Figure 2 – HPROF heap=sites summary section
(click to enlarge)

The output tells you what kinds of objects your application is allocating. It shows the numbers objects, how many bytes, and percentages for each allocation site. In HPROF, the place in your code where allocation occurred is called a “site”. For each site, you know what type of object was being allocated and where in the code those objects were allocated. The code location is assigned a number  that you can use to look up trace information in the first part of the profile results file. The code location number is the value in the “Trace” column. For example, the number one item on the list is the allocation of PossibleMove objects. The stack trace for that is trace 303047. For this run of HPROF, 303047 looks like the lines below. It indicates the constructor in the PossibleMove class.

TRACE 303047:

	com.lahti.game.Operator.<init>(Operator.java:25)
	com.lahti.game.GameMove.<init>(GameMove.java:25)
	com.lahti.game.gomoku.Move.<init>(Move.java:41)
	dev.wgl.gomoku.PossibleMove.<init>(PossibleMove.java:54)

Item #11 from Figure 2 shows where in the code all the Point objects are being allocated. Method newMove calls evalMove, which calls squareIdToPoint, which calls the constructor for Point. Having the line numbers helps you can find the exact line where the objects were allocated.

TRACE 303050:
	com.lahti.game.Point.<init>(Point.java:24)
	com.lahti.game.gomoku.Board.squareIdToPoint(Board.java:583)
	com.lahti.game.gomoku.Evaluation.evalMove(Evaluation.java:345)
	dev.wgl.gomoku.MinimaxPlayer.newMove(MinimaxPlayer.java:949)

Later in this article, I will say more about HPROF results related to object allocation. Let’s take a quick look at how you set up for analysis of  CPU times for your application.

For cpu time, there are two ways to use HPROF: one based on sampling and the other based on measuring more precisely how much time is spent in each method. Sampling can be used without much overhead. Measuring CPU times takes much, much longer. To set this up, set the VM Arguments section of your test application’s Run Configuration to one of the following:

-Xrunhprof:cpu=samples

or

-Xrunhprof:cpu=times

You will likely run the first of these to get a rough idea of where the most time is being spent. Getting HPROF to collect samples is reasonably fast, like the “heap=sites” option. The other option “cpu=times” takes a long time to run. Be prepared to wait on this one. If you have a test case that takes 5 seconds to run, HPROF might take much longer. (Some of my own tests are well over 20 times slower. )

Output for cpu times is like that for heap sites, except that the summary at the bottom shows where time is being spent. You can see which methods account for the most time and how many times they are called. As with “heap=sites”, the “Trace” column points you to additional information in the results file that indicates where the method was called from.

HPROF output for cpu times

Figure 3 – HPROF cpu times summary

Once you have a performance test case and you know how to set up the different types of profiling (heap sites, cpu samples, cpu times), your performance tuning can begin in earnest. You will need to be comfortable with reading the output, locating the code sections within your application, and understanding the logic and algorithms that are being used. More guidance on these topics is provided in Section 2 of this article.

1.1 Considerations for Android Applications

For Android applications, you follow essentially the same steps. The first thing is to write a separate class that you can run as a Java application, independent of the Android emulator or device, Depending on how you’ve structured your application, this first step might require a bit of work on your part because you already have a set of Android Activity objects doing the work of your application.  For many of our apps, we already are using background tasks and services because we know that if we leave those running in the UI thread, our app can become unresponsive. In those cases, it is going to be pretty easy to come up with a standalone test application.

For a class defined as a subclass of AsyncTask, all you need to do is convert the doInBackground method to a main program in a standalone class.

protected Object doInBackground(Object... args) {
    ...                   // argument handling code 
    step1 (arg1, ...);    // code to execute in background task
    step2 (arg2, ...);
    step3 (arg3, ...);
}

After converting to an application, you would have something like:

public static void main (String args []) {
    arg1 = ...            // argument handling code 
    arg2 = ...
    ... 
    step1 (arg1, ...);    // code to execute in background task
    step2 (arg2, ...);
    step3 (arg3, ...);
    System.out.println ("something to show it is working");
}

Be sure to write something to standard output so you will know if you accidentally introduce an error. Once you have done that and have verified that the new application runs correctly, you are ready to modify the Run Configuration in Eclipse, as explained in the preceding section. Then run the performance test and start analyzing the results

2 HPROF Results

In the following sections, I have a few examples that illustrate how to read and interpret the results. They serve as an introduction to HPROF results. For detailed information about HPROF and its results, see the following references:

2.1 Object Allocation

Allocation Surprises

Every time I do performance tuning work, I am surprised. There are always one or two places where large numbers of objects are being allocated that I never would have imagined could be a problem. Usually, my first clue is that there are lots of garbage collection messages, right about the time that the application feels sluggish. Garbage collection messages are pretty easy to see while your Android application is running. Just switch to the DDMS view in Eclipse and watch the Logcat window as you run your application, either using the Emulator or on the actual device. The situation previously noted  in Figure 1 (above) was exactly this kind of problem.

          percent          live          alloc'ed  stack class
 rank   self  accum     bytes objs     bytes  objs trace name
   11  3.25% 55.46%    580544 36284    616656 38541 303050 com.lahti.game.Point
   12  3.25% 58.71%    580528 36283    616656 38541 303056 com.lahti.game.Point

TRACE 303050:
	com.lahti.game.Point.<init>(Point.java:24)
	com.lahti.game.gomoku.Board.squareIdToPoint(Board.java:583)
	com.lahti.game.gomoku.Evaluation.evalMove(Evaluation.java:345)
	dev.wgl.gomoku.MinimaxPlayer.newMove(MinimaxPlayer.java:949)

I was very surprised at the large number of Point objects listed as item 11. Over 38,000 Point objects were allocated, for a total of 600K bytes. Upon reflection on what the trace information (303050) was telling me, I understood. It turned out that there were integer square numbers for a game board being converted to X-Y coordinates. Every time a square number was converted, a new Point object was allocated. What I did to address this was to allocate a single Point object and save it as a temporary object inside the Board class.  That temporary object was used over and over gain, rather than allocating new ones. All those objects immediately dropped out on the next run of the performance test. I did something similar for the second spot in the code that was allocating Point objects. Even though the total amount of space was relatively small (6.5% of the total), if you find several places, like that, the savings in allocated objects adds up quickly.

Here are a couple of code fragments to illustrate what I did.  The original code looked like this, with the allocation of the Point objects being done in the squareIdToPoint method.

while (looping) {
  ...
  Point p = board.squareIdToPoint (sq);
  ...
  // Do something with the x and y parts of the point
  ...
}

To move the allocation out of the loop, I replaced the squareIdToPoint method with another method that used the same logic but updated an existing point rather than allocating a new one. Then I created a temporary point object and used it over and over again in the loop. The new code looks like:

Point p = getTempPoint ();
while (looping) {
  ...
  board.squareIdToPoint2 (sq, p);
  ...
  // Do something with the x and y parts of the point
  ...
}
...
public void squareIdToPoint2 (int sq, Point pt)
{
   ...   // calculations for x, y
   pt.x = x;
   pt.y = y;
}

In my case, I chose to make the TempPoint be an instance variable of my class and accessed it with getTempPoint. In most cases, you can get by with simply allocating a temporary point right outside the loop.

Point p = new Point (-1, -1);
while (looping) {
...
}

Reuse Objects Rather Than Allocate Them

I had another situation where several thousand objects were being allocated, used for their intended purpose, and then discarded. This was inside a loop. The end result was a lot of objects being generated each time through the loop. The garbage collector handled the clean up of temporary objects just fine, but it still had to do work to do so, and that showed up in overall slow elapsed time. I found that if I did a simple memory manager object that supported recycling the objects for that specific class, I greatly reduced the amount of time spent in garbage collection.

Then all I had to do was make sure I released the memory manager object, which was holding on to those reusable objects, when the code broke out of the loop. To give me a more accurate view of what was allocated and was already taken care of by the garbage collector, I called “system.gc ()” at the end of the performance test code. That resulted in a line in the HPROF summary table that looked like this:

SITES BEGIN (ordered by live bytes) Tue Jul 19 08:06:51 2011
          percent          live          alloc'ed  stack class
 rank   self  accum     bytes objs     bytes  objs trace name
    9  2.87% 45.77%     64000 1000     64000  1000 302297 com.lahti.tree.Node

Notice that there are two columns of  “live” bytes and two columns  of “alloc’ed” (allocated). The first pair shows what objects were still around at the end of the application. The second pair shows the total allocated during the test run. In my case, the numbers matched. From this, I learned that my memory manager was working. I was recycling the same 1000 objects over and over again, but I did still had to arrange to release the objects it held at the end of the application.

The code changes I made were fairly easy to do. Most of the work is in the memory manager object. I started with code that looked like the following:

while (looping) {
   ...
   Node n = new Node ();
   ...
   // Work with n
   ...
}

When the memory manager was in place, the call to “new Node” is replaced with a call to the memory manager, giving it the responsibility for coming up with a Node object ready for use. Whether that object was allocated or recycled didn’t matter to the caller. The code looked like:

NodeManager m = new NodeManager (1000);
while (looping) {
   ...
   Node n = m.newNode ();
   ...
   // Work with n
   ...
}

I also found that I had methods calling one another that created the equivalent of two nested loops. So I had to recycle the nodes at the end of the inner loop. As long as I chose a value large enough for the number of objects to pre-allocate, it worked fine.

NodeManager m = new NodeManager (1000);
while (looping) {
   ...
   while (moreLooping) {
      ..
      Node n = m.newNode ();
      collection.add (n);
      ...
      // Work with n
   }
   // Tell node manager that nodes are free to be reused.
   recycleNodes (collection);
   ...
}

So in this example, I figured out how to identify a place where a custom memory manager worked better than simply relying on the Java garbage collector. I also avoided introducing a memory leak.

2.2 Cpu times

Unnecessary Method Calls

An example for HPROF “cpu=times” is given below. It shows the time spent in individual methods, starting with the ones that use the most  cpu time. The column “self” shows the percentage of cpu time for the method itself. The column “accum” shows the cumulative total for the current row and all preceding rows. The “count” column shows how many times each method is called. In the output below, note that the first two methods account for about 44% of the total time. That numbers is the sum of the “self” percentages of rows one and two. That’s 26% plus 18%.

CPU TIME (ms) BEGIN (total = 206770) Fri Jul 15 07:03:29 2011
rank   self  accum   count trace method
   1 26.23% 26.23%  575160 306454 com.lahti.g.Line.checkBoard
   2 18.17% 44.41% 4026120 306447 com.lahti.g.Board.safeCheckPiece
   3  9.39% 53.80%   28758 306475 com.lahti.g2.EvaluationL2x.evalMove
   4  5.81% 59.61% 4026120 306446 com.lahti.g.Board.squareId
   5  4.73% 64.34% 3194498 306451 com.lahti.g.Board.getBoardArray
   6  4.56% 68.90% 3194498 306450 com.lahti.g.Board.getSquareMaxId
   7  3.07% 71.97%  337192 306470 com.lahti.g.LinePatternTable.encodeAsInt
   8  1.77% 73.74%  337192 306461 com.lahti.g.Evaluation.getCountForLine
   9  1.76% 75.50%  337136 306465 com.lahti.g2.EvaluationL2x.getLineKind
  10  0.96% 76.46%  170850 306474 com.lahti.g2.EvaluationL2x.getIntersectionScore
  11  0.86% 77.32%  604106 306458 com.lahti.g.Evaluation.getDebug
  12  0.86% 78.18%  575160 306449 com.lahti.g.Line.setEvalPt
  13  0.85% 79.03%  575160 306452 com.lahti.g.Line.setValueAfter
  14  0.85% 79.88%  575160 306457 com.lahti.g.Line.setEval
  15  0.84% 80.72%  575160 306453 com.lahti.g.Line.setOnBoard
  16  0.83% 81.55%  575160 306442 com.lahti.g.Line.getValuesArray
  17  0.82% 82.37%  575160 306456 com.lahti.g.Line.setLineKind
  18  0.82% 83.19%  575160 306443 com.lahti.g.Line.getXPointsArray
  19  0.82% 84.01%  575160 306441 com.lahti.g.Line.setBoard
  20  0.81% 84.82%  575160 306445 com.lahti.g.Line.getEvalIndex

As was the case with the HPROF analysis of object allocation, there was a lot here that I found surprising. The first was the very large number of calls to the #2 item: a method named safeCheckPiece. It was being called over 4 million times, as was the #4 item, method squareId. That was unexpected and it led me to reconsider the algorithm I was using.

The second thing that was surprising was that there so many “get” and “set” methods on the list (items 8-20). I tend to like access methods for object instance variables because it helps me keep a cleaner separation between different classes. For example:

private int [] fieldValuesArray;

public int [] getValuesArray ()
{
	if (fieldValuesArray == null) {
		setValuesArray (new int [5]); }
   return fieldValuesArray;
}

However, in this case, with a combined total of close to 15-20% for get and set methods, I was willing to to rework the code to reference the instance variables directly. To understand this point, you need to know that “accum” is short for “accumulated” so that column is the cumulative total of all the items before. From item 6 at 68.9% to item 20 at 84.8% is about 16%. Most of that time went away once I converted from private to public instance variables and eliminated the calls to the get methods. For the code sample above, with a change from “private” to “public”,  callers changed from doing

a = obj.getValuesArray ();

to

a = obj.fieldValuesArray;

That makes this an interesting example. The performance data got me to consider changing the overall algorithms and also identified a large a set of relatively easy changes involving unnecessary calls to methods. The changes turned out to be pretty extensive. They happened over the course of two or three days. During that, I was very happy I had taken the time to build an easy-to-run test case that included output that reminded me what correct execution looked like. I could move fast and be reasonably confident that I had not introduced any errors while performance turning. (Of course, it’s also good if you have a set of Junit tests to run.)

3 More Thoughts Related to Android

I’m working on an app and as it nears completion, I have found a few places where the performance was not quite what I wanted. It didn’t take long to find the performance tools that are provided with the Android development environment.  I used those tools for awhile but found that my analysis and tuning went faster if I went back to using the HPROF tool available at the Java level.

For me, the Java HPROF tools are just easier, primarily because each run of the test case produces a text file. That makes test results easy to examine and easy to compare two points in time. That helps me be sure that I’m making progress because the files give me before and after output to study and reflect on. There’s also something about the format for cpu timing results that I find easier to understand than what you get with Traceview.

Don’t get me wrong.The TraceView tool in Android is great to have — and I still use it. Sometimes I use it first, to be sure that most of the time is being spent in my process. And I use it to identify specific sections of code that I need to work on. I build a test case for one particular aspect of the performance problem and switch over to Java level testing in Eclipse. Doing it that way gives me a performance test case that I can run over and over again. Like most of the problems we developers face, we’re really good at fixing them once we get to the place where we can reliably reproduce the problem. For me, performance testing is no different.

Learning More About Android TraceView

As with so many other of the articles on this blog, I am sharing tips and suggestions that I have found useful in Android development. I have found that using HPROF for Android performance tuning is a good thing for me, given my past Java experience and the types of apps I am working on. The experiences of others doing Android performance tuning work might lead to different conclusions. I intend to take a closer look at the Traceview tool in the coming months.

4 References

Java HPROF – complete description of the HPROF tool.

Performance Analysis – a tutorial from Oracle on performance analysis.

Profiling with Traceview and dmtracedump – describes how to use Traceview in the Android development environment. Traceview works when you are debugging your application on the emulator or directly on a phone or other device.

Memory Allocation for Android Applications – A good article in the Android Developers’ blog about heap analysis tools. I am particularly interested in trying the tool that lets you compare two heap dumps.


Advertisements

About Bill Lahti

Bill Lahti is a software engineer building mobile applications and knowledge management solutions. Two of his interests are writing for this blog and building Android apps, with strategy games being an area of particular interest.
This entry was posted in Android, Java and tagged , , . Bookmark the permalink.

2 Responses to My Guide to Performance Tuning For Android and Java

  1. Pupetto says:

    Great post, thanks for sharing your findings.

    I personally would like to have more information (code would be super) on the 2 techniques used that improved memory usage of your application, i.e. (a) how to modify code to use a single tmp Point object and (b) the details of the memory manager you implemented to recycle objects in loop.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s