Preface
Because of my previous post about autoboxing I get many search queries about autoboxing performance. Now I’m curious, too. So let’s get ready for some benchmarks.
When we talk about boxing performance, then we always have two sides. Throughput and memory consumption.
Throughput with Arrays
For micro benchmarking I use JMH framework, which I already described in my other post here.
To have a bigger test set I create an array of 2 million entries. We measure the throughput (operations/second).
private static final int MAX_SIZE = 2_000_000; @Benchmark public int[] testPrimitiveIntegers() { int[] ints = new int[MAX_SIZE]; for (int i=0; i < MAX_SIZE; i++) { //boxing here -> object instantiation ints[i] = i; } return ints; } @Benchmark public Integer[] testIntegerBoxing() { Integer[] ints = new Integer[MAX_SIZE]; for (int i=0; i < MAX_SIZE; i++) { //boxing here -> object instantiation ints[i] = i; } return ints; } @Benchmark public int[] testIntegerInclUnboxing() { int[] ints = new int[MAX_SIZE]; for (int i=0; i < MAX_SIZE; i++) { //manual boxing here, after that automatic unboxing ints[i] = Integer.valueOf(i); } return ints; } public static void main(String[] args) throws RunnerException { Options opt = new OptionsBuilder() .include(IntegerBoxingArray.class.getSimpleName()) .warmupIterations(15) .measurementIterations(10) .forks(1) .shouldDoGC(true) .build(); new Runner(opt).run(); }
The result (ops/s, higher is better)
Benchmark | Score | Error |
---|---|---|
testIntegerBoxing | 176 919 | ± 5 843 |
testIntegerInclUnboxing | 193 915 | ± 4 597 |
testPrimitiveIntegers | 521 617 | ± 5 347 |
Primitives seems to be 2-3 times faster then it’s Integer equivalent. But wait a minute why is boxing and unboxing faster than only boxing? The reason seems to be the handling of Integer for array and return value.
But still we can create nearly 200 k arrays the size of 2 million in one second using autoboxing and Integer arrays! But we are very deep in the basic operations where some milliseconds can have a big impact in the overall application performance.
Now I’ve done the same with the byte datatype. The result is the following:
Benchmark | Score | Error |
---|---|---|
testByteBoxing | 398 484 | ± 3 886 |
testByteInclUnboxing | 1029 260 | ± 7 629 |
testPrimitiveByte | 1025 612 | ± 10 827 |
That seems even more weird. The performance loose is here only because of handling Byte[] and returning it instead of byte[]. Boxing has here no performance penalty, because all Byte values are cached in the JVM internally. The same value range is cached for all other wrapper data types, too.
Memory consumption
Explanation of WORD
The memory consumption of objects, data types in the JVM is pretty straight forward. The most important factor is the architecture data type “WORD“. WORD is dependent of your used system architecture and operating system. In most modern PCs this is currently either 64 bit or 32 bit. This defines what bite size the CPU can handle.
Wrapper Types
Boolean, Byte
If autoboxing (or manual valueOf) is used there is no need to create a new instance, because they are already cached in the VM. Every JVM has to cache this.
But we still need a pointer (“oop”) to the instance.
32bit: 4 byte pointer
64bit: 4 or 8 byte pointer (see Hotspot enhancements)
Short, Integer, Character, Float
-128 to 127 is cached. Character is unsigned so only 0 to 127. Float and Double values have no cache.
The rest always yields in a new instance. A Short, Character variable needs 4/8 bytes, too. Because it has to fit in the word size (padding). This has the same impact for int in 64bit environment.
32bit: 4 bytes pointer + 8 bytes instance header + 4 bytes content = 16 bytes
64bit: 4/8 bytes pointer + 16 bytes instance header + 8 bytes content = 28/32 bytes
Long, Double
32bit: 4 bytes pointer + 8 bytes instance header + 8 bytes content = 24 bytes
64bit: 4/8 bytes pointer + 16 bytes instance header + 8 bytes content = 28/32 bytes
Primitives
All data types will have at least word size. Therefore…
boolean, byte, short, char, int, float
32bit: 4 byte
64bit: 8 byte
long, double
32bit, 64bit: 8 byte
primitives in arrays
In arrays primitives retain their size, but they still have to fit to the WORD size.
Some examples:
boolean[299]: 299 * 1 +12/24 header + 4/8 pointer = 315/331 bytes + padding = 316 / 336 bytes
int[123]: 123 * 4 + 12/24 header + 4/8 pointer = 508/524 + padding = 508/528 bytes
object sizes
Of course objects have to fit to 4/8 bytes. But internally attributes can retain their sizes. That means primitives have their original 1 to 8 byte and objects probably have a reduced 4 byte pointer in 64 bit systems (see Hotspot enhancements).
For example:
int id; String name; Date lastLogin; boolean enabled; short balance;
Looking at the flat memory consumption (ignoring the object attributes itself) we have following consumption:
32bit: 4 (id) + 4 (name) + 4 (lastLogin) + 1 (enabled) + 2 (short) = 15 + padding = 16
most 64bit: 16
rare 64bit: 4 + 8 + 8 + 1 + 2 = 23 + padding = 24
all with wrapper instead of primitives:
32 bit: 5 * 4 = 24
most 64 bit: 24
rare 64 bit: 5 * 8 = 35
Hotspot enhancements
Compressed Oops
This feature reduces the pointer in 64bit system from 8 byte to 4 byte, if the heap is smaller than 32 GB. This is simply done, because the pointer doesn’t refer any more the hardware address. Instead the pointer is an offset from the start of the heap space. And it does only refer to objects and not byte.
Conclusion
Boxing is still quiet fast. I suggest to use primitives for mandatory attributes in a object, for heavy CPU calculations and for last resorts in bottleneck fights.
I wouldn’t recommended to use primitives for optional attributes, because that creates evil magic numbers. Of course in very rare cases where every tiny bit of performance counts this might be needed too.
We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.
The difference on a higher scale
Let’s say we have an JEE application with 5 million instances where the mandatory id is a Integer instead of an primitive int.
Assuming we have a 64bit system with 4 byte pointer:
Integer: 28 * 5 million = 140 000 000 bytes
int: 4 * 5 million = 20 000 000
Makes a difference of 120 MB we can safe.
Computer setup
CPU: 4 core Intel(R) Core(TM) i5-4670K CPU @ 3.40GHz
Java: Java 8 update 25 64bit
OS: Linux Mint 17.1 64bit
Kernel: 3.13.0-37-generic
IDE used: eclipse Lunar
testPrimitiveIntegers works in 2-3 times faster then other tests because of wrong banchmarks.
You forgot about profile-guided optimizations(http://hg.openjdk.java.net/code-tools/jmh/file/tip/jmh-samples/src/main/java/org/openjdk/jmh/samples/JMHSample_12_Forking.java)
Try to run test one by one but not three test together and you will see totally different result.
Also do not forget about Loop Unrolling(http://hg.openjdk.java.net/code-tools/jmh/file/183e50c96c54/jmh-samples/src/main/java/org/openjdk/jmh/samples/JMHSample_11_Loops.java)
You are right that looking only at the boxing would have different results.
The reason why I used loops is also to show that handling with objects is way more time consuming than primitives, even if it is not solely the boxing.