My low-level programming language performance comparison left me wondering how well the automatic garbage collection of some languages really performs. I have run into a few long-lived Java processes that never seemed to return memory back to the operating system. Their resident memory usage only ever seemed to increase with the size of the workload and never really decrease. I have also seen many long-lived C processes with high peak resident memory usage and low idle resident memory usage. Those processes definitely returned memory back to the OS when they were finished with it and I thought that was one of the goals of automatic garbage collection.
I wrote a small C program as a starting point to test this. The allocate() function performs millions of memory allocations, uses that memory for some simple calculations, and then frees all of it. The main function calls allocate(), then waits 10 minutes, then calls allocate() twice in rapid succession, then waits another 10 minutes. While this ran, I logged resident memory usage of the process about 5 times per second. I then ported the program to other libraries and languages for comparison. The main difference is that the garbage-collected languages supposedly don't require explicit frees of allocated memory. The garbage collector is supposed to figure out when allocated memory is no longer needed and free it for you, reducing the burden on the programmer. I decided to set variables to null and clean up anyway to help the garbage collectors as much as possible. There are many different garbage collection algorithms and associated tunable parameters. There is more detailed discussion in the results, in the conclusions, and in each language section.
My test system runs Fedora 20 Linux with the latest updates as of January 25, 2015. I had plenty of RAM to avoid swapping. The Makefile at the end shows compile and test execution.
This plot shows C and C++ using pretty standard allocation and free/delete. Notice that glibc did not return all freed memory back to the OS during this test. This behavior is discussed more in the source code section. It retains a large portion of the memory allocated during allocate() for faster reuse. Notice that peak resident memory usage during the double-allocate() event at 610 seconds does not rise much above the first single-allocate() event at 10 seconds. Freed memory is reused immediately and efficiently. Also notice that resident memory usage during the second wait period is pretty much the same as it was during the first wait period. The unused heap memory retained by the program is not unnecessarily large, it is somewhere between minimum and peak. This plot represents what an ideal performance-focused garbage collector should be capable of.
This plot shows the same C and C++ code, but with malloc_trim(0) called immediately after the last free/delete. Notice that peak memory usage is the same for both events and the resident memory usage is minimal during both wait periods. The behavior is discussed more in the source code section. Resident memory usage is minimal at all times, but allocations result in more system calls, which likely had a performance penalty. This plot represents what an ideal resource-focused garbage collector should be capable of.
This plot shows the same Java code compiled with gcj and run natively, and run with OpenJDK java/JVM. Gcj had lower resident memory usage for the entire test. The peak gcj resident memory usage is close to the peak C/C++ memory usage, which indicates good garbage collection and reuse of previously allocated memory. Java was run with no options, so it used the default garbage collector and parameters. Notice the significant jump in resident memory usage at the second allocation event, about double the C/C++ peak. There appears to be no garbage collection and reuse during the rapid double allocation event. Neither approach returned memory to the OS during this test.
This plot shows the same Java code running in java/JVM with various garbage collection options. Notice the UseParallelGC/UseParallelOldGC option performed similar to having no options, so it may be the default. The other three options had lower peak memory usage, but still well above what should be required, which indicates partial collection and reuse of allocated memory. The UseConcmarkSweepGC did not return any memory to the OS. Both UseSerialGC and UseParNewGC returned some memory to the OS and had lower resident memory usage during the wait periods.
This plot shows the same Java code running in java/JVM with various garbage collection options. These three garbage collection options all returned significant amounts of memory back to the OS. The UseParNewGC with specified HeapFreeRatio bounds had the lowest java/JVM peak resident memory usage during the double allocation event and the lowest usage during the second wait period, but somewhat average usage during the first wait period. The UseG1GC had high double allocation event usage, but very low usage during the first and second wait periods.
This plot shows Go code compiled natively and JavaScript code running in Node.js. Notice that Go's garbage collector returns unused memory to the OS at timed intervals and it achieves only partial reuse during the double allocation event. Node had higher resident memory usage during the first event, but it achieved excellent reuse during the double allocation vent. Strangely, it seemed to return a bunch of memory to the OS right before or during the double allocation event.
This plot shows the same Go and JavaScript code, but with a couple lines added. The Go program calls debug.FreeOsMemory() after the first and second events. The JavaScript program calls global.gc() after the first and second events. Peak usage for both remains decent, but wait period usage is dramatically reduced. These setups collected all of the garbage and returned it back to the OS. While peak usage is higher here, these graphs are closest to C/C++ with malloc_trim(0). However, I had to suggest garbage collection and/or freeing memory to achieve this result.
This plot shows PHP, Python, and Ruby code. Ruby is the only one of these three with what I would consider decent memory usage. PHP and Python returned a bunch of memory to the OS during the first wait period, but memory usage is still ridiculous. They did not return much memory to the OS during the second wait period after the double allocation event. Peak usage for both events is about the same, which indicates good reuse, but the high levels indicate memory-hungry data structure implementations. Ruby returned a little memory to the OS, but usage stayed closed to its relatively low peak for most of the time.
This plot shows the same PHP, Python, and Ruby code, but with a couple explicit garbage collection calls added. PHP usage looks about the same as before. Peak usage for all three looks about the same as before, but wait period usage for Ruby is slightly lower and wait period usage for Python is dramatically reduced.
When I first wrote and tested this program, idle memory usage was nearly as high as peak, which really threw me for a loop. I have written some tests like this before and free() seemed to be doing its job every time. Not this time however. After reading an interesting forum post and a great article, it is clear that malloc() and free() are glibc or libc library functions that provide an additional layer between C and actual kernel system calls. The additional layer reduces the number of system calls by grouping many small requests into fewer large requests. It also has some built-in intelligence to determine when to actually perform the system calls. For example, a loop that repeatedly allocates 8 bytes, does something with it, and frees 8 bytes could be more efficient with just a single allocate and free. These two approaches together improve performance by reducing system calls. An unfortunate side effect of this strategy is that resident memory usage may be greater than required at any specific time because of the extra/standby/preemptive/free heap space occupied by the program. The behavior of malloc() and free() can be tuned using mallopt(). There are situations when the automatic behavior or the tuning parameters may not be sufficient. In that case, the malloc_trim() function returns free memory in the program heap to the OS. Normally, free() decides when to call malloc_trim().
For this simple test program, the idle resident memory usage during the wait periods was hundreds of megabytes, probably in anticipation of many more allocation requests. The program wasn't returning all of the freed memory back to the OS; it was saving much of it for faster future library allocations without system calls. As a result, resident memory usage rose to around 300-400MB and stayed relatively constant throughout execution. Adding an explicit call to malloc_trim(0) after the last free() reduced resident memory usage from hundreds of MB to just a couple KB during the wait periods, which brought the resident memory usage down close to the actual program memory usage. I definitely had to dig a bit deeper into this stuff than I planned, but I'm certainly glad I did.
#include <stdio.h>
#include <malloc.h>
#include <unistd.h>
typedef struct {
int data1;
int data2;
} my_data_t;
void allocate(void)
{
my_data_t *my_data = NULL;
my_data_t **array = NULL;
int element = 0;
int list_size = 10000000;
double sum = 0.0;
for (element = 0; element < list_size; element++) {
my_data = (my_data_t*)malloc(sizeof(my_data_t));
my_data->data1 = element;
my_data->data2 = element;
array = (my_data_t**)realloc(array, sizeof(my_data_t*)*(element+1));
array[element] = my_data;
}
for (element = 0; element < list_size; element++) {
my_data = array[element];
sum += my_data->data1;
sum += my_data->data2;
free(my_data);
}
free(array);
#ifdef USE_TRIM
malloc_trim(0);
#endif
printf("sum %E\n", sum);
}
void waitsec(int sec)
{
int i;
for (i = 0; i < sec; i++)
sleep(1);
}
int main(int argc, char **argv)
{
waitsec(10);
allocate();
waitsec(600);
allocate();
allocate();
waitsec(600);
return 0;
}
#!/usr/bin/node
function allocate() {
var element = 0;
var list_size = 10000000;
var sum = 0.0;
var array = new Array();
for (element = 0; element < list_size; element++) {
my_data = {data1:element, data2:element}
array.push(my_data);
}
for (element = 0; element < list_size; element++) {
my_data = array[element];
sum += my_data.data1;
sum += my_data.data2;
}
console.log("sum " + sum);
array = null;
}
function waitsec(sec) {
var i;
var sleep = require('sleep');
for (i = 0; i < sec; i++)
sleep.sleep(1);
}
waitsec(10);
allocate();
waitsec(600);
allocate();
allocate();
waitsec(600);
#!/usr/bin/node --expose-gc
function allocate() {
var element = 0;
var list_size = 10000000;
var sum = 0.0;
var array = new Array();
for (element = 0; element < list_size; element++) {
my_data = {data1:element, data2:element}
array.push(my_data);
}
for (element = 0; element < list_size; element++) {
my_data = array[element];
sum += my_data.data1;
sum += my_data.data2;
}
console.log("sum " + sum);
array = null;
}
function waitsec(sec) {
var i;
var sleep = require('sleep');
for (i = 0; i < sec; i++)
sleep.sleep(1);
}
waitsec(10);
allocate();
global.gc(); // suggest garbage collection
waitsec(600);
allocate();
allocate();
global.gc(); // suggest garbage collection
waitsec(600);