Projects Publications Resume Contact About Youtube Donate

Programming Language Performance Comparison

Synopsis
Introduction
Results Table
Conclusions
Feedback Update 2021/03
test.c (C)
test.cpp (C++)
test.java (Java)
test.adb (Ada)
test.go (Go)
Python

Python
- test-python.py (Python)
- test_python.py (Python)
Python w/ Cython Compile
NumPy
- test-numpy.py (Python)
- test_numpy.py (Python)
NumPy w/ Cython Compile
Cython w/ Cython Compile

test.js (JavaScript)
test.pl (Perl)
test.rb (Ruby)
test.pas (Pascal)
test.lua (Lua)
test.php (PHP)
Makefile

Synopsis

My programming language comparison involves the performance of:

dynamically-allocated (heap) memory access
array/vector element iteration
integer calculations
floating-point calculations

I am not comparing:

every possible language
dynamic memory allocation performance
I/O performance
network performance
productivity
advanced language features
compiler speed

Introduction

I'd like to thank my brother, Kevin Selwyn, for comparing two simple programs, written in C and Go, which sparked my interest in this. I wanted to do a programming language performance comparison like this for quite some time, since I come across various programming languages at work and I hear about many more. Each has strengths and weaknesses and sometimes the language chosen is not the most efficient for the task.

My goal here is to compare the performance of some relatively low level operations, not development speed or advanced language features. A secondary goal is to get some minimal exposure to some languages I'm not familiar with. The test program I came up with dynamically allocates some heap memory once, initializes the array/vector, and then performs many 32-bit integer and 64-bit floating point calculations using that array. The algorithm does not calculate anything useful and it does not measure the performance of any single operation. It is just a low-level workload that should be easy to port to various languages. This comparison is most relevant to programs dominated by user process time, not kernel/system process time. In other words, it emphasizes calculations and memory access over system calls like memory allocation, disk I/O, or network communication.

I'm most familiar with C, so it is my benchmark and the language I chose to start with. The program takes a single command-line argument, which is the number of iterations over the whole inner calculation loop. For example, 'test-c 100'. It is important to note that most of these languages can call statically-typed native-compiled code when performance is important or when needing to leverage existing code. I demonstrated this with Python/Cython, but API's exist for pretty much all of the languages tested here. Some other examples are Java Native Interface (JNI) and PHP/Hack/HHVM, but it is also possible to call C library functions from Go and Pascal if you create the right bindings.

My test system is Fedora 20 Linux with the latest updates as of January 14, 2015. I had plenty of RAM to avoid swapping. The Makefile at the end shows compile and test execution.

Results Table

[an error occurred while processing this directive]

Conclusions

The results table shows everything, but I'll highlight some observations anyway. C is the highest performance language I tested. It's also probably the lowest level, most capable, and most difficult to write safe code in. I was surprised to find that clang/clang++ generated faster executables from the same c/c++ source code than gcc/g++. The next highest performers are C++, native-compiled Java/gcj, and then Ada. Native-compiled Java/gcj is the fastest language I tested with automatic garbage collection.

The execution time of Java/javac/JVM is deceiving because it maxed out two CPU cores when everything else maxed out only a single CPU core. So, even though it finished a little faster than Cython, it used about twice as many CPU resources to do it. In terms of efficiency, it probably performs somewhere between Cython and Pascal or Go.

The performance of the scripting languages is obviously lower than the native executables. Dynamic typing, interpretation, garbage collection, and object-oriented array implementations are probable culprits for many. The notable standout among non-pre-compiled scripting languages is JavaScript running on Node.js. It isn't as fast as Java bytecode running on JVM, but it is significantly faster than the rest of the scripting type languages I tested.

The highest performing languages also produced the smallest native executables. Pascal and Go produced seemingly very bloated executables. Many of the scripts and bytecodes were very small, but the bytecode-native combination of Python/Cython ended up somewhere between the efficient native executables and Pascal.

Resident memory usage was all over the map. The memory efficient languages were in the 0.8GB range. As expected, many of the scripting languages were less memory efficient because of all the automatic behind the scenes stuff they do and inefficient array implementations in many cases. The two worst memory performers were Perl at 6GB and PHP at 14GB.

I'm certainly not an expert in all of these languages. It is entirely possible, maybe even probable, that I overlooked some language features that would make for more efficient implementations. I did my best with the time I spent; for example, I discovered a performance improvement in Ruby by converting the array to a vector. If I overlooked something similar anywhere else, please let me know and I'll try to update the results.

Feedback Update 2021/03

A person named Isaac reached out with some interesting observations. He noted that some relatively minor modifications to the code can have a large impact on the results. First, adding the final keyword to the Java code significantly improves JVM performance. Second, initializing with a variable instead of a constant significantly decreases C performance. He posted some of his results and observations on HackerNews. Both of these are related to compiler optimization.

He also didn't see the same CPU usage I did. He measured ~1core/JVM during the test, whereas my testing showed ~2cores/JVM. I don't know anything about his setup or how it differs from my 2015 setup. The important takeaway is JVM is multi-thread even if your Java code isn't. Whether that will have an impact will depend on your environment and your workload.

It's also worth noting that there have likely been improvements across the board since my original test. Hardware, OS, compilers, runtimes, support libraries, etc. have all likely improved.

test.c (C)

#include <stdio.h>
#include <malloc.h>
#include <stdlib.h>
int main(int argc, char **argv) {
    int element = 0;
    int iteration = 0;
    int iterations = 0;
    int innerloop = 0;
    double sum = 0.0;
    int array_length = 100000000;
    double *array = (double*)malloc(array_length * sizeof(double));
    if (argc > 1)
        iterations = atoi(argv[1]);
    printf("iterations %d\n", iterations);
    for (element = 0; element < array_length; element++)
        array[element] = element;
    for (iteration = 0; iteration < iterations; iteration++)
        for (innerloop = 0; innerloop < 1000000000; innerloop++)
            sum += array[(iteration + innerloop) % array_length];
    printf("sum %E\n", sum);
    free(array);
    array = NULL;
    return 0;
}

test.cpp (C++)

#include <iostream>
#include <cstdlib>
using namespace std;
int main(int argc, char **argv) {
    int element = 0;
    int iteration = 0;
    int iterations = 0;
    int innerloop = 0;
    double sum = 0;
    int array_length = 100000000;
    double *array = new double[array_length];
    if (argc > 1)
        iterations = atoi(argv[1]);
    cout << "iterations " << iterations << endl;;
    for (element = 0; element < array_length; element++)
        array[element] = element;
    for (iteration = 0; iteration < iterations; iteration++)
        for (innerloop = 0; innerloop < 1000000000; innerloop++)
            sum += array[(iteration + innerloop) % array_length];
    cout << "sum " << sum << endl;
    delete array;
    array = NULL;
    return 0;
}

test.java (Java)

This java code can be compiled and executed two different ways. It can be compiled natively with gcj, then executed directly. Or it can be compiled into bytecode with javac, then executed in the Java Virtual Machine (JVM) using java.

public class test {
    public static void main(String[] args) {
        int element = 0;
        int iteration = 0;
        int iterations = 0;
        int innerloop = 0;
        double sum = 0.0;
        int array_length = 100000000;
        double[] array = new double[array_length];
        if (args.length > 0)
            iterations = Integer.parseInt(args[0]);
        System.out.println("iterations " + iterations);
        for (element = 0; element < array_length; element++)
            array[element] = element;
        for (iteration = 0; iteration < iterations; iteration++)
            for (innerloop = 0; innerloop < 1000000000; innerloop++)
                sum += array[(iteration + innerloop) % array_length];
        System.out.println("sum " + sum);
        array = null;
    }
}

test.adb (Ada)

with Ada.Text_IO, Ada.Command_Line, Ada.Unchecked_Deallocation;
use Ada.Text_IO, Ada.Command_Line;
procedure test is
    element : Integer := 0;
    iteration : Integer := 0;
    iterations : Integer := 0;
    innerloop : Integer := 0;
    sum : Standard.Long_Float := 0.0;
    array_length : Integer := 100000000;
    type vector is array (0..array_length) of Standard.Long_Float;
    type vector_access is access vector;
    procedure free_vector is new Ada.Unchecked_Deallocation
        (Object => vector, Name => vector_access);
begin
    declare
        test_array : vector_access := new vector;
    begin
        if Argument_Count > 0 then
            iterations := Integer'Value(Argument(1));
        end if;
        Put_Line("iterations " & Integer'Image(iterations));
        while element < array_length loop
            test_array(element) := Standard.Long_Float(element);
            element := element + 1;
        end loop;
        while iteration < iterations loop
            innerloop := 0;
            while innerloop < 1000000000 loop
                sum := sum + test_array((iteration + innerloop) mod array_length);
                innerloop := innerloop + 1;
            end loop;
            iteration := iteration + 1;
        end loop;
        Put_Line("sum " & Standard.Long_Float'Image(sum));
        free_vector(test_array);
    end;
end test;

test.go (Go)

package main
import "os"
import "fmt"
import "strconv"
func main() {
    var (
        element int = 0
        iteration int = 0
        iterations int = 0
        innerloop int = 0
        sum float64 = 0.0
        array_length int = 100000000
        array []float64 = make([]float64, array_length)
    )
    if len(os.Args) > 1 {
        iterations,_ = strconv.Atoi(os.Args[1])
    }
    fmt.Printf("iterations %d\n", iterations)
    for element = 0; element < array_length; element++ {
        array[element] = float64(element)
    }
    for iteration = 0; iteration < iterations; iteration++ {
        for innerloop = 0; innerloop < 1000000000; innerloop++ {
            sum += array[(iteration + innerloop) % array_length]
        }
    }
    fmt.Printf("sum %E\n", sum)
    array = nil
}

Python

Python normally compiles .py files into .pyc files, which are bytecode files that run in the Python Virtual Machine (PVM). There does not seem to be any concept of a dumb array or just a block of memory with adjacent values in Python or NumPy. Python arrays and NumPy arrays seem to be implemented using some sort of linked-list and object-oriented approach behind the scenes, which probably accounts for the poor performance.

test-python.py (Python)

import sys
import test_python
iterations = 0
if len(sys.argv) > 1:
    iterations = int(sys.argv[1])
test_python.test_python(iterations)

test_python.py (Python)

import sys
def test_python(iterations):
    element = 0
    iteration = 0
    innerloop = 0
    total = float(0.0)
    array_length = 100000000
    array = [i for i in range(array_length)]
    print 'iterations', iterations
    while iteration < iterations:
        innerloop = 0
        while innerloop < 1000000000:
            total += array[(iteration + innerloop) % array_length];
            innerloop += 1
        iteration += 1
    print 'sum', total
    del array

Python w/ Cython Compile

The NumPy reference documentation has a really good quote that describes how to get the best performance from Python: "Those who want really good performance out of their low level operations should strongly consider directly using the iteration API provided in C, but for those who are not comfortable with C or C++, Cython is a good middle ground with reasonable performance tradeoffs." Basically, use C and not Python when performance matters. Cython converts Python source code to C and then compiles it natively into a shared library; the remainder of the Python program can then call the shared library. This approach results in a mixture of interpreted/PVM and native-compiled code.

I did not change any code to C or Cython-specific code here because I'm still testing Python. The Cython .pyx compilation ends up running faster than the identical Python code in an interpreted/PVM .py/.pyc file, but it's still slow.

test-python-cython.py (Python)

import sys
import test_python_cython
iterations = 0
if len(sys.argv) > 1:
    iterations = int(sys.argv[1])
test_python_cython.test_python(iterations)

test_python_cython.pyx (Python)

import sys
def test_python(iterations):
    element = 0
    iteration = 0
    innerloop = 0
    total = float(0.0)
    array_length = 100000000
    array = [i for i in range(array_length)]
    print 'iterations', iterations
    while iteration < iterations:
        innerloop = 0
        while innerloop < 1000000000:
            total += array[(iteration + innerloop) % array_length];
            innerloop += 1
        iteration += 1
    print 'sum', total
    del array

setup_python_cython.py (Python)

from distutils.core import setup
from Cython.Build import cythonize
setup(name = 'test python cython', ext_modules = cythonize("test_python_cython.pyx"))

NumPy

test-numpy.py (Python)

import sys
import test_numpy
iterations = 0
if len(sys.argv) > 1:
    iterations = int(sys.argv[1])
test_numpy.test_python(iterations)

test_numpy.py (Python)

import numpy
import sys
def test_python(iterations):
    element = 0
    iteration = 0
    innerloop = 0
    total = numpy.float64(0.0)
    array_length = 100000000
    array = numpy.zeros(array_length, numpy.float64)
    print 'iterations', iterations
    while element < array_length:
        array[element] = element
        element += 1
    while iteration < iterations:
        innerloop = 0
        while innerloop < 1000000000:
            total += array[(iteration + innerloop) % array_length];
            innerloop += 1
        iteration += 1
    print 'sum', total
    del array

NumPy w/ Cython Compile

I did not change any code to C or Cython-specific code here because I'm still testing NumPy. The Cython .pyx compilation ends up running faster than the identical NumPy code in an interpreted/PVM .py/.pyc file, but it's still slow.

test-numpy-cython.py (Python)

import sys
import test_numpy_cython
iterations = 0
if len(sys.argv) > 1:
    iterations = int(sys.argv[1])
test_numpy_cython.test_python(iterations)

test_numpy_cython.pyx (Python)

import numpy
import sys
def test_python(iterations):
    element = 0
    iteration = 0
    innerloop = 0
    total = numpy.float64(0.0)
    array_length = 100000000
    array = numpy.zeros(array_length, numpy.float64)
    print 'iterations', iterations
    while element < array_length:
        array[element] = element
        element += 1
    while iteration < iterations:
        innerloop = 0
        while innerloop < 1000000000:
            total += array[(iteration + innerloop) % array_length];
            innerloop += 1
        iteration += 1
    print 'sum', total
    del array

setup_numpy_cython.py (Python)

from distutils.core import setup
from Cython.Build import cythonize
setup(name = 'test numpy cython', ext_modules = cythonize("test_numpy_cython.pyx"))

Cython w/ Cython Compile

I used Cython-specific code here. I declared variables as the correct C types using Cython's cdef syntax. I used Cython's malloc syntax to create a C-style dumb array instead of a Python or NumPy array. The end result is a function with a mixture of Python and C-like stuff that is orders of magnitude faster than pure Python and NumPy. This results in a mixture of interpreted/PVM and native-compiled code. The speedup likely comes from forcing explicit types and using a simple array with less overhead in the native-compiled code.

test-cython.py (Python)

import sys
import test_cython
iterations = 0
if len(sys.argv) > 1:
    iterations = int(sys.argv[1])
test_cython.test_cython(iterations)

test_cython.pyx (Cython)

import sys
from libc.stdlib cimport malloc, free
def test_cython(iterations):
    cdef int element = 0
    cdef int iteration = 0
    cdef int innerloop = 0
    cdef double total = 0.0
    cdef int array_length = 100000000
    cdef double *array = <double *>malloc(array_length * sizeof(double))
    print 'iterations', iterations
    while element < array_length:
        array[element] = element
        element += 1
    while iteration < iterations:
        innerloop = 0
        while innerloop < 1000000000:
            total += array[(iteration + innerloop) % array_length];
            innerloop += 1
        iteration += 1
    print 'sum', total
    free(array)
    array = NULL

setup_cython.py (Python)

from distutils.core import setup
from Cython.Build import cythonize
setup(name = 'test cython', ext_modules = cythonize("test_cython.pyx"))

test.js (JavaScript)

Traditionally, JavaScript runs in web browsers, but Node.js allows JavaScript to run on servers or on the command line.

#!/usr/bin/node --max-old-space-size=4096

var element = 0;
var iteration = 0;
var iterations = 0;
var innerloop = 0;
var sum = 0.0;
var array_length = 100000000;
var array = new Array(array_length);
var argc = process.argv.length
if (argc > 2)
    iterations = process.argv[2];
console.log("iterations " + iterations);
for (element = 0; element < array_length; element++)
    array[element] = element;
for (iteration = 0; iteration < iterations; iteration++)
    for (innerloop = 0; innerloop < 1000000000; innerloop++)
        sum += array[(iteration + innerloop) % array_length];
console.log("sum " + sum);
array = 0

test.pl (Perl)

#!/usr/bin/perl

$element = 0.0;
$iteration = 0;
$iterations = 0;
$innerloop = 0;
$sum = 0.0;
$array_length = 100000000;
@array = [];
$argc = @ARGV;
if ($argc > 0) {
    $iterations = $ARGV[0];
}
print("iterations $iterations\n");
for ($element = 0.0; $element < 100000000.0; $element++) {
    $array[$element] = $element;
}
for ($iteration = 0; $iteration < $iterations; $iteration++) {
    for ($innerloop = 0; $innerloop < 1000000000; $innerloop++) {
        $sum += $array[($iteration + $innerloop) % $array_length];
    }
}
print("sum $sum\n");
@array = [];

test.rb (Ruby)

#!/usr/bin/ruby

require 'matrix'
element = 0.0
iteration = 0
iterations = 0
innerloop = 0
sum = 0.0
array_length = 100000000
array = Array.new(array_length) {0.0}
vector = [array]
if ARGV[0]
    iterations = ARGV[0].to_i
end
puts "iterations #{iterations}"
for element in 0..array_length-1
    vector[element] = element
end
for iteration in 0..iterations-1
    for innerloop in 0..1000000000-1
        sum = sum + vector[(iteration + innerloop) % array_length];
    end
end
printf("sum %E\n", sum);
array = nil

test.pas (Pascal)

program testpascal;
uses sysutils;
type
    vector = array of double;
var
    element : longint;
    iteration : longint;
    iterations : longint;
    innerloop : longint;
    sum : double;
    array_length : longint;
    my_array : vector;
begin
    element := 0;
    iteration := 0;
    iterations := 0;
    innerloop := 0;
    sum := 0.0;
    array_length := 100000000;
    setlength(my_array, array_length);
    if paramcount > 0 then
        iterations := strtoint(paramstr(1));
    writeln('iterations ', iterations);
    for element := 0 to array_length-1 do
        my_array[element] := element;
    for iteration := 0 to iterations-1 do
        for innerloop := 0 to 1000000000-1 do
            sum := sum + my_array[(iteration + innerloop) mod array_length];
    writeln('sum ', sum);
    my_array := nil;
end.

test.lua (Lua)

#!/usr/bin/lua
element = 0
iteration = 0
iterations = 0
innerloop = 0
sum = 0
array_length = 100000000
array = {}
if #arg > 0 then
    iterations = tonumber(arg[1])
end
print("iterations ", iterations)
for element=0, array_length-1 do
    array[element] = element
end
for iteration=0, iterations-1 do
    for innerloop=0, 1000000000-1 do
        sum = sum + array[((iteration + innerloop) % array_length)]
    end
end
print("sum ", sum)
array = nil

test.php (PHP)

#!/usr/bin/php
<?php
ini_set('memory_limit', '-1');
$element = 0;
$iteration = 0;
$iterations = 0;
$innerloop = 0;
$sum = 0.0;
$array_length = 100000000;
$array[] = 0;
if ( $argc > 1 ) {
    $iterations = $argv[1];
}
fwrite(STDOUT, "iterations ". $iterations . "\n");
for ($element = 1; $element < $array_length; $element++) {
    $array[] = $element;
}
for ($iteration = 0; $iteration < $iterations; $iteration++) {
    for ($innerloop = 0; $innerloop < 1000000000; $innerloop++) {
        $sum = $sum + $array[($iteration + $innerloop) % $array_length];
    }
}
fwrite(STDOUT, "sum ". $sum . "\n");
$array = 0;
?>

Makefile

all:\
test-c-gcc test-c-clang\
test-cpp-g++ test-cpp-clang++\
test-java test.class\
test-go\
test-ada\
test_python.pyc test_python_cython.so\
test_numpy.pyc test_numpy_cython.so\
test_cython.so\
test-pascal

test-c-gcc: test.c
	gcc -O3 -o test-c-gcc test.c
test-c-clang: test.c
	clang -O3 -o test-c-clang test.c

test-cpp-g++: test.cpp
	g++ -O3 -o test-cpp-g++ test.cpp
test-cpp-clang++: test.cpp
	clang++ -O3 -o test-cpp-clang++ test.cpp

test-java: test.java
	gcj -O3 --main=test -o test-java test.java
test.class: test.java
	javac test.java

test-go: test.go
	go build -o test-go test.go

test-ada: test.adb
	gnatmake -O3 -o test-ada test.adb

test_python.pyc: test_python.py test-python.py
	python -m py_compile test_python.py
	python -m py_compile test-python.py
test_python_cython.so: test_python_cython.pyx setup_python_cython.py test-python-cython.py
	python setup_python_cython.py build_ext --inplace
	python -m py_compile test-python-cython.py

test_numpy.pyc: test_numpy.py test-numpy.py
	python -m py_compile test_numpy.py
	python -m py_compile test-numpy.py
test_numpy_cython.so: test_numpy_cython.pyx setup_numpy_cython.py test-numpy-cython.py
	python setup_numpy_cython.py build_ext --inplace
	python -m py_compile test-numpy-cython.py

test_cython.so: test_cython.pyx setup_cython.py test-cython.py
	python setup_cython.py build_ext --inplace
	python -m py_compile test-cython.py

test-pascal: test.pas
	fpc -O3 -otest-pascal test.pas

clean:
	rm -f *.o *.so *.pyc
	rm -f test-c-gcc test-c-clang test-cpp-g++ test-cpp-clang++
	rm -f test-java test.class test-go test-ada test.ali
	rm -rf build test_python_cython.c test_numpy_cython.c test_cython.c
	rm -f test-pascal

run_test: all
	echo "-------------------------------------"
	time -p ./test-c-gcc 100
	time -p ./test-c-clang 100
	time -p ./test-cpp-g++ 100
	time -p ./test-cpp-clang++ 100
	time -p ./test-java 100
	time -p java test 100
	time -p ./test-ada 100
	time -p ./test-go 100
	echo "Multiply the test-python.pyc time by 100 for comparison."
	time -p python test-python.pyc 1
	echo "Multiply the test-python-cython.pyc time by 100 for comparison."
	time -p python test-python-cython.pyc 1
	echo "Multiply the test-numpy.pyc time by 100 for comparison."
	time -p python test-numpy.pyc 1
	echo "Multiply the test-numpy-cython.pyc time by 100 for comparison."
	time -p python test-numpy-cython.pyc 1
	time -p python test-cython.pyc 100
	time -p ./test.js 100
	echo "Multiply the test.pl time by 100 for comparison."
	time -p ./test.pl 1
	echo "Multiply the test.rb time by 100 for comparison."
	time -p ./test.rb 1
	time -p ./test-pascal 100
	echo "Multiply the test.lua time by 100 for comparison."
	time -p ./test.lua 1
	echo "Multiply the test.php time by 100 for comparison."
	time -p ./test.php 1