User Process vs Kernel Module AES Performance and The Impact of DEBUG_KERNEL

This work was sparked by previous work integrating AES encryption with an iSCSI server daemon. That previous work showed that AES performance was greater in a user process than in a kernel module for the platform tested. That led me to wonder if the distribution's default kernel had some options compiled in that caused this poor performance. I started with the previous aestest-user application and aestest-kernel module used previously. I thoroughly examined both source files to make sure the calculation loops were as close to identical as possible. After running make, I ran aestest-user, then modprobe aestest-kernel, and finally dd bs=1 count=1 if=/proc/aestest to trigger the kernel module calculations. The results were the same as before.
User/Kernel OS Version CPU Throughput (MB/s)
user process Centos 5.1 Intel Celeron 2.5GHz 32bit 47
user process RHEL 5 Intel Celeron 2.5GHz 32bit 47
user process RHEL5LSPP3 TTC 1.3.4 Intel Celeron 2.5GHz 32bit 47
kernel module Centos 5.1 2.6.18-53.el5 Intel Celeron 2.5GHz 32bit 44
kernel module Centos 5.1 2.6.18-92.1.13.el5 Intel Celeron 2.5GHz 32bit 44
kernel module RHEL5LSPP3 TTC 1.3.4
2.6.18-8.1.3.lspp.el5.tcs.12
Intel Celeron 2.5GHz 32bit 44


I then set out to try a totally customized 2.6.26.5 kernel to see if performance could be improved over the distribution's default kernel. I only compiled in the things needed by my system and performance was impressive. The problem was that I did not keep careful track of what I changed from a useful baseline.

Next I untarred the 2.6.26.5 kernel again and started testing with a nearly stock configuration. The only modifications made were to compile a few drivers as modules to satisfy my initrd boot situation. I managed to match the performance of the distribution's stock kernel, so I achieved a useful baseline. Next, I tried many of the changes implemented previously and tested performance one kernel configuration variable at a time. The result was a 15% kernel module performance improvement over the stock kernel. Here are the performance results:
User/Kernel OS Version CPU Throughput (MB/s)
user process Centos 5.1 2.6.26.5 Intel Celeron 2.5GHz 32bit 47
kernel module Centos 5.1 nearly stock 2.6.26.5 Intel Celeron 2.5GHz 32bit 44
kernel module Centos 5.1 customized 2.6.26.5 Intel Celeron 2.5GHz 32bit 51


A diff on the two .config files revealed the minimal differences. The stock settings were full of debugging functionality and the kernel was optimized for size instead of speed. The faster customized kernel was optimized for speed and all kernel debugging functionality was removed. Here are the differences:
Nearly Stock 2.6.26.5 Customized 2.6.26.5
< CONFIG_CC_OPTIMIZE_FOR_SIZE=y
< CONFIG_KALLSYMS_ALL=y
< CONFIG_KPROBES=y
< CONFIG_KRETPROBES=y

< # CONFIG_DEFAULT_DEADLINE is not set
< CONFIG_DEFAULT_CFQ=y
< CONFIG_DEFAULT_IOSCHED="cfq"

< # CONFIG_PREEMPT_NONE is not set
< CONFIG_PREEMPT_VOLUNTARY=y

< CONFIG_ENABLE_WARN_DEPRECATED=y
< CONFIG_MAGIC_SYSRQ=y
< CONFIG_UNUSED_SYMBOLS=y

< CONFIG_DEBUG_KERNEL=y
< CONFIG_DETECT_SOFTLOCKUP=y
< CONFIG_TIMER_STATS=y
< CONFIG_DEBUG_STACKOVERFLOW=y
< # many config_debug settings were not set
< # and have been removed from this list
> # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
> # CONFIG_KPROBES is not set

> # changing I/O scheduler had no effect
> CONFIG_DEFAULT_DEADLINE=y
> # CONFIG_DEFAULT_CFQ is not set
> CONFIG_DEFAULT_IOSCHED="deadline"


> # changing preemption had no effect
> CONFIG_PREEMPT_NONE=y
> # CONFIG_PREEMPT_VOLUNTARY is not set

> # CONFIG_ENABLE_WARN_DEPRECATED is not set
> # CONFIG_MAGIC_SYSRQ is not set
> # CONFIG_UNUSED_SYMBOLS is not set

> # removing all kernel debugging and hacking
> # had a huge effect on performance
> # CONFIG_DEBUG_KERNEL is not set


Next I repeated this testing method on an x86_64 computer with dual Intel Xeon 3GHz CPUs. The results were even more dramatic with a 32% kernel module performance improvement over the stock kernel. Removing kernel hacking and debugging features seems critical to high performance in the kernel. Here are the performance results:
User/Kernel OS Version CPU Throughput (MB/s)
user process Centos 5.1 2.6.26.5 Intel Xeon 3GHz 64bit 86
kernel module Centos 5.1 nearly stock 2.6.26.5 Intel Xeon 3GHz 64bit 68
kernel module Centos 5.1 customized 2.6.26.5 Intel Xeon 3GHz 64bit 90


Here are the business parts of the user process and kernel module that perform all of the calculations and timing. Notice that they are nearly identical.
User Process Kernel Module
//set start time
time(&starttime);
    
//encrypt buffer 1 million times with the same key
for(i=0; i < 1000000; i++){
    //perform aes-256-CBC on 512 bytes at a time using the same iv
    for(aes_cbcoffset=0;aes_cbcoffset<bufsize/512;aes_cbcoffset++){
        memcpy(aes_ivec,aes_ivec2,AES_BLOCK_SIZE);
        AES_cbc_encrypt(buf+aes_cbcoffset*512,buf+aes_cbcoffset*512,
	512,&(aes_encryptionkeyschedule),aes_ivec,AES_ENCRYPT);
    }
}

//print first char of cyphertext version
printf("buf: %c\n",buf[0]);

//decrypt buffer 1 million times with the same key
for(i=0; i < 1000000; i++){
    //perform aes-256-CBC on 512 bytes at a time using the same iv
    for(aes_cbcoffset=0;aes_cbcoffset<bufsize/512;aes_cbcoffset++){
        memcpy(aes_ivec,aes_ivec2,AES_BLOCK_SIZE);
        AES_cbc_encrypt(buf+aes_cbcoffset*512,buf+aes_cbcoffset*512,
	512,&(aes_decryptionkeyschedule),aes_ivec,AES_DECRYPT);
    }
}

//print plaintext version, which matches the original
printf("buf: %s\n",buf);

//determine end time and calculate stats
time(&endtime);
printf("elapsed: %d\n",endtime-starttime);
printf("MB/s: %d\n",(2*bufsize)/(endtime-starttime));
//set start time
starttime = current_kernel_time();
    
//encrypt buffer 1 million times with the same key
for(i=0; i < 1000000; i++){
    //perform aes-256-CBC on 512 bytes at a time using the same iv
    for(aes_cbcoffset=0;aes_cbcoffset<bufsize/512;aes_cbcoffset++){
        memcpy(aes_ivec,aes_ivec2,AES_BLOCK_SIZE);
        AES_cbc_encrypt(buf+aes_cbcoffset*512,buf+aes_cbcoffset*512,
	512,&(aes_encryptionkeyschedule),aes_ivec,AES_ENCRYPT);
    }
    if(i%1000==0) schedule();//don't hog the cpu
}

//print first char of cyphertext version
printk("buf: %c\n",buf[0]);

//decrypt buffer 1 million times with the same key
for(i=0; i < 1000000; i++){
    //perform aes-256-CBC on 512 bytes at a time using the same iv
    for(aes_cbcoffset=0;aes_cbcoffset<bufsize/512;aes_cbcoffset++){
        memcpy(aes_ivec,aes_ivec2,AES_BLOCK_SIZE);
        AES_cbc_encrypt(buf+aes_cbcoffset*512,buf+aes_cbcoffset*512,
	512,&(aes_decryptionkeyschedule),aes_ivec,AES_DECRYPT);
    }
    if(i%1000==0) schedule();//don't hog the cpu
}

//print plaintext version, which matches the original
printk("buf: %s\n",buf);

//determine end time and calculate stats
endtime = current_kernel_time();
printk("elapsed: %d\n",endtime.tv_sec-starttime.tv_sec);
printk("MB/s: %d\n",(2*bufsize)/(endtime.tv_sec-starttime.tv_sec));