Gregory Hildstrom Projects Publications Resume Contact About Youtube

The Performance Impact of CONFIG_DEBUG_KERNEL

Previous work integrating AES encryption with the iSCSI Enterprise Target kernel module sparked this investigation. Preliminary tests showed that AES performance was greater in a user process than in a kernel module for the platforms tested, but that did not make sense to me initially. Everything is faster in the kernel, right?

This quandary led me to wonder if the Linux distributions' default kernels had options compiled in that caused this poor performance. I started with the aestest-user application and aestest-kernel module that I had used for initial performance testing. I thoroughly examined both source files to make sure the calculation loops were as close to identical as possible. After running make, I ran aestest-user, modprobe aestest-kernel, and finally dd bs=1 count=1 if=/proc/aestest to trigger the kernel module calculations. The results were the same as before, but I tested multiple operating systems and different kernels to be sure.
User/Kernel OS Version CPU Throughput (MB/s)
user process CentOS 5.1 Intel Celeron 2.5GHz 32bit 47
user process RHEL 5 Intel Celeron 2.5GHz 32bit 47
user process RHEL5LSPP3 TTC 1.3.4 Intel Celeron 2.5GHz 32bit 47
kernel module CentOS 5.1 2.6.18-53.el5 Intel Celeron 2.5GHz 32bit 44
kernel module CentOS 5.1 2.6.18-92.1.13.el5 Intel Celeron 2.5GHz 32bit 44
kernel module RHEL5LSPP3 TTC 1.3.4
2.6.18-8.1.3.lspp.el5.tcs.12
Intel Celeron 2.5GHz 32bit 44


I then set out to try a totally customized 2.6.26.5 kernel to see if performance could be improved over the distribution's default kernel. I only compiled in necessary system functionality and performance was impressive. The problem was that I did not keep careful track of what I changed from a useful baseline, so I was not sure which variables had the most significant performance impact.

Next I untarred the 2.6.26.5 kernel again and started testing with a nearly stock configuration. The only modifications made were to compile a few drivers as modules to satisfy my initrd boot situation. I managed to match the performance of the distributions' stock kernels, so I achieved a useful baseline. I verified this by diffing against the stock kernel's configuration in /boot/confg-version. Next, I tried many of the changes implemented previously and tested performance one kernel configuration variable at a time. The result was a 15% kernel module performance improvement over the stock kernel. Here are the performance results:
User/Kernel OS Version CPU Throughput (MB/s)
user process CentOS 5.1 2.6.26.5 Intel Celeron 2.5GHz 32bit 47
kernel module CentOS 5.1 nearly stock 2.6.26.5 Intel Celeron 2.5GHz 32bit 44
kernel module CentOS 5.1 customized 2.6.26.5 Intel Celeron 2.5GHz 32bit 51


A diff on the two .config files revealed the minimal differences. The stock settings were full of debugging functionality and the kernel was optimized for size instead of speed. The faster customized kernel was optimized for speed and all kernel debugging and kernel hacking functionality was removed. Removing CONFIG_DEBUG_KERNEL had the biggest impact on performance by far. The additional interrups, bookkeeping routines, and checking routines that run in a debug kernel seriously hinder kernel performance. Here are the kernel configuration differences:
Nearly Stock 2.6.26.5 Customized 2.6.26.5
< CONFIG_CC_OPTIMIZE_FOR_SIZE=y
< CONFIG_KALLSYMS_ALL=y
< CONFIG_KPROBES=y
< CONFIG_KRETPROBES=y

< # CONFIG_DEFAULT_DEADLINE is not set
< CONFIG_DEFAULT_CFQ=y
< CONFIG_DEFAULT_IOSCHED="cfq"

< # CONFIG_PREEMPT_NONE is not set
< CONFIG_PREEMPT_VOLUNTARY=y

< CONFIG_ENABLE_WARN_DEPRECATED=y
< CONFIG_MAGIC_SYSRQ=y
< CONFIG_UNUSED_SYMBOLS=y

< CONFIG_DEBUG_KERNEL=y
< CONFIG_DETECT_SOFTLOCKUP=y
< CONFIG_TIMER_STATS=y
< CONFIG_DEBUG_STACKOVERFLOW=y
< # many config_debug settings
< # have been removed from this list
> # changing the optimization had minimal effect
> # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
> # CONFIG_KPROBES is not set

> # changing I/O scheduler had no effect
> CONFIG_DEFAULT_DEADLINE=y
> # CONFIG_DEFAULT_CFQ is not set
> CONFIG_DEFAULT_IOSCHED="deadline"


> # changing preemption had no effect
> CONFIG_PREEMPT_NONE=y
> # CONFIG_PREEMPT_VOLUNTARY is not set

> # CONFIG_ENABLE_WARN_DEPRECATED is not set
> # CONFIG_MAGIC_SYSRQ is not set
> # CONFIG_UNUSED_SYMBOLS is not set

> # removing all kernel debugging and hacking
> # had a huge effect on performance
> # CONFIG_DEBUG_KERNEL is not set


Next I repeated this test on an x86_64 computer with dual Intel Xeon 3GHz CPUs. The results were even more dramatic with a 32% kernel module performance improvement over the stock kernel. Removing kernel hacking and debugging features seems critical to high performance in the kernel. Here are the performance results:
User/Kernel OS Version CPU Throughput (MB/s)
user process CentOS 5.1 2.6.26.5 Intel Xeon 3GHz 64bit 86
kernel module CentOS 5.1 nearly stock 2.6.26.5 Intel Xeon 3GHz 64bit 68
kernel module CentOS 5.1 customized 2.6.26.5 Intel Xeon 3GHz 64bit 90


I have shown why initial kernel performance was slower than a user process and that it can be improved to outperform a user process by removing CONFIG_DEBUG_KERNEL. The question now is why do so many distributions like RHEL, Fedora, and Debian have debugging enabled in the default kernel? Also, why do these distributions not provide another nearly identical kernel with debugging disabled in the GRUB menu? With a potential kernel performance improvement of 15%-32%, it should be easy for users to switch to a non-debug kernel once they know their systems are stable.


Appendix

Here are the business parts of the user process and kernel module that perform all of the calculations and timing. Notice that they are nearly identical.
User Process Kernel Module
//set start time
time(&starttime);
    
//encrypt buffer 1 million times with the same key
for(i=0; i < 1000000; i++){
    //perform aes-256-CBC on 512 bytes at a time using the same iv
    for(aes_cbcoffset=0;aes_cbcoffset<bufsize/512;aes_cbcoffset++){
        memcpy(aes_ivec,aes_ivec2,AES_BLOCK_SIZE);
        AES_cbc_encrypt(buf+aes_cbcoffset*512,buf+aes_cbcoffset*512,
	512,&(aes_encryptionkeyschedule),aes_ivec,AES_ENCRYPT);
    }
}

//print first char of cyphertext version
printf("buf: %c\n",buf[0]);

//decrypt buffer 1 million times with the same key
for(i=0; i < 1000000; i++){
    //perform aes-256-CBC on 512 bytes at a time using the same iv
    for(aes_cbcoffset=0;aes_cbcoffset<bufsize/512;aes_cbcoffset++){
        memcpy(aes_ivec,aes_ivec2,AES_BLOCK_SIZE);
        AES_cbc_encrypt(buf+aes_cbcoffset*512,buf+aes_cbcoffset*512,
	512,&(aes_decryptionkeyschedule),aes_ivec,AES_DECRYPT);
    }
}

//print plaintext version, which matches the original
printf("buf: %s\n",buf);

//determine end time and calculate stats
time(&endtime);
printf("elapsed: %d\n",endtime-starttime);
printf("MB/s: %d\n",(2*bufsize)/(endtime-starttime));
//set start time
starttime = current_kernel_time();
    
//encrypt buffer 1 million times with the same key
for(i=0; i < 1000000; i++){
    //perform aes-256-CBC on 512 bytes at a time using the same iv
    for(aes_cbcoffset=0;aes_cbcoffset<bufsize/512;aes_cbcoffset++){
        memcpy(aes_ivec,aes_ivec2,AES_BLOCK_SIZE);
        AES_cbc_encrypt(buf+aes_cbcoffset*512,buf+aes_cbcoffset*512,
	512,&(aes_encryptionkeyschedule),aes_ivec,AES_ENCRYPT);
    }
    if(i%1000==0) schedule();//don't hog the cpu
}

//print first char of cyphertext version
printk("buf: %c\n",buf[0]);

//decrypt buffer 1 million times with the same key
for(i=0; i < 1000000; i++){
    //perform aes-256-CBC on 512 bytes at a time using the same iv
    for(aes_cbcoffset=0;aes_cbcoffset<bufsize/512;aes_cbcoffset++){
        memcpy(aes_ivec,aes_ivec2,AES_BLOCK_SIZE);
        AES_cbc_encrypt(buf+aes_cbcoffset*512,buf+aes_cbcoffset*512,
	512,&(aes_decryptionkeyschedule),aes_ivec,AES_DECRYPT);
    }
    if(i%1000==0) schedule();//don't hog the cpu
}

//print plaintext version, which matches the original
printk("buf: %s\n",buf);

//determine end time and calculate stats
endtime = current_kernel_time();
printk("elapsed: %d\n",endtime.tv_sec-starttime.tv_sec);
printk("MB/s: %d\n",(2*bufsize)/(endtime.tv_sec-starttime.tv_sec));