Projects Publications Resume Contact About Youtube

Kernel vs User Implementation for Software Data Encryption

Research Paper
ITEC 610
Professor Richardson

11/11/2007

Gregory Alan Hildstrom

Introduction
Kernel-Space Vs. User-Space Overview
Kernel-Space Vs. User-Space File/Disk Encryption
Kernel-Space Vs. User-Space Network Encryption
Kernel-Space Vs. User-Space AES256CBC Encryption
AES256CBC Throughput Performance and Relevant Tests
Other Relevant Research
Encryption Key Management
Development and Portability
Conclusions
References

Introduction

My company is responsible for implementing a secure data storage server. The server must run the Red Hat Enterprise Linux 5 Labeled Security Protection Profile (RHEL5LSPP) certified operating system, which uses the Security Enhanced Linux (SELinux) multilevel security (MLS) policy.

"Encryption is the foundation of many aspects of security. For example, encryption protects messages sent across the Internet and protects files stored on servers." (Anderson & Post, 2006, p. 179) The latter case applies to our situation.

Our client requested that recorded data be encrypted on the underlying storage devices to prevent any security breach from yielding usable data and to allow rapid destruction of data. Encryption greatly reduces the time necessary to wipe or clean the data storage devices after use because encryption keys can be destroyed in an instant, which is much faster than overwriting entire data storage devices with random data.

The server must be capable of sustaining high data throughput, over 200MB/s, to a large number of network clients, so encryption performance is critical. There are two primary places that software encryption can be utilized on the Linux storage server or network clients: at the operating system (OS) level (kernel space) or at the application level (user space). Encryption at the OS level is typically built into the kernel or kernel modules, which are the core of the operating system. Application-level encryption occurs in user process space in either end user applications or in server daemons. This paper examines the performance of different encryption implementations and mechanisms in kernel-space and user-space execution as they apply to high performance servers. File, disk, network, and raw encryption are examined for both implementation strategies. Encryption key management, development issues, and portability are also addressed.

I will show that application-level software data encryption is a superior choice for high performance servers when low-latency driver-level access is not required.

"In 2007, the U.S. Government chose a new method known as the Advanced Encryption Standard (AES) because it is fast and users have a choice of a key length of 128, 192, or 256 bits. Keep in mind that longer keys make the message more secure, but increase the time it takes to encrypt and decrypt the message." (Anderson & Post, 2006, p. 181) AES is approved by the National Security Agency (NSA) for the encryption of classified data when using 128 or 256 bit keys, so AES with 256 bit keys coupled with cipher block chaining (AES256CBC) is the focus of this research. (National Security Agency, 2005) (Dworkin, 2001) AES 256 is much faster than triple-DES, which has a 168-bit key, and slightly slower than DES, which has a 56-bit key, but it is considered more secure because of the key size and strong algorithm. (Laoutaris, Merakos, Strvrakakis, & Xenakis, 2006, p. 14) Because of the longer key length, and long time associated with encryption and decryption, performance optimization is critical on servers with high performance requirements.

OS-level encryption modules generally strive for transparent encryption of some device or service that can be used by many user-level applications. OS-level modules might encrypt entire block devices, like disks and partitions, filesystems, or network connections. Two examples of block device (disk) OS-level encryption are dm-crypt and loop-AES. Two examples of filesystem OS-level encryption are encfs and cryptfs. IPsec is one example of OS-level network encryption.

Application-level encryption can occur in custom application code, library code, or in separate third party executables. An example of an encryption library and third party executable is OpenSSL. Other examples of third party executables are aesutil and WinZip. SSH is one example of application-level network connection encryption.

Kernel-Space Vs. User-Space Overview

The advantages of user-space drivers [applications] are:

The full C library can be linked in. The driver can perform many exotic tasks without resorting to external programs (the utility programs implementing usage policies that are usually distributed along with the driver itself).
The programmer can run a conventional debugger on the driver code without having to go through contortions to debug a running kernel.
If a user-space driver hangs, you can simply kill it. Problems with the driver are unlikely to hang the entire system, unless the hardware being controlled is really misbehaving.
User memory is swappable, unlike kernel memory. An infrequently used device with a huge driver won’t occupy RAM that other programs could be using, except when it is actually in use.
A well-designed driver program can still, like kernel-space drivers, allow concurrent access to a device.
If you must write a closed-source driver, the user-space option makes it easier for you to avoid ambiguous licensing situations and problems with changing kernel interfaces. (Corbet, Kroah-Hartman, & Rubini, 2005, p. 38)

But the user-space approach to device driving has a number of drawbacks. The most important are:

Interrupts are not available in user space. There are workarounds for this limitation on some platforms, such as the vm86 system call on the IA32 architecture.
Direct access to memory is possible only by mmapping /dev/mem, and only a privileged user can do that.
Access to I/O ports is available only after calling ioperm or iopl. Moreover, not all platforms support these system calls, and access to /dev/port can be too slow to be effective. Both the system calls and the device file are reserved to a privileged user.
Response time is slower, because a context switch is required to transfer information or actions between the client and the hardware.
Worse yet, if the driver has been swapped to disk, response time is unacceptably long. Using the mlock system call might help, but usually you’ll need to lock many memory pages, because a user-space program depends on a lot of library code. mlock, too, is limited to privileged users.
The most important devices can’t be handled in user space, including, but not limited to, network interfaces and block devices. (Corbet et al., 2005, p. 39)

User-space programs are inherently safer because they run in protected memory as part of individual address space. User programs are more easily debugged and generally cannot bring down the entire operating system if they crash. Kernel-space programs can handle interrupts, require less context switching, and have lower-level access to system resources.

User-space processes are not allowed to interfere with kernel memory or other user process memory. They also have significant overhead when making system calls, which execute in kernel space on behalf of the user process. Processes executing in kernel space have no memory protection and may be interrupted at any time. "Another important difference between kernel programming and application programming is in how each environment handles faults: whereas a segmentation fault is harmless during application development and a debugger can always be used to trace the error to the problem in the source code, a kernel fault kills the current process at least, if not the whole system." (Corbet et al., 2005, p. 19) Also, "if you do not write your code with concurrency in mind, it will be subject to catastrophic failures that can be exceedingly difficult to debug."(Corbet et al., 2005, p. 21) Yet another difference is that kernel code does not have any sort of automatic cleanup like user programs; every malloc must be freed when no longer needed or the memory will be occupied until the next reboot. (Corbet et al., 2005, p. 18) Implementing at the OS level is much more tedious and unforgiving, but it has potential performance benefits for some applications.

The kernel is geared toward low-level functionality needed by many different user applications to access services and hardware. Kernel modules cannot be linked to static or shared libraries to take advantage of preexisting code. "The kernel needs its own printing function because it runs by itself, without the help of the C library." (Corbet et al., 2005, p. 17) "A [kernel] module is linked only to the kernel, and the only functions it can call are the ones exported by the kernel; there are no libraries to link to." (Corbet et al., 2005, p. 18) This makes high-level application development, which could rely on numerous libraries, more difficult and time consuming in the kernel.

The difficulty in designing any high-performance application is balancing the potential performance benefits of kernel-space implementation with the easier development of user-space implementation.

Kernel-Space Vs. User-Space File/Disk Encryption

There are four main transparent data storage encryption mechanisms available for the average Linux system: dm-crypt, loop-aes, encfs, and cryptfs. Both dm-crypt and loop-aes add encryption to the standard Linux kernel loop device driver and provide a virtual disk device that encrypts all I/O to the actual device. Dm-crypt deprecates the older cryptoloop driver, but the underlying method is similar. Both encfs and cryptfs add encryption to the Linux filesystem layer and encrypt mounted filesystem I/O to the underlying device. I chose to analyze dm-crypt because it is included with RHEL5LSPP and it shares most of its implementation strategy with loop-aes, so performance should be similar. I also chose to analyze encfs because it was easier to install and is similar to cryptfs. (Dave, Wright, & Zadok, 2003, p. 2) I decided to use the industry standard library and program OpenSSL for application-level file encryption; version 0.98b is included with RHEL5LSPP.

I created a 10GB partition /dev/sdb1 and dumped random data to a 1GB ram file in /dev/shm. I used dd/sync and openssl/sync with a 1MB block size to read random data from memory and store it using one of the following encryption methods. Here are the results:

Method	Write Speed MB/s	Read Speed MB/s
Raw / no encryption	57	43
Dm-crypt	31	35
Encfs	15	29
OpenSSL	45	43

This result surprised me because I assumed that anything implemented at the kernel level would be faster, which was incorrect. The openssl user-space application was clearly the fastest encryption method for throughput, but it does not provide transparent access to other user applications like the kernel methods. "Systems such as Cryptoloop that can encrypt an entire partition are able to make effective use of the buffer cache in order to prevent many encryption and decryption operations." (Dave et al., 2003, p. 14) An application would need to implement its own cache to mimic this behavior for small repeated I/O requests, which would certainly add to development time and application complexity.

Kernel-Space Vs. User-Space Network Encryption

Next I decided to compare kernel space and user space for network encryption. I installed Openswan 2.4.9, which uses the IPsec implementation and encryption in the Linux kernel to encrypt network communication. Secure Shell (SSH) uses the OpenSSL library for encryption, runs entirely in user space, and can create an encrypted network tunnel between two computers similar to IPsec. I measured performance using 4 different applications, but they did not all work for both methods. The two computers were connected with 10GbE tuned rather poorly.

Netperf is a widely used network I/O performance tool. ISCSI connectivity was provided by the iSCSI Enterprise Target 0.4.15 server and Open-iSCSI 2.0-865.15 client. Netcat is included with most Linux distributions and simply pipes data from the shell on one computer to another. Filemover is a custom TCP application that copies a file from client to server as fast as possible. A 1MB block size was used where possible for copying from /dev/shm (memory) on one computer to another. Here are the results:

Method	Netperf MB/s	ISCSI MB/s	Netcat MB/s	Filemover MB/s
Raw / no encryption	230	165	126	225
Ipsec	40	38	33	38
Ssh tunnel	n/a	n/a	43	41

Once again, application-level functionality was faster than OS-level. The SSH tunnel was faster than IPsec and transparent to the application as long as it communicated on a single network port. IPsec is more transparent and robust, but slower. This led me to wonder if encryption performance in general was slower when executing in kernel space.

Kernel-Space Vs. User-Space AES256CBC Encryption

I consolidated all of the functions and data structures related to AES CBC encryption and decryption from the OpenSSL 0.98b source code into a single aes.h file. I created a user-space program and a kernel-space module with nearly identical encryption loop functions. The only differences were replacing 2 calls of printf with prink, malloc with kmalloc, and time with current_kernel_time; the actual calculation loops were identical.

The function populated a 1KB buffer with random data and then encrypted and decrypted it 1000000 times using AES256CBC and a random key. This is roughly the equivalent of encrypting and decrypting 1GB in memory and no other system variables should come into play. The user-space program aestest-user was executed from the command line. The kernel-space module aestest-kernel called the encryption loop test function when /proc/aestest was read using the command dd bs=1 count=1 if=/proc/aestest of=/dev/zero. For comparison, openssl speed -evp aes-256-cbc yielded 80MB/s in user space. Here are the results:

	Dual 3.0GHzXeon MB/s	Dual dual-core 265 Opteron MB/s	Single 2.53GHz Celeron MB/s
	RHEL5LSPP x86_64	RHEL5LSPP x86_64	RHEL5LSPP i686
User space	89	66	47
Kernel space	70	58	44
Kernel speed % of user	-21	-12	-6

This result is very surprising. On three separate computers, both 64-bit and 32-bit, AES256CBC encryption was slower in kernel space than user space for identical function calls.

AES256CBC Throughput Performance and Relevant Tests

The results of aestest, network encryption, and storage encryption tests clearly show that user-space encryption delivers higher sustained performance than kernel-space execution. The kernel implementations are more transparent to the final application, which would not need to perform the encryption directly. Kernel drivers are faster for functions that require low latency, precise timing, and low-level control, but not necessarily faster for calculations.

Since user-space implementation proved faster for storage, networking, and raw encryption, I decided to modify the filemover server program to encrypt data before writing to ram disk using several different methods. This is a much more relevant estimate of the solution that needs to be provided to the customer. The filemover client sends data to the server in plaintext and the server encrypts the buffers in memory before writing them to disk. This test copied a 500MB file from /dev/shm on one computer to another.

Method	Filemover Speed MB/s
Single server process no encryption	213
Single server process with encryption	38
Multithreaded server with encryption (1 CPU)	72
Forked server with encryption (2 CPUs)	61

The multithreaded server started a new thread to deal with each I/O request, which ranges from 1B to something less than 65536B because of the IPV4 packet size limit. Better buffer handling could increase the maximum I/O request size and reduce thread and function call overhead, but this has not been implemented yet. The single server process was the slowest for encryption. The added encryption latency in network and storage I/O prevents this from reaching the CPU encryption limit. The multithreaded server was reasonably fast for throughput because it decreased the communication and storage latency and utilized more of the available CPU cycles. A received packet was encrypted and written by a separate thread, so the main process could continue to respond to the client and receive new packets, but overall CPU power for threads is limited to 1 processor for user-space processes. This test actually exceeded the calculation-only performance of the kernel even though it also had to handle network traffic and data storage. Forking a new process for each I/O request did not perform as well as I hoped. Forked process can utilize more than 1 processor, but carry higher inherent creation overhead.

This is one area where kernel-space execution could prove beneficial for data encryption. Threads spawned by the kernel have a very low creation overhead and can utilize multiple processors without the overhead associated with forking entire new processes. While calculation performance in the kernel is slower, simple threads may be used to utilize more processors with very little overhead. However, the maximum single client connection throughput requirement for our project is 40MB/s, which is well within the reach of the simple user-space multithreaded server. The 200MB/s requirement is aggregate for about 100 total clients. Additional processes could be forked once for each client connection, which will distribute the load efficiently over multiple processors, and each resulting process can use multiple threads to handle concurrent I/O requests quickly. This approach should be able to efficiently use all of the CPU resources on the system and retain high calculation performance. In this multiple process multiple thread scenario, I do not believe that kernel-space execution will provide any performance benefit over user-space execution. If large buffers are used for I/O requests, read and write system call context switching overhead can be kept to a minimum.

As a sanity check, I ran the OpenSSL speed benchmark to see if enough CPU power was available to achieve 200MB/s with software encryption. I used the command openssl speed -evp aes-256-cbc [-multi n] and used the 1KB block size results. The total throughput required might be attainable with careful server hardware and software design.

Server	1 process speed MB/s	4 process speed MB/s
Dual 3.0GHz Xeon	80	160
Dual Dual-Core 265 Opteron	60	240

Other Relevant Research

Other research has arrived at similar performance conclusions. The conclusion reached by a team investigating web server (u-server and TUX) performance is particularly relevant.

We demonstrate that the gap in the performance of the two servers becomes less significant as the proportion of dynamic-content requests increases. In fact, for workloads with a majority of dynamic requests, the u-server outperforms TUX. We conclude that a well-designed user-space web server can compete with an in-kernel server on performance, while retaining the reliability and security benefits that come from operating in user space. (Brecht, Li, Shukla, Subramanian, & Ward, 2004, p. 1)

Their results correlate well with mine. Static content request performance is largely the result of response and I/O latency, which is where kernel-space execution shines. Dynamic content request performance is also a result of response and I/O latency, but it involves many calculations to generate the content, which is where user-space excels.

This conclusion is also backed up by other research that aimed to run user-level code in the kernel.

This project has two goals. The first goal is to improve application performance by reducing context switches and data copies. We do this by either running select sections of the application in kernel-mode, or by creating new, more efficient system calls. The second goal is to ensure that kernel safety is not violated when running user-level code in the kernel. (Callanan, Rai, Sivathanu, Traeger, & Zadok, 2005, p. 1)

The team invested significant effort into error checking and security by implementing new development tools and compiler extensions. (Callanan et al., 2005, p. 8) The speed improvements to applications came from reducing context switches and data copies, and not because instructions or calculations are inherently faster in the kernel. This type of approach will likely benefit applications that spend most of their time making system calls and copying data between user space and kernel space. This will likely not benefit applications that spend most of their time performing calculations.

A simple way to demonstrate system call impact is to use the dd command with a block size of 1B, which will have terrible performance compared to a block size of 1MB. Copying a 1GB file with the larger block size would call the system functions read and write 1000 times, but the smaller block size would result in 1 billion calls to the read and write functions. All of the data must be copied between kernel and user space with either method, but every system call involves a context switch and the system call overhead quickly becomes a problem with small block sizes.

Encryption Key Management

Encryption key management is a very serious issue in any encryption system. Great care must be exercised to insure that keys cannot be compromised, leaked, faked, or lost. In data transmission systems "both people also need to have the same key, which is the difficult part. How do you deliver a secret key to someone? And if you can deliver a secret key, you might as well send the message the same way." (Anderson & Post, 2006, p. 180) This is the primary reason that the client decided to encrypt stored data and not data transmitted over the private LAN. Key distribution to hundreds of network clients is a nightmare for accreditors. It is very difficult to guarantee, with high certainty, that keys have not been compromised during distribution. Even if keys are automatically exchanged using Diffie-Hellman or some other public key protocol, it is more difficult to guarantee that the key is not sniffed by a rogue process on the remote computer. This matters less for short-term session keys, but it is a much bigger issue for long-term data storage keys. Similarly, it is also difficult to assure that all keys have been completely destroyed once they have been widely distributed and are no longer needed. Key management will be much easier with all of the data encryption taking place on a centralized server. Centrally located physical security, access, configuration, audits, and files make management easier from a security perspective.

Keys will be generated by the server and stored on internal disks. Removable media can present reliability and security problems. Each security level will have a different encryption key. The security administrator or system administrator users will be the only users with read access to the secret encryption keys, for backup purposes only. Neither user space nor kernel space has any particular advantage here. Both must read the keys from some physical media, have a method of overwriting keys, and have a method of loading keys. For user-space applications this could be as simple as updating a configuration file, overwriting a key file, issuing a command, or restarting a daemon.

The /proc filesystem is a special, software-created filesystem that is used by the kernel to export information to the world. Each file under /proc is tied to a kernel function that generates the file’s "contents" on the fly when the file is read... Fully featured /proc entries can be complicated beasts; among other things, they can be written to as well as read from. (Corbet et al., 2005, p. 83)

The proc and other virtual filesystems provide a way to perform I/O and interact with running kernel modules.

Loading new keys into memory and overwriting keys does not present any particular difficulty for either approach. Physical security of the server and proper security configuration on the server are the most important things to consider.

Development and Portability

Neither approach should have much impact on network client application development. Server application and daemon development will be drastically impacted depending on implementation strategy. The ease of library linking and debugging are major benefits of user space implementation even without considering the higher calculation performance. If latency and low-level access are most important, kernel space implementation has an advantage.

Most software developers are familiar with user space; there are far fewer kernel and driver developers in the industry. From a management standpoint, if the performance goal is feasible, it is probably more cost effective to develop in user space. However, a transparent encryption method may already exist in the kernel, so a development team must decide if its performance is acceptable.

What will happen if a cryptographic vulnerability is discovered in AES? How easy will it be to change or update algorithms? If a user-space application links to a third party library, like OpenSSL, installing an updated library and restarting the server could be all that is necessary to fix the problem. If a completely new algorithm is needed, code will probably need to be modified. Also, user space programs can be tested with different libraries, library versions, and program versions concurrently. Updating an algorithm in kernel space always requires code modification because no libraries are linked during compilation. Updated kernel source code may fix or add algorithms to the cryptographic API in the kernel, which may be called by your kernel module. If your module uses its own encryption source code, as mine did, manual updating of the source code will be necessary. Also, only one version of the kernel can run at any time. I think that user space offers a significant advantage here.

Portability is also highly desired in code developed for a given platform. Porting server code to other platforms increases the potential profitability of a software product, which is the most important aspect in many cases. This is where user-space implementation really sines. A completely user-space server application for Linux can be trivially modified to run on countless other Unix-like operating systems from AIX to Z/OS and possibly even Microsoft Windows. A completely or partially kernel-space server application for Linux is limited to Linux operating systems and that approach does not even guarantee that it will work for all Linux kernels. Changes from one kernel version to the next may render developed kernel modules broken in newer versions. I think user space has a clear advantage here also.

Conclusions

I have shown that application-level software data encryption is a superior choice for high performance servers when low-latency driver-level access is not required. This small caveat about low-latency and driver-level access is not to be taken lightly. The answer to the question originally posed in the title of this paper depends on how the encryption is going to be used and what the server is actually doing. However, I have shown that user-space execution can outperform kernel-space execution for common encryption goals such as network communication and data storage. I have also shown that raw encryption calculation performance is superior in user-space. Other research has arrived at similar conclusions.

It seems that high performance server applications should minimize the amount of system calls and maximize the amount of calculations executed outside of the kernel. Opponents to this philosophy will claim that some specific server applications have such high performance requirements that the protections and features provided by the kernel are a hindrance. Usually this argument involves the performance of I/O, memory access, and system calls. These can usually be overcome by intelligent design and proper use of existing system features. Linux provides the ability to open files and devices with the O_DIRECT flag, which prevents the page cache from being used. This cuts the number of data copies in half. Linux also provides the ability to tag areas of allocated memory to control how the memory manager handles it. This can prevent critical areas of memory from being swapped to disk.

In most cases, application-level implementation is a better strategy if a convenient operating system implementation does not exist or is insufficient.

Follow-up

I later discovered the source of the performance difference, which is documented here.

References

Albinet, A., Arlat, J., & Fabre, J. (2004, June). Characterization of the Impact of Faulty Drivers on the Robustness of the Linux Kernel. Paper presented at the 2004 International Conference on Dependable Systems and Networks. Retrieved October 1, 2007, from the IEEE Computer Society Digital Library database.

Anderson, D., & Post, G. (2006). Management Information Systems. New York, NY: McGraw- Hill/Irwin.

Brecht, T., Li, L., Shukla, A., Subramanian, A., & Ward, P. (2004). Evaluating the Performance of User-Space and Kernel-Space Web Servers. Paper presented at the 14th IBM Center for Advanced Studies Conference in October 2004. Retrieved October 1, 2007, from http://cs.uwaterloo.ca/~brecht/papers/getpaper.php?file=cascon-2004.pdf

Callanan, S., Rai, A., Sivathanu, G., Traeger, A., & Zadok, E. (2005, April). Efficient and Safe Execution of User-Level Code in the Kernel. Paper presented at the 19th IEEE International Parallel and Distributed Processing Symposium. Retrieved October 1, 2007, from the IEEE Computer Society Digital Library database.

Corbet, J., Kroah-Hartman, G., & Rubini, A. (2005). Linux Device Drivers, Third Edition. Sebastopol, CA: O’Reilly Media, Inc.

Dave, J., Wright, C., & Zadok, E. (2003, October). Cryptographic File Systems Performance: What You Don't Know Can Hurt You. Paper presented at the Second IEEE International Security in Storage Workshop. Retrieved October 1, 2007, from the IEEE Computer Society Digital Library database.

Dworkin, M. (2001). Recommendation for Block Cipher Modes of Operation. National Institute of Standards and Technology Special Publication 800-38A. Retrieved October 1, 2007, from http://csrc.nist.gov/publications/nistpubs/800-38a/sp800-38a.pdf

Laoutaris, N., Merakos, L., Strvrakakis, I., & Xenakis, C. (2006). A generic characterization of the overheads imposed by IPsec and associated cryptographic algorithms. Computer Networks, 50(17), 3225-3241. Retrieved October 1, 2007, from the Science Direct database.

National Security Agency. (2005). Fact Sheet NSA Suite B Cryptography. Retrieved October 1, 2007, from http://www.nsa.gov/ia/industry/crypto_suite_b.cfm

Kernel vs User Implementation for Software Data Encryption

Table of Contents

Follow-up