Projects Publications Resume Contact About Youtube

j2kaudio 2003

JPEG 2000 Audio (j2kaudio)
Gregory Alan Hildstrom
13th November 2003
Abstract
This document is intended to detail a method and implementation that uses
JPEG 2000 wavelet compression to store audio data. The main hope and goal
for this project is to create an audio compression scheme that will achieve higher
quality and higher compression than methods that use the discrete cosine trans-
form. For simplicity of implementation, understanding, and testing, only CD qual-
ity audio will be used; although these same methods could be used for 24-bit,
4-channel, or 5.1-channel audio. The implementation of these methods has been
called j2kaudio, which has command line flags to convert from wav to jp2 and jp2
to wav. This technology may be able to use JPEG 2000 code-streams, jpc fies, as
well, which may allow real-time streaming of j2kaudio data.
Contents
1 Introduction 2
2 CD Quality Audio Data 2
3 Image Data 2
4 Audio Matrix Manipulation 3
5 Audio Image Encoding 3
6 Results 4
6.1 Source Audio . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
6.2 Lossy Compression Results . . . . . . . . . . . . . . . . . . . . 4
6.2.1 LAME . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
6.2.2 j2kaudio . . . . . . . . . . . . . . . . . . . . . . . . . . 4
6.2.3 first Listen . . . . . . . . . . . . . . . . . . . . . . . . 4
6.2.4 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . 4
6.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 5
6.3 Lossless Compression Results . . . . . . . . . . . . . . . . . . . 5
6.3.1 LPAC . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
6.3.2 flAC . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
6.3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 5
7 Conclusions 5
1
1 Introduction
The goal of this project is to store audio data in a compressed format using the JPEG
2000 wavelet image compression standard. The main hope and goal for this project
is to create an audio compression scheme that will achieve higher quality and higher
compression than methods that use the discrete cosine transform.
Audio data is linear, one dimensional, and is usually stored in an one dimensional
array for each audio channel. Images are two dimensional and are usually stored in a
matrix. In order to use image compression algorithms on audio data, the audio needs
to be converted into a matrix to look like image data. The main goal of this project is
to store audio in an image fie that is sample rate pixels wide x number of seconds high
in size. One channel of CD audio would be 44100 pixels wide x 60 pixels high for
one minute of audio. The JPEG 2000 spec allows for lossless and lossy compression,
which would be a great feature for audio fies.
One distinct advantage to using the JPEG 2000 standard for audio compression
comes to multimedia applications that use MJPEG 2000 for video compression. The
ability to use the same encoding and decoding engine for both video and audio greatly
reduces application complexity. MJPEG 2000 uses the JPEG 2000 standard to com-
press the individual frames of video. If an AVI video fie, or some new format, used
MJPEG 2000 compression for video and JPEG 2000 for audio, the software applica-
tions that deal with the video can be greatly simplifid. Hardware implementations that
create, manipulate, or play MJPEG 2000 video with JPEG 2000 audio would be greatly
simplifid. Using JPEG 2000 audio would be much simpler than using additional code
or hardware to encode the audio using MP3, Ogg Vorbis, flAC, or PCM.
2 CD Quality Audio Data
When CD audio is stored in a wav fie, it is stored in 16-bit binary numbers that are
interleaved. The binary numbers are equivalent to a short integer in c++, which can
range from -32768 to 32768. CD audio is sampled at 44100 Hz and has two audio
channels: left and right.
wav fie representation t0 t1 t2 t3 t4 t5 t6 t7 t8 t9
Channel 0 (left) 00 01 02 03 04 05 06 07 08 09
Channel 1 (right) 10 11 12 13 14 15 16 17 18 19
3 Image Data
Image data is usually a matrix of pixels. Many color images use 3 channel color, which
means that each pixel in an image is composed of 3 different colors; typically red,
green, and blue. Typical true color images use 8 bits for each color component, which
results in 24 bits representing each color. If we are trying to store 16 bit numbers from
audio data, 24 bit color will be horribly ineffiient.
Grayscale images do not contain color information. Grayscale images can be cre-
ated that use 16 bits per pixel, which would be ideal for storing 16 bit audio data.
2
The only obstacles are getting the audio data into a highly compressible format and
retaining as much information about the audio as possible.
Grayscale image representation 0 1 2 3 4 5 6 7 8
0 00 10 20 30 40 50 60 70 80
1 01 11 21 31 41 51 61 71 81
2 02 12 22 32 42 52 62 72 82
3 03 13 23 33 43 53 63 73 83
4 04 14 24 34 44 54 64 74 84
5 05 15 25 35 45 55 65 75 85
6 06 16 26 36 46 56 66 76 86
7 07 17 27 37 47 57 67 77 87
8 08 18 28 38 48 58 68 78 88
4 Audio Matrix Manipulation
Audio data gets read in as an samples-per-channel x n-channel matrix. One minute of
CD audio would be stored as a (44100 * 60) x 2 matrix. Several minutes of CD audio
can easily exceed tens of millions of data points, which may make the matrix too wide
to be treated as an image. The size of the direct 2 dimensional audio matrix in also not
inherently useful; the only information contained in the matrix about the audio is the
number of channels and total number of samples per channel.
My idea is to manipulate the 2 dimensional audio matrix into a 3 dimensional ma-
trix. The 3d matrix would be n-channels deep, which means that there would be one
2d matrix per channel. The new 2d audio matrix would be sample rate x number of
seconds in size. One minute of CD audio would be represented by two matrices, one
for the left channel and one for the right channel, that are each 44100 x 60 pixels. The
size of the image created by each audio channel contains the sample rate of the audio
and the length of the audio track in seconds, which are easily extracted from image
fies and very useful.
5 Audio Image Encoding
The 16-bit audio channels can be independently encoded as grayscale images or po-
tentially encoded as a single image with 2-channel color. JPEG 2000 seems to have
allowed for many different color component schemes. Normal 24-bit color has 3 color
components: red, green, and blue. JPEG 2000 also allows for 2, 4, 5, 12, and many
other color components to defie a pixel. Grayscale 16-bit images would be very useful
for encoding mono or single audio channels independently.
CD wav fies have 2 audio channels, which may be able to be encoded as 2 color
components of a pixel. The resultant image would be 32-bit color, with two 16-bit
color components. This 2 component image would contain both channels in a single
2d matrix of 32-bit pixels.
Encoding the audio data in 2 dimensions has several potential benefis. If you are
encoding a song that happens to be at 60 beats per minute, the beat is 1 beat per second.
3
That beat may repeat itself closely vertically and appear as lines running vertically in
the image. The two dimensions allow the compression algorithm to compress features
in the audio that repeat over time, meaning over seconds or over the length of the song,
which cannot be accomplished with linear DCT type compression as well as they can
with this application of wavelet compression.
j2kaudio creates an image that is a whole number of seconds long, because of the
one second wide images, so fractions of a second at the end of audio fies will be
truncated.
6 Results
6.1 Source Audio
My reference wav fie was Everything Zen, by BUSH. I ripped the wav fie from CD
using cdda2wav in Linux. The original wav fie was 49062764B or about 49MB.
6.2 Lossy Compression Results
6.2.1 LAME
For comparison purposes I encoded the song using LAME, version 3.93, into an mp3
fie. I used the command, lame -m s audio.wav audio.mp3, which forced stereo en-
coding and used a nominal bit-rate of 128 k. The mp3 fie was 4451264B or about
4.4MB and required 26 seconds to encode and 10 seconds to decode.
6.2.2 j2kaudio
The ratio of the mp3 fie size to the original fie size determined the jpeg 2000 encoding
rate of 0.090726. The jp2 fie was 4448875B or about 4.4MB and required 54 seconds
to encode and 24 seconds to decode.
6.2.3 first Listen
The original wav fie obviously sounded great. I was paying particular attention to the
sound of the symbols and hi-hats, which are usually the fist instruments to suffer when
using lossy compression. The mp3 fie sounded pretty good, but the high frequencies
were lacking as expected. The jp2 fie did not sound quite as good as either the original
wav or the mp3 fie. The jp2 fie contains mid to high frequency static sounds, which I
believe are due to compression in the vertical direction, meaning that the current sound
samples are affected by samples that occur earlier or later.
6.2.4 Error Analysis
I converted the compressed fies to wavs then loaded all three wavs into WaveMetrics
IGOR Pro. I chose to compute the abs(originalwav[sample]-compressedwav[sample]))
for both mp3 and j2kaudio. The mp3 absolute difference is much greater than the
4
j2kaudio difference, but that does not mean much for audio compression if the mp3
sounds better.
6.2.5 Summary
Lossy Compression Method Compressed Size (B) Encoding Time (s) Decoding Time (s)
LAME 4451264 26 10
j2kaudio 4448875 54 24
6.3 Lossless Compression Results
The lossless compression, which uses a jpeg 2000 encoding rate of 1, works great. The
reconstructed audio wav fie is indistinguishable from the original. The only side effect
of my algorithm is that the reconstructed wav fie has been truncated to the nearest
number of whole seconds. The lossless compressed jp2 image fie was 40251799B or
about 40MB and required 50 seconds to encode and 49 seconds to decode.
6.3.1 LPAC
I compressed the test song with lpac command line version 1.40, which used the lpac
codec version 3.08. LPAC compressed the song to 35281296B or about 35MB and
required 33 seconds to encode and 15 seconds to decode.
6.3.2 flAC
I compressed the test song with flac version 1.1.0. flAC compressed the song to
35420852B or about 35MB and required 45 seconds to encode and 4 seconds to decode.
6.3.3 Summary
Lossless Encoding Method Compressed Size (B) Encoding Time (s) Decoding Time (s)
LPAC 35281296 33 15
flAC 35420852 45 4
j2kaudio-2003-11-12 40251799 50 49
7 Conclusions
These results describe the current state and performance of j2kaudio as of 2003-11-13.
j2kaduio is not currently up to par with LAME for lossy compression or LPAC or flAC
for lossless compression; there is a lot of work to do to make j2kaudio competitive.
There are many things about the encoding algorithm than can be changed to improve
speed, compression, and sound quality. j2kaudio is approaching the same ball park as
the other encoders, but it is unique in that it can be used to perform lossless and lossy
audio compression using the same algorithms that can be used for video or image data.
5