Why Digital Audio Systems Upsample

Digital audio systems since the beginning have used elevated sampling rates for processing while down-sampling the product to 44.1 or 48 KHz sampling rates for storage and transmission. These design decisions are attempts to satisfy the requirements of the sampling theorem using easily implemented anti-aliasing filters and reconstruction filters.

References

Reference 1 gives a good explanation of the sampling theorem and up-sampling for implementation convenience and the interaction of digital audio with human hearing. Reference 2 explains the theory behind dither in digital systems. Reference 3 explains oversampling techniques.

The Sampling Theorem

The sampling theorem states that any band-limited signal can be perfectly reconstructed from a sequence of uniformly spaced samples taken at twice the frequency of the highest spectral component present in the signal. The catch is that no energy must be present above one-half the sampling frequency and that ideal low pass filters (brick wall filters) must be used in both the recording and playback systems.

The sampling theorem is not a matter of belief. It is mathematically proven theory that describes the behavior of sampled data systems. Any well-behaved sampled data system must satisfy the sampling theorem’s requirements. Sampled data systems that work do so.

As with any engineering theory, the devil is in the details of satisfying its requirements. The sampling theorem requires unrealizable ideal low pass anti-aliasing filters and reconstruction filters. The art is in designing a sampled data system that works with realizable filters. For audio systems, an important part of the art is in choosing and designing anti-aliasing and reconstruction filters that do not introduce non-musical artifacts to the reconstructed audio signal.

After the break, we’ll explain how digital audio system engineers design real world digital recording and playback systems that can be built with realizable filters. As you might guess, it is over-sampling that makes this possible

What is over-sampling?

Over-sampling is the practice of sampling a analog signal at a frequency much higher than that required by the sampling theorem, usually 4 or 8 times oversampling. The over-sampled sample stream is processed digitally and is down-sampled for storage and playback. The playback system reconstructs the high frequency sample stream as part of the digital to analog conversion process.

Why up-sample to record?

The purpose of up sampling is to allow the use of a well behaved anti-aliasing filter in the recorder at 20 kHZ by raising the frequency at which aliasing will actually occur to 96 kHZ (half of the 192kHZ system working frequency).

The signals from microphones and instrument transducers will be band limited by the mechanics of these devices. By raising the system aliasing frequency through oversampling, a 20 kHz Butterworth filter can achieve the needed rejection above the system sampling frequency. Butterworth filters have a smooth amplitude response in the pass band and are well-behaved in both phase and amplitude in the pass band.

Note that aliasing is not distortion in the traditional sense. The sampling and reconstruction process fold the aliased spectral components back on top of the program material high frequency components. The result is dramatically unmusical. Thus the importance of an adequate anti-aliasing filter in the sampler.

Upsampling in playback

The same magic works on playback. By up-sampling the 44 kHZ CD sample stream by 2 or 4 for playback allows a 20 kHz Butterworth filter to recreate the audio waveform from the sample stream. Raising the aliasing frequency by a factor of 4 or 8 relative to the filter break frequency gives a good reconstruction. The original Sony and Phillips players up-sampled from 44.1 to 88.2 kHz for exactly this reason.

The reconstruction filter form is chosen to avoid ripple in the pass band and overshoot in the impulse response. Several filter forms, Bessel, Butterworth, and Legendre, have these amplitude characteristics with differing phase behavior. Selection of filter type will affect the voicing and stereo image of the play back converter.

Why mixers use long sample word-lengths

The reason digital mixers use 24 bit samples is to gain headroom for mixing down multiple channels to an output. Adding two 16 bit numbers produces a 17 bit result. Adding four 16 bit numbers produces an 18 bit result. Storing an 18 bit result in 16 bits means shifting the result right 2 bits discarding detail. Digital mixers and audio workstations avoid this problem in two ways. First, they use a longer 24 bit sample word internally and many use 32 bit or 64 bit floating point for filter calculations.

If the digital signal processor attempted to work with 16 bit words, each operation could potentially result in the loss of precision (think detail) as the system continually tossed away the low order bits. When the result exceeded 1, overflow has occurred causing a total loss of the mix.

Digital overflow presents as a raspberry in the output stream. Overflow makes the resultant signal unlistenable, nay unrecognizable as anything other than a machine malfunction. Digital mixers and audio workstations take great pains to avoid overflow.

Are 16 bits enough for storage and transmission?

So how many bits do we need for storage and transmission? There are two practical considerations.

Playback system signal to noise ratio
The resolution of listener’s ears

The reference gives a very good explanation of why 16 bits is enough for storage and transmission. Basically, 16 bits offers a dynamic range of 96 db without dither and about 120 db with dither. In comparison, a well mastered and pressed LP offers about 60 db of dynamic range. A quiet concert hall with audience has a similar dynamic range.

Dithering is a technique for adding a small amount of Gaussian noise to the signal before sampling to influence the value of the least significant bit of the sample stream. Without dither, a signal with an amplitude of less than 1 bit would produce a sequence of zero samples. By adding a random signal as dither, the sequence of samples becomes a sequence of 0s and non-zeroes from which the small input can be reconstructed.

In effect the small signal is modulating a pseudo-random zero mean carrier signal. The reconstruction process recovers the low level modulating signal. When the signal becomes much larger than the dither, the presence of dither is not noticed in the output. The dither amplitude is generally less than the noise present in the playback amplifier and is well below the noise in the listening environment.

Use of dither makes it possible to encode signals having an amplitude less than the least significant bit. It is possible to represent and reconstruct sinusoidal signals on the order of -105 db where 0 db represents a signal of maximum amplitude. With proper dither, it is possible to represent about 120 db of dynamic range in a 16 bit sample. 120 db is enough dynamic range to represent both an operating jackhammer 1 foot from the microphone and mosquito flying about in the room and resolve them both.

By using 16 bits for storage and transmission, a digital recording and playback system can represent any sounds that the listener is capable of resolving and the playback amplifier’s internal noise becomes limiting for small signals. Increases of the sample word size above 16 bits provide no discernible benefit to the listener at a significant increase in storage and transmission cost.

Are High Sample Rate Sources of Discernible Benefit?

The reference sites several double blind tests conducted in quiet studio environments using high sample rate (96 kHZ or 192 kHz) program material. The playback system down sampled to 44 kHz and 16 bits using professional signal processing tools and professional converters produced playback signals from the two streams. Blind A/B testing listeners picked the two sample streams equally over a large number of listeners and trial playbacks. The test audience was guessing. They could not differentiate between the two playback chains in a studio reference playback environment using studio reference grade playback equipment.