Audio compression formats using mp3 and FLAC as examples. Digital audio compression methods

Lectures 15 – 16. Compression of audio information Lecture plan 1. General information. 2. Structure of an encoder with digital audio data compression. 3. Psychoacoustic models (PAM). 4. Basic coding systems.

1. Sound compression methods are based on eliminating its redundancy. A distinction is made between statistical and psychoacoustic redundancy of natural sound signals. The reduction of statistical redundancy is based on taking into account the properties of the sound signals themselves, and psychoacoustic redundancy is based on taking into account the properties of auditory perception. 2

Statistical redundancy is due to the presence of a correlation between adjacent samples of the time function of the sound signal (SS) during its sampling. Sufficient treatment is used to reduce it. With no information, their complex algorithms are used; however, the original signal loss is represented in a more compact 3

form, which requires fewer bits to encode it. However, even when using fairly complex processing procedures, eliminating the statistical redundancy of audio signals makes it possible to increase the required throughput of the communication channel by only 15... 25% compared to its original value, which cannot be considered a revolutionary achievement. 4

After eliminating statistical redundancy, the speed of the digital stream when transmitting high-quality signals and the human ability to process them differ by at least several orders of magnitude. 5

This also indicates the significant psychoacoustic redundancy of the primary digital ES and, therefore, the possibility of its reduction. The most promising from this point of view were methods that take into account such properties of hearing as masking. If you know which parts of the sound signal the ear perceives and which parts it does not due to masking, then you can 6

select and then transmit over the communication channel only those parts of the signal that the ear is capable of perceiving, and the inaudible ones can simply be discarded. In addition, signals can be quantized with the lowest level resolution possible so that quantization distortions, changing in magnitude with changes in the level of the signal itself, still remain 7

would be inaudible - they would be masked by the original signal. However, after eliminating the psychoacoustic redundancy, the exact restoration of the shape of the temporal function of the VS during decoding is no longer possible. 8

Two important features for practice: If compression of digital audio signals has already been used in a communication channel, then its repeated use leads to significant distortion, i.e. it is important to know the “history” of the digital signal and what encoding methods have already been used. 9

Traditional quality assessment methods (for example, on tonal signals) are not suitable for codecs with audio data compression; testing is carried out on digital and real audio signals. 10

Work on analyzing the quality and assessing the effectiveness of digital algorithms for audio data with compression for the purpose of their subsequent standardization began in 1988, when the international expert group MPEG (Moving Pictures Experts Group) was formed. eleven

The result of the work of this group at the first stage was the adoption in November 1992 of the international standard MPEG-1 ISO/IEC 11172 -3 (the number 3 after the standard number refers to the coding of audio signals). 12

To date, several other MPEG standards have become widespread, such as MPEG-2 ISO/IEC 13818-3, 13818-7 and MPEG-4 ISO/IEC 14496-3. In contrast, in the United States, the Dolby AC-3 standard was developed as an alternative to MPEG standards. 13

Somewhat later, two different digital technology platforms for radio and television clearly emerged - these are DAB (Digital Audio Broadcasting), DRM (Digital Radio Mondiale), DVB (with terrestrial DVB-T, cable DVB-C, satellite DVB-S varieties) and ATSC (Dolby AC-3). 14

The first of them (DAB, DRM) is promoted by Europe, ATSC - by the USA. These platforms differ in the algorithm, first of all, for the compression of the selected digital audio data, the type of digital modulation and the procedure for noise-resistant coding of the audio signal. 15

2. Despite the significant variety of digital audio data compression algorithms, the structure of the encoder that implements such a signal processing algorithm can be represented in the form of a generalized diagram: 16

In the time and frequency segmentation block, the original audio signal is divided into subband components and segmented by time. The length of the encoded sample depends on the timing characteristics of the audio signal. 18

In the absence of sharp outliers in amplitude, the so-called long sample is used. changes in the encoded decrease, in the case of sharp sampling amplitudes, which gives the signal length a significantly higher time resolution. 19

The NMR model uses the following hearing properties: Absolute hearing threshold. Critical hearing bands (frequency groups into which a person divides a sound signal when perceiving it), which even have their own unit of measurement for pitch - bark. 23

Relative masking in the frequency domain hearing threshold. and When the hearing is simultaneously exposed to two signals, one may not be heard against the background of the other - this is masking, and the relative threshold of audibility is the threshold of audibility of one signal in the presence of the other, taking into account frequency masking 24

Masking in the time domain - characterizes the dynamic properties of hearing, showing the change in time of the relative threshold of audibility when the masking and masked signals do not sound simultaneously. 25

In this case, a distinction is made between post-masking (a change in the audibility threshold after a high-level signal) and pre-masking (a change in the audibility threshold before the arrival of a high-level signal). This type of masking, when sounds do not overlap in time, is called temporal masking. 26

Post-masking appears in the time interval of 100... 200 ms after the end of the masking signal, and pre-masking - about 10 ms, which is determined by the characteristics of a particular person. For this reason, temporary masking is practically not used in digital coding. 27

The main calculation procedures are performed on the basis of psychoacoustic analysis, implemented on the basis of NMR - a model based on the principle of additive (interdependent) action on the hearing organ of spectral components if they act simultaneously. Primary PCM signal 28 is supplied to the input of the psychoacoustic analysis block of the encoder (slide 17)

at a speed of 48*16 = 768 Kbps. The following procedures are performed: Procedure 1. Calculation of the energy spectrum of a sample of the input ES and its normalization. Example: let the FFT sample length be N=512 (Layer 1) or 1024 samples (Layer 2). Let us denote n as the signal sample number in the sample; k – FFT coefficient index. 29

At the output of the FFT block we have a line spectrum X(k) in dB, with frequency resolution ΔF = fd/N. With fd = 48 k. Hz and N = 1024 we obtain ΔF = 46.875 Hz. The FFT is performed with a Hanna window function to suppress the Gibbs effect. thirty

The calculated spectrum is normalized, the maximum spectral component is assigned a level of 92 dB. Procedure 2. Calculation of the sampling signal energy in the coding subbands. Procedure 3. Calculation of local maxima of the energy spectrum of the sample signal. The algorithm here is simple: the spectral component X(k) will be a local maximum, 32

If it is greater than the previous X(k-1), but not less than the next X(k+1). Procedure 4. Formation of a list of tonal components. In this case, the maximum frequency region is examined and near each local spectral component is included in the list of tonal components (X(k)), if in this region it is 33

exceeds any component (except for two neighboring ones, to be taken into account when calculating their energy level) by no less than 7 d. B. Procedure 5. Formation of a list of non-tonal (noise-like) components is carried out after the formation of a list of tonal components. To do this, tonal and 34

neighboring components taken into account earlier. This procedure is necessary to take into account the corresponding masking coefficients. Procedure 6. Thinning of the spectrum of tonal and non-tonal components is carried out with the aim of masking outside the critical hearing band, which is the same for both tonal and non-tonal components. 35

After thinning, a new grid of spectral components is formed: in the first three subbands (0... 2250 Hz) components are taken into account in all the next three spectral subbands (2250... 4500 Hz) - every second, in the next three subbands (4500... 6750 Hz) - every fourth and in the remaining 20 subbands - only every eighth spectral component. 36

Thus, if the upper frequency of the ES is 22500 Hz, then after such thinning a spectrum of 126 spectral components is obtained (the original spectrum had 512 components). Procedure 7. Calculation of camouflage coefficients. Procedure 8. Calculation of masking thresholds. 37

Procedure 9. Calculate the global masking threshold curve. Here, a global masking threshold is formed for each subband and the permissible value of the noise level for each quantization is determined, in particular, a histogram of the bit distribution when encoding subband samples is constructed. 38

4. 1. The audio part of the MPEG-1 standard (ISO/IEC 11172-3) includes three algorithms of different levels of complexity: Layer I, Layer II and Layer III. The general structure of the encoding process is the same for all levels, but they differ in their intended use and internal mechanisms. Each level has its own digital stream, that is, a total 39

stream width and its own decoding algorithm. The levels have a difference in compression ratio and in the provided sound quality of the resulting streams. MPEG-1 is designed to encode signals digitized at sampling rates of 32, 44.1 and 48 kHz. 40

The MPEG-1 standard normalizes the following digital stream rates for all three levels: 32, 48, 56, 64, 96, 112, 192, 256, 384 and 448 kbit/s, the number of input signal quantization levels is from 16 to 24. 41

The standard input signal for the MPEG-1 encoder is an AES/EBU digital signal (two-channel digital audio signal with a quantization capacity of 20...24 bits per sample). The following sound encoder operating modes are provided: single channel (mono), dual channel (stereo or two mono channels) and 42

joint stereo (signal with partial separation of the right and left channels). The most important property of MPEG-1 is full backward compatibility of all three levels. This means that each decoder can decode signals not only from its own, but also from lower layers. 43

The Level I algorithm is based on the DCC (Digital Compact Cassette) format developed by Philips for recording on compact cassettes. First-level coding is used where the degree of compression is not very important and the deciding factors are the complexity and cost of the encoder and decoder. 44

The Level I encoder provides high quality 384 kbps digital audio stream per stereo program. Level II requires a more complex encoder and a slightly more complex decoder, but provides better compression - 45

“transparency” of the channel is achieved already at a speed of 256 kbit/s. It allows up to 8 encoding/decoding without noticeable degradation in sound quality. The Level II algorithm is based on the MUSICAM format, popular in Europe. 46

128 kbps, although high-quality transmission is possible at lower speeds. The standard recommends two psychoacoustic models: the simpler Model 1 and the more complex, but also higher quality Model 2. They differ in the algorithm for processing samples. Both models can be used for all three levels, 48

but Model 2 has a special modification for Level III. MPEG-1 turned out to be the first international standard for digital compression of audio signals and this led to its widespread use in many areas: 49

broadcasting, sound recording, multimedia communications applications. and Level II is the most widely used, it has become part of the European standards for satellite, cable and terrestrial digital TV broadcasting, audio broadcasting standards, DVD recording, 50

ITU BS Recommendations. 1115 and J. 52. Level III (also called MP-3) is widely used in integrated services digital networks (ISDN) and the Internet. The vast majority of music files on the Internet are recorded in this standard. 51

4. 2. MPEG-2 is an extension of MPEG-1 towards multi-channel audio. MPEG-2 takes into account differences in the transmission mode of multi-channel audio, including five-channel format, seven-channel audio 52

with two additional loudspeakers used in very wide screen cinemas, extending these formats with a low-frequency channel. 53

4. 3. With all the many innovative approaches MPEG-4 offers, the audio sections of the standard are perhaps the most interesting and revolutionary part of it. The object-based approach to images is new to television, but it has been used in a number of animation systems before. 54

Regarding the audio quality of the standard (the so-called object sound), there is simply no system comparable to MPEG-4 in terms of the complexity of the approach, the range of technologies used and the range of applications. 55

The fundamental difference between MPEG-7 is that it was not developed at all to establish any rules for compressing audio and video data or typing and characterizing data of any particular type. 56

4. 4. The MPEG-7 standard is intended as a descriptive one, intended to regulate the characteristics of multimedia of any type, for data up to analog, and recorded in different formats (for example, with different spatial and temporal frame resolutions). 57

The larger the WT card's memory, the more realistic the sound (because more samples recorded at higher resolution are stored in memory). Standard General MIDI describes more than 200 instruments; storing samples of their sounds (tables) requires at least 8 MB of memory (minimum 20 KB for each sample).

The WF method is known ( Wave Form) sound generation, based on converting sounds into complex mathematical formulas and then using these formulas to control a powerful processor to reproduce sound; WF synthesis is expected to provide even better (relative to FM and WT technologies) reality of the sound of musical instruments with limited volumes of sound files.

Typical diagram for connecting external devices to an IBM PC-oriented sound card ( map) is shown in Fig. 4.8.

To reduce the data flow, other ( excellent from PCM) analog signal coding methods. For example, a coding technique based on the known characteristics of an analog signal is known to significantly reduce the amount of stored data; with the so-called -coding analog the signal is converted into a digital code determined by the logarithm of the signal magnitude (rather than by its linear transformation). The disadvantage of the method is the need to have a priori information about the characteristics of the original signal.

There are known conversion methods that do not require a priori information about the source signal. At differential pulse code modulation(DPCM ,Differential Pulse Code Modulation) only the difference between the current and previous signal levels is stored (the difference requires digital representation less number of bits than the full amplitude). At delta modulation(DM ,Delta Modulation) each sample consists of only one bit that determines the sign of the change in the original signal (increase or decrease); Delta modulation requires a higher sampling rate. Differential pulse code modulation technologies are associated with error that accumulates over time, so special measures are taken to periodically calibrate the ADC.

The most widely used method for recording sound is adaptive pulse code modulation (ADPCM, Adaptive Pulse Code Modulation), using 8- or 4-bit coding for the signal difference. The technology was first used by the company Creative Labs and provides data compression up to 4:1.

However, other (software) methods of compressing/decompressing audio information are often used; Among them, the most popular format has recently been MP3, developed by the institute Fraunhofer IIS (Fraunhofer Institute Integrierte Schaltungen, www.iis.fhg.de) and by THOMSON (the full MP3 format specification is published at www.mp3tech.org). The full name of the MP3 standard is MPEG-Audio Layer-3 (where MPEG essence Moving Picture Expert Group, not to be confused with the MPEG-3 standard intended for use in high-definition television).

MP3 encoding of data occurs by separating independent separate blocks of data - frames. To do this, the original signal during encoding is divided into sections of equal duration, called frames, and encoded separately (to further reduce the amount of data, compression is used using Heffman algorithm); during decoding, the signal is formed from a sequence of decoded frames. The encoding process requires significant time, while decoding (during playback) is carried out on the fly.

The MP3 format provides the best sound quality with the smallest file size. This is achieved by taking into account the characteristics of human hearing, including the effect masking a weak signal of one frequency range by a more powerful signal of an adjacent range (when it occurs) or a powerful signal of the previous frame, causing a temporary decrease in the sensitivity of the ear to the signal of the current frame (in other words, secondary sounds that are not heard by the human ear due to the presence of /previous moment of another - louder sound). It also takes into account the inability of most people to distinguish between signals below a certain power level, which varies for different frequency ranges. This process is called adaptive coding and allows you to save on the least significant from the point of view of human perception of sound details. The degree of compression (and therefore the quality) is determined not by the MP3 format, but by data stream width when coding.

Audio information compressed using this technology can be streamed or stored in MP3 or WAV-MP3 files. The difference between the second and the first is the presence of an additional header of the WAV file, which allows, if there is an MP3 codec (codec, encoder and decoder in a complex version) in the system, to use standard Windows tools to work with such a file. Compression parameters when encoding a file can be varied within wide limits. Quality, indistinguishable by most ordinary listeners from CD quality, is achieved at bit rates ( bitrate, bitrate) 112128 KB per second; the compression is approximately 14:1 relative to the original volume. Specialists usually require a transfer speed of 256320 KB/sec (this corresponds to only double the speed of a CD player, but is not available for most domestic InterNet lines).

The fundamental feature of MPEG coding (both video and audio information) is lossy compression. After packing and unpacking the audio file using the MP3 method, the result is not identical to the original`bit to bit'. On the contrary, packaging purposefully excludes unimportant components from the packaged signal, which leads to an extreme increase in the compression ratio (compression up to 96:1 with the quality of a telephone channel).

A lot of user-friendly software has also been written for MP3. The production of hardware (pocket and car) MP3 players has been launched (MP3 supports up to 5 channels).

At the turn of 1998-1999 the company XingTech(www.xingtech.com) was the first to use the technology variable bitrate(VBR, Variable Bite Rate). In the case of VBR, the maximum acceptable loss level, and the encoder selects the minimum bitrate sufficient to complete the task. Frames adjacent to each other in the final stream may end up encoded with different parameters.

According to experts, MP3 will remain relevant in the next decade (even despite the existence of the AAG and VQF formats and the promoted MS format WMA). About the existence of other coders(converters of information from one format to another) see www.sulaco.org/mp3/free.html and www.xiph.org.

A possible competitor to MP3 in the (not so near) future could be the MPEG-4 format (more precisely, its audio component), based on an object-based approach to sound scenes (language BIFS allows you to locate sound sources in the three-dimensional space of the scene, control their characteristics and apply effects to them independently of each other, etc., in future versions it is expected to add the ability to set the acoustic parameters of the environment).

For encoding audio objects, MPEG-4 offers toolkits for both live and synthesized sounds. MPEG-4 specifies the bitstream syntax and decoding process in terms of toolkits, allowing the use of various compression algorithms. The standard offers a range of bit rates for encoding live sounds - from 2 to 128 KB/sec and higher. When encoding with a variable bitrate, the minimum average speed may be even lower (about 1.2 KB/sec) For the highest quality audio, the AAC algorithm is used, which gives quality better than that. CD with a stream more than 10 times smaller. Another possible algorithm for encoding live sound is TwinVQ. Algorithms are proposed for speech coding HVXC(Harmonic Vector eXcitation Coding) for flow rates of 24 KB/sec and CELP(Code Excited Linear Predictive) for speeds of 424 KB/sec.

MPEG-4 assumes the possibility of speech synthesis. The inputs of the synthesizer receive the spoken text, as well as various parameters of the “coloring” of the voice - stress, changes in pitch, speed of pronunciation of phonemes, etc. You can also set the “speaker” gender, age, accent, etc. You can insert a control into the text information that, upon detection, the synthesizer, synchronously with the pronunciation of the corresponding phoneme, will transmit parameters or commands to other components of the system (for example, in parallel with the voice, a stream of parameters for facial animation can be generated). As always, MPEG-4 defines the operating rules and interface of the synthesizer, but not its internal structure.

An interesting part of the “sound” component is the means for synthesizing arbitrary sounds and music. MPEG-4 offers as a standard an approach developed in the cradle of many advanced technologies - MIT Media Lab. and named SA ( Structured Audio, Structured Sound). This is not a specific synthesis method, but a format for describing synthesis methods in which any of the existing methods (and, allegedly, future ones) can be specified. There are two languages available for this - SAOL (Structured Audio Orchestra Language) And SASL (Structured Audio Score Language). The first specifies the orchestra, and the second specifies what this orchestra should play. An orchestra consists of instruments, each instrument is represented by a network of digital signal processing elements - synthesizers, digital filters, which all together synthesize the desired sound. With SAOL you can program almost any desired instrument, natural or artificial sound. First, a set of instruments is loaded into the decoder, and then the SASL data stream causes this orchestra to play, controlling the synthesis process; This ensures the same sound on all decoders with very low input flow and high control precision. With the advent of MPEG-4, the idea of ITV actually takes on more real and understandable shape ( Interactive TeleVision, Interactive Television), which has been debated for several years now and by which everyone understands something different (from simple “video-on-demand” to detective stories with multivariate plot development and viewer participation).

Data on MPEG-4 is provided primarily for information on current trends in media recording and synthesis; those interested are referred to cselt.it/mpeg and www.mpeg.org. At the end of 2000, the MPEG development team planned to announce the completion of work on the MPEG-7 standard (official name - Multimedia Content Description Interface).

Audio information can be obtained using special methods based on analysis of the data structure and subsequent compression with some losses.

The real possibility of audio processing comparable in quality to existing analog examples only appeared in the late 80s. In 1988, the International Standards Organization (ISO) formed the MPEG (Moving Pictures Expert Group) committee, whose main task is to develop coding standards for moving pictures, sound and their combinations. Over the ten years of its existence, the committee has developed a number of standards on this issue. As a result, having summarized extensive research in this area, a number of specific formats for storing data have been recommended, differing in the quality of the results and the speed of data flow.

Currently, the three most common standards for storing video data are MPEG-1, MPEG-2 and MPEG-4. Within the first two formats, there are also formats for storing audio information - Layer-1, Layer-2 and Layer-3. These three audio formats are defined for MPEG-1 and are used with minor extensions in MPEG-2. All three formats are similar to each other, but use different levels of trade-off between compression and complexity. Layer-1 is the simplest level, it does not require significant compression costs, but also provides an insignificant degree of compression. Layer-3 is the most labor intensive and provides the best compression. Recently, this format has gained enormous popularity. It is often called MP3. This name refers to the extension of the audio files stored in this format.

The basic idea on which all lossy audio compression techniques are based is to neglect the subtle details of the original sound that lie beyond the range of what the human ear can perceive. Several points can be highlighted here.

Noise level. Sound compression is based on a simple fact - if a person is close to a loud siren, then he is unlikely to hear the conversation of people standing nearby. Moreover, this does not happen because a person pays much attention to a loud sound, but to a greater extent because the human ear actually loses sounds that lie in the same frequency range as a louder sound. This effect is called masking, and it changes with differences in sound volume and frequency.

The second point is the division of the audio frequency band into subbands, each of which is further processed separately. The encoding program isolates the loudest sounds in each band and uses this information to determine the acceptable noise level for that band. The best encoding programs also take into account the influence of neighboring bands. Very loud sound in one band can affect the masking effect on nearby bands.

Another point of coding is the use of a psychoacoustic model, based on the characteristics of human perception of sound. Compression using this model is based on removing obviously inaudible frequencies while more carefully preserving sounds that are clearly distinguishable by the human ear. Unfortunately, there cannot be exact mathematical formulas here. Human perception of sound is a complex, not fully understood process, so the choice of compression methods is made on the basis of analyzing listening and comparison of differently compressed sounds by groups of experts. But here there are practically unlimited possibilities in the field of improving psychoacoustic models. Most existing algorithms for encoding human voice are based on the high predictability of such a signal - universal MPEG compression algorithms try to apply this technique with varying success.

Another compression technique is the use of so-called combined stereo. It is known that the human hearing aid can determine the direction of only medium frequencies - high and low sound as if separately from the source. This means that these background frequencies can be encoded into a mono signal. In addition to all this, the difference in the complexity of the streams in the channels is used for compression. For example, if there is complete silence in the right channel for some time, this “reserved” place is used to improve the quality of the left channel, or the necessary bits that did not fit into the stream a little earlier are “shoved” there. The final stage of compression uses the Huffman compression algorithm. This process improves the compression ratio for relatively homogeneous signals that are poorly compressed using the techniques described above. Based on the ideas described, compression algorithms are built that make it possible to achieve a compression ratio of 10:1 or higher with virtually no loss in sound quality. During encoding, the required compression level is set, and compression algorithms achieve the required compression level value at the expense of quality loss. The required compression level is usually specified as a bit rate, measured in Kbit/sec.

As an initial step in image processing, MPEG-1 and MPEG-2 compression formats split reference frames into several equal blocks, which are then subjected to diskette cosine transform (DCT). Compared to MPEG-1, the MPEG-2 compression format provides better image resolution at a higher video data transfer rate through the use of new compression and redundant information removal algorithms, as well as encoding the output data stream. Also, the MPEG-2 compression format allows you to select the compression level due to the quantization accuracy. For video with a resolution of 352x288 pixels, the MPEG-1 compression format provides a transmission rate of 1.2 – 3 Mbit/s, and MPEG-2 – up to 4 Mbit/s.

Compared to MPEG-1, the MPEG-2 compression format has the following advantages:

MPEG-2 provides scalability of different levels of image quality in a single video stream.
In the MPEG-2 compression format, the accuracy of motion vectors is increased to 1/2 pixel.
The user can select an arbitrary precision of the discrete cosine transform.
The MPEG-2 compression format includes additional prediction modes.

MPEG-4 uses so-called fractal image compression technology. Fractal (contour-based) compression involves extracting the contours and textures of objects from the image. Contours are presented in the form of so-called splines (polynomial functions) and are encoded by reference points. Textures can be represented as coefficients of a spatial frequency transform (for example, discrete cosine or wavelet transform).

The range of data rates that MPEG 4 video compression format supports is much wider than MPEG 1 and MPEG 2. Further developments by specialists are aimed at completely replacing the processing methods used by the MPEG 2 format. The MPEG 4 video compression format supports a wide range of standards and data rates. MPEG 4 includes progressive and interlace scanning techniques and supports arbitrary spatial resolutions and bit rates ranging from 5 kbps to 10 Mbps. MPEG 4 has an improved compression algorithm, the quality and efficiency of which is increased at all supported bit rates.

Back

To contents

Forward

Audio compression for music lovers

The truth about high bitrates with lossy compression

Preface

In the understanding of most people the word music lover is most often associated with a person who not only loves and collects music, but also appreciates high-quality music, not only in artistic and aesthetic terms, but also the quality of the recording of the phonogram itself. Just think, just a few years ago the audio CD was considered the standard for music quality, but a computer, even in my dreams, could not compete with CD quality. However, time is a great joker, and often likes to turn everything upside down. It would seem that quite a bit of time passed, some year or two and... that’s it, the CD on the PC receded into the background. Don’t ask “why?”, you yourself know the answer to this question. It's all because of the revolution in the world of sound on a computer - audio compression (hereinafter referred to as audio compression implies lossy compression to reduce the size of the audio file), which made it possible to store music on the hard drive, a lot of music! Moreover, it became possible to exchange it via the Internet. New sound cards have been released that are capable of squeezing almost studio quality out of seemingly useless hardware in terms of music. Today, even if you have a computer that is not very fast in terms of performance, if you buy a Creative SoundBlaster Live sound card! and remembering that since Soviet times you have had a good amplifier and high-quality acoustics, you will get nothing more than a high-quality music center, the sound of which is inferior only to very expensive audio equipment (medium or even the highest Hi-Fi category). Add to this the availability of music files, and you will understand that you have power in your hands. And then a revolution occurs, and you understand that a CD is no longer so convenient, something completely different fascinates you - the magic signs of “MP3”. You can neither eat nor sleep - you are faced with a seemingly insoluble “chicken and the egg” question: what to “squeeze” with and, most importantly, how to “squeeze”...

Of the audio compression formats existing today, three deserve attention, in my opinion: MP3 (or MPEG-1 Audio Layer III), LQT (as a member of the MPEG-2 AAC / MPEG-4 family) and the completely new OGG format (Ogg Vorbis ), developed by a group of enthusiasts:

Today MP3 is the most common of them (primarily because it is free). Let me remind you that it was thanks to the MP3 format that the victorious march of compressed audio took place. However, as often happens with pioneers, it is gradually losing ground and giving way to newer and better formats.
The second format, LQT, is a representative of a new direction of audio coding algorithms, a representative of the AAC family. This is a fairly high-quality, but commercial and strictly classified format.
OGG became widely known to the public this summer and is currently developing rapidly; soon (with the release of an encoder and decoder) it should beat MP3 with better sound quality with fewer files.

I will not give a detailed description of technologies and formats here; you can easily find them yourself. There will be only facts, conclusions and recommendations. I plan to present my research separately for each format in separate articles.

The task

I decided to “push heads” against the three specified formats in order to obtain the highest quality sound with the minimum file size. For the test, several samples were selected (here a sample is a small fragment cut from a PCM file) from compositions of two types. The first is a very dense and loud sound with amplitude normalization (sound compaction “vertically” so that it fits into 16 bits from a 24-bit master) and dynamic range compression (so that the sound of all instruments is always loud). For the first type (as in my previous tests), the composition Crush On You from the album Have A Nice Day by Roxette was chosen; three samples of 15-20 seconds each from different parts of the composition were studied. The second sample is clean and transparent (light orchestral or acoustic arrangement). The second type was taken from the composition Mano a Mano from the album Tango by the famous pianist Richard Clayderman.

Why these particular records? In Roxette samples there is very strong dynamic compression (the amplitude value is very often equal to the maximum (which is bad) and leads to overload of the reproducing equipment and severe distortion).

On such samples, encoders have to work in extreme mode, which is why any distortion becomes easily audible, because Coding distortions are added to the existing original distortions. You may ask, “why then take such a sample as a test?” It is necessary and how. The vast majority of albums currently released are recorded this way. Therefore, the encoder must be tolerant of overloaded audio.

With Clayderman's samples the situation is diametrically opposite. The original analog recording, after very high-quality digital remastering, was recorded on a CD, without dynamic compression.

Great sound, very pleasant and soft highs. We will pay special attention to them during the analysis and try to preserve them. But these are the frequencies that will be most difficult for coders to convey.

What do we press?

My research into reference quality for different MP3 bitrates and encoders is expressed in the OrlSoft MPeg eXtension program. Encoding parameters were selected based on test results.

The undisputed leader in high-bitrate quality is the LAME encoder. Fraunhofer IIS encoders are still only good for low bitrates - 128 and 160 kbps. I won't even talk about others. Just NEVER deal with encoders based on XING code (the most famous representative is Audio Catalyst) - these are the worst, the sound is just terrible.

For most users of the MP3 format, the problem of quality sound is usually posed as follows: “256 or 320? Maybe try VBR?” And this question torments them every day. Not all recordings sound good at 256 - there are strong audible and visible (by measurements) losses in the high frequencies. When using the VBR mode (the so-called variable bitrate stream), it often happens that the music sounds better to the ear than 256, but this cannot be taken as a general rule. Encode records that are of little value or not of very high quality - you can’t go wrong. My VBR parameters are selected to obtain the maximum quality for VBR.

For the commercial LQT format, there is only a proprietary encoder from the authors - Liquifier Pro. We press them. I note that the LQT format is initially based on VBR encoding, so there are simply several modes for it, such as “bad”, “good” and “excellent”. Naturally, for our tests we take the “excellent” (Audiophile) mode, which results in a stream from 192 to 256, most often 200-220 kbps. Let me remind you that the LQT format is based on the MPEG-2 AAC family of algorithms. Moreover, this is the highest quality implementation of AAC to date (tested on analogues).

The OGG format is a relative of the MP3 format, but contains a different psychoacoustic model and some technical innovations that MP3 does not have. To begin with, OGG initially only supports VBR mode. The user sets an approximate bit rate, and the encoder tries to compress as close to it as possible. The change range is extremely wide: from 8 to 512 kbps, and it is much more discrete than MP3. The upper limit is as much as 512 kbps, while MP3 encoders today really only “pull” up to 320. You may ask, “is it really possible that 320 is not enough?” Yes, it happens, but rarely.

Roxette samples

Well, now we come to the most interesting part. Let's start with my auditory sensations.

For MP3 at a 256 kbit/s stream, disturbances in the sound of the high frequencies are clearly audible. Not only is a considerable part of them missing from the sound, but also strong distortion, wheezing, metallic clanging and other “charms” are mixed in. This is a sign that 256 is clearly not enough, therefore, you need to try higher. We take a sample compressed to 320. The sound has changed significantly - this is a completely different matter: the top is in place, no difference is detected by ear. For the purity of the experiment, let's see what happens in the floating flow rate mode. We get an average bitrate of 290 kbit/s, which suggests that 256 will not be enough for the sample under study. Indeed, to the ear, a sample encoded in VBR mode sounds a little better than 256, but clearly does not reach the sound of 320. In the case of using MP3, only encoding in the 320 kbit/s mode is suitable for high-quality compression, i.e. to the maximum extent possible.

Let's take OGG as a "modified MP3". There are five approximate bitrates for the encoder: 128, 160, 192, 256 and 350. Well, let's try 192 and 256. We won't take the 350 bitrate, because... We already know that MP3 at 320 kbit/s clearly conveys excellent quality; there doesn’t seem to be any need for anything better. For mode 192 we get an average stream of 226, and for mode 256 - as much as 315 kbps. So much for accuracy. Such a large deviation from the reference point is a signal for very complexly encoded audio material; with a sample that is simpler in density, the accuracy will be higher. To be honest, I spent a long time trying to evaluate the 320 MP3 and 315 OGG and came to the conclusion that they both sound almost identical to the original sound. But they are based on different psychoacoustic models and their sound colors are different. Personally, I liked MP3 a little more. However, this is a really controversial issue - after all, the OGG encoder is still only a beta version. When it's released, I think it should surpass MP3 in quality. Comparing them separately with the original, I was inclined to believe that OGG is still closer in sound to the original, but there is something wrong with the upper frequencies of this encoder. Because of this, MP3 sounds a little better. I don’t think it’s necessary to say that in 350 mode (the average bitrate was 365) OGG “perfectly” replicates the original.

Now about a little-known, but widely advertised as the “highest quality” format - the LQT format. And, most importantly, it really does sound very cool overall, however, after listening, I realized what I didn't like about its sound. It doesn't distort the high frequencies like MP3 at 256 kbps, but it does smear the sound, and it smears it a lot. Sharp sounds blur in time. Yes this is bad. But the fact is that it is useless to compare LQT at a bitrate of only 230 kbit/s with MP3 at the same bitrate; MP3 is inferior in terms of overall sound. Of course, there is something to complain about. MP3 loses and distorts the upper frequencies, while LQT, in turn, somewhat “drops” the mid frequencies and smears the upper frequencies. In general, here who will like what more. But this is a topic for another article. Today we are only talking about higher bitrates. Yes, LQT gives good quality, but by no means great. Apparently, this is due to the lack of bitrate, that is, if a higher bitrate mode appears in LQT, it will beat even 320 kbps MP3 on recordings like the one under study.

These were my purely subjective impressions. Let's now move on to more objective tests. We investigate the frequency response (that is, amplitude-frequency response) samples recognized as the best (320 for MP3, 315 for OGG and 230 for LQT). The presented diagram is the so-called “sonarm” - a time-frequency representation of sound. Horizontally there is a time scale, vertically there is a linear frequency scale.

Did you look closely? Here is a clear confirmation of my words: the latest Ogg Vorbis format in 256 mode is clearly not up to par - the frequency cut is visible to the naked eye. The “super commercial” LQT format seems to convey the high-frequency range even better than LAME, but the overall quality is worse. The fact is that in LQT there is no pure stereo mode - there, in fact, it is always Joint-Stereo (the encoder first compresses the left channel, and then encodes only the difference between left and right). Because of this, the highs are smeared when there is a lack of bitrate, which is clearly visible in the illustrations, plus this conclusion is easily confirmed by examining the signal in the MS matrix, i.e. when switching it to center channel + stereo mode. What can we say about the LAME sample... everything is just great - the upper frequencies are slightly cut off, but it’s tolerable; There were also no visible failures noted.

Let's summarize. At the finish line for the Roxette sample, the OGG 256 kbit/s and LQT formats left the race; the OGG 350 kbit/s sample is not inferior to the leader. However, let's not bury the new format ahead of time - let's wait for the release. Then we’ll run the tests again: OGG 256 vs LAME 320.

Richard Clayderman samples

With the Roxette samples, everything seems to be clear - for now it’s better to compress dense sound with the LAME encoder in 320 kbps mode. What about a more transparent sound? Let's first try to compress in 256 kbit/s mode and, in theory, everyone should be happy. Result: the low frequencies seem to be in place, and the mids too, but the high frequencies... the high frequencies are gone! They are there, but they don’t have that beautiful sound left, which is very difficult not to pay attention to in this recording. High frequencies are generally in place and there are no strong losses, but the sound of the “cymbals” has become somewhat synthetic, harsh and very unpleasant. Such sound has no right to claim the title of quality. Well, we'll have to use 320 again, but I really wanted to compress it into 256... If we compare 320 with the sound of 256, the transmission of high frequencies has become much better. However, when compared with the original, one can hear that the recording is still not satisfactory in terms of quality. After comparing a few more samples, it becomes obvious that these are errors in the psychoacoustic model. Even at 320 kbit/s, MP3 does not normally transmit high frequencies on the type of recordings under study. The upper frequencies become sharper, more metallic, they smell like synthetics and, oddly enough, they seem louder (frequency response measurements do not demonstrate this - a purely auditory effect).

Let's explore Ogg Vorbis now. As in the previous test, we take samples compressed in 256 kbit/s mode. After the failure with MP3, it’s hard to believe the result obtained - the sound of Ogg Vorbis is better in all respects and cannot be compared with what LAME produces at 320 kbps! Comparing with the original, it is also very difficult to find the difference. Ogg Vorbis at 287 bitrate beat LAME at 320 bitrate. This is exactly what I said at the beginning of the article: the OGG format may well beat MP3.

Okay, what can the award-winning LQT format tell us at a bitrate of only 252? But here, too, a shocking result is obtained - an extremely close match to the original! At least, the difference is so small that it can be considered insignificant. Also, pay attention to an interesting fact: when encoding Roxette samples, the average bitrate was about 230 kbps, and on seemingly simpler Clayderman samples - 250 kbps. This suggests that LQT is much better adapted to the actual sound of music; it takes into account all the nuances more accurately. Great format. What he needs is a normal encoder without any frills and a slightly higher bitrate so he can encode more complex samples.

These were my subjective “auditory” studies. Now let's look at the frequency response.

And again, analysis of the frequency response of the signals only confirms my conclusions based on the listening results: LQT produces simply outstanding results, this time better than LAME. Excellent transmission of the frequency range, and losses at 21 kHz are remote high-frequency noise, which is even welcome. LAME is behind, but not by much. As expected, MP3's frequency range is fine. But the frequency response of the Ogg Vorbis sample was disappointing: look at the reduction in frequencies. But it sounds better than one might think by looking at its frequency response. Apparently, by cutting some frequencies it is possible to more accurately convey the overall sound picture.

And what do we get as a result? Two leaders: LAME and LQT at maximum bitrate. OGG is very much on the heels of MP3 and will win in the future if its developers bring their idea to the final embodiment: smaller size and better quality.

Delta Signal Research

The MP3 format, due to its high bitrate, is better on most recordings. However, it loses ground when we are dealing with very high-quality sound. Here LQT is an absolute favorite. But the difference between 256 and 320 is not that big, so most often it can be sacrificed for the sake of a more convenient and widespread format. Many people, including me, do this in their music library, and they simply buy especially high-quality recordings on disk.

All this is of course good, but the two formats sound different, and this bothers many. There is another interesting study. It is possible to calculate the difference signal (hereinafter referred to as delta signal) two samples and thereby find out how they differ. This, of course, is a purely digital study, because... the difference may not be significant enough to be heard. In our case, everything turned out to be completely different.

The volume of the difference signal reaches -25 dB, and its frequency response looks very much like broadband noise. If you listen to a delta signal, it sounds like a broadband set of distortions, i.e. you can clearly hear the difference between the psychoacoustic models MP3 and LQT.

Comparing MP3 with the OGG format using the same scheme, we did not get anything new (the difference, of course, is smaller, but it is still significant):

Similar results are obtained for the pair LQT and OGG.

The results of the study of delta signals indicate that the psychoacoustic models of the three considered formats are very different from each other and it makes no sense to compare them with each other in terms of the difference in frequency response.

Conclusion

Well, let's try to draw some final conclusions, presenting them in the form of practical recommendations:

LAME is the best representative of MP3 format encoders, it produces almost the maximum that can be obtained from MP3. For all very loud and dense recordings, I would recommend using LAME at 320.
OGG is some structural modification of the MP3 format with a new psychoacoustic model, the mathematical processing and practical implementation of which is fundamentally different from MP3. For low-value and low-quality recordings, OGG in 192 kbit/s mode will be used (or LQT in 128 Transparent mode, on average 160-180 kbit/s).
Unlike MP3 and OGG, which are representatives of the MPEG-1 format encoders, the LQT format is based on the MPEG-2 AAC specification. The AAC format delivers significantly better quality at lower bitrates due to fundamentally different audio processing. For recordings of average value, I recommend LQT (at maximum), or your choice (the difference between them is small): OGG in the 256 kbps mode, LAME at 256. It is better not to use the VBR mode of the LAME encoder, it is noticeably worse.
For very high-quality recordings, where even when encoded at 320 kbps you can clearly hear the absence of anything significant in the sound of the sample, try encoding the sample with an Ogg Vorbis encoder at 350 kbps.
If you are still not satisfied with the lossy compressed sound, you will have to buy the compositions you like on a CD-DA disc.

Perhaps some part of the article interested you more. Write to me - I will be very glad to hear your feedback.

MINISTRY OF AGRICULTURE

FEDERAL STATE EDUCATIONAL INSTITUTION OF HIGHER PROFESSIONAL EDUCATION

STAVROPOL STATE AGRARIAN UNIVERSITY

Faculty of Economics

Department of Applied Informatics

INDEPENDENT

CONTROLLED WORK

in the discipline "Multimedia"

Topic “Compression of audio information”

Completed:

2PO group student

Checked:

Associate Professor of the Department of PI,

Ph.D., Associate Professor

Stavropol, 2011

COMPRESSION OF AUDIO INFORMATION

General information

During primary encoding in the studio channel, uniform quantization of audio signal (AS) samples is used with a resolution of ∆A = 16...24 bits/sample at a sampling frequency f = 44.1...96 kHz. In studio quality channels usually

∆A = 16 bits/sample, f = 48 kHz, frequency band of the encoded audio signal

∆F = 20...20000 Hz. The dynamic range of such a digital channel is about 54 dB. If f = 48 kHz and ∆A = 16 bits/sample, then the digital stream speed when transmitting one such signal is equal to V = 48x16 = 768 kbit/s. This requires a total communication channel capacity when transmitting an audio signal in 5.1 (Dolby Digital) or 3/2 formats plus an ultra-low frequency channel (Dolby Surround, Dolby-Pro-Logic, Dolby THX) of more than 3,840 Mbit/s. But a person is capable of consciously processing only about 100 bits of information with his senses. Therefore, we can talk about the significant redundancy inherent in primary digital audio signals.

A distinction is made between statistical and psychoacoustic redundancy of primary digital signals. The reduction of statistical redundancy is based on taking into account the properties of the sound signals themselves, and psychoacoustic redundancy is based on taking into account the properties of auditory perception.

Statistical redundancy is due to the presence of a correlation between adjacent samples of the time function of the audio signal during its sampling. To reduce it, fairly complex processing algorithms are used. When using them, there is no loss of information, but the original signal is presented in a more compact form, which requires fewer bits when encoding it. It is important that all these algorithms make it possible to restore the original signals without distortion during reverse conversion. Orthogonal transformations are most often used for this purpose. The optimal one from this point of view is the Karhunen-Loeve transformation. But its implementation requires significant computational costs. The modified discrete cosine transform (MDCT) is slightly inferior in efficiency. It is also important that fast computational algorithms have been developed to implement MDCT. In addition, there is a simple connection between the Fourier transform coefficients (to which we are all accustomed) and the MDCT coefficients, which allows us to present the results of calculations in a form that is quite consistent with the operation of hearing mechanisms. Coding methods that take into account the characteristics of audio signals (for example, the probability of the appearance of audio levels of different sizes) also make it possible to further reduce the speed of the digital stream. An example of such accounting is Huffman codes, where the most probable signal values are assigned shorter codewords, and the values of samples whose probability of occurrence is low are encoded with longer codewords. It is for these two reasons that in the most effective compression algorithms for digital audio data, it is not the SV samples themselves that are encoded, but the MDCT coefficients, and Huffman code tables are used to encode them. Note that the number of such tables is quite large and each of them is adapted to a sound signal of a certain genre.

However, even when using fairly complex processing procedures, eliminating the statistical redundancy of audio signals ultimately makes it possible to reduce the required communication channel capacity by only 15...25% compared to its original value, which cannot be considered a revolutionary achievement.

After eliminating statistical redundancy, the digital speed when transmitting high-quality signals and the human ability to process them differ by at least several orders of magnitude. This also indicates the significant psychoacoustic redundancy of the primary digital ES and, therefore, the possibility of its reduction. The most promising from this point of view turned out to be methods that take into account such properties of hearing as masking, pre-masking and post-masking. If it is known which parts (parts) of the sound signal the ear perceives and which not due to masking, then only those parts of the signal that the ear is capable of perceiving can be isolated and then transmitted over the communication channel, and the inaudible parts (components of the original signal) can be discarded (not transmit over a communication channel). In addition, signals can be quantized with the lowest level resolution possible, so that quantization distortions, changing in magnitude with changes in the level of the signal itself, would still remain inaudible, i.e., would be masked by the original signal. However, after eliminating the psychoacoustic redundancy, accurate restoration of the shape of the temporal function of the GS during decoding is no longer possible.

In this regard, attention should be paid to two features that are very important for practice. If compression of digital audio data has already been used previously in the communication channel when delivering a program, then its re-use often leads to significant distortion, although the original signal seems to us to be of quite good quality before re-encoding. Therefore, it is very important to know the “history” of the digital signal, and what encoding methods have already been used in its transmission. If we measure the quality parameters of such codecs using tonal signals using traditional methods (as is often done), then we will obtain for them at different, even the smallest set values of the digital stream speed, almost ideal values of the measured parameters. The results of test listening for them, performed on real audio signals, will be fundamentally different. In other words, traditional methods of quality assessment for codecs with digital audio data compression are not suitable.

Work on analyzing the quality and assessing the effectiveness of digital audio data compression algorithms with a view to their subsequent standardization began in 1988, when the international expert group MPEG (Moving Pictures Experts Group) was formed. The result of the work of this group at the first stage was the adoption in November 1992 of the international standard MPEG 1 ISO/IEC 11172-3 (hereinafter, the number 3 after the standard number refers to the part that deals with audio signal coding).

To date, several other MPEG standards have also become widespread in radio broadcasting, such as MPEG-2 ISO/IEC 13818-3, 13818-7 and MPEG-4 ISO/IEC 14496-3.

In contrast, in the USA the Dolby AC-3 (ad/52) standard was developed as an alternative to MPEG standards. Somewhat later, two different platforms of digital technologies for radio broadcasting and television clearly emerged - these are DAB (Digital Audi o Broadcasting), DRM (Digital Radio Mondiale), DVB (with terrestrial DVB-T, cable DVB-C, satellite DVB-S varieties) and ATSC (Dolby AC-3). The first of them (DAB, DRM) is promoted by Europe, ATSC - by the USA. These platforms differ, first of all, in the chosen compression algorithm for digital audio data, the type of digital modulation and the procedure for noise-resistant coding of the audio signal.

Despite the significant variety of digital audio data compression algorithms, the structure of the encoder that implements such a signal processing algorithm can be represented in the form of a generalized diagram shown in Fig. 4.1. In the time and frequency segmentation block, the original audio signal is divided into subband components and segmented by time. The length of the encoded sample depends on the shape of the time function of the audio signal. In the absence of sharp amplitude spikes, the so-called long sample is used, providing high frequency resolution. In the case of sudden changes in signal amplitude, the length of the encoded sample decreases sharply, which gives higher time resolution. The decision to change the length of the encoded sample is made by the psychoacoustic analysis unit, calculating the value of the psychoacoustic entropy of the signal. After segmentation, the subband signals are normalized, quantized, and encoded. In the most effective compression algorithms, it is not the ES sample samples themselves that are encoded, but the corresponding MDCT coefficients.

Typically, when compressing digital audio data, entropy coding is used, which simultaneously takes into account both the properties of human hearing and the statistical characteristics of the audio signal. However, the main role is played by procedures for eliminating psychoacoustic redundancy. Taking into account the patterns of auditory perception of a sound signal is carried out in the psychoacoustic analysis block. Here, using a special procedure, the maximum permissible level of quantization distortion (noise) is calculated for each subband signal, at which they are still masked by the useful signal of this subband. The dynamic bit distribution block, in accordance with the requirements of the psychoacoustic model, allocates for each coding subband the minimum possible number of bits at which the level of distortion caused by quantization does not exceed the threshold of their audibility calculated by the psychoacoustic model. Modern compression algorithms also use special procedures in the form of iterative loops, which make it possible to control the amount of energy of quantization distortion in subbands when there is an insufficient number of bits available for coding.

MPEG audio compression algorithms are based on the properties of perception of sound signals by the human hearing aid described in the first chapter. Using the masking effect can significantly reduce the amount of audio data while maintaining acceptable sound quality. The principle here is quite simple: “If some component is not audible, then there is no trace of it.” In practice, this means that in the masking region the number of bits per sample can be reduced to such an extent that the quantization noise still remains below the masking threshold. Thus, for the sound encoder to work, it is necessary to know the masking thresholds for various combinations of influencing signals. An important node in the encoder, the psychoacoustic hearing model (PAM), is responsible for calculating these thresholds. It analyzes the input signal in successive time periods and determines for each block of samples the spectral components and the corresponding masking areas. The input signal is analyzed in the frequency domain; for this, a block of samples taken over time is converted into a set of coefficients for the components of the frequency spectrum of the signal using a discrete Fourier transform. Developers of compression encoders have considerable freedom in constructing the model; the accuracy of its functioning depends on the required compression ratio

Bandpass coding and filter block. The best method for audio coding that takes into account the masking effect is bandpass coding. Its essence is as follows. A group of samples of the input audio signal, called a frame, is fed to a filter block (FB) which, as a rule, contains 32 bandpass filters. Considering what has been said about critical bands and masking, it would be good to have passbands in the filter block that, if possible, coincide with the critical ones. However, the practical implementation of a digital block of filters with unequal bands is quite complex and is justified only in devices of the highest class. Typically, a block of filters is used based on quadrature-mirror filters with equal passbands, covering the entire band of audible frequencies with little mutual overlap (Fig. 4.2). In this case, the filter bandwidth is equal to π/32T, and the central frequencies of the bands are equal to (2k + 1) π/64T, where T is the sampling period;

k = 0.1,..., 31. At a sampling rate of 48 kHz, the filter section bandwidth is 750 Hz.

The output of each filter is that part of the input signal that falls within the passband of this filter. Next, in each band using SAM, the spectral composition of the signal is analyzed and it is estimated which part of the signal should be transmitted without reductions, and which lies below the masking threshold and can be requantized into fewer bits. Since in real audio signals the maximum energy is usually concentrated precisely in a few frequency bands, it may turn out that signals in other bands do not contain distinguishable sounds and may not be transmitted at all; the presence, for example, of a strong signal in one band means that several overlying bands will masked and can be encoded with smaller scrap bits.

To reduce the maximum dynamic range, the maximum sample in the frame is determined and a scaling factor is calculated, which brings this sample to the upper quantization level. This operation is similar to companding in analogue broadcasting. All other counts are multiplied by the same factor. The scaling factor is sent to the decoder along with the encoded data to correct the latter's gain. After scaling, the masking threshold is estimated and the total number of bits is redistributed between all bands.

Quantization and bit distribution. All of the above operations did not significantly reduce the amount of data; they were, as it were, a preparatory stage for the actual audio compression. As with digital video compression, most of the compression occurs in the quantizer. Based on the decisions made by the SAM to requantize samples in individual frequency bands, the quantizer changes the quantization step in such a way as to bring the quantization noise of a given band closer to the calculated masking threshold. In this case, a sample may require only 4 or 5 bits instead.

Decision-making about the transmitted signal components in each frequency band occurs independently of the others, and a certain “dispatcher” is required that would allocate to each of the 32 band signals a portion of the total bit resource corresponding to the significance of this signal in the overall ensemble. The role of such a dispatcher is performed by a dynamic bit distribution device.

Three bit allocation strategies are possible.

In a direct adaptation system, the encoder does all the calculations and sends the results to the decoder. The advantage of this method is that the bit allocation algorithm can be updated and changed without affecting the operation of the decoder. However, sending additional data to the decoder consumes a significant portion of the total bit supply.

A backward adaptation system performs the same calculations in both the encoder and decoder, so there is no need to send additional data to the decoder. However, the complexity and cost of the decoder are significantly higher than in the previous version, and any change in the algorithm requires updating or reworking the decoder.

A compromise system with forward and backward adaptation separates the functions of calculating the distribution of bits between the encoder and decoder in such a way that the encoder performs the most complex calculations and sends only the key parameters to the decoder, spending relatively few bits on this, the decoder performs only simple calculations. In such a system, the encoder cannot be significantly changed, but adjusting some parameters is acceptable.

A generalized diagram of an audio encoder and decoder performing digital compression according to the described algorithm with direct adaptation is shown in Figure 4.3a. The signals at the output of the frequency bands are combined into a single digital stream using a multiplexer.

In the decoder, the processes occur in the reverse order. The signal is demultiplexed, divided by a scaling factor, the original values of digital samples in frequency bands are restored and fed to a combining filter block, which generates an output stream of audio data that is adequate to the input from the point of view of psychophysiological perception of the audio signal by the human ear.

MPEG family of standards

MPEG stands for Moving Picture Coding Experts Group, literally a group of moving picture coding experts. MPEG dates back to January 1988. From the first meeting in May 1988, the group began to grow, and has grown to a very large group of specialists. Typically, about 350 specialists from more than 200 companies take part in the MPEG meeting. The majority of MPEG participants are specialists employed in various scientific and academic institutions.

MPEG-1 standard

The MPEG-1 standard (ISO/IEC 11172-3) includes three algorithms of different levels of complexity: Layer I, Layer II and Layer III. The general structure of the coding process is the same for all levels. However, despite the similarity of the levels in the general approach to encoding, the levels differ in their usage and internal mechanisms. For each level, a digital stream (total stream width) is defined and its own MPEG-1 decoding algorithm is designed to encode signals digitized at a sampling rate of 32, 44.1 and 48 KHz. As stated above, MPEG-1 has three layers (Layer I, II and III). These levels have differences in the compression ratio provided and the sound quality of the resulting streams. MPEG-1 normalizes the following digital stream rates for all three levels: 32, 48, 56, 64, 96, 112, 192, 256, 384 and 448 kbit/s, the number of input signal quantization levels is from 16 to 24. Standard input ^The signal for the MPEG-1 encoder is an AES/EBU digital signal (two-channel digital audio signal with quantization bits per report). The following operating modes of the audio encoder are provided:

■ single channel (mono);

■ double channel (stereo or two mono channels);

■ joint stereo (signal with partial separation of the right and left channels). The most important property of MPEG-1 is full backward compatibility of all three levels. This means that each decoder can decode signals not only from its own, but also from lower layers.

Level II requires a more complex encoder and a somewhat more complex decoder, but provides better compression - channel “transparency” is achieved already at a speed of 256 kbit/s. It allows up to 8 encoding/decoding without noticeable degradation in sound quality. The Level P algorithm is based on the MUSICAM format, popular in Europe.

The most complex Level III includes all the basic compression tools: bandpass coding, additional DCT, entropy coding, advanced SAM. Due to the complexity of the encoder and decoder, it provides a high degree of compression - it is believed that a “transparent” channel is formed at a speed of 128 kbit/s, although high-quality transmission is possible at lower speeds. The standard recommends two psychoacoustic models: the simpler Model 1 and the more complex, but also higher quality Model 2. They differ in the algorithm for processing samples. Both models can be used at all three levels, but Model 2 has a special modification for Level III.

MPEG-1 turned out to be the first international standard for digital audio compression and this led to its widespread use in many areas: broadcasting, sound recording, communications and multimedia applications. Level II is the most widely used and has become part of European satellite, cable and terrestrial digital TV broadcasting standards, audio broadcasting standards, DVD recording, ITU Recommendations BS.1115 and J.52. Level III (also called MP-3) is widely used in integrated service digital networks (ISDN) and on the Internet. The vast majority of music files on the network are recorded in this standard.

First level coder. Let's take a closer look at the work of the first level encoder (Figure 4.4). The filter unit (FB) simultaneously processes 384 counts of audio data and distributes them with appropriate subsampling into 32 bands, 12 samples in each band with a sampling frequency of 48/32 = 1.5 kHz. The frame duration at a sampling rate of 48 kHz is 8 ms. The simplified psychoacoustic model evaluates only frequency masking by the presence and "instantaneous" level of signal components in each band. Based on the evaluation results, the coarsest possible quantization is assigned to each band, but so that the quantization noise does not exceed the masking threshold. The scaling factors are 6-bit wide and cover a dynamic range of 120 dB in 2 dB steps. The digital stream also carries 32 bit allocation codes. They have a width of 4 bits and indicate the length of the sample codeword in a given band after requantization.

In the decoder, samples of each frequency band are separated by a demultiplexer and fed to a multiplier, which restores their original dynamic range. Before this, the original bit depth of the samples is restored - the least significant bits discarded in the quantizer are replaced with zeros. Bit allocation codes help the demultiplexer to separate in a serial stream codewords belonging to different samples and transmitted by a variable word length code. Then the samples of all 32 channels are fed to the synthesizing BF, which carries out upsampling and arranges the samples properly in time, restoring the original waveform.

Second level coder. The second-level encoder eliminates the main disadvantages of the basic bandpass coding model associated with the discrepancy between the critical hearing bands and the real BF bands, which is why the masking effect was practically not used in low-frequency sections of the range. The frame size has been tripled, up to 24 ms at 48 kHz sampling, and 1152 samples are processed simultaneously (3 subframes of 384 samples each). The input signal for SAM is not bandpass signals from the output of the BF, but spectral coefficients obtained as a result of a 512-point Fourier transform of the encoder input signal. Due to the increase in the time duration of the frame and the accuracy of spectral analysis, the efficiency of SAM increases.

At the second level, a more complex bit distribution algorithm is used. Bands with numbers from 0 to 10 are processed with a four-bit distribution code (selection of any of 15 quantization scales), for bands with numbers from 11 to 22 the choice is reduced to 3 digits (selection of one of 7 scales), bands with numbers from 23 to 26 provide selection of one of 3 scales (two-bit code), and bands with numbers from 27 to 31 (above 20 kHz) are not transmitted. If the quantization scales selected for all frame blocks are the same, then the scale number is transmitted only once.

Another significant difference in the second-level algorithm is that not all scaling factors are transmitted over the communication channel. If the difference in multipliers of three consecutive subframes exceeds 2 dB for no more than 10% of the time, only one set of multipliers is transmitted and this saves on bit consumption. If rapid changes in audio level occur in a given band, two or all three sets of scaling factors are transmitted. Accordingly, the decoder must remember the numbers of the selected quantization and scaling factors and apply them, if necessary, to the subsequent subframe. Level 3 coder. The Level III encoder uses an advanced encoding algorithm with additional DCT.

The main disadvantage of second-level encoders - ineffective processing of rapidly changing transitions and jumps in sound level - is eliminated thanks to the introduction of two types of DCT blocks - “long” with 18 samples and “short” with 6 samples. Mode selection is done adaptively by switching window functions in each of the 32 frequency bands. Long blocks provide better frequency resolution of a signal with standard characteristics, while short blocks improve the processing of fast transitions. One frame can contain both long and short blocks, but the total number of DCT coefficients does not change, since instead of one long block, three short blocks are transmitted. The following enhancements also apply to improve encoding.

■ Non-uniform quantization (the quantizer raises the samples to the 3/4 power before quantization to improve the signal-to-noise ratio; accordingly, the decoder raises them to the 4/3 power for inverse linearization).

■ Unlike encoders of the first and second levels, at the third level scaling factors are assigned not to each of the 32 frequency bands of the BF, but to scaling bands - sections of the spectrum not associated with these bands and approximately corresponding to the critical bands.

■ Entropy coding of quantized coefficients using the Huffman code.

■ The presence of a “bit reservoir” - a reserve that the encoder creates during periods of a stationary input signal.

The third level encoder more fully processes the stereo signal in the joint stereo (MS Stereo) format. While lower-level encoders operate only in intensity coding mode, where the left and right channels in bands above 2 kHz are encoded as one signal (but with independent scaling factors), a third-level encoder can also operate in sum-difference mode, providing more high degree of compression of the difference channel. The stereo signal is decomposed into the average between the channels and the difference. In this case, the second one is encoded at a lower speed. This allows you to slightly increase the encoding quality in a normal situation when the channels are in phase. But this also leads to a sharp deterioration if signals are encoded that do not match in phase, in particular, a phase shift is almost always present in recordings digitized from audio cassettes, but is also found on CDs, especially if the CD itself was recorded at one time from an audio tape .

Within the third level, stereo signal encoding is possible by three more different methods.

■ Joint Stereo (MS/IS Stereo) introduces another stereo simplification technique that improves encoding quality at particularly low bit rates. It consists in the fact that for some frequency ranges it is not even the difference signal that is left, but only the ratio of the signal powers in different channels. It is clear that even less speed is used to encode this information. Unlike all others, this method results in a loss of phase information, but the benefits of saving space in favor of the average signal are greater at very low speeds. This mode is used by default for high frequencies at speeds of 96 kbit/s and below (this mode is practically not used by other high-quality encoders). But, as already mentioned, when this mode is used, phase information is lost. In addition, any out-of-phase signal is also lost.

■ Dual Channel - each channel receives exactly half the stream and is encoded separately as a mono signal. The method is recommended mainly in cases where different channels contain fundamentally different signals, for example, text in different languages. This mode is installed in some encoders upon request.

■ Stereo - Each channel is encoded separately, but the encoder may decide to give one channel more space than another. This can be useful in the case where, after discarding the part of the signal that lies below the threshold of audibility or is completely masked, the code does not completely fill the volume allocated for a given channel, and the encoder has the opportunity to use this space to encode another channel. This, for example, avoids encoding “silence” in one channel when there is a signal in another. This mode is used at speeds above 192 kbit/s. It is also applicable at lower speeds on the order of kbit/s.

The main Level III encoders used are XingTech encoders, FhG IIS encoders, and ISO source code based encoders.

XingTech encoders do not have high encoding quality, but are quite suitable for encoding electronic music. Their speed makes them ideal encoders for music that does not require high-quality encoding.

FhG IIS encoders are known for the highest encoding quality at low and medium speeds, thanks to the psychoacoustic model most suitable for such speeds. Of the console encoders in this group, the most preferable is 13eps 2.61. For now, the MP3eps 3.1 encoder is also used, but no one has seriously tested the latter. Other encoders, such as Audio Active or MP3 Producer, have significant shortcomings, mainly due to limited customization options and an underdeveloped interface.

The remaining encoders trace their origins to the ISO source codes. There are two main directions of development - code optimization for speed and algorithm optimization for quality. The first direction was best represented by the BladeEnc encoder, which uses the original ISO model, but made many code optimizations, and the second model is represented by mpegEnc.

The MP3Pro encoder was announced in July 2001 by Coding Technologies together with Tomson Multimedia and the Fraunhofer Institute. The MP3Pro format is a development of Level III (MP3). MP3Pro is backward (fully) and forward (partially) MP3 compatible, meaning files encoded with MP3Pro can be played in conventional players. However, the sound quality is noticeably worse than when played in a special player. This is due to the fact that MP3Pro files have two audio streams, while conventional players recognize only one stream in them, i.e. regular MPEG-1 Layer 3.

The MP3Pro uses a new technology - SBR (Spectral Band Replication). It is designed to transmit the upper frequency range. The fact is that previous technologies for using psychoacoustic models have one common drawback: they all work efficiently, starting at a speed of 128 kbit/s. At lower speeds, various problems begin: either the frequency range must be cut to transmit sound, or encoding leads to the appearance of various artifacts. New SBR technology complements the use of psychoacoustic models. A slightly narrower range of frequencies than usual is transmitted (encoded) (i.e., with the “highs” cut off), and the upper frequencies are recreated (restored) by the decoder itself based on information about lower frequency components. Thus, SBR technology is actually used not so much at the compression stage as at the decoding stage. The second data stream, mentioned above, is precisely the minimum necessary information that is used during playback to restore the high frequencies. It is not yet known for certain what exact information this stream carries, however, studies have shown that this information is about the average power in several frequency bands in the upper range.