What Is MPEG Video Compression Standard?

By Kulanthai Chandrarajh ck4@doc.ic.ac.uk



There are many important ramifications of the technologies incorporated into the the MPEG specifications, but what seems to get the most press is the video compression system. Robust video compression and transport approaches are essential for operations over emerging ATM-based broadband ISDN. The MPEG video compression standard is discussed in this paper.

1. Introduction

Most physical entities convey some type of "information" and need a fixed number of parameters towards this purpose. In many instances however, this fixed number is prohibitively large for storage and transmission purposes. The compression process attempts to represent the entity by employing fewer than the total set of parameters. The compression technique can be divided into two parts.
Lossless Compression.
Lossy Compression.
If all the information is conveyed using the subset of the parameters, the compression is called lossless. On the other hand, if less than the complete information is conveyed, it is termed lossy compression.

Video signals are spatio-temporal signals or simply stated, a sequence of time varying images. The information they convey is "visual". A monochromatic still image can be mathematically represented by x(h,v), where x is the intensity value at the horizontal location h and vertical location v. The monochromatic video signal can be represented by x(h,v,t), where x is the intensity value at the h horizontal, v vertical and t temporal locations respectively. Figure 1 shows these representation of the video signal.

Color video signal is merely a superposition of the intensity distribution of the three primary color primitives(R,G,B) or equivalently of one luminance, (Y) and two chrominance components(U,V). This is shown in the Table. 1.

Y = 0.30R' + 0.59G' + 0.11B'
U = -0.15R'- 0.29G' + 0.47B'
V = 0.62R'- 0.52G' - 0.10B'

Table. 1

After a brief description of analog video signals, this paper analyzes the need towards digital video and the need for video compression. Finally, the main part of this paper alanyzes the MPEG video compression standard.

2.MPEG Compression Standard

The most common form of the video signal in use today is still analog. This signal is obtained through a process known as scanning. In this section the analog representation of the video signal and its disadvantages are discussed.This part also describes the need towards digital representation of video signal. After describing the need for compression of video signal, this paper describes the MPEG compression technique for video signals.

2.1 Analog Video Signal

Analog signal is obtained through a process known as scanning. This is shown in Figure 2. Scanning records the intensity values of the spatio-temporal signal only in the h direction. This signal is coupled with the horizontal and vertical synchronization pulses to yield the complete video signal. Scanning can be either progressive or interlaced. Progressive scanning scans all the horozontal lines to form the complete frame. In the interlaced scanning, the even and the odd horizontal lines of a picture are scanned seperately yielding the two fields of a picture. There are three main analog video standards.
In the composit standard, the luminance and the two chrominance components are encoded together as a single signal. This is in contrast to the component standard, where the three components are coded as three distinct signals. The S-Video consists of seperate Y and C analog video signals.

Today, the technology is attempting to integrate the video, computer and telecommunication industry together on a single mutimedia platform. The video signal is required to be scalable, platform independent, able to provide interactivity, and be robust. The analog unfortunately fails to address these requirements. Moving to digital not only eliminates most of the above mentioned problems but also opens door to a whole range of digital video processing techniques which can make the picture sharper.

2.2 Digital Video Signal

To digitize the spatio-temporal signal x(h,v,t), usually, the component form of the analog signal is sampled in all three directions. Each sample point in a frame is called a pixel. Sampling in the horizontal direction yields the pixels per line, which defines the horizontal resolution of the picture. Vertical resolution is controlled by sampling vertically. Temporal sampling determines the frame rate.

Digital video too has its share of bottlenecks. The most important one is the huge bandwidth requirement. Inspite of being digital, it thus still need to stored. The logical solution to this problem is digital video compression.

2.3 MPEG Compression Standard

Compression aims at lowering the total number of parameters required to represent the signal, while maintaining good quality. These parameters are then coded for transmission or storage. A result of compressing digital video is that it becomes available as computer data, ready to transmitted over existing communication networks.

There are many different redandancies present in the video signal data.
Spatial redandancy occurs because neighboring pixels in each individual frame of a video signal are related. The pixels in consecitive frames of signal are also related, leading to temporal redundancy. The human visual system does not treat all the visual information with equal sensitivity, leading to psychovisual redundancy. Finally, not all parameters occur with the same probability in an image. As a result, they would not require equal number of bits to code them (Huffman coding).

There are several different compression standards around today (CCITT recomandation H. 261). MPEG, which stands for moving pictures experts groups, is a joint coommitte of the OSI and IEC. It has been responsible for the MPEG-1(ISO/IEC 11172) and MPEG-2(ISO/IEC 13818) standards in the past and is currently developing the MPEG-4 standard. MPEG standards are generic and universal. There are three main parts in the MPEG-1 and MPEG-2 specifications, namely, Systems, Video and Audio. The Video part defines the syntax and semantics of the compressed video bitstream. The Audio part defines the same for audio bitstream, while the System part specifies the method of combining into a single stream, one or more video and audio elementary streams previously, Fig. 1. The MPEG-2 standard consists of a fourth part called DSMCC, which defines a set of protocols for the retrieval and storage of MPEG data. We shall now examine the structure of a non-scalable video bitsream in some deatil to understand the video compression.

The video bitstream consists of video sequences. Each video sequence consists of a variable number of group of pictures(GOP). A GOP contains a variable number of pictures(p), Figure 3.

Mathematically, each picture is really an union of the pixel values of the luminance and the two chrominance components. The picture can also be subsampled at a lower resolution in the chrominance domain because the human eye is less sensitive to high frequency color shifts(more rods than cones on the retina). There are three formats:
  1. 4:4:4---the chrominance and luminance planes are subsampled at the same resolution.
  2. 4:2:2---the chrominance planes are subsampled at half resolution in the horizontal direction.
  3. 4:2:0---the chrominance information is sub-sampled at half the rate both vertically and horizontally.
These formats are shown in Format.fig.

Pictures can be devided into three main types based on their compression schemes.
I or Intra pictures.
P or Predicted pictures.
B or Bidirectional pictures.
The frames that can be predicted from previous frames are called P-frames. But what happens if transmission errors occur in a sequnce of P-frames?. To avoid the propagation of transmission errors and to allow periodic resynchronization, a complete frame which does not rely on information from other frames is transmitted approximately once every 12 frames. These stand-alone frames are "intra coded" and are called I-frames. The coding technique for I pictures falls in the category of transform coding. Each picture is divided into 8x8 non-overlapping pixels blocks. Four of these blocks are additionally arranged into a bigger block of size 16x16, called macroblock. The DCT is applied to each 8x8 block individually, Figure 4. This transform converts the data into series of coefficients which represent the magnitudes of the cosine functions at increasing freqncies. The quantization process allows the high energy, low frequency coefficients to be coded with greater number of bits, while using fewer or zero bits for the high freuency coefficients. Retaining only a subset of the coeffients reduces the total number of parameters needed for representation by a substantial amount. The quantization process also helps in allowing the encoder to output bitstreams at specified bitrate.

The DCT oefficients are coded using a combination of two special coding schemes- Run length and Huffman. The coefficients are scaned in a zigzag pattern to create a 1-D sequence. MPEG-2 provides an alternative scanning method. The resulting 1-D sequence usually contains a large number of zeros due to DCT and the quantization process. Each non-zero coefficients is assosiated with a pair of pointers. First, its position in the block which is indicated by the number of zeros between itself and the prevoius non-zero coefficient. Second, its coefficient value. Based on these two pointers, it is given a variable length code from a lookup table. This is done in a manner so that a highly probable combination gets a code with fewer bits, while the unlikely ones get longer codes. However, since spatial redundandancy is limitted, the I Pictures provide only moderate compression.

The P and B pictures are where MPEG derives its maximum compression efficiency. It is done by a technique called motion compensation(MC) based prediction, which exploits the temporal redundancy. Since frames are closely related, it is assumed that a current picture can be modelled as a translation of the picture at the previous time. It is possible then to accurately "predict" the data of one frame based on the data of a previous frame. In P pictures, each 16x16 sized macroblocks is predicted from macroblock of previously encoded I picture. Since, frames are snapshots in time of a moving object, the macroblocks in the two frames may not correspond to the same spatial location. The encoder searches the previous frame(for P-frames, or the frames before and after for B-frames) in half pixel increments for other macroblock locations that are a close match to the information that is contained in the current macroblock. The displacements in the horizontal and vertical directions of the best match macroblocks from the cosited macroblock are called Motion vectors, Figure 5.
If no matching macroblocks are found in the neighboring region, the macroblock is intra coded and the DCT coefficients are encoded. if a matching block is found in the search region the coefficients are not transmitted, but a motion vector is used instead.

The motion vectors can also be used for motion prediction in case of corrupted data, and sophisticated decoder algorithms can use these vectors for error concealment( refer to article1).

For B pictures, MC prediction and interpolation is performed using reference frames present on either side of it, Figure 6.

Compared to I and P, B pictures provides the maximum compression. There are other advantages of B pictures. B pictures themselves never used for predictions and hence do not propagate errors. MPEG-2 allows for both frame and field based MC. Field based MC is spatially useful when the video signal includes fast motion.

3. Concluding Remarks

The first step in compression is to translate the information in the picture into the frequncy domain. The red, green and blue intensity information in each pixel is translated into Y and (U,V). The pixels are grouped together into rectangular areas called blocks, and groups of blocks called macroblocks. These blocks are then translated into frequency information using the DCT. The pixels of the blocks are scanned in a zigzag order to increase the runs of zero c

The MPEG compression algorithm is a clever combination of a number of diverse tools, each of which exploit a particular data redundancy. Spatial, Temporal, Psychovisual and coding redundancies are discussed in this paper. The end result is that the coded video needs a far lower bandwidth compared to the original, while maintaning extreamly good quailty. Currently, the technology is gearing up towards an exciting phase with the advent of HDTV and DVD. Video compression is a key factor in these new technologies and MPEG has become the industry standard.
There many advantages in choosing MPEG.
Guarantees a means for universal interoperability.
Reduces the cost of video compression codecs by triggering a mass production of ASICS.