VC-6 is fundamentally different to most industry mezzanine codecs because it’s based on repeatable s-tree structures rather than block-based DCT or Wavelet transforms. Also, it is not limited to three data planes, i.e., RGB or YUV, but can handle up to 255 separate data planes which can each be a different size or resolution. Each data plane is compressed separately by creating a hierarchy of different resolutions of an image and then encoding the residual differences between them with a uniquely low-complexity approach. These simple tree structures provide intrinsic capabilities that are well suited to modern computing techniques like massive parallelism.
VC-6 RGB Planes example
Each plane of each source image (I0) is down sampled by ½ in height and width a number of times to produce lower resolution versions (I-1, I-2, I-3…) giving us what is referred to in the SMPTE standard as the plane stack, and each resolution is called an echelon. The number of echelons generated is configurable and is one way the balance between compression quality and computational complexity can be controlled.
The reverse process is carried out, starting with the lowest echelon, (let us assume for example it is -3), so I-3 is up sampled to generate B-2, a reconstructed version of I-2. Then B-2 is up sampled to give B-1, and B-1 is up sampled to give B0.
The differences between the source echelons and the reconstructed echelons, i.e. the errors introduced by the down sampling and up sampling processes are calculated for each echelon apart from the lowest one which has no reconstructed version. These are referred to as the residuals (R0, R-1, R-2….).
The lowest echelon and the residuals from the higher echelons are then encoded giving E-3, E-2, E-1 and E0. It is this encode process that provides the data compression.
Decoding unsurprisingly is a very similar process but in reverse. The VC-6 bitstream contains a compressed version of the lowest resolution echelon, and compressed sets of residuals, or the errors that are introduced when the base resolution image is successively up-sampled.
If a lower resolution version is all that is required, then it is not necessary to decode and process any of the higher echelons, making the decoding process quicker and needing less resources such as memory.
The downsampler is not defined by the VC-6 standard, which allows the VC-6 encoder designer to trade-off compression performance against computational efficiency and speed.
The choice of upsampler is however constrained by the standard and is signalled in the VC-6 bitstream. Currently the VC-6 standard defines 4 possible upsamplers which are:
- Nearest neighbor
- Standard Nonlinear 9-Transformer
It is the encode stage that provides the data compression, and consists of the following processes:
Directional decomposition -> Quantization -> S-Tree and Entropy Coding-> Range Encoding
The lowest resolution image and the residual images for the higher layers transformed into four directional versions, using directional transforms to generate A (average), H (horizontal), V (vertical) and D (diagonal) versions or echelons that are ½ the resolution. This is done to both produce data structures with greater probability of sequences of zeros, which can be efficiently encoded, and to allow any necessary quantisation of the data to be targeted to minimise the impact in terms of visually perceptible video compression artifacts.
VC-6 supports lossless compression, which if selected means that there is no quantisation of picture data.
However for lossy compression a rate control function determines which elements to quantise and to what degree to achieve the required data compression whilst minimising the impact on the perceived quality of the image. To help achieve this, each directional echelon is divided into 16 x 16 tiles and based on the characteristics of these tiles they are put into groups or tile sets. Each tile set has the same level of quantisation applied to it.
The number of tile sets is not defined but is left for the VC-6 encoder to determine the optimum number considering that there is a cost associated with each tileset as the quantization parameters must be transmitted to the decoder in the VC-6 bitstream.
Grouping tiles of similar characteristics and applying the same degree of quantisation to them can help maintain a consistent appearance to identical or similar surfaces in the image.
S-Tree Representation and Entropy Encoding
Each quantised 16 x 16 tile is represented by an S-tree structure. The diagram below illustrates how elements in the 16 x 16 grid are grouped in sets of fours and formed into a tree structure. This tree structure is a very efficient way of representing the 16 x 16 elements if the 16 x 16 array is sparsely populated i.e., a significant number of the elements are zero when entropy coding is applied.
S-Tree structure for 16 x 16 grid
It is perhaps easier to describe how the entropy encoding of a sparsely populated array works by considering a 1D array as illustrated in the diagram below. This shows a sparsely populated 50x1 array with non-zero values in elements 9, 10, 11, 12, 20, 22, 23 and 24.
If the size of the array is known in advance, then the number of layers in the tree structure is implied, in this case 5, it is also assumed that any element that is not explicitly decoded has the vale 0.
The tree structure is described using metadata or T codes where:
- 1 means jump to the first node in the layer above
- 0 means continue to the next node in the current layer unless no further nodes so backtrack to the
- X means don’t care treat as zero
The process of decoding such a tree is called de-sparsification.
In this example the last three nodes in layer -3 are inactive as the array is known to have only 50 elements, and the tree structure can represent 256 elements, hence they are represented by x in the Tcode.
A similar approach is adopted by VC-6 to efficiently represent the 2D 16 x 16 tiles.
Fifty element 1D array example
The final stage of compression involves the application of standard data compression techniques such as range encoding to produce the final VC-6 compressed bitstream.
The VC-6 bitstream consists of the following major parts:
- Primary Header
- Fixed size of 168 bits giving the high-level stream information such as the format version, size, number of planes etc.
- Secondary Header
- Variable size which is determined by some of the data values within the primary header and includes the heights and widths of each plane, and the number of echelons.
- Tertiary Header
- Variable size which is determined by some of the data values within the primary and secondary headers.
The VC-6 bitstream can be organised in several different ways depending upon the application. Figure 8 for example shows how it can be ordered to allow decoding of the lower resolution versions before the entire image has been received by the decoder in a streaming type of application. However if fast access to one particular image plane is important, then the payload can be ordered by plane rather than echelon (or resolution).
VC-6 bitstream structure