H.264/AVC——新一代视频压缩编码标准及其应用前景

H.264/AVC——新一代视频压缩编码标准及其应用前景

【摘要】本文介绍了新一代视频压缩编码标准H.264/AVC的体系结构、为提高压缩效率和网络亲和性所开发和采用的部分新技术,给出了新标准应用前景的初步展望。
【关键词】标准 视频 编码 网络

H.264/AVC是一个新的视频压缩编码标准,它是由两个国际标准化组织ITU-T和ISO/IEC的专家于2001年12月成立的联合专家组(JVT)于2003年3月起草完成的,分别被ITU-T和ISO/IEC命名为H.264和ISO/IEC 14496-10 “Advanced Video Coding”(AVC)。
H.264/AVC新标准公布后被普遍看好,业内人士誉之为“下一代视频压缩编码标准”。它最主要的特点有两个:
1.在同等图像质量条件下,视频压缩比是H.263和MPEG-4的2倍;
2.对于各种网络环境,特别是IP和无线网络具有良好的适应性。

一、H.264/AVC的体系结构
H.264/AVC提出了在视频编码层(Video Coding Layer, VCL)和网络提取层(Network Abstraction Layer, NAL)之间进行概念性分割的新思路,前者是视频内容的核心压缩内容之表述,后者是通过特定类型网络进行传输的表述,这样的结构便于信息的封装和对信息进行更好的优先级控制。
VCL中包括编码器与解码器,主要功能是视频数据压缩编码和解码,它包括运动补偿、变换编码、熵编码等单元,它处理的是块、宏块和片的数据,并尽量做到与网络层独立,这是视频编码的核心,其中包含许多实现错误恢复的工具。VCL可以传输按当前的网络情况调整的编码参数。NAL则用于为VCL提供一个与网络无关的统一接口,它负责对视频数据进行封装打包,使其在能够在基于RTP/UDP/IP、H.323/M、MPEG-2传输和H.320协议的网络中传送。
H.264/AVC定义了三个类(profile),各自支持不同的编码功能,并分别说明了对遵从该类的编码器和解码器的要求。其中,基本类(baseline profile)支持帧内和帧间编码(使用I片和P片)以及使用上下文自适应可变长度编码(CAVLC)的熵编码。主类(main profile)支持隔行视频、使用B片的帧间编码、使用加权预测的帧间编码以及使用基于上下文的算术编码(CABAC)的熵编码。扩展类(extended profile)不支持隔行视频和CABAC,但增加了能够在已编码码流(SP和SI片)之间转换的模式,并改善了错误恢复性能(数据分割)。
基本类的潜在应用领域包括视频电话、视频会议以及无线通信;主类的潜在应用领域包括电视广播和视频存储;扩展类可在流媒体应用中发挥独特的作用。不过,每个类都具有足够的灵活性支持范围广泛的各种应用,因此,上述这些应用的例子并非定论。
图 一 H.264/AVC的基本类、主类和扩展类
图一表示了三类之间的关系和标准所支持的编码工具。从图中可清楚地看出,基本类是扩展类的一个子集,但不是主类的子集。
对编解码器性能的限制是由一组级(Level)定义的,每个级的限制体现在诸如抽样率、图像尺寸、编码比特率以及内存要求等参数。

二、H.264/AVC对视频压缩效率的改进
为了提高视频压缩比,H.264/AVC开发和采用了一系列新的技术,主要有:
1. 运动补偿采用小块和可变块。
小块变换:以往的视频压缩编码中常用单位为8×8像素的块。在H.264/AVC中却采用小尺寸的4×4块,由于变换块的尺寸变小了,运动物体的分割就更为精确。这样,图像变换过程中的计算量小了,而且在运动物体边缘的衔接误差也大为减少。当图像中有较大面积的平滑区域时,为了不产生因小尺寸变换带来的块间灰度差异,H.264/AVC可对帧内宏块亮度数据的16个4×4块的DCT系数进行第二次4×4块的变换,对色度数据的4个4×4块的DC系数(每个小块一个,共4个DC系数)进行2×2块的变换。
整数变换:H.264/AVC不仅使图像变换块尺寸变小,而且这个变换是整数操作,而不是实数运算,即编码器和解码器的变换和反变换的精度相同,没有“反变换误差”。此外,整数变换减小了运算量和复杂度,有利于向定点DSP移植。
可变块变换:在H.264/AVC中,每个16×16的宏块(MB)可分割成7种不同尺寸的块,在图像的平坦区可采用16×16的块,而细节区则采用尺寸较小的块。这种灵活、精细的宏块分割,更切合图像中的实际运动物体的形状,于是,在每个宏块中可包含有1、2、4、8或16个运动矢量。
2. 四分之一样本精度的运动补偿。
在H.264/AVC中采用了1/4像素甚至1/8像素的运动估值,即真正的运动矢量的位移可能是以1/4甚至1/8像素为基本单位的。显然,运动矢量位移的精度越高,则帧间运动补偿后的残差越小,传输码率越低,即压缩比越高。
在H.264/AVC中采用了6阶FIR滤波器的内插获得1/2像素位置的值。当1/2像素值获得后,1/4像素值可通过线性内插获得。
对于4:1:1的视频格式,亮度信号的1/4 像素精度对应于色度部分的1/8像素的运动矢量,因此需要对色度信号进行1/8像素的内插运算。
理论上,如果将运动补偿的精度增加一倍(例如从整像素精度提高到1/2像素精度),可有0.5bit/Sample的编码增益,但实际验证发现在运动矢量精度超过1/8像素后,系统基本上就没有明显增益了,因此,在H.264/AVC中,只采用了1/4像素精度的运动矢量模式,而不是采用1/8像素的精度。
3. 多个参考图像运动补偿: 在H.264/AVC中,可采用5个最多达15个参考帧进行运动补偿,即在编码器的缓存中存有多个已编码的参考帧,编码器从中选择一个编码效果更好的作为参考帧,并指出是哪个帧被用于预测,这样就可获得比只用前一个已编码帧作为预测帧更好的编码效果,最终有效地改善视频图像质量。
4. 帧内编码的方向性空间预测,即参照已解码域的方向性空间预测,而不采用变换域的预测,从而提高了预测质量。
5. 环内去块滤波:H.264/AVC定义了自适应消除块效应的滤波器,对预测环路内的水平和垂直块边缘进行处理,大大减小了方块效应。
除此之外,H.264/AVC开发和引入的新技术还有:分级块变换,短字长变换,精确匹配变换,加权预测,算术熵编码,上下文自适应熵编码,图像边界上的运动矢量,图像表示法与基准使用图像的能力退耦,基准序列与显示序列退耦等等,本文就不详述了。

三、H.264/AVC对传输环境适应性的改进
为了提高数据错误/丢失的鲁棒性和各种网络环境下操作的灵活性,H.264/AVC采用的一些关键设计包括:
1. 参数集结构:H.264/AVG引入了参数集(parameter set)的概念,每个参数集包含的信息可以用于大量的编码图像。一个序列参数集(sequence parameter set)包含用于一个完整的视频序列(一个连续的已编码图像的集合)的参数。序列参数集中的参数包括一个标识符(序列参数集标识符),帧数的限制和图像顺序计数,用于解码的参考帧的数目(包括短和长参考帧),已解码图像的宽度和高度以及逐行或隔行(帧或帧/场)编码的选择。一个图像参数集(picture parameter set)包含用于一个序列中的一个或多个解码图像的参数。每个图像参数集包括一个标识符(图像参数集标识符),一个选择的序列参数集标识符,一个选择VLC或CABAC熵编码的标记,所用片组的数目(和片组图类型的定义),可用于预测的在表0和表1中的参考图像的数目,初始量化参数和一个指示默认的解块滤波器参数是否可修改的标记。
典型情况下,一个或多个序列参数集和图像参数集在解码片头和片数据之前送给解码器。一个编码的片头引用一个图像参数集标识符和这个“活动的”图像参数集。这个“活动的”图像参数集保持活动直到另一个图像参数集被另一个片头引用而激活。图像参数集以类似的方式引用一个序列参数集标识符而“激活”该序列参数集。被激活的参数集保持有效(即它的参数被用于全部连续编码图像)直到不同的序列参数集被激活。
2. 灵活的宏块顺序:灵活的宏块次序是H.264/AVC的一大特色,通过设置宏块次序映射表(MBAmap)来任意地指配宏块到不同的片组,FMO模式打乱了原宏块顺序,降低了编码效率,增加了时延,但增强了抗误码性能。FMO模式分割图像的模式各种各样,重要的有棋盘模式、矩形模式等。当然FMO模式也可以使一帧中的宏块顺序分割,使得分割后的片的大小小于无线网络的MTU尺寸。经过FMO模式分割后的图像数据分开进行传输,以棋盘模式为例,当一个片组的数据丢失时可用另一个片组的数据(包含丢失宏块的相邻宏块信息)进行错误掩盖。实验数据显示,当丢失率为(视频会议应用时)10%时,经错误掩盖后的图像仍然有很高的质量。
3. 任意的片顺序(ASO):基本类支持任意的片顺序(ASO),也就是说,一个编码帧中的片可以遵循任一解码顺序。如果在解码帧中任一片中的第一个宏块与同一图像中此前解码的片相比具有较小的宏块地址,这时将使用ASO。
4. 数据分割:通常情况下,一个宏块的数据是存放在一起而组成片的,数据分割使得一个片中的宏块数据重新组合,把宏块语义相关的数据组成一个分割,由分割来组装片。在H.264/AVC中有三种不同的数据分割。 (1)头信息分割:包含片中宏块的类型,量化参数和运动矢量,是片中最重要的信息。 (2)帧内信息分割:包含帧内CBPs和帧内系数,帧内信息可以阻止错误的蔓延。 (3)帧间信息分割:包含帧间CBPs和帧间系数,通常比前两个分割要大得多。
帧内信息分割结合头信息解出帧内宏块,帧间信息分割结合头信息解出帧间宏块。帧间信息分割的重要性最低,对重同步没有贡献。当使用数据分割时,片中的数据根据其类型被保存到不同的缓存,同时片的大小也要调整,使得片中最大的分割小于MTU尺寸。
解码端若获得所有的分割,就可以完整重构片;解码端若发现帧内信息或帧间信息分割丢失,可用的头信息仍然有很好的错误恢复性能。这是因为宏块类型和宏块的运动矢量含有宏块的基本特征。
5. SP/SI同步切换图像。SP和SI片是特殊编码的片,它使视频解码器能够有效地在两个视频流之间切换和有效的随机存取。流媒体应用的共同需求是视频解码器要在几个编码流之间切换。例如,同一视频素材以多种码率编码,通过因特网传输,解码器试图以接收到的最高码率流进行解码,但是如果数据流量降低了又需要自动切换到低码率流。
SP片用于支持相似编码序列之间的切换(例如,以不同码率编码的同源序列),它不像I片那样增加码率。在切换点,有三个SP片,各用运动补偿预测编码(比I片更有效)。SP片A2可用参考图像A1解码,SP片B2可用参考图像B1解码。切换过程的关键是SP片AB2(称为切换SP片),它可用解码运动补偿参考图像A1的方式生成,去产生解码帧B2(即解码器输出帧是相同的,无论解码是从B1到B2,还是从A1到AB2)。在每一切换点需要一个额外的SP片(事实上以另一方向切换将需要另一个SP片BA2),但这可能比把帧A2和B2编码为I片更有效。

图 二 用SP片切换码流
图三是SP片A2编码过程的简图,它是通过从A2帧减去A1”(已解码的A1帧)的运动补偿版然后对残差进行编码而产生。与“标准的”P片不同,减法在变换域(在块变换之后)中进行。SP片B2以同样的方式编码。此前已解码A1帧的解码器可对SP片A2解码。

图 三 SP片A2的编码
SP片AB2的编码如图四所示。B2帧(我们要切入的帧)被变换,从A1”(我们从中切出的那一帧的解码)完成运动补偿预测。使用已解码的图像A1作为参考,“MC”块试图寻找B2帧的每个MB的最佳匹配。运动补偿预测被变换,然后从已变换的B2(即在切换SP片的情况下,减法发生在变换域)中减去。减去后的残差被量化、编码和传输。

图 四 SP片AB2的编码
此前已解码A1”的解码器可解码SP片AB2以产生B2”。对A1”作运动补偿(用运动矢量数据编码作为AB2的部分),变换并加上已解码和已标度(反量化)的残差,然后对结果做反变换以产生B2。

图 五 SP片AB2的解码
如果流A和B是同一原始序列以不同比特率编码的不同版本,从A1”(SP片AB2)得出的B2的运动补偿预测将是完全有效的。结果表明,用SP片在同一序列的不同版本之间作切换比在切换点插入I片显著的更有效。
SP片的另一个应用是提供随机存取和类似VCR的功能。例如,一个SP片和切换SP片放在帧10的位置,解码器可以通过解码A0从A0直接快进到A10,然后解码切换SP片A0-10通过从A0预测去产生A10。

图 六 用SP片快进
SI片是扩展类支持的具有更进一步交换功能的片。它的应用与SP片类似,不同的是预测是从此前解码的重构帧样本用4×4帧内预测方式形成的。SI片可用于从一个序列到另一个完全不同的序列的切换,在这种情况下,运动补偿预测无效,因为两个序列之间不存在相关性。
6. 设置NAL单元:每个NAL单元是一个数据包,它采用统一的数据格式,包括单个字节的包头信息、多个字节的视频数据与组帧、逻辑信道信令、定时信息、序列结束信号等。包头中包含存储标志和类型标志。存储标志用于指示当前数据不属于被参考的帧。类型标志用于指示图像数据的类型。在NAL单元中插入了冗余编码图像,可自适应地增强了抗误码能力。

四、应用前景
面对网络与多媒体日益广泛的应用,人们对媒体信息的消费需求不断增加,统一的国际标准是使多媒体信息和技术产品在全球范围内通用的必要基础。
目前,在信息资源中对视频信息的开发和利用是很不够的。众所周知,视频信息具有直观性、确定性、高效性等特点。在话音、数据和视频构成的多媒体信息流中视频流将逐步成为主要组成部分。只要看看目前数字电视、网络电视、手机电视、移动电视、会议电视、可视电话、网上教育、网上医疗、网上游戏、视频点播、彩信等多种视频业务的蓬勃发展,就可以知道下一代网络中信息流的主角定将是多媒体视讯流。
H.264/AVC标准的开发目标是实现多媒体业务在各个领域的应用,涉及面非常广泛,不同的应用对应的码率、分辨率、质量和服务也不同。
H.264/AVC标准使运动图像压缩技术上升到了一个更高的阶段,在较低带宽上提供高质量的图像传输是H.264/AVC的应用亮点,因此,H.264/AVC将对诸如数字卫星广播、数字视频存储以及互联网传播等一系列技术进行改进,以提高视频质量,扩展多媒体业务的应用范围。
H.264/AVC的基本类无需使用版权,具有开放的性质。它不仅比H.263和MPEG-4节约了50%的码率,而且对IP和无线网络传输具有更好的支持功能。它引入了面向IP包的编码机制,有利于网络中的分组传输,支持网络中视频的流媒体传输。这对目前因特网传输多媒体信息、移动网中传输宽带信息等都具有重要意义。
H.264/AVC提供包传输网络中处理包丢失所需的工具,以及在易误码的无线网中处理比特误码的工具, 能够更好地处理信息包和数据丢失,具有较强的抗误码特性,可适应丢包率高、干扰严重的无线信道中的视频传输。
H.264/AVC支持不同网络资源下的分级编码传输,在所有码率下都能持续提供较高的视频质量。既能工作在低延时模式以适应实时通信的应用(如视频会议),又能很好地工作在没有延时限制的应用,如视频存储和以服务器为基础的视频流式应用。
H.264/AVC可满足多种应用的需求,目前主要应用在以下领域:基于电缆、卫星、Modem、DSL、无线及移动网络等信道的数字电视广播、可视电话、视频会议、实时监控、流式多媒体业务、低比特率下的移动多媒体通信如彩信、手机电视等,以及视频数据在光学或磁性设备上的存储等。有关专家认为,H.264/AVC最终将取代目前已获得广泛应用的MPEG-2标准。

参考文献
Iain E. G. Richardson, H.264 and MPEG-4 Video Compression, John Wiley & Sons Ltd, 2003

多媒体通信系统(3.4.4)

3.4.4 Content-Based Image Retrieval
To address their challenges, multimedia signal-processing methods must allow efficient access to processing and retrieval of content in general, and visual content in particular. This is required across a large range of applications, in medicine, entertainment, consumer industry, broadcasting, journalism, art and e-commerce. Therefore, methods originating from numerous research areas, that is, signal processing, pattern recognition, computer vision, database organization, human-computer interaction and psychology, must contribute to achieving the image-retrieval goal. An example of image retrieval is Given: A query
Retrieve: All images that have similar content to that of the query.
为应付挑战,多媒体信号处理方法必须有效地接入以处理和复现常规和特殊的视频内容。对于很大范围的应用这都是必须的,例如医学、娱乐、消费工业、广播、新闻业、艺术以及电子商务。因此,源于各个研究领域的方法,信号处理、模式识别、计算机视觉、数据库组织、人机交互和心理学,都为实现图像复现的目标作出了贡献。图像复现的一个例子是:
给定:一个查询
检索:所有与查询具有相似内容的图像。
Image-retrieval methods face several challenges when addressing this goal [3.68]. These challenges, which are summarized in Table 3.1, cannot be addressed by text-based image retrieval systems, which have had an unsatisfactory performance so far. In these systems, the query keywords are matched with keywords that have been associated to each image. Because of difficult automatic selection of the relevant keywords, time consuming and subjective manual annotation is required. Moreover, the vocabulary is limited and must be expanded as new applications emerge.
对于给定的这个目标,图像复现法面临几个挑战。这些挑战列在表3.1中,基于文本的图像复现系统无法解决,迄今为止,它的性能不能令人满意。在这些系统中,查询关键词与已经对每一图像建立关联的关键词相匹配。由于自动选择相应关键词的困难,因此要耗费时间并需要个人人工注解。此外,当有新的应用时,还必须对有限的词汇进行扩展。
To improve performance and address these problems, content-based image retrieval methods have been proposed. These methods have generally focused on using low-level features such as color, texture and shape layout, for image retrieval, mainly because such features can be extracted automatically or semiautomatically.
为改善性能解决问题,已经提出了基于内容的图像复现法。这些方法通常聚焦于低级特征,例如色彩、纹理和外形轮廓,对于图像复现,这些特征能够自动或半自动地提取。
Texture-Based Methods
Statistical and syntactic texture description methods have been proposed. Methods based on spatial frequencies, co-occurrence matrixes and multiresolution methods have been frequently employed for texture description because of their efficiency [3.69]. Methods based on spatial frequencies evaluate the coefficients of the autocorrelation function of the texture. Co-occurrence matrixes identify repeated occurrences of gray level pixel configurations within the texture.
已经提出统计和合成纹理描述法。这些方法基于空间频率,共生矩阵及多分辨率法由于效率高而常被纹理描述运用。基于空间频率的方法估计纹理的自相关函数。共生矩阵确定纹理的灰度级象素组态。
Table 3.1 Image retrieval challenges [3.68].
Challenges Remarks
Query types Color based/shape based/color and shape based
Quantitative, for example, find all images with 30% amount of red
Query forms
Query by example, for example, image region/image/sketch/other examples
Various content For example, natural scenes/head-and-shoulder images/MRIs
Matching types Object to object/image to image/object to image
Application specific
Precision levels
Exact versus similarity-based match
Presentation of results Application specific
Multiresolution methods describe the texture characteristics at coarse-to-fine resolutions. A major problem that is associated with most texture description methods is their sensitivity to scale, that is, the texture characteristics may disappear at low resolutions or may contain a significant amount of noise at high resolutions [3.70, 3.71, 3.72].
多分辨率法以由粗到细的分辨率描述纹理特征。主要问题与大多数纹理描述法的敏感度有关,即,在低分辨率时纹理特征可能消失,在高分辨率时又可能包含大量噪声。
Shape-Based Methods
Describing quantitatively the shape of an object is a difficult task. Several contour-based and region-based shape description methods have been proposed. Chain codes, geometric border representations, Fourier transforms of the boundaries, polygonal representations and deformable (active) models are some of the boundary-based shape methods that have been employed for shape description. Simple scalar region descriptors, moments, region decompositions and region neighborhood graphs are region-based methods that have been proposed for the same task [3.73, 3.74]. Contour-based and region-based methods are developed in either the spatial or transform domains, yielding different properties of the resulting shape descriptors. The main problems that are associated with shape description methods are high sensitivity to scale, difficult shape description of objects and high subjectivity of the retrieved shape results.
量化描述一个对象的形状是一个困难的任务。已经提出了几个基于轮廓和基于区域的形状描述法。链码、几何边框表示法、边界的傅立叶变换、多边形表示法以及可变形(主动的)模型是一些基于边界的形状法,已经用于形状描述。简单梯形区域描述符、矩、区域分解和邻域图是已经使用的基于区域法。
Color-Based Methods
Color description methods are generally color histogram based, dominant color based and color moment based [3.75, 3.76]. Description methods that employ color histograms use a quantitative representation of the distribution of color intensities. Description methods that employ dominant colors use a small number of color ranges to construct an approximate representation of color distribution. Description methods that use color moments employ statistical measures of the image characteristics in terms of color.
色彩描述法通常基于色彩直方图、基于支配色、基于色矩。使用色彩直方图的描述法用色强度分布的定量表示。使用支配色的描述法用少量的色彩范围构造色彩分布的近似表示。用色矩的描述法用图像特征在色彩上的统计度量。
The performance of these methods typically depends on the color space, quantization, and distance measures employed for evaluation of the retrieved results. The main problem that is associated with histogram-based and dominant-color-based methods is their inability to allow the localization of an object with the image. A solution to address this problem is to apply color segmentation, which allows both image-to-image matching and object localization. The main problem of color-moment-based methods is their complexity, which makes their application to browsing or other image-retrieval functionalities difficult.
这些方法的性能特别依赖于色彩空间、量化和检索结果的评估使用的距离测量。基于直方图和基于支配色法的主要问题是它们无法对图像中的对象定位。解决这个问题的办法是采用色彩分割,它既考虑图像对图像的匹配,又考虑对象定位。基于色矩法的主要问题是它们的复杂性,使它们在浏览及其它图像复现应用中产生功能性困难。
Examples of content-based image and video-retrieval systems are included in Table 3.2. Some or all of the limitations of these systems are the following [3.68]:
~ Few query types are supported
~ Limited set of low-level features
~ Difficult access to visual objects
~ Results partially match user”s expectations
~ Limited interactivity with the user
~ Limited system interoperability
~ Scalability problems
基于内容的图像和视频复现系统的例子列于表3.2中。这些系统的局限如下:
 支持的查询类型少
 低级特征量有限
 难以接入视觉对象
 结果与使用者期望部分匹配
 与使用者互动有限
 系统互操作性有限
 可扩缩性问题
Table 3.2 Examples of content-based image and video-retrieval systems [3.68].
Features System Image/Video Provider
WebSeek I, V Columbia University
Picasso I University of Florence
Color and text
Chabot I University of California, Berkeley
* I University of Toronto
QBIC I IBM
PhotoBook I MIT
Color, texture and shape
BlobWorld I University of California, Berkeley
VIR I, V Virage
Color, shape and scale Nefertiti I National Research Council of Canada
NeTra I University of California, Santa Barbara
Color, texture, shape and
spatial location Digital I Kodak
storyboard
WebClip V Columbia University
Color, texture and
Jacob I, V University of Palermo
motion
* V IMAX
N/A * V NASA
* No name has been adopted for the corresponding system.

多媒体通信系统(3.4.3)

3.4.3 Video Signal Processing
Digital video has many advantages over conventional analog video, including bandwidth compression, robustness against channel noise interactivity and ease of manipulation. Digital-video signals come in many formats. Broadband TV signals are digitized with ITU-R 601 format, which has 30/25 fps, 720 pixels by 488 lines per frame, 2:1 interlaced, 4:3 aspect ratio, and 4:2:2 chroma sample. With the advent of high-definition digital-video, standardization efforts between the TV and PC industries have resulted in the approval of 18 different digital video formats in the United States. Exchange of video signals between TV and PCs requires effective format conversion. Some commonly used interframe/field filters for format conversion, for example, ITU~ R 601 to the Source Input Format (SIF) and vice versa and 3:2 pull-down to display 24 Hz motion pictures in 60 Hz format, have been reviewed [3.57]. As for video filters, they can be classified as interframe/field (spatial), motion-adaptive and motion-compensated filters [3.58]. Spatial filters are easiest to implement. However, they do not make use of the high temporal correlation in the video signals. Motion-compensated filters require highly accurate motion estimation between successive views. Other more sophisticated format conversion methods include motion-adaptive field-rate doubling and deinterlacing [3.59] as well as motion compensated frame rate conversion [3.58].
与传统的模拟视频相比,数字视频有很多优点,包括带宽压缩、抗信道噪声、交互性和易于操作。数字视频信号有很多格式。广播电视信号以ITU-R 601格式数字化,帧频为30/25 fps,每帧720象素、488线,2:1隔行,4:3宽高比,4:2:2色度抽样。随着高清晰度数字视频的出现,美国TV和PC行业之间的标准化努力的结果是批准了18种数字视频格式。TV和PC之间视频信号交换需要进行格式转换。一些共用的格式转换帧间/场滤波器已经接受评审,例如,ITU-R 601到SIF(源输入格式)及其反向转换、在60 Hz格式中3:2下降到24 Hz动画显示。视频滤波器可以分类为帧间/场滤波器(空间)、运动自适应和运动补偿滤波器。空间滤波器最容易实现。但是,它们不能用于时间关联度高的视频信号。运动补偿滤波器需要相邻图像之间非常精确的运动估计。其它更复杂的格式转换方法包括运动自适应场频倍增和去隔行以及运动补偿帧频转换。
Video signals suffer from several degradations and artifacts. Some of these degradations may be acceptable under certain viewing conditions. However, they become objectionable for freezeframe or printing from video applications. Some filters are adaptive to scene content in that they aim to preserve spatial and temporal edges while removing the noise. Examples of edge-preserving filters include median, weighted median, adaptive linear mean square error and adaptive weighted-averaging filtering [3.58]. Deblocking filters can be classified as those that do require a model of the degradation process (inverse, constrained, least square, and Wiener filtering) and those that do not (contrast adjustment by histogram specification and unsharp masking). Deblocking filters smooth intensity variations across amounts of temporal redundancy. Namely, successive frames generally have large overlaps with each other. Assuming that frames are shifted by subpixel amounts with respect to each other, it is possible to exploit this redundancy to obtain a high-resolution reference image (mosaic) of the regions covered in multiple views [3.60]. High-resolution reconstruction methods employ least-squares estimation, back projection, or projection-autoconvex sets methods based on a simple instantaneous camera model or a more sophisticated camera model including motion blur [3.61].
视频信号受到劣化和认为干扰。某些劣化在一定条件下可以接受。但是,凝结帧就令人讨厌了。某些滤波器适用于景物内容,在那种情况下,它们是用来在保持空间和时间边沿的同时去除噪声。边沿保持滤波器的例子包括中值、加权中值、自适应线性均方差以及自适应加权平均滤波。分解块滤波器分为需要劣化过程模型的(反转、受迫、最小平方、维纳滤波)和不需要劣化过程模型的(直方图规范调节对比度和模糊掩蔽)。分解块滤波器平滑时间冗余总量的变化强度。一般地,连续帧通常有大量的相互重叠。假设子象素数量相互关联的帧移动,利用这种冗余就能够获得多重图像区域覆盖下的图像(马赛克)的高分辨率基准。高分辨率重构法采用最小平方估计、后向投影以及基于简单的即时摄像机模型或更复杂的包括运动模糊的摄像机模型的投影自弯曲调整法。
One of the challenges in digital video processing is to decompose a video sequence into its elementary parts (shots and objects). A video sequence is a collection of shots, a shot is a group of frames and each frame is composed of synthetic or natural visual objects. Thus, temporal segmentation generally refers to finding shot boundaries, spatial segmentation corresponds to extraction of visual objects in each frame and object tracking means establishing correspondences between the boundaries of objects in successive frames.
数字视频处理的课题之一是把图像序列分解为基本单元(镜头与对象)。图像序列是镜头的集合,一个镜头是一组帧,每一帧是由合成或自然的视频对象组成的。因此,时间分割一般涉及寻找镜头边界,空间分割对应于从每一帧里提取视频对象,对象跟踪就是使相继帧中对象的边界相互一致。
Temporal segmentation methods edit effects as cuts, dissolves, fades and wipes. Thresholding and clustering using histogram-based similarity methods have been found effective for detection of cuts [3.62]. Detection of special effects with high accuracy requires customized methods in most cases and is a current research topic. Segmentation of objects by means of chroma keying is relatively easy and is commonly employed. However, automatic methods based on color, texture and motion similarity often fail to capture semantically meaningful objects [3.63]. Semiautomatic methods, which aim to help a human operator perform interactive segmentation by tracking boundaries of a manual initial segmentation, are usually required for object-based video editing applications. Object-tracking algorithms, which can be classified as boundary region or model-based tracking methods, can be based on 2D or 3D object representations. Effective motion analysis is an essential part of digital video processing and remains an active research topic.
时间分割法编辑特技如切换、叠化、淡变和划变。已经发现用基于直方图的相似法确定阈值和分组可以有效地检测切换。对特技的高准确度检测在大多数情况下需要定制法,这是目前的研究课题。用色键分割对象相对容易,是目前普遍采用的方法。但是,基于色彩、纹理和运动相似性的自动法捕获有意义的对象时常常失败。半自动法,它的目标是通过跟踪人工初步分割的边界帮助人们完成交互分割,通常需要基于对象的视频编辑软件。对象跟踪算法,可以分为边界区域或基于模型跟踪法,可以以2D及3D对象表示为基础。有效地运动分析是数字视频分析的基本部分并仍是活跃的研究课题。
Storage and archiving of digital video in shared disks and servers in large volumes, browsing of such databases in real time and retrieval across switched and packet networks pose many new challenges, one of which is efficient and effective description of content. The simplest method to index content is by assigning manually or semiautomatically the content to programs, shots and visual objects [3.64]. It is of interest to browse and search for content using compressed data because almost all video data will likely be stored in compressed format [3.65]. Video-indexing systems may employ a frame-based, scene-based or object-based video representation. The basic components of a video-indexing system are temporal segmentation, analysis of indexing features and visual summarization. The temporal-segmentation step extracts shots, scenes and/or video objects. The analysis step computes content-based indexing features for the extracted shots, scenes, or objects. Content-based features may be generic or domain dependent. Commonly used generic indexing features include color histograms, type of camera- motion direction and magnitude of dominant object motion entry and exit instances of objects of interest and shape features for objects [3.66, 3.67]. Domain-dependent feature extraction requires a priori knowledge about the video source, such as new programs, particular sitcoms, sportscasts and particular movies. Content-based browsing can be facilitated by a visual summary of the contents of a program, much like a visual table of contents. Among the proposed visual summarization methods are story boards, visual posters and mosaic-based summaries.
数字视频的存档在共享盘和服务器中要占据大量空间,对这样的数据库的实时浏览和通过包交换网络的重现引出许多新的挑战,其中之一是内容描述的效率和有效性。检索内容最简单的办法是人工或半人工地给节目、镜头和视频对象做目录。重要的是,用压缩的数据对内容进行浏览和检索,因为几乎所有的视频数据可能都是用压缩格式保存的。视频索引系统可以用基于帧、基于景物以及基于对象的视频表示。视频索引系统的基本组成是时间分割、索引特征分析和画面摘要。时间分割阶段是提取镜头、景物和/或视频对象。分析阶段是对提取的镜头、景物以及视频对象计算基于内容的索引特征。基于内容的特征可以随类或域而定。一般用类索引特征包括色彩直方图、摄像机运动的类型、主要对象的运动方向和幅度、重要对象进出的情况、对象的形状特征。与域有关的特征摘要需要关于视频源的知识,例如新节目、一部连续剧、比赛实况转播和一部电影。基于内容的浏览可以借助于节目内容的视频摘要,它很像内容的视频表格。被推荐的视频摘要法有故事板、视频海报和基于马赛克的摘要。

多媒体通信系统(3.4.2)

3.4.2 Speech, Audio and Acoustic Processing for Multimedia
The primary advances in speech and audio signal processing that contributed to multimedia applications are in the areas of speech and audio signal compression, speech synthesis, acoustic processing, echo control and network echo cancellation.
语音和音频信号处理的改进对多媒体应用的贡献在下述范围:语音和音频信号压缩、语音合成、声学处理、回声控制以及网络回声消除。
Figure 3.2 Block diagram for audio-assisted head and shoulder video [3.36]. ~1998 IEEE.
Speech and audio signal compression Signal compression techniques aim at efficient digital representation and reconstruction of speech and audio signals for storage and playback as well as transmission in telephony and networking.
语音和音频信号压缩 信号压缩技术的目标是为电话和网络中的存储、重放和传输进行语音和音频信号的有效的数字表示和重建。
Signal-analysis techniques such as Linear Predictive Coding (LPC) [3.37], and all-pole autoregressive modeling [3.38] and Fourier analysis [3.39], played a central role in signal representation. For compression, VQ [3.40, 3.41] marks a major advance. These techniques are built upon rigorous mathematical frameworks that have become part of the important bases of digital signal processing. Incorporation of knowledge and models of psychophysics in hearing have been proven as beneficial for speech and audio processing. Techniques such as noise shaping [3.42] and explicit use of auditory masking in the perceptual audio coder [3.43] have been found very useful. Today, excellent speech quality can be obtained at less than 8 Kb/s, which forms the basis for cellular as well as Internet telephony. The fundamental structure of the Code- Excited Linear Prediction (CELP) coder is ubiquitous in supporting speech coding at 4 to 16 Kb/s, encompassing such standards as G.728 [3.44], G.729 [3.45], G.723.1, IS-54 [3.46], IS-136 [3.47], GSM [3.48] and FS-1016 [3.491. CD or near-CD-quality stereo audio can be achieved at 64 to 128 Kb/s, less than one twelfth of the original CD rate, and is ready for such applications as Internet audio (streaming and multicasting) and digital radio (digital audio broadcast). Advances in audio-coding standards are supported in MPEG activities.
信号分析技术例如线性预测编码(LPC)、全极点自回归模型和傅立叶分析在信号表示中扮演着主要角色。对于压缩,VQ标志着一个重要进步。这些技术都建立在严格的数学框架之上,并已成为数字信号处理的重要基础的一部分。语音和音频信号处理已经从听觉的精神物理学知识与模型的结合中获得了益处。噪声频谱成型一类的技术和听觉屏蔽在知觉音频编码器中的直接应用已被发现非常有用。今天,在低于8 Kb/s的条件下已能获得极好的音质,这已成为蜂窝以及因特网电话的基础。码激励线性预测编码器的基本结构已经普遍用于支持4~16 Kb/s速率的语音编码,包括G.728 [3.44], G.729 [3.45], G.723.1, IS-54 [3.46], IS-136 [3.47], GSM [3.48] and FS-1016 [3.49]等标准。在64 ~128 Kb/s可达到CD或接近CD质量的立体声,速率低于CD码率的十二分之一,已经用于因特网音频(流和多播)和数字广播(数字音频广播)。MPEG支持音频编码标准的改进。
Speech synthesis The area of speech synthesis includes generation of speech from unlimited text, voice conversion and modification of speech attributes such as time scaling and articulatory mimic [3.50]. Text-to-speech conversion takes text as input and generates human-like speech as output [3.51]. Key problems in this area include conversion of text into a sequence of speech inputs (in terms of phonemes, dyades or syllables), generation of the associated prosodic structure and intonation and methods to concatenate and reconstruct the sound waveform. Voice conversion refers to the technique of changing one person”s voice to another, from person A to person B or from male to female and vice versa. It is useful to be able to change the time scale of a signal (to speed up or slow down the speech signal which changes the pitch) or to change the mode of the speech (making it sound happy or sad) [3.52]. Many of these signal-processing techniques have appeared in animation and computer graphics applications.
语音合成 语音合成的范围包括来自无约束文本语音的产生、话音语音特征例如时间尺度的转换和修改以及拟声。文本到语音转换以文本为输入,以产生的类人语音为输出。这个领域的关键问题包括文本变换到语音输入序列(术语叫音素或音节)、建立语法结构与音调的关联以及连接和重建声音波形的方法。话音转换涉及到把一个人的声音变为另一个人的技术,从人A到人B以及从男到女等等。能够改变信号的时间尺度(语音信号的快速或慢速以改变音调)或者语音的模式(欢快或悲愁的声音)是非常有用的。许多这些信号处理技术已经出现在动画和计算机图形这些应用中。
Acoustic processing and echo control Sound pickup and playback is an important area of multimedia processing. In sound recording, interference, such as ambient noise and reverberation, degrade the quality. The idea of acoustic signal processing and echo control is to allow straightforward high-quality sound pickup and playback in applications, such as a duplex device like a speakerphone, a sound source-tracking apparatus like microphone arrays, teleconferencing systems with stereo input and output, hands-free cellular phones and home theatre with 3D sound.
声学处理和回声控制 拾音和重放是多媒体处理的一个重要领域。录音时,环境噪声和回响之类的干扰使录音质量劣化。声学信号处理和回声控制是想在应用中能够获得高质量的拾音和重放,这些应用包括耳麦之类的双工设备、麦克风阵列之类的声源跟踪设备、立体声输入输出的远程会议系统、不用手的蜂窝电话以及3D声音的家庭影院等。
Signal processing for acoustic echo control includes modeling of reverberation, design of dereverberation algorithms, echo suppression, double-talk detection and adaptive acoustic echo cancellation, which is still a challenging problem in stereo full-duplex communication environments [3.53].
声学回声控制的信号处理包括回响模型、消回响算法设计、回声抑制、双方讲话检测以及适应回声消除,这仍然是立体声全双工通信环境中富有挑战性的问题。
Example 3.3 For typical environments, the system modeling time for reverberation is of the order of 100 ms. This at a sampling rate of 16 KHz translates into a echo-canceling filter of 1600 taps, requiring seconds to converge.
例3.3 在典型环境中,回响的系统建模时间是100ms量级。抽样频率16 KHz转换为1600个抽头的消回声滤波器,这需要若干秒汇聚。
For sound pickup, acoustic processing aims at the design of transducers or transducer arrays to achieve a durable directionality (beam steering and width control) as well as noise resistance. Understanding of near and far-field acoustics is important in achieving the required response in specific applications [3.54]. Various 1D and 2D microphone arrays have been shown in teleconferencing and auditorium applications with good results [3.55].
对于拾音,声学处理的目标是设计耐用的指向性(束调整和宽度控制)及抗噪声换能器或换能器阵列。在特殊应用中为得到所需要的响应必须掌握近场和远场声学特征。各种1D和2D麦克风阵列已经在远程会议和礼堂中获得良好的应用。
Network echo cancellation In telephony, both near-end and far-end echo exists due to the hybrid coil that is necessary for two-wire and four-wire conversions. Network echo can be so severe that it hampers telephone conversation. Network echo cancellers were invented to correct the problem in the late 1960s, based on the Least Mean Squares (LMS) adaptive echo cancelation algorithm [3.56]. The network echo delay is of the order of 16 ms, typically requiring a filter with 128 taps at a sampling rate of 8 KHz.
电话中的网络回声消除,包括由于二四线变换需要的混合线圈而产生的近端和远端回声。严重的网络回声将影响电话交谈。解决该问题的网络回声消除器发明于1960晚期,基于最小均方(LMS)适应回声消除算法。网络回声延迟大约16ms,在抽样频率8 KHz时需要128个抽头(典型值)的滤波器。

多媒体通信系统(3.4)

3.4 Challenges of Multimedia Information Processing
Novel communications and networking technologies are critical for a multimedia database system to support interactive dynamic interfaces. A truly integrated media system must connect with individual users and content-addressable multimedia databases. This will be a logical connection through computer networks and data transfer.
新的通信和网络技术必须支持交互动态接口的多媒体数据库系统。一个真正的集成媒体系统必须连接各个用户和内容可检索的多媒体数据库。这将是一种通过计算机网络和数据转移的逻辑连接。
To advance the technologies of indexing and retrieval of visual information in large archives, multimedia content-based indexing would complement the text-based search. Multimedia systems must successfully combine digital video and audio, text animation, graphics and knowledge about such information units and their interrelationships in real time.
为了改进大量档案中视觉信息的检索与重现技术,多媒体的基于内容检索将补充基于文本的检索。多媒体系统要能够实时地顺利组合数字视频、音频、字幕、图形和有关这些信息单元及其相互关系的知识。
The operations of filtering, sampling, spectrum analysis and signal representation are basic to all of signal processing. Understanding these operations in the multidimensional (mD) Case has been a major activity since 1975 [3.15, 3.16, 3.17]. More key results since that time have been directed at the specific applications of image and video processing, medical imaging, and array processing. Unfortunately, there remains considerable cross-fertilization among the application areas.
滤波、抽样、频谱分析以及信号表示等操作全部基于信号处理。1975年以来,理解多维(mD)情况下的这些操作已经成为主要活动。从那时以来,很多关键结果已经指导着图像和视频处理、医学图像以及阵列处理等专业应用。不幸的是在这些应用中残存了大量的杂交。
Algorithms for processing mD signals can be grouped into four categories:
 Separable algorithms that use 1D operators to process the rows and columns of a multidimensional array
 Nonseparable algorithms that borrow their derivation from their 1D counterparts
 mD algorithms that are significantly different from their 1D counterparts
 mD algorithms that have no 1D counterparts.
mD信号处理算法可以归纳为四类:
 用1D算子处理多维阵列的行和列的分离算法
 借用1D算法的非分离算法
 显著不同于1D的mD算法
 非1D的mD算法
Separable algorithms operate on the rows and columns of an mD signal sequentially. They have been widely used for image processing because they invariably require less computation than nonseparabte algorithms. Examples of separable procedures include mD Discrete Fourier Transforms (DFTs), DCTs and Fast Fourier Transform (FFT)-based spectral estimation using the periodogram. In addition, separable Finite Impulse Response (FIR) filters can be used in separable filter banks, wavelet representations for mD signals and decimators and interpolators for changing the sampling rate.
分离算法继续用于对mD信号的行和列的运算。由于与非分离算法相比它们总是可以用较少的计算,所以一直广泛的用于图像处理。分离规程的例子包括mD离散傅立叶变换(DFT)、DCT和以快速傅立叶变换(FFT)为基础的用周期图的频谱估计。另外,离散有限冲激响应(FIR)滤波器可用于离散滤波器单元、mD信号的小波表示和改变抽样速率的抽值器和内插器。
The second category contains algorithms that are uniquely mD in that they cannot be decomposed into a repetition of 1D procedures. These can usually be derived by repeating the corresponding 1D derivation in an mD setting. Upsampling and downsampling are some examples. As in the 1D case, bandlimited multidimensional signals can be sampled on periodic lattices with no loss of information. Most 1D FIR filtering and FFT-based spectrum analysis algorithms also generalize straightforwardly to any mD lattice [3.18]. Convolutions can be implemented efficiently using the mD DFT either on whole arrays or on subarrays. The window method for FIR filter design can be easily extended, and the FRI” algorithm can be decomposed into a vector-radix form, which is slightly more efficient than the separable row/column approach for evaluating multidimensional DFTs [3.19, 3.20]. Nonseparable decimators and interpolators have also been derived that may eventually be used in subband image and video coders [3.21]. Another major area of research has been spectral estimation. Most of the modern spectral estimators, such as the maximum entropy method, require a new formulation based on constrained optimization. This is because their 1D counterparts depend on factorization properties of polynomials [3.22]. An interesting case is the maximum likelihood method, where the 2D version was developed first and then adopted to the 1D situation [3.23].
第二类是唯一不能分解为重复1D规程的mD算法。它们通常通过在一个mD框架内重复相应的1D推导而推导出来。升抽样和降抽样就是它们的例子。如同1D的情况下,带限多维信号可以信息无损地按照周期晶格抽样。大多数1D FIR滤波和基于FFT的频谱分析算法也直接归为任一mD晶格。用mD DFT对阵列或子阵列都可以有效地进行卷积运算。FIR滤波器设计的窗口法易于扩展,FRI的算法可以分解为矢量基数形式,它比用分离的行/列逼近法求多维DFT的值效率稍高一些。不可分离的抽值器和内插器也已被导出,可最终用于子带图像和视频编码器。研究的另一个主要领域已经是频谱估计。最新的频谱估计器,例如最大熵法,需要一种基于强迫优化的新的表述。这是因为它们的1D副本依赖于多项式的因数分解性质。一种有趣的情况是最大似然法,首先开发出来的是2D版本,然后才被采用于1D。
There are also mD algorithms that have no 1D counterparts, especially algorithms that perform inversion and computer imaging. One of these is the operation of recovering an mD distribution from a finite set of its projections, equivalently inverting a discretized Radon transform. This is the mathematical basis of computed tomography and positron emission tomography.
也有mD算法不存在1D副本,特别是进行反转和计算机图像的算法。其中之一是从它的投影的有限集合复原一个mD分布,等价地反转一个离散Radon变换。这是计算机层析成像和正电子层析成像的数学基础。
Another imaging method, developed first for geophysical applications, is Fourier integration. Finally, signal recovery methods unlike the 1D case are possible, The mD signals with finite support can be recovered from the amplitudes of their Fourier transforms or from threshold crossings [3.24].
另一种首先为地球物理学应用开发的图像法是傅立叶综合。最终,信号恢复法不像1D情况是可能的,具有有限支持的mD信号能够从它们的傅立叶变换的幅值或者从阈值交叉中恢复。
3.4.1 Pre and Postprocessing
In multimedia applications, the equipment used for capturing data, such as the camera, should be cheap, making it affordable for a large number of users. The quality of such equipment drops when compared to their more expensive and professional counterparts. It is mandatory to use a preprocessing step prior to coding in order to enhance the quality of the final pictures and to remove the noise that will affect the performance of compression algorithms. Solutions have been proposed in the field of image processing to enhance the quality of images for various applications [3.25, 3.26]. A more appropriate approach would be to take into account the characteristics of the coding scheme when designing such operators. In addition, pre- and postprocessing operators are extensively used in order to render the input or output images in a more appropriate format for the purpose of coding or display.
在多媒体应用中,用于采集数据的设备,例如摄像机,或许很便宜,很多人都能买得起。这样的设备与那些价格高的专业设备相比质量差。必须在编码之前进行处理,以提高最终图片的质量和去掉噪波,否则将影响压缩算法的性能。用于改善各种应用中图像质量的图像处理领域已经有了解决方案。适当的办法是在设计这样的处理器时考虑编码方案的特性。另外,为了在编码或显示时以比较适当的格式输入或输出图像,也广泛地使用预处理器和后处理器。
Mobile communications is an important class of applications in multimedia. Terminals in such applications are usually subject to different motions, such as tilting and jitter, translating into a global motion in the scene due to the motion of the camera. This component of the motion can be extracted by appropriate methods detecting the global motion in the scene and can be seen as a preprocessing stage. Results reported in the literature show an important improvement of the coding performance when a global motion estimation is used [3.27].
在多媒体中移动通信是一类重要应用。这类应用终端一般处于不同的运动中,例如倾斜和抖动,由于摄像机运动而转化过来的景物的全向运动。可以用适当的方法通过检测现场的全向运动提取这个运动分量, 并把它作为预处理步骤。文献报告显示,采用全向运动估计时编码性能得到重大改善。
It is normal to expect a certain degree of distortion of the decoded images for very tow-bit- rate applications. However, an appropriate coding scheme introduces the distortions in areas that are less annoying to the users. An additional stage could be added to reduce the distortion further due to compression as a postprocessing operator. Solutions were proposed in order to reduce the blocking artifacts appearing at high compression ratios [3.28, 3.29, 3.30, 3.31, 3.32, 3.33]. The same types of approaches have been used in order to improve the quality of decoded signals in other coding schemes, reducing different kinds of artifacts, such as ringing, blurring and mosquito noise [3.34, 3.35].
一般认为,在很低比特率场合中解码图像会有一定程度的失真。然而,在某些情况下,为了减少用户的烦恼以适当的编码方案引入失真。可以附加一个步骤作为后处理器,以进一步减小压缩带来的失真。已经提出了解决高压缩比时出现阻塞问题的方法。同样的方法也已经在其它编码方案中用于改善解码信号的质量,减小各类噪声,例如振铃、斑点和哼声。
Recently, advances in postprocessing mechanisms have been studied to improve lip synchronization of head-and-shoulder video coding at a very low bit rate by using the knowledge of decoded audio in order to correct the positions of the lips of the speaker [3.36], Figure 3.2 shows an example of the block diagram of such a postprocessing operation.
最近,对改善在很低比特率时头肩像视频编码的唇同步问题的后处理机制的研究已经取得进展,这种机制运用解码音频的知识校正讲话者的唇位,图3.2显示了一例这类后处理过程的框图。

多媒体通信系统(3.3)

3.3 Signal-Processing Elements
Many classical signal-processing procedures have become deeply embedded in the multidimensional fields. A key driver is optimization for representation of multimedia components, as well as the associated storage and delivery requirements. The optimization procedures range from very simple to sophisticated. Some of the principal techniques are the following:
 Nonlinear analog (video and audio) mapping
 Quantization of the analog signal
 Statistical characterization
 Motion representation and models
 3D representations
 Color processing
许多经典的信号处理规程已经深深地植入多维领域。为了表示多媒体分量,对关键的驱动器以及相关的存储和传输设备进行了优化。优化规程的范围涵盖了从非常简单的到精密复杂的。一些主要技术如下所示:
 非线性模拟(视频和音频)映射
 模拟信号的量化
 统计描述
 运动表示和模型
 3D表示
 色彩处理
A nonlinear analog (video and audio) mapping procedure may be purely analog. Its intention may be the desire to enhance the delivery process. It could also be introduced to mask the limitations of various components of the overall multimedia chain. Typical constraints are introduced by bandwidth limitations and constrained dynamic range in the display terminal.
非线性模拟(视频和音频)映射过程可以是完全模拟的。其目的可以是期望提高传输性能。也可能是用于掩饰整个多媒体链中各个方面的缺陷。典型的约束是显示终端的带宽限制和动态范围的限制。
Quantization of the analog signal is fundamental to any digital representation that has originated in the analog world. The quantization process is an inherently lossy procedure and fundamentally noninvertible. This classical signal-processing element still remains the basic constraint in limiting performance, although not very exciting compared with other multimedia issues [3.5]. Quantization techniques comprise a whole field by themselves. The major relevant issues include uniform and nonuniform techniques and adaptive and nonadaptive procedures [3.6].
模拟信号的量化对任何发源于模拟世界的数字表示来说都是基本的。量化处理是一种固有的有损过程,而且是根本不可逆转的。这个经典的信号处理方法仍然是对性能限制的基本约束,尽管与其它多媒体难题相比它还不是非常令人激动的[3.5]。量化技术有它自己的领域。主要相关课题包含统一的和非统一的技术以及适应的和非适应的规程[3.6]。
Statistical concepts and applications are directly and indirectly strongly embedded in processing components associated with multimedia. This relevant field is part of classical signal processing, and we can only highlight the major categories. A spectral analysis is fundamental to the entire range of image models for filtering and algorithm design. The procedures are critical to both visual and audio data components [3.7, 3.8]. Statistical redundancy is the basic concept upon which the entire field of data compression is based. Mathematical extension of the concept leads to optimum transform for decorrelation. This in turn leads to the entire field of modem transform-coding technology [3.9]. Model-based representations, primarily for compression, are determined from assumed or derived statistical models. The classes of transform-coding algorithms are based on this technology [3.10]. The utility of Fourier transform and its discrete extensions such as Discrete Cosine Transform (DCT), wavelets and others are based on the principle that these transforms asymptotically approach the optimum transform, assuming a reason- able statistical behavior [3.11]. Visual and audio models are fundamental to the relevant multimedia representations, primarily compression procedures. These models are based on fundamental statistical representations of the elementary components, including their evaluation by the human observer [3.12, 3.13].
统计概念及其应用直接或间接地深入于与多媒体有关的处理。这一相关领域是经典的信号处理的一部分,我们只需突出它的主要范畴。对于滤波和算法设计的图像模型的整个范围,谱分析是基本原理。对于视频和音频两者的数据分量这个规程都是必不可少的。在整个基于数据压缩的全部领域中统计冗余都是基本概念。这个概念的数学扩展导出了去相关的最优变换。进而导出了整个调制解调器的变换编码技术。基于模型的表示法(压缩的基础)正是源于统计模型。上层变换编码算法就是基于这项技术。傅立叶变换以及它的离散扩展,例如离散余弦变换、小波变换等基于该原理的变换,这些变换逐渐接近于最优变换,它们的应用采用一种合理的统计行为。相应的多媒体表示法,基本压缩规程,视频和音频模型是基本原理。这些模型基于基本分量包括它们的主观评价的统计表示的基本原理。
The models are:
 Implementation of motion detection and associated compensation in subsequent image frames can significantly reduce the required bandwidth. Successful prediction of image segment locations in future frames reduces the required information update to the required motion vectors. Thus, under this condition, the associated update information is dramatically reduced.
 Combining the presence of motion in video segments with the limitations for human visual systems provides additional bandwidth-reduction potentials. Because the human vision deteriorates when observing moving areas, image blur associated with these regions becomes significantly less noticeable. Consequently, additional image compression can be introduced in segments that contain motion, with minimal noticeable effect.
这些模型是:
 运动检测和在后继图像帧的相关补偿的实现,能够显著地减少所需带宽。成功预言在未来帧中图像块的位置减少了所需信息对所需运动矢量的更新。因而,在这个条件下,显著的减少了相关的更新信息。
 视频块中运动的存在,结合人类视觉系统的局限,提供了进一步减小带宽的可能性。由于人类视觉在观看运动区域时变得更糟,与这些区域相关的图像模糊就显著的变得不那么引人注意。因此,在包含运动的块中就能够以最小的可觉察效果进一步压缩图像。
Human vision is basically 3D. Efficient representation of a 3D signal is a major challenge of multimedia. The most common 3D techniques are based on 2D display techniques. The 3D scene is projected onto two dimensions in the rendering phase of the multimedia chain. The proper hierarchy of object elements and behavior maintains the 3D illusion. The relevant processes include shadowing consideration and preserving the proper hidden body behavior. The required processing resources are still significant. A substantial industry produces various processing components, such as chip sets and graphics boards, to develop solutions for many diverse applications including desktop computing. The associated technology is very effective in high-end applications. Virtual reality models using large screens are impressive even though the presentation remains 2D. In 3D representations, the stereo projection is the best known. The same 3D scene is recorded from two slightly different perspectives, essentially replicating our eyes. The two separate recordings are subsequently presented to the eyes separately. Unlike the early stereo film-based recordings, modern techniques are heavily dependent on digital processing, which corrects for camera-projection inaccuracies, resulting in significantly enhanced stereo display.
人类视觉基本上是三维的。3D信号的有效表示法是多媒体的一个主要课题。大多数3D技术基于2D显示技术。在多媒体的重现阶段,3D景物被投射为二维。对象元素和行为的恰当的层次保持着3D幻觉。有关的处理包括适当的遮蔽和保留身体形态的适当隐藏。所需的处理资源依然重要。制造业生产了各种处理器件,例如芯片和图像板卡,为许多不同的应用包括桌面计算机开发了解决方案。在高端应用中有关技术非常有效。尽管表达依然是2D的,使用大屏幕的虚拟现实模型仍给人留下深刻印象。在3D表示法中,最著名的是立体投影。它模拟我们的眼睛,以两个略微不同的透视点记录同一3D景物。然后这两个记录分别呈现给我们的眼睛。与早期立体电影不同,现代技术强烈地依赖数字处理校正镜头误差,显著地增强了立体显示的效果。
Projection techniques comprise an effective group to recreate multidimensionality from individual projections through the original object. Although this technology has been used very effectively in medical applications, its utility to multimedia applications is not likely to be useful in the near future. The primary limitations are complexity and lack of easy real-time implementation [3.14].
投影技术是通过对原初对象的各个投影的有效组合再现多维影像。尽管这项技术已经在医学上得到非常广泛的应用,但是在近期内在多媒体方面的应用还不会那么多。主要限制在于复杂和难以实时运行。
For efficient representation of color processing, modeling and communication applications, color plays a very important role. The correlation properties among color planes are used in image and video compression algorithms.
对于色彩处理、模型和通信应用的有效表示,色彩具有非常重要的作用。在图像和视频压缩算法中运用了彩色平面中的相关特性。

多媒体通信系统(3.2下)

Digital video Video is composed of a series of still-image frames and produces the illusion of movement by quickly displaying frames one after another. The Human Visual System (HVS) accepts anything more than 20 Frames Per Second (fps) as smooth motion. Television and video are usually distinguished. Television is often associated with the concept of broadcast or cable delivery of programs, whereas video allows more user interactivity, such as recording, editing and viewing at a user-selected time.
数字视频 视频是由一系列静止图像画面组成的,这些画面快速地一幅接一幅地显示就产生了运动的幻觉。人类视觉系统(HVS)把任一超过20帧每秒(fps)的运动画面都看作是流畅的。电视和视频通常是有区别的。电视常常与节目的广播、电缆传送等概念相联系,而视频更偏重于与使用者的互动,例如按使用者选择的时间录像、编辑以及观看等。
The biggest challenges posed by digital video are the massive volume of data involved and the need to meet the real-time constraints on retrieval, delivery and display. The solution entails the compromise in the presentation quality and video compression. As for the compromise in the presentation quality, instead of video with full frame, full fidelity and full motion, one may reduce the image size, use less bits to represent colors, or reduce the frame rate. To reduce the massive volume of digital video data, compression techniques with high compression ratios are required. In addition to throwing away the spatial and color similarities of individual images, the temporal redundancies between adjacent video frames are eliminated.
数字视频带来的最大挑战是巨量的数据和要求实时地复现、传输和显示。解决办法必须在表达质量和视频压缩之间进行折衷。作为对表达质量的折衷,取代视频的全帧、全保真度和全运动,一个办法是减小图像大小,用较少的比特表示色彩,或者减小帧速率。为了减小数字视频数据量,需要高压缩比的压缩技术。另外,为了丢弃在每幅图像之间空间和色彩的相似部分,要除去相邻视频画面之间的时间冗余。
Digital audio Sound waves generate air pressure oscillations that stimulate the human auditory system. The human ear is an example of a transducer. It transforms sound waves to signals recognizable by brain neurons. As with other audio transducers, two important considerations are frequency response and dynamic range. Frequency response refers to the range of frequencies that a medium can reproduce accurately. The frequency range of human hearing is between 20 Hz and 20 KHz. Dynamic range describes the spectrum of the softest to the loudest sound-amplitude levels that a medium can reproduce. Human hearing can accommodate a dynamic range greater than a factor of millions. Sound amplitudes are perceived in logarithmic ratio rather than linearly. Humans perceive sounds across the entire range of 120 dB, the upper limit of which will be painful to humans. Sound waves are characterized in terms of frequency (Hz), amplitude (dB) and phase (degree), whereas frequencies and amplitudes are perceived as pitch and loudness, respectively. Pure tone is a sine wave. Sound waves are additive. In general, sounds are represented by a sum of sine waves. Phase refers to the relative delay between two waveforms. Distortion can result from phase shifts.
数字音频 声波产生气压振动刺激人的听觉系统。人耳是一种典型的传感器。它把声波变换为大脑神经元可辨识的信号。就像其它音频传感器一样,必须考虑频率响应和动态范围这两个重要因素。频率响应是指媒介能够正确重现的频率范围。人类听觉的频率范围是在20 Hz与20 KHz之间。动态范围描述媒介能够重现的从最轻微到最响的声音振幅级。人类听觉能够适应的动态范围超过数百万倍。对声音振幅的感知是对数性的而不是线性的。人类感知声音的范围达到120 dB,上限是使人感到疼痛。声波可以用术语频率(Hz)、振幅(dB)和相位(degree)来表征,频率和振幅分别被感知为音调和响度。纯音是正弦波。声波是合成的。一般的,声音表现为多个正弦波的叠加。相位是指两个波形之间的相对延迟。相位变化将造成失真。
Digital audio systems are designed to make use of the range of human hearing. The frequency response of a digital audio system is determined by the sampling rate, which in turn is determined by the Nyquist theorem.
数字音频系统是按照人类听觉范围设计的。数字音频系统的频率响应取决于抽样速率,而抽样速率是由奈奎斯特理论确定的。Example 3.1 The sampling rate of Compact Disk (CD) quality audio is 44.1 KHz. Thus, it can accommodate the highest frequency of human hearing, namely, 20 KHz. Telephone quality sound adopts an 8 KHz sampling rate. This can accommodate the most sensitive frequency of human hearing, up to 4 KHz.
例3.1 CD质量音频的抽样频率是44.1 KHz。因此,它能够满足人类听觉的最高频率,即20 KHz。电话质量的声音采用8 KHz的抽样频率,它能够满足人类听觉最敏感的4 KHz以下的频率。
Digital audio aliasing is introduced when one attempts to record frequencies that exceed half the sampling rate. A solution is to use a low-pass filter to eliminate frequencies higher than the Nyquist rate. The quantization interval, or the difference in value between two adjacent quantization levels, is a function of the number of bits per sample and determines the dynamic range. One bit yields 6 dB of dynamic range. For example, 16 bits audio contributes 96 dB of the dynamic range found in CD-grade audio, which is nearly the dynamic range of human hearing. The quantized samples can be encoded in various formats, such as Pulse Code Modulation (PCM), to be stored or transmitted. Quantization noise occurs when the bit number is too small. Dithering, which adds white noise to the input analog signals, may be used to reduce quantization noise. In addition, a low-pass filter can be employed prior to the digital-to-analog (D/A) stage to smooth the stairstep effect resulting from the combination of a low sampling rate and quantization. Figure 3.1 summarizes the basic steps for processing digital audio signals [3.4].
在录音频率超过抽样频率的一半时,会造成数字音频混叠失真。解决办法是用低通滤波器滤除高于奈奎斯特频率的成分。量化阶距,即两个相邻的量化电平之间的差值,是每个抽样的比特数的函数,它决定了动态范围。一个比特支持6 dB的动态范围。例如,用于CD级音频的16比特音频提供96 dB的动态范围,已经几乎是人类听觉的整个动态范围。量化样本可以用诸如脉冲编码调制(PCM)等各种格式编码、存储或传输。在比特数过少时会产生量化噪声。在输入模拟信号中加入白噪声,即抖动,可以降低量化噪声。另外,在数模转换(D/A)级前用低通滤波器可以平滑因低的抽样频率和量化造成的阶梯效应。图3.1概括了处理数字音频信号的基本步骤[3.4]。
The quality of digital audio is characterized by the sampling rate, the quantization interval and the number of channels. The higher the sampling rate, the more bits per sample and the more channels means the higher the quality of the digital audio and the higher the storage and bandwidth requirements.
数字音频的质量可以用抽样频率、量化电平和声道数来表征。抽样频率越高、每个抽样的比特数越多、声道越多,数字音频的质量越高、所要求的存储和带宽越高。
Example 3.2 A 44.1 KHz sampling rate, 16-bit quantization and stereo audio reception produce CD-quality audio, but require a bandwidth of 44,100x16x2=l.4 Mb/s, Telephone-quality audio, with a sampling rate of 8 KHz, 8-bit quantization and mono audio reception, needs only a data throughput of 8,000x8xl=64 Kb/s. Digital audio compression or a compromise in quality can be applied to reduce the file size.
例3.2 抽样频率为44.1 KHz,16比特量化的立体声接收,产生CD质量的声音,但是需要44,100x16x2=l.4 Mb/s的带宽,电话质量的声音,抽样频率为8 KHz ,8比特量化,单声道接收,需要的数据流量仅为8,000x8xl=64 Kb/s。数字音频压缩以及在质量上的折衷可以用来减小文件大小。
Integrated media systems will only achieve their potential if they are truly integrated in three key ways: integration of content, integration with human users and integration with other media systems. First, such systems must successfully combine digital video and audio, text, animation and graphics and knowledge about such information units and their inter-relationships in real time. Second, they must integrate with the individual user by cooperatively interactive multi- dimensional dynamic interfaces. Third, integrated media systems must connect with other such systems and content-addressable multimedia databases, both logically (information sharing) and physically (information networking, compression and delivery).
集成媒体系统仅能实现它们的潜能,如果它们真的在三个关键方式上集成:内容集成、使用者集成以及与其它媒体系统集成。首先,这样的系统必须成功地集成数字视频和音频、文本、动画以及图形,还有关于这些信息单元以及它们之间的关系的实时的消息。其次,它们必须通过协同的多维动态接口与各个使用者集成。第三,集成媒体系统必须与其它类似系统以及目录可寻址的多媒体数据库,在逻辑上(信息共享)和物理上(信息联网、压缩和传输)连接。

多媒体通信系统(3.2上)

3.2 Digital Media
Digital media take advantage of advances in computer-processing techniques and inherit their strength from digital signals. The following distinguishing features make them superior to the analog media:
 Robustness–The quality of digital media will not degrade as copies are made. They are most stable and more immune to the noises and errors that occur during processing and transmission. Analog signals suffer from signal-path attenuation and generation loss (as copies are made) and are influenced by the characteristics of the medium itself.
 Seamless integration–This involves the integration of different media through digital storage and processing and transmission technologies, regardless of the particular media properties. Therefore, digital media eliminate device dependency in an integrated environment and allow easy data composition of nonlinear editing.
 Reusability and interchangeability–With the development of standards for the common exchange formats, digital media have greater potential to be reused and shared by multiple users.
 Ease of distributed potential–Thousands of copies may be distributed electronically by a simple command.
数字媒体发挥与计算机处理技术的亲缘优势,并继承了数字信号的长处。下述特征是它们比模拟媒体的优越之处:
鲁棒性——数字媒体的质量不会因复制而劣化。对于处理和传输过程中产生的噪声和差错,它们具有最强的稳定性和更不受影响。模拟信号由于信道衰落和复制损耗而受损伤,而且受媒介特性的影响。
无缝集成——这包括不同媒体通过数字存储、处理和传输技术的集成,与各种媒体的性质无关。因此,数字媒体解除了在集成环境里的器件依赖性,轻松实现非线性编辑的数据合成。
可复用性和互换性——随着公共交换格式标准的开发,数字媒体具有更大的让多重用户再生和共享的能力。
易于分配的能力——一条简单的指令就能分配数千拷贝。
Digital image Digital images are captured directly by a digital camera or indirectly by scanning a photograph with a scanner. They are displayed on the screen or printed.
数字图像 数字图像是用数字摄像机直接捕获的,或者使用扫描仪扫描照片间接捕获。它们显示在屏幕上或者打印出来。
Digital images are composed of a collection of pixels that are arranged as a 2D matrix. This 2D or spatial representation is called the image resolution. Each pixel consists of three components: red (R), green (G) and blue (B). On a screen, each component of a pixel corresponds to a phosphor. A phosphor glows when excited by an electron gun. Various combinations of different RGB intensities produce different colors. The number of bits to represent a pixel is called the color depth, which decides the actual number of colors available to represent a pixel. Color depth is in turn determined by the size of the video buffer in the display circuitry.
数字图像是排列为2D矩阵的像素的集合。这种2D或空间表示法叫做图像分辨率。每个像素由三个分量组成:红(R)、绿(G)、蓝(B)。在屏幕上,像素的每一分量对应于一种荧光粉。荧光粉在受到电子枪激发时发光。各种RGB亮度的不同组合产生不同的色彩。代表一个像素的比特数叫做色深度,它决定了可用于表示一个像素的实际色彩数量。反过来,色深度取决于显示电路中视频缓冲器的大小。
The resolution and color depth determine the presentation quality and the size of image storage. The more pixels and the more colors there are means the better the quality and the larger the volume. To reduce the storage requirement, three different approaches can be used:
 Index color–This approach reduces the storage size by using a limited number of bits with a color lookup table (or color palette) to represent a pixel. Dithering can be applied to create additional colors by blending colors from the palette. This is a technique taking advantage of the fact that the human brain perceives the media color when two different colors are adjacent to one another. With palette optimization and color dithering, the range of the overall color available is still considerable, and the storage is reduced.
 Color subsampling–Humans perceive color as brightness, hue and saturation rather than as RGB components. Human vision is more sensitive to variation in the luminance (or brightness) than in the chrominance (or color difference). To take advantage of such differences in the human eye, light can be separated into the luminance and chrominance components instead of the RGB components. The color subsampling approach shrinks the file size by down-sampling the chrominance components, that is, using less bits to represent the chrominance components while having the luminance component unchanged.
 Spatial reduction–This approach, known as data compression, reduces the size by throwing away the spatial redundancy within the images.
分辨率和色深度确定了表达质量和图像存储的大小。像素越多,色彩越多,就意味着更好的质量和更加昂贵。要减小所需的存储,可采用三种途径:
 色彩索引——这种减小存储量的办法是用色彩检索表(或者调色板)来表示像素以限制比特数。抖动可以用于通过在调色板上混合色彩来创造更多的色彩。这种技术利用了这样一种事实:当两种色彩相距很近时人脑感知的是中间色彩。通过优化调色板和色彩抖动,可用色彩总数依然可观,而存储减小了。
 色彩二次抽样——人类对色彩的感知,对亮度、色调和色饱和度要胜于RGB分量。人类视觉对亮度变化比对色度变化更敏感。利用人眼的这种差异,可以把光分为亮度和色度两个分量以代替RGB分量。通过对色度分量降抽样,用较少的比特代表色度分量,亮度分量保持不变,这样,色彩二次抽样技术就缩小了文件大小。
 空间压缩——这种叫做数据压缩,通过丢弃图像中的空间冗余来减小文件。

多媒体通信系统(第3章3.1)

Multimedia Communication Systems
Chapter 3 Multimedia Processing in Communications
Chapter Overview
Multimedia has at its very core the field of signal-processing technology. With the exploding growth of the Internet, the field of multimedia processing in communications is becoming more and more exciting. Although multimedia leverages numerous disciplines, signal processing is the most relevant. Some of the basic concepts, such as spectral analysis, sampling theory and partial differential equations, have become the fundamental building blocks for numerous applications and, subsequently, have been applied in such diverse areas as transform coding, display technology and neural networks. The diverse signal-processing algorithms, concepts and applications are interconnected and, in numerous instances, appear in various reincarnated forms.
信号处理技术是多媒体的核心基础领域。随着互联网的爆炸性增长,通信中的多媒体处理领域越来越令人兴奋。虽然多媒体领域学科众多,信号处理却是最关键和最具实质性的。一些基本概念,例如频谱分析、抽样理论以及偏微分方程,已经成为许多应用的理论基础,而且在诸如变换编码、显示技术、神经网络等众多领域得到了应用。各种信号处理算法、概念和应用互相关联,而且在很多情况下呈现出不同的形式。
This chapter is organized as follows. First, we present and analyze digital media and signal processing elements. To address the challenges of multimedia signal processing while providing higher interactivity levels with the media and increased capabilities to access a wide range of applications, multimedia signal-processing methods must allow efficient access to processing and retrieval of multimedia content. Then, we review audio and video coding. During the last decade new digital audio and video applications have emerged for network, wireless, and multimedia computing systems and face such constraints as reduced channel bandwidth, limited storage capacity and low cost. New applications have created a demand for high-quality digital audio and video delivery. In response to this need, considerable research has been devoted to the development of algorithms for perceptually transparent coding of high-fidelity multimedia.
本章安排如下。首先,介绍和分析数字媒体和信号处理原理。与媒体更高的互动水平和提高接入各种应用的能力,这些要求对多媒体信号处理提出了挑战,多媒体信号处理方法必须提高访问效率以处理和恢复多媒体内容。然后,我们回顾音频和视频编码。近十年来,尽管面临诸如减小通道带宽、限制存储容量以及降低价格等等约束,新的数字音频和视频应用仍然在网络、无线和多媒体计算机系统中脱颖而出。 已经出现的新应用要求高质量的数字音频和视频传输。为了满足这些需要,已经有相当多的研究投入到高保真度多媒体的感知透明编码(perceptually transparent coding)的开发中来。
Next, we describe a general framework for image copyright protection through digital watermarking. In particular, we present the main features of an efficient watermarking scheme and discuss robustness issues. The watermarking technique that has been proposed is to hide secret information in the signal so as to discourage unauthorized copying or to attest the origin of the media. Data embedding and watermarking algorithms embed text, binary streams, audio, image or video in a host audio, image or video signal. The embedded data is perceptually inaudible or invisible to maintain the quality of the source data.
接着,我们叙述采用数字水印的图像版权保护的一般框架。我们特别介绍了一种高效水印方案的要点,并讨论鲁棒性问题。水印技术已经用于在信号中隐藏秘密信息以防止未经授权的拷贝和证明媒体来源。数据隐藏和水印算法嵌入主体音频、图像以及视频信号的文本、二元码流、音频、图像以及视频中。隐藏的数据对于保持原数据的质量是感觉不到损伤的。
We also review the key attributes of neural processing essential to intelligent multimedia processing. The objective is to show why NNs are a core technology for efficient representation for audio-visual information. Also, we will demonstrate how the adaptive NN technology presents a unified solution to a broad spectrum of multimedia applications (image visualization, tracking of moving objects, subject-based retrieval, face-based indexing and browsing and so forth).
我们还回顾神经处理本质的关键特征智能多媒体处理。为什么NN是有效表示音视频信息的核心技术。我们还将证明自适应的神经网络(NN)技术怎样代表了宽频多媒体应用(图像可视化、移动对象的跟踪、基于主题的恢复、基于外观的检索和浏览等等)的一种统一的解决方案。
Finally, this chapter concludes with a discussion of recent large-scale integration programmable processors designed for multimedia processing, such as real-time compression and decompression of audio and video as well as the next generation of computer graphics. Because the target of these processors is to handle audio and video in real time, the promising capability must be increased compared to that of conventional microprocessors, which were designed to handle mainly texts, figures, tables and photographs. To clarify the advantages of a high-speed multimedia processing capability, we define these chips as multimedia processors. Recent general-purpose microprocessors for workstations and personal computers use special built-in hardware for multimedia processing.
最后,本章以对最近的一种为多媒体处理,例如音频和视频的实时压缩和解压缩,以及下一代计算机图形,而设计的大规模集成可编程处理器的讨论作为结束。由于这种处理器的目标是音频和视频的实时操作,为主要用于处理文本、图形、表格和照片而设计,与传统的微处理器相比必须增加期望容量。为了阐明高速多媒体处理能力的优势,我们对这种用于多媒体处理器的芯片详加说明。最新的原本用于工作站和个人计算机的微处理器为多媒体处理采用了专门的内置硬件。
3.1 Introduction
Multimedia signal processing is more than simply putting together text, audio, images and video. It is the integration and interaction among these different media that creates new systems and new research challenges and opportunities. Although multimedia leverages numerous disciplines, signal processing is the most relevant. Some of the basic concepts, such as spectral analysis, sampling theory and partial differential equation theory, have become the fundamental building blocks for numerous applications and, subsequently, have been reinvented in such diverse areas as transform coding, display technology and NNs. The diverse signal-processing algorithms, concepts and applications are interconnected.
多媒体信号处理并不是简单地把文本、音频、图像和视频放到一起。它是把这些不同的媒体集成和融合为一种新的系统,一种新的挑战和机遇。虽然多媒体领域学科众多,信号处理却是最具实质性的。一些基本概念,例如频谱分析、抽样理论以及偏微分方程理论,已经成为许多应用的理论基础,然后在诸如变换编码、显示技术、神经网络等众多领域得到了重新确立。各种信号处理算法、概念和应用是相互关联的。
The term “multimedia” represents many different concepts. It includes basic elementary components, such as different audio types. These basic components may originate from many diverse sources (individuals or synthetic). For audio, the synthetics may be traditional musical presentation. One may also argue that multimedia is based on the extended visual experience, which includes representation of the real world, as well as its model, through a synthetic representation.
“多媒体”一词代表很多不同的概念。它包括一些基本成分,例如不同的音频类型。这些基本成分可能来自于许多不同的源(单个的或合成的)。对于音频,合成的源或许是传统的音乐演出。有人可能会争辩说多媒体是基于扩展的视觉经验,它包括真实世界的表象,以及通过人造表象对它的模仿。
The “multimedia” technologies have dramatically changed and will keep changing. However, it is erroneous to favor advances simply because the final product is based on better technology.
“多媒体”技术已经发生了戏剧性的变化,而且仍将继续发生变化。因此,一厢情愿地喜欢它是错误的,因为最终的产品总是基于更好的技术。
Multimedia consists of {multimedia data} + { set of instructions } . Multimedia data is informally considered as the collection of the three multimedia data, that is, multisource, multitype and multiformat data [3.1]. The interactions among the multimedia components consist of complex relationships without which multimedia could be a simple set of visual, audio and often data [3.2].
多媒体由+组成。一种非正式的说法是,多媒体数据是以下三种多媒体数据的集合:多源、多类和多格式数据[3.1] 。多媒体成分(不包括简单的由视频、音频和通常的数据组成的多媒体)之间的互作用具有复杂的关系[3.2]。
We define multimedia signal processing as the representation, interpretation, encoding and decoding of multimedia data using signal-processing tools. The goal of multimedia signal processing is effective and efficient access, manipulation, exchange and storage of multimedia content for various multimedia applications [3.3].
我们把多媒体信号处理定义为使用信号处理工具对多媒体数据的表示、解释、编码和解码。多媒体信号处理的目的是使各种多媒体应用有效地访问、交换和存储多媒体内容[3.3]。
The Technical Committee (TC) on MMSP is the youngest TC in the IEEE Signal Processing (SP) society. It took them a long time to raise some questions like the following:
 What is multimedia signal processing all about?
 What impact has signal processing brought to multimedia technologies?
 Where are the multimedia technologies now?
MMSP的技术委员会在IFEEE信号处理学会中是最年轻的技术委员会。它花了很长时间提出了一些象下面这样的问题:
 关于多媒体信号处理的一切是什么?
 信号处理对多媒体技术带来什么冲击?
 现在多媒体技术在什么地方?
Multimedia signal-processing technologies will play major roles in the multimedia-network age. Researchers today working in this area have the privilege of selecting the future direction of MMSP technologies, so what they are doing will deeply influence our future society.
多媒体信号处理技术在多媒体网络时代将扮演主要角色。今天工作在这个领域的研究人员将拥有选择MMSP技术未来方向的特权。