多媒体通信系统(3.4.2)

3.4.2 Speech, Audio and Acoustic Processing for Multimedia
The primary advances in speech and audio signal processing that contributed to multimedia applications are in the areas of speech and audio signal compression, speech synthesis, acoustic processing, echo control and network echo cancellation.
语音和音频信号处理的改进对多媒体应用的贡献在下述范围:语音和音频信号压缩、语音合成、声学处理、回声控制以及网络回声消除。
Figure 3.2 Block diagram for audio-assisted head and shoulder video [3.36]. ~1998 IEEE.
Speech and audio signal compression Signal compression techniques aim at efficient digital representation and reconstruction of speech and audio signals for storage and playback as well as transmission in telephony and networking.
语音和音频信号压缩 信号压缩技术的目标是为电话和网络中的存储、重放和传输进行语音和音频信号的有效的数字表示和重建。
Signal-analysis techniques such as Linear Predictive Coding (LPC) [3.37], and all-pole autoregressive modeling [3.38] and Fourier analysis [3.39], played a central role in signal representation. For compression, VQ [3.40, 3.41] marks a major advance. These techniques are built upon rigorous mathematical frameworks that have become part of the important bases of digital signal processing. Incorporation of knowledge and models of psychophysics in hearing have been proven as beneficial for speech and audio processing. Techniques such as noise shaping [3.42] and explicit use of auditory masking in the perceptual audio coder [3.43] have been found very useful. Today, excellent speech quality can be obtained at less than 8 Kb/s, which forms the basis for cellular as well as Internet telephony. The fundamental structure of the Code- Excited Linear Prediction (CELP) coder is ubiquitous in supporting speech coding at 4 to 16 Kb/s, encompassing such standards as G.728 [3.44], G.729 [3.45], G.723.1, IS-54 [3.46], IS-136 [3.47], GSM [3.48] and FS-1016 [3.491. CD or near-CD-quality stereo audio can be achieved at 64 to 128 Kb/s, less than one twelfth of the original CD rate, and is ready for such applications as Internet audio (streaming and multicasting) and digital radio (digital audio broadcast). Advances in audio-coding standards are supported in MPEG activities.
信号分析技术例如线性预测编码(LPC)、全极点自回归模型和傅立叶分析在信号表示中扮演着主要角色。对于压缩,VQ标志着一个重要进步。这些技术都建立在严格的数学框架之上,并已成为数字信号处理的重要基础的一部分。语音和音频信号处理已经从听觉的精神物理学知识与模型的结合中获得了益处。噪声频谱成型一类的技术和听觉屏蔽在知觉音频编码器中的直接应用已被发现非常有用。今天,在低于8 Kb/s的条件下已能获得极好的音质,这已成为蜂窝以及因特网电话的基础。码激励线性预测编码器的基本结构已经普遍用于支持4~16 Kb/s速率的语音编码,包括G.728 [3.44], G.729 [3.45], G.723.1, IS-54 [3.46], IS-136 [3.47], GSM [3.48] and FS-1016 [3.49]等标准。在64 ~128 Kb/s可达到CD或接近CD质量的立体声,速率低于CD码率的十二分之一,已经用于因特网音频(流和多播)和数字广播(数字音频广播)。MPEG支持音频编码标准的改进。
Speech synthesis The area of speech synthesis includes generation of speech from unlimited text, voice conversion and modification of speech attributes such as time scaling and articulatory mimic [3.50]. Text-to-speech conversion takes text as input and generates human-like speech as output [3.51]. Key problems in this area include conversion of text into a sequence of speech inputs (in terms of phonemes, dyades or syllables), generation of the associated prosodic structure and intonation and methods to concatenate and reconstruct the sound waveform. Voice conversion refers to the technique of changing one person”s voice to another, from person A to person B or from male to female and vice versa. It is useful to be able to change the time scale of a signal (to speed up or slow down the speech signal which changes the pitch) or to change the mode of the speech (making it sound happy or sad) [3.52]. Many of these signal-processing techniques have appeared in animation and computer graphics applications.
语音合成 语音合成的范围包括来自无约束文本语音的产生、话音语音特征例如时间尺度的转换和修改以及拟声。文本到语音转换以文本为输入,以产生的类人语音为输出。这个领域的关键问题包括文本变换到语音输入序列(术语叫音素或音节)、建立语法结构与音调的关联以及连接和重建声音波形的方法。话音转换涉及到把一个人的声音变为另一个人的技术,从人A到人B以及从男到女等等。能够改变信号的时间尺度(语音信号的快速或慢速以改变音调)或者语音的模式(欢快或悲愁的声音)是非常有用的。许多这些信号处理技术已经出现在动画和计算机图形这些应用中。
Acoustic processing and echo control Sound pickup and playback is an important area of multimedia processing. In sound recording, interference, such as ambient noise and reverberation, degrade the quality. The idea of acoustic signal processing and echo control is to allow straightforward high-quality sound pickup and playback in applications, such as a duplex device like a speakerphone, a sound source-tracking apparatus like microphone arrays, teleconferencing systems with stereo input and output, hands-free cellular phones and home theatre with 3D sound.
声学处理和回声控制 拾音和重放是多媒体处理的一个重要领域。录音时,环境噪声和回响之类的干扰使录音质量劣化。声学信号处理和回声控制是想在应用中能够获得高质量的拾音和重放,这些应用包括耳麦之类的双工设备、麦克风阵列之类的声源跟踪设备、立体声输入输出的远程会议系统、不用手的蜂窝电话以及3D声音的家庭影院等。
Signal processing for acoustic echo control includes modeling of reverberation, design of dereverberation algorithms, echo suppression, double-talk detection and adaptive acoustic echo cancellation, which is still a challenging problem in stereo full-duplex communication environments [3.53].
声学回声控制的信号处理包括回响模型、消回响算法设计、回声抑制、双方讲话检测以及适应回声消除,这仍然是立体声全双工通信环境中富有挑战性的问题。
Example 3.3 For typical environments, the system modeling time for reverberation is of the order of 100 ms. This at a sampling rate of 16 KHz translates into a echo-canceling filter of 1600 taps, requiring seconds to converge.
例3.3 在典型环境中,回响的系统建模时间是100ms量级。抽样频率16 KHz转换为1600个抽头的消回声滤波器,这需要若干秒汇聚。
For sound pickup, acoustic processing aims at the design of transducers or transducer arrays to achieve a durable directionality (beam steering and width control) as well as noise resistance. Understanding of near and far-field acoustics is important in achieving the required response in specific applications [3.54]. Various 1D and 2D microphone arrays have been shown in teleconferencing and auditorium applications with good results [3.55].
对于拾音,声学处理的目标是设计耐用的指向性(束调整和宽度控制)及抗噪声换能器或换能器阵列。在特殊应用中为得到所需要的响应必须掌握近场和远场声学特征。各种1D和2D麦克风阵列已经在远程会议和礼堂中获得良好的应用。
Network echo cancellation In telephony, both near-end and far-end echo exists due to the hybrid coil that is necessary for two-wire and four-wire conversions. Network echo can be so severe that it hampers telephone conversation. Network echo cancellers were invented to correct the problem in the late 1960s, based on the Least Mean Squares (LMS) adaptive echo cancelation algorithm [3.56]. The network echo delay is of the order of 16 ms, typically requiring a filter with 128 taps at a sampling rate of 8 KHz.
电话中的网络回声消除,包括由于二四线变换需要的混合线圈而产生的近端和远端回声。严重的网络回声将影响电话交谈。解决该问题的网络回声消除器发明于1960晚期,基于最小均方(LMS)适应回声消除算法。网络回声延迟大约16ms,在抽样频率8 KHz时需要128个抽头(典型值)的滤波器。