|UFDC Home||myUFDC Home | Help|
This item has the following downloads:
TOWARDS A REAL-TIME IMPLEMENTATION OF LOUDNESS ENHANCEMENT
ALGORITHMS ON A MOTOROLA DSP 56600
ADNAN H. SABUWALA
A THESIS PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE
UNIVERSITY OF FLORIDA
I would like to thank a myriad of persons without whose help this thesis would not have been
possible. Of all these, I am immensely indebted to my professor and advisor, Dr. Harris, without
whose help I would not be writing this thesis. He is an epitome of friendliness and generosity.
He has been not only a mentor, but also has offered me a helping hand during the course of
my graduate studies. Words are not enough to describe his emotional, financial and technical
support and I would like to express my sincere gratitude towards him for the same.
I would like to express my sincere thanks to Dr. Principe and Dr. Rangarajan for agreeing to
be on my thesis committee and providing me with helpful hints at every stage of my thesis.
I would also like to thank Mark, Bill and Kaustubh for their valuable hints. I would also
like to thank Marc for providing me with a valuable research topic and for his thoughts and
-Ii.--. i, !- on the same.
And finally, I would like to thank my parents, Hatim and Fatema Sabuwala, for their unpar-
alleled love and affection and for the belief they have shown in me.
TABLE OF CONTENTS
ABSTRACT . . . . . .
1 INTRODUCTION . .......
1.1 W hat Is a DSP? .............
1.2 Inside a Digital Cell Phone .......
1.3 Loudness Enhancement Algorithms .
1.3.1 Critical Band Concept .....
1.3.2 Warped Filter Implementation .
1.3.3 Vowels . . . .
1.4 C!i lpter Organization and Structure .
2 DSP ARCHITECTURAL DETAILS .
O verview . . . . . .
Central Architecture . .........
Data Arithmetic Logic Unit . .....
2.3.1 Data ALU architecture . ...
2.3.2 Data ALU Registers . .....
2.3.3 M AC Unit . .........
2.3.4 Data ALU Accumulator Registers .
2.3.5 Accumulator Shifter . .....
2.3.6 Bit Field Unit . ........
2.3.7 Data Shifter/Limiter . .....
2.3.8 Data ALU Arithmetic . ....
Address Generation Unit . ......
Program Control Unit . ........
Program Patch Logic . ........
PLL and Clock Oscillator . ......
Expansion Port (Port A) . .....
JTAG Test Access Port and On-Chip Emulator
On-Chip Memory . ..
Peripherals . ......
Summary . .......
. . . 1
. . . 2
. . . 4
. . . 5
. . . 7
. . . 8
. . . 9
. . . 10
. . . 10
. . . 12
. . . 16
. . . 16
. . . 17
. . . 18
. . . 19
. . . 19
. . . 20
. . . 20
. . . 22
. . . 23
. . . 25
. . . 26
. . . 26
. . . 27
. . . 27
. . . 27
. . . 28
. . . 28
3 BUILDING BLOCKS AND IMPLEMENTATION ISSUES .. ...
3.1 Basic Block Diagram .......................
3.1.1 Introduction to Linear Prediction .. .........
3.1.2 Bandwidth Expansion .. ..............
3.2 Autocorrelation ................... ........
3.3 Levinson-Durbin Recursion ................... .
3.4 FIR and IIR Filters .. ....................
3.5 Scaling FIR Coefficients .. .................
3.6 Warped Filter Implementation .. ...............
4 LMS: THE SOLUTION TO IMPLEMENTATION ISSUES .. ...
4.1 T he Solution . . . . . . .
4.2 Least Mean Squares (LMS) Algorithm ..............
4.3 Linear Prediction Using LMS .. ..............
4.4 Experimental Results .. ...................
5 CONCLUSIONS AND FUTURE WORK .. ........
A ASSEMBLY CODE FOR LEVINSON-DURBIN .. ......
B ASSEMBLY CODE FOR IIR AND FIR FILTERS .. ......
C ASSEMBLY CODE FOR LMS ALGORITHM .. .......
D ASSEMBLY CODE FOR AUTOCORRELATION .. ......
E ASSEMBLY CODE FOR MODIFIED SIGNED LMS ALGORITHM
REFERENCES . . . . . . .
BIOGRAPHICAL SKETCH .. . .............
Abstract of Thesis Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Master of Science
TOWARDS A REAL-TIME IMPLEMENTATION OF LOUDNESS ENHANCEMENT
ALGORITHMS ON A MOTOROLA DSP 56600
Adnan H. Sabuwala
C'i i ii il: John G. Harris
Major Department: Electrical and Computer Engineering
Most of the cellular phone companies with audio speaker capabilities focus on reducing the
current drain to extend battery life. None of these companies concentrate on modifying the speech
signal itself to make it sound louder in noisy listener environments without adding additional
energy. Such algorithms have been described in literature by Boillot and form the backbone of this
thesis. The current project focuses on taking a step towards running these algorithms in real-time
on a 16-bit fixed point Motorola DSP 56600. Implementation of the autocorrelation, Levinson-
Durbin, FIR, and IIR filters in assembly for the Motorola DSP 56600 has been investigated
in the thesis. The challenges and alternate solutions to circumvent the challenges have been
described, and experimental results have been presented. Results indicate that the modified
signed LMS algorithm, which can be considered to be a blend between the LMS and signed LMS
algorithms, turns out to be an elegant solution to circumvent the challenges in implementing the
In this thesis we provide a real-time implementation of algorithms that increase the perceived
loudness of a cellular phone in noisy listener environments. The work presented in this thesis
is a step towards achieving a real-time working product based on the loudness enhancement
algorithms which have been earlier developed and simulated in MATLAB by Marc Boillot .
This algorithm widens the bandwidth of the formant of voiced phonemes in speech. This work,
which has been funded by Motorola Inc., involves the use of a DSP simulator to simulate the
algorithms on a MOTDSP56600 which is a ROM-based 16-bit fixed point C'\ OS Digital Signal
This chapter briefly describes the basic structure of a digital cell phone and some basics about
digital signal processors (DSPs). Later in the chapter we discuss the loudness enhancement
algorithms and present the structure of the remaining chapters. The loudness algorithms present
the motivation behind this thesis.
1.1 What Is a DSP?
A digital signal processor (DSP) can be classified as a variant of a microprocessor-one that
is fast and powerful enough to process data in real time. The real-time capability of the DSP
makes it ideal for applications where d, 1 ,i- must be minimized.
Vocoders attempt to achieve high data compression while simultaneously maintaining the
signal quality. Most vocoders use psychoacoustic criteria incorporated in their bit compression
schemes. Bandwidth expansion techniques have been used in vocoders to alleviate quantization
noise and residual effects of the vocoding process. Two of the typically defined internal vocoder
filtering operations which use this technique are the perceptual noise spectral shaping and the
adaptive post-filtering. The Code Excited Linear Prediction (CELP) algorithm has been shown
to achieve high quality speech coding at low bit rates . The perceptual noise weighting filter
effectively generates a noise error function which retains the formant pole locations and elevates
the allowable noise in tonal regions. Applied to the excitation signal it alters the flat spectrum
of the excitation to that of human hearing sensitivity. The adaptive post filtering operation
typically attempts to suppress quantization noise in the valley regions by amplification of the
less sensitive formant regions. It consists of four basic filtering steps: long-term filtering, short-
term filtering, tilt compensation, and adaptive gain control. Of these, the short-term filter is
used to improve the overall quality of the synthesized speech and to alleviate quantization noise
Older analog cellular phones usually suffered from poor speech quality and annoying echoes.
However, in newer digital cell phones, the DSP takes a real-world signal, like speech, and performs
mathematical computations on it to improve the sound (referred to as "speech ( i I 1i ii ).
The DSP compresses the data (one's voice), removes the background noise (referred to as i, -
cancellation") and eliminates the echoes (referred to as "echo cancellation") so that one's voice
travels at a faster rate. The result is a clear sound, with no annoying echoes.
One task that a DSP does is to take a digital signal and process it to improve the signal.
The improvement is in the form of clearer speech, sharper images, or faster data. This ability
to process signals without requiring additional energy can make new breakthroughs in cellular
phone technology where a longer battery life is one of the prime concerns of the consumer.
In the next section we look at the internal structure of a digital cellular phone and try to
explain the functionalities of each of the components that make up the phone.
1.2 Inside a Digital Cell Phone
Considering some of the most intricate devices based on a "complexity per cubic inch" scale,
cell phones are one of the most such devices that people handle on a daily basis. Modern digital
cell phones can process millions of calculations per second in order to compress and decompress
the voice stream. They can transmit and receive on hundreds of channels, switching channels in
sync with the base stations as the phone moves between cells.
Instructions Per Second) chip and the vocoder handles all the compression and decompression
of speech signals. The microprocessor and memory units handle the housekeeping chores for the
keyboard and display, deal with command and control signalling with the base station and also
coordinate the rest of the functions on the board. The RF and power section handles issues of
power management and recharging and also deals with the hundreds of channels. And finally,
the RF transmitter and receiver amplifiers handle the signals coming in and out of the antenna.
In the following section, we present and describe the "loudness enhancement algorithms" that
form the backbone of this thesis report. In this section, a few technical terms which might be of
interest in the remainder of this thesis have been described.
1.3 Loudness Enhancement Algorithms
There is a vast world market for small hand-held devices and cellular phones. The i ii 'i
concern of any such manufacturing company is that of designing low-cost devices with limited
power consumption. Battery life can be considerably increased by saving on power consumption.
However, a i I i, i ily of these companies focus on building devices with better speaker design and
also using more efficient power amplifiers which can reduce the current drain and as such increase
the battery life considerably. None of these companies have tried to address energy conservation
schemes which operate directly on the speech signal which serves as the input and output to
the modern digital cellular phones. Therefore, an important step towards power savings can
be addressing this issue and focussing towards designing algorithms which operate directly on
the speech signal and try to increase loudness perception over a cellular phone without actually
increasing the signal energy. This section describes the basis of such loudness enhancement
il'j- .:hms and also presents the development of these algorithms and defines basic terminology
associated with them. As we shall see in the following sections, these algorithms exploit the
psychoacoustic nature of the human auditory system to achieve loudness enhancement and a
novel warped filter implementation of the same is developed. Its critical that these algorithms
be implementable in real-time so that a full working product can be made realizable.
1.3.1 Critical Band Concept
In this section, we present the basis of the "loudness enhancement algorithms." A brief
overview of the critical band concept and its significance towards loudness enhancement is pre-
Loudness can be defined as the human perception of the intensity of the speech signal. Loud-
ness is a function of the sound intensity and so also the frequency and quality of the speech
signal. Loudness can evaluated based on the ISO-532B standard which is a graphical evaluation
procedure for calculating the loudness of a complex sound. Loudness models have also been
developed in the literature, first by Zwicker and then further improved by Moore and Glasberg.
Moore's and Zwicker's models are very similar for moderate and normal sound levels. However,
at lower frequencies and at sounds close to quiet level, Moore's model outperforms Zwicker's
model. These algorithms which have been developed are based on Moore's model which uses
excitation patterns obtained from auditory filters.
The loudness level was introduced as a mechanism for loudness measurement procedure. By
definition, loudness level is the pressure level of the sound of a 1-KHz tone which is equally loud
as the sound being tested. The loudness level is measured with a unit called the "phon." Sounds
with equal phon levels are at equal loudness and a contour of such curves are known as the equal
loudness curves and are shown in Figure 1.2.
The phon, however, does not provide a measure of the loudness scale and hence, another unit
called the "sone" was introduced. A sone value of 1 corresponds to the loudness exhibited by a
1-KHz tone at an intensity of 40dB sound pressure level. A 10 phon increase is approximately
equivalent to a doubling of the sone value.
The critical band is an extremely important concept in auditory theory. A critical band can
be regarded as the bandwidth in which sudden perceptual changes can be noticed . Critical
In view of this the critical band concept forms the basis of the loudness enhancement algo-
rithm. The overall loudness of a speech signal is obtained by summing up the loudness over each
of the critical bands.
1.3.2 Warped Filter Implementation
In this section, we outline the implementation of a warped filter structure which was used in
achieving loudness enhancement of the input speech signal without actually increasing its energy.
However, we reserve the 1i iP i i y of the development of this structure for future chapters outlined
in this thesis.
Loudness enhancement involves increasing the bandwidth beyond a critical bandwidth as
indicated in the previous section. However, to achieve this bandwidth expansion without actually
changing the formant locations of the spectral envelope of speech, we had to make use of the
DSP fundamentals of evaluating the spectral response over a circle with radius larger than 1.
We know from DSP that when the impulse response of a stable system is evaluated over a circle
having a radius larger than 1, then the resulting system is not only stable, but also since the
poles get pulled farther apart from the circle boundary, the formant locations get bandwidth
expanded. This type of evaluation over a circle with radius larger than unity corresponds to an
equivalent power scaling of the coefficients of the system with a radius term. This provides a
fixed bandwidth increase independent of formant frequency. However, we know that the critical
bandwidth increases with frequency and as such would like to achieve a non-linear expansion of
bandwidth with frequency to account for the same. This non-linear expansion is achieved by
the use of a warped filter structure. The details of this implementation shall be reserved for the
It should be noted, that in Boillot , the loudness enhancement algorithms which were
presented and developed in MATLAB worked only on the ',. i I" sections of speech. A brief
section on vowels is as such presented next to illustrate the significance of working with voiced
speech for these algorithms.
Vowels can be characterized typically using the first three formant locations. The important
acoustic cues associated with vowels are the formant frequency locations, their bandwidths,
amplitude and duration. The formant hypothesis (resulting from the classic study of Peterson
and Barney, as cited in Hillenbrand et al. ) states that the first two to three formant locations
provide for vowel discrimination while the second and third formants help in discerning vowel
intelligibility. Thus, in essence, vowel formant locations are an acoustic cue in vowel perception.
Furthermore, studies have shown that the change in formant locations can considerably affect the
phonetic quality of vowels, however, the change in formant bandwidths or spectral tilt will not
affect phonetic quality . This is a useful feature of vowel characteristics, since in the current
project we are interested in altering the formant bandwidths and not the formant locations to
improve the loudness of the speech signal.
Vowels are typically spectrally smooth and have a high energy. The majority of the signal
power -I', is contained in them and a vast ini ,i i ly of it is unmasked. Also, as indicated in
the previous section, the alteration of formant bandwidths does not degrade vowel identification
and intelligibility. Loudness analysis indicates that the peak loudness is produced by vowels
in speech . Moreover, the intelligibility of speech is determined by the vowel-consonant-
vowel transitions rather than the steady state region of vowels. These observations -il-.-, -1
that a loudness enhancement scheme which preserves energy would best work on vowel sections
of speech. In view of these observations, a bandwidth expansion technique which increases the
bandwidth of vowel regions of speech moderately should lead to an increase in loudness perception
without actually degrading the signal. Such a technique is called formant expansion. Results
have shown that increasing the bandwidth on a linear scale will increase loudness .
Thus, in this section we saw that the vowels have the in .i itly of the speech signal power,
are in abundance in any typical sentence, have smooth spectral shapes, have broad bandwidths
which increase with increasing frequency and finally can be sufficiently bandwidth expanded
without degrading the intelligibility. Therefore, they form ideal candidates on which the loudness
enhancement techniques can be performed.
1.4 Chapter Organization and Structure
In this section, we present the chapter structure and a brief description of each chapter in
the remainder of this thesis.
In chapter 2, we shall discuss the architecture of the Motorola DSP 56600 which has been
used for implementation of the developed algorithms in real-time. The chapter shall discuss
the details of the processor and also provide some applications of the processor. In chapter 3,
the basic building blocks for the implementation of the loudness enhancement algorithms are
described. These include the autocorrelation, LPC, FIR and IIR filters respectively and shall be
discussed in greater detail. We will also talk about the warped version implementation in this
chapter. We will present the difficulties and challenges encountered in the implementation of
the algorithm in real-time and also provide a basic description of the FIR scaling from a binary
mathematical point of view. In chapter 4, we will present an alternate method for circumventing
the challenges we encountered in implementation of the loudness enhancement algorithms in
real-time. Experimental results are also described in the chapter and the total time taken for
the algorithms to run in real world is tabulated. ('! lpter 5 shall be the final chapter of the
current thesis and shall bring out the conclusions of the experimental results. It shall focus on
providing a step towards future work that can be done to make this product completely realizable
DSP ARCHITECTURAL DETAILS
In this chapter, we will discuss the architectural details and programmable modes of the
16-bit DSP 56600 which has been used for the implementation of the loudness enhancement
algorithms. The chapter begins with a small overview of the DSP followed by the various
architectural components of the DSP chip.
The current project involved the use of the DSP 56600 family chip. The DSP56600 family of
16-bit high performance Digital Signal Processors (DSPs) is designed specifically for low-power
digital handset cellular applications. These chips are capable of performing a wide v ,i. I of
fixed-point DSP algorithms. Each DSP in the family architecture contains a central processing
module which is common to various other family members. Besides this, a variety of other highly
integrated and cost-effective DSP devices can be built around this core based upon a library of
modules containing memories and peripherals. The main advantage of this DSP is that it can
provide very high execution speeds in a real-time, Input/Output (I/O) intensive environment
which most of the state-of-the-art DSP applications require.
Digital Signal Processing is the arithmetic processing of real-time signals which have been
sampled at regular intervals and digitized. Examples of such type of processing includes the
Correlation-comparing two signals
Rectifying, amplifying and/or transforming a signal
---VAV- -- + y(t)
Figure 2.1: Analog Signal Processing
All of the above functions have traditionally been performed using analog circuits. With
recent developments in the semiconductor industry, it has been possible to obtain the processing
power necessary to perform these and other functions using DSPs. Figure 2.1 shows an example
of analog signal processing. The circuit in the diagram shows a filter implementation for con-
trolling an actuator. Since an ideal filter is impossible to design, an engineer has to design it
for an acceptable response considering temperature variations, component aging, fluctuations in
power supply, and component accuracy. The resultant circuit has low noise immunity, requires
adjustments and is difficult to modify.
The equivalent circuit using a DSP is shown in Figure 2.2. The application requires the use
of an Analog-to-Digital (A/D) and Digital-to-Analog (D/A) converter in addition to the DSP.
However, even with these additional parts, the total component count can be much lower using
a DSP than the analog counterpart. This is mainly due to the high integration of components
available with the use of a DSP.
In summary, the advantages of using DSPs as compared to analog-only circuits, include the
x(t) AD FIR Filter D(t)
Figure 2.2: Digital Signal Processing
Built-in self-test possible
Stable, deterministic performance
Filter adjustments not needed
Wide variety of applications
High noise immunity
In the following sections, we describe the architecture of the Motorola DSP 56600 and also
detail each of the components in the architecture.
2.2 Central Architecture
This section describes the DSP 56600 core, a member of Motorola's family of programmable
C '\OS DSPs. Low power dissipation, low cost, high performance and high integration are the
design priorities for this DSP core. Some of the 1 ii' ', core features are the following :
60 Million Instructions Per Second (l\IPS) with a 60-MHz clock at 2.7V
Fully pipelined 16x 16-bit parallel Multiplier-Accumulator (\!AC)
40-bit parallel barrel shifter
Highly parallel instruction set
* Position Independent Code (PIC) support
* Unique DSP addressing modes
* -i. 1 hardware DO loops
* Fast auto-return interrupts
* On-chip 16-stage hardware stack with stack extension
* On-chip support for software patching and enhancements
* On-chip PLL
* On-Chip Emulation (OnCE) module
* Address tracing for debugging
* JTAG port compatible with the IEEE Standard Test Access Port and Bon,,.l i;i-Scan Ar-
chitecture (IEEE 1149.1)
Low-power features of the DSP 56600 core include the following:
* Very low power C'\!OS design (<0.7mA/\!lP, 2.7V A70 MIPS and <0.5mA/\!IP, 1.8V
* Low power Wait standby mode
* Ultra-low power Stop mode
* Power management units for further power reduction
* Fully static logic, with operation frequency down to DC
The DSP core provides the following functional blocks:
* Data Arithmetic Logic Unit (Data ALU)
Address Generation Unit (AGU)
Program Control Unit (PCU)
Program Patch Logic
PLL and Clock Oscillator
Expansion Port (Port A)
JTAG Test Access Port and On-Chip Emulation (OnCE) module
Besides this, each member of the DSP 56600 family provides its own set of on-chip periph-
erals for enhanced functionality. The following buses have been implemented for providing data
exchange between the blocks of the DSP core:
Peripheral I/O Expansion Bus (PIO_EB) to peripherals
Program Memory Expansion Bus (PM_EB) to Program ROM
X Memory Expansion Bus (XM_EB) to X Memory
Y Memory Expansion Bus (YM_EB) to Y Memory
Global Data Bus (GDB) between Program Control Unit and other core structures
Program Data Bus (PDB) for carrying program data throughout the core
X Memory Data Bus (XDB) for carrying X data throughout the core
Y Memory Data Bus (YDB) for carrying Y data throughout the core
Program Address Bus (PAB) for carrying program memory addresses throughout the core
X Memory Address Bus (XAB) for carrying X memory addresses throughout the core
Y Memory Address Bus (YAB) for carrying Y memory addresses throughout the core
Excepting the Program Data Bus (PDB), all internal buses on the DSP 56600 core are 16-bit
buses. The PDB is a 24-bit bus. The block diagram of the DSP 56603 which is a member of
the DSP 56600 family of DSPs is shown in Figure 2.3. It illustrates the core blocks of the DSP
56600 and also shows representative peripherals for the chip implementation.
*- +CLKOUT IM L O ROA -MODB MO.T
PINIT.NMI~ MOD 5_ MOCD/i
RESET I MO DI C.
3 +16 f616
Tiple Ddicated Host SSI Bootstrap
Timer or GPIO Interface Interface ROM Expansion
GPIO pins HIO8 or or GPIO 3072 x 24 Area
pins GPIO pins
mpins Program X Memory Y Memory
RAM RAM RAM
16.5Kx24 8192x16 8192 x 16
Peripheral a c ." b
Expansion Area u" w
a YAB x
Address tXAB A TAG
I I a External
Program 16-bit Interface
" Generator r a' Data ALU
I Program I i Program i I Program '1 6 6 40 -' 40-bt MAC
I Interrupt *-o Decode *-O Address i Two 40-bit Accumulators JTAG
P PLL I Controller i I Controller I I Generalorl 40-bit Barrel Shifter
-s -OnCE -
Figure 2.3: DSP 56603 Block Diagram 
In the following sections, we describe each of the functional blocks of the DSP 56600 core. A
brief description of blocks that are not relevant to this project is also provided.
2.3 Data Arithmetic Logic Unit
This section presents the operation and architecture of the Data ALU, which is the heart of
the arithmetic and logical operations of the DSP core. In addition, it also presents the arithmetic
and rounding performed by the Data ALU.
2.3.1 Data ALU architecture
The Data ALU is primarily responsible for performing the arithmetic and logical operations
on data operands in the DSP core. The Data ALU registers can be read over the X Data Bus
(XDB) and the Y Data Bus (YDB) either as 16-bit or 32-bit operands. The source operands are
alv--,v- the Data ALU registers themselves and can be either 16, 32, or 40 bits. The results are
stored in an accumulator. The operations are performed in 2 clock cycles in a pipeline fashion
so that a new instruction can be initiated in every clock thereby yielding an effective execution
rate of 1 clock cycle per instruction. Also another feature is that the destination register can be
used as a source for the next instruction without any conflicts. The in li components of the
Data ALU which is shown in Figure 2.4 are as follows:
Four 16-bit input registers
A parallel, fully pipelined MAC
Two 32-bit accumulator registers
Two 8-bit accumulator extension registers
A Bit Field Unit (BFU) with a 40-bit barrel shifter
An accumulator shifter
* Two data bus shifter/limiter circuits
X Data Bus
Figure 2.4: Data ALU Block Diagram
2.3.2 Data ALU Registers
X1, XO, Y1 and YO are four 16-bit general purpose data registers. These registers can either
be viewed as four separate 16-bit registers or two 32-bit registers formed by the concatenation of
X1:XO and Y1:YO, respectively. X1 is the most significant word in X and similarly Y1 is the most
significant word in Y. As can be seen in Figure 2.4 these registers serve as input buffers between
the XDB or YDB and the MAC unit or barrel shifter. These registers are used as Data ALU
source operands allowing new operands to be loaded for the next instruction while the register
contents are used by the current instruction. Besides this, they can also be used to read back
out onto the XDB or YDB.
2.3.3 MAC Unit
The heart of the arithmetic processing unit of the DSP 56600 core is the \!i il i1i! r" Accumu-
lator ( \ AC)." It performs all the calculations on data operands. In the case of arithmetic oper-
ations, it accepts as many as 3 inputs and outputs one 40-bit result of the form Extension:Most
Significant Product:Least Significant Product (EXT:MSP:LSP). The MAC unit operates inde-
pendent of and in parallel with the XDB and YDB activity. It executes 16-bit x 16-bit, parallel,
fractional multiplies, between two's complement, signed, unsigned or mixed operands. The 32-bit
product is right justified and added to the 40-bit contents of either the A or B accumulator.
The resultant 40-bit sum is stored back in the same accumulator. The MAC operation
is fully pipelined and takes 2 clock cycles to complete. In the first clock cycle, the multiply
operation is performed and the product is stored in the pipeline register. In the second clock
cycle, the accumulator is added or subtracted. In the case of a pure multiply operation (\! PY)
being specified, the MAC clears the contents of the accumulator and adds the content of the
product to it thereafter during the second clock cycle. A 40-bit result can also be stored as
a 16-bit operand. In such a case, the LSP can either be truncated or rounded into the MSP.
Rounding is performed if specified in the DSP instruction (e.g MACR). The rounding can be
either convergent rounding (round-to-nearest-even) or two's complement rounding. The type of
rounding is specified by the Rounding Mode bit (RM) in the Status Register (SR). The bit in
the accumulator that is rounded is specified by the Scaling Mode bits (SO and S1) in the SR. It
is possible to saturate the arithmetic unit's result going into the accumulator so that we can fit
it in 32 bits (\!SP:LSP). This process is called I ii i i It is activated by the Arithmetic
Saturation Mode (SM) bit in the SR. This type of mode is typically used for algorithms which
cannot take advantage of the Extension Accumulator (EXT).
2.3.4 Data ALU Accumulator Registers
There are six Data ALU registers viz. A2, Al, AO, B2, B1 and BO. Taken together they
form two general purpose 40-bit accumulators A and B, with each one of them having three
concatenated registers, A2:A1:A0 and B2:Bl:B0, respectively. Al or B1 stores the 16-bit MSP,
AO or BO stores the 16-bit LSP while the 8-bit EXT is stored in A2 or B2.
Reading the A or B accumulators over the XDB or YDB buses is protected against overflow
by substituting a limiting constant for the data that is being transferred. The content of A or
B is not affected if limiting occurs. Only the value that is transferred over the XDB or YDB
is limited. This process is commonly referred to as transfer saturation and is different from the
Arithmetic Saturation mode that was described in Section 2.3.3.
The overflow protection is performed after the contents of the accumulator have been shifted
according to the scaling mode. Shifting and limiting are performed only when the entire 40-bit
accumulator is specified as the source for a parallel data move over the XDB or YDB. Shifting
and limiting are not used when only an individual register within an accumulator (Al, AO, A2,
Bl, BO or B2) is specified ad the source for a parallel data move. The A and B accumulators serve
as buffer registers between the Arithmetic Unit and the XDB or YDB buses. These registers can
be used as both Data ALU source and destination operands.
2.3.5 Accumulator Shifter
The accumulator shifter is an .,-vnchronous parallel shifter with a 40-bit input and a 40-
bit output that is implemented immediately before the MAC accumulator input. The source
accumulator shifting operations are:
No shift (Unmodified)
16-bit Right Shift (Arithmetic) for DMAC
* Force to zero
2.3.6 Bit Field Unit
The Bit Field Unit (BFU) contains a 40-bit parallel bidirectional shifter with a 40-bit input
and a 40-bit output mask generation unit, and logic unit. The BFU is used in the following
Multibit Left Shift (Arithmetic or Logical) for ASL, LSL
Multibit Right Shift (Arithmetic or Logical) for ASR, LSR
1-bit Rotate (Right or Left) for ROR, ROL
Bit Field Merge, Insert, and Extract for MERGE, INSERT, EXTRACT, and EXTRACTU
Count Leading Bits for CLB
Fast Normalization for NORMF
Logical operations for AND, OR, EOR, and NOT
2.3.7 Data Shifter/Limiter
The data shifter/limiter circuits provide special post-processing on data read from the A
and B accumulators out to the XDB or YDB buses. There are two independent shifter/limiter
circuits, one for the XDB bus and the other for the YDB bus. Each consists of a shifter followed
by a limiter circuit.
The data shifters in the shifters/limiters unit can perform the following data shift operations:
Scale up-shift data one bit to the left
Scale down-shift data one bit to the right
* No scaling-pass the data unshifted
Each data shifter has a 16-bit output with overflow indication. These shifters permit dynamic
scaling of fixed-point data without modifying the program code. The data shifters are controlled
using the Scaling Mode bits (SO and S1) in the SR.
In the DSP 56600 core, the Data ALU accumulators A and B have eight extension bits.
Limiting occurs when the extension bits are in use and either A or B is the source being read
over the XDB or YDB. The limiters in the DSP 56600 core place a shifted and limited value
on XDB or YDB without changing the contents of the A or B registers. Having two limiters
allows two-word operands to be limited independently in the same instruction cycle. The two
data limiters can also be combined to form one 32-bit data limiter for long-word operands.
If the contents of the selected source accumulator can be represented without overflow in the
destination operand size (i.e the signed integer portion of the accumulator is not in use), the
data limiter is disabled, and the operand is not modified. However, if the contents of the selected
source accumulator cannot be represented without overflow in the destination operand size, the
data limiter substitutes a limited data value having maximum magnitude (saturated) and having
the same sign as the source accumulator contents:
$7FFF for 16-bit positive numbers
$7FFF FFFF for 32-bit positive numbers
,II III for 16-bit negative numbers
-ii11111 0000 for 32-bit negative numbers
This process is called transfer saturation. The value in the accumulator register is not shifted
or limited and can be reused within the Data ALU. When limiting does occur, a flag is set and
latched in the SR.
2.3.8 Data ALU Arithmetic
The DSP 56600 core uses a fractional data representation for all Data ALU operations. The
decimal points are all aligned and are left-justified.
The most negative number that can be represented is -1.0. The internal representation is
-.'IIIII for words and -.*111111 0000 for long-words. The most positive word is $7FFF or 1 2-15
and the most positive long word is $7FFF FFFF or 1 2-31. These limitations apply to
all data stored in memory and to data stored in the Data ALU input buffer registers. The
extension registers associated with the accumulators allow for word growth so that the most
positive number with word growth that can be used is 256 and the most negative number with
word growth is -256.
To maintain alignment of the binary point, when a word operand is written to accumulator A
or B, the operand is written to the most significant accumulator register (Al or B1), and its MSB
is automatically sign extended through the accumulation extension register (A2 or B2). The least
significant accumulator register (AO or BO) is automatically cleared. When a long-word operand
is written to an accumulator, the least significant word of the operand is written to the least
significant accumulator register. The number representation for integers is between 2(N-1)
The fractional representation is limited to numbers between 1. To convert from an integer to
a fractional number, the integer must be multiplied by a scaling factor so that the result will
alv--,v- be between 1. The representation of integer and fractional numbers is the same if the
numbers are added or subtracted, but is different when the numbers are multiplied or divided.
The key difference is in the alignment of the 2N 1 bit product. In fractional multiplication,
the 2N 1 significant product bits should be left-aligned, and a 0 filled in the LSB to maintain
fractional representation. In integer multiplication, the 2N 1 significant product bits should
be right-aligned, and the sign bit duplicated to maintain integer representation. Since the DSP
56600 core incorporates a fractional array multiplier, it ah--,v- aligns the 2N 1 significant
product bits to the left. Besides these, the DSP 56600 core uses two types of rounding modes
viz. "convergent-roundini; and I .,'s-complement rounding." The type of rounding is selected
by the Rounding Mode (RM) bit in the SR.
2.4 Address Generation Unit
The Address Generation Unit (AGU) performs the effective address calculations using integer
arithmetic necessary to address the data operands in memory and contains the registers used
to generate these addresses. It implements four types of arithmetic: linear, modulo, multiple
wrap-around modulo, and reverse-carry. The AGU operates in parallel with other chip-resources
to minimize address-generation overhead.
The AGU is divided into two halves, each with its own Address Arithmetic Logic Unit (Ad-
dress ALU). Each Address ALU has four sets of register triplets, and each register triplet is
composed of an address register, an offset register, and a modifier register. The two Address
ALUs are identical. Each contains a 16-bit full adder (called an offset adder).
A second full adder (called a modulo adder) adds the result of the first full adder to a modulo
value that is stored in its respective modifier register. A third full adder (called a reverse-carry
adder) is also provided. The offset adder and reverse-carry adder are in parallel and share
common inputs. The only difference between them is that they carry propagates in opposite
directions. Test logic determines which of the three summed results of the full adders is the
Each Address ALU can update one address register from its respective address register file
during one instruction cycle. The contents of the associated modifier register specifies the type
of arithmetic to be used in the address register update calculation. The modifier value is decoded
in the Address ALU.
Since the modulo-addressing modifier type has been used in the current project, a brief
description of the same is provided below.
In this type of modifier arithmetic mode, address modification is performed modulo M, where
M ranges from 2 to +32,768. Modulo M arithmetic causes the address register value to remain
within an address range of size M, defined by a lower and upper address boundary.
The value m = M 1 is stored in the modifier register. The lower boundary (base address)
value must have zeros in the k LSBs, where 2k > M, and therefore must be a multiple of 2k. The
upper boundary is the lower boundary plus the modulo size minus one (base address+M 1).
Since M <2k, once M is chosen, a sequential series of memory blocks, each of length 2k, is
created where these circular buffers can be located. If M < 2k, there is a space of 2k M
between sequential buffers.
The address pointer is not required to start at the lower address boundary or to end on
the upper address boundary; it can initially point anywhere within the defined modulo address
range. Neither the lower nor the upper boundary of the modulo region is stored; only the size
of the modulo region is stored in Mn. The boundaries are determined by the contents of Rn.
Assuming the (Rn)+ indirect addressing mode is used, of the address register pointer increments
past the upper boundary of the buffer (base address+M 1), it wraps around through the base
address (lower boundary). Alternatively, assuming that the (Rn)- addressing mode is used, if
the address decrements past the lower boundary (base address), it wraps around through the
base address+M 1 (upper boundary).
If an offset, Nn, is used in the address calculations, the 16-bit absolute value, I Nn must be
less than or equal to M for proper modulo addressing. If Nn > M, the result is data dependent
and unpredictable, except for the special case where Nn = P x 2k, a multiple of the block size
where P is a positive integer. For this special case, when using the (Rn)+Nn addressing mode,
the pointer, Rn, jumps linearly to the same relative address in a new buffer, which is P blocks
forward in memory. Similarly, for (Rn)-Nn, the pointer jumps P blocks backward in memory.
This technique is useful in sequentially processing multiple tables or N-dimensional arrays.
The range of values for Nn is -32, 768 to +32, 767. The modulo arithmetic unit automatically
wraps around the address pointer by the required amount. This type of address modification is
useful for creating circular buffers for FIFO queues, delay lines and sample buffers up to 32,767
words long, as well as for decimation, interpolation, and waveform generation. The special case
of (Rn)Nn modulo M with Nn = P x 2k is useful for performing the same algorithm on multiple
blocks of data in memory, for example, when performing parallel Infinite Impulse Response (IIR)
2.5 Program Control Unit
The Program Control Unit (PCU) performs instruction prefetch, instruction decoding, hard-
ware DO loop control and exception processing. The PCU implements a seven-stage pipeline
and controls the different processing states of the DSP 56600 core. The PCU consists of three
Program Decode Controller (PDC)
Program Address Generator (PAG)
Program Interrupt Controller (PIC)
The PDC decodes the 24-bit instruction loaded into the instruction latch and generates all
signals necessary for pipeline control. The PAG contains all the hardware needed for program
address generation, system stack and loop control. The PIC arbitrates among all interrupt
requests (internal interrupts as well as the five external requests IRQA, IRQB, IRQC, IRQD,
and NMI), and generates the appropriate interrupt vector addresses.
The PCU implements its functions using the following registers:
PC-Program Counter Register
LA-Loop Address Register
* LC-Loop Counter Register
VBA-Vector Base Address Register
OMR-Operating Mode Register
SC-Stack Counter Register
The PCU also includes a hardware System Stack (SS).
2.6 Program Patch Logic
The Program Patch Logic (PPL) block provides the core user a way to fix the program code
in the on-chip ROM without generating a new mask. Implementing the code correction is done
by replacing a piece of ROM-based code with a patch program stored in RAM. The PPL consists
of four Patch Address Registers (PAR1-PAR4) and four patch address comparators. Each PAR
points to a starting location in the ROM code where the program flow is to be changed. The
PC register in the PCU is compared to each PAR. When an address of a fetched instruction
is identical to an address stored in one of the PARs, the Program Data Bus (PDB) is forced
to a corresponding JMP instruction, replacing the instruction that otherwise would have been
fetched from the ROM.
2.7 PLL and Clock Oscillator
The DSP 56600 core features a Phase Locked Loop (PLL) clock oscillator in its central
processing module. The PLL allows the processor to operate at a high internal clock frequency
using a low frequency clock input. The clock generator in the core is composed of two main
blocks: the PLL, which performs the clock input division, frequency multiplication, and skew
elimination, and the Clock Generator (CLKGEN), which performs low power division and clock
2.8 Expansion Port (Port A)
Port A is the memory expansion port and is used for both program and data memory. It
provides an easy to use, low part-count connection with fast or slow static memories and with
I/O devices. The Port A data bus is 24 bits wide with a separate 16-bit address bus capable of
a sustained rate of one memory access per two clock cycles. External memory can be as large
as 64 K x 24-bit program memory space, depending on chip configuration. An internal wait
state generator can be programmed to insert as many as thirty-one wait states if access to slower
memory or I/O device is required. For power-sensitive applications and applications that do not
require external memory, Port A can be fully disabled.
2.9 JTAG Test Access Port and On-Chip Emulator (OnCE)
The DSP 56600 core provides a dedicated user-accessible Test Access Port (TAP) that is fully
compatible with the IEEE Standard Test Access Port and Bou,,nl, ,-Scan Architecture (IEEE
1149.1). The test logic includes a Test Access Port consisting of four dedicated signal pins,
a 16-state controller, and three test data registers. A boundary scan register links all device
signal pins into a single shift register. The test logic, implemented using static logic design, is
independent of the device system logic. The On-Chip Emulation (OnCE) module provides a
means of interacting with the DSP 56600 core and its peripherals non-intrusively so that a user
can examine registers, memory, or on-chip peripherals. This facilitates hardware and software
development on the core processor.
2.10 On-Chip Memory
The memory space of the DSP 56600 core is partitioned into program memory space, X
data memory space, and Y data memory space. The data memory space is divided into X data
memory and to Y data memory in order to work with the two Address ALUs and to feed two
operands simultaneously to the Data ALU. Memory space typically includes internal RAM and
ROM and can be expanded off-chip under software control. Both internal and external memory
configuration is specific to each member of the DSP 56600 family. The total on-chip and external
memory for the DSP 56602 and DSP 56603 which belong to the DSP 56600 family is tabulated
in Table 2.1.
Table 2.1: On-Chip and External Memory
Device On-chip Data On-chip Program External Data/Program
Memory Memory Memory
DSP 56602 25K x16-bit X-RAM 5Kx24-bit RAM 64Kx 24-bit
6K x 16-bit X-ROM 34Kx 24-bit ROM
25Kx 16-bit Y-RAM
8Kx 16-bit Y-ROM
DSP 56603 8Kx 16-bit X-RAM 16Kx24-bit RAM 64Kx 24-bit
8K x 16-bit Y-RAM 3K x 24-bit ROM
Each member of the DSP 56600 family can be configured with its own set of on-chip periph-
erals for communicating with external devices or memory, as well as for providing additional
In this chapter, we have presented a description of the architectural details of the DSP
56600 core and have also discussed sections relevant to this project in greater detail. The next
chapter outlines the basic building blocks that form the backbone of the loudness enhancement
BUILDING BLOCKS AND IMPLEMENTATION ISSUES
This chapter describes in greater detail the basic building blocks of the loudness enhancement
algorithm implementation on the Motorola DSP 56600. A block diagram representation of the
system setup is shown and thereafter each of the blocks in the diagram is described in greater
detail. We also discuss the warped filter implementation of the loudness enhancement algorithms.
In the current project, we have not implemented the warped filter structure for DSP simulations.
3.1 Basic Block Diagram
In this section, we shall present the fundamentals behind linear prediction which is the most
important step in the bandwidth expansion technique for the loudness enhancement algorithms.
This section will motivate the basic block diagram that can be used to represent the building
blocks of the loudness enhancement algorithms. A brief section on the warped linear prediction
is also presented.
3.1.1 Introduction to Linear Prediction
Linear prediction is the most well-known technique for modelling acoustical speech behav-
ior . Linear prediction makes use of the fact that speech varies very slowly with time with
fairly stationary characteristics, that is, it is quasi-stationary. Linear prediction developed from
models of speech production based on linear mathematical principles.
The linear model assumes that a glottal excitation source stimulates a vocal tract model
which in turn passes through a lip radiation model. The overall model can be represented by the
S(z) = E(z)G(z)V(z)L(z) (3.1)
where E(z) represents the excitation, G(z) represents the glottal shaping, V(z) represents the
vocal tract model, and L(z) represents the lip radiation model. The glottal excitation is the
quasi-periodic pulse train of air generated by the vibration of vocal chords in response to air flow
from the lungs. An all pole filter can be used to represent the linear speech production model,
and is represented by the following equation
G(z)V(z)L(z) -= (3.2)
A(z) 1 E k-
The all zero filter A(z) is referred to as the inverse filter (sometimes also called as the analysis
filter). This filter is used in the analysis model E(z) = S(z)A(z). The reciprocal of A(z) is called
the all-pole model and is used in the all-pole speech synthesis S(z) = E(z)A-(z).
Linear prediction of speech is based on the concept that the parameters of the speech produc-
tion model vary very slowly over time and that in any interval of long enough duration, the speech
waveform can be represented by a linear combination of its past values. The Linear Predictive
Coding (LPC) model has been well understood since the early 1970's and can be described by
the following equation
s(n) = I aks(n k) + Gu(n) (3.3)
where u(n) is the normalized glottal excitation and G is the excitation gain. Eq. 3.3 leads to the
following transfer function
HS(z) 1 1 (3
H( U) P A() (3.4)
GU(z) 1- ak- A(z)
The LPC an 1,i-; equations provide a means of evaluating the prediction error. The predic-
tion error is used as a minimization criterion in finding the optimal filter coefficients ak which
best represent the speech signal in a mean squared error sense. The prediction error is basically
a measurement criterion which indicates how close the synthetic representation of speech is to
the true speech signal. Let us define s(n) as the synthetic representation of speech. Thus, s(n)
represents a linear combination of previous speech samples.
s(n) = ais(n 1) + as(n 2) + ... + aps(n p) (3.5)
The prediction error is then given as
e(n) = s(n) s(n) = s(n) I aks(n k) (3.6)
which leads to the following transfer function
A(z) z) 1 a~- (3.7)
S(z) k= k
In the case when the speech s(n) is actually generated using Eq. 3.5, the prediction error e(n)
equals the scaled glottal excitation Gu(n). The main purpose of linear prediction is then to find
a set of optimal coefficients ak which minimize the mean squared error. These set of equations
which need to be solved in order to determine the optimal set of predictor coefficients are known
as the set of normal equations and are given as
Q(i,0) ^.(i, k) (3.8)
where ((i, k) represents the short-term covariances of the speech signal. These equations can be
solved using the autocorrelation method shown below
r,(| k)ak=ni), < k=1
where rk is the autocorrelation at lag k.
It is imperative to recall that the coefficients ak are related to the predictor coefficients ak by
the following relation
1k = -ak for k 1,2,...,p (3.10)
3.1.2 Bandwidth Expansion
As described earlier in C'!i pter 1, an LPC technique for loudness enhancement is to alter the
formant bandwidths. Such a technique can be described by the following equation
A(z) = (akr-k) jke (3.11)
This procedure is based on McCandless procedure  and provides us with a way of evaluating
the z-transform over a circle with radius larger than or less than the unit circle r = 1. For
the case 0 < r < 1, the evaluation is on a circle with radius smaller than unity. The poles are
therefore closer to the circle than before and the contribution of the poles effectively increases.
Also, stability is a concern for the inverse filter 1/A(5), since the analytic expression for the same
may not have poles lying inside the circle of radius r.
For the case of r > 1, the evaluation of the z-transform is on a circle farther away from the
unit circle. The contribution of the poles decreases leading to a decrease in pole resonance peaks
and also a corresponding expansion of pole bandwidths. Moreover, the analytic expression for
the inverse filter has all its poles guaranteed to lie within the circle of radius r and hence, stability
is not a concern. Translating the evaluation of the z-transform on a circle with radius r > 1
back into the filter coefficients terms, we find that this method of bandwidth expansion simply
requires a scaling of the LPC coefficients by a power series of r. The bandwidth broadening
technique can be put in the following filter form
H() A / (3.12)
where the bandwidth expansion factors 7 and 3 set the level of bandwidth adjustment. Results
have shown that the optimal values for 7 and 3 are 7 = 0.8 and 3 = 0.4 .
Eq. 3.12 I r I -1 the use of FIR and IIR filter structures for the computation of the bandwidth
expanded, loudness enhanced speech output. The numerator corresponds to an FIR analysis filter
structure whose coefficients are the LPC coefficients scaled by a power series with common ratio
7. The denominator corresponds to an IIR synthesis filter structure whose coefficients are the
LPC coefficients scaled by a power series with common ratio 3.
Thus, in the computation of the bandwidth expanded, loudness enhanced speech from the
original speech samples, we need to perform four basic steps:
1. Compute autocorrelation coefficients
2. Use autocorrelation coefficients to compute LPC (using Levinson-Durbin recursion
3. Use LPC coefficients and 7 to build the FIR an ,il,--i; structure and filter original speech
4. Use LPC coefficients and 3 to build the IIR synthesis structure and filter the output from
previous stage using it
These steps can be more clearly elicited in Figure 3.1.
In the next few sections, we shall describe each of the blocks in Figure 3.1 in further detail.
Also, we shall present some results which show that the assembly output matches the MATLAB
output for these blocks.
In this section, we describe the first block of Figure 3.1 which computes the autocorrelation
of the input speech samples. This is an important step towards computing the linear predictive
The input speech has been sampled at 16KHz and the autocorrelation block operates on 180
sample windows (which corresponds to 5.625ms of speech samples with 50'. overlap). For the
current project, speech samples from the TIMIT database are chosen for evaluation purposes.
-- Autocorrelation LPC
FIR Analysis Filter Filter
Figure 3.1: Block Diagram for the Loudness Enhancement Algorithm
The autocorrelation for the 180 sample window is computed using the assembly code listed in
The autocorrelation of the speech sample window is then used in the subsequent LPC block
to compute the linear predictive coefficients using the Levinson-Durbin recursion.
3.3 Levinson-Durbin Recursion
From Eq. 3.9, it is clear that the basic problem of finding the linear predictive coefficients is
that of solving the matrix equation Ra = r. Here, R is the autocorrelation matrix, a are related
to the linear prediction coefficients by Eq. 3.10, and r is the autocorrelation vector. In 1947,
Levinson  published an algorithm for solving the problem Ax = b in which A is Toeplitz,
symmetric, and positive definite, and b is arbitrary. The autocorrelation equation in Eq. 3.9
are of this form, with b having a special relationship to the elements of the matrix A. In 1959,
Durbin  published a slightly more efficient algorithm for this special case. This algorithm is
referred to as the Levinson-Durbin recursion in speech processing.
The Levinson-Durbin recursion can be stated by the following set of equations :
E(O) r(O) (3.13)
r(i)- L (i jl)
ki = 1 < i < (3.14)
Si) ki (3.15)
i) kiai7 (3.16)
E(i) (1 k2)E(-1) (3.17)
First step consists of the initialization of the error term which is done in Eq. 3.13. Thereafter,
the ith reflection coefficient is computed in Eq. 3.14. The next step involves the computation of
the ith predictive coefficient and the previous coefficients (if any) are updated using the update
rule defined by Eq. 3.16. Finally the last step involves the computation of the error term and
the algorithm progresses recursively until all the linear prediction coefficients have been found.
As can be seen from the above set of equations, implementation of the Levinson-Durbin
recursion in assembly for a fixed-point DSP can be a challenge. Eq. 3.14 which calculates the
reflection coefficients needs a division to be performed in every recursion. The built-in division
routine written for the Motorola DSP 56600 provides for 32-bit dividends and 16-bit divisors.
As a result, the quotient is restricted to [-1,1) range. However, it is impossible to guarantee that
the numerator in Eq. 3.14 will ah--,v- be less than or equal to the denominator. We, therefore,
have to look for other v--~ i of getting around the division step. One solution is to write a
separate subroutine for the DSP which performs division in the conventional way of subtracting
the dividend from the divisor until the difference is smaller than the divisor itself. The difference
can then be divided to compute the decimal part of the quotient and the number of times we
need to subtract the divisor will give us the integer part of the quotient. However, this whole
process needs a large number of memory registers (9 registers, 2 accumulators, 4 data ALU input
registers) and also we have a trade-off between complexity and accuracy. Since the algorithm
relies on the accuracy of the pole locations rather than the LPC coefficient values themselves
it makes sense to consider an approximation algorithm for computing the coefficients which
themselves may not be exact, however, the pole locations will still be very close to the original
pole locations. Such an approximation algorithm has been dealt with in C'! lpter 4.
Listed in Appendix A is the assembly code for the Levinson-Durbin recursion:
3.4 FIR and IIR Filters
The linear prediction block as described in the previous section outputs the LPC coefficients
which form the filter coefficients for the FIR analysis filter and IIR synthesis filter with proper
scaling by corresponding power series. Building a filter (FIR or IIR) in assembly requires input
scaling and other issues to be taken care of to avoid overflow and underflow problems. Such
issues have been discussed in greater detail below. The FIR and IIR filters in assembly are found
in Appendix B.
The most basic type of filter in DSP is the FIR filter. By definition, a filter is classified as
FIR if it has the following transfer function
bozN-1 + bzN-2 + ... + bN-2Z + bN-1 (
H(zi = (3.18)
where b e N, M Z, N>0, zEC
This is referred to as an N-tap FIR filter. In general, an FIR filter can be either causal or
non-causal. However, FIR filters are alv--, stable and that is the chief reason they are widely
used. The difference equation which results from the above transfer function when N = M is
y(n) = box(n) + bix(n- 1) + ... + bN-2x(n N + 2) + bN-ix(n N + 1)
This is the familiar result of discrete convolution of the filter with the input data. The equations
above are the idealized, mathematical representations of an FIR filter because the arithmetic
operations of addition, subtraction, multiplication, and division are performed over the field of
real numbers (R, +, x), i.e., in the real number system. In practice, both the data values and
the coefficients are constrained to be fixed-point rationals. While this set is closed, it is not
"bit-bounded", i.e., the number of bits required to represent a value in the fixed-point rationals
can be arbitrarily large. In a practical system, one is limited to a finite number of bits in the
words used for the filter input, coefficients, and filter output. Most current DSPs provide ALUs
and memory architectures to support 16-bit, 24-bit, or 32-bit wordlengths, however, one may
implement arbitrarily long lengths by customizing the multiplications and additions in software
and utilizing more processor cycles and memory. The final choices, however, are governed by
many aspects of the design such as required speed, power consumption, SNR, cost and others.
There are generally two methods of operating on fixed-point data viz. integer and fractional.
The integer method represents data as integers and performs integer arithmetic. The fractional
method assumes the data are fixed-point rationals bounded between -1 and +1. Except for an
extra left shift performed in fractional multiplies, these two methods can be considered equivalent.
3.5 Scaling FIR Coefficients
Consider an FIR filter with N coefficients bo, bl,..., bN-1, bi E R. In fixed-point arithmetic,
a binary word can be interpreted as an unsigned or signed fixed-point rational. Although there
are a number of situations in which the filter coefficients could be the same sign and thus could
be represented using unsigned values, let us assume that they are not and hence we must utilize
signed fixed-point rationals for our coefficients. Thus, we must find a way of representing, or
more accurately, of -l:,,il.:': the filter coefficients using signed fixed-point rationals.
Since a signed fixed-point rational is of the form Bi/2b, where Bi and b are integers, -2M-1 <
Bi < 2M-1 1, and M is the wordlength used for the coefficients, we determine the estimate bi
of coefficient bi by choosing a value for b and then determining Bi as
Bi = round(b, 2b) (3.20)
b' = Bj/2b (3.21)
In general, b' is only an estimate of bi because of the rounding operation. This approximation is
called coefficient quantization. The quantization error can be determined by the following
ei = b'i bi
S Bi/2b bi
The question that arises then is how do we choose b? In order to answer this, note that the
maximum error eim a quantized coefficient can have will be one-half of the bit being rounded
eim = 2-b/2
It is now easy to see that lacking any additional criteria, the ideal value for b is the maximum
it can be since that will result in the least amount of coefficient quantization error. However,
b is from the integers, and the integers can go to infinity. Again, considering the coefficient
wordlength to be M bits, the maximum magnitude a signed two's complement value has is
2M-1 1. Therefore, we must be careful not to choose a value for b which will produce a Bi that
has a magnitude larger than 2M-1 1. When a value becomes too large to be represented by
the representation we have chosen, then we v- that an overflow has occurred. Thus to avoid
overflow, the value of b that will not overflow the largest magnitude coefficient can be computed
b = Llog2((2- 1)/max(|lb,))J (3.24)
In summary, we see that the ideal value for b is the maximum value which can be used with-
out overflowing the coefficients since that provides the minimum coefficient quantization error.
However, adding two J-bit values requires J + 1 bits in order to maintain precision and avoid
overflow. This can be easily extended to a sum of multiple values and we find that the sum of
N J-bit values requires J + [log2 N] bits to maintain precision and avoid overflow.
Let us consider an N-tap FIR filter which has L-bit data values and M-bit coefficients. Then
using the above relations, the final N-term sum required at each time interval n,
y(n) = b'x(n) + bx(n 1) +... + b_x(n- N + 1) (3.25)
requires L + M + log2 N bits in order to maintain precision and avoid overflow. Most processors
and hardware components provide the ability to multiply two M-bit values together to form a
2M-bit result. Most general purpose and some DSP processors provide an accumulator that is the
same width as the multiplier output. Some DSP processors provide a 2M + G-bit accumulator,
where G denotes "guard bits." Therefore, another criteria in the design of FIR filters is that the
final convolution sum fit within the accumulator. To put it algebraically, we require that
2M + log N < 2M + G (3.26)
assuming that the coefficient wordlength and the data wordlength is the same (M bits). The key
point here is that the number of bits required for the filter output increases with the length of the
filter. For situations where we don't have guard bits (G = 0), we see that we immediately have
problems even for a 2-tap filter. This is precisely why the guard bits are provided because they
guard against overflow when performing summations. However, even though the accumulator
may have guard bits, it is still possible to overflow the accumulator if log2 N > G, i.e., if we
attempt to use a filter that is longer than 2" taps. However, in the current project, we have 8
guard bits and are using a 4th order FIR filter. Therefore, we can be assured that the accumulator
will not overflow despite the presence of guard bits.
Consider the convolution sum shown in Eq. 3.19. The signs of x(k) which will make the terms
in bix(n i) all positive will result in larger output. This occurs when sgn(x(n- i)) sgn(bi).
Therefore the convolution sum can be rewritten as
y( ) = bix(n i)
= bi(sgn(bi)) x(n i)l
= Ib|llx(n i)| (3.27)
If we let XMAX denote the maximum magnitude of x(n), then the maximum sum represented
above would be
YMAX 5 /' 'MAX
x=MAX 5 blbi
where a Z= E |bi represents the coefficient area. Using scaled representation format, we have
YMAX = aXMAx/2b (3.29)
Similarly using the scaled representation for yMAX, we have,
YMA = 2bbaXMAX
For an A-bit accumulator for storing the output with L-bit data wordlength and coefficient area
a, the maximum value for the coefficient scale factor bb is
bb A- L log2 (3.31)
To summarize, we need to maximize bb to reduce quantization error, also we need to constrain
bb so that the coefficient with the largest magnitude is representable, and finally we need to
constrain bb so that overflows in the convolution sum are avoided. Taking these three criteria
into consideration, the value of bb that we seek is given by
bb = min(log2((2M 1)/max( Ib,))],A- L- [log,2 ]) (3.32)
This section provided a binary mathematical point of view towards coefficient scaling to avoid
overflows in FIR filters.
Figure 3.2 shows a zoomed in version of an overlay graph of the MATLAB and assembly
output for the autocorrelation of 180 sample windows for the sentence "She had your dark suit
in greasy wash water all year." This sentence was taken from the TIMIT database. The blue
solid line shows MATLAB output while the red dotted line shows the assembly output.
The outputs match each other within the precision of the hardware and as such it difficult
to discern the two plots from each other. Figure 3.3 shows another such overlay plot for the
FIR output of a single phoneme (783 samples) being passed through an FIR analysis filter. Also
the IIR output of the same phoneme passed through an IIR synthesis filter with bandwidth
expansion factor 7 = 0.909 does not match the MATLAB output exactly but is slightly off from
it. The difference is so small that it is not of much significance.
3.6 Warped Filter Implementation
In this section, we present a brief overview of the warped filter structure implementation
for the loudness enhancement algorithms. The warped filter technique is used to increase the
1.8 1.85 1.9 1.95 2 2.05
Figure 3.2: Overlay of MATLAB and Assembly output for autocorrelation
bandwidth on a critical band scale instead of a linear band scale. As we saw in Sec. 3.1.2, the
LPC pole placement technique leads to a linear fixed increase in bandwidth independent of the
frequency. However, as discussed in C'!I pter 1, the loudness enhancement technique involves
increasing the bandwidth on a critical band scale. This requires an additional degree of freedom
for bandwidth adjustment. The all-pass warping factor a provides this additional degree of
Eq. 3.11 shows how the z-transform can be evaluated over a circle with radius r for a given
set of LPC coefficients. The radius determines the amount of bandwidth expansion and this is
fixed over the entire frequency scale. However, it would be desirable to introduce some kind of
non-linearity in the bandwidth expansion based on the critical band concept for human auditory
Overlay of assembly output and matlab output
100 200 300 400 500 600 700
e 3.3: Overlay of MATLAB and Assembly output for FIR fil
system. This non-linearity is introduced by warping the frequency scale. Warping refers to
alteration of the frequency scale. In simpler terms, it refers to a stretching or compression of
the frequency scale. Warping can be represented by a functional one-to-one mapping of the unit
circle onto itself. The mapping function itself lies in the z domain and the following mappings
define the relation between the z domain and the warped z (referred to as z) domain.
z = f(z)
z = g(34
The bilinear transform is one such one-to-one mapping which is easily invertible too. It
corresponds to the first-order all-pass filter as shown below:
1 < a < 1 (3.35)
All-pass systems have a unit-magnitude response and passes all frequencies with unit mag-
nitude. They are mainly used to compensate for group-delay distortions. In the case of warped
filter structures, the ability of all-pass systems to distort the phase is used favorably to alter the
frequency scale. a is the dispersive delay element and sets the degree of frequency warping. The
dispersive elements inject frequency dependence of digital filter outputs thereby resulting in a
non-uniform frequency resolution. The z-transform in the warped domain with respect to the
warped frequency scale is the same as the z-transform in the normal frequency domain. The
warped filter structures can be found in greater detail in .
In the next chapter, we will discuss how we modified the original algorithm to overcome the
challenges encountered in implementation of the Levinson-Durbin recursion.
LMS: THE SOLUTION TO IMPLEMENTATION ISSUES
In the previous chapter, we discussed the various building blocks to be implemented for the
bandwidth expansion technique for loudness enhancement. These were the autocorrelation, LPC,
FIR and IIR filter blocks respectively. We also presented some overlay graphs which showed that
each of these blocks worked perfectly well. However, we had overflow problems in the division
routine in the LPC block as was described in that chapter. The current chapter deals with the
problem of finding a solution to this issue and so also implementing the solution in assembly for
the Motorola DSP 56600.
4.1 The Solution
In the previous chapter, we saw that to compute the filter coefficients for the FIR and IIR
filters, we needed to first compute the linear predictive coefficients (LPC). These coefficients
then appropriately scaled by the radius terms constituted the filter coefficients. The Levinson-
Durbin algorithm as shown in Eqs. 3.13 3.17 involved computation of the reflection coefficients
kid's) This involved the use of a division routine for the fixed-point Motorola DSP 56600. The
main problem associated with the division routine was the overflow upon division which led to
recursive errors as the Levinson-Durbin is a recursive method of computing the coefficients. This
obviously affects the pole locations for the FIR and IIR filters and since these pole locations are
crucial in determining the formant locations for the re-synthesized speech, it becomes necessary
to be able to obtain the linear predictive coefficients with greater accuracy. Moreover, if we can
avoid the division routine in the computation of the LPC coefficients, then we can guarantee the
solution to converge without overflow problems. The LMS is one such elegant algorithm which
makes use of a feedforward structure for estimating the filter coefficients. The details about
the LMS algorithm are provided in the coming sections and we shall show that this algorithm
performs considerably better than the Levinson-Durbin on a fixed-point DSP.
4.2 Least Mean Squares (LMS) Algorithm
In 1960, Widrow and Hoff developed a widely used algorithm called the Least Mean Square
(LMS) il'. -, .:hm. The method of steepest descent uses a fixed gradient in the recursive computa-
tion of the Wiener filter for stochastic inputs. However, in contrast, the LMS uses a "stochastic
gi ih ni in this computation and hence, the LMS is an important member of the stochastic
gradient ill,.>rithms family. The most salient features of the LMS algorithm are its simplicity,
non-requirement of pertinent correlation functions and so also no matrix inversion is needed.
This simplicity has made the LMS algorithm a standard against other linear adaptive filtering
algorithms which have been benchmarked.
The LMS algorithm consists of two basic steps:
1. A filtering process involving the computation of a filter output in response to a specified
input and then generating an estimation error by computing the difference between a
desired signal and the output of the filter.
2. A adaptive process wherein the parameters of the filter (filter coefficients) are automatically
adjusted based on the estimation error.
These two steps together can be depicted by the feedback loop shown in Figure 4.1.
First, we have the transversal filter which is responsible for the filtering process and next we
have an adaptive weight control block which performs an adaptive control mechanism on the filter
coefficients. The transversal filter consists of an \!'' order feedforward structure with M-1 delay
elements, M tap inputs and M weights for each of the inputs. During the filtering process, the
desired response d(n) is provided for processing besides the input vector u(n). The transversal
filter produces an estimate dest(n) for the desired signal. Based on this estimate we can compute
an estimation error e(n). The estimation error along with the input u(n) are then applied to
the adaptive control mechanism to estimate the new set of tap weights for the transversal filter.
Adaptive Weight Control +
Figure 4.1: LMS Block Diagram
The LMS algorithm uses the product of u(n k)e*(k) as an estimate for the kh element in the
gradient vector VJ(n) that characterizes the method of steepest descent .
Stability might be a concern since the LMS filter involves feedback. In this context, a mean-
ingful criterion is to require that
J(n) J(oo) as n -i o (4.1)
where J(n) is the mean-square error produced by the LMS filter at time n and its final value
J(oo) is a constant. For the LMS algorithm to satisfy this criterion, the step-size parameter p
has to satisfy a certain condition related to the spectral content of the tap inputs.
The difference between the final value J(oo) and the minimum value Jmin attained by the
Wiener-Hopf solution is called the excess mean-square error Je(oc). This difference represents
the excess price paid for using the adaptive LMS approach for computing the filter weights as
compared to a deterministic approach as in the method of steepest descent. The ratio of Je(oo)
to Jmi, is called the misadjustment. It is a measure of how far the solution of the LMS filter
approach is away from the Wiener solution. However, the misadjustment can be controlled by
the proper choice of the step-size parameter p. The misadjustment is related to the step-size
M = -tr[R] (4.2)
where R is the autocorrelation matrix.
The LMS filter is simple in implementation but at the same time is very strong in delivering
high performance due to its ability to adapt to the external environment. However, we have to
' v special attention to the proper choice of the step-size parameter p. The LMS algorithm can be
derived from the steepest descent algorithm by replacing the gradient vector by its instantaneous
estimate. The derivation can be found in greater detail in . The LMS algorithm in its final
form comprises of the following three set of equations
y(n) = H(n)u(n) (4.3)
e(n) = d(n) y(n) (4.4)
w(n + 1) = w(n) + u(n)e*(n) (4.5)
Eq. 4.3 computes the filter output and represents the filtering process. In Eq. 4.4, the error
is estimated on the basis of the current desired signal and finally the filter tap weights are
updated in Eq. 4.5. These equations represent the LMS algorithm in its complex form. We find
that the LMS algorithm requires 2M+1 complex multiplications and 2M complex additions per
iteration where M is the number of tap weights used in the transversal filter. In other words,
the computational complexity of the LMS algorithm is O(\ ) which is much easier to implement
in a DSP as opposed to the complex Levinson-Durbin algorithm as shown in C'! lpter 3. The
LMS algorithm can be used for a wide v -i i, I r, of applications. Some of the most commonly used
applications of the LMS are adaptive noise cancellation, adaptive b, iI;r, .,i.-i adaptive line
enhancement and linear prediction. In the next section, we describe the application of the LMS
algorithm in the determination of linear prediction coefficients of speech.
4.3 Linear Prediction Using LMS
Recalling from C'! ipter 3, linear prediction of speech involves the estimation of the current
speech sample using previous speech samples. As shown in Eq. 3.5, s(n) is the estimate for the
current speech sample based on the previous speech samples s(n 1), s(n 2), ..., s(n p) for a
linear predictor of order p. Based on the discussion in the previous section, if we delay the input
sequence by one sample and feed the resulting vector (u(n 1)) as input to the transversal filter
with a desired response d(n) = u(n), then the resulting tap weight vectors for a suitable step-size
parameter p and an appropriate number of passes (which depends on the speed of the DSP)
would closely match the true linear predictive coefficients. Clearly, this is an elegant approach
to finding the linear predictive coefficients as compared to the Levinson-Durbin recursion within
the limitations of the fixed-point DSP.
Figure 4.2 shows the transversal filter structure with a feedback loop for computing the linear
predictive coefficients adaptively. The input u(n) is first d. 1 ,v .1, by one sample and then followed
by the transversal feedforward structure. Also, the current input u(n) serves as the desired signal
and the filter tap weights (linear predictive coefficients) are updated accordingly.
4.4 Experimental Results
In this section, we present the results obtained from simulating the LMS algorithm for com-
putation of the linear predictive coefficients on the Motorola DSP 56600. The assembly code
for the LMS algorithm can be found in Appendix C. Sentences were taken from the TIMIT
database. Speech is sampled at 16KHz and sentences are broken into 180 sample windows with
50' overlap. A set of four linear predictive coefficients are computed for each frame. We use
Figure 4.2: Linear Prediction using LMS
the LMS algorithm with an initial step-size of p = 0.2 with an automatic update for p for each
pass of the frame. For each pass, p is scaled down by a factor of 0.99. This ensures we have a
smaller misadjustment towards the latter passes. The idea is to converge faster to the correct
solution in the earlier passes and then to achieve better accuracy using a smaller step-size in
the latter passes. We perform 20 passes for each frame which still leaves us with enough time
to perform computations on the current frame before the next frame arrives for the Motorola
DSP 56600 running at 60MHz clock cycle. Number of clock cycles for each of the Autocorrela-
tion, Levinson-Durbin, LMS, Modified signed LMS, FIR and IIR filter blocks along with their
durations for each frame of data are tabulated in Table 4.1.
Figure 4.3 shows the LPC value tracks (which are simply the negative of the weight values)
for 260 frames of speech, each frame being 180 samples long, for the sentence "She had your dark
suit in greasy wash water all year."
LPC value tracks for 260 frames of 180 samples each with 20 passes
for the sentence "She had your dark suit in greasy wash water all year"
50 100 F 150 200 250
Figure 4.3: LPC Value Tracks
These LPC values do not match the true LPC values exactly as is expected, however, if we
look at the variation in pole locations with each pass and the final location of the poles, we
see that the poles match the original poles very closely. This is highly desirable as the poles
are crucial in determining the formant locations rather than the LPC values themselves. The
pole tracks for the first pass for the first frame in the sentence under consideration are shown in
Figure 4.4 followed by Figure 4.5 which shows the pole location variations in the second pass. It
is clear from the figures that the poles begin to stabilize with increasing number of passes.
For the first few frames of speech (typically silence), the coefficient updates resulting from the
LMS algorithm are small enough to be accurately represented by the limited 16-bit precision of
the DSP. As such, for the first few frames, the coefficients stay at zero for the DSP as compared
Pole Tracks for ft frame for 176 iterations on 1st pass
L- 4 '
-1 -0.5 0 0.5 1
Figure 4.4: Pole Value Tracks after first pass
to MATLAB which has a 64-bit floating point representation. Due to this underflow problem,
the coefficients do not match exactly but are close to each other. Table 4.2 shows the mean and
variance of the true pole locations and the assembly simulated pole locations.
An overlay plot for the true LPC pole locations and the assembly pole locations for all the
20 passes for the 201st frame are shown in Figure 4.6.
It is clear from the above results that the LMS algorithm turns out to be an elegant approach
to finding the linear predictive coefficients circumventing the division overflow problem which led
to recursive errors in the computation of the coefficients. Besides, the LMS also requires fewer
clock cycles per pass for each frame as compared to the Levnison-Durbin recursion. Moreover,
we see that the whole algorithm can be completely executed on each frame of data in 5.62ms
Table 4.1: Clock cycles for Levinson-Durbin, LMS, FIR and IIR filter blocks
Block Number of Clock Cycles Execution Time
Autocorrelation 105372 1.756ms
Levinson-Durbin 25075 0.418ms
LMS 310985 5.18ms
Modified signed LMS 371663 6.19ms
FIR 12423 0.21ms
IIR 13637 0.23ms
Overall using LMS 337045 5.62ms
Overall using modified signed LMS 397723 6.63ms
Overall using Levinson-Durbin 156507 2.61ms
which still leaves us with plenty of time to account for the external data interface operations for
a 180 sample frame of speech being sampled at 16KHz.
However, from Table 4.2, we see that although the variances in the pole radii and angles are
very small yet the mean pole locations using LMS are slightly off from the true pole locations.
This -, .-.- -I that the variances in pole locations is not a good measure of performance. There-
fore, we looked at the root mean squared error values of the pole radii and angles which tell us
how far away from the true pole locations are we when using the LMS algorithm to compute the
linear prediction coefficients. Also since the variances in pole locations was so small, the updates
in the weight values started to fall below the minimum machine precision of the hardware. This
led us to the development of the i,..I.: ., signed LMS algorithm which is a blend between the
LMS algorithm and the signed LMS algorithm. The modified signed LMS can be described by
the following weight update equation:
w(n + 1) = w(n) + sgn(pu(n)e*(n)) max(eps,abs(pu(n)e*(n))) (4.6)
As can be seen from this equation, the signed LMS (eps 0) algorithm is a special case of the
modified signed LMS. The objective in using the modified signed LMS was to force an update
Pole Tracks for ft frame for 176 iterations after 2nd pass
I I I
Figure 4.5: Pole Value Tracks after second pass
equal to the minimum precision of the hardware (eps) when an underflow occurred. Whenever
the update is large enough to be correctly represented using the 16-bit DSP, the modified signed
LMS algorithm switches back to the LMS algorithm. This algorithm turned out to be a very good
replacement for the Levinson-Durbin recursion and also exhibited very low root mean squared
error values for the pole locations as compared to the LMS algorithm. Table 4.3 shows the root
mean squared error values for the radius and angle locations for the 4 poles corresponding to the
4 LPC coefficients which have been computed in the current project for each frame of speech
samples. The table clearly shows that the modified signed LMS algorithm yields poles which are
closer to the true poles by almost a factor of 0.5. The assembly code for the modified signed
LMS is included in Appendix E.
Table 4.2: Mean and Variance of pole locations
Pole Pr Ipo (in radians) a2 a2
Truel 0.536 0.7055 0 0
True2 0.536 -0.7055 0 0
True3 0.4346 2.1336 0 0
True4 0.4346 -2.1336 0 0
Assembly 0.5719 0.6835 6.46x 10-6 4.438x 10-6
Assembly2 0.5719 -0.6835 6.46x 10-6 4.438x 10-6
Assembly3 0.4799 2.1738 1.51x10-5 5.395x 10-6
Assembly 0.4799 -2.1738 1.51x10-5 5.395x 10-6
Figure 4.7 shows the LPC value tracks obtained using the modified signed LMS for 260 frames
of speech, each frame being 180 samples long, for the sentence "She had your dark suit in greasy
wash water all year." When compared to Figure 4.3, we see that the linear prediction coefficients
from assembly match the true values from MATLAB more closely.
Table 4.4 illustrates a comparison of the root MSE for the modified signed LMS algorithm
with the number of passes for each frame. The table indicates that with as low as 10 passes
the modified signed LMS algorithm yields pole locations which are quite close to the true pole
locations. This can lead to additional reduction in execution time.
These results clearly indicate that the modified signed LMS is a very good replacement for the
Levinson-Durbin recursion for computing the linear prediction coefficients. The total execution
time for the overall algorithm using the modified signed LMS in place of the Levinson-Durbin
recursion is tabulated in Table 4.1.
Pole Tracks for true LPC values and LMS simulated LPC values
. .. .
Figure 4.6: Pole Value Tracks for all 20 passes for the 201st frame
Table 4.3: Comparison of Root MSE values of pole locations using LMS and modified signed
Value Root MSE for LMS Root MSE for modified signed LMS
ri 0.0408 0.0205
r2 0.0408 0.0205
r3 0.04 0.0177
r4 0.04 0.0177
01 0.0321 0.0184
02 0.0321 0.0184
03 0.0522 0.0388
04 0.0522 0.0388
Table 4.4: Comparison of Root MSE values of pole locations using modified signed LMS algo-
rithms vs. number of passes
for modified signed LMS
r4 01 02
0.0214 0.5415 0.5415
0.0188 0.0218 0.0218
0.0185 0.0210 0.0210
0.0177 0.0184 0.0184
Overlay plot showing LPC values from MATLAB and
LPC values from assembly using signed LMS
Figure 4.7: LPC Value Tracks
0 50 100 Fram150 be 200 250 300
CONCLUSIONS AND FUTURE WORK
The primary goal of this thesis has been to take a step towards implementing the loudness
enhancement techniques  in real-time on a Motorola DSP 56600 which is a 16-bit fixed point
DSP. Experimental results revealed the challenges in implementing the Levinson-Durbin recur-
sion in a fixed point DSP and the LMS algorithm was presented as an elegant solution to this
problem. We also showed that the FIR and IIR filtering processes needed input scaling to prevent
overflows and underflows and a binary mathematical discussion was presented. Results indicate
that the LMS performs very well in comparison to the Levinson-Durbin recursion within the
limitations of the underlying hardware. An analysis of the computation time in terms of number
of clock cycles was also presented. This analysis is crucial to ensure that the implemented algo-
rithm can run in real-time. Results showed that the bandwidth expansion technique which has
been implemented leaves us with sufficient time to process a single frame before the next frame
arrives. Sentences were picked up from the TIMIT database which is a testing standard for most
of the speech enhancement and recognition systems. Input speech is sampled at 16KHz and is
broken up into 180 sample windows with 50'. overlap. These frames are then processed by the
Motorola DSP 56600 running at 60MHz clock cycle taking a total of 5.62ms to process a single
Since this has been a step towards implementing the whole loudness enhancement algorithm
described in , we have focused our efforts on implementing the preliminary form of a linear
bandwidth expansion as developed in . This leaves us with the scope of implementing the
warped filter structure which incorporates the psychoacoustic nature of the human auditory sys-
tem to achieve a non-linear bandwidth expansion. Besides this, we can also focus on improving
the efficiency of the current algorithm to make it run faster on the DSP. Another area of research
that can be investigated is the comparison between the Levinson-Durbin recursion implemen-
tation as opposed to the LMS implementation when a separate custom subroutine to perform
40-bit division has been written for the current DSP. The two algorithms can be compared for
the complexity of implementation versus the accuracy of the solution. Also, we can investigate
writing dynamic algorithms for preventing underflow for the initial frames of the LMS when the
coefficient updates are very small and also preventing overflow for the latter frames when the
coefficients themselves increase beyond the [-1,1) range.
These algorithms running in real-time will form an important component of most state-
of-the-art cellular phone technology in years to come. The most important advantage of these
algorithms is the ability to increase loudness at the same energy level thereby saving considerably
on battery life, which for a consumer is one of the most important factors to be considered in
making a decision for buying a particular model vis-a-vis another.
ASSEMBLY CODE FOR LEVINSON-DURBIN
This appendix contains the assembly Levinson-Durbin recursion code which we have imple-
mented for the Motorola DSP 56600.
MATLAB code for Durbin recursion:
k = quantl6(rr(2)/rr(1));
a0=1; ar(2)= quantl6(rr(2)/rr(1)*a0);
e =quantl6((- (rr(2)/rr(l))^2)*rr(l));
k = quantl6((rr(i)-sum(ar(2:i-l)'.*rr(i-l:-1:2)))/e);
ar(2:i-l) = quantl6(ar(2:i-l)-k*fliplr(ar(2:i-l)));
; Begin Durbin Algorithm
abs a a,b
eor xO,b #acoeffs+l,r4
LI move aO,a
clr b a,x:(rl)+ a,yO
macr -yO,yO,b #2,r7
move b,xl a,y:(r4)+
mpyr xl,xO,a #2,n7
outer do loop (note: alpha = yl)
clr a x:(r2)-,x0 y:(r5)+,y0
inner do loop #1 (note: r7 = i)
mac x0,y0,a x:(r2)-,x0 y:(r5)+,y0
back to outer do loop (note: error = a)
macr -xO,xO,b r4,r6
mpyr xl,yl,b (r6)-n6
inner do loop #2 (note: r7 = (i
macr x0,y0,a y:(r6)-,y0
end of inner do loop #2 ;
move x:(r3)-,x0 y:(r5)+n5,y0
end of inner do loop #3 ;
inner do loop #3 (note: r7
end of outer do loop
ASSEMBLY CODE FOR IIR AND FIR FILTERS
This appendix lists the assembly code for the IIR and FIR filters that have been implemented
for the Motorola DSP 56600.
x0,y0,a x:(r0)+,x0 y:(r4)-,y0
ASSEMBLY CODE FOR LMS ALGORITHM
This section lists the assembly code for the LMS algorithm implemented for the Motorola
; This program calculates the LPC coefficients using the LMS
nk equ 4
mu equ 0.2 ;0.2/4
scaler equ 0.99
win equ 176
len equ 180
npass equ 20
nframes equ 260
samp ds nk
newsamp ds nk
data ds len
desired ds win
input_ptr dc 1
input_ptrl dc 1
output_ptr dc 1
acoeffs ds nk
clr b x:(r0)+,x0 y:(r4)+,y0
mac x0,y0,b x:(r0)+,x0 y:(r4)+,y0
ASSEMBLY CODE FOR AUTOCORRELATION
This section lists the assembly code for the autocorrelation of speech using Motorola DSP
; input x:input_ptr boblib.io
; stated in command script
length is 2*M-1 and 0 index
;inner mac loop
; Second half of correlation
; Can be commented out since
; xcorrindex goes 1:WINLEN/2
; first half of correlation
; correlation vector returned by acorr()
ASSEMBLY CODE FOR MODIFIED SIGNED LMS ALGORITHM
In this section, we present the assembly code for computing the linear prediction coefficients
using the Modified Signed LMS algorithm.
; This program calculates the LPC coefficients using the modified
signed LMS algorithm
nk equ 4
mu equ 0.2 ;0.2/4
scaler equ 0.99
win equ 176
len equ 180
npass equ 20
nframes equ 260
samp ds nk
newsamp ds nk
data ds len
desired ds win
input_ptr dc 1
input_ptrl dc 1
output_ptr dc 1
acoeffs ds nk
clr b x:(r0)+,x0 y:(r4)+,y0
mac x0,y0,b x:(r0)+,x0 y:(r4)+,y0
move x:(rO)+,xO y:(r4)+,a
 A. Agrawal and W. Len. Aspects of voiced speech parameters on the intelligibility of Peterson
Barney words. J. Acoustic Soc. Am., 57(1):217-222, 1975.
 M. A. Boillot. A warped filter implementation for the loudness enhancement of speech. PhD
dissertation, University of Florida, May 2002.
 J. Durbin. Efficient estimation of parameters in moving-average models. Biometrika, 46:306
 C. Galand, J. Menez, and M. Rosso. Adaptive code excited linear prediction. IEEE Trans-
actions on S.:,.,,1l Processing, 40(6):1317-1326, 1992.
 W. Hartmann. S.:',,1.- Sound and Sensation. Springer, New York, 1998.
 S. Haykin. Adaptive Filter Th(..,;, Prentice-Hall Inc., Upper Saddle River, New Jersey,
 J. Hillenbrand, L. Getty, M. Clark, and K. Wheeler. Acoustic characteristics of American
English vowels. J. Acoustic Soc. Am., 97(5):3099-3111, 1995.
 N. Levinson. The Weiner RMS (root mean square) error criterion in filter design and
prediction. Journal of Mathematical Ph;/,:- 25:261-278, 1947.
 J. Markel and A. Gray. Linear Prediction of Speech. Springer-V, i1 .- Berlin, New York,
 M. S. Martinez, A. Black, and A. Kondoz. Effects of finite-precision conversion on linear
predictive coefficients. IEEE Proc.-Vis. Image S.':,,.,l Process., 147(5):415-422, 2000.
 S. McCandless. An algorithm for automatic formant extraction using linear predictive spec-
tra. IEEE Trans. on Acoustics, Speech and S.:,1'l Proc., ASSP-22:135-141, 1974.
 Motorola Inc. DSP 56600 16-bit Digital S.:,.,l Processor En,,.:,m Manual, Austin, Texas,
 L. Rabiner and B. Juang. Fundamentals of Speech Recognition. Prentice-Hall Inc.,
Englewood-Cliffs, New Jersey, 1993.
 E. Zwicker and H. Fastl. P-;,. 1,.'.. ).ustics. Springer Series, Berlin, New York, 1998.
Adnan H. Sabuwala was born in B-ml.iv, India, on 18th November 1978. He completed his
schooling from the Versova Welfare High School and joined the Sathaye College, Vile Parle, for
his high school studies. In July of 1996 he was admitted to the Indian Institute of Technology,
Bmcib-_- (IIT-B), to the Department of Electrical Engineering. He graduated with a B.Tech
degree from the IIT in August 2000 and joined the Department of Electrical and Computer
Engineering at the University of Florida in Fall 2000. Since January 2001, he has been working
as a research assistant for Dr. John G. Harris in the Computational Neuro-Engineering Lab
where he completed his master's thesis on "Towards a Real-Time Implementation of Loudness
Enhancement Algorithms on a Motorola DSP 56600."