Extraction of Information Dimensions for Phonemes and Its Application to Speech Detection & Segmentation
Chang-Young Lee ∗
Division of Information-System Engineering, Dongseo University, Busan 617-716, Korea (Received 30 April 2013 : revised 29 May 2013 : accepted 15 July 2013)
As an application of dimensional analysis in the field of chaos and fractals, we estimate information dimensions for speech phonemes. By constructing phase space vectors from the time-series data of speech signals, we calculate the natural measure and Shannon’s information. The information dimension is then obtained as the slope in the plot of the information versus space division order.
A notable feature of the result is the drastic change in the information dimension at the onset of speech production. We expect that the information dimension might be utilized as a valuable tool for speech detection and segmentation.
PACS numbers: 43.72.Dv, 43.72.Ew
Keywords: Strange attractor, Natural measure, Information dimension, Speech detection, Speech segmen- tation
I. INTRODUCTION
As a method of communication between man and ma- chine, speech recognition provides a very effective inter- face. Speech input to a machine is about twice as fast as information entry by a skilled typist [1]. The technique of speech recognition is now familiar that lots of appli- cations such as handheld mobile devices are taking use of the state of the art recognition technology [2].
For several decades, the subject of chaos [3] has been of much interest in numerous fields of academic concern.
For this reason, people gave it the title: ‘one of the great- est discoveries in 20th century’. Even though enormous developments in this field have been achieved for various systems and problems, however, the subject is said to be ‘unsettled’ yet. One of the reasons is that the sub- ject matters are too diverse for systematic integration [4]. The study of fractals, geometrical objects related to chaos, has revealed the hidden structures inherent in na- ture. An important subject in conjunction with fractals is the dimensional analysis. It measures concisely the complexity of the geometrical objects or chaotic dynam- ical systems.
∗
E-mail: [email protected]
As an application of the theory of chaos and fractals, we try to construct strange attractors via reconstruction of phase space vectors [5], and calculate information di- mensions for speech phonemes therefrom. It might be inferred that the extracted parameters are somehow re- lated to the mechanical structures and dynamical move- ments of human vocal tracts considered as a dynamical system.
It is known that a random generator enters in the speech production of most consonants and reflects some- how the dynamics of the vocal tracts [6]. When we con- sider it as a chaotic dynamical system, an important pa- rameter characterizing the system is the Ljapunov ex- ponent [7]. Its concept has been generalized in such a degree that it applies to many interesting dynamical sys- tems in mathematics and science. It has become one of the keys to understanding the chaotic behavior under- lying the system. However, the calculation procedure is not directly applicable to time-series data such as speech signal. Therefore, we turn our calculation to the so-called information dimension which serves as a lower bound for the fractal dimension that is usually computed by the box counting procedure [8].
The organization of this paper is as follows. After
providing a brief review on the procedure for extraction
778
of information dimension in Section II, application to speech signal will be given in Section III. Experimental results and implications regarding speech detection and segmentation will be given in Section IV, followed by concluding remarks in Section V.
II. EXTRACTION OF INFORMATION DIMENSION
Given a time-series data
x
0, x
1, x
2, · · · (1) we construct two dimensional (2D) phase space vectors
v
0= (x
0, x
T), v
1= (x
τ, x
τ +T), v
2= (x
2τ, x
2τ +T), · · · . (2) Though the phase space vectors are generally formed in a larger-dimensional space, our study will be per- formed in 2D for the sake of simplicity. In other terms, the vectors of Eq. (2) might be viewed as projections of larger-dimensional ones onto 2D subspace.
In Eq. (2), τ is a sort of sampling period that enters in conjunction with the analog-to-digital (A/D) conver- sion frequency, and is thus limited not to be large. The parameter T is expected to have an important effect on the subsequent results. If T is too small, then the first and the second components of the phase space vectors have nearly equal values and thus the resultant trajec- tory of vectors will form a nearly straight line. If T is too large, on the other hand, then there comes no corre- lation between the two components and we should fail in search of any structures hidden in the time-series data.
In this paper, the values of τ = 3 and T = 5 are adopted.
The investigation as to the effect of τ and T will not be pursued in the present study.
We divide the phase space into arrays of squares of size s = 2
kand label each square by B
ij. Then we define the natural measure by
µ
ij= 1 n
n−1
X
k=0
I[v
k∈ B
ij] (3)
where the indicator function I[·] is 1 if its argument is true and 0 otherwise.
In Eq. (3), the number of vectors n for estimation of natural measure should in principle be infinite in order that the measured quantities be ergodic [9]. However it is limited for practical reasons since we are dealing with digital signals of finite time length. The actual situa- tion is even worse. The frame length n should be small in order for the analysis to be meaningful on short-term basis. Otherwise, the analysis is blurred out by averag- ing over time. For the analysis to be of short-term, the time duration is usually taken to be tens of milliseconds [10]. For the case of 16 kHz sampling rate, the num- ber of n = 512 is usually taken to be an analysis frame length in conjunction with fast Fourier transform (FFT) which requires n to be a power of 2. This frame length corresponds to 16ms of time duration, and is quite com- mon in speech processing technology [11]. We adopt this convention in this paper.
The next step is to calculate the Shannon’s informa- tion
I(s) = − X
ij
µ
ijlog
2(µ
ij) (4)
for square size s [12]. The sum is over all the boxes encompassing the vectors. We then plot the resultant data and obtain, by linear regression, the best line (linear function) expressed by
I(s) = I
0+ D
Ilog
21 s
(5)
with D
Ithe slope of the line.
In order to scrutinize and get more information from the structure of the phase space vectors, we should look into the space more closely by zooming-in. D
Icharacter- izes the amount of additional information in increasing the resolution of the phase space by decreasing the mesh size s. For this reason, the quantity D
Iis called ‘in- formation dimension’. It provides the lower bound for the fractal dimension of the given geometry as obtained commonly by the box-counting procedure [13].
For understanding of D
I, we enumerate some special
cases. Fig. 1(a) shows the case of information dimen-
sion close to 0. Most vectors are agglomerated into a
point-like lump. In addition, small fraction of vectors
are distributed relatively far from this lump and make
the volume of the ‘encompassing’ container very large
Fig. 1. Three illustrative distributions of phase space vectors. The information dimensions are close to (a) 0, (b) 1, and (c) 2, respectively.
Fig. 2. An example of the Shannon’s information versus the space division order. The slope of the line gives the information dimension.
compared to the lump size. The result is reminiscent of an atom, where the electron cloud makes the atomic size large and the atomic mass is concentrated in a tiny region of nucleus. The fractal dimension of such an en- tity is close to zero. For its importance in regards to our study and later reference, we state the two requirements of small information dimension as follows. In order for the phase space vectors to have information dimension close to zero, the following conditions should be met:
1. Almost all the vectors are agglomerated in a point- like region.
2. Some minor vectors should occupy a relatively large space volume.
Figure 1(b) shows the case of information dimension close to 1. The two components of vectors are so strongly correlated that the distribution is needle-shaped. This
Fig. 3. The waveform of a Korean word ‘Gawi-Bawi-Bo’
pronounced by a young female speaker.
behavior happens e.g., when the original pattern has some periodicity. Fig. 1(c) shows that the vectors are evenly distributed over the space. They form a genuine 2-dimensional distribution, the case of which is close to 2.
We consider the smallest rectangle (2-cube) which con- tains all of 512 vectors of an analysis frame and divide it by a (2
k× 2
k) array. After calculation of natural measure and Shannon’s information by the procedures of Eqs. (3) and (5), we increase k by 1 and repeat the calculation. Fig. 2 shows the results performed on the data of Figs. 1(b) and (c).
Since we have only a rather small number (512) of
vectors in a frame, the information contained in the space
will be exhausted soon as we increase k. Therefore, the
data plot such as Fig. 2 tends to be flattened as the
mesh size is reduced. Though more elaborate methods
may be devised and applied to circumvent this problem
Fig. 4. The trajectories of the phase space vectors constructed from the time-series data of the three sections of Fig. 3 according to Eq. (2) with τ = 3 and T = 5.
[14], we content ourselves with the result obtained for five divisions, i.e., k = 1 ∼ 5. The values of information dimensions were found to range from 1.2 to 1.4 for most phonemes (as will be seen in Fig. 5 below), which are close to many famous two-dimensional strange attractors [15].
III. APPLICATION TO SPEECH
Several words were pronounced by a young female speaker and sampled at 16 kHz with 16bit quantiza- tion after low-pass filtering. Fig. 3 shows the waveform of a Korean word ‘Gawi-Bawi-Bo’ which means ‘rock- paper-scissors’. For comparison of information dimen- sions, three sections (frames) were comparted by vertical lines, each one consisting of 512 data points correspond- ing to 32 ms of time duration.
Figure 4 shows the trajectories of the phase space vec- tors constructed from the time-series data of the three sections of Fig. 3 according to Eq. (2) with τ = 3 and T = 5.
We consider the smallest ‘n-cube’ that contains all the phase space vectors. It is divided into mesh array (rect- angles in our 2D case), natural measures are counted for various mesh sizes, and the information dimensions are extracted according to the procedures described in Sec- tion II. Fig. 5 shows the information dimension profile for the waveform of Fig. 1 calculated frame by frame in semi-continuous manner. We notice the sharp drops at two locations corresponding to the onsets of the speech
productions of the syllables ‘Bawi’ and ‘Bo’ after short breath-holdings.
This result is remarkable in that the drops in infor- mation dimension at the silence-speech boundaries are sharp. It is to be compared with the profile of the en- ergy, which is defined by
E =
n−1
X
k=0
x
2k(6)
where the sum is over a frame of length n. This quan- tity should not be confused with the ‘energy in physics’
in which the signal amplitude is actually pressure and a proportional constant is needed for accurate quantifi- cation of energy. In speech processing technology, the energy estimated by Eq. (7) is commonly adopted for speech detection. Fig. 6 shows the energy profile for the waveform of Fig. 1. We see that the energy valleys at the two locations of breath-holdings are not so sharp com- pared with Fig. 5.
Another parameter of interest in speech processing is the zero crossing rate (ZCR) which is the rate of sign- changes along the signal, i.e., the rate at which the signal changes from positive to negative or back. In mathemat- ical terms, it is defined by
z = 1 n
n−1
X
k=0