Acoustics-Methods for the calculation of the articulation index of speech
		    
	
		    
					    	Some standard content:
		    	
				
				National Standard of the People's Republic of China
Acoustics-Methods for the calculationof the articulation Index of speech1 Subject content and scope of application
GB/T15485—1995
Since speech intelligibility tests are complicated and time-consuming, a computable physical measure that is highly correlated with speech intelligibility has been developed. For example, a group of speakers and listeners are organized to conduct speech perception tests to evaluate speech intelligibility. This physical measure is called the articulation index: AI for short. A1 is the effective ratio (part) of the normal speech signal that is available to a listener to obtain speech intelligibility under given speech channel and noise conditions. It is a weighted fraction that can be calculated from the measured or estimated speech spectrum and the effective masking spectrum of the noise present at the listener's ear.
This standard describes the method and steps for calculating the articulation index AI and gives the functional relationship between A1 and syllable intelligibility. The calculation method described in this standard is based on the average results of male and female normal speakers and male and female normal listeners who speak Mandarin Chinese. The data in this method are not applicable to children. The method for calculating the intelligibility index AI described in this standard is partly based on the calculation method of the American National Standard ANSIS3.5-1969 "Method for calculating the intelligibility index of speech". 
2 Terminology
2.1 Articulation index articulation index An index derived from a large number of speech intelligibility tests that relies on band additivity and is used to calculate the intelligibility of a given speech transmission system. It takes values between 0 and 1. 2.2 Long-term root mean square sound pressure spectrum long-term root mean square (rms) pressure spectrum The functional relationship between the sound pressure amplitude of a speech signal and the frequency. It can be measured by a variety of methods. In the 1/1 octave band and 1/3 octave band speech analysis, for normal continuous speech, a stable long-term spectrum can be obtained by using an integration time of 15. The numerical representation of the measured RMS value is called long-time RMS sound pressure, which is different from the RMS value measured by using a shorter integration time, such as the average duration of a speech of 1/8$.
2.3 Spectral level
The spectral level of a signal at a certain frequency is the sound level obtained within a 1Hz bandwidth centered on that frequency, expressed in decibels, with an accuracy of 20 μPa.
When the measurement filter bandwidth is A (Hz>, the vocal spectral level of continuous language is equal to the measured band sound pressure level minus 1010BA. 2.41/' octave band 1/3 octave band spectrum 1/3 octave band spectrum Octavebanclspertrum When the measurement adopts 1/1 octave band or 1/3 octave band filter, the functional relationship between the frequency band sound pressure level in decibels and the center or boundary frequency of the 1/1 octave band or 1/3 octave band is called 1/1 octave band or 1/3 octave band spectrum. Note: (The root mean square sound pressure level of a filter refers to its geometric center or center frequency. The upper and lower cut-off frequencies of a filter band refer to the frequencies 3d3 above and below the maximum frequency of the filter positive signal response. This standard requires that the slope of the filter attenuation curve used shall not be less than 18dB per octave. Approved by the State Administration of Technical Supervision on July 3, 1995 and implemented on February 1, 1996. 
2.5 Speech peak speechpeak s
GB/T15485—1995
1/8: The root mean square value of the average time speech signal exceeds the long-term flat value by 12dH or more by one percent. The long-term flat effective value plus 12dB is used as the peak amplitude of the speech that contributes to the intelligibility of the speech. 2.6 Overall level overalllevel
The sound pressure level measured with C weighting. The long-term total root mean square sound pressure level is approximately equal to the arithmetic half of the peak sound pressure level of each word in the speech balance closed form minus 3dB. When measuring, the sound level meter is placed at the location of the communication system transmitter. If it is talking, it is at the location of the listener. The sound level meter should be measured under quiet conditions, in slow gear and C weighting. The speech balance words should be pronounced under normal conditions. 2.7 Clip peak Clipping 
The effect that occurs when the instantaneous voltage input to an amplifier exceeds a certain defined linear gain value. 2. t.hreshold ofaudihility for sounds having continuous spectrum specita The minimum signal sound pressure level that can cause auditory perception in 50% of the trials using a continuous spectrum sound retrograde audiometry test in a quiet environment. 2.9 Phonologically halanced word test The PB (phnetically halanced) word test The PB (phnetically halanced) word test uses a monosyllabic word list, each with 75 syllables (words). These syllables are carefully selected, and the proportion of basic speech they contain is roughly the same as that of daily spoken language. 2.10 Band perception level The difference between the band sound pressure level and the auditory perception of the band, expressed in decibels. 3 Calculation method 
 The following two methods can be used to calculate AI. 20 band method: This method is based on the measured or estimated speech spectral level and noise spectral level in 20 adjacent equal-definition frequency bands (see Table 1). In a quiet environment, the spectral level of speech peak is 30dB or more above the hearing threshold, and the contribution of language components in each frequency band to the intelligibility of the speaker is equal. 
Table 1 20 frequency bands of equal clarity in Chinese
Band number
Band boundary
200~400
400~550
550~730
730900
900~-1 020
1 020~1 150
1 750--1 270
1 270~1 400
1 400~1 520
1 5201 700
1 700~1 900
" 900-~2 [00
2 100~2 400
2 400-~2 700
27003000
Bandwidth
Centerfrequency
GB/T 15485--1995
ContinuedTable 1
Bandbounds
3000~3400
3400-4000
4 000---4 700
4 700-~6 700
5 700~7 000
Bandwidth
Centerfrequency
1/3 and 1/1 octave band method: It is derived from the 20-band method, except that the measurement or estimation of speech and noise is in 1/3 or full octave bands. 
The following sections will describe in detail the calculation of A1 using the above method. The parameters related to the calculation - ideal speech spectrum, listening, and the maximum permissible sound level of non-transposed speech are given in Tables 1 and 5. The basis of these functions is: Chinese standard frequency spectrum and national standard: GB4983 free field pure quasi-equal loudness curve. The curves of the maximum permissible sound level of clipped speech for 1dR and 24d in Table 6 are respectively above and below the maximum permissible sound level of non-transposed speech (refer to Figure 1). 
② The 1/1 octave band method is not as sensitive to changes in speech and speech spectra as the 20-band method and the 1/3-band method, and is therefore not as accurate. When the energy of the masking noise is obviously concentrated in a range of one octave or narrower, the 1/1 octave band method cannot be used. In this case, the 1/3-band method should be used, and the 20-band method is preferably used. 
(③When reporting the results in the literature, be sure to state the AI calculation method used, i.e. AI (20-band), AI <1/3 times the lazy band> or AI (1/1 frequency band), 
3.120-band method
3.1.1 Step 1
Plot the known or estimated peak spectral level of the language at the listener's ear on the calculation diagram (see Figure 1). The peak spectral level of the language can be approximated by the following algebraic sum method. 3.1.1.1 The frequency response of the system being evaluated, in decibels. The frequency response of each frequency is the difference between the sound pressure level at the listener's ear and the sound level of the speaker at the microphone at that frequency, in decibels. Note: Care must be taken to ensure that the frequency response fully reflects the characteristics of the transmitting and receiving transducers of the system. 3.1.1.2. Implementation of the idealized speech spectrum a. The idealized spectrum in Figure 1 is measured when the long-term total RMS sound pressure level is 65 dB. When the measured or estimated long-term total RMS sound pressure level is different from 65 dB, the curve is moved up or down by the difference between the two. Note: The idealized spectrum in Figure 1 strictly applies to an environment with essentially no reverberation and noise at a distance of 1 m from the speaker's head. Under these conditions, the shape of this spectrum is quite accurate when the speech spectrum is measured between 2.54 m and 1 m behind the speaker. Therefore, the speech level can be measured (or estimated) close to the speaker's head, and the value can be converted to the value at 1 m according to the square law, and an equivalent point sound source is assumed to be 0.6 cm behind the speaker's lips. This derived value can be compared with 65 dB to obtain the required adjustment value for the idealized speech frequency spectrum level. b. When the speech is played by a loudspeaker in a non-anechoic room and non-free sound field, it should be calibrated according to Table 2. Table 2 Correction of speech sound level from loudspeakers in reverberant or flat reverberant rooms Total speech sound level, d
Number of quantities to be subtracted from the speech sound level, dB0
Total speech sound level. GB/T 15485---1995
Continued Table 2 Correction to be subtracted from the speech sound level d
The correction values given in Table 2 do not apply to speech sound produced through headphones or in a free sound field by loudspeakers. This correction value is based on experimental results, which show that in such cases, an increase in the speech sound level will cause a decrease in intelligibility. For example, assuming that the long-term total speech sound level of the sound reinforcement system in a reverberant room is 95 dB, the idealized spectrum corrected according to the system frequency response characteristics described in 3.1.1.1 contains 26 dB plus the spectrum, that is, 95 dB-65 dB-4 dB (correction values in Table 2). The effective spectrum of the sound suppression system is thus obtained. 
3.1.2 Step 2
Add the corrected spectrum of the steady-state noise reaching the listener to Figure 1. The root mean square sound pressure of several noise sources, such as the ambient noise in the listener's area and the noise reaching the listener through the speech transmission system, should be added together. 3.1.2.1 Corrected noise spectrum
When the frequency band perceived level of the noise exceeds 80dI3, the effectiveness of the noise masking increases faster than the normal rate. This increased masking can be taken into account in the AI calculation by adding a correction factor to the noise sound pressure level. If the noise perception level exceeds 80dB at the central frequency of a band (the vertical lines in Figure 1>), the noise sound pressure level at this point will be increased by an appropriate amount according to the values given in Table 3. The noise perception level can be determined by subtracting the hearing threshold spectrum level from the noise spectrum level. Table 3 Correction for masking nonlinear growth 
Frequency band perception level.dE
3.1.3 Step 3 
The correction value 13
 should be added to the noise sound positive level. Plot the noise effective masking spectrum level on the calculation diagram (Figure 1). The effective masking spectrum at any frequency is a combination of the noise spectrum, the corrected noise spectrum or GB/T 15485—1995
or the largest one in the extended masking noise spectrum at that frequency. The extension of the masking noise spectrum is drawn by the following method: 3.1.3.1 First determine the extension of the noise masking spectrum. Find the point on the rightmost vertical axis that is 3dB lower than each maximum value of the noise spectrum or the corrected noise spectrum, and then draw a horizontal line from the highest point to intersect the noise spectrum. These points are called "starting points". Note: ① When the noise spectrum has only one maximum or peak, there is only one starting point. ② If the noise spectrum has a peak at ? 000Hz or higher, the starting point is set at 3αB lower than 5700Hz. . 3.1.3.2 From each starting point, drop vertically by 57dB, and then return to the left at an increasing slope of 10dB/oc1: a straight line, which is the low-frequency part of the masking spectrum. 
3.1.3.3 From each starting point, draw a horizontal line to the right to a certain length and then drop at a certain slope. The length of the horizontal part and the slope of the drop, the frequency of each starting point, and the maximum spectral level of the noise at this frequency are shown in Table 1. These lines represent the high-frequency part of the extended masking spectrum. 
Men and women also have an average of 21 equal clarity frequencies + heart dial (see clothes!) 14
Lazy belt number 1
 5 il i2
Frequency Hz
Figure 120 Band method AI calculation diagram
1 Maximum allowable sound level: a Language peak clipping 24dB; - Language peak clipping 12dB: 
Unclipped language
Language peaks higher than the above curve are not suitable for language tolerance. Idealized language spectrum of Mandarin Chinese
Male and female voices
Long-term spectrum (rmg) +12 dB (long-term rm total sound level is 65 dI3), the hearing threshold spectrum level of continuous spectrum sound. 
The maximum spectrum level of nesting sound or the spectrum level after correction, the higher one is used
(Kya sound pressure 20 μPa)
76--85
56--65
GB/T15485--1995
High section of masking spectrum
Upper extension of masking
50--B00 Hz
800-1 600 Hz
Note: 1) A is the frequency (Hz) of a horizontal line drawn from the starting point to the right; 2) B is the slope of a diagonal line drawn from the right point of the horizontal line downward (dB/oct); 3) See 3.1.3 for the determination of the starting point frequency value. 1 600--2 400 Hz
2 400~3 200 Hz 3 200 Hz~7 000 HzB
3.1.4 Step 4
In the 20-band frequency range, the center frequency of each band is indicated in Figure 1. Determine the decibel difference between the speech spectrum level and the effective masking spectrum level. When the difference is 0 or less than F0, it is set to 0. When the speech spectrum level exceeds the effective masking level by 30dH, it is set to 30. Note: (①) The part of the listening curve on the calculation diagram (Figure 1) that is higher than the effective masking level is regarded as the minimum noise spectrum. ② When the speech peak curve exceeds the maximum allowable sound level marked in Figure 1, the maximum allowable sound level is regarded as the speech peak sound level. 3.1.5 Step 5
Add the 20 differences obtained in step 4 and divide by 600. The value obtained is the intelligibility index of the given communication system under noise conditions and for a given speech level. 3.1.6 Example
An illustrative example of calculating AI using the 20-band method is shown in Figure 2. P
GB/T 15485—1995
Average center frequencies of 20 equal-definition frequency bands for clear female voice (see Table 1) Number Band number
45678910121416
Nanofrequency Hz
Figure 220 Example of calculating AI using the band method
Peak spectral level of speech - long-term spectrum (rms) -12BA
(Long-term rm1s total sound level 95 dB).
Sound spectrum level of sudden sound (total sound level 113 dB). 
Central noise masking level (see Table 3);
a Masking extends upward,
b Masking extends downward. 
3.2 11-octave band method and 173-octave band method
3.2.1 Step 1
Determine the speech sound pressure level reaching the listener's ear based on the bandpass filter used. Note: The center frequencies of the 1/3 octave band and 1/1 octave band filters are given in Figure 5. The frequency band sound pressure level of the speech peak can be approximated by the algebraic sum of the following values: Band number 
Difference between speech peak and noise or hearing threshold masking 17 
Frequency number 
Calculation of reading at the center frequency of Figure 24B and A1 
GB/T 15485--1995 
Difference between speech peak and noise or hearing threshold masking 26 
A1 -- 239/600=0. 40 
3.2.1.1 The frequency response characteristics of the evaluated system expressed in decibels, the frequency response at each center frequency is the difference between the band sound pressure level at the listener's ear and the sound pressure level of the frequency at the speaker's microphone (care should be taken to ensure that the frequency response fully reflects the characteristics of the transmitting and receiving transducers of the entire system). 
3.2.1.2 Calculation of the idealized speech spectrum 
a- The idealized speech spectrum is shifted by the difference between the measured or estimated long-term total sound level (rms) of the speech and 65dR, i.e. the difference is added or subtracted from the idealized speech spectrum value of the corresponding frequency band in Table 5, h. When the speech is reproduced by loudspeakers in a non-anechoic environment or non-anechoic room, the long-term total sound level of the speech should be reduced by the value indicated in Table 2 (the correction values given in Table 2 do not apply to the reproduction of speech through headphones or using field speakers in a free sound field). Note: The idealized speech spectrum in Table 5 is strictly applicable to a distance of 1 ft from the speaker's lips in an essentially reverberant and noiseless environment. 3.2.2 Step 2 
Calculate the frequency band sound level of the steady-state noise reaching the listener's ear. Note: When the perceived level of a 1/1 octave band or a 1/3 octave band exceeds 84dR at the centre frequency of a band that contributes significantly to intelligibility, the speech and noise spectra are transformed into spectral level values (see 2.3) and the results are plotted on Figure 1. This requires the use of the 20-band method described in 3.1.1 to 3.1.5 to calculate AI. The transformation to spectral level is necessary in order to allow for the consideration of non-linear and extended masking effects in the calculation of AI. Such extended masking effects become significant when the perceived level of the band exceeds 84dB. GB/T15485—1995
Table 5 Idealized spectrum 12B, effective hearing threshold spectrum and maximum permissible sound pressure level of continuous spectrum sound without clipping Center frequency of 20 equal clarity bands
Center frequency
48. 0 -16. 0 105
—16. 0 103
43: 0 -16. 0 101
39. 0—17. 0
37. 0—17. 5
36. 0—19. 0
34. 0--20. 0
32. 022. 0
H0. 0—24. 0
28.0—26. 0
26.0 28.5 99
24. 0 —29. 0
22.0 : 30.0 99
30,0100
16. 5-29. 0 102
1/3 times the center frequency of the band
Band sound pressure level
Center frequency
: It is an idealized language spectrum! B;
's weight,
3.2. 3 step 3
1380, 000 4
1350, 001 0
1320.001 0
1290. 001 4
130U.003 4
1320.002 4
1350.0020
1/1 octave band center frequency
Condensation band sound pressure level
Center frequency
74.0.14.0
62. 010, 0
1400.002 4
B is the effective listening spectrum: C is the maximum permissible point level of the continuous spectrum without clipping: W is the decibel difference (D) between the speech band sound pressure level at the center frequency of each band and the noise band sound pressure level. If this difference is 0 or less than 0, the difference is set to 0. If the speech band sound pressure level exceeds the noise band sound pressure level by 301H or more, the difference is set to 30
Note: () Where the listening threshold curve exceeds the noise band level, the listening is regarded as the minimum equivalent noise band level. Where the speech peak exceeds the maximum permissible sound level, the maximum permissible sound level is taken as the speech peak sound level. 3.2.4 Step 4
CB/T 15485—1995
Multiply the weighting value (W) listed in Table 5W by the difference (D) obtained in step 3 of 3.2.3. 3.2.5 Step 5
Add the values of DXW. The result is the AI of the language system under the given noise conditions and the specified language sound level. 4 The influence of various factors on AI
4.1 Factors evaluated by AI
The score of speech intelligibility test is affected by many conditions imposed on the transmitted language signal, which is still not completely clear and cannot be completely determined. However, when applying the A1 principle, there are some factors that can be quantitatively estimated: they appear either individually or in combination, except for the following 4.2. These factors are as follows. 4.1.1 Steady noise masking
AI can predict the influence of broadband continuous noise and the influence of noise with a bandwidth greater than 200Hx in the frequency range of 200-7000Hz. 
4.1.2 Non-steady noise masking
Non-steady noise has an effect on speech intelligibility only for a part of the time, which is called the action period. Point noise is not steady-state and the start-stop cycle is known. The noise can be calculated as steady-state. Then, the A1 value obtained can be corrected by using Figure 3 to obtain the effective AI value. This method is only applicable when the noise level in the start period of the action cycle is more than 20dB higher than the stop period. 
4.1.2.1 Noise interruption rate 
For a communication system with noise with a certain start-stop cycle, the effective A1 calculated according to the method of 4.1.2 and Figure 3 should be further adjusted according to the function given in Figure 4. The horizontal axis of Figure 4 is the number of interruptions per second of the noise, and the vertical axis is the effective AI under given parameters bzxz.net
4.1.3 Frequency distortion of speech signals 
Frequency distortion is the unequal gain of different frequencies during signal transmission, which also affects the voice accuracy. These effects can be considered quite accurately using AI. Different emphasis is given to the following factors when calculating AI. a: High components of speech signals: 
b. Low frequency component: 
C, mid-frequency component. 
However, if the long-term spectrum of the speech signal is very irregular, that is, the spectrum has a series of peaks and valleys, and the average peak-to-valley slope exceeds 18dB/oct, the AI is not very reliable for estimating speech intelligibility. 4.1.4 Amplitude distortion of speech frequency derivatives
The calculated AI can estimate the effect of obvious symmetrical clipping on speech intelligibility. The following steps can be used. 4. 1.4. 1 Step 1
Determine from Figure 5 the quotient of the improvement in the long-term mean root mean square level of speech caused by a specific clipping value and the amount of amplification after clipping in a system. 
CB/T 15485—1995
u.1 n.2 0.a n.4 ua 0. 6 o.7 0.8 .9 1.(1 plus the ratio of the time of noise production
Figure 8 Correction for different noise duration factors
Note: The ordinate represents the correction value A1 that should be added to the AI value obtained according to the steady-state noise masking for different sound time fractions. The correction value cannot exceed 1.0.].[H
Number of interruptions per ball
10,IGH
Figure 4 Relationship between effective AI and interruption rate (read AI correction value from Figure 3) Note: Figure 4 shows the relationship between effective AI value and number of masking interruptions The coefficients on the curve are the AI values calculated for steady-state noise masking, corrected for the noise level according to Figure 3. 4.1.4.2 Step 2 
Add the result of step 1 to the peak speech level (long-term RMS of non-clipped speech + 12dB). This is the level of the non-clipped and truncated amplified speech at the listener's ear. 
Note: The amplification after the addition is defined as the amount of amplification applied to the communication system so that the peak-to-peak amplitude reaches the same level as the non-clipped speech. Peak level of speech GB/T 15485—1995
 is defined as the amplitude exceeded 1 percent of the time. If the amplification after clipping, in decibels, is not equal to the clipping value applied to the harmonic signal, the long-term RMS value increase obtained in step 1 must be reduced by an amount. This amount is equal to the difference between the peak clipping and the amplification after clipping. 4.1.4.3 Step 3
 Plot the results of step 2 on the AI calculation graph and calculate AI as described above. Note that the maximum allowable sound level on the AI calculation graph is higher for clipped speech than for non-clipped speech. Note: "In general, peak clipping is only used when the speech signal is relatively noise-free before clipping and there is noise at the listening position or when noise is mixed in after clipping.
4.1.5 Reverberation
Reverberation in a room will cause speech intelligibility to degrade. The extent of the reduction is related to the reverberation time of the room. Reverberation time is defined as the time required for a steady sound to drop by 60 dB after the sound source has stopped. When the reverberation time is known, Figure 6 can be used to correct the AI calculated for a given communication system. This correction is added to the AI value obtained after correction from Table 2. 
4.1.6 Speech Level
A very weak or very strong speaker level will degrade speech intelligibility. A given AI value can be accurate if the speech level is maintained steadily between 50 dBH and 85 dB (long-term RMS sound pressure level measured at 1 m from the speaker's lips), other factors remaining constant. If very weak or very strong speech levels are used in the communication system, the effective sound level rather than the measured speech level must be interpolated in the AI calculation diagram. The relationship between the actual speech level and the effective speech level is shown in Figure 7. 4.1.7 Visual cues
Visual cues obtained by observing the speaker's voice and face are very helpful for speech intelligibility, especially in the presence of noise. However, an AI can be changed or adjusted to become an "effective AI" to reflect the effect of visual cues on speech intelligibility of listeners who are not trained in "lip reading" techniques (see Figure 8). 4.2A1 Factors that cannot be estimated 
 There are many factors that affect speech communication systems that cannot be estimated in current AIs, especially the following ones. 4.2.1 Speaker gender 
 As mentioned above, this method is designed based on the average results of intelligibility tests conducted by male and female speakers. Therefore, the speech intelligibility when an individual male or female speaker uses the speech communication system may be slightly different from the estimated value. 4.2.2 Multiple transmission channels 
 The quantitative effect of the mixed speech signal received by the listener directly from the speaker and simultaneously from a field speaker on speech intelligibility is not clear. Therefore, AI may not be applicable to such a system. 4.2.3 Combination of multiple factors 
 When several distortions appear in combination, such as clipping, adding interruption noise, and adding reverberation, there has not been sufficient testing to show that the exact AI is effective when there are possible combinations of factors. 
4.2.4 Asymmetric clipping, frequency deviation and attenuation This standard only applies to "communication systems with asymmetric clipping not exceeding 3dB, signal frequency deviation not exceeding 50Hz, and no significant attenuation changes"					
Tip: This standard content only shows part of the intercepted content of the complete standard. If you need the complete standard, please go to the top to download the complete standard document for free.