WO2002029781A2 - Speech to data converter - Google Patents
Speech to data converter Download PDFInfo
- Publication number
- WO2002029781A2 WO2002029781A2 PCT/US2001/042526 US0142526W WO0229781A2 WO 2002029781 A2 WO2002029781 A2 WO 2002029781A2 US 0142526 W US0142526 W US 0142526W WO 0229781 A2 WO0229781 A2 WO 0229781A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- frame
- data
- fft
- fft frame
- speech
- Prior art date
Links
- 238000000034 method Methods 0.000 claims abstract description 33
- 230000007704 transition Effects 0.000 claims description 10
- 230000005540 biological transmission Effects 0.000 claims description 6
- 238000010183 spectrum analysis Methods 0.000 abstract description 18
- 238000001228 spectrum Methods 0.000 abstract description 6
- 238000013519 translation Methods 0.000 description 15
- 230000014616 translation Effects 0.000 description 15
- 238000010586 diagram Methods 0.000 description 12
- 230000008569 process Effects 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 230000009467 reduction Effects 0.000 description 3
- 230000003595 spectral effect Effects 0.000 description 3
- 230000006835 compression Effects 0.000 description 2
- 238000007906 compression Methods 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/02—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
Definitions
- the present invention relates generally to speech technology and in particular to the transmission of speech. Still more particularly the present invention relates to an improved method for the transmission of speech using a small amount of data.
- vocoders Devices which employ techniques to compress voiced speech by extracting parameters that relate to a model of human speech generation are typically called vocoders. Such devices are composed of an encoder, which analyzes the incoming speech to extract the relevant parameters, and a decoder, which resynthesizes the speech using the parameters it receives over the transmission channel. In order to be accurate, the model must be constantly changing. Thus the speech is divided into blocks of time, or analysis frames, during which the parameters are calculated. The parameters are then updated for each new frame.
- the function of the vocoder is to compress the digitized speech signal into a low bit rate signal by removing all of the natural redundancies inherent in speech.
- Method and apparatus for reducing the amount of data sent when transmitting speech by obtaining the spectrum content of digital speech First, analog speech is converted to digital speech. Then the digital speech is divided into frames and a spectrum analysis is performed on the frames. Frames with similar spectrum are combined. Then a second spectrum analysis is performed in predetermined steps. The data from the spectrum analysis of each frame is compressed and sent to a receiver. The receiver uses the data to reconstruct the frame. The frame is combined with other frames to reproduce the digital signal. Then the digital signal is played back, thereby reproducing the analog speech.
- Figure 1 A depicts a block diagram of a speech to data converter in accordance with a preferred embodiment of the present invention
- Figure IB is a block diagram of an analog to digital converter commonly used
- Figure 1C is a block diagram of a bandpass filter commonly used
- Figure ID depicts a block diagram of a digital signal divided into frames in accordance with a preferred embodiment of the present invention
- Figure IE depicts a block diagram of a series of frames after a spectral analysis has been performed in accordance with a preferred embodiment of the present invention
- Figure IF depicts a block diagram of a series of frames after a frame comparison has been performed in accordance with a preferred embodiment of the present invention
- Figure 2A depicts a block diagram of a frame before a spectral analysis has been performed in accordance with a preferred embodiment of the present invention
- Figure 2B depicts a block diagram of the results of a spectral analysis in accordance with a preferred embodiment of the present invention
- Figure 2C depicts a block diagram of the results of a spectral analysis where the amplitude is measured on a 0 to 100 unit scale in accordance with a preferred embodiment of the present invention
- Figure 2D depicts a block diagram of the results of a spectral analysis where the amplitude is measured on a 0 to 16 unit scale in accordance with a preferred embodiment of the present invention
- Figures 3A-H depict a block diagram of a sine wave for a specific frequency and resultant sine wave for a frame in accordance with a preferred embodiment of the present invention
- Figure 4 depicts a flow chart showing the steps in the speech to data conversion in accordance with a preferred embodiment of the present invention.
- Figure 5 depicts a flow chart showing the steps of a universal translator in accordance with a preferred embodiment of the present invention.
- FIG. 1 A a block diagram of a speech to digital converter (NDC) 102 in accordance with the present invention.
- Analog signal 104 is received at encoder 106.
- encoder 106 converts analog signal 104 into digital signal 108. This is done using an analog to digital converter and the process is common in the art.
- the analog signal is sampled at a rate of 500,000 to 1,000,000 times per second in 8 bits per sample.
- the frequency range is limited from about 50 Hz to 10,000 Hz. This range is greater than the system requires, but further narrowing of the frequency range will be done as explained below.
- the frequency range of digital signal 108 is further narrowed from about 75 Hz to about 3,000 Hz, as depicted in Figure IC.
- the narrowing of the frequence is done with bandpass filter 107 and the process is common in the art. Different frequency ranges can be used for different purposes.
- digital signal 108 is divided into frames 110 using a frame rate of about 150 frames per second.
- a spectrum analysis is preformed on frame 110 and frames with similar spectrum are combined.
- a fast Fourier transform is used to perform the spectrum analysis and generate peaks similar to Figure IE. While the preferred embodiment is a range set at 50 Hz steps between 75 Hz and 3,000 Hz (3kHz), almost any range can be used.
- the amplitude of each step is evaluated in 4 bits (16 levels) and once the analysis of each frame 110 is completed, all amplitudes below a level of 2 are deleted and the peak amplitudes are stored with their frequencies.
- Figure IE shows a series of frame 110s after a spectrum analysis has been performed on each one. Marks 116 illustrate the highest amplitudes for each frame.
- a maximum number of five peaks are stored, however any number of peaks can be used.
- each frame is compared to the next frame. For example, frame 118 in Figure IE is compared to frame 120. If the data is relatively close then the two frames are combined. When one frame is different from the previous frame or from the general range of a series of frames or the total number of frames added together exceeds 15, then the frame is ended and the total number of frames added together is converted into a time slice. It is this time slice that will be analyzed for the data to be transmitted.
- frame 118 and frame 120 are similar, a check is done to see if 15 frames have been combined into one frame that is similar to frames 118 and 120. If less than 15 frames have been combined into one, then frames 118 and 120 are combined into frame 122, shown in Figure IF.
- frame 122 is compared to frame 124 to see if they are similar. If they are not, the length of frame 122 is recorded in the frame header. The process starts over and frame 124 is compared to frame 126.
- Frame 128 represents a silence frame. Silence has a special length. Silence accounts from most of the speech signal, whether between words or sentences or while listening. Silence frames are given a length indicator of 16 frames, while speech frames use a maximum length of 15 frames. Only silence frames can have a length of 16 frames. This simplifies the data stream considerably as any frame with a frame length of 16 does not require additional processing. However, if a silence frame was less than 16 frames, then it would have to be analyzed and would require additional processing.
- frame 122 is analyzed for its spectral content area.
- a spectrum analysis is preformed on frame 122.
- Figure 2 A represents frame 122 before a spectral, analysis is performed.
- the power distribution of frame 122 can be determined. The power distribution of each frame determines the sound for a particular frame. The better the power distribution is documented, the better the reproduction will be.
- a FFT is used to do the spectrum analysis.
- a FFT is able to take all the power from a given range, for example 100 Hz, and represent the power distribution in terms of a sine wave. While almost any range can be set, in the preferred embodiment, the range is set at 100 Hz steps between 75 Hz and 3,000 Hz (3k Hz). The bottom frequency is set at 75 Hz because most of the power distribution below 75 Hz is due to noise. The upper frequency is set at 3kHz because human speech seldom extends above that area. A frequency lower than 3kHz could be used, but pitch and timber would be lost. Other step sizes may be used so long as the steps are close enough to replicate speech at an acceptable level of quality. The larger the frequency steps, the fewer the data points sent. However, the fewer the data points sent, the poorer the reproduction quality of the receiver.
- the amplitude is first plotted on an amplitude scale ranging from 0 to 100. Then the highest single amplitude is stored as the frames absolute amplitude. For example, the highest amplitude in Figure 2B is 39 found at the 1075 frequency mark. Next, the area from 0 to the absolute amplitude is divided into 16 steps. For example, the maximum amplitude for frame 122 is 39 on a 0 to 100 units scale, the absolute amplitude is set at 39 as shown in Figure 2C. Then the area from 0 to 39 is divided in 16 units. Each unit is equal to 2.4375 units on the 0 to 100 scale.
- the amplitude data for each of the 100 Hz steps will be included as part of the frame data. Up to 30 amplitudes will be required for each frame. However, in one embodiment, all amplitudes less than 2 units are eliminated. This results in most of the amplitude data being 0's.
- the number of bits for each frame will be about 6 bits to identify the frame start, 4 bits for frame length, 6 bits for absolute amplitude and 4 bits for each of up to 30 amplitudes. The result is a maximum of 136 bits per frame. If an increased reduction in data is required, then all amplitudes less than 2 units are eliminated. By using compression algorithms it will be possible to reduce the frame data by 60% to 80% or to about 25 to 55 bits per frame. With an average frame rate of about 18 to 20 frames per second, obtaining an average of 1,000 bits per second should be achievable.
- the data is sent to receiver 103 by conventional means known in the art.
- receiver 103 Upon receipt of the data, receiver 103 reads the frame length and absolute amplitude from the frame header and assigns the amplitude from the correct frequency. Receiver 103 knows the frequency steps are in 100 Hz increments and, as depicted in Figures 3 A — 3H, creates a sine wave for each frequency step and corresponding amplitude. If played back, the single sine wave would merely produce a tone. To reproduce speech, the sine waves for each frequency step in the frame must be reproduced and combined with the other sine waves in the frame to form a resultant sine wave. For example, Figure 3B represents the sine wave recreated from the amplitude that corresponds to frequency of 175 Hz. Figure 3C depicts the recreated sine wave combined with the sine wave generated from the data that corresponds with 75 Hz.
- Figure 3D represents the resultant sine wave from the combination of the two sine waves shown in Figure 3C.
- the process is repeated as shown in Figures 3F - 3H until all of the sine waves for the frame have been recreated and a resultant wave have been produced.
- the resultant wave is similar to the wave in Figure 2A.
- each frame would produce an unrecognizable sound or beep.
- digital signal 108 is replicated. To eliminate the flutter caused by discontinuities, each frame is faded into the next or some other process known in the art is used.
- Block 402 illustrates the production and propagation of analog signal 104.
- Block 404 depicts encoder 106 receiving analog signal 104.
- Block 406 illustrates encoder 106 converting analog signal 104 into digital signal 108.
- Block 408 depicts digital signal 108 being divided into frame 110s.
- Block 410 illustrates the length of frame 110s being reduced by combining similar frame 110s.
- Block 412 depicts a spectrum analysis being performed on frame 110.
- Block 414 illustrates the frame data being send to receiver 103.
- Block 416 depicts receiver 103 receiving the frame data.
- Block 418 illustrates receiver 103 reconstructing the frame based on the frame data send.
- Block 420 depicts the reconstructed frames being combined to reproduce digital signal 108.
- Block 422 illustrates the reconstructed digital signal 108 being played back as analog signal 104.
- Figure 5 depicts the use of a universal translator.
- Block 502 illustrates a first user speaking a first language into a first translation system.
- First translation system is any translation system that can convert one language in text into the text of a single transition language. It is preferable that the translation System also has speech recognition, but it is not required.
- Block 504 depicts converting the first language speech into first language text.
- Block 506 illustrates the text in first language being converted into text in a transition language. In the preferred embodiment the transition language is English, however, any language could be used.
- Block 508 depicts the text of transition language being transmitted to a second translation system. The requirements for the second translation system are the same as those for the first translation system.
- Block 510 illustrates the second translation system receiving the text of transition language.
- Block 512 depicts the second language system converting the transition language text into a second language text.
- Block 514 illustrates the second language text being delivered to the second user.
- the second language text could be translated back into the transition language, then back into first language text. This would allow the first user to see how the translations effected first user's meaning.
- Each translation system only has to convert a language into the translation language or the translation language to the user's language. Consequently, the translation system can be focused to deliver a true translation of grammar and vocabulary instead of translating more than one language. As a result, a much more accurate speech translation system would be developed.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
- Telephonic Communication Services (AREA)
- Reduction Or Emphasis Of Bandwidth Of Signals (AREA)
- Transmission Systems Not Characterized By The Medium Used For Transmission (AREA)
Abstract
Description
Claims
Priority Applications (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2002533275A JP2004515800A (en) | 2000-10-05 | 2001-10-05 | A device that converts voice to data |
EP01979957A EP1410379A2 (en) | 2000-10-05 | 2001-10-05 | Speech to data converter |
US10/398,642 US20040049377A1 (en) | 2001-10-05 | 2001-10-05 | Speech to data converter |
KR10-2003-7004924A KR20030063357A (en) | 2000-10-05 | 2001-10-05 | Speech to data converter |
CA002425137A CA2425137A1 (en) | 2000-10-05 | 2001-10-05 | Speech to data converter |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US23816600P | 2000-10-05 | 2000-10-05 | |
US60/238,166 | 2000-10-05 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2002029781A2 true WO2002029781A2 (en) | 2002-04-11 |
WO2002029781A3 WO2002029781A3 (en) | 2002-08-22 |
Family
ID=22896760
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2001/042526 WO2002029781A2 (en) | 2000-10-05 | 2001-10-05 | Speech to data converter |
Country Status (5)
Country | Link |
---|---|
EP (1) | EP1410379A2 (en) |
JP (1) | JP2004515800A (en) |
KR (1) | KR20030063357A (en) |
CA (1) | CA2425137A1 (en) |
WO (1) | WO2002029781A2 (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4435831A (en) * | 1981-12-28 | 1984-03-06 | Mozer Forrest Shrago | Method and apparatus for time domain compression and synthesis of unvoiced audible signals |
US4741037A (en) * | 1982-06-09 | 1988-04-26 | U.S. Philips Corporation | System for the transmission of speech through a disturbed transmission path |
US4864503A (en) * | 1987-02-05 | 1989-09-05 | Toltran, Ltd. | Method of using a created international language as an intermediate pathway in translation between two national languages |
US5450522A (en) * | 1991-08-19 | 1995-09-12 | U S West Advanced Technologies, Inc. | Auditory model for parametrization of speech |
US5615301A (en) * | 1994-09-28 | 1997-03-25 | Rivers; W. L. | Automated language translation system |
US5765131A (en) * | 1986-10-03 | 1998-06-09 | British Telecommunications Public Limited Company | Language translation system and method |
US6138089A (en) * | 1999-03-10 | 2000-10-24 | Infolio, Inc. | Apparatus system and method for speech compression and decompression |
US6167374A (en) * | 1997-02-13 | 2000-12-26 | Siemens Information And Communication Networks, Inc. | Signal processing method and system utilizing logical speech boundaries |
-
2001
- 2001-10-05 KR KR10-2003-7004924A patent/KR20030063357A/en not_active Application Discontinuation
- 2001-10-05 WO PCT/US2001/042526 patent/WO2002029781A2/en not_active Application Discontinuation
- 2001-10-05 JP JP2002533275A patent/JP2004515800A/en not_active Withdrawn
- 2001-10-05 EP EP01979957A patent/EP1410379A2/en not_active Withdrawn
- 2001-10-05 CA CA002425137A patent/CA2425137A1/en not_active Abandoned
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4435831A (en) * | 1981-12-28 | 1984-03-06 | Mozer Forrest Shrago | Method and apparatus for time domain compression and synthesis of unvoiced audible signals |
US4741037A (en) * | 1982-06-09 | 1988-04-26 | U.S. Philips Corporation | System for the transmission of speech through a disturbed transmission path |
US5765131A (en) * | 1986-10-03 | 1998-06-09 | British Telecommunications Public Limited Company | Language translation system and method |
US4864503A (en) * | 1987-02-05 | 1989-09-05 | Toltran, Ltd. | Method of using a created international language as an intermediate pathway in translation between two national languages |
US5450522A (en) * | 1991-08-19 | 1995-09-12 | U S West Advanced Technologies, Inc. | Auditory model for parametrization of speech |
US5615301A (en) * | 1994-09-28 | 1997-03-25 | Rivers; W. L. | Automated language translation system |
US6167374A (en) * | 1997-02-13 | 2000-12-26 | Siemens Information And Communication Networks, Inc. | Signal processing method and system utilizing logical speech boundaries |
US6138089A (en) * | 1999-03-10 | 2000-10-24 | Infolio, Inc. | Apparatus system and method for speech compression and decompression |
Also Published As
Publication number | Publication date |
---|---|
EP1410379A2 (en) | 2004-04-21 |
KR20030063357A (en) | 2003-07-28 |
CA2425137A1 (en) | 2002-04-11 |
WO2002029781A3 (en) | 2002-08-22 |
JP2004515800A (en) | 2004-05-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101510424B (en) | Method and system for encoding and synthesizing speech based on speech primitive | |
US4821324A (en) | Low bit-rate pattern encoding and decoding capable of reducing an information transmission rate | |
US20070106513A1 (en) | Method for facilitating text to speech synthesis using a differential vocoder | |
US6678655B2 (en) | Method and system for low bit rate speech coding with speech recognition features and pitch providing reconstruction of the spectral envelope | |
JP2006099124A (en) | Automatic voice/speaker recognition on digital radio channel | |
CN1552059A (en) | Method and apparatus for speech reconstruction in a distributed speech recognition system | |
US7970607B2 (en) | Method and system for low bit rate voice encoding and decoding applicable for any reduced bandwidth requirements including wireless | |
TWI281657B (en) | Method and system for speech coding | |
US7359853B2 (en) | Method and system for low bit rate voice encoding and decoding applicable for any reduced bandwidth requirements including wireless | |
Gomez et al. | Recognition of coded speech transmitted over wireless channels | |
US10490196B1 (en) | Method and system for low bit rate voice encoding and decoding applicable for any reduced bandwidth requirements including wireless | |
KR0155315B1 (en) | Celp vocoder pitch searching method using lsp | |
US20040049377A1 (en) | Speech to data converter | |
EP1410379A2 (en) | Speech to data converter | |
Crochiere et al. | A Variable‐Band Coding Scheme for Speech Encoding at 4.8 kb/s | |
JP3328945B2 (en) | Audio encoding device, audio encoding method, and audio decoding method | |
WO1991006945A1 (en) | Speech compression system | |
Chazan et al. | Low bit rate speech compression for playback in speech recognition systems | |
JPH0235994B2 (en) | ||
CN115938354A (en) | Audio identification method and device, storage medium and electronic equipment | |
Tan et al. | Distributed speech recognition standards | |
JP2002076904A (en) | Method of decoding coded audio signal, and decoder therefor | |
Keeler et al. | Comparison of the intelligibility of predictor coefficient and formant coded speech | |
Viswanathan et al. | Towards a minimally redundant linear predictive vocoder | |
GB2266213A (en) | Digital signal coding |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A2 Designated state(s): CA JP KR US |
|
AL | Designated countries for regional patents |
Kind code of ref document: A2 Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR |
|
AK | Designated states |
Kind code of ref document: A3 Designated state(s): CA JP KR US |
|
AL | Designated countries for regional patents |
Kind code of ref document: A3 Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
WWE | Wipo information: entry into national phase |
Ref document number: 2002533275 Country of ref document: JP Ref document number: 2425137 Country of ref document: CA Ref document number: 1020037004924 Country of ref document: KR |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2001979957 Country of ref document: EP |
|
WWP | Wipo information: published in national office |
Ref document number: 1020037004924 Country of ref document: KR |
|
WWE | Wipo information: entry into national phase |
Ref document number: 10398642 Country of ref document: US |
|
WWP | Wipo information: published in national office |
Ref document number: 2001979957 Country of ref document: EP |
|
WWW | Wipo information: withdrawn in national office |
Ref document number: 2001979957 Country of ref document: EP |