WO2016165334A1 - 一种语音处理方法及装置、终端设备 - Google Patents

一种语音处理方法及装置、终端设备 Download PDF

Info

Publication number
WO2016165334A1
WO2016165334A1 PCT/CN2015/095521 CN2015095521W WO2016165334A1 WO 2016165334 A1 WO2016165334 A1 WO 2016165334A1 CN 2015095521 W CN2015095521 W CN 2015095521W WO 2016165334 A1 WO2016165334 A1 WO 2016165334A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
video
user
measured
standard
Prior art date
Application number
PCT/CN2015/095521
Other languages
English (en)
French (fr)
Inventor
阮卫东
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2016165334A1 publication Critical patent/WO2016165334A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B19/00Teaching not covered by other main groups of this subclass
    • G09B19/06Foreign languages
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis

Definitions

  • This document relates to smart terminal technologies, for example, to a voice processing method and apparatus, and a terminal device.
  • Intelligent terminals have become popular. It is more convenient to learn foreign languages on the smart terminal. Users can find many audio and video learning resources on the network. Generally, you can download audio and video resources through some applications such as Youku, open the downloaded audio and video resources offline to learn, or directly in the application. Open audio and video resources online to learn.
  • the embodiment of the invention provides a voice processing method, device and terminal device, so as to avoid the problem that the self-learning phonetic symbol is accurate when the self-learning language is learned by the intelligent terminal.
  • an embodiment of the present invention discloses a voice processing method, where the method includes:
  • the similarity of the measured audio to the standard audio is displayed to the user.
  • comparing the measured audio with the standard audio, and obtaining the similarity between the measured audio and the standard audio including:
  • the characteristic parameters of the tested audio include one or more of the following:
  • LPC Linear predictive cepstral coefficient
  • MFCC Mel frequency cepstral coefficient
  • ASCC accent sensitivity parameter
  • acquiring audio data in the audio and video before the pause play time includes:
  • the audio data after the most recently stored audio and video in the audio and video is paused to the time of the paused playback.
  • the method further includes:
  • the embodiment of the invention also discloses a voice processing device, which comprises:
  • the identification unit is configured to acquire a voice signal input by the user when playing the audio and video, store the acquired voice signal as the measured audio, and instruct to pause the playing of the audio and video when the voice signal input by the user is acquired;
  • An audio acquisition unit configured to obtain audio data in the audio and video before the pause play time, And storing the audio data in the obtained audio and video as standard audio;
  • a comparing unit configured to compare the measured audio with the standard audio to obtain a similarity between the measured audio and the standard audio
  • the display unit sets the similarity between the measured audio and the standard audio to the user.
  • the comparing unit includes:
  • a pre-processing module configured to pre-process the stored measured audio and standard audio respectively
  • An audio feature extraction module configured to extract feature parameters of the tested audio from the pre-processed measured audio, and extract feature parameters of the standard audio from the pre-processed standard audio;
  • an audio comparison module configured to calculate a difference between a feature parameter of the measured audio and a feature parameter of the standard audio, to obtain a similarity between the measured audio and the standard audio.
  • the characteristic parameters of the tested audio include one or more of the following:
  • LPC Linear predictive cepstral coefficient
  • MFCC Mel frequency cepstral coefficient
  • ASCC accent sensitivity parameter
  • the foregoing apparatus further includes:
  • the control unit is configured to: after the display unit displays the similarity between the measured audio and the standard audio to the user, when the user initiates the audio and video pause play instruction initiated by the user, pauses playing the audio and video, and receives the audio and video playback instruction initiated by the user. When you continue to play audio and video.
  • the audio acquiring unit acquires the audio data in the audio and video before the pause playing time includes:
  • the audio data from the most recently stored audio and video to the time of the paused playback is acquired.
  • the embodiment of the invention further discloses a terminal device, which at least comprises the voice processing device as described above.
  • Embodiments of the present invention also disclose a computer readable storage medium storing computer executable instructions for performing the voice processing method described above when executed by a processor.
  • the technical solution of the present application when using the rich language teaching program on the network to learn a foreign language, it is possible to compare the pronunciation similarity of the pronunciation and the language teaching program in time, correct the pronunciation, improve the self-learning efficiency, and make the user better.
  • the learning interaction avoids the problem of the inability to confirm the correctness of the pronunciation in the existing passive learning.
  • FIG. 1 is a flowchart of voice processing according to an embodiment of the present invention
  • FIG. 2 is a schematic structural diagram of a voice processing apparatus according to an embodiment of the present invention.
  • FIG. 3 is a structural block diagram of a comparison unit in a voice processing device according to an embodiment of the present invention.
  • the embodiment of the invention provides a voice processing method, which can improve the performance of learning a foreign language through voice comparison. As shown in FIG. 1, the method includes the following steps:
  • Step S101 playing streaming audio and video, acquiring the voice signal input by the user while playing the streaming audio and video, saving the obtained voice signal input by the user, and suspending playing the streaming audio and video;
  • the played streaming audio and video may be local or online streaming audio and video.
  • the obtained voice signal input by the user is used as the audio to be tested.
  • the voice signal input by the user can be saved as the measured audio.
  • Step S102 Acquire audio data in the streaming audio and video before the pause play time, and save the audio data as standard audio;
  • the audio data in the streaming audio and video before the saved pause play time can be used as standard audio and cached or saved in a file.
  • Step S103 Comparing the voice signal input by the user with the saved audio data to obtain the similarity between the two by using the voice contrast technology (ie, obtaining the similarity between the measured audio and the standard audio), and Giving a pronunciation score to the voice signal input by the user;
  • Step S104 Display the similarity result of the compared voice signal input by the user and the saved audio data to the user.
  • the subsequent operation may be performed, that is, if the similarity is high, the user may choose to continue playing the streaming audio and video for learning, and at this time, the user-initiated streaming media is received.
  • the audio and video playback instruction continues to play the streaming audio and video according to the instruction; if the similarity is low, the user may choose to input the voice signal again, for example, to pronounce again, at this time, the user-initiated streaming audio and video pause playback command is suspended.
  • the streaming audio and video is played, and when the user selects the pronunciation again, the similarity between the voice signal input by the user and the standard audio can be re-compared, and the specific comparison process is as described above.
  • the method provided by this embodiment may be performed by a terminal device, which may be a mobile phone, a tablet computer, or the like.
  • step S101 the user's pronunciation is written into the cache, which can be realized by the recording function provided by the Android system.
  • the Android system provides two APIs for recording: android.media.AudioRecord, android.media.MediaRecorder.
  • AudioRecord mainly implements side recording (AudioRecord+AudioTrack) and real-time processing of audio.
  • side recording AudioRecord+AudioTrack
  • real-time processing of speech is realized, and various audio packages can be implemented by code.
  • the output is PCM voice data. If it is saved as an audio file, it cannot be played by the player. Therefore, it is necessary to write code first to realize data encoding and compression of PCM voice data. In this way, for example, it is possible to use the AudioRecord class recording, and realize the WAV format package, recording 20s, and outputting the WAV format audio file size is about 3.5M bytes.
  • the WAV format has high recording quality, but the compression ratio is small and the file size is large; the AAC format is better than the mp3 format, the sound quality is better, the file is smaller, the loss is compressed, and the general Apple system Or Android SDK 4.1.2 (API 16) and above supports playback.
  • Amr format, compression ratio is relatively large, but compared to other compression formats, the recording quality is relatively poor, mostly used for vocals, call recording. Since the recording time of the user is not very long, in order to obtain a better contrast effect, the embodiment preferably adopts AudioRecord recording, and the recording file format is WAV format.
  • Audio data in the process of acquiring the audio data in the streaming audio and video before the pause play time, if the streaming audio and video is paused for the first time, all the streaming audio and video played before the pause play time can be acquired.
  • Audio data of course, the amount of audio data that can be obtained depends on the size of the cache allocated by the system. For the case where the streaming audio and video is not paused for the first time, the audio data of the last cached audio and video in the streaming audio and video to the time of the paused playback may be acquired. That is, the user is satisfied with the previous pronunciation comparison result, and after selecting to continue playing the streaming audio and video for learning, the terminal device starts to acquire new audio data.
  • the Android system terminal can obtain the audio played by the Android system by the following method.
  • the bottom layer of the audio system of the Android system is the /dev/eac device file, and the interface of the hardware abstraction layer of the AudioHardware Interface is the interface of the AudioFlinger and the Audio hardware.
  • AudioFlinger can access AudioHardware down to output audio data and control audio parameters.
  • AudioFlinger can be served up through the IAudioFinger interface. Therefore, AudioFlinger plays an important role in the audio system framework of the Android system, and can obtain system audio through AudioFlinger.
  • AudioFlinger is implemented in the frameworks/base/services/audioflinger/ directory of the native code.
  • the thread AudioFlinger::MixerThread::threadLoop() can be found.
  • the function is to mix the audio data of each track.
  • the mixed audio data is written to the underlying audio device. Search mOutput->write in the thread's code to find the following code:
  • the mixed audio data is written to the cache and transferred to the hardware-related code for processing.
  • the audio data is stored in the mMixbuffer, which is PCM (Pulse Code Modulation) audio data, and the audio data in the buffer is written into a specified file, such as /data/wav.raw, to obtain the system.
  • the audio data being played.
  • step S103 the existing audio comparison technique can be used to achieve comparison between the user's pronunciation (ie, the audio being tested) and the teaching audio (ie, standard audio).
  • the method for realizing audio comparison is to perform audio similarity comparison by extracting audio feature parameters.
  • the process mainly includes the following operations:
  • the characteristic parameters involved in the embodiment may be a Mel Frequency Cepstrum Coefficient (MFCC), a Linear Predictive Cepstrum Coefficient (LPCC), and an Accent Sensitive (ASCC, Accent Sensitive). Cepstrum Coefficient) and the like, which are not particularly limited herein.
  • MFCC Mel Frequency Cepstrum Coefficient
  • LPCC Linear Predictive Cepstrum Coefficient
  • ASCC Accent Sensitive
  • Cepstrum Coefficient and the like, which are not particularly limited herein.
  • the embodiment further provides a voice processing device, as shown in FIG. 2, including:
  • the identification unit 201 is configured to acquire the user's pronunciation (ie, the measured audio) while playing the streaming audio and video and store the acquired user's pronunciation, and send the streaming audio and video pause playback command to the control unit 205;
  • the audio acquiring unit 202 is configured to acquire audio data (ie, standard audio) in the streaming audio and video before the paused playing time, where the streaming audio and video may be local or online;
  • the comparing unit 203 is configured to acquire the teaching audio from the audio acquiring unit 202 and acquire the user's pronunciation from the recognition unit 201, and compare the user's pronunciation with the teaching audio using the voice contrast technique to obtain the similarity between the user's pronunciation and the teaching audio (ie, the comparison is Comparing the audio and standard audio), the comparison result is transmitted to the display unit 204;
  • the display unit 204 is configured to display the comparison result to the user, that is, display the similarity between the measured audio and the standard audio to the user;
  • the control unit 205 is configured to control playback of local or online streaming audio and video.
  • the user downloads a foreign language teaching program (audio or audio and video) from the Internet, selects to play the foreign language teaching program to start learning, and the teacher in the foreign language teaching program gives a pronunciation of the word, the voice
  • the processing device buffers the audio of the speaker's pronunciation, the user can immediately follow, the voice processing device recognizes the user's voice, suspends the playing of the foreign language teaching program, and caches the user's pronunciation.
  • the speech processing device compares the user's pronunciation with the speaker's pronunciation and gives a comparison result. If the similarity between the user's pronunciation and the teacher's pronunciation is high, the user generally chooses to continue to play the foreign language teaching program to continue learning.
  • the voice processing device may further include a control unit, the control unit is configured to: after the display unit displays the similarity between the measured audio and the standard audio to the user, if the user initiates the streaming audio and video pause playback command , the streaming audio and video is paused, and if the user-initiated streaming audio and video playback instruction is received, the streaming audio and video continues to be played.
  • the user selects to play a foreign language teaching program (audio or audio and video) online, and after the teacher in the program gives a pronunciation of the word, the voice processing device buffers the audio of the speaker, the user The voice processing device can recognize the user's voice immediately, suspend the playback of the foreign language teaching program, and cache the user's pronunciation. After the end of the follow-up, the speech processing device compares the user's pronunciation with the teaching audio, and gives a comparison result.
  • a foreign language teaching program audio or audio and video
  • the user generally chooses to continue to play the foreign language teaching program to continue learning, if the user The similarity between the pronunciation and the pronunciation of the instructor is low, the user can re-speech, and the speech processing device base The above comparison is made again for the new user pronunciation.
  • the comparison unit 203 includes a pre-processing module 301, an audio feature extraction module 302, and an audio comparison module 303;
  • the pre-processing module 301 is configured to pre-process the voice signal, and the pre-processing process generally includes digitization of the voice signal, pre-emphasis, endpoint detection, etc., and pre-processes the measured audio and the standard audio respectively;
  • the audio feature extraction module 302 is configured to calculate and extract key parameters reflecting the characteristics of the voice signal, that is, separately calculate and extract feature parameters of the measured audio and the standard audio, and the commonly used feature parameters include a linear prediction cepstral coefficient (LPCC), Meyer. Frequency cepstral coefficient (MFCC) and accent sensitivity parameter (ASCC);
  • LPCC linear prediction cepstral coefficient
  • MFCC Frequency cepstral coefficient
  • ASCC accent sensitivity parameter
  • the audio comparison module 303 is arranged to reflect the similarity between the user voice signal and the teaching audio by calculating the difference between the extracted feature parameters of the user voice signal and the feature parameters of the teaching audio.
  • the embodiment of the present invention uses MFCC as a characteristic parameter as an example to illustrate The specific process of comparison.
  • the MFCC parameter combines the auditory perception characteristics of the human ear with the mechanism of speech generation.
  • the Mel frequency can be expressed by the following formula:
  • the extraction process of the MFCC parameters includes the following steps:
  • H(z) 1-a ⁇ (z-1), 0.9 ⁇ a ⁇ 1.0 (a is generally about 0.95) .
  • S'(n) s(n) - a ⁇ s(n - 1).
  • S'(n) s(n) - a ⁇ s(n - 1).
  • the amplitude of the high-frequency formant is lower than the amplitude of the low-frequency formant.
  • the purpose of pre-emphasis is to eliminate the effects of the vocal cords and lips and compensate the high-frequency part of the speech signal.
  • the short-term energy represents the level of the volume, that is, the amplitude of the sound, and some of the fine noise in the speech signal can be filtered according to the value of this energy.
  • the energy value of a frame is lower than a predetermined threshold, the frame is taken as a silence.
  • the framed frame is usually subjected to FFT (Fast Fourier Transform) to obtain the spectral parameters of each frame.
  • Step (6) triangular band-pass filter
  • the spectral parameters of each frame are passed through a set of N triangular band-pass filters (N generally 20 to 30), and the output of each frequency band is logarithmized to obtain each output.
  • the N parameters are cosine transform to obtain the L-order Mel-scale cepstrum parameters.
  • the above comparison unit may also be extracted using the linear prediction cepstral coefficient (LPCC) and the accent sensitivity parameter (ASCC).
  • LPCC linear prediction cepstral coefficient
  • ASCC accent sensitivity parameter
  • the audio acquiring unit acquires the audio data in the streaming audio and video before the paused playing time
  • the streaming audio and video is paused for the first time
  • all the audio data in the streaming audio and video played before the paused playing time is acquired.
  • all the audio data that can be obtained The size also depends on the size of the cache allocated for the audio acquisition unit.
  • the audio data of the last cached audio and video in the streaming audio and video to the time of the paused playback may be acquired. That is, the user is satisfied with the previous pronunciation comparison result, and after selecting to continue playing the streaming audio and video for learning, the terminal device starts to acquire new audio data.
  • the above-mentioned voice processing device can be placed in the terminal device.
  • the present embodiment further provides a terminal device, which includes at least the above-mentioned voice processing device.
  • the working principle of the voice processing device can be referred to the above content, and details are not described herein again.
  • the embodiment of the present invention further provides a computer readable storage medium storing computer executable instructions for performing the voice processing method described in the foregoing embodiments when executed by a processor.
  • streaming audio and video is taken as an example in the above embodiment, the embodiment of the present invention is not limited to streaming audio and video, and audio and video in other formats may be used, which is not limited herein.
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
  • the device is implemented in a flow chart or Multiple processes and/or block diagrams The functions specified in one or more boxes.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device.
  • the instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Educational Administration (AREA)
  • Educational Technology (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

一种语音处理方法及装置、终端设备,该方法包括:播放音视频;获取用户输入的语音信号,将所述用户输入的语音信号作为被测音频进行存储,并暂停播放所述音视频(S101);获取所述音视频在暂停播放时刻之前的音频数据,并将所述音频数据作为标准音频进行存储(S102);将所述被测音频与所述标准音频进行对比,得到被测音频与标准音频的相似度(S103);将所述被测音频与标准音频的相似度显示给用户(S104)。还公开了一种语音处理装置及包含此语音处理装置的终端设备。该技术方案能够及时比较用户发音和教学发音的相似度,纠正发音,提高自学效率。

Description

一种语音处理方法及装置、终端设备 技术领域
本文涉及智能终端技术,例如涉及一种语音处理方法及装置、终端设备。
背景技术
智能终端已经普及。在智能终端上学习外语更加方便,用户可以在网络上找到很多音视频学习资源,通常,可以通过一些应用如优酷等下载音视频资源,离线打开下载的音视频资源来学习,或直接在应用中在线打开音视频资源来学习。
在使用优酷等视频网站的视频资源或其它一些外语学习网站的音频资源学习外语时,当视频或音频中的老师教了一个音标或单词的读音后,学员可以跟读,但是学员无法确定自己的读音是否和老师的读音完全一致,只能凭自己的感觉判断,这样的学习效果不佳。
发明内容
本发明实施例提供一种语音处理方法及装置、终端设备,以避免通过智能终端自学语言时无法判断自学音标是否准确的问题。
为此,本发明实施例公开了一种语音处理方法,该方法包括:
播放音视频;
获取用户输入的语音信号,将所述用户输入的语音信号作为被测音频进行存储,并暂停播放所述音视频;
获取所述音视频在暂停播放时刻之前的音频数据,并将所述音频数据作为标准音频进行存储;
将所述被测音频与所述标准音频进行对比,得到被测音频与标准音频的相似度;
将所述被测音频与标准音频的相似度显示给用户。
可选地,上述方法中,将所述被测音频与所述标准音频进行对比,得到被测音频与标准音频的相似度包括:
对存储的被测音频以及标准音频分别进行预处理;
从预处理的被测音频中提取被测音频的特征参数,从预处理的标准音频中提取标准音频的特征参数;
计算所述被测音频的特征参数和所述标准音频的特征参数的差距,得到所述被测音频与所述标准音频的相似度。
可选地,上述方法中,所述被测音频的特征参数包括如下一种或几种:
线性预测倒谱系数(LPCC)、梅尔频率倒谱系数(MFCC)和口音敏感参数(ASCC)。
可选地,上述方法中,获取暂停播放时刻之前的音视频中的音频数据包括:
在首次暂停播放音视频时,获取暂停播放时刻之前所播放的音视频中的所有音频数据;
在非首次暂停播放流媒体音视频时,获取所述音视频中最近一次存储的音视频之后至暂停播放时刻的音频数据。
可选地,上述方法中,将被测音频与标准音频的相似度显示给用户后,该方法还包括:
在接收用户发起的音视频暂停播放指令时,暂停播放音视频;
在接收用户发起的音视频播放指令时,继续播放音视频。
本发明实施例还公开了一种语音处理装置,该装置包括:
识别单元,设置为在播放音视频时获取用户输入的语音信号,将获取到的语音信号存储作为被测音频,以及在获取到用户输入的语音信号时,指示暂停播放音视频;
音频获取单元,设置为获取暂停播放时刻之前的音视频中的音频数据, 并将获取到的音视频中的音频数据存储作为标准音频;
比较单元,设置为将所述被测音频与所述标准音频进行对比,得到被测音频与标准音频的相似度;
显示单元,设置将被测音频与标准音频的相似度显示给用户。
可选地,上述装置中,所述比较单元包括:
预处理模块,设置为对存储的被测音频和标准音频分别进行预处理;
音频特征提取模块,设置为从预处理的被测音频中提取被测音频的特征参数,以及从预处理的标准音频中提取标准音频的特征参数;
音频比较模块,设置为计算所述被测音频的特征参数和所述标准音频的特征参数的差距,得到所述被测音频与所述标准音频的相似度。
可选地,上述装置中,所述被测音频的特征参数包括如下一种或几种:
线性预测倒谱系数(LPCC)、梅尔频率倒谱系数(MFCC)和口音敏感参数(ASCC)。
可选地,上述装置还包括:
控制单元,设置为在所述显示单元将被测音频与标准音频的相似度显示给用户后,接收用户发起的音视频暂停播放指令时,暂停播放音视频,以及接收用户发起的音视频播放指令时,继续播放音视频。
可选地,上述装置中,音频获取单元获取暂停播放时刻之前的音视频中的音频数据包括:
在首次暂停播放音视频时,获取暂停播放时刻之前所播放的音视频中的所有音频数据;
在非首次暂停播放音视频时,获取最近一次存储的音视频之后至暂停播放时刻的音频数据。
本发明实施例还公开了一种终端设备,至少包括如上所述语音处理装置。
本发明实施例还公开一种计算机可读存储介质,存储有计算机可执行指令,所述计算机可执行指令在由处理器执行时用于执行上述的语音处理方法。
采用本申请技术方案,在利用网络上丰富的语言教学节目学习外语时,能够及时比较自己的发音和语言教学节目中的发音的相似度,纠正自己的发音,提高自学效率,使用户有更好的学习互动性,避免现有的被动学习中无法确认自己发音的正确性的问题。
附图说明
图1是本发明实施例提供的一种语音处理流程图;
图2是本发明实施例提供的一种语音处理装置的结构示意图;
图3是本发明实施例提供的一种语音处理装置中比较单元的结构框图。
具体实施方式
需要说明的是,在不冲突的情况下,本申请的实施例和实施例中的特征可以任意相互组合。
本发明实施例提供一种语音处理方法,可以通过语音对比提高学习外语成效。如图1所示,该方法包括如下操作步骤:
步骤S101:播放流媒体音视频,在播放流媒体音视频的同时获取到用户输入的语音信号,保存获取到的用户输入的语音信号,并暂停播放流媒体音视频;
其中,播放的流媒体音视频可以是本地或在线流媒体音视频。
所获取到的用户输入的语音信号作为被测音频,该步骤中,可将用户输入的语音信号作为被测音频进行保存。
步骤S102:获取暂停播放时刻之前的流媒体音视频中的音频数据,并将该音频数据作为标准音频保存;
其中,所保存的暂停播放时刻之前的流媒体音视频中的音频数据可以作为标准音频,并进行缓存或保存在一个文件中。
步骤S103:使用语音对比技术,比较用户输入的语音信号与保存的音频数据以得到二者的相似度(即得到被测音频与标准音频的相似度),还可以 对用户输入的语音信号给出发音打分;
步骤S104:将比较得到的用户输入的语音信号与保存的音频数据的相似度结果显示给用户。
另外,按照上述方法向用户显示相似度结果后,还可以进行后续的操作,即如果相似度高,用户可以选择继续播放流媒体音视频进行学习,此时,就会收到用户发起的流媒体音视频播放指令,按照该指令继续播放流媒体音视频;如果相似度低,用户可以选择再次输入语音信号,例如再次发音,此时,会收到用户发起的流媒体音视频暂停播放指令,暂停播放流媒体音视频,而且在用户选择再次发音时,可以重新比较用户再次发音时输入的语音信号与标准音频之间的相似度,具体比较过程如上所述。
本实施例提供的方法可以由终端设备执行,该终端设备可以是手机、平板电脑等。
下面以装有Android***的终端为例,说明上述方法的具体实施过程。
在步骤S101中,将用户的发音写入缓存,这可以通过Android***提供的录音功能实现。Android***提供了两个API用于实现录音功能:android.media.AudioRecord、android.media.MediaRecorder。
1、AudioRecord
AudioRecord主要是实现边录边播(AudioRecord+AudioTrack)以及对音频的实时处理。此种方式实现对语音的实时处理,可以用代码实现各种音频的封装。但输出是PCM语音数据,如果保存成音频文件,是不能够被播放器播放的,所以需要先写代码实现对PCM语音数据的数据编码以及压缩。藉此,可以例如实现使用AudioRecord类录音,并实现WAV格式封装,录音20s,输出的WAV格式音频文件大小大概为3.5M bytes左右。
2、MediaRecorder
MediaRecorder已经集成了录音、编码、压缩等,支持少量的录音音频格式,例如.aac(API=16)、.amr和.3gp。由于MediaRecorder此种方式集成了大部分功能,直接调用相关接口即可,代码量小。但MediaRecorder无法实时处理音频;输出的音频格式不是很多,例如无法输出mp3格式文件。例 如,使用MediaRecorder类录音,可以输出amr格式文件,录音20s,输出的amr格式音频文件大概为33K bytes。
通过对上述两种录音方式比较可以看出,WAV格式,录音质量高,但是压缩率小,文件大;AAC格式,相对于mp3格式,音质更佳,文件更小,有损压缩,一般苹果***或者Android SDK4.1.2(API 16)及以上版本支持播放。amr格式,压缩比比较大,但相对其他的压缩格式录音质量比较差,多用于人声,通话录音。由于用户录音时间不会很长,为了取得更好对比效果,本实施例优选采用AudioRecord录音,录音文件格式为WAV格式。
要说明的是,在获取暂停播放时刻之前的流媒体音视频中的音频数据的过程中,如果是首次暂停播放流媒体音视频,可以获取暂停播放时刻之前所播放的流媒体音视频中的所有音频数据,当然,能获取的音频数据的多少还要取决于***分配的缓存大小。而针对非首次暂停播放流媒体音视频的情况,可以获取流媒体音视频中最近一次缓存的流媒体音视频之后至暂停播放时刻的音频数据。即用户对上一次发音比较结果满意,选择继续播放流媒体音视频进行学习后,终端设备开始获取新的音频数据。
在步骤S102中,Android***终端可以通过以下方法获取Android***正在播放的音频,Android***的音频***的最底层是/dev/eac设备文件,AudioHardwareInterface硬件抽象层的接口是AudioFlinger和Audio硬件的接口。AudioFlinger向下可访问AudioHardware,实现输出音频数据,控制音频参数。此外,AudioFlinger向上可通过IAudioFinger接口提供服务。所以,AudioFlinger在Android***的音频***框架中起着承上启下的作用,可以通过AudioFlinger获取到***音频。
AudioFlinger通过原生代码的frameworks/base/services/audioflinger/目录来实现,在文件AudioFlinger.cpp中可以找到线程AudioFlinger::MixerThread::threadLoop(),其作用是混合各音轨(track)音频数据,将混合后的音频数据写到底层音频设备,在该线程的代码中搜索mOutput->write,可以找到以下代码:
mLastWriteTime=systemTime();
mIn Write=true;
mBytes Written+=mixBufferSize;
int bytes Written=(int)mOutput->write(mMixBuffer,mixBufferSize)
If(byte Written<0)mBytes Written-=mixBufferSize;
mNum Writes++;
iInWrite=false;
就在这些代码处,混合后的音频数据被写入缓存,传输到硬件相关代码进行处理。该音频数据存储在mMixbuffer中,是PCM(Pulse Code Modulation,脉冲编码调制)音频数据,将这个缓存中的音频数据写入到指定的文件中,如/data/wav.raw,就可以获取到***正在播放的音频数据。
在Android***终端上,收到音视频暂停播放指令后,可以使用mediaPlayer.pause来暂停播放音视频。
在步骤S103中,可以使用现有的音频比对技术,实现用户发音(即被测音频)和教学音频(即标准音频)的对比。
本实施例中,实现音频比对的方法是利用提取音频特征参数来进行音频相似度比对。该过程主要包括如下操作:
对缓存的被测音频以及标准音频分别进行预处理;
对预处理的被测音频进行计算,从中提取被测音频的特征参数,对预处理的标准音频进行计算,从中提取标准音频的特征参数;
计算所述被测音频的特征参数和所述标准音频的特征参数的差距,得到所述被测音频与所述标准音频的相似度。
其中,本实施例中所涉及的特征参数可以是梅尔频率倒谱系数(MFCC,Mel Frequency Cepstrum Coefficient)、线性预测倒谱系数(LPCC,Linear Predictive Cepstrum Coefficient)和口音敏感参数(ASCC,Accent Sensitive Cepstrum Coefficient)等等,其在这里不作特别限制。
本实施例还提供一种语音处理装置,如图2所示,包括:
识别单元201,设置为在播放流媒体音视频时获取用户的发音(即被测音频)并存储获取到的用户发音,发送流媒体音视频暂停播放指令给控制单元205;
音频获取单元202,设置为获取暂停播放时刻之前的流媒体音视频中的音频数据(即标准音频),其中,流媒体音视频可以是本地的或在线的;
比较单元203,设置为从音频获取单元202获取教学音频和从识别单元201获取用户发音,使用语音对比技术,比较用户发音和教学音频以得到用户发音和教学音频之间的相似度(即比较被测音频和标准音频之间的相似度),将比较结果传递给显示单元204;
显示单元204,设置为将比较结果显示给用户,即将被测音频与标准音频的相似度显示给用户;
控制单元205,设置为控制本地或在线流媒体音视频的播放。
下面结合具体应用场景详细说明上述语音处理装置的工作过程。
在一种应用场景例中,用户从网上下载一个外语教学节目(音频或音视频),选择播放该外语教学节目以开始学习,该外语教学节目中的教学者给出一个单词的发音后,语音处理装置缓存教学者发音的音频,用户可立刻跟读,语音处理装置识别到用户语音,暂停该外语教学节目的播放,并缓存用户的发音。跟读结束后,语音处理装置比较用户发音和教学者的发音,并给出比较结果,如果用户发音和教学者的发音之间的相似度高,用户一般会选择继续播放外语教学节目继续学习,如果用户发音和教学者的发音之间的相似度低,用户可以选择重新发音,语音处理装置基于新的用户发音再次进行上述比较。上述语音处理装置还可增加一控制单元,该控制单元设置为:在显示单元将被测音频与标准音频之间的相似度显示给用户后,若接收到用户发起的流媒体音视频暂停播放指令,则暂停播放流媒体音视频,若接收到用户发起的流媒体音视频播放指令,则继续播放流媒体音视频。
在另一种应用场景例中,用户选择在网上在线播放一个外语教学节目(音频或音视频),节目中的教学者给出一个单词的发音后,语音处理装置缓存教学者发音的音频,用户可立刻跟读,语音处理装置识别到用户语音,暂停该外语教学节目的播放,并缓存用户的发音。跟读结束后,语音处理装置比较用户发音和教学音频,并给出比较结果,如果用户发音和教学者的发音之间的相似度高,用户一般会选择继续播放外语教学节目继续学习,如果用户发音和教学者的发音之间的相似度低,用户可以重新发音,语音处理装置基 于新的用户发音再次进行上述比较。
下面参见图3说明上述比较单元203实现语音比较的过程,该比较单元203包括预处理模块301、音频特征提取模块302和音频比较模块303;
预处理模块301设置为对语音信号进行预处理,预处理过程通常包括语音信号的数字化、预加重、端点检测等,主要对被测音频和标准音频分别进行预处理;
音频特征提取模块302设置为计算和提取反映语音信号特征的关键参数,即分别计算和提取被测音频和标准音频的特征参数,较常用的特征参数有线性预测倒谱系数(LPCC)、梅尔频率倒谱系数(MFCC)和口音敏感参数(ASCC)等;
音频比较模块303设置为通过计算提取出的用户语音信号的特征参数和教学音频的特征参数的差距来反映用户语音信号和教学音频两者之间的相似度。
由于MFCC在实际应用中能够更好的反应语音特性,并且有良好抗噪能力,被证明为语音识别领域最成功的特征描述之一,因此本发明实施例以使用MFCC作为特征参数为例,说明比较的具体过程。
MFCC参数是将人耳的听觉感知特性和语音的产生机制相结合。梅尔(Mel)频率可以用如下公式表示:
Mel(f)=2595×1g(1+f/700)
MFCC参数的提取过程包括如下操作步骤:
步骤(1):预加重(pre-emphasis)
将经采样后的数字语音信号s(n)通过一个高通滤波器(high pass filter):H(z)=1-a×(z-1),0.9<a<1.0(a一般取0.95左右)。则经过该高通滤波器预加重后的信号为:S’(n)=s(n)-a×s(n-1)。因为发声过程中声带和嘴唇的效应,使得高频共振峰的振幅低于低频共振峰的振幅,进行预加重的目的就是为了消除声带和嘴唇的效应,补偿语音信号的高频部分。
步骤(2):分帧(frame blocking)
一般取10-20ms为一帧,为了避免窗边界对信号的遗漏,在对帧做偏移 时,要有帧迭(帧与帧之间需要重叠一部分)。一般取帧长的一半作为帧移,也就是每次位移一帧的二分之一后再取下一帧,这样可以避免帧与帧之间的特性变化太大。
步骤(3):计算短时能量(energy)
短时能量代表着音量的高低,亦即声音振幅的大小,可以根据此能量的值来过滤掉语音信号中的一些细微噪声。当一帧的能量值低于预定的门槛值(threshold)时,则将此帧作为静音段(silence)。
步骤(4):加窗(window)
语音在长范围内是不停变动的,没有固定的特性无法做处理,所以将每一帧代入窗函数,窗外的值设定为0,其目的是消除各个帧两端可能会造成的信号不连续性。常用的窗函数有方窗、汉明窗(hamming window)和汉宁窗等,根据窗函数的频域特性,常采用汉明窗,其公式是在加窗范围内,w(n)=0.54-0.46*cos(2*pi*n/(n-1))。
步骤(5):快速傅立叶变换(FFT transform)
由于语音信号在时域上的变化快速而不稳定,所以通常都将它转换到频域上来观察,此时它的频谱会随着时间作缓慢的变化。所以通常将加窗后的帧经过FFT(Fast Fourier Transform)求出每帧的频谱参数。
步骤(6):三角形带通滤波器(triangular band-pass filter)
将每帧的频谱参数通过由一组N个三角形带通滤波器(N一般为20~30)所组成的梅尔刻度滤波器,将每个频带的输出取对数,求出每一个输出的对数能量值(log energy)Ek,其中k=1,2,…N。再将此N个参数进行余弦变换(cosine transform)求出L阶的Mel-scale cepstrum参数。
还要说明的是,上述比较单元中,除了使用梅尔频率倒谱系数(MFCC)提取语音信号特征参数外,也可以使用线性预测倒谱系数(LPCC)和口音敏感参数(ASCC)来提取。
另外,上述音频获取单元获取暂停播放时刻之前的流媒体音视频中的音频数据时,如果是首次暂停播放流媒体音视频,则获取暂停播放时刻之前所播放的流媒体音视频中的所有音频数据。当然,所能获取的所有的音频数据 的大小还要取决于为音频获取单元分配的缓存大小。而针对非首次暂停播放流媒体音视频的情况,可以获取流媒体音视频中最近一次缓存的流媒体音视频之后至暂停播放时刻的音频数据。即用户对上一次发音比较结果满意,选择继续播放流媒体音视频进行学习后,终端设备开始获取新的音频数据即可。
由于上述语音处理装置可置于终端设备内,故本实施例还单独提供一种终端设备,其至少包括上述语音处理装置,该语音处理装置的工作原理可参见上述内容,在此不再赘述。
本发明实施例还提供一种计算机可读存储介质,存储有计算机可执行指令,所述计算机可执行指令在由处理器执行时用于执行上述实施例中所述的语音处理方法。
尽管上述实施例中以流媒体音视频为例进行了说明,但是,本发明实施例并非局限于流媒体音视频,也可以采用其它格式的音视频,在这里不作限定。
本领域普通技术人员可以理解上述方法中的全部或部分步骤可通过程序来指令相关硬件完成,所述程序可以存储于计算机可读存储介质中,如只读存储器、磁盘或光盘等。可选地,上述实施例的全部或部分步骤也可以使用一个或多个集成电路来实现。相应地,上述实施例中的各模块/单元可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。本申请不限制于任何特定形式的硬件和软件的结合。
本文是参照根据本发明实施例的方法、装置、终端设备和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或 多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
以上所述,仅为本发明的实施例而已,并非用于限定本发明的保护范围。

Claims (12)

  1. 一种语音处理方法,其特征在于,该方法包括:
    播放音视频;
    获取用户输入的语音信号,将所述用户输入的语音信号作为被测音频进行存储,并暂停播放所述音视频;
    获取所述音视频在暂停播放时刻之前的音频数据,并将所述音频数据作为标准音频进行存储;
    将所述被测音频与所述标准音频进行对比,得到被测音频与标准音频的相似度;
    将所述被测音频与标准音频的相似度显示给用户。
  2. 如权利要求1所述的方法,其特征在于,将所述被测音频与所述标准音频进行对比,得到被测音频与标准音频的相似度包括:
    对存储的被测音频以及标准音频分别进行预处理;
    从预处理的被测音频中提取被测音频的特征参数,从预处理的标准音频中提取标准音频的特征参数;
    计算所述被测音频的特征参数和所述标准音频的特征参数的差距,得到所述被测音频与所述标准音频的相似度。
  3. 如权利要求2所述的方法,其特征在于,所述被测音频的特征参数线性预测倒谱系数、梅尔频率倒谱系数和口音敏感参数中的至少一个。
  4. 如权利要求1至3任一项所述的方法,其特征在于,获取所述音视频在暂停播放时刻之前的音频数据包括:
    在首次暂停播放音视频时,获取暂停播放时刻之前所播放的音视频中的所有音频数据;
    在非首次暂停播放音视频时,获取所述音视频中最近一次存储的音视频之后至暂停播放时刻的音频数据。
  5. 如权利要求4所述的方法,其特征在于,将被测音频与标准音频的相似度显示给用户后,该方法还包括:
    在接收用户发起的音视频暂停播放指令时,暂停播放音视频;
    在接收用户发起的音视频播放指令时,继续播放音视频。
  6. 一种语音处理装置,其特征在于,该装置包括:
    识别单元,设置为在播放音视频时获取用户输入的语音信号,将获取到的语音信号存储作为被测音频,以及在获取到用户输入的语音信号时,指示暂停播放音视频;
    音频获取单元,设置为获取暂停播放时刻之前的音视频中的音频数据,并将获取到的音视频中的音频数据存储作为标准音频;
    比较单元,设置为将所述被测音频与所述标准音频进行对比,得到被测音频与标准音频的相似度;
    显示单元,设置为将被测音频与标准音频的相似度显示给用户。
  7. 如权利要求6所述的装置,其特征在于,所述比较单元包括:
    预处理模块,设置为对存储的被测音频和标准音频分别进行预处理;
    音频特征提取模块,设置为从预处理的被测音频中提取被测音频的特征参数,以及从预处理的标准音频中提取标准音频的特征参数;
    音频比较模块,设置为计算所述被测音频的特征参数和所述标准音频的特征参数的差距,得到所述被测音频与所述标准音频的相似度。
  8. 如权利要求7所述的装置,其特征在于,所述被测音频的特征参数包括线性预测倒谱系数、梅尔频率倒谱系数和口音敏感参数中的至少一种。
  9. 如权利要求6至8中任一项所述的装置,其特征在于,该装置还包括:
    控制单元,设置为在所述显示单元将被测音频与标准音频的相似度显示给用户后,接收用户发起的音视频暂停播放指令时,暂停播放音视频,以及接收用户发起的音视频播放指令时,继续播放音视频。
  10. 如权利要求9所述的装置,其特征在于,音频获取单元获取暂停播放时刻之前的音视频中的音频数据包括:
    在首次暂停播放音视频时,获取暂停播放时刻之前所播放的音视频中的 所有音频数据;
    在非首次暂停播放音视频时,获取最近一次存储的音视频之后至暂停播放时刻的音频数据。
  11. 一种终端设备,其特征在于,至少包括如权利要求6至10任一项所述的语音处理装置。
  12. 一种计算机可读存储介质,存储有计算机可执行指令,所述计算机可执行指令在由处理器执行时用于执行权利要求1至5任一项所述的方法。
PCT/CN2015/095521 2015-09-17 2015-11-25 一种语音处理方法及装置、终端设备 WO2016165334A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510595881.0 2015-09-17
CN201510595881.0A CN106548785A (zh) 2015-09-17 2015-09-17 一种语音处理方法及装置、终端设备

Publications (1)

Publication Number Publication Date
WO2016165334A1 true WO2016165334A1 (zh) 2016-10-20

Family

ID=57126170

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/095521 WO2016165334A1 (zh) 2015-09-17 2015-11-25 一种语音处理方法及装置、终端设备

Country Status (2)

Country Link
CN (1) CN106548785A (zh)
WO (1) WO2016165334A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111107442A (zh) * 2019-11-25 2020-05-05 北京大米科技有限公司 音视频文件的获取方法、装置、服务器及存储介质

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107864410B (zh) * 2017-10-12 2023-08-25 庄世健 一种多媒体数据处理方法、装置、电子设备以及存储介质
CN108881652B (zh) * 2018-07-11 2021-02-26 北京大米科技有限公司 回音检测方法、存储介质和电子设备
CN109859773A (zh) * 2019-02-14 2019-06-07 北京儒博科技有限公司 一种声音的录制方法、装置、存储介质及电子设备
CN110263334A (zh) * 2019-06-06 2019-09-20 深圳市柯达科电子科技有限公司 一种辅助外语学习的方法和可读存储介质
CN113066487A (zh) * 2019-12-16 2021-07-02 广东小天才科技有限公司 一种矫正口音的学习方法、***、设备及存储介质
CN110890095A (zh) * 2019-12-26 2020-03-17 北京大米未来科技有限公司 语音检测方法、推荐方法、装置、存储介质和电子设备
CN113971969B (zh) * 2021-08-12 2023-03-24 荣耀终端有限公司 一种录音方法、装置、终端、介质及产品

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1450445A (zh) * 2002-04-09 2003-10-22 无敌科技股份有限公司 可携式电子装置的语言跟读及发音矫正***与方法
US20040166480A1 (en) * 2003-02-14 2004-08-26 Sayling Wen Language learning system and method with a visualized pronunciation suggestion
TW200506764A (en) * 2003-08-05 2005-02-16 Wen-Fu Peng Interactive language learning method with speech recognition
CN101145283A (zh) * 2006-09-12 2008-03-19 董明 具有发音质量评价的嵌入式语言教学机
CN101630448A (zh) * 2008-07-15 2010-01-20 上海启态网络科技有限公司 语言学习客户端及***

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1450445A (zh) * 2002-04-09 2003-10-22 无敌科技股份有限公司 可携式电子装置的语言跟读及发音矫正***与方法
US20040166480A1 (en) * 2003-02-14 2004-08-26 Sayling Wen Language learning system and method with a visualized pronunciation suggestion
TW200506764A (en) * 2003-08-05 2005-02-16 Wen-Fu Peng Interactive language learning method with speech recognition
CN101145283A (zh) * 2006-09-12 2008-03-19 董明 具有发音质量评价的嵌入式语言教学机
CN101630448A (zh) * 2008-07-15 2010-01-20 上海启态网络科技有限公司 语言学习客户端及***

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111107442A (zh) * 2019-11-25 2020-05-05 北京大米科技有限公司 音视频文件的获取方法、装置、服务器及存储介质

Also Published As

Publication number Publication date
CN106548785A (zh) 2017-03-29

Similar Documents

Publication Publication Date Title
WO2016165334A1 (zh) 一种语音处理方法及装置、终端设备
US10789290B2 (en) Audio data processing method and apparatus, and computer storage medium
CN106898340B (zh) 一种歌曲的合成方法及终端
JP6790029B2 (ja) 音声プロファイルを管理し、発話信号を生成するためのデバイス
WO2020098115A1 (zh) 字幕添加方法、装置、电子设备及计算机可读存储介质
US9552807B2 (en) Method, apparatus and system for regenerating voice intonation in automatically dubbed videos
US20130041669A1 (en) Speech output with confidence indication
EP2388780A1 (en) Apparatus and method for extending or compressing time sections of an audio signal
McLoughlin Speech and Audio Processing: a MATLAB-based approach
US9064489B2 (en) Hybrid compression of text-to-speech voice data
WO2020113733A1 (zh) 动画生成方法、装置、电子设备及计算机可读存储介质
US12027165B2 (en) Computer program, server, terminal, and speech signal processing method
CN110675886B (zh) 音频信号处理方法、装置、电子设备及存储介质
JP2000511651A (ja) 記録されたオーディオ信号の非均一的時間スケール変更
US20180130462A1 (en) Voice interaction method and voice interaction device
US8571873B2 (en) Systems and methods for reconstruction of a smooth speech signal from a stuttered speech signal
US8620670B2 (en) Automatic realtime speech impairment correction
US20220101864A1 (en) Training generative adversarial networks to upsample audio
CN112908292A (zh) 文本的语音合成方法、装置、电子设备及存储介质
CN112420015A (zh) 一种音频合成方法、装置、设备及计算机可读存储介质
US20220208174A1 (en) Text-to-speech and speech recognition for noisy environments
CN113948062B (zh) 数据转换方法及计算机存储介质
CN113223513A (zh) 语音转换方法、装置、设备和存储介质
WO2018179209A1 (ja) 電子機器、音声制御方法、およびプログラム
US20230230610A1 (en) Approaches to generating studio-quality recordings through manipulation of noisy audio

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15889030

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15889030

Country of ref document: EP

Kind code of ref document: A1