WO2019242414A1 - 语音处理方法、装置、存储介质及电子设备 - Google Patents

语音处理方法、装置、存储介质及电子设备 Download PDF

Info

Publication number
WO2019242414A1
WO2019242414A1 PCT/CN2019/085543 CN2019085543W WO2019242414A1 WO 2019242414 A1 WO2019242414 A1 WO 2019242414A1 CN 2019085543 W CN2019085543 W CN 2019085543W WO 2019242414 A1 WO2019242414 A1 WO 2019242414A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
voiceprint feature
signal
output
voice signal
Prior art date
Application number
PCT/CN2019/085543
Other languages
English (en)
French (fr)
Inventor
陈岩
Original Assignee
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oppo广东移动通信有限公司 filed Critical Oppo广东移动通信有限公司
Publication of WO2019242414A1 publication Critical patent/WO2019242414A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • G10L13/0335Pitch control
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0264Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • the present application relates to the technical field of electronic devices, and in particular, to a voice processing method, device, storage medium, and electronic device.
  • an embodiment of the present application provides a voice processing method, including:
  • the voice signal to be output includes a voiceprint feature to be output corresponding to the voiceprint feature, and voice content to be output corresponding to the voice content;
  • an embodiment of the present application provides a voice processing apparatus, including:
  • Acquisition module used to collect voice signals in the external environment
  • An acquisition module configured to acquire voice content and voiceprint features included in the voice signal
  • a generating module configured to generate a voice signal to be output according to the voice content and the voiceprint feature, the voice signal to be output includes a voiceprint feature to be output corresponding to the voiceprint feature, and Output voice content;
  • An output module is configured to output the voice signal to be output.
  • an embodiment of the present application provides a storage medium on which a computer program is stored, and when the computer program is run on a computer, the computer is caused to execute:
  • the voice signal to be output includes a voiceprint feature to be output corresponding to the voiceprint feature, and voice content to be output corresponding to the voice content;
  • an embodiment of the present application provides an electronic device including a processor and a memory, where the memory has a computer program, and the processor calls the computer program to execute:
  • the voice signal to be output includes a voiceprint feature to be output corresponding to the voiceprint feature, and voice content to be output corresponding to the voice content;
  • FIG. 1 is a schematic flowchart of a voice processing method according to an embodiment of the present application.
  • FIG. 2 is a schematic diagram of an electronic device acquiring voice content from a voice signal in an embodiment of the present application.
  • FIG. 3 is a schematic diagram of voice interaction between an electronic device and a user in an embodiment of the present application.
  • FIG. 4 is a schematic diagram of an electronic device and a user performing a voice interaction in a conference room scene according to an embodiment of the present application.
  • FIG. 5 is another schematic flowchart of a voice processing method according to an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a voice processing apparatus according to an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
  • FIG. 8 is another schematic structural diagram of an electronic device according to an embodiment of the present application.
  • an embodiment herein means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application.
  • the appearances of this phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are they independent or alternative embodiments that are mutually exclusive with other embodiments. It is clearly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.
  • the embodiment of the present application provides a voice processing method.
  • the execution subject of the voice processing method may be a voice processing device provided in the embodiment of the application, or an electronic device integrated with the voice processing device.
  • the voice processing device may use hardware or Software way.
  • the electronic device may be a smart phone, a tablet computer, a palmtop computer, a notebook computer, or a desktop computer.
  • An embodiment of the present application provides a voice processing method, including:
  • the voice signal to be output includes a voiceprint feature to be output corresponding to the voiceprint feature, and voice content to be output corresponding to the voice content;
  • the outputting the voice signal to be output includes:
  • the acquiring a voice signal in an external environment includes:
  • the acquiring a noise signal during the collection of the noisy voice signal according to the historical noise signal includes:
  • the method before the generating a voice signal to be output according to the voice content and the voiceprint feature, the method further includes:
  • the voice signal to be output is generated according to the voice content and the voiceprint feature.
  • determining whether the voiceprint feature matches a preset voiceprint feature includes:
  • the obtaining the similarity between the voiceprint feature and the preset voiceprint feature includes:
  • the feature distance is used as a similarity between the voiceprint feature and the preset voiceprint feature.
  • the method further includes:
  • the acquiring the voice content included in the voice signal includes:
  • FIG. 1 is a schematic flowchart of a voice processing method according to an embodiment of the present application. As shown in FIG. 1, the process of the voice processing method provided by the embodiment of the present application may be as follows:
  • the electronic device can collect voice signals in the external environment in many different ways. For example, when the electronic device is not connected to a microphone, the electronic device can collect the voice in the external environment through the built-in microphone to obtain the voice signal; For example, when a microphone is externally connected to the electronic device, the electronic device can collect the voice in the external environment through the external microphone to obtain a voice signal.
  • an electronic device collects a voice signal in an external environment through a microphone (the microphone here may be a built-in microphone or an external microphone), if the microphone is an analog microphone, an analog voice signal will be collected. At this time, the electronic device The analog voice signal needs to be sampled to convert the analog voice signal into a digital voice signal. For example, it can be sampled at a sampling frequency of 16KHz. In addition, if the microphone is a digital microphone, the electronic device will directly collect the signal through the digital microphone. Digitized voice signal without conversion.
  • the electronic device After the electronic device collects a voice signal in the external environment, the electronic device determines whether a voice parsing engine exists locally. If it exists, the electronic device inputs the collected voice signal to the local voice parsing engine for voice analysis to obtain the voice. Parse text. Among them, the speech signal is parsed, that is, the conversion process of the speech signal from "audio" to "text".
  • the electronic device can select a speech parsing engine from the multiple speech parsing engines to perform speech parsing on the speech signal in the following manner:
  • the electronic device may randomly select a speech analysis engine from a plurality of local speech analysis engines to perform speech analysis on the collected speech signals.
  • the electronic device can select a speech parsing engine with the highest parsing success rate from multiple speech parsing engines to perform speech parsing on the collected speech signals.
  • the electronic device can select a speech parsing engine with the shortest analysis time from multiple speech parsing engines to perform speech parsing on the collected speech signals.
  • the electronic device may also select a speech parsing engine with a parsing success rate that reaches a preset success rate from the multiple speech parsing engines and has the shortest parsing time to perform speech parsing on the collected speech signals.
  • a speech analysis engine in a manner not listed above, or may combine multiple speech analysis engines to perform speech analysis on the speech signal.
  • an electronic device may pass two speeches simultaneously.
  • the parsing engine performs speech parsing on the speech signal, and when the speech parsing texts obtained by the two speech parsing engines are the same, the same speech parsing text is used as the speech parsing text of the speech signal; for example, an electronic device may pass at least three speeches
  • the parsing engine performs speech parsing on the speech signal, and when the speech parsing text obtained by at least two of the speech parsing engines is the same, the same speech parsing text is used as the speech parsing text of the speech signal.
  • the electronic device After the electronic device parses and obtains the speech analysis text of the speech signal, it can extract the speech content included in the speech signal from the speech analysis text. For example, referring to FIG. 2, a user speaks a voice “What is the weather tomorrow”, and the electronic device will collect a voice signal corresponding to the voice “What is the weather tomorrow”, and perform a voice analysis on the voice signal to obtain a corresponding voice parsed text. The speech content "What is the weather like tomorrow" is extracted from the speech analysis text.
  • the electronic device determines whether a voice parsing engine exists locally, if it does not exist, it sends the aforementioned voice signal to a server (the server is a server providing a voice parsing service), instructs the server to analyze the aforementioned voice signal, and returns the analysis The speech parsed text obtained by the aforementioned speech signal.
  • the electronic device can extract the speech content included in the speech signal from the speech analysis text.
  • the first to determine the characteristics of the voiceprint is the size of the acoustic cavity, including the throat, nasal cavity, and oral cavity.
  • the shape, size, and location of these organs determine the size of the vocal cord tension and the range of sound frequencies. Therefore, although different people say the same thing, the frequency distribution of the sound is different, and some sound low and loud.
  • the second factor that determines the characteristics of the voiceprint is the manner in which the vocal organs are manipulated.
  • the vocal organs include lips, teeth, tongue, soft palate, and diaphragm muscles, and their interaction produces clear speech. And the way they collaborate is learned randomly by people in their interactions with the people around them. In the process of learning to speak, by simulating the speech of different people around them, they will gradually form their own voiceprint characteristics.
  • the mood of the user when speaking can also change the characteristics of the voiceprint.
  • the electronic device in addition to acquiring the voice content included in the collected voice signal, the electronic device also acquires the voiceprint feature included in the collected voice signal.
  • the voiceprint features include, but are not limited to, spectral feature components, cepstrum feature components, formant feature components, pitch feature components, reflection coefficient feature components, tone feature components, speech rate feature components, emotional feature components, prosodic feature components, and rhythm. At least one of the feature components.
  • 103 Generate a voice signal to be output according to the acquired voice content and voiceprint features, where the voice signal to be output includes voiceprint features to be output corresponding to the aforementioned voiceprint features and voice content to be output corresponding to the aforementioned voice content.
  • the electronic device obtains the voice content and voiceprint features included in the voice signal, and then according to the preset correspondence between the voice content, voiceprint features, and voice content to be output, and the acquired voice content and voice Pattern features to get the corresponding speech content to be output.
  • the correspondence between the speech content, voiceprint features, and speech content to be output can be set by those skilled in the art according to actual needs, where a tone word that does not affect semantics can be added to the speech content to be output.
  • a voiceprint feature that includes only emotional feature components
  • the electronic device when a user speaks "What's the weather tomorrow" with a neutral emotion, the electronic device will get the corresponding content to be output as "Tomorrow will be clear and clear, suitable for going out”; For another example, when the user says “I'm unhappy” with negative emotions, the electronic device will get the corresponding content to be output as "Don't be unhappy, take me out to play.”
  • the electronic device also obtains a corresponding voiceprint feature to be output according to a preset correspondence relationship between the voiceprint feature and the voiceprint feature to be output, and the obtained voiceprint feature.
  • the correspondence between the voiceprint features and the voiceprint features to be output can be set by those skilled in the art according to actual needs, and this application does not specifically limit this.
  • the emotions to be output corresponding to the negative emotions can be set as positive emotions
  • the emotions to be output corresponding to the neutral emotions are neutral emotions
  • the emotions to be output corresponding to the positive emotions are positive emotions.
  • the electronic device After the electronic device obtains the voice content to be output corresponding to the voice content and the voiceprint feature, and obtains the voice feature to be output corresponding to the voiceprint feature, the electronic device performs speech according to the voice content to be output and the voice feature to be output.
  • the speech signals to be output are obtained by synthesis, and the speech signals to be output include the speech content to be output corresponding to the foregoing voice content, the voiceprint feature, and the speech feature to be output corresponding to the voiceprint feature.
  • the electronic device After the electronic device generates the voice signal to be output, the electronic device outputs the voice signal to be output in a voice manner. For example, please refer to FIG. 3, taking voiceprint features including only emotional feature components as an example. When a user says “I am unhappy” with negative emotions, the electronic device will get the corresponding content to be output as "Don't be unhappy, take me “Go out and play”, and get the corresponding voiceprint feature to be output as "positive emotions”.
  • the electronic device After that, the electronic device performs speech synthesis based on "don't be unhappy, take me out to play” and “positive emotions” to obtain the voice signals to be output
  • the voice signal to be output if the electronic device is regarded as a “person”, the “person” will say “do n’t be unhappy, take me out to play” with a positive emotion to comfort the user.
  • the electronic device in the embodiment of the present application can collect voice signals in the external environment, and obtain the voice content and voiceprint features included in the collected voice signals, and then generate a standby signal based on the acquired voice content and voiceprint features.
  • a voice signal is output, and the voice signal to be output includes the voice pattern feature to be output corresponding to the aforementioned voice print feature and the voice content to be output corresponding to the aforementioned voice content, and finally the generated voice signal to be output is output.
  • the electronic device can output an output voice signal including the corresponding voiceprint feature according to the voiceprint characteristics included in the input voice signal, and realizes voice output in different utterance modes. Therefore, the electronic device's voice interaction is improved. flexibility.
  • "outputting the generated voice signal to be output” includes:
  • the electronic device When the electronic device outputs the generated voice signal to be output, it first obtains the loudness value (or volume value) of the aforementioned voice signal, uses the loudness value as the input loudness value, and then according to the preset input loudness value and output loudness The corresponding relationship of the values determines the output loudness value corresponding to the aforementioned loudness value, and uses the output loudness value as the target loudness value corresponding to the voice signal to be output, and finally determines the target loudness value to output the generated voice signal to be output.
  • the loudness value or volume value
  • the correspondence between the input loudness value and the output loudness value can be as follows:
  • Lout represents the output loudness value
  • Lin represents the input loudness value
  • k is the corresponding coefficient, which can be set by those skilled in the art according to actual needs. For example, when k is set to 1, the output loudness value is equal to the input loudness value. When set to less than 1, the output loudness value will be less than the input loudness value.
  • the target loudness value corresponding to the output voice signal to be output is determined through the collected loudness value of the voice signal, which can make the voice interaction of the electronic device more compatible with the current scene.
  • the user is located in the conference room with the electronic device.
  • the electronic device When the user whispers the voice, the electronic device also whispers the voice, thereby avoiding the situation where the fixed vocalization disturbs others.
  • "collecting a voice signal in an external environment” includes:
  • the embodiments of the present application continue to provide a solution for collecting a voice signal from a noisy environment.
  • the electronic device When the electronic device is in a noisy environment, if the user sends out a voice signal, the electronic device will collect the noisy voice signal in the external environment.
  • the noisy voice signal is formed by the combination of the voice signal sent by the user and the noise signal in the external environment. If the user does not send a voice signal, the electronic device will only collect noise signals in the external environment. Among them, the electronic device will buffer the collected noisy speech signals and noise signals.
  • the start time of the noisy voice signal is used as the end time, and the preset time period (the The preset time length can be taken by a person skilled in the art according to actual needs.
  • the embodiment of the present application does not specifically limit this. For example, it can be set to a historical noise signal of 500 ms. Noise signal.
  • the electronic device obtains 16:47 on June 13, 2018.
  • a noise signal with a duration of 500 milliseconds buffered from 56 seconds to 16:47:56 and 500 milliseconds on June 13, 2018 is used as the historical noise signal corresponding to the noisy speech signal.
  • the electronic device After acquiring the historical noise signal corresponding to the noisy speech signal, the electronic device further acquires the noise signal during the collection of the noisy speech signal according to the acquired historical noise signal.
  • the electronic device can predict the noise distribution during the acquisition of the noisy speech signal based on the acquired historical noise signal, thereby obtaining the noise signal during the noisy speech signal acquisition.
  • the noise variation in continuous time is usually small.
  • the electronic device can use the acquired historical noise signal as the noise signal during the noisy speech signal acquisition.
  • the length of the noisy speech signal can be intercepted from the historical noise signal with the same duration as the noisy speech signal as the noise signal during the collection of the noisy speech signal; if the length of the historical noise signal is less than the length of the noisy speech signal, Then, the historical noise signal can be copied, and multiple historical noise signals can be spliced to obtain a noise signal with the same duration as the noisy speech signal, as the noise signal during the noisy speech signal acquisition.
  • the electronic device After acquiring the noise signal during the acquisition of the noisy voice signal, the electronic device first performs inverse processing on the acquired noise signal, and then superimposes the inverse processed noise signal with the noisy voice signal to eliminate the noisy voice.
  • the noise part of the signal is used to obtain a noise-reduced voice signal, and the obtained noise-reduced voice signal is used as a voice signal collected from the external environment for subsequent processing.
  • obtaining a noise signal during the acquisition of a noisy voice signal according to the historical noise signal includes:
  • the electronic device obtains the historical noise signal, uses the historical noise signal as sample data, and performs model training according to a preset training algorithm to obtain a noise prediction model.
  • the training algorithm is a machine learning algorithm.
  • the machine learning algorithm can predict the data through continuous feature learning.
  • the electronic device can predict the current noise distribution based on the historical noise distribution.
  • machine learning algorithms can include: decision tree algorithms, regression algorithms, Bayesian algorithms, neural network algorithms (which can include deep neural network algorithms, convolutional neural network algorithms and recursive neural network algorithms, etc.), clustering algorithms, etc. Which training algorithm is selected as a preset training algorithm for model training can be selected by those skilled in the art according to actual needs.
  • the preset training algorithm for the configuration of the electronic device configuration is a Gaussian mixture model algorithm (which is a regression algorithm).
  • the historical noise signal is used as sample data, and the model is modeled according to the Gaussian mixture model algorithm.
  • a Gaussian mixture model is obtained (the noise prediction model includes multiple Gaussian units for describing the noise distribution), and the Gaussian mixture model is used as the noise prediction model.
  • the electronic device uses the start time and end time of the noisy speech signal collection period as the input of the noise prediction model, and inputs the noise prediction model for processing to obtain the noise prediction model output noise signal during the noisy speech signal collection period.
  • the method before “generating a voice signal to be output based on the acquired voice content and voiceprint characteristics”, the method further includes:
  • the preset voiceprint feature may be a voiceprint feature previously entered by the owner, or may be a voiceprint feature previously entered by another user authorized by the owner to determine the aforementioned voiceprint feature (that is, a voice signal collected in an external environment). Whether the voiceprint feature of the device matches the preset voiceprint feature, that is, determining whether the user sending the voice signal is the owner. If the aforementioned voiceprint feature does not match the preset voiceprint feature, the electronic device determines that the user who issued the voice signal is not the owner. If the aforementioned voiceprint feature matches the preset voiceprint feature, the electronic device determines that the user who issued the voice signal is the machine. Mainly, at this time, a voice signal to be output is generated according to the obtained voice content and the aforementioned voiceprint feature. For details, refer to the foregoing related description, and details are not described herein again.
  • the user who issued the voice signal is identified according to the voiceprint characteristics of the voice signal. Only when the user who sends the voice signal is the owner, the voice is obtained based on the obtained voice. The content and the aforementioned voiceprint characteristics generate a speech signal to be output. Therefore, it is possible to prevent the electronic device from erroneously responding to others other than the owner, so as to improve the owner's use experience.
  • determining whether the aforementioned voiceprint feature matches the preset voiceprint feature includes:
  • the electronic device determines whether the aforementioned voiceprint feature matches the preset voiceprint feature, it can obtain the similarity between the aforementioned voiceprint feature and the preset voiceprint feature, and determine whether the acquired similarity is greater than or equal to the first preset similarity Degree (can be set by those skilled in the art according to actual needs).
  • the obtained similarity is greater than or equal to the first preset similarity, it is determined that the obtained voiceprint feature matches the preset voiceprint feature, and when the obtained similarity is less than the first preset similarity, It is determined that the obtained voiceprint feature does not match the preset voiceprint feature.
  • the electronic device may obtain the distance between the aforementioned voiceprint feature and the preset voiceprint feature, and use the obtained distance as the similarity between the aforementioned voiceprint feature and the preset voiceprint feature.
  • a person skilled in the art may select any one of characteristic distances (such as Euclidean distance, Manhattan distance, Chebyshev distance, etc.) to measure the distance between the aforementioned voiceprint feature and the preset voiceprint feature.
  • the cosine distance of the aforementioned voiceprint feature and the preset voiceprint feature can be obtained, and specifically refer to the following formula:
  • e represents the cosine distance between the aforementioned voiceprint feature and the preset voiceprint feature
  • f represents the aforementioned voiceprint feature
  • N represents the dimensions of the aforementioned voiceprint feature and the preset voiceprint feature (the aforementioned voiceprint feature and the preset voiceprint feature Dimensions are the same)
  • fi represents the feature vector of the i-th dimension in the aforementioned voiceprint feature
  • gi represents the feature vector of the i-th dimension in the preset voiceprint feature.
  • the method further includes:
  • the characteristics of the voiceprint are closely related to the physiological characteristics of the human body, in daily life, if the user catches a cold, his voice will become hoarse, and the characteristics of the voiceprint will also change accordingly. In this case, even if the user who issued the voice signal is the owner, the electronic device cannot recognize it. In addition, there are many situations that cause the electronic device to fail to identify the owner, which will not be repeated here.
  • the electronic device completes the judgment of the similarity of the voiceprint feature, if the similarity between the aforementioned voiceprint feature and the preset voiceprint feature is less than
  • the first preset similarity is further judged whether the similarity is greater than or equal to the second preset similarity (the second preset similarity is configured to be smaller than the first preset similarity, which can be specifically determined by those skilled in the art according to actual needs Take a suitable value, for example, when the first preset similarity is set to 95%, the second preset similarity may be set to 75%).
  • the electronic device When the judgment result is yes, that is, the similarity between the aforementioned voiceprint feature and the preset voiceprint feature is less than the first preset similarity and greater than or equal to the second preset similarity, the electronic device further obtains the current position information .
  • the electronic device when in an outdoor environment (the electronic device can identify whether it is currently in an outdoor environment or an indoor environment according to the strength of the received satellite positioning signal, for example, when the strength of the received satellite positioning signal is lower than a preset threshold, it is determined to be in Indoor environment, when the strength of the received satellite positioning signal is higher than or equal to a preset threshold, it is determined to be in an outdoor environment), the electronic device can use satellite positioning technology to obtain the current position information. Indoor location technology can be used to obtain the current location information.
  • the electronic device After acquiring the current position information, the electronic device determines whether it is currently within a preset position range according to the position information.
  • the preset position range can be configured as a common position range of the owner, such as home and company.
  • the electronic device determines that the aforementioned voiceprint feature matches the preset voiceprint feature, and determines that the user who issued the voice signal is the owner.
  • the voice processing method may include:
  • the start time of the noisy voice signal is used as the end time, and the preset time period (the The preset time length can be taken by a person skilled in the art according to actual needs.
  • the embodiment of the present application does not specifically limit this. For example, it can be set to a historical noise signal of 500 ms. Noise signal.
  • the electronic device obtains 16:47 on June 13, 2018.
  • a noise signal with a duration of 500 milliseconds buffered from 56 seconds to 16:47:56 and 500 milliseconds on June 13, 2018 is used as the historical noise signal corresponding to the noisy speech signal.
  • the electronic device After acquiring the historical noise signal corresponding to the noisy speech signal, the electronic device further acquires the noise signal during the collection of the noisy speech signal according to the acquired historical noise signal.
  • the electronic device can predict the noise distribution during the acquisition of the noisy speech signal based on the acquired historical noise signal, thereby obtaining the noise signal during the noisy speech signal acquisition.
  • the noise variation in continuous time is usually small.
  • the electronic device can use the acquired historical noise signal as the noise signal during the noisy speech signal acquisition.
  • the length of the noisy speech signal can be intercepted from the historical noise signal with the same duration as the noisy speech signal as the noise signal during the collection of the noisy speech signal; if the length of the historical noise signal is less than the length of the noisy speech signal, Then, the historical noise signal can be copied, and multiple historical noise signals can be spliced to obtain a noise signal with the same duration as the noisy speech signal, as the noise signal during the noisy speech signal acquisition.
  • the electronic device After acquiring the noise signal during the acquisition of the noisy voice signal, the electronic device first performs inverse processing on the acquired noise signal, and then superimposes the inverse processed noise signal with the noisy voice signal to eliminate the noisy voice.
  • the noise part of the signal is used to obtain a noise-reduced voice signal, and the obtained noise-reduced voice signal is used as a voice signal to be processed for subsequent processing.
  • the electronic device After the electronic device obtains the speech signal to be processed, it first determines whether there is a speech parsing engine locally. If it exists, the electronic device inputs the aforementioned speech signal to the local speech parsing engine for speech parsing to obtain a speech parsing text. Among them, the speech signal is parsed, that is, the conversion process of the speech signal from "audio" to "text".
  • the electronic device After the electronic device parses and obtains the voice parsed text of the voice signal, it can extract the voice content included in the voice signal from the voice parsed text. For example, referring to FIG. 2, a user speaks a voice “What is the weather tomorrow”, and the electronic device will collect a voice signal corresponding to the voice “What is the weather tomorrow”, and perform a voice analysis on the voice signal to obtain a corresponding voice parsed text. The speech content "What is the weather like tomorrow" is extracted from the speech analysis text.
  • the electronic device determines whether a voice parsing engine exists locally, if it does not exist, it sends the aforementioned voice signal to a server (the server is a server providing a voice parsing service), instructs the server to analyze the aforementioned voice signal, and returns the analysis The speech parsed text obtained by the aforementioned speech signal.
  • the electronic device can extract the speech content included in the speech signal from the speech analysis text.
  • the electronic device in addition to acquiring the voice content included in the aforementioned voice signal, the electronic device also acquires the voiceprint feature included in the aforementioned voice signal.
  • the voiceprint features include, but are not limited to, spectral feature components, cepstrum feature components, formant feature components, pitch feature components, reflection coefficient feature components, tone feature components, speech rate feature components, emotional feature components, prosodic feature components, and rhythm. At least one of the feature components.
  • a voice signal to be output according to the acquired voice content and voiceprint features, where the voice signal to be output includes voiceprint features to be output corresponding to the aforementioned voiceprint features and voice content to be output corresponding to the aforementioned voice content.
  • the electronic device obtains the voice content and voiceprint features included in the voice signal, and then according to the preset correspondence between the voice content, voiceprint features, and voice content to be output, and the acquired voice content and voice Pattern features to get the corresponding speech content to be output.
  • the correspondence between the speech content, voiceprint features, and speech content to be output can be set by those skilled in the art according to actual needs, where a tone word that does not affect semantics can be added to the speech content to be output.
  • a voiceprint feature that includes only emotional feature components
  • the electronic device when a user speaks "What's the weather tomorrow" with a neutral emotion, the electronic device will get the corresponding content to be output as "Tomorrow will be clear and clear, suitable for going out”; For another example, when the user says “I'm unhappy” with negative emotions, the electronic device will get the corresponding content to be output as "Don't be unhappy, take me out to play.”
  • the electronic device also obtains a corresponding voiceprint feature to be output according to a preset correspondence relationship between the voiceprint feature and the voiceprint feature to be output, and the obtained voiceprint feature.
  • the correspondence between the voiceprint features and the voiceprint features to be output can be set by those skilled in the art according to actual needs, and this application does not specifically limit this.
  • the emotions to be output corresponding to the negative emotions can be set as positive emotions
  • the emotions to be output corresponding to the neutral emotions are neutral emotions
  • the emotions to be output corresponding to the positive emotions are positive emotions.
  • the electronic device After the electronic device obtains the voice content to be output corresponding to the voice content and the voiceprint feature, and obtains the voice feature to be output corresponding to the voiceprint feature, the electronic device performs speech according to the voice content to be output and the voice feature to be output.
  • the speech signals to be output are obtained by synthesis, and the speech signals to be output include the speech content to be output corresponding to the foregoing voice content, the voiceprint feature, and the speech feature to be output corresponding to the voiceprint feature.
  • the electronic device first obtains a loudness value (or a volume value) of the foregoing voice signal after the generated voice signal is to be output.
  • the electronic device obtains the loudness value of the voice signal, uses the loudness value as the input loudness value, and then determines the output loudness value corresponding to the aforementioned loudness value according to the preset correspondence between the input loudness value and the output loudness value. Use the output loudness value as the target loudness value corresponding to the voice signal to be output, and then determine the target loudness value to output the generated voice signal to be output.
  • the correspondence between the input loudness value and the output loudness value can be as follows:
  • Lout represents the output loudness value
  • Lin represents the input loudness value
  • k is the corresponding coefficient, which can be set by those skilled in the art according to actual needs. For example, when k is set to 1, the output loudness value is equal to the input loudness value. When set to less than 1, the output loudness value will be less than the input loudness value.
  • a voice processing device is also provided.
  • FIG. 6, is a schematic structural diagram of a voice processing apparatus 400 according to an embodiment of the present application.
  • the voice processing device is applied to an electronic device.
  • the voice processing device includes an acquisition module 401, an acquisition module 402, a generation module 403, and an output module 404, as follows:
  • the acquisition module 401 is configured to acquire a voice signal in an external environment.
  • the acquiring module 402 is configured to acquire voice content and voiceprint features included in the collected voice signal.
  • a generating module 403 is configured to generate a voice signal to be output according to the acquired voice content and voiceprint features, where the voice signal to be output includes voiceprint features to be output corresponding to the aforementioned voiceprint features and voice content to be output corresponding to the aforementioned voice content .
  • the output module 404 is configured to output the generated voice signal to be output.
  • the output module 404 may be configured to:
  • a voice signal to be output is output.
  • the acquisition module 401 may be configured to:
  • the acquired noise signal is superposed in phase with the noisy voice signal, and the superimposed noise-reduced voice signal is used as the collected voice signal.
  • the acquisition module 401 may be configured to:
  • Noise signals are predicted during the acquisition of noisy speech signals according to a noise prediction model.
  • the generating module 403 may be configured to:
  • a voice signal to be output is generated according to the acquired voice content and the aforementioned voiceprint feature.
  • the generating module 403 may be configured to:
  • the generating module 403 may be configured to:
  • the feature distance is used as the similarity between the voiceprint feature and the preset voiceprint feature.
  • the generating module 403 may be configured to:
  • the obtaining module 402 may be configured to:
  • the voice processing device 400 may be integrated in an electronic device, such as a mobile phone, a tablet computer, or the like.
  • the above modules can be implemented as independent entities, or can be arbitrarily combined, and implemented as the same or several entities.
  • the specific implementation of the above units can refer to the previous embodiments, and will not be repeated here.
  • an electronic device is also provided.
  • the electronic device 500 includes a processor 501 and a memory 502.
  • the processor 501 is electrically connected to the memory 502.
  • the processor 500 is a control center of the electronic device 500. It connects various parts of the entire electronic device by using various interfaces and lines, and executes the electronic program by running or loading a computer program stored in the memory 502, and calling data stored in the memory 502. Various functions of the device 500 and process data.
  • the memory 502 may be configured to store software programs and modules.
  • the processor 501 executes various functional applications and data processing by running computer programs and modules stored in the memory 502.
  • the memory 502 may mainly include a storage program area and a storage data area, where the storage program area may store an operating system, a computer program (such as a sound playback function, an image playback function, etc.) required for at least one function; the storage data area may store data according to Data created by the use of electronic devices, etc.
  • the memory 502 may include a high-speed random access memory, and may further include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices. Accordingly, the memory 502 may further include a memory controller to provide the processor 501 with access to the memory 502.
  • the processor 501 in the electronic device 500 loads the instructions corresponding to the process of one or more computer programs into the memory 502 according to the following steps, and the processor 501 runs the stored data in the memory 502
  • a computer program in the computer to achieve various functions, as follows:
  • the voice signal to be output includes voiceprint feature to be output corresponding to the aforementioned voiceprint feature, and voice content to be output corresponding to the aforementioned voice content;
  • the electronic device 500 may further include a display 503, a radio frequency circuit 504, an audio circuit 505, and a power source 506.
  • the display 503, the radio frequency circuit 504, the audio circuit 505, and the power supply 506 are electrically connected to the processor 501, respectively.
  • the display 503 may be used to display information input by the user or information provided to the user and various graphical user interfaces. These graphical user interfaces may be composed of graphics, text, icons, videos, and any combination thereof.
  • the display 503 may include a display panel.
  • the display panel may be configured in the form of a liquid crystal display (Liquid Crystal Display, LCD), or an organic light emitting diode (Organic Light-Emitting Diode, OLED).
  • the radio frequency circuit 504 may be used to transmit and receive radio frequency signals to establish wireless communication with a network device or other electronic device through wireless communication, and transmit and receive signals to and from the network device or other electronic device.
  • the audio circuit 505 may be used to provide an audio interface between the user and the electronic device through a speaker or a microphone.
  • the power source 506 may be used to power various components of the electronic device 500.
  • the power supply 506 may be logically connected to the processor 501 through a power management system, so as to implement functions such as management of charging, discharging, and power consumption management through the power management system.
  • the electronic device 500 may further include a camera, a Bluetooth module, and the like, and details are not described herein again.
  • the processor 501 may execute:
  • a voice signal to be output is output.
  • the processor 501 when collecting a voice signal in an external environment, the processor 501 may execute:
  • the acquired noise signal is superposed in phase with the noisy voice signal, and the superimposed noise-reduced voice signal is used as the collected voice signal.
  • the processor 501 may execute:
  • Noise signals are predicted during the acquisition of noisy speech signals according to a noise prediction model.
  • the processor 501 when generating a voice signal to be output based on the acquired voice content and voiceprint characteristics, the processor 501 may execute:
  • a voice signal to be output is generated according to the acquired voice content and the aforementioned voiceprint feature.
  • the processor 501 may further execute:
  • the processor 501 may execute:
  • the feature distance is used as the similarity between the voiceprint feature and the preset voiceprint feature.
  • the processor 501 may further execute:
  • the processor 501 when acquiring the voice content included in the collected voice signal, the processor 501 may execute:
  • An embodiment of the present application further provides a storage medium.
  • the storage medium stores a computer program, and when the computer program is run on a computer, the computer is caused to execute the speech processing method in any of the foregoing embodiments, such as: Voice signals in the external environment; acquiring voice content and voiceprint features included in the collected voice signals; generating a voice signal to be output according to the acquired voice content and voiceprint features, and the voice signal to be output includes the corresponding voiceprint feature Features of the voiceprint to be output, and voice content to be output corresponding to the aforementioned voice content; and output a generated voice signal to be output.
  • the storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM, ROM), or a random access device (Random Access Memory, RAM).
  • the computer program may be stored in a computer-readable storage medium, such as stored in a memory of an electronic device, and executed by at least one processor in the electronic device, and may include, for example, a voice processing method during execution.
  • the storage medium may be a magnetic disk, an optical disk, a read-only memory, a random access memory, or the like.
  • the voice processing device For the voice processing device according to the embodiment of the present application, its functional modules may be integrated into one processing chip, or each module may exist separately physically, or two or more modules may be integrated into one module.
  • the above integrated modules can be implemented in the form of hardware or software functional modules. If the integrated module is implemented in the form of a software functional module and sold or used as an independent product, it may also be stored in a computer-readable storage medium, such as a read-only memory, a magnetic disk, or an optical disk. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

一种语音处理方法,包括:采集外部环境中的语音信号(101);获取采集到的语音信号所包括的语音内容以及声纹特征(102);根据获取到的语音内容和声纹特征生成待输出语音信号,该待输出语音信号包括对应前述声纹特征的待输出声纹特征,以及对应前述语音内容的待输出语音内容(103);输出生成的待输出语音信号(104)。还公开了一种语音处理装置、存储介质及电子设备。

Description

语音处理方法、装置、存储介质及电子设备
本申请要求于2018年06月19日提交中国专利局、申请号为201810631577.0、发明名称为“语音处理方法、装置、存储介质及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及电子设备技术领域,具体涉及一种语音处理方法、装置、存储介质及电子设备。
背景技术
目前,随着技术的发展,人机之间的交互方式变得越来越丰富。相关技术中,用户可以通过语音与手机、平板电脑等电子设备进行交互。比如,用户可以向电子设备说出“明天天气如何”,电子设备将查询到今天的天气信息,并以语音的方式输出查询到的天气信息。
发明内容
第一方面,本申请实施例提供了一种语音处理方法,包括:
采集外部环境中的语音信号;
获取所述语音信号包括的语音内容以及声纹特征;
根据所述语音内容和所述声纹特征生成待输出语音信号,所述待输出语音信号包括对应所述声纹特征的待输出声纹特征、以及对应所述语音内容的待输出语音内容;
输出所述待输出语音信号。
第二方面,本申请实施例提供了一种语音处理装置,包括:
采集模块,用于采集外部环境中的语音信号;
获取模块,用于获取所述语音信号包括的语音内容以及声纹特征;
生成模块,用于根据所述语音内容和所述声纹特征生成待输出语音信号,所述待输出语音信号包括对应所述声纹特征的待输出声纹特征、以及对应所述语音内容的待输出语音内容;
输出模块,用于输出所述待输出语音信号。
第三方面,本申请实施例提供了一种存储介质,其上存储有计算机程序,当所述计算机程序在计算机上运行时,使得所述计算机执行:
采集外部环境中的语音信号;
获取所述语音信号包括的语音内容以及声纹特征;
根据所述语音内容和所述声纹特征生成待输出语音信号,所述待输出语音信号包括对 应所述声纹特征的待输出声纹特征、以及对应所述语音内容的待输出语音内容;
输出所述待输出语音信号。
第四方面,本申请实施例提供了一种电子设备,包括处理器和存储器,所述存储器有计算机程序,所述处理器通过调用所述计算机程序,用于执行:
采集外部环境中的语音信号;
获取所述语音信号包括的语音内容以及声纹特征;
根据所述语音内容和所述声纹特征生成待输出语音信号,所述待输出语音信号包括对应所述声纹特征的待输出声纹特征、以及对应所述语音内容的待输出语音内容;
输出所述待输出语音信号。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的语音处理方法的一流程示意图。
图2是本申请实施例中电子设备从语音信号中获取语音内容的示意图。
图3是本申请实施例中电子设备和用户进行语音交互的示意图。
图4是本申请实施例中电子设备和用户在会议室场景中进行语音交互的示意图。
图5是本申请实施例提供的语音处理方法的另一流程示意图。
图6是本申请实施例提供的语音处理装置的一结构示意图。
图7是本申请实施例提供的电子设备的一结构示意图。
图8是本申请实施例提供的电子设备的另一结构示意图。
具体实施方式
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
本申请实施例提供一种语音处理方法,该语音处理方法的执行主体可以是本申请实施例提供的语音处理装置,或者集成了该语音处理装置的电子设备,其中该语音处理装置可以采用硬件或者软件的方式实现。其中,电子设备可以是智能手机、平板电脑、掌上电脑、笔记本电脑、或者台式电脑等设备。
本申请实施例提供一种语音处理方法,其中,包括:
采集外部环境中的语音信号;
获取所述语音信号包括的语音内容以及声纹特征;
根据所述语音内容和所述声纹特征生成待输出语音信号,所述待输出语音信号包括对应所述声纹特征的待输出声纹特征、以及对应所述语音内容的待输出语音内容;
输出所述待输出语音信号。
在一实施例中,所述输出所述待输出语音信号,包括:
获取所述语音信号的响度值;
根据所述响度值确定对应所述待输出语音信号的目标响度值;
按照所述目标响度值,输出所述待输出语音信号。
在一实施例中,所述采集外部环境中的语音信号,包括:
在采集到外部环境中的带噪语音信号时,获取对应所述带噪语音信号的历史噪声信号;
根据所述历史噪声信号,获取所述带噪语音信号采集期间的噪声信号;
将所述噪声信号与所述带噪语音信号进行反相位叠加,并将叠加得到的降噪语音信号作为所述语音信号。
在一实施例中,所述根据所述历史噪声信号,获取所述带噪语音信号采集期间的噪声信号,包括:
将所述历史噪声信号作为样本数据进行模型训练,得到噪声预测模型;
根据所述噪声预测模型预测所述采集期间的所述噪声信号。
在一实施例中,所述根据所述语音内容和所述声纹特征生成待输出语音信号之前,还包括:
判断所述声纹特征是否与预设声纹特征匹配;
在所述声纹特征与预设声纹特征匹配时,根据所述语音内容和所述声纹特征生成所述待输出语音信号。
在一实施例中,所述判断所述声纹特征是否与预设声纹特征匹配,包括:
获取所述声纹特征和所述预设声纹特征的相似度;
判断所述相似度是否大于或等于第一预设相似度;
在所述相似度大于或等于所述第一预设相似度时,确定所述声纹特征与所述预设声纹特征匹配。
在一实施例中,所述获取所述声纹特征和所述预设声纹特征的相似度,包括:
获取所述声纹特征和所述预设声纹特征之间的特征距离;
将所述特征距离作为所述声纹特征和所述预设声纹特征的相似度。
在一实施例中,判断所述相似度是否大于或等于第一预设相似度的步骤之后,还包括:
在所述相似度小于所述第一预设相似度且大于或等于第二预设相似度时,获取当前的位置信息;
根据所述位置信息判断当前是否位于预设位置范围内;
在当前位于预设位置范围内时,确定所述声纹特征与所述预设声纹特征匹配。
在一实施例中,所述获取所述语音信号包括的语音内容,包括:
解析所述语音信号得到对应的语音解析文本,并从所述语音解析文本中提取所述语音信号所包括的所述语音内容。
请参照图1,图1为本申请实施例提供的语音处理方法的流程示意图。如图1所示,本申请实施例提供的语音处理方法的流程可以如下:
101、采集外部环境中的语音信号。
其中,电子设备可以通过多种不同方式来采集外部环境中的语音信号,比如,在电子设备未外接麦克风时,电子设备可以通过内置的麦克风对外部环境中的语音进行采集,得到语音信号;又比如,在电子设备外接有麦克风时,电子设备可以通过外接的麦克风对外部环境中的语音进行采集,得到语音信号。
其中,电子设备在通过麦克风(此处的麦克风可以是内置麦克风,也可以是外接麦克风)采集外部环境中的语音信号时,若麦克风为模拟麦克风,将采集到模拟的语音信号,此时电子设备需要对模拟的语音信号进行采样,以将模拟的语音信号转换为数字化的语音信号,比如,可以16KHz的采样频率进行采样;此外,若麦克风为数字麦克风,则电子设备将通过数字麦克风直接采集到数字化的语音信号,无需进行转换。
102、获取采集到的语音信号所包括的语音内容以及声纹特征。
其中,电子设备在采集到外部环境中的语音信号之后,电子设备判断本地是否存在语音解析引擎,若存在,则电子设备将采集到的语音信号输入到本地的语音解析引擎进行语音解析,得到语音解析文本。其中,对语音信号进行语音解析,也即是将语音信号由“音频”向“文字”的转换过程。
此外,在本地存在多个语音解析引擎时,电子设备可以按照以下方式从多个语音解析引擎中选取一个语音解析引擎对语音信号进行语音解析:
其一,电子设备可以从本地的多个语音解析引擎中随机选取一个语音解析引擎,对采集到的语音信号进行语音解析。
其二,电子设备可以从多个语音解析引擎中选取解析成功率最高的语音解析引擎,对采集到的语音信号进行语音解析。
其三,电子设备可以从多个语音解析引擎中选取解析时长最短的语音解析引擎,对采集到的语音信号进行语音解析。
其四,电子设备还可以从多个语音解析引擎中,选取解析成功率达到预设成功率、且解析时长最短的语音解析引擎对采集到的语音信号进行语音解析。
需要说明的是,本领域技术人员还可以按照以上未列出的方式进行语音解析引擎的选取,或者可以结合多个语音解析引擎对语音信号进行语音解析,比如,电子设备可以同时通过两个语音解析引擎对语音信号进行语音解析,并在两个语音解析引擎得到的语音解析文本相同时,将该相同的语音解析文本作为语音信号的语音解析文本;又比如,电子设备 可以通过至少三个语音解析引擎对语音信号进行语音解析,并在其中至少两个语音解析引擎得到的语音解析文本相同时,将该相同的语音解析文本作为语音信号的语音解析文本。
电子设备在解析得到语音信号的语音解析文本之后,即可从该语音解析文本中提取出语音信号所包括的语音内容。比如,请参照图2,用户说出语音“明天天气如何”,电子设备将采集到对应语音“明天天气如何”的语音信号,对该语音信号进行语音解析,得到对应的语音解析文本,从该语音解析文本中提取出前述语音信号的语音内容“明天天气如何”。
此外,电子设备判断本地是否存在语音解析引擎之后,若不存在,则将前述语音信号发送至服务器(该服务器为提供语音解析服务的服务器),指示该服务器对前述语音信号进行解析,并返回解析前述语音信号所得到的语音解析文本。在接收到服务器返回的语音解析文本之后,电子设备即可从该语音解析文本中提取出前述语音信号所包括的语音内容。
需要说明的是,在实际生活中,每个人说话时的声音都有自己的特点,熟悉的人之间,可以只听声音而相互辨别出来。这种声音的特点就是声纹特征,声纹特征由多种因素决定:
第一个决定声纹特征的是声腔的尺寸,具体包括咽喉、鼻腔和口腔等,这些器官的形状、尺寸和位置决定了声带张力的大小和声音频率的范围。因此不同的人虽然说同样的话,但是声音的频率分布是不同的,听起来有的低沉有的洪亮。
第二个决定声纹特征的因素是发声器官***纵的方式,发声器官包括唇、齿、舌、软腭及腭肌肉等,他们之间相互作用就会产生清晰的语音。而他们之间的协作方式是人通过后天与周围人的交流中随机学习到的。人在学习说话的过程中,通过模拟周围不同人的说话方式,就会逐渐形成自己的声纹特征。
此外,用户发声时的情绪也会导致声纹特征产生变化。
相应的,在本申请实施例中,电子设备除了获取采集到的语音信号所包括的语音内容之外,还获取采集到的语音信号所包括声纹特征。
其中,声纹特征包括但不限于频谱特征分量、倒频谱特征分量、共振峰特征分量、基音特征分量、反射系数特征分量、声调特征分量、语速特征分量、情绪特征分量、韵律特征分量以及节奏特征分量中的至少一种特征分量。
103、根据获取到的语音内容和声纹特征生成待输出语音信号,该待输出语音信号包括对应前述声纹特征的待输出声纹特征、以及对应前述语音内容的待输出语音内容。
其中,电子设备在获取到前述语音信号所包括的语音内容以及声纹特征之后,根据预先设置的语音内容、声纹特征与待输出语音内容的对应关系,以及获取到的前述语音内容以及前述声纹特征,得到对应的待输出语音内容。需要说明的是,对于语音内容、声纹特征与待输出语音内容的对应关系,可由本领域技术人员根据实际需要进行设置,其中,可以在待输出语音内容中增加不影响语义的语气词。
比如,以声纹特征仅包括情绪特征分量为例,当用户以中性的情绪说出“明天天气如 何”时,电子设备将得到对应的待输出内容为“明天晴空万里,适合外出哦”;又比如,当用户以负面的情绪说出“我不开心”时,电子设备将得到对应的待输出内容为“不要不开心,带我出去玩吧”。
此外,电子设备还根据预先设置的声纹特征和待输出声纹特征的对应关系,以及获取到的前述声纹特征,得到对应的待输出声纹特征。需要说明的是,对于声纹特征和待输出声纹特征的对应关系,可由本领域技术人员根据实际需要进行设置,本申请对此不做具体限制。
比如,以声纹特征仅包括情绪特征分量为例,可以设置负面情绪对应的待输出情绪为正面情绪,中性情绪对应的待输出情绪为中性情绪,正面情绪对应的待输出情绪为正面情绪。
电子设备在获取到与前述语音内容、前述声纹特征所对应的待输出语音内容,以及获取到对应前述声纹特征的待输出语音特征之后,根据前述待输出语音内容以及待输出语音特征进行语音合成,得到待输出语音信号,该待输出语音信号包括了与前述语音内容、前述声纹特征所对应的待输出语音内容,以及与前述声纹特征对应的待输出语音特征。
104、输出生成的待输出语音信号。
其中,电子设备在生成前述待输出语音信号之后,将以语音的方式输出该待输出语音信号。比如,请参照图3,以声纹特征仅包括情绪特征分量为例,当用户以负面情绪说出“我不开心”时,电子设备将得到对应的待输出内容为“不要不开心,带我出去玩吧”,以及得到对应的待输出声纹特征为“正面情绪”,之后,电子设备根据“不要不开心,带我出去玩吧”以及“正面情绪”进行语音合成,得到待输出语音信号,在输出该待输出语音信号时,若将电子设备看做一个“人”,则这个“人”将以正面情绪说出“不要不开心,带我出去玩吧”,对用户进行安慰。
由上可知,本申请实施例的电子设备可以采集外部环境中的语音信号,并获取采集到的语音信号所包括的语音内容以及声纹特征,再根据获取到的语音内容和声纹特征生成待输出语音信号,该待输出语音信号包括对应前述声纹特征的待输出声纹特征、以及对应前述语音内容的待输出语音内容,最后输出生成的待输出语音信号。由此,使得电子设备能够根据输入语音信号所包括的声纹特征,输出包括对应声纹特征的输出语音信号,实现了以不同的发声方式进行语音输出,因此,提高了电子设备进行语音交互的灵活性。
在一实施方式中,“输出生成的待输出语音信号”包括:
(1)获取前述语音信号的响度值;
(2)根据获取到响度值确定对应待输出语音信号的目标响度值;
(3)按照确定的目标响度值,输出待输出语音信号。
其中,电子设备在输出生成的待输出语音信号时,首先获取到前述语音信号的响度值(或称音量值),将该响度值作为输入响度值,再根据预设的输入响度值和输出响度值的 对应关系,确定对应前述响度值的输出响度值,将该输出响度值作为对应待输出语音信号的目标响度值,最后以确定的该目标响度值,输出生成的待输出语音信号。
输入响度值和输出响度值的对应关系可以如下所示:
Lout=k*Lin;
其中,Lout表示输出响度值,Lin表示输入响度值,k为对应系数,可由本领域技术人员根据实际需要进行设置,比如,将k设置为1时,输出响度值与输入响度值相等,将k设置为小于1时,输出响度值将小于输入响度值。
由此,通过采集的语音信号的响度值来确定输出对应的待输出语音信号的目标响度值,能够使得电子设备的语音交互与当前所处的场景更加契合。比如,请参照图4,用户携电子设备位于会议室内,当用户小声的发出语音时,电子设备也会小声的进行语音反馈,避免了固定发声而打扰到他人的情形。
在一实施方式中,“采集外部环境中的语音信号”包括:
(1)在采集到外部环境中的带噪语音信号时,获取对应带噪语音信号的历史噪声信号;
(2)根据历史噪声信号,获取带噪语音信号采集期间的噪声信号;
(3)将获取到的噪声信号与带噪语音信号进行反相位叠加,并将叠加得到的降噪语音信号作为采集到的语音信号。
容易理解的是,环境中存在各种各样的噪声,比如,办公室中存在电脑运行产生的噪声,敲击键盘产生的噪声等。那么,电子设备在进行语音信号的采集时,显然难以采集到纯净的语音信号。因此,本申请实施例继续提供一种从嘈杂的环境中采集语音信号的方案。
当电子设备处于嘈杂的环境中时,若用户发出语音信号,电子设备将采集到外部环境中的带噪语音信号,该带噪语音信号由用户发出的语音信号和外部环境中的噪声信号组合形成,若用户未发出语音信号,电子设备将仅采集到外部环境中的噪声信号。其中,电子设备将缓存采集到的带噪语音信号和噪声信号。
本申请实施例中,电子设备在采集到外部环境中的带噪语音信号时,以带噪语音信号的起始时刻为结束时刻,获取接收到带噪语音信号之前采集的、预设时长(该预设时长可由本领域技术人员根据实际需要取合适值,本申请实施例对此不做具体限制,比如,可以设置为500ms)的历史噪声信号,将该噪声信号作为对应带噪语音信号的历史噪声信号。
比如,预设时长被配置为500毫秒,带噪语音信号的起始时刻为2018年06月13日16时47分56秒又500毫秒,则电子设备获取2018年06月13日16时47分56秒至2018年06月13日16时47分56秒又500毫秒期间缓存的、时长为500毫秒的噪声信号,将该噪声信号作为对应带噪语音信号的历史噪声信号。
电子设备在获取到对应带噪语音信号的历史噪声信号之后,根据获取到的历史噪声信号,进一步获取到带噪语音信号采集期间的噪声信号。
比如,电子设备可以根据获取到的历史噪声信号,来预测带噪语音信号采集期间的噪 声分布,从而得到带噪语音信号采集期间的噪声信号。
又比如,考虑到噪声的稳定性,连续时间内的噪声变化通常较小,电子设备可以将获取到历史噪声信号作为带噪语音信号采集期间的噪声信号,其中,若历史噪声信号的时长大于带噪语音信号的时长,则可以从历史噪声信号中截取与带噪语音信号相同时长的噪声信号,作为带噪语音信号采集期间的噪声信号;若历史噪声信号的时长小于带噪语音信号的时长,则可以对历史噪声信号进行复制,拼接多个历史噪声信号以得到与带噪语音信号相同时长的噪声信号,作为带噪语音信号采集期间的噪声信号。
在获取到带噪语音信号采集期间的噪声信号之后,电子设备首先对获取到的噪声信号进行反相处理,再将反相处理后的噪声信号与带噪语音信号进行叠加,以消除带噪语音信号中的噪声部分,得到降噪语音信号,并将得到的该降噪语音信号作为采集到外部环境的语音信号,用作后续处理,具体可参照以上相关描述,此处不再赘述。
在一实施方式中,“根据历史噪声信号,获取带噪语音信号采集期间的噪声信号”包括:
(1)将历史噪声信号作为样本数据进行模型训练,得到噪声预测模型;
(2)根据噪声预测模型预测带噪语音信号采集期间的噪声信号。
其中,电子设备在获取到历史噪声信号之后,将该历史噪声信号作为样本数据,并按照预设训练算法进行模型训练,得到噪声预测模型。
需要说明的是,训练算法为机器学习算法,机器学习算法可以通过不断的进行特征学习来对数据进行预测,比如,电子设备可以根据历史的噪声分布来预测当前的噪声分布。其中,机器学习算法可以包括:决策树算法、回归算法、贝叶斯算法、神经网络算法(可以包括深度神经网络算法、卷积神经网络算法以及递归神经网络算法等)、聚类算法等等,对于选取何种训练算法用作预设训练算法进行模型训练,可由本领域技术人员根据实际需要进行选取。
比如,电子设备配置的配置的预设训练算法为高斯混合模型算法(为一种回归算法),在获取到历史噪声信号之后,将该历史噪声信号作为样本数据,并按照高斯混合模型算法进行模型训练,训练得到一个高斯混合模型(噪声预测模型包括多个高斯单元,用于描述噪声分布),将该高斯混合模型作为噪声预测模型。之后,电子设备将带噪语音信号采集期间的开始时刻和结束时刻作为噪声预测模型的输入,输入到噪声预测模型进行处理,得到噪声预测模型输出带噪语音信号采集期间的噪声信号。
在一实施方式中,“根据获取到语音内容和声纹特征生成待输出语音信号”之前,还包括:
(1)判断前述声纹特征是否与预设声纹特征匹配;
(2)在前述声纹特征与预设声纹特征匹配时,根据获取到的前述语音内容和前述声纹特征生成待输出语音信号。
其中,预设声纹特征可以为机主预先录入的声纹特征,也可以为机主授权的其他用户预先录入的声纹特征,判断前述声纹特征(也即是采集到外部环境中语音信号的声纹特征)是否与预设声纹特征匹配,也即是判断发出语音信号的用户是否为机主。若前述声纹特征与预设声纹特征不匹配,电子设备判定发出语音信号的用户不为机主,若前述声纹特征与预设声纹特征匹配,电子设备判定发出语音信号的用户为机主,此时根据获取到的前述语音内容和前述声纹特征生成待输出语音信号,具体可参照以上相关描述,此处不再赘述。
本申请实施例通过在生成待输出语音信号之前,根据语音信号的声纹特征对发出语音信号的用户进行身份识别,在且仅发出语音信号的用户为机主时,才根据获取到的前述语音内容和前述声纹特征生成待输出语音信号。由此,能够避免电子设备对机主外的他人产生错误响应,以提升机主的使用体验。
在一实施方式中,“判断前述声纹特征是否与预设声纹特征匹配”包括:
(1)获取前述声纹特征和预设声纹特征的相似度;
(2)判断获取到的相似度是否大于或等于第一预设相似度;
(3)在获取到的相似度大于或等于第一预设相似度时,确定前述声纹特征与预设声纹特征匹配。
电子设备在判断前述声纹特征是否与预设声纹特征匹配时,可以获取前述声纹特征与预设声纹特征的相似度,并判断获取到的相似度是否大于或等于第一预设相似度(可由本领域技术人员根据实际需要进行设置)。其中,在获取到的相似度大于或等于第一预设相似度时,确定获取到的前述声纹特征与预设声纹特征匹配,在获取到的相似度小于第一预设相似度时,确定获取到的前述声纹特征与预设声纹特征不匹配。
其中,电子设备可以获取前述声纹特征与预设声纹特征的距离,并将获取到的距离作为前述声纹特征与预设声纹特征的相似度。其中,可由本领域技术人员根据实际需要选取任意一种特征距离(比如欧氏距离、曼哈顿距离、切比雪夫距离等等)来衡量前述声纹特征与预设声纹特征之间的距离。
比如,可以获取前述声纹特征和预设声纹特征的余弦距离,具体参照以下公式:
Figure PCTCN2019085543-appb-000001
其中,e表示前述声纹特征和预设声纹特征的余弦距离,f表示前述声纹特征,N表示前述声纹特征和预设声纹特征的维度(前述声纹特征和预设声纹特征的维度相同),fi表示前述声纹特征中第i维度的特征向量,gi表示预设声纹特征中第i维度的特征向量。
在一实施方式中,“判断获取到的相似度是否大于或等于第一预设相似度”之后,还包括:
(1)在获取到的相似度小于第一预设相似度且大于或等于第二预设相似度时,获取当前的位置信息;
(2)根据该位置信息判断当前是否位于预设位置范围内;
(3)在当前位于预设位置范围内时,确定前述声纹特征与预设声纹特征匹配。
需要说明的是,由于声纹特征和人体的生理特征密切相关,在日常生活中,如果用户感冒发炎的话,其声音将变得沙哑,声纹特征也将随之发生变化。在这种情况下,即使发出语音信号的用户为机主,电子设备也无法识别出。此外,还存在多种导致电子设备无法识别出机主的情况,此处不再赘述。
为解决可能出现的、无法识别出机主的情况,在本申请实施例中,电子设备在完成对声纹特征相似度的判断之后,若前述声纹特征与预设声纹特征的相似度小于第一预设相似度,则进一步判断该相似度是否大于或等于第二预设相似度(该第二预设相似度配置为小于第一预设相似度,具体可由本领域技术人员根据实际需要取合适值,比如,在第一预设相似度被设置为95%时,可以将第二预设相似度设置为75%)。
在判断结果为是,也即是前述声纹特征与预设声纹特征的相似度小于第一预设相似度且大于或等于第二预设相似度时,电子设备进一步获取到当前的位置信息。
其中,在处于室外环境(电子设备可以根据接收到卫星定位信号的强度大小来识别当前处于室外环境,还是处于室内环境,比如,在接收到的卫星定位信号强度低于预设阈值时,判定处于室内环境,在接收到的卫星定位信号强度高于或等于预设阈值时,判定处于室外环境)时,电子设备可以采用卫星定位技术来获取到当前的位置信息,在处于室内环境时,电子设备可以采用室内定位技术来获取当前的位置信息。
在获取到当前的位置信息之后,电子设备根据该位置信息判断当前是否位于预设位置范围内。其中,预设位置范围可以配置为机主的常用位置范围,比如家里和公司等。
在判定当前位于预设位置范围内时,电子设备确定前述声纹特征与预设声纹特征匹配,确定发出语音信号的用户为机主。
由此,能够避免可能出现的、无法识别出机主的情况,达到提升机主使用体验的目的。
下面将在上述实施例描述的方法基础上,对本申请的语音处理方法做进一步介绍。请参照图5,该语音处理方法可以包括:
201、在采集到外部环境中的带噪语音信号时,获取对应带噪语音信号的历史噪声信号。
本申请实施例中,电子设备在采集到外部环境中的带噪语音信号时,以带噪语音信号的起始时刻为结束时刻,获取接收到带噪语音信号之前采集的、预设时长(该预设时长可由本领域技术人员根据实际需要取合适值,本申请实施例对此不做具体限制,比如,可以设置为500ms)的历史噪声信号,将该噪声信号作为对应带噪语音信号的历史噪声信号。
比如,预设时长被配置为500毫秒,带噪语音信号的起始时刻为2018年06月13日 16时47分56秒又500毫秒,则电子设备获取2018年06月13日16时47分56秒至2018年06月13日16时47分56秒又500毫秒期间缓存的、时长为500毫秒的噪声信号,将该噪声信号作为对应带噪语音信号的历史噪声信号。
202、根据历史噪声信号,获取带噪语音信号采集期间的噪声信号。
电子设备在获取到对应带噪语音信号的历史噪声信号之后,根据获取到的历史噪声信号,进一步获取到带噪语音信号采集期间的噪声信号。
比如,电子设备可以根据获取到的历史噪声信号,来预测带噪语音信号采集期间的噪声分布,从而得到带噪语音信号采集期间的噪声信号。
又比如,考虑到噪声的稳定性,连续时间内的噪声变化通常较小,电子设备可以将获取到历史噪声信号作为带噪语音信号采集期间的噪声信号,其中,若历史噪声信号的时长大于带噪语音信号的时长,则可以从历史噪声信号中截取与带噪语音信号相同时长的噪声信号,作为带噪语音信号采集期间的噪声信号;若历史噪声信号的时长小于带噪语音信号的时长,则可以对历史噪声信号进行复制,拼接多个历史噪声信号以得到与带噪语音信号相同时长的噪声信号,作为带噪语音信号采集期间的噪声信号。
203、将获取到的噪声信号与带噪语音信号进行反相位叠加,并将叠加得到的降噪语音信号作为待处理的语音信号。
在获取到带噪语音信号采集期间的噪声信号之后,电子设备首先对获取到的噪声信号进行反相处理,再将反相处理后的噪声信号与带噪语音信号进行叠加,以消除带噪语音信号中的噪声部分,得到降噪语音信号,并将得到的该降噪语音信号作为待处理的语音信号,用作后续处理。
204、获取前述语音信号所包括的语音内容以及声纹特征。
其中,电子设备得到待处理的语音信号之后,首先判断本地是否存在语音解析引擎,若存在,则电子设备前述语音信号输入到本地的语音解析引擎进行语音解析,得到语音解析文本。其中,对语音信号进行语音解析,也即是将语音信号由“音频”向“文字”的转换过程。
电子设备在解析得到前述语音信号的语音解析文本之后,即可从该语音解析文本中提取出前述语音信号所包括的语音内容。比如,请参照图2,用户说出语音“明天天气如何”,电子设备将采集到对应语音“明天天气如何”的语音信号,对该语音信号进行语音解析,得到对应的语音解析文本,从该语音解析文本中提取出前述语音信号的语音内容“明天天气如何”。
此外,电子设备判断本地是否存在语音解析引擎之后,若不存在,则将前述语音信号发送至服务器(该服务器为提供语音解析服务的服务器),指示该服务器对前述语音信号进行解析,并返回解析前述语音信号所得到的语音解析文本。在接收到服务器返回的语音解析文本之后,电子设备即可从该语音解析文本中提取出前述语音信号所包括的语音内容。
在本申请实施例中,电子设备除了获取前述语音信号所包括的语音内容之外,还获取前述语音信号所包括声纹特征。其中,声纹特征包括但不限于频谱特征分量、倒频谱特征分量、共振峰特征分量、基音特征分量、反射系数特征分量、声调特征分量、语速特征分量、情绪特征分量、韵律特征分量以及节奏特征分量中的至少一种特征分量。
205、根据获取到语音内容和声纹特征生成待输出语音信号,该待输出语音信号包括对应前述声纹特征的待输出声纹特征、以及对应前述语音内容的待输出语音内容。
其中,电子设备在获取到前述语音信号所包括的语音内容以及声纹特征之后,根据预先设置的语音内容、声纹特征与待输出语音内容的对应关系,以及获取到的前述语音内容以及前述声纹特征,得到对应的待输出语音内容。需要说明的是,对于语音内容、声纹特征与待输出语音内容的对应关系,可由本领域技术人员根据实际需要进行设置,其中,可以在待输出语音内容中增加不影响语义的语气词。
比如,以声纹特征仅包括情绪特征分量为例,当用户以中性的情绪说出“明天天气如何”时,电子设备将得到对应的待输出内容为“明天晴空万里,适合外出哦”;又比如,当用户以负面的情绪说出“我不开心”时,电子设备将得到对应的待输出内容为“不要不开心,带我出去玩吧”。
此外,电子设备还根据预先设置的声纹特征和待输出声纹特征的对应关系,以及获取到的前述声纹特征,得到对应的待输出声纹特征。需要说明的是,对于声纹特征和待输出声纹特征的对应关系,可由本领域技术人员根据实际需要进行设置,本申请对此不做具体限制。比如,以声纹特征仅包括情绪特征分量为例,可以设置负面情绪对应的待输出情绪为正面情绪,中性情绪对应的待输出情绪为中性情绪,正面情绪对应的待输出情绪为正面情绪。
电子设备在获取到与前述语音内容、前述声纹特征所对应的待输出语音内容,以及获取到对应前述声纹特征的待输出语音特征之后,根据前述待输出语音内容以及待输出语音特征进行语音合成,得到待输出语音信号,该待输出语音信号包括了与前述语音内容、前述声纹特征所对应的待输出语音内容,以及与前述声纹特征对应的待输出语音特征。
206、获取前述语音信号的响度值。
其中,电子设备在生成的待输出语音信号之后,首先获取到前述语音信号的响度值(或称音量值)。
207、根据获取到响度值确定对应待输出语音信号的目标响度值。
208、按照确定的目标响度值,输出待输出语音信号。
其中,电子设备在获取到前述语音信号的响度值之后,将该响度值作为输入响度值,再根据预设的输入响度值和输出响度值的对应关系,确定对应前述响度值的输出响度值,将该输出响度值作为对应待输出语音信号的目标响度值,,再以确定的该目标响度值,输出生成的待输出语音信号。输入响度值和输出响度值的对应关系可以如下所示:
Lout=k*Lin;
其中,Lout表示输出响度值,Lin表示输入响度值,k为对应系数,可由本领域技术人员根据实际需要进行设置,比如,将k设置为1时,输出响度值与输入响度值相等,将k设置为小于1时,输出响度值将小于输入响度值。
在一实施例中,还提供了一种语音处理装置。请参照图6,图6为本申请实施例提供的语音处理装置400的结构示意图。其中该语音处理装置应用于电子设备,该语音处理装置包括采集模块401、获取模块402、生成模块403和输出模块404,如下:
采集模块401,用于采集外部环境中的语音信号。
获取模块402,用于获取采集到的语音信号所包括的语音内容以及声纹特征。
生成模块403,用于根据获取到的语音内容和声纹特征生成待输出语音信号,该待输出语音信号包括对应前述声纹特征的待输出声纹特征、以及对应前述语音内容的待输出语音内容。
输出模块404,用于输出生成的待输出语音信号。
在一实施方式中,输出模块404可以用于:
获取前述语音信号的响度值;
根据获取到响度值确定对应待输出语音信号的目标响度值;
按照确定的目标响度值,输出待输出语音信号。
在一实施例中,采集模块401可以用于:
在采集到外部环境中的带噪语音信号时,获取对应带噪语音信号的历史噪声信号;
根据历史噪声信号,获取带噪语音信号采集期间的噪声信号;
将获取到的噪声信号与带噪语音信号进行反相位叠加,并将叠加得到的降噪语音信号作为采集到的语音信号。
在一实施方式中,采集模块401可以用于:
将历史噪声信号作为样本数据进行模型训练,得到噪声预测模型;
根据噪声预测模型预测带噪语音信号采集期间的噪声信号。
在一实施方式中,生成模块403可以用于:
判断前述声纹特征是否与预设声纹特征匹配;
在前述声纹特征与预设声纹特征匹配时,根据获取到的前述语音内容和前述声纹特征生成待输出语音信号。
在一实施方式中,生成模块403可以用于:
获取前述声纹特征和预设声纹特征的相似度;
判断获取到的相似度是否大于或等于第一预设相似度;
在获取到的相似度大于或等于第一预设相似度时,确定前述声纹特征与预设声纹特征 匹配。
在一实施方式中,生成模块403可以用于:
获取声纹特征和预设声纹特征之间的特征距离;
将特征距离作为声纹特征和预设声纹特征的相似度。
在一实施方式中,生成模块403可以用于:
在获取到的相似度小于第一预设相似度且大于或等于第二预设相似度时,获取当前的位置信息;
根据该位置信息判断当前是否位于预设位置范围内;
在当前位于预设位置范围内时,确定前述声纹特征与预设声纹特征匹配。
在一实施方式中,获取模块402可以用于:
解析语音信号得到对应的语音解析文本,并从语音解析文本中提取语音信号所包括的语音内容。
其中,语音处理装置400中各模块执行的步骤可以参考上述方法实施例描述的方法步骤。该语音处理装置400可以集成在电子设备中,如手机、平板电脑等。
具体实施时,以上各个模块可以作为独立的实体实现,也可以进行任意组合,作为同一或若干个实体来实现,以上各个单位的具体实施可参见前面的实施例,在此不再赘述。
在一实施例中,还提供一种电子设备。请参照图7,电子设备500包括处理器501以及存储器502。其中,处理器501与存储器502电性连接。
处理器500是电子设备500的控制中心,利用各种接口和线路连接整个电子设备的各个部分,通过运行或加载存储在存储器502内的计算机程序,以及调用存储在存储器502内的数据,执行电子设备500的各种功能并处理数据。
存储器502可用于存储软件程序以及模块,处理器501通过运行存储在存储器502的计算机程序以及模块,从而执行各种功能应用以及数据处理。存储器502可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作***、至少一个功能所需的计算机程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据电子设备的使用所创建的数据等。此外,存储器502可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。相应地,存储器502还可以包括存储器控制器,以提供处理器501对存储器502的访问。
在本申请实施例中,电子设备500中的处理器501会按照如下的步骤,将一个或一个以上的计算机程序的进程对应的指令加载到存储器502中,并由处理器501运行存储在存储器502中的计算机程序,从而实现各种功能,如下:
采集外部环境中的语音信号;
获取采集到的语音信号所包括的语音内容以及声纹特征;
根据获取到的语音内容和声纹特征生成待输出语音信号,该待输出语音信号包括对应前述声纹特征的待输出声纹特征、以及对应前述语音内容的待输出语音内容;
输出生成的待输出语音信号。
请一并参阅图8,在某些实施方式中,电子设备500还可以包括:显示器503、射频电路504、音频电路505以及电源506。其中,其中,显示器503、射频电路504、音频电路505以及电源506分别与处理器501电性连接。
显示器503可以用于显示由用户输入的信息或提供给用户的信息以及各种图形用户接口,这些图形用户接口可以由图形、文本、图标、视频和其任意组合来构成。显示器503可以包括显示面板,在某些实施方式中,可以采用液晶显示器(Liquid Crystal Display,LCD)、或者有机发光二极管(Organic Light-Emitting Diode,OLED)等形式来配置显示面板。
射频电路504可以用于收发射频信号,以通过无线通信与网络设备或其他电子设备建立无线通讯,与网络设备或其他电子设备之间收发信号。
音频电路505可以用于通过扬声器、传声器提供用户与电子设备之间的音频接口。
电源506可以用于给电子设备500的各个部件供电。在一些实施例中,电源506可以通过电源管理***与处理器501逻辑相连,从而通过电源管理***实现管理充电、放电、以及功耗管理等功能。
尽管图8中未示出,电子设备500还可以包括摄像头、蓝牙模块等,在此不再赘述。
在某些实施方式中,在输出生成的待输出语音信号时,处理器501可以执行:
获取前述语音信号的响度值;
根据获取到响度值确定对应待输出语音信号的目标响度值;
按照确定的目标响度值,输出待输出语音信号。
在某些实施方式中,在采集外部环境中的语音信号时,处理器501可以执行:
在采集到外部环境中的带噪语音信号时,获取对应带噪语音信号的历史噪声信号;
根据历史噪声信号,获取带噪语音信号采集期间的噪声信号;
将获取到的噪声信号与带噪语音信号进行反相位叠加,并将叠加得到的降噪语音信号作为采集到的语音信号。
在某些实施方式中,在根据历史噪声信号,获取带噪语音信号采集期间的噪声信号时,处理器501可以执行:
将历史噪声信号作为样本数据进行模型训练,得到噪声预测模型;
根据噪声预测模型预测带噪语音信号采集期间的噪声信号。
在某些实施方式中,在根据获取到的语音内容和声纹特征生成待输出语音信号时,处理器501可以执行:
判断前述声纹特征是否与预设声纹特征匹配;
在前述声纹特征与预设声纹特征匹配时,根据获取到的前述语音内容和前述声纹特征生成待输出语音信号。
在某些实施方式中,在判断前述声纹特征是否与预设声纹特征匹配时,处理器501还可以执行:
获取前述声纹特征和预设声纹特征的相似度;
判断获取到的相似度是否大于或等于第一预设相似度;
在获取到的相似度大于或等于第一预设相似度时,确定前述声纹特征与预设声纹特征匹配。
在某些实施方式中,在获取前述声纹特征和预设声纹特征的相似度时,处理器501可以执行:
获取声纹特征和预设声纹特征之间的特征距离;
将特征距离作为声纹特征和预设声纹特征的相似度。
在某些实施方式中,在判断获取到的相似度是否大于或等于第一预设相似度之后,处理器501还可以执行:
在获取到的相似度小于第一预设相似度且大于或等于第二预设相似度时,获取当前的位置信息;
根据该位置信息判断当前是否位于预设位置范围内;
在当前位于预设位置范围内时,确定前述声纹特征与预设声纹特征匹配。
在某些实施方式中,在获取采集到的语音信号所包括的语音内容时,处理器501可以执行:
解析语音信号得到对应的语音解析文本,并从语音解析文本中提取语音信号所包括的语音内容。
本申请实施例还提供一种存储介质,所述存储介质存储有计算机程序,当所述计算机程序在计算机上运行时,使得所述计算机执行上述任一实施例中的语音处理方法,比如:采集外部环境中的语音信号;获取采集到的语音信号所包括的语音内容以及声纹特征;根据获取到的语音内容和声纹特征生成待输出语音信号,该待输出语音信号包括对应前述声纹特征的待输出声纹特征、以及对应前述语音内容的待输出语音内容;输出生成的待输出语音信号。本申请实施例中,存储介质可以是磁碟、光盘、只读存储器(Read Only Memory,ROM,)或者随机存取器(Random Access Memory,RAM)等。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
需要说明的是,对本申请实施例的语音处理方法而言,本领域普通测试人员可以理解实现本申请实施例的语音处理方法的全部或部分流程,是可以通过计算机程序来控制相关的硬件来完成,所述计算机程序可存储于一计算机可读取存储介质中,如存储在电子设备 的存储器中,并被该电子设备内的至少一个处理器执行,在执行过程中可包括如语音处理方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储器、随机存取记忆体等。
对本申请实施例的语音处理装置而言,其各功能模块可以集成在一个处理芯片中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中,所述存储介质譬如为只读存储器,磁盘或光盘等。
以上对本申请实施例所提供的一种语音处理方法、装置、存储介质及电子设备进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (20)

  1. 一种语音处理方法,其中,包括:
    采集外部环境中的语音信号;
    获取所述语音信号包括的语音内容以及声纹特征;
    根据所述语音内容和所述声纹特征生成待输出语音信号,所述待输出语音信号包括对应所述声纹特征的待输出声纹特征、以及对应所述语音内容的待输出语音内容;
    输出所述待输出语音信号。
  2. 如权利要求1所述的语音处理方法,其中,所述输出所述待输出语音信号,包括:
    获取所述语音信号的响度值;
    根据所述响度值确定对应所述待输出语音信号的目标响度值;
    按照所述目标响度值,输出所述待输出语音信号。
  3. 如权利要求1所述的语音处理方法,其中,所述采集外部环境中的语音信号,包括:
    在采集到外部环境中的带噪语音信号时,获取对应所述带噪语音信号的历史噪声信号;
    根据所述历史噪声信号,获取所述带噪语音信号采集期间的噪声信号;
    将所述噪声信号与所述带噪语音信号进行反相位叠加,并将叠加得到的降噪语音信号作为所述语音信号。
  4. 如权利要求3所述的语音处理方法,其中,所述根据所述历史噪声信号,获取所述带噪语音信号采集期间的噪声信号,包括:
    将所述历史噪声信号作为样本数据进行模型训练,得到噪声预测模型;
    根据所述噪声预测模型预测所述采集期间的所述噪声信号。
  5. 如权利要求1所述的语音处理方法,其中,所述根据所述语音内容和所述声纹特征生成待输出语音信号之前,还包括:
    判断所述声纹特征是否与预设声纹特征匹配;
    在所述声纹特征与预设声纹特征匹配时,根据所述语音内容和所述声纹特征生成所述待输出语音信号。
  6. 如权利要求5所述的语音处理方法,其中,所述判断所述声纹特征是否与预设声纹特征匹配,包括:
    获取所述声纹特征和所述预设声纹特征的相似度;
    判断所述相似度是否大于或等于第一预设相似度;
    在所述相似度大于或等于所述第一预设相似度时,确定所述声纹特征与所述预设声纹 特征匹配。
  7. 如权利要求6所述的语音处理方法,其中,所述获取所述声纹特征和所述预设声纹特征的相似度,包括:
    获取所述声纹特征和所述预设声纹特征之间的特征距离;
    将所述特征距离作为所述声纹特征和所述预设声纹特征的相似度。
  8. 如权利要求6所述的语音处理方法,其中,判断所述相似度是否大于或等于第一预设相似度的步骤之后,还包括:
    在所述相似度小于所述第一预设相似度且大于或等于第二预设相似度时,获取当前的位置信息;
    根据所述位置信息判断当前是否位于预设位置范围内;
    在当前位于预设位置范围内时,确定所述声纹特征与所述预设声纹特征匹配。
  9. 如权利要求1所述的语音处理方法,其中,所述获取所述语音信号包括的语音内容,包括:
    解析所述语音信号得到对应的语音解析文本,并从所述语音解析文本中提取所述语音信号所包括的所述语音内容。
  10. 一种语音处理装置,其中,包括:
    采集模块,用于采集外部环境中的语音信号;
    获取模块,用于获取所述语音信号包括的语音内容以及声纹特征;
    生成模块,用于根据所述语音内容和所述声纹特征生成待输出语音信号,所述待输出语音信号包括对应所述声纹特征的待输出声纹特征、以及对应所述语音内容的待输出语音内容;
    输出模块,用于输出所述待输出语音信号。
  11. 一种存储介质,其上存储有计算机程序,其中,当所述计算机程序在计算机上运行时,使得所述计算机执行:
    采集外部环境中的语音信号;
    获取所述语音信号包括的语音内容以及声纹特征;
    根据所述语音内容和所述声纹特征生成待输出语音信号,所述待输出语音信号包括对应所述声纹特征的待输出声纹特征、以及对应所述语音内容的待输出语音内容;
    输出所述待输出语音信号。
  12. 一种电子设备,包括处理器和存储器,所述存储器储存有计算机程序,其中,所 述处理器通过调用所述计算机程序,用于执行:
    采集外部环境中的语音信号;
    获取所述语音信号包括的语音内容以及声纹特征;
    根据所述语音内容和所述声纹特征生成待输出语音信号,所述待输出语音信号包括对应所述声纹特征的待输出声纹特征、以及对应所述语音内容的待输出语音内容;
    输出所述待输出语音信号。
  13. 如权利要求12所述的电子设备,其中,在输出所述待输出语音信号时,所述处理器用于执行:
    获取所述语音信号的响度值;
    根据所述响度值确定对应所述待输出语音信号的目标响度值;
    按照所述目标响度值,输出所述待输出语音信号。
  14. 如权利要求12所述的电子设备,其中,在采集外部环境中的语音信号时,所述处理器用于执行:
    在采集到外部环境中的带噪语音信号时,获取对应所述带噪语音信号的历史噪声信号;
    根据所述历史噪声信号,获取所述带噪语音信号采集期间的噪声信号;
    将所述噪声信号与所述带噪语音信号进行反相位叠加,并将叠加得到的降噪语音信号作为所述语音信号。
  15. 如权利要求14所述的电子设备,其中,在根据所述历史噪声信号,获取所述带噪语音信号采集期间的噪声信号时,所述处理器用于执行:
    将所述历史噪声信号作为样本数据进行模型训练,得到噪声预测模型;
    根据所述噪声预测模型预测所述采集期间的所述噪声信号。
  16. 如权利要求12所述的电子设备,其中,在根据所述语音内容和所述声纹特征生成待输出语音信号之前,所述处理器还用于执行:
    判断所述声纹特征是否与预设声纹特征匹配;
    在所述声纹特征与预设声纹特征匹配时,根据所述语音内容和所述声纹特征生成所述待输出语音信号。
  17. 如权利要求16所述的电子设备,其中,在判断所述声纹特征是否与预设声纹特征匹配时,所述处理器用于执行:
    获取所述声纹特征和所述预设声纹特征的相似度;
    判断所述相似度是否大于或等于第一预设相似度;
    在所述相似度大于或等于所述第一预设相似度时,确定所述声纹特征与所述预设声纹特征匹配。
  18. 如权利要求17所述的电子设备,其中,在获取所述声纹特征和所述预设声纹特征的相似度时,所述处理器用于执行:
    获取所述声纹特征和所述预设声纹特征之间的特征距离;
    将所述特征距离作为所述声纹特征和所述预设声纹特征的相似度。
  19. 如权利要求17所述的电子设备,其中,在判断所述相似度是否大于或等于第一预设相似度的步骤之后,所述处理器还用于执行:
    在所述相似度小于所述第一预设相似度且大于或等于第二预设相似度时,获取当前的位置信息;
    根据所述位置信息判断当前是否位于预设位置范围内;
    在当前位于预设位置范围内时,确定所述声纹特征与所述预设声纹特征匹配。
  20. 如权利要求12所述的电子设备,其中,在获取所述语音信号包括的语音内容时,所述处理器用于执行:
    解析所述语音信号得到对应的语音解析文本,并从所述语音解析文本中提取所述语音信号所包括的所述语音内容。
PCT/CN2019/085543 2018-06-19 2019-05-05 语音处理方法、装置、存储介质及电子设备 WO2019242414A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810631577.0A CN108922525B (zh) 2018-06-19 2018-06-19 语音处理方法、装置、存储介质及电子设备
CN201810631577.0 2018-06-19

Publications (1)

Publication Number Publication Date
WO2019242414A1 true WO2019242414A1 (zh) 2019-12-26

Family

ID=64421230

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/085543 WO2019242414A1 (zh) 2018-06-19 2019-05-05 语音处理方法、装置、存储介质及电子设备

Country Status (2)

Country Link
CN (1) CN108922525B (zh)
WO (1) WO2019242414A1 (zh)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108922525B (zh) * 2018-06-19 2020-05-12 Oppo广东移动通信有限公司 语音处理方法、装置、存储介质及电子设备
CN109817196B (zh) * 2019-01-11 2021-06-08 安克创新科技股份有限公司 一种噪音消除方法、装置、***、设备及存储介质
CN110288989A (zh) * 2019-06-03 2019-09-27 安徽兴博远实信息科技有限公司 语音交互方法及***
CN110400571B (zh) * 2019-08-08 2022-04-22 Oppo广东移动通信有限公司 音频处理方法、装置、存储介质及电子设备
CN110767229B (zh) * 2019-10-15 2022-02-01 广州国音智能科技有限公司 基于声纹的音频输出方法、装置、设备及可读存储介质
CN110634491B (zh) * 2019-10-23 2022-02-01 大连东软信息学院 语音信号中针对通用语音任务的串联特征提取***及方法
CN111933138B (zh) * 2020-08-20 2022-10-21 Oppo(重庆)智能科技有限公司 语音控制方法、装置、终端及存储介质
CN115497480A (zh) * 2021-06-18 2022-12-20 海信集团控股股份有限公司 一种声音复刻方法、装置、设备及介质

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103165131A (zh) * 2011-12-17 2013-06-19 富泰华工业(深圳)有限公司 语音处理***及语音处理方法
CN103259908A (zh) * 2012-02-15 2013-08-21 联想(北京)有限公司 一种移动终端及其智能控制方法
CN103838991A (zh) * 2014-02-20 2014-06-04 联想(北京)有限公司 一种信息处理方法及电子设备
CN105488227A (zh) * 2015-12-29 2016-04-13 惠州Tcl移动通信有限公司 一种电子设备及其基于声纹特征处理音频文件的方法
CN106128467A (zh) * 2016-06-06 2016-11-16 北京云知声信息技术有限公司 语音处理方法及装置
US20170069317A1 (en) * 2015-09-04 2017-03-09 Samsung Electronics Co., Ltd. Voice recognition apparatus, driving method thereof, and non-transitory computer-readable recording medium
CN107729433A (zh) * 2017-09-29 2018-02-23 联想(北京)有限公司 一种音频处理方法及设备
CN207149252U (zh) * 2017-08-01 2018-03-27 安徽听见科技有限公司 语音处理***
CN108922525A (zh) * 2018-06-19 2018-11-30 Oppo广东移动通信有限公司 语音处理方法、装置、存储介质及电子设备

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103165131A (zh) * 2011-12-17 2013-06-19 富泰华工业(深圳)有限公司 语音处理***及语音处理方法
CN103259908A (zh) * 2012-02-15 2013-08-21 联想(北京)有限公司 一种移动终端及其智能控制方法
CN103838991A (zh) * 2014-02-20 2014-06-04 联想(北京)有限公司 一种信息处理方法及电子设备
US20170069317A1 (en) * 2015-09-04 2017-03-09 Samsung Electronics Co., Ltd. Voice recognition apparatus, driving method thereof, and non-transitory computer-readable recording medium
CN105488227A (zh) * 2015-12-29 2016-04-13 惠州Tcl移动通信有限公司 一种电子设备及其基于声纹特征处理音频文件的方法
CN106128467A (zh) * 2016-06-06 2016-11-16 北京云知声信息技术有限公司 语音处理方法及装置
CN207149252U (zh) * 2017-08-01 2018-03-27 安徽听见科技有限公司 语音处理***
CN107729433A (zh) * 2017-09-29 2018-02-23 联想(北京)有限公司 一种音频处理方法及设备
CN108922525A (zh) * 2018-06-19 2018-11-30 Oppo广东移动通信有限公司 语音处理方法、装置、存储介质及电子设备

Also Published As

Publication number Publication date
CN108922525B (zh) 2020-05-12
CN108922525A (zh) 2018-11-30

Similar Documents

Publication Publication Date Title
WO2019242414A1 (zh) 语音处理方法、装置、存储介质及电子设备
US20200365155A1 (en) Voice activated device for use with a voice-based digital assistant
CN110136692B (zh) 语音合成方法、装置、设备及存储介质
US20130211826A1 (en) Audio Signals as Buffered Streams of Audio Signals and Metadata
CN110473546B (zh) 一种媒体文件推荐方法及装置
KR20190042918A (ko) 전자 장치 및 그의 동작 방법
CN108806684B (zh) 位置提示方法、装置、存储介质及电子设备
WO2021008538A1 (zh) 语音交互方法及相关装置
CN107799126A (zh) 基于有监督机器学习的语音端点检测方法及装置
CN111583944A (zh) 变声方法及装置
CN108711429B (zh) 电子设备及设备控制方法
CN111508511A (zh) 实时变声方法及装置
CN110265011B (zh) 一种电子设备的交互方法及其电子设备
CN108962241B (zh) 位置提示方法、装置、存储介质及电子设备
CN112840396A (zh) 用于处理用户话语的电子装置及其控制方法
WO2022057759A1 (zh) 一种语音转换的方法及相关设备
WO2022147692A1 (zh) 一种语音指令识别方法、电子设备以及非瞬态计算机可读存储介质
CN117059068A (zh) 语音处理方法、装置、存储介质及计算机设备
WO2019242415A1 (zh) 位置提示方法、装置、存储介质及电子设备
WO2019228140A1 (zh) 指令执行方法、装置、存储介质及电子设备
CN109064720B (zh) 位置提示方法、装置、存储介质及电子设备
KR102114365B1 (ko) 음성인식 방법 및 장치
CN108989551B (zh) 位置提示方法、装置、存储介质及电子设备
CN114154636A (zh) 数据处理方法、电子设备及计算机程序产品
CN111696566B (zh) 语音处理方法、装置和介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19823564

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19823564

Country of ref document: EP

Kind code of ref document: A1