CN112151072A

CN112151072A - Voice processing method, apparatus and medium

Info

Publication number: CN112151072A
Application number: CN202010850493.3A
Authority: CN
Inventors: 叶一川; 刘恺; 周盼; 曹赫; 郎勇
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2020-08-21
Filing date: 2020-08-21
Publication date: 2020-12-29
Anticipated expiration: 2040-08-21

Abstract

The embodiment of the invention provides a voice processing method and device and a device for processing voice, wherein the method specifically comprises the following steps: performing voice activity detection on voice to obtain a plurality of voice units corresponding to the voice; outputting the voice entry and the voice recognition result respectively corresponding to the plurality of voice units; receiving correction information of a user aiming at a target voice unit; and correcting the target voice unit according to the correction information. The embodiment of the invention can improve the efficiency of pronunciation correction.

Description

Voice processing method, apparatus and medium

Technical Field

The present invention relates to the field of communications technologies, and in particular, to a method and an apparatus for processing speech, and a machine-readable medium.

Background

With the development of communication technology, the popularization of audio devices such as bluetooth headsets and smart speakers, and the consumption of audio contents, including radio stations, web podcasts, audio books, knowledge programs, etc., has been continuously increasing in recent years. The accompanying of acquiring the sound in a listening mode at any time and any place becomes a choice of more users, and immersive audio experiences of news, learning, entertainment, music and the like can be easily obtained in scenes such as driving, commuting, sleeping and the like.

At present, text designated by a user can be converted into voice and output to the user; alternatively, the voice input by the user may be converted into voice corresponding to a specific tone and output to the user. In addition, in order to improve the quality of the speech, the error in the speech can be located by trying to listen to the speech so as to correct the error in the speech.

The inventors have discovered in practicing embodiments of the present invention that speech is typically stored in a speech file and that a user typically needs to listen to the entire speech file to determine errors in the speech. The listening of the whole voice file usually takes much time and cost, and thus the efficiency of voice correction is low.

Disclosure of Invention

In view of the above problems, embodiments of the present invention have been made to provide a speech processing method, a speech processing apparatus, and an apparatus for speech processing that overcome or at least partially solve the above problems, and can improve the efficiency of reading correction.

In order to solve the above problem, the present invention discloses a speech processing method, comprising:

performing voice activity detection on voice to obtain a plurality of voice units corresponding to the voice;

outputting the voice entry and the voice recognition result respectively corresponding to the plurality of voice units;

receiving correction information of a user aiming at a target voice unit;

and correcting the target voice unit according to the correction information.

In another aspect, an embodiment of the present invention discloses a speech processing apparatus, including:

the detection module is used for carrying out voice activity detection on voice to obtain a plurality of voice units corresponding to the voice;

the output module is used for outputting the voice inlets and the voice recognition results corresponding to the voice units respectively;

the receiving module is used for receiving correction information of a user aiming at the target voice unit; and

and the correction module is used for correcting the target voice unit according to the correction information.

In yet another aspect, an embodiment of the present invention discloses an apparatus for speech processing, including a memory, and one or more programs, where the one or more programs are stored in the memory, and configured to be executed by the one or more processors includes instructions for:

receiving correction information of a user aiming at a target voice unit;

and correcting the target voice unit according to the correction information.

One or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform the foregoing methods are also disclosed.

The embodiment of the invention has the following advantages:

the embodiment of the invention respectively outputs the corresponding voice inlet and the voice recognition result aiming at a plurality of voice units in the voice, so that a user can carry out voice correction by taking the voice units as a unit and generate corresponding correction information. Compared with the situation that a user listens to the whole voice file, the voice unit and the comparison output of the voice recognition result thereof in the embodiment of the invention can help the user determine which voice unit to listen to.

In addition, in the case of positioning an error of a speech unit, the user may correct the target speech unit in which the error occurs to obtain corresponding correction information. The embodiment of the invention takes the voice unit as a unit for correction, so that the operation cost for correcting the whole voice file can be saved, and the efficiency of voice correction can be improved.

Drawings

FIG. 1 is a flowchart illustrating steps of a first embodiment of a speech processing method according to the present invention;

FIG. 2 is a flowchart illustrating steps of a second embodiment of a speech processing method according to the present invention;

FIG. 3 is a flowchart illustrating the steps of a third embodiment of a speech processing method;

FIG. 4 is a block diagram of a speech processing apparatus according to the present invention;

FIG. 5 is a block diagram of an apparatus 1300 for speech processing of the present invention; and

fig. 6 is a schematic structural diagram of a server according to the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

The voice processing scenario of the embodiment of the present invention may include: speech synthesis scenes, sound change scenes, etc. The text appointed by the user can be converted into the voice according with the appointed tone under the voice synthesis scene, and the voice synthesis scene can be applied to the fields of news broadcasting, reading listening, teaching, medical treatment, customer service, legal scene and the like. In the sound changing scene, a first voice input by a user can be converted into a second voice according with a specified tone, and any one of the characteristics of speaking content, speech speed, pause and emotion of the first voice can be reserved in the second voice. It is to be understood that the embodiments of the present invention are not limited to the specific speech processing scenarios.

Aiming at the technical problem of low efficiency of voice correction in the prior art, the embodiment of the invention provides a voice processing scheme, which specifically comprises the following steps: carrying out voice activity detection on voice to obtain a plurality of voice units corresponding to the voice; outputting the voice entry and the voice recognition result respectively corresponding to the plurality of voice units; receiving correction information of a user aiming at a target voice unit; and correcting the target voice unit according to the correction information.

VAD (Voice Activity Detection) can accurately detect valid Voice and invalid Voice (such as noise, laugh, crying, music, background Voice, etc.) under stationary or non-stationary noise, and perform Voice segmentation according to the Detection result, where the segmentation can realize Voice sentence breaking and recognize the segmented Voice units as an independent sentence.

For example, if the speech includes N (N may be a natural number greater than 1) speech units, the embodiment of the present invention may present the N speech units in the following manner:

speech unit 1 speech entry 1 speech recognition result 1

Phonetic Unit 2 Speech entry 2 Speech recognition result 2

……

Speech unit N speech entry N speech recognition result N

The displayed voice recognition result can help the user to determine whether to listen to the corresponding voice unit, so that the voice unit can be screened, and the user can be helped to quickly locate the error of the voice unit.

The error types of the speech unit may include, but are not limited to: the voice recognition result is not matched with the source text before voice synthesis, the pronunciation information is wrong, or the emotion parameter is wrong. The step of mismatching the voice recognition result with the source text before voice synthesis specifically comprises: multiple read errors (e.g., repeatedly reading a word), misread words, etc.), missed read errors, misread errors (e.g., misread a word into another word), etc. It is to be understood that embodiments of the present invention are not limited to specific types of errors. In the speech synthesis scene, a user can upload a source text to obtain corresponding speech, the 0-divided source text can be used as the basis of speech synthesis,

for example, a polyphone is included in the speech recognition result i, and the user thinks that there may be an error in the pronunciation information of the polyphone, so that the user can listen to the corresponding speech unit. For another example, the speech recognition result j includes lyric expressions, and the user may consider that there may be emotional expressions, so that he can listen to the corresponding speech unit.

In the case of positioning an error of a speech unit, a user may correct a target speech unit in which the error occurs to obtain corresponding correction information. The embodiment of the invention takes the voice unit as a unit for correction, so that the operation cost for correcting the whole voice file can be saved, and the efficiency of voice correction can be improved.

The voice processing method provided by the embodiment of the invention can be applied to application environments corresponding to the client and the server, wherein the client and the server are positioned in a wired or wireless network, and the client and the server perform data interaction through the wired or wireless network.

Optionally, the client may run on a terminal, where the terminal specifically includes but is not limited to: smart phones, tablet computers, electronic book readers, MP3 (Moving Picture Experts Group Audio Layer III) players, MP4 (Moving Picture Experts Group Audio Layer IV) players, laptop portable computers, car-mounted computers, desktop computers, set-top boxes, smart televisions, wearable devices, and the like.

The client may correspond to a website, or APP (Application). For example, the client may correspond to an application such as a speech processing APP.

Method embodiment one

Referring to fig. 1, a flowchart illustrating steps of a first embodiment of a processing method for a speech unit according to the present invention is shown, which may specifically include the following steps:

step 101, performing voice activity detection on voice to obtain a plurality of voice units corresponding to the voice;

102, outputting voice entries and voice recognition results corresponding to the voice units respectively;

103, receiving correction information of a user aiming at a target voice unit;

and step 104, correcting the target voice unit according to the correction information.

As soon as the embodiment of the method shown in fig. 1 can be executed by the client and/or the server, it is understood that the embodiment of the present invention does not impose any limitation on the specific execution subject of the embodiment of the method.

In step 101, the speech may represent the speech to be corrected in a speech processing scenario.

VAD can separate valid speech from invalid speech to make subsequent speech processing more efficient. If the VAD cuts off valid speech, it will cause speech loss; if the VAD puts voice with invalid noise or the like into the subsequent voice processing system, it will have an impact on the accuracy of the voice processing.

In an alternative embodiment of the invention, a method of voice activity detection may comprise: a detection method based on voice features. The voice feature specifically includes: energy characteristics, periodic characteristics, etc. Alternatively, the voice activity detection may be performed by using a double-threshold energy method or a four-threshold energy method, where the threshold value of the energy of the voice frame is generally set empirically, although it is simple and fast, but the detection accuracy is low.

In another alternative embodiment of the present invention, a method of voice activity detection may comprise: a statistical model based approach, or a machine learning based detection approach.

The detection method based on machine learning converts voice activity detection into a binary classification problem, and the corresponding categories specifically include: a speech class and a non-speech class. The classifier can be trained according to the training data to learn different characteristics of the voice class and the non-voice class in the training data, so that the classifier has the capability of distinguishing the voice class from the non-voice class.

The classifier may include a mathematical model. The mathematical model is a scientific or engineering model constructed by using a mathematical logic method and a mathematical language, and is a mathematical structure which is generally or approximately expressed by adopting the mathematical language aiming at the characteristic or quantity dependency relationship of a certain object system, and the mathematical structure is a relational structure which is described by means of mathematical symbols. The mathematical model may be one or a set of algebraic, differential, integral or statistical equations, and combinations thereof, by which the interrelationships or causal relationships between the variables of the system are described quantitatively or qualitatively. In addition to mathematical models described by equations, there are also models described by other mathematical tools, such as algebra, geometry, topology, mathematical logic, etc. Where the mathematical model describes the behavior and characteristics of the system rather than the actual structure of the system. The method can adopt methods such as machine learning and deep learning methods to train the mathematical model, and the machine learning method can comprise the following steps: linear regression, decision trees, random forests, etc., and the deep learning method may include: convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM), Gated cyclic units (GRU), and so on.

In an optional embodiment of the present invention, a classifier based on LSTM (Long Short-Term Memory network) may be used to determine the category corresponding to a speech segment in speech; the categories specifically include: a speech category, or a non-speech category.

The embodiment of the invention can utilize the classifier to intercept the voice fragments from the voice. If the voice segment corresponds to the voice category, the voice segment can be a voice unit; if the speech segment corresponds to a non-speech category, the speech segment may be a non-speech unit.

The embodiment of the invention can determine the corresponding category of the voice fragment according to the probability that the voice fragment belongs to a certain category. For example, if a first probability that a speech segment belongs to a non-speech category exceeds a first probability threshold, the speech segment corresponds to the non-speech category. For another example, if the second probability that the voice segment belongs to the voice category exceeds the second probability threshold, the voice segment corresponds to the voice category. The first probability threshold and the second probability threshold may be values between 0 and 1, which may be determined by one skilled in the art according to the actual application requirements, for example, the first probability threshold and the second probability threshold may be values of 0.8, 0.9, and so on.

Alternatively, if there is a non-speech unit between speech unit i and speech unit (i +1), the start position of the non-speech unit may correspond to the end position of speech unit i, and the end position of the non-speech unit may correspond to the start position of speech unit i.

LSTM has long-term memory functions that can be applied to long sequences and therefore can be processed for longer speech. Moreover, the LSTM adopts a door mechanism, can solve the problems of gradient explosion, gradient disappearance and the like to a certain extent, and can improve the classification accuracy.

Optionally, the training data of the classifier specifically includes: the voice data corresponding to the voice processing scene, for example, the voice data corresponding to the voice synthesis scene, or the voice data corresponding to the sound change scene, etc. The embodiment of the invention can label the voice segments in the voice data to obtain the labeled voice units and the labeled non-voice units. Optionally, the embodiment of the present invention may perform framing processing on the voice data according to the requirement of voice recognition.

Optionally, noise may be included in the voice data, so that the classifier can learn the ability to distinguish between voice and non-voice under the condition of noise and voice superposition, and thus the classifier can have better robustness under a noise environment,

in this embodiment of the present invention, optionally, the loss function (loss function) of the classifier may include: cross entropy, squared error loss, etc. The loss function can be used to measure the degree of inconsistency between the predicted value f (x) and the true value Y of the classifier, and the smaller the loss function is, the more robust the classifier is generally.

Optionally, the determining the category corresponding to the voice specifically includes: determining a first probability that a speech segment belongs to a non-speech class using a long-short term memory network-based classifier; and if the first probability exceeds a first probability threshold and the number of the voice frames included in the voice segment exceeds a number threshold, determining a segmentation point according to the voice frames included in the voice segment.

The embodiment of the invention can determine the corresponding category of the voice fragment by combining the first probability and the number of the voice frames included in the voice fragment. Assuming that the number threshold is P, under the condition that the first probability exceeds the first probability threshold, the position corresponding to the P-th speech frame in the speech segment may be used as a segmentation point, that is, the speech segment corresponding to the previous P speech frames may be used as a non-speech unit. The method determines the category corresponding to the voice segment by combining the number of the voice frames, can reduce the situation that the voice with slow speaking rate is misjudged as invalid voice to a certain extent, and can improve the accuracy of voice activity detection.

It should be noted that the classifier may be used to continue detecting the continuous speech frames after the P-th speech frame. Consecutive speech frames following the pth speech frame may be speech units or non-speech units. The quantity threshold can be determined by one skilled in the art according to the actual application requirements, for example, the quantity threshold is a value of, for example, 30.

In the embodiment of the invention, the classifier can extract the characteristics of the training data or the voice to be corrected, and determines the category corresponding to the voice segment in the voice according to the extracted voice characteristics.

The speech features may include, but are not limited to, prosodic features, psychoacoustic features, and spectral features.

The prosodic feature, also called a super-sound quality feature or a super-sound segment feature, refers to a change in pitch, duration, and intensity of sound in the speech, except for the sound quality feature. The prosodic features include, but are not limited to, pitch frequency, utterance duration, utterance amplitude, and utterance pace in the present embodiment. Psychoacoustic characteristics include, but are not limited to, formants, band energy distributions, harmonic signal-to-noise ratios, and short-time energy jitter in embodiments of the present invention.

The spectral feature, also called vibration spectral feature, is a pattern formed by decomposing a complex oscillation into harmonic oscillations with different amplitudes and frequencies, and arranging the amplitudes of the harmonic oscillations according to the frequencies. The frequency spectrum characteristic is fused with the rhythm characteristic and the tone quality characteristic so as to improve the anti-noise effect of the characteristic parameters. In the embodiment of the invention, the Frequency spectrum characteristic adopts MFCC (Mel-Frequency Cepstral Coefficients) capable of reflecting the auditory characteristics of human ears.

Optionally, statistical information of the speech features may also be used as the speech features, and the statistical information may include, but is not limited to: mean, variance, minimum, maximum, range, slope, etc.

In step 102, the voice entry may be used to listen to a voice corresponding to the voice unit, and the voice entry may be presented in the form of a voice playing control.

The speech recognition result may be text corresponding to the speech. The speech corresponding to the speech unit can be subjected to speech recognition to obtain a speech recognition result. Alternatively, in a speech synthesis scenario, the text before conversion corresponding to the speech unit may be used as a speech recognition result.

The embodiment of the invention can display the voice entry and the voice recognition result respectively corresponding to the plurality of voice units according to the preset display style. For example, a corresponding display area may be set for each speech unit, and the speech entry and the speech recognition result may be displayed in the display area for comparison and viewing. The user of the embodiment of the invention can be a user who obtains voice service, such as a foreground user; alternatively, the user of the embodiment of the present invention may be a user who provides a voice service, such as a technician in the background.

In step 103, the error types of the phonetic unit may include, but are not limited to: the voice recognition result is not matched with the source text before voice synthesis, the pronunciation information is wrong, or the emotion parameter is wrong. The step of mismatching the voice recognition result with the source text before voice synthesis specifically comprises: multiple read errors (e.g., repeatedly reading a word), misread words, etc.), missed read errors, misread errors (e.g., misread a word into another word), etc. It is to be understood that embodiments of the present invention are not limited to specific types of errors.

The correction information may include: and error information, such as more reading of X words, less reading of Y words, emotional mismatch, or lack of emotion.

Alternatively, the correction information may further include: modification of the error message suggests information such as that the reading of "yes" is four times, or that the emotion should be sad, and so on.

Alternatively, the correction information may further include: the user voice can be the pronunciation that the user thinks is accurate, and the user voice can be partial or whole pronunciation in the target voice unit. For example, the user speech may be speech with emotion, or the user speech may be speech with accurate pronunciation.

In this case, the user speech may be analyzed to obtain corresponding modified information, for example, the corresponding modified emotion parameters may be extracted from the user speech, or accurate speech information may be extracted from the user speech.

Optionally, the correction information may include: the corrected pronunciation information can be corrected according to the corrected pronunciation information under the condition that the target voice unit has pronunciation errors. For example, if "section" is misread as "xu" in the speech of the target speech unit, the corrected speech information "chang triphone" can be provided for "section".

In step 104, the target speech unit may be modified according to the modification information. Optionally, in a speech synthesis scenario, speech synthesis may be performed on a source text corresponding to the target speech unit. Speech synthesis technology is a technology for generating artificial speech by a mechanical, electronic method. TTS (Text To Speech) is a Speech synthesis technology, which converts Text information generated by a computer or inputted from the outside into intelligible and fluent spoken chinese language and outputs the spoken Text.

Optionally, the speech synthesis result corresponding to the target speech unit may be fused with the speech corresponding to the non-target speech unit to obtain a result speech, and the result speech may be a modified speech and may be provided to the user. Non-target speech units may characterize speech units that are not involved in the modification. It can be understood that, when there are a plurality of target speech units, the speech synthesis results corresponding to the plurality of target speech units may be fused with the speech corresponding to the non-target speech unit according to the position or sequence of the target speech unit in the speech.

To sum up, the method for processing a speech unit according to the embodiment of the present invention outputs a corresponding speech entry and a speech recognition result for each of a plurality of speech units in a speech, so that a user can perform speech correction in units of speech units. Compared with the situation that a user listens to the whole voice file, the voice unit and the comparison output of the voice recognition result thereof in the embodiment of the invention can help the user determine which voice unit to listen to.

Method embodiment two

Referring to fig. 2, a flowchart illustrating steps of a second embodiment of a processing method for a speech unit according to the present invention is shown, which may specifically include the following steps:

step 201, performing voice activity detection on a voice to obtain a plurality of voice units corresponding to the voice;

step 202, outputting the voice entry and the voice recognition result respectively corresponding to the plurality of voice units;

with respect to the first embodiment of the method shown in fig. 1, the method of this embodiment may further include:

step 203, displaying the current pronunciation information of the polyphone in the voice recognition result; the current pronunciation information is obtained according to the polyphones and the context thereof;

step 204, displaying a pronunciation input interface corresponding to the polyphone according to the correction operation of the user on the current pronunciation information, so that the user can input the corrected pronunciation information;

and step 205, correcting the voice corresponding to the target voice unit according to the corrected pronunciation information.

In step 203, the pronunciation information represents the reading of the polyphones. Taking a chinese character as an example, the pronunciation information may include: pinyin and tone.

Polyphone is a word with two or more pronunciation information, and different pronunciation information has different meaning, different usage and different part of speech. The pronunciation information has the function of distinguishing the part of speech and the meaning of the speech; the pronunciation information is different according to different use conditions, and the pronunciation information has different functions of different usage.

The embodiment of the invention obtains the current pronunciation information of the polyphone according to the polyphone in the text and the context thereof. The polyphones and the context thereof can correspond to language units such as words, phrases, sentences or paragraphs, and the accuracy of the current pronunciation information can be improved because richer language information represented by the language units is adopted in the process of determining the current pronunciation information.

In addition, once the current pronunciation information is determined, the embodiment of the invention can display the current pronunciation information of the polyphones in the text without being limited by conditions such as listening to the voice and the like. The embodiment of the invention can provide the current pronunciation information for the user under the condition of not listening to the voice so as to correct the pronunciation; therefore, the embodiment of the invention can save the time cost spent on listening the voice, and further can improve the efficiency of voice correction.

In this embodiment of the present invention, optionally, the speech recognition result may be displayed in a text region, and the current pronunciation information may be displayed in a region around the polyphone in the speech recognition result. For example, the current reading information may be presented in the upper region of the polyphones. For example, the text includes "so", where "yes" is polyphone, so that the current pronunciation information "wei 4" may be shown above "yes", and "4" represents that the tone is four tones.

The context of embodiments of the invention may include: above, and/or below. Alternatively, the above context is typically the part before the polyphone and the below context is typically the part after the polyphone.

The polyphones and the context thereof can correspond to language units such as words, phrases, sentences or paragraphs, and the accuracy of the current pronunciation information can be improved because richer language information represented by the language units is adopted in the process of determining the current pronunciation information.

In an optional embodiment of the present invention, the method may further include: and determining the current pronunciation information of the polyphones in the text according to the polyphones, the context of the polyphones and the labeled linguistic data containing the polyphones. The labeled corpus can represent the corpus for labeling the pronunciation information of the polyphone. The markup corpus may correspond to language units such as words, or phrases, or sentences, or paragraphs.

In the embodiment of the present invention, optionally, polyphones in the text may be detected according to the polyphone set. For example, words in the text are matched with the polyphonic character set to obtain polyphonic characters in the text that hit the polyphonic character set.

According to one embodiment, the labeled corpus can be located in a dictionary, and the current pronunciation information of polyphones in the text can be determined based on a dictionary matching mode.

According to another embodiment, the polyphones and the context thereof can be matched with the labeled corpus, and the current pronunciation information of the polyphones in the text can be obtained according to the pronunciation information of the polyphones in the target labeled corpus which is successfully matched.

According to yet another embodiment, the mathematical model may be trained based on the labeled corpus to obtain the data analyzer. The data analyzer may characterize a mapping between input data (polyphones and their context) and output data (current pronunciation information for polyphones).

The labeled corpus can represent the language environment, and the data analyzer can obtain the rule of the current pronunciation information of the polyphones in the specific language environment based on learning. Therefore, in the case of using the data analyzer, it is possible to determine the current pronunciation information of the polyphone according to the matching of the language environment between the polyphone and the context thereof as well as the markup corpus, instead of requiring the literal matching of the polyphone and the context thereof as well as the markup corpus.

In step 204, the modification operation may be used to trigger modification of the current pronunciation information. The correction operation may be a voice operation or a touch operation or a mouse operation.

For example, the voice operation may be "correct the pronunciation of X word", and the pronunciation input interface corresponding to X word may be presented in response to the voice operation. An "X word" may characterize a polyphonic word.

For another example, the touch operation may be a click operation for the X word, and the pronunciation input interface corresponding to the X word may be displayed in response to the click operation.

For another example, the touch operation may be a mouse selection operation for the X word, and the pronunciation input interface corresponding to the X word may be displayed in response to the mouse selection operation.

The embodiment of the invention shows the pronunciation input interface corresponding to the polyphone, so that a user can input corrected pronunciation information.

According to an embodiment, the displaying of the pronunciation input interface corresponding to the polyphonic character may specifically include: displaying the pronunciation options corresponding to the polyphones for the user to select; for example, for the polyphone "yes", the pronunciation options of "wei 4", "wei 2", and the like are provided.

According to another embodiment, the displaying of the pronunciation input interface corresponding to the polyphone may specifically include: and displaying the pronunciation input frame corresponding to the polyphone for the user to input. The user can input corresponding pronunciation information, such as "wei 2" or the like, in the voice input box.

In step 205, the corrected pronunciation information is used for correcting the voice of the target voice unit, so that the accuracy of the voice can be improved.

According to an embodiment, the corrected pronunciation information corresponds to the first target speech unit, and speech synthesis can be performed on the speech recognition result corresponding to the first target speech unit according to the corrected pronunciation information; and the voice synthesis result corresponding to the first target voice unit is fused with the voice corresponding to the non-first target voice unit to obtain a result voice, wherein the result voice can be corrected voice and can be provided for a user.

The embodiment of the invention carries out voice synthesis by taking the voice unit as a unit in the voice correction process, can save the operation cost of carrying out voice synthesis on the whole voice file, and can improve the efficiency of voice correction.

In the embodiment of the present invention, optionally, the corrected voice may be stored, so as to be listened or downloaded by a user.

In summary, the processing method of the speech unit according to the embodiment of the present invention obtains the current pronunciation information of the polyphone according to the polyphone in the speech recognition result and the context thereof. The polyphones and the context thereof can correspond to language units such as words, phrases, sentences or paragraphs, and the accuracy of the current pronunciation information can be improved because richer language information represented by the language units is adopted in the process of determining the current pronunciation information.

In addition, once the current pronunciation information is determined, the embodiment of the invention can display the current pronunciation information of the polyphones in the voice recognition result without being limited by conditions such as voice listening and the like. The embodiment of the invention can provide the current pronunciation information for the user under the condition of not listening to the voice so as to correct the pronunciation; therefore, the embodiment of the invention can save the time cost spent on listening, and further can improve the efficiency of voice correction.

Method embodiment three

Referring to fig. 3, a flowchart illustrating a third step of the embodiment of the processing method for a speech unit of the present invention is shown, which may specifically include the following steps:

step 301, performing voice activity detection on a voice to obtain a plurality of voice units corresponding to the voice;

step 302, outputting the voice entry and the voice recognition result respectively corresponding to the plurality of voice units;

303, displaying the current emotion parameters of the linguistic units in the voice recognition result;

304, displaying an emotion input interface corresponding to the language unit according to the correction operation of the user on the current emotion parameter so that the user can input the corrected emotion parameter;

and 305, correcting the voice corresponding to the target voice unit according to the corrected emotion parameters.

The embodiment of the invention provides the current emotion parameters of the language units in the voice recognition result for the user to carry out emotion correction so as to apply the emotion parameters required by the user to voice processing, thereby improving the accuracy of the voice processing and the satisfaction degree of the user on the voice processing result.

The language units of the embodiment of the invention can be words, phrases, sentences or the like. In other words, the embodiment of the present invention may show the current emotion parameters in units of linguistic units such as words, phrases, or sentences in the speech recognition result for the user to correct.

Emotions can represent the mental experience and feelings of a person and are used for describing socially meaningful feelings such as love for real reasons, appreciation for beauty, sad anger not meeting with talent and the like. The embodiment of the invention can carry out semantic analysis on the language unit to obtain the current emotion parameter. Alternatively, the emotion classification model can be used to determine the emotion classification to which the linguistic unit belongs. The specific emotion category can be determined by those skilled in the art according to the actual application requirements, and is not described herein.

The embodiment of the invention can identify the emotional parameters of the sentences or words (for example, the emotional parameters of a sentence are anger, the emotional parameters of a sentence are low, the emotional parameters of a sentence are choking and the like), and the emotional parameters are displayed in the surrounding area of the corresponding sentence or word.

In an embodiment of the present invention, the displaying of the emotion input interface corresponding to the language unit may include: displaying the emotion options corresponding to the language units for the user to select, and determining the modified emotion parameters according to the emotion options selected by the user; alternatively, the emotion input box corresponding to the language unit may be displayed so that the user inputs the modified emotion parameters in the emotion input box.

In an optional embodiment of the present invention, the modified emotion parameters may be applied to speech synthesis to complete corresponding emotion migration; namely, according to the modified emotional parameters, the voice recognition result is subjected to voice synthesis. Therefore, the stiffness problem of the voice synthesis result can be avoided to a certain extent, the appearance of stable and stiffness sound similar to a robot can be reduced, and the naturalness and the emotion of the voice synthesis result can be improved.

In the embodiment of the invention, the modified emotion parameter corresponds to the second target voice unit, and the voice recognition result corresponding to the second target voice unit can be subjected to voice synthesis according to the modified emotion parameter; and the voice synthesis result corresponding to the second target voice unit is fused with the voice corresponding to the non-second target voice unit to obtain a result voice, wherein the result voice can be corrected voice and can be provided for a user.

It should be noted that the second method embodiment shown in fig. 2 and the third method embodiment shown in fig. 3 may be combined, that is, the embodiment of the present invention may provide correction of the speech information and the emotion parameter, and use the corrected speech information and emotion parameter for speech synthesis to implement speech correction in units of speech units. It can be understood that, in the embodiment of the present invention, there is no limitation on the modification order of the speech information and the emotion parameter, and the two may be executed sequentially or subsequently. In addition, the embodiment of the present invention does not limit the sequence of the speech synthesis according to the corrected speech information and emotion parameters, and the two may be executed sequentially or simultaneously.

It should be noted that, for simplicity of description, the method embodiments are described as a series of motion combinations, but those skilled in the art should understand that the present invention is not limited by the described motion sequences, because some steps may be performed in other sequences or simultaneously according to the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no moving act is required as an embodiment of the invention.

Device embodiment

Referring to fig. 4, a block diagram of a speech processing apparatus according to an embodiment of the present invention is shown, which may specifically include:

a detection module 401, configured to perform voice activity detection on a voice to obtain a plurality of voice units corresponding to the voice;

an output module 402, configured to output a voice entry and a voice recognition result corresponding to each of the multiple voice units;

a receiving module 403, configured to receive modification information of a user for a target speech unit;

and a correcting module 404, configured to correct the target speech unit according to the correction information.

Optionally, the detection module 401 may include:

the category determining module is used for determining the category corresponding to the voice segment in the voice by utilizing a classifier based on a long-term and short-term memory network; the above categories may include: a speech category, or a non-speech category.

Optionally, the category determining module may include:

the probability determination module is used for determining a first probability that the voice fragment belongs to the non-voice category by using a long-short term memory network-based classifier;

and a dividing point determining module, configured to determine a dividing point according to the voice frames that the voice segment may include, if the first probability exceeds a first probability threshold and the number of the voice segments that may include the voice frame exceeds a number threshold.

Optionally, the training data of the classifier may include: and voice data corresponding to the voice processing scene, wherein the voice data contains noise.

Optionally, the correction information may include: and (5) corrected pronunciation information.

Optionally, the apparatus may further include:

the first display module is used for displaying the current pronunciation information of the polyphone in the voice recognition result; the current pronunciation information is obtained according to the polyphones and the context thereof;

and the second display module is used for displaying the pronunciation input interface corresponding to the polyphone according to the correction operation of the user aiming at the current pronunciation information so that the user can input the corrected pronunciation information.

Optionally, the correction information may include: the modified emotional parameters;

the above apparatus may further include:

the third display module is used for displaying the current emotion parameters of the language units in the voice recognition result;

and the fourth display module is used for displaying the emotion input interface corresponding to the language unit according to the correction operation of the user on the current emotion parameter so that the user can input the corrected emotion parameter.

Optionally, the apparatus may further include:

and the fusion module is used for fusing the correction result corresponding to the target voice unit with the voice corresponding to the non-target voice unit to obtain the corrected voice.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 5 is a block diagram illustrating an apparatus 1300 for speech processing according to an example embodiment. For example, apparatus 1300 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and so forth.

Referring to fig. 5, apparatus 1300 may include one or more of the following components: a processing component 1302, a memory 1304, a power component 1306, a multimedia component 1308, an audio component 1310, an input/output (I/O) interface 1312, a sensor component 1314, and a communication component 1316.

The processing component 1302 generally controls overall operation of the device 1300, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing element 1302 may include one or more processors 1320 to execute instructions to perform all or part of the steps of the method described above. Further, the processing component 1302 can include one or more modules that facilitate interaction between the processing component 1302 and other components. For example, the processing component 1302 may include a multimedia module to facilitate interaction between the multimedia component 1308 and the processing component 1302.

The memory 1304 is configured to store various types of data to support operation at the device 1300. Examples of such data include instructions for any application or method operating on device 1300, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 1304 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power supply component 1306 provides power to the various components of device 1300. Power components 1306 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for device 1300.

The multimedia component 1308 includes a screen between the device 1300 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 1308 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the back-facing camera may receive external multimedia data when the device 1300 is in an operational mode, such as a capture mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 1310 is configured to output and/or input audio signals. For example, audio component 1310 includes a Microphone (MIC) configured to receive external audio signals when apparatus 1300 is in an operational mode, such as a call mode, a recording mode, and a voice data processing mode. The received audio signals may further be stored in the memory 1304 or transmitted via the communication component 1316. In some embodiments, the audio component 1310 also includes a speaker for outputting audio signals.

The I/O interface 1312 provides an interface between the processing component 1302 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 1314 includes one or more sensors for providing various aspects of state assessment for the device 1300. For example, the sensor assembly 1314 may detect an open/closed state of the device 1300, the relative positioning of components, such as a display and keypad of the apparatus 1300, the sensor assembly 1314 may also detect a change in position of the apparatus 1300 or a component of the apparatus 1300, the presence or absence of user contact with the apparatus 1300, orientation or acceleration/deceleration of the apparatus 1300, and a change in temperature of the apparatus 1300. The sensor assembly 1314 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 1314 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 1314 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 1316 is configured to facilitate communications between the apparatus 1300 and other devices in a wired or wireless manner. The apparatus 1300 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 1316 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 1316 also includes a Near Field Communications (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on radio frequency data processing (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 1300 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as the memory 1304 comprising instructions, executable by the processor 1320 of the apparatus 1300 to perform the method described above is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer readable storage medium in which instructions, when executed by a processor of a terminal, enable the terminal to perform a method of processing a speech unit, the method comprising: performing voice activity detection on voice to obtain a plurality of voice units corresponding to the voice; outputting the voice entry and the voice recognition result respectively corresponding to the plurality of voice units; receiving correction information of a user aiming at a target voice unit; and correcting the target voice unit according to the correction information.

Fig. 6 is a schematic structural diagram of a server in an embodiment of the present invention. The server 1900, which may vary widely in configuration or performance, may include one or more Central Processing Units (CPUs) 1922 (e.g., one or more processors) and memory 1932, one or more storage media 1930 (e.g., one or more mass storage devices) that store applications 1942 or data 1944. Memory 1932 and storage medium 1930 can be, among other things, transient or persistent storage. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a sequence of instructions operating on the server. Further, a central processor 1922 may be arranged to communicate with the storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.

The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input/output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is only limited by the appended claims

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

The embodiment of the invention discloses A1 and a processing method of a voice unit, which comprises the following steps:

receiving correction information of a user aiming at a target voice unit;

and correcting the target voice unit according to the correction information.

A2, the method of A1, the voice activity detection of speech comprising:

determining the category corresponding to the voice fragment in the voice by using a classifier based on a long-term and short-term memory network; the categories include: a speech category, or a non-speech category.

A3, according to the method in A2, the determining the category corresponding to the voice includes:

determining a first probability that a speech segment belongs to a non-speech class using a long-short term memory network-based classifier;

and if the first probability exceeds a first probability threshold and the number of the voice frames included in the voice segment exceeds a number threshold, determining a segmentation point according to the voice frames included in the voice segment.

A4, according to the method of A2, the training data of the classifier includes: and processing voice data corresponding to the scene by voice, wherein the voice data contains noise.

A5, the method of any of A1 to A4, wherein the revision information comprises: and (5) corrected pronunciation information.

A6, the method of A5, the method further comprising:

displaying the current pronunciation information of polyphones in the voice recognition result; the current pronunciation information is obtained according to the polyphones and the context thereof;

and displaying the pronunciation input interface corresponding to the polyphone according to the correction operation of the user aiming at the current pronunciation information so that the user can input the corrected pronunciation information.

A7, the method of any of A1 to A4, wherein the revision information comprises: the modified emotional parameters;

the method further comprises the following steps:

displaying the current emotion parameters of the linguistic units in the voice recognition result;

and displaying the emotion input interface corresponding to the language unit according to the correction operation of the user on the current emotion parameter so that the user can input the corrected emotion parameter.

A8, the method of any one of A1 to A4, the method further comprising:

and fusing the correction result corresponding to the target voice unit with the voice corresponding to the non-target voice unit to obtain the corrected voice.

The embodiment of the invention discloses B9, a speech processing device, comprising:

the receiving module is used for receiving correction information of a user aiming at the target voice unit;

B10, the apparatus of B9, the detection module comprising:

the category determining module is used for determining the category corresponding to the voice segment in the voice by utilizing a classifier based on a long-term and short-term memory network; the categories include: a speech category, or a non-speech category.

B11, the apparatus of B10, the category determination module comprising:

and the division point determining module is used for determining a division point according to the voice frames included by the voice segments if the first probability exceeds a first probability threshold and the number of the voice frames included by the voice segments exceeds a number threshold.

B12, the apparatus of B10, the training data of the classifier comprising: and processing voice data corresponding to the scene by voice, wherein the voice data contains noise.

B13, the apparatus according to any of B9 to B12, the correction information comprising: and (5) corrected pronunciation information.

B14, the apparatus of B13, the apparatus further comprising:

B15, the apparatus according to any of B9 to B12, the correction information comprising: the modified emotional parameters;

the device further comprises:

B16, the apparatus according to any one of B9 to B12, further comprising:

The embodiment of the invention discloses C17, an apparatus for speech processing, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs configured to be executed by the one or more processors comprise instructions for:

receiving correction information of a user aiming at a target voice unit;

and correcting the target voice unit according to the correction information.

C18, the apparatus of C17, the voice activity detection of speech comprising:

C19, the determining the category corresponding to the voice according to the apparatus of C18 includes:

C20, the apparatus of C18, the training data of the classifier includes: and processing voice data corresponding to the scene by voice, wherein the voice data contains noise.

C21, the apparatus according to any of C17 to C20, the correction information comprising: and (5) corrected pronunciation information.

C22, the device of C21, the device also configured to execute the one or more programs by one or more processors including instructions for:

C23, the apparatus according to any of C17 to C20, the correction information comprising: the modified emotional parameters;

the device is also configured to execute, by one or more processors, the one or more programs including instructions for:

C24, the device of any of C17-C20, the device also configured to execute the one or more programs by one or more processors including instructions for:

25. One or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform a method as described in one or more of a 1-A8.

The foregoing has described in detail a speech processing method, a speech processing apparatus and a speech processing apparatus provided by the present invention, and the present disclosure has applied specific examples to explain the principles and embodiments of the present invention, and the descriptions of the foregoing examples are only used to help understand the method and the core ideas of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method of speech processing, the method comprising:

receiving correction information of a user aiming at a target voice unit;

and correcting the target voice unit according to the correction information.

2. The method of claim 1, wherein the voice activity detection of the voice comprises:

3. The method of claim 2, wherein the determining the category to which the speech corresponds comprises:

4. The method of claim 2, wherein the training data of the classifier comprises: and processing voice data corresponding to the scene by voice, wherein the voice data contains noise.

5. The method according to any one of claims 1 to 4, wherein the revision information comprises: and (5) corrected pronunciation information.

6. The method of claim 5, further comprising:

7. The method according to any one of claims 1 to 4, wherein the revision information comprises: the modified emotional parameters;

the method further comprises the following steps:

8. A speech processing apparatus, comprising:

9. An apparatus for speech processing comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors the one or more programs including instructions for:

receiving correction information of a user aiming at a target voice unit;

and correcting the target voice unit according to the correction information.

10. One or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform the method of one or more of claims 1-7.