CN112151072B

CN112151072B - Voice processing method, device and medium

Info

Publication number: CN112151072B
Application number: CN202010850493.3A
Authority: CN
Inventors: 叶一川; 刘恺; 周盼; 曹赫; 郎勇
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2020-08-21
Filing date: 2020-08-21
Publication date: 2024-07-02
Anticipated expiration: 2040-08-21
Also published as: CN112151072A

Abstract

The embodiment of the invention provides a voice processing method and device and a device for voice processing, wherein the method specifically comprises the following steps: performing voice activity detection on voice to obtain a plurality of voice units corresponding to the voice; outputting voice inlets and voice recognition results respectively corresponding to the voice units; receiving correction information of a user aiming at a target voice unit; and correcting the target voice unit according to the correction information. The embodiment of the invention can improve the efficiency of pronunciation correction.

Description

Voice processing method, device and medium

Technical Field

The present invention relates to the field of communications technologies, and in particular, to a method and apparatus for processing speech, and a machine readable medium.

Background

With the development of communication technology, audio devices such as bluetooth headsets and smart speakers continue to grow in recent years for audio content consumption, including radio stations, network podcasts, audio books, knowledge programs, and the like. The method has the advantages that the voice accompaniment is acquired in a listening mode anytime and anywhere, so that more users can select, and immersive news, learning, entertainment, music and other sound experiences can be easily acquired in driving, commuting, sleeping and other scenes.

At present, text appointed by a user can be converted into voice and output to the user; or the voice input by the user can be converted into voice conforming to a specific tone color and output to the user. In order to improve the quality of the voice, the error in the voice can be positioned by listening to the voice at present so as to correct the error in the voice.

The inventors have found in practicing embodiments of the present invention that speech is typically stored in a single speech file, and that the user typically needs to listen to the entire speech file to determine errors in the speech. Listening to the entire voice file generally takes more time and cost, which in turn makes voice correction less efficient.

Disclosure of Invention

In view of the foregoing, embodiments of the present invention have been made to provide a speech processing method, a speech processing apparatus, and a device for speech processing that overcome or at least partially solve the foregoing problems, and the embodiments of the present invention can improve the efficiency of pronunciation correction.

In order to solve the above problems, the present invention discloses a voice processing method, comprising:

Performing voice activity detection on voice to obtain a plurality of voice units corresponding to the voice;

outputting voice inlets and voice recognition results respectively corresponding to the voice units;

receiving correction information of a user aiming at a target voice unit;

And correcting the target voice unit according to the correction information.

In another aspect, an embodiment of the present invention discloses a speech processing apparatus, including:

The detection module is used for detecting voice activity of voice to obtain a plurality of voice units corresponding to the voice;

The output module is used for outputting a voice inlet and a voice recognition result which correspond to the voice units respectively;

the receiving module is used for receiving the correction information of the user aiming at the target voice unit; and

And the correction module is used for correcting the target voice unit according to the correction information.

In yet another aspect, an embodiment of the present invention discloses an apparatus for speech processing, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, the one or more programs comprising instructions for:

receiving correction information of a user aiming at a target voice unit;

And correcting the target voice unit according to the correction information.

One or more machine-readable media are also disclosed in embodiments of the invention, wherein the instructions, when executed by one or more processors, cause an apparatus to perform the aforementioned method.

The embodiment of the invention has the following advantages:

The embodiment of the invention outputs the corresponding voice entrance and the voice recognition result respectively aiming at a plurality of voice units in voice, so that a user can carry out voice correction by taking the voice units as units and generate corresponding correction information. Compared with the process that a user listens to the whole voice file, the comparison output of the voice units and the voice recognition results thereof in the embodiment of the invention can help the user determine which voice unit to listen to, and the voice units can be screened to a certain extent, so that the operation cost of the user for listening to the voice can be saved, and the efficiency of voice correction can be improved.

In addition, under the condition that the error of the voice unit is located, the user can correct the target voice unit with the error to obtain corresponding correction information. Because the embodiment of the invention corrects the voice unit, the operation cost for correcting the whole voice file can be saved, and the efficiency of voice correction can be improved.

Drawings

FIG. 1 is a flowchart illustrating steps of a first embodiment of a speech processing method according to the present invention;

FIG. 2 is a flowchart illustrating steps of a second embodiment of a speech processing method according to the present invention;

FIG. 3 is a flowchart illustrating steps of a third embodiment of a speech processing method of the present invention;

FIG. 4 is a block diagram of a speech processing apparatus of the present invention;

fig. 5 is a block diagram of an apparatus 1300 for speech processing according to the present invention; and

Fig. 6 is a schematic structural diagram of a server according to the present invention.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

The speech processing scenario of the embodiment of the present invention may include: speech synthesis scenes, sound changing scenes, etc. The text appointed by the user can be converted into the voice conforming to the appointed tone under the voice synthesis scene, and the voice synthesis scene can be applied to the fields of news broadcasting, reading listening, teaching, medical treatment, customer service, legal scenes and the like. Under the sound-changing field, the first voice input by the user can be converted into the second voice conforming to the appointed tone, and any one of the characteristics of speaking content, speech speed, pause and emotion of the first voice can be reserved in the second voice. It will be appreciated that embodiments of the present invention are not limited to a particular speech processing scenario.

Aiming at the technical problem of low efficiency of voice correction in the traditional technology, the embodiment of the invention provides a voice processing scheme, which specifically comprises the following steps: performing voice activity detection on voice to obtain a plurality of voice units corresponding to the voice; outputting voice inlets and voice recognition results respectively corresponding to the voice units; receiving correction information of a user aiming at a target voice unit; and correcting the target voice unit according to the correction information.

VAD (Voice Activity detection ) can accurately detect effective and ineffective voices (such as noise, laughter, crying, music, background voice, etc.) under stable or non-stable noise, and segment voices according to detection results, the segmentation can realize sentence breaking of voices, and the segmented voice units are recognized as an independent sentence.

For example, if N (N may be a natural number greater than 1) speech units are included in speech, the embodiment of the present invention may display N speech units in the following manner:

Speech unit 1 Speech entry 1 Speech recognition result 1

Speech unit 2 Speech entry 2 Speech recognition result 2

……

Speech unit N Speech entry N Speech recognition result N

The displayed voice recognition result can help the user to determine whether to listen to the corresponding voice unit, so that the voice unit can be screened, and the user can be helped to quickly locate errors of the voice unit.

The error types of the phonetic units may include, but are not limited to: the voice recognition result is not matched with the source text before voice synthesis, the pronunciation information is wrong, or the emotion parameters are wrong, etc. The mismatch between the speech recognition result and the source text before speech synthesis specifically includes: multi-read errors (e.g., repeatedly reading a word), misreading a word, etc.), missed-read errors, misreading errors (e.g., misreading one word as another), etc. It will be appreciated that embodiments of the present invention are not limited to a particular type of error. In the speech synthesis scenario, the user may upload the source text to obtain the corresponding speech, the sub-0 source text may serve as a basis for speech synthesis,

For example, the multi-tone word is included in the speech recognition result i, and the user considers that there may be an error in the pronunciation information of the multi-tone word, so that the user can listen to the corresponding speech unit. As another example, the speech recognition result j includes an expression of the emotion, and the user considers that there may be a problem of the expression of the emotion, so that the user can listen to the corresponding speech unit.

Under the condition that the error of the voice unit is located, the user can correct the target voice unit with the error to obtain corresponding correction information. Because the embodiment of the invention corrects the voice unit, the operation cost for correcting the whole voice file can be saved, and the efficiency of voice correction can be improved.

The voice processing method provided by the embodiment of the invention can be applied to application environments corresponding to the client and the server, wherein the client and the server are positioned in a wired or wireless network, and the client and the server perform data interaction through the wired or wireless network.

Alternatively, the client may run on a terminal, which specifically includes, but is not limited to: smart phones, tablet computers, e-book readers, MP3 (moving picture experts compression standard audio layer 3,Moving Picture Experts Group Audio Layer III) players, MP4 (moving picture experts compression standard audio layer 4,Moving Picture Experts Group Audio Layer IV) players, laptop portable computers, car computers, desktop computers, set top boxes, smart televisions, wearable devices, and the like.

The client may correspond to a website, or APP. For example, the client may correspond to an application such as a speech processing APP.

Method embodiment one

Referring to fig. 1, a flowchart illustrating steps of a first embodiment of a method for processing a speech unit according to the present invention may specifically include the following steps:

Step 101, performing voice activity detection on voices to obtain a plurality of voice units corresponding to the voices;

102, outputting a voice inlet and a voice recognition result respectively corresponding to the plurality of voice units;

step 103, receiving correction information of a user aiming at a target voice unit;

Step 104, correcting the target voice unit according to the correction information.

The first embodiment of the method shown in fig. 1 may be executed by a client and/or a server, and it should be understood that the embodiment of the present invention is not limited to the specific execution body of the embodiment of the method.

In step 101, the speech may represent the speech to be modified in the speech processing scenario.

The VAD may separate valid speech from invalid speech to make subsequent speech processing more efficient. If the VAD cuts out valid speech, it will cause loss of speech; if the VAD puts speech, such as noise, that is ineffective into the subsequent speech processing system, it will have an impact on the accuracy of the speech processing.

In an alternative embodiment of the present invention, a method of voice activity detection may comprise: a detection method based on voice features. The voice features specifically include: energy characteristics, cycle characteristics, etc. Alternatively, the voice activity detection may be performed using an energy double threshold or an energy four threshold method, where the threshold of the voice frame energy is typically set empirically, but the accuracy of the detection is low, although simple and fast.

In another alternative embodiment of the present invention, a method of voice activity detection may comprise: a statistical model-based method, or a machine learning-based detection method.

The machine learning-based detection method converts voice activity detection into a binary classification problem, and the corresponding classification specifically comprises the following steps: a speech category and a non-speech category. The classifier can be trained according to the training data to learn different characteristics of the voice category and the non-voice category in the training data, so that the classifier has the discrimination capability of the voice category and the non-voice category.

The classifier may comprise a mathematical model. The mathematical model is a scientific or engineering model constructed by using a mathematical logic method and a mathematical language, and is a mathematical structure which is expressed in a generalized or approximate way by adopting the mathematical language aiming at referring to the characteristic or the quantity dependency relationship of a certain object system, and the mathematical structure is a relationship structure which is expressed by means of mathematical symbols. The mathematical model may be one or a set of algebraic, differential, integral or statistical equations and combinations thereof by which the interrelationship or causal relationship between the variables of the system is described quantitatively or qualitatively. In addition to mathematical models described by equations, there are models described by other mathematical tools, such as algebra, geometry, topology, mathematical logic, etc. Wherein the mathematical model describes the behavior and characteristics of the system rather than the actual structure of the system. The training of the mathematical model may be performed by a machine learning method, a deep learning method, and the like, and the machine learning method may include: linear regression, decision trees, random forests, etc., the deep learning method may include: convolutional neural network (Convolutional Neural Networks, CNN), long Short-Term Memory network (LSTM), gated loop unit (Gated Recurrent Unit, GRU), etc.

In an alternative embodiment of the present invention, a classifier based on LSTM (Long Short-Term Memory network) may be used to determine the class corresponding to the speech segment in the speech; the categories specifically include: a phonetic class, or a non-phonetic class.

The embodiment of the invention can intercept the voice fragments from the voice by using the classifier. If the voice segment corresponds to the voice category, the voice segment can be a voice unit; if the speech segment corresponds to a non-speech category, the speech segment may be a non-speech unit.

The embodiment of the invention can determine the category corresponding to the voice fragment according to the probability that the voice fragment belongs to a certain category. For example, if the first probability that the speech segment belongs to the non-speech class exceeds the first probability threshold, the speech segment corresponds to the non-speech class. For another example, if the second probability that the speech segment belongs to the speech class exceeds the second probability threshold, the speech segment corresponds to the speech class. The first probability threshold and the second probability threshold may be values between 0 and 1, which may be determined by those skilled in the art according to practical application requirements, for example, the first probability threshold and the second probability threshold may be values of 0.8, 0.9, etc.

Alternatively, if there is a non-speech unit between speech unit i and speech unit (i+1), the starting position of the non-speech unit may correspond to the ending position of speech unit i, and the ending position of the non-speech unit may correspond to the starting position of speech unit i.

LSTM has long-term memory functions, which can be adapted to long sequences, and thus can handle longer speech. In addition, LSTM adopts a gate mechanism, so that the problems of gradient explosion, gradient disappearance and the like can be solved to a certain extent, and the classification accuracy can be improved.

Optionally, the training data of the classifier specifically includes: and voice data corresponding to a voice processing scene, such as voice data corresponding to a voice synthesis scene, voice data corresponding to a sound changing scene, and the like. The embodiment of the invention can label the voice fragments in the voice data to obtain the labeled voice units and the labeled non-voice units. Optionally, the embodiment of the invention can perform framing processing on the voice data according to the voice recognition requirement.

Optionally, noise can be included in the voice data, so that the classifier can learn the capability of distinguishing voice from non-voice under the condition of superposition of noise and voice, and therefore, the classifier has better robustness under the noise environment,

In the embodiment of the present invention, optionally, a loss function (loss function) of the classifier may include: cross entropy, square error loss, etc. The loss function can be used to measure the degree of inconsistency between the predicted value f (x) and the true value Y of the classifier, and the smaller the loss function, the better the robustness of the classifier.

Optionally, the determining the category corresponding to the voice specifically includes: determining a first probability that the speech segment belongs to a non-speech class by using a classifier based on a long-short-term memory network; if the first probability exceeds a first probability threshold and the number of the voice fragments including the voice frames exceeds a number threshold, determining a segmentation point according to the voice frames included in the voice fragments.

The embodiment of the invention can combine the first probability and the number of the voice frames included in the voice fragments to determine the category corresponding to the voice fragments. Assuming that the number threshold is P, if the first probability exceeds the first probability threshold, the position corresponding to the P-th speech frame in the speech segment may be used as a segmentation point, that is, the speech segment corresponding to the first P speech frames may be used as a non-speech unit. The above-mentioned combination of the number of the voice frames determines the category corresponding to the voice segment, so that the situation that the voice with slow speaking rate is misjudged as invalid voice can be reduced to a certain extent, and the accuracy of voice activity detection can be improved.

It should be noted that, the classifier may be used to continue to detect the consecutive voice frames after the P-th voice frame. Consecutive speech frames following the P-th speech frame may be speech units or non-speech units. The number threshold may be determined by one skilled in the art according to the actual application requirements, for example, the number threshold is a value such as 30.

In the embodiment of the invention, the classifier can perform feature extraction on training data or the voice to be corrected, and determine the category corresponding to the voice fragment in the voice according to the extracted voice feature.

The speech features may include, but are not limited to, prosodic features, tonal features, and spectral features.

The prosodic features, also called ultrasonic features or ultrasonic segment features, refer to the changes in pitch, duration and intensity of speech except for the acoustic features. The prosodic features include, but are not limited to, pitch frequency, voicing duration, voicing amplitude, and voicing speed in this embodiment. The acoustic quality features include, but are not limited to, formants, band energy distribution, harmonic signal-to-noise ratio, and short-time energy dithering in the present invention.

The spectrum feature, also called vibration spectrum feature, refers to a pattern formed by decomposing complex oscillation into harmonic oscillation with different amplitudes and different frequencies, and arranging the amplitudes of the harmonic oscillation according to the frequency. The spectral features are fused with prosodic features and audio features to enhance the anti-noise effect of the feature parameters. In the embodiment of the invention, the frequency spectrum characteristic adopts MFCC (Mel frequency cepstrum coefficient, mel-Frequency Cepstral Coefficients) which can reflect the auditory characteristics of human ears.

Alternatively, statistics of the above voice features may also be used as the voice features, where the statistics may include, but are not limited to: mean, variance, minimum, maximum, range, slope, etc.

In step 102, the voice portal may be used to listen to the voice corresponding to the voice unit, and the voice portal may be in the form of a voice playing control.

The speech recognition result may be text corresponding to speech. The voice corresponding to the voice unit can be subjected to voice recognition so as to obtain a voice recognition result. Or in the speech synthesis scene, the pre-conversion text corresponding to the speech unit can be used as a speech recognition result.

The embodiment of the invention can display the voice entrance and the voice recognition result corresponding to the voice units respectively in a preset display mode. For example, a corresponding display area may be set for each speech unit, where the speech entrance and the speech recognition result are displayed for comparison. The user of the embodiment of the invention can be a user for obtaining voice service, such as a foreground user; or the user of the embodiment of the invention may be a user providing voice services, such as a background technician.

In step 103, the error types of the speech units may include, but are not limited to: the voice recognition result is not matched with the source text before voice synthesis, the pronunciation information is wrong, or the emotion parameters are wrong, etc. The mismatch between the speech recognition result and the source text before speech synthesis specifically includes: multi-read errors (e.g., repeatedly reading a word), misreading a word, etc.), missed-read errors, misreading errors (e.g., misreading one word as another), etc. It will be appreciated that embodiments of the present invention are not limited to a particular type of error.

The correction information may include: error information such as more X-words, less Y-words, improper emotion, or lack of emotion.

Or the correction information may further include: the modification of the error message suggests that the pronunciation of "yes" be four, or that emotion should be grief and indignation, etc.

Or the correction information may further include: the user voice can be the pronunciation which is considered to be accurate by the user, and the user voice can be part or all of the pronunciation in the target voice unit. For example, the user's voice may be voice with emotion, or the user's voice may be voice with accurate pronunciation.

In this case, the user speech may be analyzed to obtain corresponding correction information, for example, corresponding corrected emotion parameters may be extracted from the user speech, or accurate speech information may be extracted from the user speech.

Optionally, the correction information may include: the corrected pronunciation information can be corrected according to the corrected pronunciation information under the condition that the target voice unit has a pronunciation error. For example, if "Yong" is misread as "xu" in the target speech unit, the corrected pronunciation information "chang trisound" may be provided for "Yong".

In step 104, the target speech unit may be modified according to the modification information. Alternatively, in a speech synthesis scenario, speech synthesis may be performed on the source text corresponding to the target speech unit. The speech synthesis technology is a technology for generating artificial speech by a mechanical, electronic method. TTS (Text To Speech) is a Speech synthesis technology, which is a technology of converting Text information generated by a computer itself or input from the outside into audible and fluent spoken chinese language output.

Optionally, the speech synthesis result corresponding to the target speech unit may be fused with the speech corresponding to the non-target speech unit to obtain a result speech, where the result speech may be corrected speech and may be provided to the user. Non-target speech units may characterize speech units that do not involve modification. It can be understood that, in the case that the target speech units are plural, the speech synthesis results corresponding to the plural target speech units and the speech corresponding to the non-target speech units may be respectively fused according to the positions or the sequences of the target speech units in the speech.

In summary, the processing method of the voice unit according to the embodiment of the present invention outputs the corresponding voice entry and the voice recognition result for a plurality of voice units in voice, respectively, so that the user can perform voice correction in units of voice units. Compared with the process that a user listens to the whole voice file, the comparison output of the voice units and the voice recognition results thereof in the embodiment of the invention can help the user determine which voice unit to listen to, and the voice units can be screened to a certain extent, so that the operation cost of the user for listening to the voice can be saved, and the efficiency of voice correction can be improved.

Method embodiment II

Referring to fig. 2, a flowchart illustrating steps of a second embodiment of a processing method for a speech unit according to the present invention may specifically include the following steps:

step 201, performing voice activity detection on voice to obtain a plurality of voice units corresponding to the voice;

Step 202, outputting voice entries and voice recognition results respectively corresponding to the plurality of voice units;

With respect to the first embodiment of the method shown in fig. 1, the method of this embodiment may further include:

Step 203, displaying the current pronunciation information of the polyphones in the voice recognition result; the current pronunciation information is obtained according to the polyphones and the context thereof;

step 204, displaying the pronunciation input interface corresponding to the polyphones according to the correction operation of the user on the current pronunciation information, so that the user can input the corrected pronunciation information;

step 205, correcting the voice corresponding to the target voice unit according to the corrected pronunciation information.

In step 203, the pronunciation information characterizes the reading of the polyphones. Taking a Chinese character as an example, the pronunciation information may include: pinyin and tone.

The polyphones are characterized in that one word has two or more pronunciation information, different pronunciation information are different in meaning, different in usage and often different in part of speech. The pronunciation information has the function of distinguishing part of speech and meaning of words; according to different use conditions, the pronunciation information is different, and the pronunciation information has the function of distinguishing the usage.

According to the embodiment of the invention, the current pronunciation information of the polyphones is obtained according to the polyphones and the context thereof in the text. The polyphones and the contexts thereof can correspond to language units such as words, phrases, sentences, paragraphs and the like, and the accuracy of the current pronunciation information can be improved because the more abundant language information characterized by the more language units is adopted in the process of determining the current pronunciation information.

In addition, once the current pronunciation information is determined, the embodiment of the invention can display the current pronunciation information of the polyphones in the text, and can not be limited by conditions such as listening of voice. The embodiment of the invention can provide the current pronunciation information for the user under the condition of not listening to the voice so as to carry out pronunciation correction for the user; therefore, the embodiment of the invention can save the time cost spent on listening to the voice, and further can improve the efficiency of voice correction.

In the embodiment of the invention, optionally, a voice recognition result can be displayed in a text region, and current pronunciation information can be displayed in a surrounding region of the polyphones in the voice recognition result. For example, current pronunciation information may be presented in an upper region of the polyphones. For example, the text includes "in order", wherein "yes" is a polyphone, so that the current pronunciation information "wei4" can be displayed above "yes", and "4" represents that the tone is four.

The context of the embodiments of the present invention may include: above, and/or below. Alternatively, the context is typically the part preceding the polyphones and the context is typically the part following the polyphones.

The polyphones and the contexts thereof can correspond to language units such as words, phrases, sentences, paragraphs and the like, and the accuracy of the current pronunciation information can be improved because the more abundant language information characterized by the more language units is adopted in the process of determining the current pronunciation information.

In an alternative embodiment of the present invention, the method may further include: and determining current pronunciation information of the polyphones in the text according to the polyphones and the context thereof and the labeling corpus containing the polyphones. The labeling corpus can represent the corpus for labeling the pronunciation information of the polyphones. The labeling corpus may correspond to language units such as words, or phrases, or sentences, or paragraphs.

In the embodiment of the invention, optionally, the polyphones in the text can be detected according to the polyphone set. For example, words in the text are matched against a set of polyphones to obtain polyphones in the text that hit the set of polyphones.

According to one embodiment, the labeling corpus may be located in a dictionary, and current pronunciation information of the polyphones in the text may be determined based on a dictionary matching manner.

According to another embodiment, the polyphones and the context thereof can be matched with the labeling corpus, and current pronunciation information of the polyphones in the text can be obtained according to the pronunciation information of the polyphones in the successfully matched target labeling corpus.

According to yet another embodiment, the data model may be trained based on the labeling corpus to obtain a data analyzer. The data analyzer may characterize the mapping between input data (polyphones and their context) and output data (polyphones current pronunciation information).

The labeling corpus can represent language environment, and the data analyzer can obtain the current pronunciation information rule of the polyphones under the specific language environment based on learning. Therefore, in the case of using the data analyzer, the matching of the polyphones and the contexts thereof with the labeling corpus in terms of the words may not be required, but the current pronunciation information of the polyphones may be determined according to the matching of the language environment between the polyphones and the contexts thereof with the labeling corpus.

In step 204, a correction operation may be used to trigger correction of the current pronunciation information. The correction operation may be a voice operation or a touch operation or a mouse operation.

For example, the voice operation may be "correct pronunciation of the X-word", and then the pronunciation input interface corresponding to the X-word may be displayed in response to the voice operation. An "X-word" may represent a polyphone.

For another example, the touch operation may be a click operation for the X-word, and then the pronunciation input interface corresponding to the X-word may be displayed in response to the click operation.

For another example, the touch operation may be a mouse selection operation for the X-word, and then the pronunciation input interface corresponding to the X-word may be displayed in response to the mouse selection operation.

The embodiment of the invention displays the pronunciation input interface corresponding to the polyphones, and can be used for a user to input corrected pronunciation information.

According to an embodiment, the foregoing displaying the pronunciation input interface corresponding to the polyphone may specifically include: displaying pronunciation options corresponding to the polyphones for selection by a user; for example, for polyphones "yes," the pronunciation options of "wei4", "wei2" etc. are provided.

According to another embodiment, the foregoing displaying the pronunciation input interface corresponding to the polyphone may specifically include: and displaying the pronunciation input box corresponding to the polyphones for user input. The user can input corresponding pronunciation information such as 'wei 2' in the voice input box.

In step 205, the corrected pronunciation information is used for correcting the target speech unit, so that the accuracy of the speech can be improved.

According to an embodiment, the corrected pronunciation information corresponds to the first target voice unit, and then voice synthesis can be performed on the voice recognition result corresponding to the first target voice unit according to the corrected pronunciation information; and fusing the voice synthesis result corresponding to the first target voice unit with the voice corresponding to the non-first target voice unit to obtain a result voice, wherein the result voice can be corrected voice and can be provided for a user.

According to the embodiment of the invention, the voice unit is used as a unit for voice synthesis in the voice correction process, so that the operation cost for performing voice synthesis on the whole voice file can be saved, and the efficiency of voice correction can be improved.

In the embodiment of the invention, optionally, the corrected voice can be saved for the user to listen to or download.

In summary, according to the processing method of the voice unit in the embodiment of the invention, the current pronunciation information of the polyphones is obtained according to the polyphones and the context thereof in the voice recognition result. The polyphones and the contexts thereof can correspond to language units such as words, phrases, sentences, paragraphs and the like, and the accuracy of the current pronunciation information can be improved because the more abundant language information characterized by the more language units is adopted in the process of determining the current pronunciation information.

In addition, once the current pronunciation information is determined, the embodiment of the invention can display the current pronunciation information of the polyphones in the voice recognition result, and can not be limited by conditions such as voice listening and the like. The embodiment of the invention can provide the current pronunciation information for the user under the condition of not listening to the voice so as to carry out pronunciation correction for the user; therefore, the embodiment of the invention can save the time cost spent in listening, and further can improve the efficiency of voice correction.

Method example III

Referring to fig. 3, a flowchart illustrating steps of a third embodiment of a method for processing a speech unit according to the present invention may specifically include the following steps:

Step 301, performing voice activity detection on voice to obtain a plurality of voice units corresponding to the voice;

step 302, outputting a voice inlet and a voice recognition result respectively corresponding to the plurality of voice units;

step 303, displaying the current emotion parameters of the language unit in the voice recognition result;

Step 304, displaying an emotion input interface corresponding to the language unit according to the correction operation of the user on the current emotion parameters so as to enable the user to input corrected emotion parameters;

And 305, correcting the voice corresponding to the target voice unit according to the corrected emotion parameters.

The embodiment of the invention provides the current emotion parameters of the language unit in the voice recognition result for the user to carry out emotion correction so as to apply the emotion parameters required by the user to voice processing, thereby improving the accuracy of voice processing and the satisfaction degree of the user on the voice processing result.

The language units of the embodiments of the present invention may be words, or phrases, or sentences, etc. In other words, the embodiment of the invention can display the current emotion parameters by taking the language units such as words, phrases, sentences and the like in the voice recognition result as units so as to be corrected by the user.

Emotion can characterize the human's mental experience and feeling, and is used to describe emotions with social meaning, such as love on reality, appreciation on beauty, grief and indignation for reminiscence, etc. The embodiment of the invention can carry out semantic analysis on the language units so as to obtain the current emotion parameters. Or the emotion classification model can be utilized to determine the emotion type to which the language unit belongs. Specific emotion categories may be determined by those skilled in the art according to actual application requirements, and will not be described herein.

The embodiment of the invention can identify the emotion parameters of the sentences or the words (for example, the emotion parameters of a sentence are grief and indignation, the emotion parameters of a sentence are sunken, the emotion parameters of a sentence are choking, and the like) and display the emotion parameters in the surrounding areas of the corresponding sentences or words.

In the embodiment of the present invention, the emotion input interface for displaying the language unit may include: displaying emotion options corresponding to the language units for users to select, and determining modified emotion parameters according to the emotion options selected by the users; or the emotion input box corresponding to the language unit can be displayed, so that the user inputs the corrected emotion parameters in the emotion input box.

In an alternative embodiment of the invention, the modified emotion parameters can be applied to speech synthesis to complete corresponding emotion migration; that is, according to the modified emotion parameters, the voice recognition result is synthesized. Thus, the problem of dullness of the voice synthesis result can be avoided to a certain extent, that is, the occurrence of stable and stiff sound similar to a robot can be reduced, and the naturalness and emotion of the voice synthesis result can be improved.

In the embodiment of the invention, the modified emotion parameters correspond to the second target voice unit, and then voice synthesis can be performed on the voice recognition result corresponding to the second target voice unit according to the modified emotion parameters; and fusing the voice synthesis result corresponding to the second target voice unit with the voice corresponding to the non-second target voice unit to obtain a result voice, wherein the result voice can be corrected voice and can be provided for a user.

It should be noted that, the second method embodiment shown in fig. 2 and the third method embodiment shown in fig. 3 may be combined, that is, the embodiment of the present invention may provide correction of voice information and emotion parameters, and use the corrected voice information and emotion parameters for voice synthesis, so as to implement voice correction in units of voice units. It can be appreciated that the embodiment of the present invention does not limit the correction sequence of the voice information and the emotion parameters, and the voice information and the emotion parameters can be executed sequentially or first. In addition, the embodiment of the invention does not limit the order of speech synthesis according to the corrected speech information and emotion parameters, and the speech synthesis and emotion parameters can be executed sequentially or firstly or simultaneously.

It should be noted that, for simplicity of description, the method embodiments are described as a series of combinations of motion actions, but those skilled in the art should appreciate that the embodiments of the present invention are not limited by the order of motion actions described, as some steps may be performed in other order or simultaneously in accordance with the embodiments of the present invention. Further, it should be understood by those skilled in the art that the embodiments described in the specification are all preferred embodiments and that the movement involved is not necessarily required by the embodiments of the present invention.

Device embodiment

Referring to fig. 4, a block diagram illustrating a voice processing apparatus according to an embodiment of the present invention may specifically include:

The detection module 401 is configured to perform voice activity detection on voice to obtain a plurality of voice units corresponding to the voice;

An output module 402, configured to output a voice entry and a voice recognition result corresponding to the plurality of voice units respectively;

A receiving module 403, configured to receive correction information of a target speech unit from a user;

and the correction module 404 is configured to correct the target speech unit according to the correction information.

Alternatively, the detection module 401 may include:

The class determining module is used for determining the class corresponding to the voice fragment in the voice by using a classifier based on the long-short-term memory network; the categories may include: a phonetic class, or a non-phonetic class.

Optionally, the above-mentioned category determining module may include:

the probability determining module is used for determining a first probability that the voice fragment belongs to the non-voice category by using a classifier based on the long-short-term memory network;

And the segmentation point determining module is used for determining the segmentation point according to the voice frames which can be included in the voice fragments if the first probability exceeds a first probability threshold and the number of the voice frames which can be included in the voice fragments exceeds a number threshold.

Optionally, the training data of the classifier may include: and voice data corresponding to the voice processing scene, wherein the voice data comprises noise.

Optionally, the correction information may include: corrected pronunciation information.

Optionally, the apparatus may further include:

the first display module is used for displaying the current pronunciation information of the polyphones in the voice recognition result; the current pronunciation information is obtained according to the polyphones and the context thereof;

The second display module is used for displaying the pronunciation input interface corresponding to the polyphone according to the correction operation of the user on the current pronunciation information so as to enable the user to input the corrected pronunciation information.

Optionally, the correction information may include: the modified emotion parameters;

the apparatus may further include:

the third display module is used for displaying the current emotion parameters of the language unit in the voice recognition result;

And the fourth display module is used for displaying the emotion input interface corresponding to the language unit according to the correction operation of the user on the current emotion parameters so as to enable the user to input the corrected emotion parameters.

Optionally, the apparatus may further include:

And the fusion module is used for fusing the correction result corresponding to the target voice unit and the voice corresponding to the non-target voice unit to obtain corrected voice.

For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Fig. 5 is a block diagram illustrating an apparatus 1300 for speech processing according to an example embodiment. For example, apparatus 1300 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like.

Referring to fig. 5, apparatus 1300 may include one or more of the following components: a processing component 1302, a memory 1304, a power component 1306, a multimedia component 1308, an audio component 1310, an input/output (I/O) interface 1312, a sensor component 1314, and a communication component 1316.

The processing component 1302 generally controls overall operation of the apparatus 1300, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing element 1302 may include one or more processors 1320 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 1302 can include one or more modules that facilitate interactions between the processing component 1302 and other components. For example, the processing component 1302 may include a multimedia module to facilitate interaction between the multimedia component 1308 and the processing component 1302.

The memory 1304 is configured to store various types of data to support operations at the device 1300. Examples of such data include instructions for any application or method operating on the apparatus 1300, contact data, phonebook data, messages, pictures, videos, and the like. The memory 1304 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The power supply assembly 1306 provides power to the various components of the device 1300. The power supply components 1306 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device 1300.

The multimedia component 1308 includes a screen between the device 1300 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 1308 includes a front-facing camera and/or a rear-facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 1300 is in an operational mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 1310 is configured to output and/or input audio signals. For example, the audio component 1310 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 1300 is in an operational mode, such as a call mode, a recording mode, and a voice data processing mode. The received audio signals may be further stored in the memory 1304 or transmitted via the communication component 1316. In some embodiments, the audio component 1310 also includes a speaker for outputting audio signals.

The I/O interface 1312 provides an interface between the processing component 1302 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor assembly 1314 includes one or more sensors for providing status assessment of various aspects of the apparatus 1300. For example, the sensor assembly 1314 may detect the on/off state of the device 1300, the relative positioning of the components, such as the display and keypad of the apparatus 1300, the sensor assembly 1314 may also detect a change in position of the apparatus 1300 or one of the components of the apparatus 1300, the presence or absence of user contact with the apparatus 1300, the orientation or acceleration/deceleration of the apparatus 1300, and a change in temperature of the apparatus 1300. The sensor assembly 1314 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. The sensor assembly 1314 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 1314 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 1316 is configured to facilitate communication between the apparatus 1300 and other devices, either wired or wireless. The apparatus 1300 may access a wireless network based on a communication standard, such as WiFi,2G or 3G, or a combination thereof. In one exemplary embodiment, the communication component 1316 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 1316 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on radio frequency data processing (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 1300 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.

In an exemplary embodiment, a non-transitory computer-readable storage medium is also provided, such as memory 1304, including instructions executable by processor 1320 of apparatus 1300 to perform the above-described method. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

A non-transitory computer readable storage medium, which when executed by a processor of a terminal, causes the terminal to perform a method of processing a speech unit, the method comprising: performing voice activity detection on voice to obtain a plurality of voice units corresponding to the voice; outputting voice inlets and voice recognition results respectively corresponding to the voice units; receiving correction information of a user aiming at a target voice unit; and correcting the target voice unit according to the correction information.

Fig. 6 is a schematic structural diagram of a server according to an embodiment of the present invention. The server 1900 may vary considerably in configuration or performance and may include one or more central processing units (central processing units, CPUs) 1922 (e.g., one or more processors) and memory 1932, one or more storage mediums 1930 (e.g., one or more mass storage devices) that store applications 1942 or data 1944. Wherein the memory 1932 and storage medium 1930 may be transitory or persistent. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a series of instruction operations in a server. Still further, a central processor 1922 may be provided in communication with a storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.

The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input/output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, and the like.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It is to be understood that the invention is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the invention is limited only by the appended claims

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

The embodiment of the invention discloses A1, a processing method of a voice unit, which comprises the following steps:

receiving correction information of a user aiming at a target voice unit;

And correcting the target voice unit according to the correction information.

A2, the method according to A1, the voice activity detection includes:

Determining the category corresponding to the voice fragment in the voice by using a classifier based on a long-short-term memory network; the categories include: a phonetic class, or a non-phonetic class.

A3, determining the category corresponding to the voice according to the method of A2, which comprises the following steps:

determining a first probability that the speech segment belongs to a non-speech class by using a classifier based on a long-short-term memory network;

if the first probability exceeds a first probability threshold and the number of the voice fragments including the voice frames exceeds a number threshold, determining a segmentation point according to the voice frames included in the voice fragments.

A4, the method according to A2, wherein the training data of the classifier comprises: and voice data corresponding to the voice processing scene, wherein the voice data comprises noise.

A5, the method according to any one of A1 to A4, the correction information including: corrected pronunciation information.

A6, the method of A5, the method further comprising:

Displaying the current pronunciation information of the polyphones in the voice recognition result; the current pronunciation information is obtained according to the polyphones and the context thereof;

and displaying the pronunciation input interface corresponding to the polyphone according to the correction operation of the user on the current pronunciation information so as to enable the user to input the corrected pronunciation information.

A7, the method according to any one of A1 to A4, the correction information including: the modified emotion parameters;

the method further comprises the steps of:

displaying the current emotion parameters of the language units in the voice recognition result;

And displaying an emotion input interface corresponding to the language unit according to the correction operation of the user on the current emotion parameters so as to enable the user to input the corrected emotion parameters.

A8, the method according to any of A1 to A4, the method further comprising:

and fusing the correction result corresponding to the target voice unit and the voice corresponding to the non-target voice unit to obtain corrected voice.

The embodiment of the invention discloses a B9 voice processing device, which comprises:

The receiving module is used for receiving the correction information of the user aiming at the target voice unit;

B10, the apparatus of B9, the detection module comprising:

The class determining module is used for determining the class corresponding to the voice fragment in the voice by using a classifier based on the long-short-term memory network; the categories include: a phonetic class, or a non-phonetic class.

B11, the apparatus of B10, the category determination module comprising:

and the segmentation point determining module is used for determining segmentation points according to the voice frames included in the voice fragments if the first probability exceeds a first probability threshold and the number of the voice frames included in the voice fragments exceeds a number threshold.

B12, the apparatus of B10, the training data of the classifier including: and voice data corresponding to the voice processing scene, wherein the voice data comprises noise.

B13, the apparatus according to any one of B9 to B12, the correction information including: corrected pronunciation information.

B14, the apparatus of B13, the apparatus further comprising:

And the second display module is used for displaying the pronunciation input interface corresponding to the polyphone according to the correction operation of the user on the current pronunciation information so as to enable the user to input the corrected pronunciation information.

B15, the apparatus according to any one of B9 to B12, the correction information including: the modified emotion parameters;

The apparatus further comprises:

B16, the apparatus of any one of B9 to B12, the apparatus further comprising:

and the fusion module is used for fusing the corrected result corresponding to the target voice unit and the voice corresponding to the non-target voice unit to obtain corrected voice.

The embodiment of the invention discloses a C17, a device for voice processing, which comprises a memory and one or more programs, wherein the one or more programs are stored in the memory, and are configured to be executed by one or more processors, and the one or more programs comprise instructions for:

receiving correction information of a user aiming at a target voice unit;

And correcting the target voice unit according to the correction information.

C18, the apparatus of C17, the voice activity detection comprising:

C19, the device according to C18, the determining the category corresponding to the voice includes:

C20, the apparatus of C18, the training data of the classifier comprising: and voice data corresponding to the voice processing scene, wherein the voice data comprises noise.

C21, the apparatus of any one of C17 to C20, the correction information comprising: corrected pronunciation information.

C22, the device of C21, the device further configured to be executed by one or more processors, the one or more programs comprising instructions for:

C23, the apparatus according to any one of C17 to C20, the correction information comprising: the modified emotion parameters;

The device is also configured to be executed by one or more processors the one or more programs including instructions for:

C24, the device of any one of C17 to C20, the device further configured to be executed by one or more processors, the one or more programs including instructions for:

25. One or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform the method of one or more of A1-A8.

The foregoing has outlined a speech processing method, a speech processing apparatus and a device for speech processing in detail, wherein specific examples are presented herein to illustrate the principles and embodiments of the present invention and to help understand the method and core concepts thereof; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. A method of speech processing, the method comprising:

outputting voice inlets and voice recognition results respectively corresponding to the voice units; the voice entrance is a voice playing control for listening to the voice corresponding to the voice unit;

receiving correction information of a user aiming at a target voice unit;

Correcting the target voice unit according to the correction information;

the correction information includes: the modified emotion parameters;

the method further comprises the steps of:

Displaying an emotion input interface corresponding to the language unit according to the correction operation of the user on the current emotion parameters so as to enable the user to input corrected emotion parameters;

Performing voice synthesis on the voice recognition result according to the corrected emotion parameters;

the voice activity detection for the voice comprises:

Determining the category corresponding to the voice fragment in the voice by using a classifier based on a long-short-term memory network; the categories include: a phonetic class, or a non-phonetic class;

the determining the category corresponding to the voice comprises the following steps:

2. The method of claim 1, wherein the training data of the classifier comprises: and voice data corresponding to the voice processing scene, wherein the voice data comprises noise.

3. The method according to any one of claims 1 to 2, wherein the correction information includes: corrected pronunciation information.

4. A method according to claim 3, characterized in that the method further comprises:

5. A speech processing apparatus, comprising:

The output module is used for outputting a voice inlet and a voice recognition result which correspond to the voice units respectively; the voice entrance is a voice playing control for listening to the voice corresponding to the voice unit;

The correction module is used for correcting the target voice unit according to the correction information;

the correction information includes: the modified emotion parameters;

The apparatus further comprises:

The fourth display module is used for displaying an emotion input interface corresponding to the language unit according to the correction operation of the user on the current emotion parameters so as to enable the user to input the corrected emotion parameters;

the detection module comprises:

The class determining module is used for determining the class corresponding to the voice fragment in the voice by using a classifier based on the long-short-term memory network; the categories include: a phonetic class, or a non-phonetic class;

The category determination module includes:

6. An apparatus for speech processing comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, the one or more programs comprising instructions for:

receiving correction information of a user aiming at a target voice unit;

Correcting the target voice unit according to the correction information;

the correction information includes: the modified emotion parameters;

the voice activity detection for the voice comprises:

7. One or more machine readable media having instructions stored thereon that, when executed by one or more processors, cause an apparatus to perform the method of one or more of claims 1-4.