CN111326144A

CN111326144A - Voice data processing method, device, medium and computing equipment

Info

Publication number: CN111326144A
Application number: CN202010129408.4A
Authority: CN
Inventors: 杨震; 刘�东; 李响
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2020-02-28
Filing date: 2020-02-28
Publication date: 2020-06-23
Anticipated expiration: 2040-02-28
Also published as: CN111326144B

Abstract

The embodiment of the invention provides a voice data processing method, which comprises the following steps: the method comprises the steps of obtaining a voice fragment and an initial text aiming at the voice fragment, and constructing a bias language model of the voice fragment based on the initial text. Then, a plurality of candidate vocabulary sequences of the speech segments are determined based on the constructed biased language model. And performing forced alignment operation on the candidate word sequences and the acoustic characteristics of the voice segments respectively to determine a preferred word sequence from the candidate word sequences. Next, based on the differences between the initial text and the preferred vocabulary sequence, annotated text of the speech segment is determined. The embodiment of the invention also provides a voice data processing device, a medium and a computing device.

Description

Voice data processing method, device, medium and computing equipment

Technical Field

The embodiment of the invention relates to the technical field of computers, in particular to a voice data processing method, a voice data processing device, a voice data processing medium and a computing device.

Background

This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.

In the technical field of voice recognition, training and tuning of a voice recognition model system need support of massive voice labeling data, and the voice labeling data comprise voice data and corresponding labeling texts thereof. The quantity of the voice labeling data and the accuracy of the labeling text thereof can greatly influence the training efficiency and performance of the voice recognition model system. The traditional technical scheme for acquiring the voice marking data has the problems of large workload, high error rate, long time consumption, high requirement on the professional skills of a marker and the like. Therefore, how to quickly acquire a large amount of voice labeling data and ensure the accuracy of the labeling text in the training data is a problem which needs to be solved urgently at present.

Disclosure of Invention

In this context, embodiments of the present invention are intended to provide a voice data processing method, apparatus, medium, and computing device.

In a first aspect of embodiments of the present invention, there is provided a speech data processing method, including: the method comprises the steps of obtaining a voice fragment and an initial text aiming at the voice fragment, and constructing a bias language model of the voice fragment based on the initial text. Then, a plurality of candidate vocabulary sequences of the speech segments are determined based on the constructed biased language model. And performing forced alignment operation on the candidate word sequences and the acoustic characteristics of the voice segments respectively to determine a preferred word sequence from the candidate word sequences. Next, based on the differences between the initial text and the preferred vocabulary sequence, annotated text of the speech segment is determined.

In an embodiment of the present invention, the obtaining of the voice segment and the initial text for the voice segment includes: voice data and text data for the voice data are acquired from existing data on the internet. Then, the voice data is subjected to endpoint detection and segmentation to obtain a plurality of voice segments. Next, for any one of the plurality of speech segments, determining an initial text for the any one of the speech segments from the text data, the initial text including: a plurality of words arranged in a predetermined order.

In another embodiment of the present invention, the method further includes: before the biased language model of the voice segment is constructed based on the initial text, whether any vocabulary in a plurality of vocabularies contained in the initial text exists in a confusable vocabulary is determined. If yes, the confusable words of any word are inserted into the initial text, and the arrangement positions of the confusable words of any word and any word in the initial text are the same.

In another embodiment of the present invention, the constructing the bias language model of the speech segment based on the initial text includes: for any vocabulary in a plurality of vocabularies contained in the initial text, calculating the biased 1-N meta grammar model probability of the vocabulary, wherein N is an integer larger than 1. Then, a state transition probability between any two of the plurality of words is determined based on the bias 1-N meta-grammar model probabilities of the plurality of words. Then, a bias language model is constructed based on the state transition probability between any two vocabularies.

In another embodiment of the present invention, the calculating the biased 1-N meta-grammar model probability of any vocabulary includes: and calculating the word frequency of any vocabulary. Then, based on the word frequency of any vocabulary, the probability of the corrected 1-N element grammar model of any vocabulary in the initial text is calculated. Then, the corrected 1-N meta-grammar model probabilities are smoothed to adjust the probability distribution among the corrected 1-N meta-grammar model probabilities, thereby obtaining the biased 1-N meta-grammar model probabilities of any vocabulary.

In another embodiment of the present invention, the adjusting the probability distribution between the modified 1-N meta-grammar model probabilities includes: and adjusting and correcting the probability distribution among the probabilities of the 1-N element grammar models through a first offset, wherein the first offset has different values for different languages.

In another embodiment of the present invention, the calculating the probability of the modified 1-N meta-grammar model of any vocabulary in the initial text comprises: and when the word frequency of any vocabulary is greater than a first threshold value, determining that the vocabulary is a high-frequency vocabulary. After determining that any vocabulary is a high-frequency vocabulary, calculating the M-element grammar model probability of the any vocabulary in an initial file, wherein M is an integer which is greater than or equal to 1 and less than or equal to N. And when the M is less than or equal to a second threshold value, adding a second offset to the M-ary grammar model probability to obtain a corrected M-ary grammar model probability of any vocabulary in the initial text, wherein the second offset is the ratio of the word frequency of any vocabulary to the total word frequency of high-frequency vocabularies in the initial text. And when M is larger than a second threshold value, taking the M-gram model probability as the modified M-gram model probability of any vocabulary in the initial text.

In another embodiment of the present invention, the determining the candidate vocabulary sequence of the speech segment based on the bias language model includes: a hidden Markov model, a context dependent factor model, and a pronunciation dictionary model are obtained. Then, the hidden Markov model, the context-dependent factor model, the pronunciation dictionary model and the bias language model are connected in series in sequence, and the output of each previous model is used as the input of the next model to construct a decoding network. Then, the voice segment is input into a decoding network, so that the decoding network outputs a plurality of candidate word sequences of the voice segment.

In another embodiment of the present invention, the forcibly aligning the candidate vocabulary sequences with the acoustic features of the speech segment to determine the preferred vocabulary sequence from the candidate vocabulary sequences includes: and performing forced alignment operation on the candidate vocabulary sequences and the acoustic characteristics of the voice segments respectively by utilizing a Viterbi algorithm to obtain a state sequence diagram structure which accords with the acoustic characteristics and aims at the candidate vocabulary sequences. The state sequence diagram structure is used to characterize a plurality of state points and transition probabilities between any two of the plurality of state points. Then, a preferred path with the highest probability consisting of a plurality of state points is determined in the state sequence diagram structure. Then, a preferred vocabulary sequence is determined according to the preferred path.

In a further embodiment of the present invention, the determining the labeled text of the speech segment based on the difference between the initial text and the preferred vocabulary sequence includes: on one hand, time stamp information of each of a plurality of words in the initial text is acquired. On the other hand, time stamp information of each of a plurality of words in the preferred word sequence is acquired. When the vocabulary of the initial text and the preferred vocabulary sequence for the same time stamp information are different, it is determined that there is a difference between the initial text and the preferred vocabulary sequence. When the number of times of difference between the initial text and the preferred vocabulary sequence is less than a third threshold value, removing the vocabulary with difference from the initial text and the preferred vocabulary sequence to obtain a corrected initial text and a corrected preferred vocabulary sequence. And then, selecting the corrected initial text or the corrected preferred vocabulary sequence as the marking text of the voice segment.

In a further embodiment of the present invention, the method further includes: and discarding the voice fragment when the number of times of difference between the initial text and the preferred vocabulary sequence is larger than or equal to a third threshold value.

In a second aspect of embodiments of the present invention, there is provided a speech data processing apparatus comprising: the device comprises an acquisition module, a construction module, a candidate determination module, a preference determination module and an annotation determination module. The acquisition module is used for acquiring a voice segment and initial text aiming at the voice segment. The construction module is used for constructing a bias language model of the voice fragment based on the initial text. Then, the candidate determining module is used for determining a plurality of candidate word sequences of the voice segments based on the constructed bias language model. The preference determining module is used for performing forced alignment operation on the candidate vocabulary sequences and the acoustic features of the voice segments respectively so as to determine a preferred vocabulary sequence from the candidate vocabulary sequences. Next, the annotation determination module is configured to determine annotated text of the speech segment based on a difference between the initial text and the preferred vocabulary sequence.

In an embodiment of the present invention, the acquiring module acquires a voice segment and an initial text for the voice segment, including: the acquisition submodule is used for acquiring voice data and text data aiming at the voice data from existing data of the Internet. Then, the preprocessing submodule is used for performing endpoint detection and segmentation on the voice data to obtain a plurality of voice segments. Next, the text extraction sub-module is configured to, for any one of the plurality of speech segments, determine an initial text for the any one of the speech segments from the text data, where the initial text includes: a plurality of words arranged in a predetermined order.

In another embodiment of the present invention, the apparatus further includes an confusion processing module, configured to determine whether any vocabulary in the plurality of vocabularies included in the initial text has a confusable vocabulary before the construction module constructs the biased language model of the speech segment based on the initial text. If yes, the confusable words of any word are inserted into the initial text, and the arrangement positions of the confusable words of any word and any word in the initial text are the same.

In another embodiment of the present invention, the constructing module constructs a bias language model of the speech segment based on the initial text, including: the first calculation submodule is used for calculating the biased 1-N meta grammar model probability of any vocabulary in a plurality of vocabularies contained in the initial text, wherein N is an integer larger than 1. Then, the second calculation submodule is used for determining the state transition probability between any two vocabularies in the plurality of vocabularies based on the bias 1-N meta grammar model probabilities of the vocabularies respectively. And then, the construction submodule is used for constructing a bias language model based on the state transition probability between any two vocabularies.

In another embodiment of the present invention, the calculating, by the first calculating sub-module, probabilities of biased 1-N meta-grammar models of any vocabulary includes: the word frequency calculating unit is used for calculating the word frequency of any vocabulary. Then, the model probability calculating unit is used for calculating the corrected 1-N element grammar model probability of any vocabulary in the initial text based on the word frequency of the vocabulary. And the smoothing unit is used for smoothing the corrected 1-N meta-grammar model probabilities to adjust the probability distribution among the corrected 1-N meta-grammar model probabilities so as to obtain the biased 1-N meta-grammar model probabilities of any vocabulary.

In another embodiment of the present invention, the adjusting, by the smoothing unit, the probability distribution among the modified 1-N meta-grammar model probabilities specifically includes: and adjusting and correcting the probability distribution among the probabilities of the 1-N element grammar models through a first offset, wherein the first offset has different values for different languages.

In another embodiment of the present invention, the calculating the probability of the modified 1-N meta-grammar model of any vocabulary in the initial text by the model probability calculating unit specifically includes: and when the word frequency of any vocabulary is greater than a first threshold value, determining that the vocabulary is a high-frequency vocabulary. After determining that any vocabulary is a high-frequency vocabulary, calculating the M-element grammar model probability of the any vocabulary in an initial file, wherein M is an integer which is greater than or equal to 1 and less than or equal to N. And when the M is less than or equal to a second threshold value, adding a second offset to the M-ary grammar model probability to obtain a corrected M-ary grammar model probability of any vocabulary in the initial text, wherein the second offset is the ratio of the word frequency of any vocabulary to the total word frequency of high-frequency vocabularies in the initial text. And when M is larger than a second threshold value, taking the M-gram model probability as the modified M-gram model probability of any vocabulary in the initial text.

In a further embodiment of the present invention, the determining the candidate word sequence of the speech segment based on the biased language model by the candidate determining module includes: the model obtaining submodule is used for obtaining the hidden Markov model, the context-dependent factor model and the pronunciation dictionary model. Then, the model construction submodule is used for connecting the hidden Markov model, the context-dependent factor model, the pronunciation dictionary model and the bias language model in series in sequence, and taking the output of each previous model as the input of the next model so as to construct a decoding network. Then, the model processing submodule is used for inputting the voice segment into the decoding network, so that the decoding network outputs a plurality of candidate word sequences of the voice segment.

In a further embodiment of the present invention, the determining the preferred vocabulary sequence from the plurality of candidate vocabulary sequences by performing a forced alignment operation on the plurality of candidate vocabulary sequences and the acoustic feature of the speech segment includes: and performing forced alignment operation on the candidate vocabulary sequences and the acoustic characteristics of the voice segments respectively by utilizing a Viterbi algorithm to obtain a state sequence diagram structure which accords with the acoustic characteristics and aims at the candidate vocabulary sequences. The state sequence diagram structure is used to characterize a plurality of state points and transition probabilities between any two of the plurality of state points. Then, a preferred path with the highest probability consisting of a plurality of state points is determined in the state sequence diagram structure. Then, a preferred vocabulary sequence is determined according to the preferred path.

In a further embodiment of the present invention, the apparatus further comprises a discarding module for discarding the speech segment when the number of times the difference exists between the initial text and the preferred vocabulary sequence is greater than or equal to a third threshold.

In a third aspect of embodiments of the present invention, there is provided a medium storing computer-executable instructions that, when executed by a processor, are operable to: the method of processing speech data according to any of the preceding embodiments.

In a fourth aspect of embodiments of the present invention, there is provided a computing device comprising: a memory, a processor, and executable instructions stored on the memory and executable on the processor, the processor when executing the instructions implementing: the method of processing speech data according to any of the preceding embodiments.

According to the voice data processing method and the voice data processing device, the voice fragment does not need to be labeled manually, only a bias language model is constructed on the basis of the initial text of the voice fragment, the preferred vocabulary sequence is gradually screened out according to the bias language model, and then the labeled text of the voice fragment is finally determined by comparing the preferred vocabulary sequence with the initial text. The process is simple and easy to implement, has high automation degree, can acquire a large amount of high-quality voice marking data in a short time, and can be used for training and tuning the voice recognition model system so as to improve the voice recognition performance of the voice recognition model system.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

fig. 1A and 1B schematically illustrate application scenarios of a voice data processing method and an apparatus thereof according to an embodiment of the present invention;

FIG. 2 schematically shows a flow diagram of a method of speech data processing according to one embodiment of the invention;

FIG. 3 schematically shows a flow chart of a method of processing speech data according to another embodiment of the invention;

FIG. 4 schematically illustrates a process diagram for building a bias language model according to one embodiment of the invention;

FIG. 5A is a partial schematic diagram that schematically illustrates biased 1-N meta grammar model probabilities for each of a plurality of words, in accordance with one embodiment of the present invention;

FIG. 5B schematically illustrates a diagram of a biased language model according to one embodiment of the invention;

FIG. 6 is a schematic diagram that schematically illustrates a sequence of candidate words, in accordance with an embodiment of the present invention;

FIG. 7 schematically illustrates a diagram of a linear FST of an initial text according to one embodiment of the invention;

FIG. 8 schematically illustrates a schematic diagram of a Viterbi forced alignment procedure in accordance with one embodiment of the invention;

FIG. 9 schematically shows a diagram of Viterbi forced alignment results according to one embodiment of the invention;

FIG. 10 schematically shows a block diagram of a speech data processing apparatus according to an embodiment of the present invention;

FIG. 11 schematically shows a schematic view of a computer-readable storage medium product according to an embodiment of the invention; and

FIG. 12 schematically shows a block diagram of a computing device according to an embodiment of the present invention.

In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

The principles and spirit of the present invention will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the invention, and are not intended to limit the scope of the invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As will be appreciated by one skilled in the art, embodiments of the present invention may be embodied as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

According to the embodiment of the invention, a voice data processing method, a voice data processing device, a voice data processing medium and a computing device are provided.

In this context, it is to be understood that the terms referred to include: speech Recognition (ASR), Language Model (LM), force alignment (force alignment), Voice Activity Detection (VAD), Finite State Transformer (FST), Weighted Finite State Transformer (WFST), word Lattice (Lattice), Hidden Markov Model (HMM), and the like. Among them, speech recognition is a technology of converting speech data into corresponding text. A language model can be represented as a probability distribution over sequences of words, which can produce a probability for the entire sequence for a sequence of a given length, describing the likelihood that the sequence will occur in that language. After a segment of speech and a segment of text are known, the likelihood that the segment of speech corresponds to the segment of text can be evaluated by a forced alignment operation from the perspective of the acoustic features of the segment of speech. Voice activity detection is a technique for detecting whether voice exists in recorded data, and is generally used for detecting the start and end positions of voice in data. The finite state transformer is mainly used for describing the mapping relation between the state sequence of one space and the sequence in another space. The weighted finite state transformer is similar to the FST, but additionally considers the weight relationship on each edge during the state transition, and is generally used in the decoding module of the conventional speech recognition system. A word lattice refers to the set of all possible translations corresponding to a piece of speech data, and is usually represented in the form of FST as described above. Hidden markov models are a method of modeling acoustic models in speech recognition systems.

Moreover, any number of elements in the drawings are by way of example and not by way of limitation, and any nomenclature is used solely for differentiation and not by way of limitation.

The principles and spirit of the present invention are explained in detail below with reference to several representative embodiments of the invention.

Summary of The Invention

To this end, embodiments of the present invention provide a method and an apparatus for processing voice data, where the method may include an acquisition process, a construction process, and a determination process. In the acquisition process, a speech segment and an initial text for the speech segment are acquired. And then, a construction process is carried out, and a bias language model of the voice fragment is constructed based on the initial text. A determination process is then performed, which can be divided into a candidate determination process, a preferred determination process, and an annotation determination process. First, a plurality of candidate vocabulary sequences of the speech segment are determined based on the constructed bias language model. Then, the candidate vocabulary sequences are respectively subjected to forced alignment operation with the acoustic characteristics of the voice segments so as to determine a preferred vocabulary sequence from the candidate vocabulary sequences. Next, based on the differences between the initial text and the preferred vocabulary sequence, annotated text of the speech segment is determined.

According to the technical scheme of the embodiment of the invention, the voice fragment does not need to be labeled manually, only a bias language model is constructed on the basis of the initial text of the voice fragment, the preferred vocabulary sequence is gradually screened out according to the bias language model, and the labeled text of the voice fragment is finally determined by comparing the preferred vocabulary sequence with the initial text. The process is simple and easy to implement, has high automation degree, can acquire a large amount of high-quality voice marking data in a short time, and can be used for training and tuning the voice recognition model system so as to improve the voice recognition performance of the voice recognition model system.

Having described the general principles of the invention, various non-limiting embodiments of the invention are described in detail below.

Application scene overview

First, referring to fig. 1A and fig. 1B, an application scenario of the voice data processing method and the apparatus thereof according to the embodiment of the present invention is described in detail.

Fig. 1A and 1B schematically show application scenarios of a voice data processing method and a device thereof according to an embodiment of the present invention.

As shown in fig. 1A, the application scenario may include

terminal devices

101, 102, 103, a network 104 and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, and so forth.

The

terminal devices

101, 102, 103 may be various electronic devices, which may have the same or different computing capabilities, including but not limited to smart speakers, smart phones, tablets, laptop and desktop computers, and the like. Various client applications, such as an application having a voice recognition function, etc., may be installed on the

terminal apparatuses

101, 102, 103. Illustratively, as shown in fig. 1B, the terminal device 101 is a smart speaker, and the user controls a music playing application in the smart speaker to play music through a voice instruction.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The server 105 may be a server providing various services, such as a background management server (for example only) providing voice annotation data, a speech recognition model system, etc. to the

terminal devices

101, 102, 103. The background management server may analyze and perform other processing on the received user request, and feed back a processing result (e.g., a webpage, information, or data obtained or generated according to the user request) to the terminal device.

It should be noted that the voice data processing method provided by the embodiment of the present disclosure may be generally executed by the server 105. Accordingly, the voice data processing apparatus provided by the embodiment of the present disclosure may be generally disposed in the server 105. The voice data processing method provided by the embodiment of the present disclosure can also be executed by the

terminal devices

101, 102, 103. Accordingly, the voice data processing apparatus provided by the embodiment of the present disclosure may be disposed in the

terminal devices

101, 102, 103. Alternatively, the voice data processing method provided by the embodiment of the present disclosure may also be performed by other servers or server clusters capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105. Accordingly, the voice data processing apparatus provided by the embodiment of the present disclosure may also be disposed in other servers or server clusters capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105.

It should be understood that the number and types of terminal devices, networks, and servers in fig. 1A-1B are merely illustrative. There may be any number and any type of terminal devices, networks, and servers, depending on the actual needs.

Exemplary method

A voice data processing method according to an exemplary embodiment of the present invention is described below with reference to fig. 2 to 4, 5A, 5B, and 6 to 9 in conjunction with the application scenarios of fig. 1A and 1B. It should be noted that the above application scenarios are merely illustrated for the convenience of understanding the spirit and principles of the present invention, and the embodiments of the present invention are not limited in this respect. Rather, embodiments of the present invention may be applied to any scenario where applicable.

Fig. 2 schematically shows a flow chart of a speech data processing method according to an embodiment of the invention.

As shown in fig. 2, the method may include operations S210 to S250 as follows.

In operation S210, a voice segment and an initial text for the voice segment are acquired.

In operation S220, a biased language model of the speech segment is constructed based on the initial text.

The bias language model is obtained by adding bias adjustment on the basis of the traditional language model, so that the obtained bias language model is more consistent with the statistical characteristics in the actual language habit.

Then, in operation S230, a plurality of candidate vocabulary sequences of the speech segment are determined based on the constructed biased language model.

Wherein each candidate vocabulary sequence of the plurality of candidate vocabulary sequences provides a possibility of a text corresponding to a speech piece, and the plurality of candidate vocabulary sequences of the speech piece may constitute a word lattice of the speech piece. Each candidate word sequence may be composed of a plurality of words, each of which may be a complete word or a fragment of a word.

Next, in operation S240, a forced alignment operation is performed on the candidate vocabulary sequences and the acoustic features of the speech segments, respectively, to determine a preferred vocabulary sequence from the candidate vocabulary sequences.

And performing forced alignment operation on the acoustic characteristics of each candidate vocabulary sequence and the voice segment, namely matching the candidate vocabulary sequence and the voice segment from the pronunciation angle of the voice segment, so as to evaluate the possibility that the candidate vocabulary sequence corresponds to the voice segment. And selecting the candidate vocabulary sequence which is most possibly corresponding to the voice segment from the candidate vocabulary sequences as the preferred vocabulary sequence.

Next, in operation S250, a labeled text of the speech segment is determined based on a difference between the initial text and the preferred vocabulary sequence.

Those skilled in the art can understand that according to the technical solution of the embodiment of the present invention, a voice fragment does not need to be labeled manually, but only a bias language model is constructed on the basis of an initial text of the voice fragment, a preferred vocabulary sequence is gradually screened out according to the bias language model, and then a labeled text of the voice fragment is finally determined by comparing the preferred vocabulary sequence with the initial text. The process is simple and easy to implement, has high automation degree, can acquire a large amount of high-quality voice marking data in a short time, and can be used for training and tuning the voice recognition model system so as to improve the voice recognition performance of the voice recognition model system.

In an embodiment of the present invention, the process of acquiring the voice segment and the initial text for the voice segment may include: voice data and text data for the voice data are acquired from existing data on the internet. Then, the voice data is subjected to endpoint detection and segmentation to obtain a plurality of voice segments. Next, for any one of the plurality of speech segments, determining an initial text for the any one of the speech segments from the text data, the initial text including: a plurality of words arranged in a predetermined order.

Illustratively, there is a large amount of speech text data on the internet, including speech data and text data for the speech data, and a large part of the text data in the speech text data is manually collated with a certain degree of accuracy. Sources of such speech text data may include: subtitles, audio books, news audio, manuscripts and the like of online video website videos. After the public voice text data are obtained, the voice data are subjected to endpoint detection and long sentence segmentation by adopting VAD (voice activity detection), so that the part without human voice or with larger background noise in the voice data is removed. After the step, a large number of segmented voice segments and initial texts corresponding to the segmented voice segments can be obtained, as shown in table 1.

Table 1 exemplarily shows 5 voice segments obtained by performing end point detection and long sentence segmentation on one voice data, and extracting initial texts respectively for the 5 voice segments from text data for the voice data. Globally unique identification information for each speech segment and corresponding initial text are listed in table 1, each initial text comprising a plurality of words arranged in a predetermined order. For example, the initial text of the speech segment BAC009S0916W0491 is { "to one", "add gold", "melt", "company", "loan", "more than ten thousand", "meta" }, which is a sequence of 7 words, wherein some words are complete words, some words are word segments, and some words may even have wrongly written words. The lexical composition, arrangement, lexical correctness, lexical completeness, etc. of each initial text are all dependent on the original text data. Since the sources of the voice data and the corresponding text data are data already used and checked in the internet, the obtained correspondence between the voice segment and the initial text initially has a certain accuracy.

TABLE 1

Identification information of speech segment	Initial text for a speech segment
		BAC009S0916W0491	Loan of more than one million yuan to a financial company
BAC009S0916W0492	The company promises to loan Benxi and is responsible for repayment
		BAC009S0916W0493	The company encounters fund difficulty
BAC009S0916W0494	There is a risk that the loan cannot be repacked as expected
		BAC009S0916W0495	The loaned staff is difficult to sleep and eat

The above embodiment obtains the voice data and the corresponding text data from the existing data of the internet to serve as the source and the basis for finally screening the labeled text in the embodiment of the invention. For a speech segment, the vocabulary sequences most likely to correspond to the speech segment may be screened from the initial text for the speech segment. However, in the initial text for a speech segment, there may inevitably be cases of recognition errors. For example, homophonic character recognition is wrong, "always" is recognized as "consistent", or "Benxi" is recognized as "Benxi" in the original text of, for example, Table 1 speech segment BAC009S0916W0492, and so on. To minimize the loss of data due to such recognition errors, confusing words may be inserted into the sorted initial text. Illustratively, the voice data processing method according to the embodiment of the present invention may further include: before the biased language model of the speech segment is constructed based on the initial text, whether any vocabulary in a plurality of vocabularies contained in the initial text exists in a confusable vocabulary is determined. If yes, the confusable words of any word are inserted into the initial text, and the arrangement positions of the confusable words of any word and any word in the initial text are the same.

For example, a confusing dictionary is preset in which confusing words of respective words are listed. A confusing vocabulary may be: another word having homophones with the vocabulary, another vocabulary having synonyms with the vocabulary, etc. According to this embodiment, for the initial text { "company", "promise", "loan", "benxi", "general", "from", "company", "responsible", "repayment" } for the voice fragment BAC009S0916W0492, the words in the initial text are sequentially searched in the confusable dictionary, and it is found that there is the confusable word "benxi". The word "this rest" is inserted into the initial text, and the word "this rest" and the word "this stream" are arranged at the same position in the initial text and are all at the fourth word position of the initial text, so as to obtain the initial text after the confusion processing. The operations of subsequently constructing a bias language model, determining a candidate vocabulary sequence, determining a preferred vocabulary sequence and the like can be performed on the initial text after the confusion process, and because the initial text after the confusion process covers all vocabulary sets which are easy to be confused as much as possible, the condition that the subsequent process cannot be corrected when only wrong vocabularies are included is avoided, and therefore the correct labeled text can be screened out.

Fig. 3 schematically shows a flowchart of a speech data processing method according to another embodiment of the present invention, for exemplarily illustrating an implementation procedure of the above operation S220 of constructing a biased language model of a speech fragment based on an initial text.

As shown in fig. 3, the method may include operations S221 to S223 as follows.

In operation S221, for any vocabulary in the plurality of vocabularies included in the initial text, a biased 1-to N-gram (1-gram to N-gram) model probability of the any vocabulary is calculated, where N is an integer greater than 1.

According to an embodiment of the present invention, this operation S221 may calculate, for each vocabulary in the initial text, an offset 1-gram model probability for the vocabulary, an offset 2-gram model probability … … for the vocabulary, and an offset N-gram model probability for the vocabulary. The process may include: the word frequency (frequency) of any one of the words is calculated. Then, based on the word frequency of any vocabulary, the corrected 1-gram model probability to the corrected N-gram model probability of the vocabulary in the initial text are calculated. And then, smoothing the corrected 1-gram model probability to the corrected N-gram model probability to adjust the probability distribution between the corrected 1-gram model probability and the corrected N-gram model probability so as to obtain the biased 1-gram model probability to the biased N-gram model probability of any vocabulary.

Illustratively, the adjusting the probability distribution among the modified 1-N meta-grammar model probabilities may include: and adjusting and correcting the probability distribution among the probabilities of the 1-N element grammar models through the first offset, wherein the first offset can have different values aiming at different languages.

For example, the above process of calculating the probability of the modified 1-N meta grammar model of any vocabulary in the initial text may include: and when the word frequency of any vocabulary is greater than a first threshold value, determining that the vocabulary is a high-frequency vocabulary. After determining that any vocabulary is a high-frequency vocabulary, calculating the N-gram model probability of the any vocabulary in the initial file, wherein N is an integer which is greater than or equal to 1 and less than or equal to N. And when n is less than or equal to a second threshold value, adding a second offset to the n-gram model probability to obtain the corrected n-gram model probability of any vocabulary in the initial text, wherein the second offset is the ratio of the word frequency of any vocabulary to the total word frequency of high-frequency vocabularies in the initial text. And when n is larger than a second threshold value, taking the n-gram model probability as the corrected n-gram model probability of any vocabulary in the initial text.

The above operation S221 is exemplarily explained by a specific example as follows. And for a voice segment and the initial text which is subjected to the confusing processing and aims at the voice segment, counting the word frequency of each nonrepeating vocabulary in the initial text. And taking the vocabulary with the word frequency larger than the first threshold value as high-frequency vocabulary, and taking the vocabulary with the word frequency smaller than or equal to the first threshold value as non-high-frequency vocabulary.

For each vocabulary, calculating the original N-gram model probability of the vocabulary, namely the original N-gram model probability, representing the probability that the vocabulary can appear in the context environment where the previous N-1 vocabularies exist, wherein the value of N is from 1 to N. And then determining the corrected n-gram model probability of each vocabulary according to the original n-gram model probability of each vocabulary.

When calculating the probability of the modified n-gram model of the high-frequency vocabulary, if n is less than or equal to a second threshold value, the probability of the n-gram model is called as the probability of a low-order model. At this time, according to the embodiment of the present invention, it is necessary to provide an additional probability based on the original n-gram model probability by the second offset to simulate the situation occurring in the actual text, that is, the high frequency vocabulary is more likely to occur in the subsequent decoding process, so as to obtain the corrected n-gram model probability. And if n is larger than a second threshold value, the original n-gram model probability is directly used as the modified n-gram model probability of the high-frequency vocabulary without increasing bias on the basis of the original n-gram model probability. When the modified n-gram model probability of the non-high-frequency vocabulary is calculated, the original n-gram model probability of the non-high-frequency vocabulary is directly used as the modified n-gram model probability of the non-high-frequency vocabulary. For example, the second threshold is 1, the probability of the original 1-gram model of the high-frequency word "in the initial text is p, and the frequency of the occurrence of the" in the initial text is m₁Then, all high frequency words in the initial text appear m in total₂Then, the 1-gram model probability of the final high-frequency vocabulary is:

further, in order to avoid that the confusing words inserted during the confusing process do not contain correct results, e.g. in the initial text containing "consistent", the correct word "always" is not inserted. This situation may result in a low probability of the correct result in a subsequently generated language model, or not at all. Illustratively, the Kneser-Ney plane may be used in embodiments of the present inventionThe sliding strategy is used for reducing the phenomenon, and the probability space of the 1-N-gram model probability is adjusted, namely a fixed value (namely a first offset) is subtracted from the lower-order N-gram model probability to be distributed to the higher-order N-gram model probability, so that the offset N-gram model probability is obtained. With such a smoothing strategy, even vocabulary collocation combinations that did not appear in the original text are assigned a certain probability. Word w_iBiased n-gram model probabilities after smoothing

This can be calculated by the following formula.

Wherein, the formula (1) is a recursive equation,

representing words from w_iFrom the first 1 st word to the word w_iThe previous n-1 th vocabulary of c_KNThe function is used to count the frequency of occurrence, and the λ function is defined as shown in equation (2), where the parenthesis indicates the number of all prefix classes in the initial text for the word w. The fixed value d subtracted from the numerator of equation (1) is the first offset subtracted in the Kneser-Ney smoothing strategy. In the embodiment of the present invention, the value of d does not remain the same as in the conventional method, but may be changed for different languages. For example, when a language with relatively small vocabulary, such as Chinese, is processed, the value of d is generally small, such as 0.1-0.3, so that the generated language model is relatively biased, and the search can be performed according to the path set in the initial text during decoding. When a language with relatively more common words such as English is processed, d is generally set to be a large value, such as 0.3-0.5, so that the smoothed language model has no too serious bias constraint and can further contain different languagesThe word text of (1).

Then, reference is continued to fig. 3. In operation S222, a state transition probability between any two vocabularies of the plurality of vocabularies is determined based on the biased 1-N meta grammar model probabilities of the vocabularies.

Next, in operation S223, a bias language model is constructed based on the state transition probability between any two vocabularies.

FIG. 4 schematically shows a process diagram for building a bias language model according to one embodiment of the invention. As shown in fig. 4, a speech segment and an initial text 410 for the speech segment are obtained, and the initial text 410 may be an initial text subjected to confusion processing, for example, the first five lines of words are words originally contained in the initial text, and the last line of words (in a dashed box) is a confusion word (including, for example, homophones and synonyms) additionally added to the initial text after the confusion processing. The initial text 410 is then subjected to a word frequency statistics operation 420 and a smoothing operation 430 to obtain a biased language model 440. The word frequency statistics operation 420 and the smoothing operation 430 are described in detail above, and are not described herein again. For example, the bias language model 440 is referred to in FIGS. 5A and 5B.

FIG. 5A schematically illustrates a partial view of biased 1-N meta grammar model probabilities for each of a plurality of words, according to one embodiment of the invention.

FIG. 5B schematically shows a diagram of a biased language model according to one embodiment of the invention.

After calculating the probabilities of the bias 1-N meta grammar models of the words in the initial text, as shown in FIG. 5A, they can be listed in text format. A segment of the text format is exemplarily shown in the figure, with the first column of numbers representing biased n-gram probability values, the second column representing the vocabulary to be targeted, and the third column representing backoff (back-off) costs. When a biased n-gram probability value of a vocabulary cannot be calculated due to the fact that the n-gram combination of the vocabulary does not exist, the biased n-gram probability value of the vocabulary can be calculated by utilizing the biased (n-1) -gram probability value of the vocabulary and the rollback cost aiming at the biased (n-1) -gram probability value. For example,for the vocabulary "difficulty", the 3-gram combination of the vocabulary is { up, capital, difficulty }, the biased 3-gram probability value of the vocabulary is p₁With a backoff cost of p₂. If there is no combination of { encountered, committed, hard } in the original text, the biased 4-gram probability value for the word "hard" cannot be computed, then (p) is used₁+p₂) Biased 4-gram probability values in the original text as the word "difficult". It will be appreciated that in addition to what is shown in FIG. 5A, the biased 1-gram probability value, the biased 2-gram probability value … …, and the biased N-gram probability value for each word in the initial text should be listed.

Based on the respective biased 1-N element grammar model probabilities of all the vocabularies in the initial text, the context environment of each vocabulary in the initial text can be obtained, and therefore the state transition probability between any two vocabularies in the initial text can be calculated. After learning the state transition probabilities between any two vocabularies, a bias language model in the FST format as shown in fig. 5B can be obtained for use in subsequent WFST-based decoding. As shown in fig. 5B, each node corresponds to a state, and an edge between two nodes has a direction, on which an input label, an output label, and a weight are labeled. According to embodiments of the present invention, the states may correspond to vocabularies, and the input tags, the output tags, and the weights may correspond to preceding vocabularies, following vocabularies, and transition probabilities, respectively. In the example shown in fig. 5B, the side from state 1 to state 5 is labeled "company: company/0.69315 ", where the first" company "is the input label and the second" company "is the output label, the weight is 0.69315, and both the input label and the output label correspond to the latter word, i.e., the word" company "corresponding to state 5.

And after obtaining the bias language model corresponding to each voice fragment, obtaining a word lattice corresponding to each voice by adopting a WFST-based decoding method. According to an embodiment of the present invention, the above process for determining a candidate vocabulary sequence of a speech segment based on a bias language model may include: a hidden Markov model, a context dependent factor model, and a pronunciation dictionary model are obtained. Then, the hidden Markov model, the context-dependent factor model, the pronunciation dictionary model and the bias language model are connected in series in sequence, and the output of each previous model is used as the input of the next model to construct a decoding network. Then, the voice segment is input into a decoding network, so that the decoding network outputs a plurality of candidate word sequences of the voice segment.

For the obtained speech segment, the above process aims to find several most likely candidate vocabulary sequences corresponding to the speech segment, which is specifically expressed by the following formula (3).

Wherein

For the expected correct result, Y is the input acoustic feature data, W is the words in the original text, and p (W) is the above constructed bias language model. In an embodiment of the present invention, the complete decoding network may be a WFST consisting of H, C, L and G, where H, C, L and G represent an HMM model, a context dependent factor model, a pronunciation dictionary model, and a biased language model, respectively. The four models are serially connected in sequence with the output of the previous one as the input to the next one (e.g., the composition operation in the FST). For example, the first three models may be provided by a baseline speech recognition system, and the language model is a biased language model using the above. And after the voice fragments are calculated by an acoustic model to obtain the distribution of the corresponding transition probability, searching in the HCLG decoding network to obtain a word lattice consisting of a plurality of possible recognition results corresponding to the voice. The word lattice may include multiple paths: a) a path with a relatively high probability of being correct, which may contain words from the original text for the speech segment, and b) a number of possible other paths, which may contain confusing words added in the preceding steps. These multiple paths may be used as multiple candidate lexical sequences as described above, from which the correct target path is determined, i.e. from which the preferred lexical sequence is determined. It is composed ofThe correct target path in (a) may occur as well as in (b).

FIG. 6 is a diagram that schematically illustrates a sequence of candidate words, in accordance with an embodiment of the present invention. For convenience of illustration, fig. 6 is cut into two parts from the dotted line, and it can be understood that fig. 6 is actually an integral obtained by splicing the upper part and the lower part at the dotted line.

Fig. 6 shows a plurality of candidate vocabulary sequences in a typical FST format, and the results of the word lattice in the FST format representing the speech segment have the same graphic meaning as the principle of fig. 5B, and repeated parts are not repeated. The label marked on the edge representing the state transition comprises a plurality of possible confusable words. Starting from the leftmost initial state 0, the transition to the next state is made according to the connection defined by the edge, while the label marked on the edge is added to the result until the last end state, e.g., 16/0.02832, is reached. In the process, the sequence formed by the labels on all the edges is the candidate vocabulary sequence corresponding to the path. A complete word lattice contains several paths from the initial state to the end state, which means several possible candidate word sequences are included.

After determining the candidate vocabulary sequence of the speech segment, a preferred vocabulary sequence needs to be further determined therefrom. According to an embodiment of the present invention, the forcibly aligning the candidate vocabulary sequences with the acoustic features of the speech segment to determine the preferred vocabulary sequence from the candidate vocabulary sequences includes: and performing forced alignment operation on the candidate word sequences and the acoustic characteristics of the voice segments respectively by utilizing a Viterbi (Viterbi) algorithm to obtain a state sequence diagram structure which accords with the acoustic characteristics and aims at the candidate word sequences. The state sequence diagram structure is used to characterize a plurality of state points and transition probabilities between any two of the plurality of state points. Then, a preferred path with the highest probability consisting of a plurality of state points is determined in the state sequence diagram structure. Then, a preferred vocabulary sequence is determined according to the preferred path.

For example, after obtaining the candidate vocabulary sequence of the speech segment shown in fig. 6, the initial text for the speech segment is then converted into a linear FST, i.e., starting from the initial state, each edge points to the next state in turn, corresponding to the next vocabulary in the initial text, and going to the end of the vocabulary. FIG. 7 schematically shows a diagram of a linear FST of an initial text according to one embodiment of the invention. It will be appreciated that for the initial text there is only one path from the initial state to the end state, i.e. one unique vocabulary sequence.

Then, a forced alignment operation is performed using the linear FST of the initial text and the word lattice generated in the previous step. Illustratively, a given acoustic feature is associated with a corresponding HMM state using a viterbi algorithm, the process being illustrated in fig. 8.

Fig. 8 schematically shows a schematic diagram of a viterbi forced alignment procedure according to an embodiment of the invention. A state sequence diagram structure is shown in which the horizontal axis represents the acoustic features of the input speech segment, e.g., adjacent acoustic features are separated by 10 milliseconds. The vertical axis represents the state of the HMM in acoustic modeling. Generally, in an HMM-based acoustic model, states of HMMs correspond to the smallest phoneme units. In this system, starting from each state, it is possible to stay in the current state by self-looping, to enter the next state onwards or to jump over a state directly into the following state. There are several possibilities for aligning each frame to a state, one of which is represented by the dashed path in fig. 8, which represents the first frame corresponding to state 2, the second to fourth frames corresponding to state 3, and so on. Thus, the process of determining the preferred lexical sequence is to find the most likely one of the paths from the several possible alignment results.

For example, forced alignment is done using a forward-backward algorithm based on dynamic programming, defining the following variables:

equation (4) shows that at time t, corresponding to O₁O₂…O_tAcoustic deviceThe signature sequence and the last observed state is at the optimal probability of Ot. Wherein Ot corresponds to a certain phoneme, lambda represents an HMM model parameter, and qt represents a hidden state at the moment t. From the derivation and generalization, one can get:

equation (5) at time t +1, corresponds to O₁,O₂…O_t,O_t+1Acoustic signature sequence and the last observed state is at O_t+1The optimal probability of (c). Is the optimal transition probability from state i to state j multiplied by the observation of O at state j_t+1Probability of (b)_j(O_t+1). According to the recurrence relation, the system records the state of the optimal probability of each step from the initial state, and each time the step is carried forward, until the acoustic feature is finished.

For example, starting from the (2, 3) coordinate position in fig. 8 (indicating that the second frame corresponds to the state 3), when the state information corresponding to the third frame is derived, three edges starting from (2, 3) are sequentially viewed, and the self-circulation staying in the state 3, entering the state 4 and entering the state 5 are respectively indicated. The highest probability of self-circulation in this case is obtained by looking at the transition probability in the HMM model, so the path from (2, 3) to (3, 3) is selected, and the current position (3, 3) is recorded. When the state is transferred to the ending state 6, the optimal state sequence corresponding to the voice can be obtained by backtracking all the historical positions recorded before: 1233456, i.e. the determined preferred vocabulary sequence.

After determining the preferred vocabulary sequence, the annotated text of the speech segment may be determined by comparison with the initial text. In another embodiment of the present invention, the process of determining the labeled text of the speech segment based on the difference between the initial text and the preferred vocabulary sequence includes: on one hand, time stamp information of each of a plurality of words in the initial text is acquired. On the other hand, time stamp information of each of a plurality of words in the preferred word sequence is acquired. When the vocabulary of the initial text and the preferred vocabulary sequence for the same time stamp information are different, it is determined that there is a difference between the initial text and the preferred vocabulary sequence. When the number of times that the difference exists between the initial text and the preferred vocabulary sequence is less than the third threshold value, the initial text and the preferred vocabulary sequence are relatively similar and have higher accuracy, so that the vocabulary with the difference can be removed from the initial text and the preferred vocabulary sequence to obtain the corrected initial text and the corrected preferred vocabulary sequence. And then, selecting one item from the corrected initial text and the corrected preferred vocabulary sequence as the labeled text of the voice segment for the subsequent training and tuning of the voice recognition system.

Further, in another embodiment of the present invention, the voice data processing method may further include: when the number of times of difference between the initial text and the preferred vocabulary sequence is larger than or equal to a third threshold value, the difference between the initial text and the preferred vocabulary sequence is large, the accuracy is low, and therefore the current speech segment needs to be abandoned, and other speech segments need to be reused to determine the corresponding labeled text.

For example, after determining the preferred vocabulary sequence of the speech segment, the specific time of occurrence of each vocabulary in the preferred vocabulary sequence, i.e., the time stamp information of each vocabulary in the preferred vocabulary sequence, can be obtained by combining the frame shift information of the abscissa of the state sequence diagram structure shown in fig. 8. Similarly, for the linear FST of the initial text shown in fig. 7, in combination with the frame shift information of the abscissa of the state sequence diagram structure shown in fig. 8, the time stamp information of each vocabulary in the initial text can be obtained, please refer to fig. 9.

Fig. 9 schematically shows a schematic diagram of a viterbi forced alignment result according to an embodiment of the invention. As shown in fig. 9, the parts of the speech segment and the vocabulary parts that are largely different from the original text are cut off based on the time stamp information, so as to obtain the remaining part of correct corrected speech, corrected original text, and corrected preferred vocabulary sequence (which may be collectively referred to as corrected text). For example, the preferred vocabulary sequence is { one, a fund, a finance, a company, a loan, many thousands, a yuan }, and the initial text is { one, a fund, a finance, a corporation, a loan, many thousands, a circle }. If the words of 'Yuan' and 'Yuan' with large difference appear from 3.2s to 3.4s, only 0 to 3.2s of data are taken for the voice fragment, and the corrected text is taken as { one, family, fund, convergence, company, loan, more than ten thousand }. When cutting out a large difference part of the text, common spoken language phenomena such as continuous repetition, hesitation and the like (such as continuous speaking of 'this, this' and the like by a speaker) possibly existing in the voice segment are considered at the same time. If the phenomenon exists in the forcibly aligned preferred vocabulary sequence but the initial text does not exist, the initial text is automatically modified to match the preferred vocabulary sequence so as to adapt to the continuous repeated phenomenon frequently appearing in the actual spoken language scene and improve the accuracy of the finally determined labeled text.

Illustratively, after the above processing is completed, the duration of the cut-out portion in the speech segment is counted. If the proportion of the duration to the total duration of the voice segments is greater than or equal to a certain set threshold value, the voice segments are discarded. And if the proportion of the duration to the total duration of the voice segments is less than the threshold, taking one of the corrected texts as the labeled text of the voice segments.

Further, according to the embodiment of the invention, after the labeled text of the voice segment is determined, a third-party voice recognition system can be adopted to perform result review, so that the screened labeled text is ensured to have high accuracy.

The technical solution provided by the embodiments of the present invention is to obtain a disclosed speech segment and a relatively accurate initial text from the internet, and determine a labeled text of a language segment by an easy confusion processing procedure, a biased language model construction procedure, a candidate vocabulary sequence determination procedure, a forced alignment procedure, and a comparison and correction procedure of an alignment result and the initial text, so as to support training and tuning of a speech recognition model system. The scheme not only greatly accelerates the acquisition speed of the labeled text of the voice data, improves the accuracy of voice labeling, and greatly reduces the cost of data acquisition, but also can be suitable for data of multiple languages such as Chinese, English, Japanese, Korean and the like and data mixed in languages, and provides powerful data support for training, tuning and iteration of the voice recognition system. Through verification, the word error rate (wer) and the sentence error rate (ser) of the labeled text of the voice data acquired by the scheme are less than 1% and less than 5%, and the accuracy of the labeled text of the voice data is obviously better than that of the labeled voice data in the prior art.

Exemplary devices

Having described the method of the exemplary embodiment of the present invention, next, a speech data processing apparatus of the exemplary embodiment of the present invention will be explained in detail with reference to fig. 10.

Fig. 10 schematically shows a block diagram of a speech data processing device according to an embodiment of the invention.

As shown in fig. 10, the voice data processing apparatus 1000 may include: an acquisition module 1010, a construction module 1020, a candidate determination module 1030, a preference determination module 1040, and an annotation determination module 1050.

The obtaining module 1010 is used for obtaining a voice segment and an initial text for the voice segment.

The construction module 1020 is configured to construct a biased language model of the speech segment based on the initial text.

The candidate determining module 1030 is configured to determine a plurality of candidate vocabulary sequences of the speech segment based on the constructed biased language model.

The preference determining module 1040 is configured to perform a forced alignment operation on the candidate vocabulary sequences and the acoustic features of the speech segments, respectively, to determine a preferred vocabulary sequence from the candidate vocabulary sequences.

The annotation determination module 1050 is used to determine the annotated text of the speech segment based on the difference between the initial text and the preferred vocabulary sequence.

In an embodiment of the present invention, the obtaining module 1010 may include: the system comprises an acquisition submodule, a preprocessing submodule and a text extraction submodule.

The acquisition submodule is used for acquiring voice data and text data aiming at the voice data from existing data of the Internet. Then, the preprocessing submodule is used for performing endpoint detection and segmentation on the voice data to obtain a plurality of voice segments. Next, the text extraction sub-module is configured to, for any one of the plurality of speech segments, determine an initial text for the any one of the speech segments from the text data, where the initial text includes: a plurality of words arranged in a predetermined order.

In another embodiment of the present invention, the apparatus 1000 may further include an confusion processing module, configured to determine whether any vocabulary in the plurality of vocabularies included in the initial text has a confusable vocabulary before the construction module constructs the biased language model of the speech segment based on the initial text. If yes, the confusable words of any word are inserted into the initial text, and the arrangement positions of the confusable words of any word and any word in the initial text are the same.

In another embodiment of the present invention, the building module 1020 may include: the device comprises a first calculation submodule, a second calculation submodule and a construction submodule.

The first calculation submodule is used for calculating the biased 1-N meta grammar model probability of any vocabulary in a plurality of vocabularies contained in the initial text, wherein N is an integer larger than 1. Then, the second calculation submodule is used for determining the state transition probability between any two vocabularies in the plurality of vocabularies based on the bias 1-N meta grammar model probabilities of the vocabularies respectively. And then, the construction submodule is used for constructing a bias language model based on the state transition probability between any two vocabularies.

In still another embodiment of the present invention, the first calculating sub-module may include: the device comprises a word frequency calculation unit, a model probability calculation unit and a smoothing processing unit.

The word frequency calculating unit is used for calculating the word frequency of any vocabulary. Then, the model probability calculating unit is used for calculating the corrected 1-N element grammar model probability of any vocabulary in the initial text based on the word frequency of the vocabulary. And the smoothing unit is used for smoothing the corrected 1-N meta-grammar model probabilities to adjust the probability distribution among the corrected 1-N meta-grammar model probabilities so as to obtain the biased 1-N meta-grammar model probabilities of any vocabulary.

In another embodiment of the present invention, the process of adjusting the probability distribution among the modified 1-N meta-grammar model probabilities by the smoothing unit may specifically include: and adjusting and correcting the probability distribution among the probabilities of the 1-N element grammar models through a first offset, wherein the first offset has different values for different languages.

In another embodiment of the present invention, the process of calculating the probability of the modified 1-N meta grammar model of any vocabulary in the initial text by the model probability calculating unit may specifically include: and when the word frequency of any vocabulary is greater than a first threshold value, determining that the vocabulary is a high-frequency vocabulary. After determining that any vocabulary is a high-frequency vocabulary, calculating the M-element grammar model probability of the any vocabulary in an initial file, wherein M is an integer which is greater than or equal to 1 and less than or equal to N. And when the M is less than or equal to a second threshold value, adding a second offset to the M-ary grammar model probability to obtain a corrected M-ary grammar model probability of any vocabulary in the initial text, wherein the second offset is the ratio of the word frequency of any vocabulary to the total word frequency of high-frequency vocabularies in the initial text. And when M is larger than a second threshold value, taking the M-gram model probability as the modified M-gram model probability of any vocabulary in the initial text.

In still another embodiment of the present invention, the candidate determining module 1030 may include: the model acquisition sub-module, the model construction sub-module and the model processing sub-module.

The model obtaining submodule is used for obtaining the hidden Markov model, the context-dependent factor model and the pronunciation dictionary model. Then, the model construction submodule is used for connecting the hidden Markov model, the context-dependent factor model, the pronunciation dictionary model and the bias language model in series in sequence, and taking the output of each previous model as the input of the next model so as to construct a decoding network. Then, the model processing submodule is used for inputting the voice segment into the decoding network, so that the decoding network outputs a plurality of candidate word sequences of the voice segment.

In yet another embodiment of the present invention, the preference determining module 1040 may include: an alignment sub-module, a preferred path determination sub-module, and a preferred sequence determination sub-module.

And the alignment submodule is used for performing forced alignment operation on the candidate vocabulary sequences and the acoustic characteristics of the voice segment by utilizing a Viterbi algorithm so as to obtain a state sequence diagram structure which accords with the acoustic characteristics and aims at the candidate vocabulary sequences. The state sequence diagram structure is used to characterize a plurality of state points and transition probabilities between any two of the plurality of state points. The preferred path determination submodule is then operable to determine a preferred path in the state sequence diagram structure having the highest probability of being made up of a plurality of state points. Next, the preferred sequence determination submodule is used to determine a preferred vocabulary sequence according to the preferred path.

In a further embodiment of the present invention, the label determining module 1050 may include: the device comprises a timestamp acquisition sub-module, a comparison sub-module, a modification sub-module and a label selection sub-module.

The time stamp obtaining submodule is used for obtaining time stamp information of each of a plurality of vocabularies in the initial text on one hand and obtaining time stamp information of each of a plurality of vocabularies in the preferred vocabulary sequence on the other hand. The comparison submodule is operable to compare the initial text to the preferred vocabulary sequence for lexical differences of the same time stamp information to determine that there is a difference between the initial text and the preferred vocabulary sequence. And the correction submodule is used for removing the words with differences from the initial text and the preferred word sequence when the number of times of the differences between the initial text and the preferred word sequence is less than a third threshold value so as to obtain a corrected initial text and a corrected preferred word sequence. And then, the label selection submodule is used for selecting the corrected initial text or the corrected preferred vocabulary sequence as the label text of the voice segment.

In yet another embodiment of the present invention, the apparatus 1000 may further comprise a discarding module for discarding the speech segment when the number of times the difference exists between the initial text and the preferred vocabulary sequence is greater than or equal to a third threshold.

It should be noted that the implementation, solved technical problems, implemented functions, and achieved technical effects of each module/unit/subunit and the like in the apparatus part embodiment are respectively the same as or similar to the implementation, solved technical problems, implemented functions, and achieved technical effects of each corresponding step in the method part embodiment, and are not described herein again.

Exemplary Medium

Having described the method and apparatus of exemplary embodiments of the present invention, a medium for implementing a voice data processing method of exemplary embodiments of the present invention will be described.

An embodiment of the present invention provides a medium storing computer-executable instructions, which when executed by a processor, are configured to implement the voice data processing method according to any one of the above method embodiments.

In some possible embodiments, aspects of the invention may also be implemented in the form of a program product comprising program code means for causing a computing device to carry out the operational steps of the speech data processing method according to various exemplary embodiments of the invention described in the above section "exemplary methods" of this specification, when said program product is run on said computing device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Fig. 11 schematically shows a schematic diagram of a computer-readable storage medium product according to an embodiment of the present invention, and as shown in fig. 11, a program product 110 for implementing a voice data processing method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a computing device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Exemplary computing device

Having described the method, medium, and apparatus of exemplary embodiments of the present invention, a computing device for implementing a voice data processing method according to another exemplary embodiment of the present invention is described next.

An embodiment of the present invention further provides a computing device, including: the voice data processing system comprises a memory, a processor and executable instructions stored on the memory and executable on the processor, wherein the processor realizes the voice data processing method in any one of the above method embodiments when executing the instructions.

As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

In some possible embodiments, a computing device for implementing a method of speech data processing according to the invention may comprise at least one processing unit, and at least one memory unit. Wherein the storage unit stores program code which, when executed by the processing unit, causes the processing unit to perform the operational steps in the voice data processing method according to various exemplary embodiments of the present invention described in the above section "exemplary method" of the present specification.

A computing device 120 for implementing a voice data processing method according to this embodiment of the present invention is described below with reference to fig. 12. The computing device 120 shown in FIG. 12 is only one example and should not impose any limitations on the functionality or scope of use of embodiments of the present invention.

As shown in fig. 12, computing device 120 is embodied in the form of a general purpose computing device. Components of computing device 120 may include, but are not limited to: the at least one processing unit 1201, the at least one memory unit 1202, and the bus 1203 connecting the various system components (including the memory unit 1202 and the processing unit 1201).

Bus 1203 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures.

The storage unit 1202 may include readable media in the form of volatile memory, such as Random Access Memory (RAM)12021 and/or cache memory 12022, and may further include Read Only Memory (ROM) 12023.

The storage unit 1202 may also include a program/utility 12025 having a set (at least one) of program modules 12024, such program modules 12024 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Computing device 120 may also communicate with one or more external devices 1204 (e.g., keyboard, pointing device, bluetooth device, etc.), may also communicate with one or more devices that enable a user to interact with computing device 120, and/or communicate with any devices (e.g., router, modem, etc.) that enable computing device 120 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/0) interface 1205. Also, computing device 120 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) through network adapter 1206. As shown, network adapter 1206 communicates with other modules of computing device 120 over bus 1203. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with computing device 120, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

It should be noted that although in the above detailed description several units/modules or sub-units/modules of the processing means of speech data are mentioned, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module according to embodiments of the invention. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.

Moreover, while the operations of the method of the invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

While the spirit and principles of the invention have been described with reference to several particular embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A method of speech data processing, comprising:

acquiring a voice fragment and an initial text aiming at the voice fragment;

building a bias language model of the voice fragment based on the initial text;

determining a plurality of candidate vocabulary sequences of the speech segment based on the biased language model;

performing forced alignment operation on the candidate word sequences and the acoustic features of the voice segments respectively to determine a preferred word sequence from the candidate word sequences; and

determining an annotated text of the speech segment based on a difference between the initial text and the preferred vocabulary sequence.

2. The method of claim 1, wherein the obtaining a speech segment and initial text for the speech segment comprises:

acquiring voice data and text data aiming at the voice data from existing data of the Internet;

carrying out endpoint detection and segmentation on the voice data to obtain a plurality of voice segments; and

for any of the plurality of speech segments, determining an initial text for the any speech segment from the text data, the initial text comprising: a plurality of words arranged in a predetermined order.

3. The method of claim 2, further comprising:

determining, for any vocabulary in the plurality of vocabularies, whether there is a confusable vocabulary in the any vocabulary before the building of the biased language model of the speech segment based on the initial text; and

if yes, inserting the confusable words of any word into the initial text, wherein the confusable words of any word are arranged at the same positions in the initial text as the position of any word.

4. The method of claim 2, wherein said building a biased language model of the speech segment based on the initial text comprises:

calculating the bias 1-N meta-grammar model probability of any vocabulary in the plurality of vocabularies, wherein N is an integer greater than 1;

determining the state transition probability between any two vocabularies in the plurality of vocabularies based on the bias 1-N element grammar model probability of each vocabulary; and

and constructing the bias language model based on the state transition probability between any two vocabularies.

5. The method of claim 4, wherein said calculating biased 1-N meta-grammar model probabilities for said any vocabulary includes:

calculating the word frequency of any vocabulary;

calculating the probability of a corrected 1-N element grammar model of any vocabulary in the initial text based on the word frequency of the vocabulary; and

and smoothing the corrected 1-N meta-grammar model probabilities to adjust the probability distribution among the corrected 1-N meta-grammar model probabilities so as to obtain the biased 1-N meta-grammar model probabilities of any vocabulary.

6. The method of claim 5, wherein said adjusting the probability distribution between the modified 1-N meta-grammar model probabilities comprises:

and adjusting the probability distribution among the probabilities of the corrected 1-N element grammar models through a first offset, wherein the first offset has different values for different languages.

7. The method of claim 5, wherein said calculating a revised 1-N meta grammar model probability of said any vocabulary in said initial text comprises:

when the word frequency of any vocabulary is larger than a first threshold value, determining that the any vocabulary is a high-frequency vocabulary;

calculating the M-element grammar model probability of any vocabulary in the initial file, wherein M is an integer which is more than or equal to 1 and less than or equal to N;

when M is smaller than or equal to a second threshold value, adding a second offset to the M-ary grammar model probability to obtain a corrected M-ary grammar model probability of any vocabulary in the initial text, wherein the second offset is a ratio of the word frequency of any vocabulary to the total word frequency of high-frequency vocabularies in the initial text; and

and when M is larger than a second threshold value, taking the M-gram model probability as the corrected M-gram model probability of any vocabulary in the initial text.

8. A speech data processing apparatus comprising:

the acquisition module is used for acquiring a voice fragment and an initial text aiming at the voice fragment;

the construction module is used for constructing a bias language model of the voice fragment based on the initial text;

a candidate determination module for determining a plurality of candidate vocabulary sequences of the speech segments based on the biased language model;

a preference determining module, configured to perform a forced alignment operation on the plurality of candidate vocabulary sequences and the acoustic features of the speech segment, respectively, so as to determine a preferred vocabulary sequence from the plurality of candidate vocabulary sequences; and

and the marking determining module is used for determining the marking text of the voice segment based on the difference between the initial text and the preferred vocabulary sequence.

9. A medium storing computer executable instructions, which when executed by a processor, are operable to implement:

the speech data processing method according to any one of claims 1 to 7.

10. A computing device, comprising: a memory, a processor, and executable instructions stored on the memory and executable on the processor, the processor when executing the instructions implementing:

the speech data processing method according to any one of claims 1 to 7.