CN113314105A

CN113314105A - Voice data processing method, device, equipment and storage medium

Info

Publication number: CN113314105A
Application number: CN202010082981.4A
Authority: CN
Inventors: 朱晓如; 曹元斌
Original assignee: Cainiao Smart Logistics Holding Ltd
Current assignee: Cainiao Smart Logistics Holding Ltd
Priority date: 2020-02-07
Filing date: 2020-02-07
Publication date: 2021-08-27

Abstract

The embodiment of the application provides a voice data processing method, a device, equipment and a storage medium, wherein the method comprises the following steps: respectively decoding the voice data by adopting a plurality of decoders, and determining a plurality of corresponding decoding results; determining acoustic processing data corresponding to the voice data according to the decoding results and the screening rules, wherein the acoustic processing data comprises the voice data and corresponding marking data; determining a frame alignment result corresponding to the voice data according to the acoustic processing data and a set basic acoustic analyzer; and returning the frame alignment result as training data. The efficiency of the voice annotation and the processing efficiency of the acoustic processor can be improved.

Description

Voice data processing method, device, equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for processing voice data, an electronic device, and a storage medium.

Background

The acoustic model training plays a very important role in the speech recognition task and is closely related to the recognition accuracy rate; the accuracy of acoustic model recognition is related to model training, and the quality of training data affects the accuracy of the model, such as the sampling rate of the training data, the noise environment, the echo environment, and the far-field approach.

The training process of the acoustic model comprises the following steps: the method comprises the steps of collecting open voice data, marking all the voice data manually, and then training an acoustic model by using data characteristics and marks of the voice data. The trained acoustic models may provide speech recognition services.

However, the above method for training the acoustic model needs manual labeling of the speech data as training data, and needs to collect a large amount of speech data for manual labeling in order to ensure the training result of the model, so that the efficiency of data labeling is low, and the efficiency of model training is low.

Disclosure of Invention

The embodiment of the application provides a voice data processing method, so that the processing efficiency is improved.

Correspondingly, the embodiment of the application also provides a data processing device, an electronic device and a storage medium, which are used for ensuring the implementation and application of the method.

In order to solve the above problem, an embodiment of the present application discloses a method for processing voice data, where the method includes: respectively decoding the voice data by adopting a plurality of decoders, and determining a plurality of corresponding decoding results; determining acoustic processing data corresponding to the voice data according to the decoding results and the screening rules, wherein the acoustic processing data comprises the voice data and corresponding marking data; determining a frame alignment result corresponding to the voice data according to the acoustic processing data and a set basic acoustic analyzer; and returning the frame alignment result as training data.

In order to solve the above problem, an embodiment of the present application discloses an acoustic resolver training method, where the method includes: respectively decoding the voice data by adopting a plurality of decoders, and determining a plurality of corresponding decoding results; determining acoustic processing data corresponding to the voice data according to the decoding results and the screening rules, wherein the acoustic processing data comprises the voice data and corresponding marking data; and training an acoustic analyzer according to the acoustic processing data to obtain the trained acoustic analyzer.

In order to solve the above problem, an embodiment of the present application discloses a voice data processing apparatus, including: a decoding result acquisition module, configured to use multiple decoders to decode the voice data, and determine multiple corresponding decoding results; the acoustic processing data acquisition module is used for determining acoustic processing data corresponding to the voice data according to the decoding results and the screening rule, wherein the acoustic processing data comprises the voice data and corresponding marking data; an alignment result acquisition module, configured to determine, according to the acoustic processing data and a set basic acoustic parser, a frame alignment result corresponding to the speech data; and the training data acquisition module is used for returning the frame alignment result as training data.

In order to solve the above problem, an embodiment of the present application discloses an acoustic resolver training apparatus, including: a decoding result obtaining module, configured to decode the voice data by using a plurality of decoders, respectively, and determine a plurality of corresponding decoding results; the acoustic processing data obtaining module is used for determining acoustic processing data corresponding to the voice data according to the decoding results and the screening rule, and the acoustic processing data comprises the voice data and corresponding marking data; and the analyzer generating module is used for training an acoustic analyzer according to the acoustic processing data to obtain the trained acoustic analyzer.

In order to solve the above problem, an embodiment of the present application discloses an electronic device, including: a processor; and a memory having executable code stored thereon, which when executed, causes the processor to perform a voice data processing method as described in one or more of the above.

To address the above issues, one or more machine-readable media having executable code stored thereon, which when executed, causes a processor to perform a voice data processing method as described in one or more of the above.

Compared with the prior art, the embodiment of the application has the following advantages:

in the embodiment of the application, a plurality of decoders are utilized to decode and analyze the voice data in the preprocessing process to obtain a plurality of decoding results; the method comprises the steps of determining corresponding voice data and label data as acoustic processing data by utilizing a plurality of decoding results and screening rules, so that the efficiency of voice labeling can be improved, processing the acoustic processing data obtained through preprocessing by adopting a set basic acoustic analyzer, and obtaining a frame alignment result corresponding to the voice data, so that the frame alignment result can be used as training data, a corresponding acoustic analyzer is trained, and the processing efficiency of the acoustic analyzer is improved.

Drawings

FIG. 1 is a system architecture diagram of a speech processing system according to one embodiment of the present application;

FIG. 2 is a system architecture diagram of a voting system in accordance with one embodiment of the present application;

FIG. 3 is a system architecture diagram of a speech processing system according to another embodiment of the present application;

FIG. 4 illustrates a method of processing voice data according to one embodiment of the present application;

FIG. 5 is a flow diagram of an acoustic parser training method according to an embodiment of the present application;

FIG. 6 is a block diagram of a data acquisition device according to an embodiment of the present application;

FIG. 7 is a block diagram of an acoustic parser training device in accordance with an embodiment of the present application;

FIG. 8 is a block diagram of an exemplary device according to one embodiment of the present application.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, the present application is described in further detail with reference to the accompanying drawings and the detailed description.

The embodiment of the application can be applied to various scenes aiming at voice data processing, voice data with higher quality can be obtained by screening the voice data through preprocessing the voice data, and acoustic processing data corresponding to the voice data can be determined through preprocessing, wherein the acoustic processing data comprise the voice data and corresponding marking data. A base acoustic parser is provided to process the acoustic processing data to obtain training data for training the acoustic parser.

In the embodiment of the present application, the acoustic analyzer may also be referred to as an acoustic model, an acoustic analyzer, an acoustic information mapping set, and the like, and is generally constructed based on a mathematical model algorithm related to acoustic processing, such as a neural network model algorithm, a deep neural network model algorithm, a hidden markov model algorithm, and the like, and may determine acoustic information based on speech data, so as to process the speech data. The mathematical model is a mathematical structure which is expressed in a general way or an approximate way by adopting a mathematical language aiming at the characteristic or quantity dependency relationship of a certain object system, and the mathematical structure is a pure relationship structure of a certain system which is drawn by means of a mathematical symbol. A mathematical model is understood broadly to include various concepts in mathematics, various formulas and various theories. In the sense that the whole mathematics can also be said to be a science about mathematical models, since they are abstracted from prototypes of the real world. In a narrow sense, mathematical models refer to mathematical relationship structures that reflect a particular problem or a particular system of things, and in this sense may also be understood as mathematical expressions that relate relationships between variables in a system.

The basic acoustic parser is a high-order acoustic model, the model algorithm is complex, the decoding speed is slow, but the accuracy is high, and the basic acoustic model is generally used in offline speech recognition processing, for example, the decoding of the speech data on the line is generally limited to a 5-layer Deep Neural Network (DNN) model, but the basic acoustic model is more complex, for example, an las (listenattten and spell) model, which is an encoder-decoder based model, and an attention mechanism is added, and the basic acoustic model can be various attention (attention) or transformer based models, wherein a transformer is a model for processing sequences based on an attention structure only. Therefore, the embodiment of the present application may process the speech data based on the basic acoustic model to obtain a frame alignment result of the speech data, and use the frame alignment result as training data of an on-line acoustic model (or an on-line acoustic decoder), so that the on-line acoustic model can be trained quickly by using the advantages of the basic acoustic model. Moreover, the voice data can be organized according to scenes, so that the basic acoustic model can process the voice data corresponding to each scene to obtain training data corresponding to each scene, and the acoustic models of various scenes can be trained for voice services of various scenes. For example, acoustic models of various scenes, such as a robot customer service, a translation, a voice robot, etc., may be trained so that acoustic information can be recognized and a voice service of a corresponding scene can be provided.

The acoustic model is usually used to map acoustic features of speech to corresponding acoustic units such as phonemes and words, and may be a knowledge representation of differences in acoustics, phonetics, environmental variables, speaker gender, accent, etc. In the embodiment of the present application, the preprocessing of the speech data is generally performed by a plurality of decoders, where the decoders (decoders) can convert the input speech data into character sequences according to dictionaries, acoustic models, and language models, for example, recognize acoustic feature sequences from the input speech, and then decode and convert the input speech into character sequences on the basis of the given acoustic feature sequences to determine the most likely words, or sentences.

On the basis of the foregoing embodiments, the present application provides a speech processing system, which may screen speech data through preprocessing by a decoder, and then obtain a frame alignment result through a base acoustic parser, that is, may be used as training data of acoustic parsers on other lines to train acoustic parsers on the lines.

As shown in the system architecture diagram of the voice processing system shown in fig. 1, the voice data may be collected first, may be collected by various voice processing systems, and may also be collected based on a target scene, so that the system architecture diagram can be used to train an acoustic parser corresponding to the target scene. In the embodiment of the application, the voice data can be provided by a voice service provider providing voice service, so that training data corresponding to a required scene is provided for each service provider, and an acoustic resolver on a line of the voice service provider is trained.

For example, the target scene may be various scenes for performing voice processing, such as various voice scenes of the internet of things, for example, in a home scene, the voice data related to the target scene may be voice data related to home life, such as voice data including words of turning on and off a light, turning on and off a television, playing music, and turning on a water heater. For example, in the e-commerce customer service scenario, the voice data related to the target scenario may be voice data related to e-commerce customer service work, such as voice data including terms of finding a product, a product model, a product brand, and the like. For example, in a shopping guide scenario of a physical store furniture product, the voice data related to the target scenario may be voice data related to the furniture product, such as voice data containing words of a table, a chair, a sofa, and the like.

In one example, a voice data processing method includes:

and 102, decoding the voice data by adopting a plurality of decoders respectively, and determining a plurality of corresponding decoding results.

And step 104, determining acoustic processing data corresponding to the voice data according to the plurality of decoding results and the screening rule, wherein the acoustic processing data comprises the voice data and corresponding marking data.

After the voice data is acquired, the voice data is screened and the corresponding annotation data is determined. The voice data screening process can decode the voice data through a plurality of decoders to obtain a plurality of decoding results. And then classifying the voice data according to the quality and screening the voice data according to a plurality of decoding results and a preset screening rule, and taking the screened voice data and corresponding labeled data as acoustic processing data. The decoder is configured to decode the voice data to obtain a decoding result, where the decoding result may include acoustic feature information corresponding to the voice data and label data of the corresponding voice data determined by the acoustic feature information.

The decoder can be obtained through an acoustic analyzer and a language decoder, the acoustic analyzer analyzes the voice data according to frames to obtain phoneme characteristics corresponding to the voice data, the phoneme characteristics are combined to obtain a plurality of words and probabilities corresponding to the words, and the language decoder is used for decoding and combining the words and the probabilities corresponding to the words to obtain sentences (labeled data) formed by the words. The plurality of decoders may be a plurality of different types of acoustic decoders, and in particular, the acoustic decoders may include an acoustic parser provided by a parser demander and being served in a target scene and a decoder obtained from the acoustic parser trained to meet decoding requirements. The screening rule is used for classifying the voice data according to a plurality of decoding results, and the screening rule can be a voting rule, and determines the quality of the voice data according to whether the voting results meet a threshold value to determine the screening processing mode of the voice data.

For example, voice data can be divided into three categories according to the quality: a first class, a second class, and a third class. The threshold values are three, namely a first threshold value, a second threshold value and a third threshold value. And voting according to the plurality of decoding results to obtain a voting result corresponding to the voice data. Under the condition that the voting result meets a first threshold value, the corresponding voice data is first-class data; under the condition that the voting result meets a second threshold value, the corresponding voice data is second-class data; and under the condition that the voting result meets a third threshold value, the corresponding voice data is third-class data.

After determining the corresponding category of the voice data, the voice data may be processed according to the category of the voice data to obtain the acoustic processing data. For example, for the first type of voice data, the corresponding annotation data can be determined by a plurality of decoding results; for the second type of voice data, the corresponding labeled data can be determined in a manual labeling mode. For the third type of voice data, it may be discarded. And after corresponding processing is carried out on the voice data of different types, the screened voice data and the corresponding marking data are used as acoustic processing data.

For example, as shown in fig. 2, voice data is decoded by three decoders, and three decoding results are obtained for one voice data. And classifying the voice data according to the quality according to the three decoding results and the screening rule. Specifically, the screening rule can be understood as: if the three decoding results are the same, the voice data is the first type of data; if two decoding results in the three decoding results are the same, the voice data is the second kind of data; if the three decoding results are different, the voice data is the third kind of data.

For the first type of voice data, the annotation data corresponding to the voice data can be determined according to the decoding result corresponding to the voice data. For the second type of voice data, because the three decoding results of the voice data are inconsistent, the labeled data corresponding to the voice data cannot be accurately determined, and then the voice data is output and used as the labeled data of the voice data according to the label artificially labeled. For the third type of voice data, the three decoding results of the voice data are different, the data quality of the voice data is poor, and the voice data is discarded. And after the screened voice data and the corresponding marking data are obtained, combining the screened voice data and the corresponding marking data to obtain acoustic processing data.

As shown in fig. 1, the voice data processing method further includes:

and step 106, determining a frame alignment result corresponding to the voice data according to the acoustic processing data and the set basic acoustic analyzer.

And step 108, returning the frame alignment result as training data.

And analyzing the acoustic processing data through a pre-trained basic acoustic analyzer to obtain a frame alignment result. The base acoustic resolver is a pre-trained acoustic resolver. The basic acoustic parser is used for aligning the voice data in the acoustic processing data with the marking data in the acoustic processing data to obtain a frame alignment result. The frame alignment results include alignment results between one or more frames corresponding to acoustic features in the speech data and the annotation data. After the frame alignment result is obtained, the frame alignment result is returned as training data, and the training data may be understood as data used for training an acoustic parser corresponding to the target scene.

The obtaining process of the frame alignment result may include:

the method includes the steps of segmenting voice data in acoustic processing data according to frames to obtain corresponding voice audio frames, then analyzing the voice audio frames according to a basic acoustic analyzer to convert the voice audio frames into frames corresponding to acoustic features and acoustic features, wherein the acoustic features are data of the acoustic analyzer used for training a target scene, the acoustic features comprise one or more phonemes, each phoneme comprises one or more audio frames, and for example, the acoustic features corresponding to the audio containing the "in" can be acoustic data such as spectrum information and the like which are formed by multiple phonemes and correspond to the "j", "i" and "n". Segmenting labeling data in the acoustic processing data to obtain corresponding acoustic units, wherein the acoustic units are data used for training an acoustic analyzer of a target scene, and the acoustic units are obtained by segmenting the labeling data, for example, the acoustic units segmented by "in" for the labeling data can be "j", "i" and "n". And aligning the frame corresponding to the acoustic feature with the acoustic unit through the basic acoustic analyzer to obtain a frame alignment result.

In the embodiment, a plurality of decoders are used for decoding and analyzing the voice data in the preprocessing process to obtain a plurality of decoding results; the method comprises the steps of determining corresponding voice data and label data as acoustic processing data by utilizing a plurality of decoding results and screening rules, and improving the efficiency of voice labeling.

The acoustic processing data may also include public speech data and its corresponding annotation data. For the public voice data with labeled data, the public voice data can be directly used as acoustic processing data to be input into a basic acoustic analyzer for analysis, and a corresponding frame alignment result is obtained and used as training data. As shown in fig. 3, the frame alignment result output by the basic acoustic analyzer can be used as training data of an on-line acoustic analyzer and other acoustic models, so that the on-line acoustic analyzer and other acoustic models can obtain better training data, and the output of the complex basic acoustic analyzer is used as the input of a relatively simple on-line acoustic analyzer in a migration learning manner, so as to improve the training effect of the on-line acoustic analyzer.

Therefore, the frame alignment result output by the basic acoustic resolver can be used as training data to be input into a simpler acoustic resolver for training of the acoustic resolver, and the trained acoustic resolver is deployed on line to provide services such as on-line real-time speech recognition and processing. In some scenes, the voice data of the basic acoustic analyzer can be the voice data of a target scene, the correspondingly obtained frame alignment result is also the training data of the target scene, the acoustic analyzer of the target scene is obtained through training, the voice data of the target scene can be processed, and more professional services under all scenes are provided.

For example, the acoustic analyzer obtained by the method of the embodiment may be applied to a customer service scene, and according to voice data related to the customer service scene, preprocessing is performed to obtain acoustic processing data, and based on a basic acoustic analyzer, the corresponding acoustic analyzer is trained by using the acoustic processing data. If the customer service scene is a bank customer service scene, the corresponding acoustic analyzer can be trained by taking the telephone recording of the bank customer service history as target voice data. After the acoustic analyzer is online, the acoustic analyzer is used for analyzing the voice data corresponding to the customer service scene to obtain a customer service voice analysis result, corresponding reply content is determined according to the result, and then the reply content is fed back to the user.

The acoustic analyzer obtained by the method of the embodiment can also be applied to a home control scene, and the acoustic analyzer is trained according to voice data related to the home control scene. For example, the acoustic analyzer is trained by taking historical voice data collected by the intelligent loudspeaker box as target voice data. After the acoustic analyzer is on line, the acoustic analyzer is used for analyzing the voice of the user to obtain a home voice analysis result, and the result is used for completing control over electronic equipment in the home, such as control over an air conditioner, control over light and the like.

The acoustic analyzer obtained by the method of the embodiment can also be applied to a shopping guide robot in a shopping mall, and the acoustic analyzer is trained according to voice data related to a shopping guide scene. After the acoustic analyzer is on line, the acoustic analyzer is used for analyzing the voice of the user to obtain a shopping guide voice analysis result, and the shopping guide of the user is completed by using the result. Such as: the recognition result of the voice data of the user is 'finding shoes', and the floors and the positions of the shoe commodities in the market map can be displayed for the user. In addition, in complex environment scenes such as shopping malls, a lot of noises are mixed in the voice data, and before the acoustic analyzer is trained, the voice data can be denoised, so that the data quality of the voice data is improved.

In some optional embodiments of the present application, output results of the acoustic analyzers of each scene on the line may also be collected, that is, service voice data related to the service provided may be provided, and the service voice data may be subjected to pre-screening processing, and the screened service voice data may continue to optimize the acoustic analyzer of the scene, optimize the basic acoustic analyzer, and the like.

In the embodiment, the acoustic analyzer is optimized by using the service voice data related to the service of the acoustic analyzer, so that the acoustic analyzer is more matched with a target scene, and the identification accuracy of the acoustic analyzer is improved.

As shown in fig. 1, optionally, as an embodiment, the screening rule includes a voting rule, and in step 104, determining acoustic processing data corresponding to the speech data according to the multiple decoding results and the screening rule includes:

and determining the marking data of the screened voice data according to the plurality of decoding results and the voting rule.

And taking the screened voice data and the corresponding marking data as acoustic processing data.

After the plurality of decoding results are determined, voting is carried out according to the plurality of decoding results to obtain corresponding voting results, and the category corresponding to the voice data is determined according to the voting results and the voting rules, so that the voice data is screened, the corresponding marking data is determined, and the acoustic processing data is obtained. A part of the corresponding annotation data of the speech data can be determined by a plurality of decoding results, and another part can be determined by manual annotation.

Optionally, as an embodiment, determining the labeled data of the filtered speech data according to the multiple decoding results and the filtering rule includes:

and voting according to the plurality of decoding results to obtain corresponding target voting results.

And when the target voting result of the target voice data meets the first threshold value, obtaining the marking data of the target voice data according to the decoding result corresponding to the target voice data.

And when the target voting result meets a second threshold value, acquiring a labeling result of the target voice data.

When the target voting result satisfies the third threshold, the target voice data is discarded.

And voting according to the plurality of decoding results to obtain a target voting result. And if the target voting result meets a first threshold value, determining that the target voice data is the first type of voice data, and determining the labeled data of the target voice data through a plurality of decoding results. And if the target voting result meets a second threshold value, determining that the target voice data are second-class voice data, and determining the labeled data of the target voice data by acquiring artificially labeled data. And if the target voting result meets a third threshold value, determining that the target voice data is the third type of data, and discarding the target voice data.

The first threshold, the second threshold, and the third threshold may be a number, or may be an interval. Specifically, for example, the first threshold, the second threshold, and the third threshold may be determined according to a preset voting ratio, for example, if the thresholds of the two preset voting ratios are 85% and 60%, respectively, the first threshold is 85%, the second threshold is [ 60%, 85% ], and the third threshold is 60%.

Optionally, as an embodiment, when voting is performed according to a plurality of decoding results, it may be determined whether the same decoding result exists in the decoding results, if the plurality of decoding results are different, semantic similarity between the plurality of decoding results may be counted, and when the semantic similarity meets a preset similarity threshold, the decoding result is voted. Such as: for a speech data, the decoding results of the three decoders are: "I want to go out to go", "I want to go", and "I want to go out". The three decoding results are different, the similarity between the three decoding results can be analyzed, the similarity between the first decoding result and the third decoding result in the three decoding results is high, the three decoding results can be used as the same decoding result, and the decoding result is used for voting to obtain a target voting result. For the voice data of which the three decoding results are different and the decoding results have semantic similarity, the labeled data corresponding to the voice data can be determined in a manual labeling mode to obtain the acoustic processing data.

Optionally, as an embodiment, before voting according to the multiple decoding results, a corresponding weight value may be set for each of the multiple decoders, and when voting according to the multiple decoding results, a weighted voting result is determined according to the decoding results and the corresponding weight values. For the weighted voting result, the category corresponding to the voice data can be determined by comparing with a weighted threshold. If the voice data can be divided into three types, the first type voice data utilizes a plurality of decoding results and corresponding weight values to determine the labeled data, the second type voice data utilizes a manual labeling mode to determine the labeled data, and for the third type voice data, a scheme of discarding the third type voice data is adopted.

Such as: for four decoders, the weight value of the first decoder is 0.4, the weight value of the second decoder is 0.3, the weight value of the third decoder is 0.2, and the weight value of the fourth decoder is 0.1. After the four decoding results are obtained, voting is carried out according to the four decoding results and the corresponding weight values, and weighted voting results are obtained. For example, for the speech data a, the decoding result of the first decoder is the same as that of the second decoder, while the decoding results of the third decoder and the fourth decoder are different, the obtained weighted voting result is 0.7(0.4+0.3), and the speech data a is classified according to the weighted voting result and the preset weighted threshold value so as to perform corresponding processing.

Optionally, as an embodiment, determining acoustic processing data corresponding to the voice data according to the multiple decoding results and the filtering rule includes: determining semantic similarity between a plurality of decoding results; counting the number of semantic similarities which accord with a similarity threshold, and determining the labeled data of the screened voice data by combining with a screening rule; and taking the screened voice data and the corresponding marking data as acoustic processing data.

Determining whether the semantics of the plurality of decoding results are the same by determining the semantic similarity among the plurality of decoding results of the voice data, and determining the number of the decoding results with the same semantics according to the number of the semantic similarity of the load similarity threshold. And classifying the voice data according to the number and the corresponding screening rule. Different processing modes can be adopted for different types of voice data. If the voice data quality can be divided into three types from high to low, the first type voice data can determine the labeled data according to a plurality of decoding results, the second type voice data can determine the labeled data in a manual labeling mode, and the third type voice data can be discarded. After the voice data is screened and the label data corresponding to the screened voice data is determined, the voice data is used as acoustic processing data to obtain training data.

Optionally, as an embodiment, in step 106, determining a frame alignment result corresponding to the speech data according to the acoustic processing data and the set basic acoustic parser, includes:

analyzing voice data in the acoustic processing data according to the basic acoustic analyzer, and determining acoustic characteristics;

and according to the basic acoustic analyzer, aligning the acoustic features and the acoustic units of the labeled data to obtain the frame alignment result.

And segmenting the voice data in the acoustic processing data according to frames to obtain audio frames corresponding to the voice data, and analyzing the audio frames according to a basic acoustic analyzer to obtain acoustic features and frames corresponding to the acoustic features. And segmenting the labeling information in the acoustic processing data to obtain the acoustic unit. And aligning the acoustic features and the acoustic units according to the basic acoustic analyzer to obtain a frame alignment result.

Optionally, as an embodiment, the method shown in fig. 1 further includes:

and acquiring a voice data set corresponding to the target scene to obtain target training data corresponding to the target scene, wherein the target training data is used for training an acoustic analyzer corresponding to the target scene.

When the acoustic resolvers corresponding to the target scenes are trained, the voice data sets corresponding to the target scenes are obtained, the voice data sets are used as voice data to be screened, marked and aligned to obtain target training data, the acoustic resolvers corresponding to the target scenes are trained through the target training data, accordingly, training data of the acoustic resolvers of various scenes on line are provided, and the acoustic resolvers of various scenes can be better trained.

Optionally, as an embodiment, the method shown in fig. 1 further includes:

and taking service data related to the acoustic resolver corresponding to the target scene after the acoustic resolver performs service in the target scene as service voice data, wherein the service voice data is used for optimizing the acoustic resolver corresponding to the target scene.

After the trained acoustic resolver is served online in a target scene, service data related to the acoustic resolver in a service process is used as service voice data, the service voice data is used as voice data to be screened, labeled and aligned to obtain corresponding optimized data, the trained acoustic resolver is optimized through the optimized data, the matching degree between the trained acoustic resolver and the target scene is further improved, and the recognition accuracy of the acoustic resolver is improved.

Optionally, as an embodiment, the method shown in fig. 1 further includes:

the basic acoustic parser is optimized according to the service voice data.

The service voice data is used as the voice data to optimize the basic acoustic resolver, the recognition accuracy of the basic acoustic resolver is further improved through continuous optimization of the basic acoustic resolver, the alignment effect of the basic acoustic resolver is improved, and the recognition accuracy of the acoustic resolver trained by the basic acoustic resolver is further improved.

Optionally, as an embodiment, the plurality of decoders include a base decoder, and the base decoder is determined by the base decoder.

The plurality of decoders comprise a basic decoder determined by the basic acoustic analyzer, namely one decoding result in the plurality of decoding results is related to the basic acoustic analyzer, the basic acoustic analyzer is obtained through a large amount of data training, the accuracy of the decoding result of the basic acoustic analyzer is high, and the decoding result of the basic acoustic analyzer participates in the plurality of decoding results, so that the accuracy of voice data screening is improved.

Referring to fig. 4, a voice data processing method according to an embodiment of the present application is shown, the voice data processing method including:

step 402, a plurality of decoders are adopted to decode the voice data respectively, and a plurality of corresponding decoding results are determined.

And step 404, voting according to the plurality of decoding results to obtain corresponding target voting results.

Step 406, determining whether the voting result satisfies a threshold.

When the voting result of the voice data satisfies the first threshold, step 408 is performed; when the target voting result satisfies the second threshold, executing step 410; when the goal voting result satisfies the third threshold, step 412 is performed.

And step 408, obtaining the labeled data of the target voice data according to the decoding result corresponding to the target voice data.

When the target voting result of the target voice data satisfies the first threshold, the decoding result corresponding to the target voice data can be used as the labeling data of the target voice data.

And step 410, acquiring a labeling result of the target voice data.

When the target voting result meets the second threshold value, the labeling result of the target voice data can be obtained through manual labeling and other modes.

Step 412, discarding the target speech data.

When the target voting result meets the third threshold, the voice data is characterized to be poor in quality, and the target voice data can be discarded.

And step 414, taking the screened voice data and the corresponding marking data as acoustic processing data.

Step 416, analyzing the voice data in the acoustic processing data according to the basic acoustic analyzer, and determining the acoustic characteristics.

And 418, aligning the acoustic features and the acoustic units according to the basic acoustic analyzer to obtain a frame alignment result.

Step 420, returning the frame alignment result as training data.

The scheme of the embodiment can be used for screening the decoding results through mechanisms such as voting and the like, the decoding results with higher quality can be directly determined as the labeled data, the decoding results with lower quality are discarded, and the labeled data can be determined for the voice data with ordinary quality through modes such as manual labeling, so that manual labeling is not needed for all voice data, the cost of manual labeling is reduced, and the labeling efficiency of the voice data is improved.

On the basis of the foregoing embodiments, this embodiment further provides an acoustic resolver training method, as shown in fig. 5, the acoustic resolver training method includes:

step 502, decoding the voice data by using a plurality of decoders respectively, and determining a plurality of corresponding decoding results.

Step 504, determining acoustic processing data corresponding to the voice data according to the plurality of decoding results and the screening rule, wherein the acoustic processing data includes the voice data and corresponding label data.

Step 506, training the acoustic resolver according to the acoustic processing data to obtain the trained acoustic resolver.

In summary, a plurality of decoders are used for decoding and analyzing the voice data in the preprocessing process to obtain a plurality of decoding results; and determining corresponding voice data and marking data as acoustic processing data by using a plurality of decoding results and screening rules, so that the efficiency of voice marking can be improved. The processing efficiency of the acoustic resolver is improved.

Optionally, as an embodiment, the screening rule includes a voting rule, and in step 502, determining, according to the multiple decoding results and the screening rule, acoustic processing data corresponding to the voice data includes:

determining the marking data of the screened voice data according to a plurality of decoding results and voting rules;

After the plurality of decoding results are determined, voting is carried out according to the plurality of decoding results to obtain corresponding voting results, and the category corresponding to the voice data is determined according to the voting results and the voting rules, so that the voice data is screened, the corresponding marking data is determined, and the acoustic processing data is obtained. The corresponding annotation data of the voice data can be directly obtained through the decoding result, and in other examples, a part of the annotation data can be determined through a plurality of decoding results, and another part of the annotation data can be determined through manual annotation.

Optionally, as an embodiment, the determining the annotation data of the filtered speech data according to the plurality of decoding results and the voting rule includes:

voting according to the plurality of decoding results to obtain corresponding target voting results;

Specifically, voting is performed according to a plurality of decoding results to obtain a target voting result. And if the target voting result meets a first threshold value, determining that the target voice data is the first type of voice data, and determining the labeled data of the target voice data through a plurality of decoding results.

Optionally, as an embodiment, determining the annotation data of the filtered speech data according to the multiple decoding results and the voting rule, further includes:

Specifically, if the target voting result meets a second threshold, the target voice data is determined to be second-class voice data, and the labeled data of the target voice data is determined by acquiring artificially labeled data.

Specifically, if the target voting result satisfies the third threshold, it is determined that the target speech data is speech data with poor quality, such as more noise and noisy background, and the target speech data is discarded.

Optionally, as an embodiment, the plurality of decoders include a base decoder, and the base decoder is determined by a base acoustic parser.

Particularly, the service voice data is used as the voice data to optimize the basic acoustic resolver, the recognition accuracy of the basic acoustic resolver is further improved through continuous optimization of the basic acoustic resolver, the alignment effect of the basic acoustic resolver is improved, and the recognition accuracy of the acoustic resolver trained by the basic acoustic resolver is further improved.

Optionally, as an embodiment, the method further includes:

and acquiring voice data corresponding to the output result of the trained acoustic analyzer, and screening the voice data through a plurality of decoders to serve as training data to optimize the acoustic analyzer.

After the trained acoustic analyzer is served on a target scene on line, voice data corresponding to an output result of the acoustic analyzer in a service process are screened through a plurality of decoders to obtain training data to optimize the trained acoustic analyzer, the matching degree between the trained acoustic analyzer and the target scene is further improved, and the recognition accuracy of the acoustic analyzer is improved.

The method of this embodiment is based on the idea of transfer learning, and can adopt the output of a pre-trained complex model (basic acoustic analyzer) as supervision information to train another simple network (acoustic analyzer), which is a way to improve the model training effect.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the embodiments are not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the embodiments. Further, those skilled in the art will also appreciate that the embodiments described in the specification are presently preferred and that no particular act is required of the embodiments of the application.

On the basis of the foregoing embodiments, this embodiment further provides a data acquisition apparatus, as shown in fig. 6, the apparatus includes:

a decoding result obtaining module 602, configured to use a plurality of decoders to decode the voice data respectively, and determine a plurality of corresponding decoding results.

An acoustic processing data obtaining module 604, configured to determine, according to the multiple decoding results and the screening rule, acoustic processing data corresponding to the voice data, where the acoustic processing data includes the voice data and corresponding annotation data.

An alignment result obtaining module 606, configured to determine, according to the acoustic processing data and a set basic acoustic parser, a frame alignment result corresponding to the speech data.

A training data obtaining module 608, configured to return the frame alignment result as training data.

In summary, a plurality of decoders are used for decoding and analyzing the voice data in the preprocessing process to obtain a plurality of decoding results; the method comprises the steps of determining corresponding voice data and label data as acoustic processing data by utilizing a plurality of decoding results and screening rules, improving the efficiency of voice labeling, processing the acoustic processing obtained through preprocessing through a set basic acoustic analyzer, directly aligning the voice data and the label data in the acoustic processing data to obtain a frame alignment result, and training a corresponding acoustic analyzer by taking the frame alignment result as training data to improve the processing efficiency of the analyzer.

Optionally, as an embodiment, the filtering rule includes a voting rule, and the acoustic processing data obtaining module 604 includes:

the data voting acquisition submodule is used for determining the marking data of the screened voice data according to the decoding results and the voting rule;

and the acoustic processing data acquisition submodule is used for taking the screened voice data and the corresponding marking data as the acoustic processing data.

Optionally, as an embodiment, the data vote acquisition sub-module includes:

the voting result acquisition submodule is used for voting according to the decoding results to obtain corresponding target voting results;

and the labeled data generation submodule is used for obtaining labeled data of the target voice data according to a decoding result corresponding to the target voice data when a target voting result of the target voice data meets a first threshold value.

Optionally, as an embodiment, the data voting acquisition sub-module further includes:

and the labeling data acquisition submodule is used for acquiring a labeling result of the target voice data when the target voting result meets a second threshold value.

and the voice data discarding submodule is used for discarding the target voice data when the target voting result meets a third threshold value.

Optionally, as an embodiment, the alignment result obtaining module 606 includes:

the voice data analysis submodule is used for analyzing the voice data in the acoustic processing data according to the basic acoustic analyzer and determining acoustic characteristics;

and the alignment result acquisition submodule is used for aligning the acoustic features and the acoustic units of the labeled data according to the basic acoustic analyzer to obtain the frame alignment result.

Optionally, as an embodiment, the apparatus further includes:

the scene data acquisition module is used for acquiring a voice data set corresponding to a target scene to obtain target training data corresponding to the target scene, and the target training data is used for training an acoustic analyzer corresponding to the target scene.

Optionally, as an embodiment, the apparatus further includes:

and the service data acquisition module is used for taking service data related to the acoustic resolver corresponding to the target scene after the acoustic resolver performs service in the target scene as service voice data, and the service voice data is used for optimizing the acoustic resolver corresponding to the target scene.

Optionally, as an embodiment, the apparatus further includes:

and the optimization processing module is used for optimizing the basic acoustic analyzer according to the service voice data.

Optionally, as an embodiment, the plurality of decoders include a base decoder, and the base decoder is determined by the base acoustic parser.

On the basis of the above embodiments, this embodiment further provides an acoustic resolver training apparatus, as shown in fig. 7, the apparatus includes:

a decoding result obtaining module 702, configured to use a plurality of decoders to decode the voice data respectively, and determine a plurality of corresponding decoding results.

An acoustic processing data obtaining module 704, configured to determine, according to the multiple decoding results and the screening rule, acoustic processing data corresponding to the voice data, where the acoustic processing data includes the voice data and corresponding annotation data.

The parser generation module 706 is configured to train an acoustic parser according to the acoustic processing data to obtain a trained acoustic parser.

In summary, a plurality of decoders are used for decoding and analyzing the voice data in the preprocessing process to obtain a plurality of decoding results; and determining corresponding voice data and marking data as acoustic processing data by using a plurality of decoding results and screening rules, so that the voice marking efficiency can be improved, and the processing efficiency of the acoustic analyzer is improved.

Optionally, as an embodiment, the filtering rule includes a voting rule, and the acoustic processing data obtaining module 704 includes:

the data voting obtaining submodule is used for determining the marking data of the screened voice data according to the decoding results and the voting rule;

and the acoustic processing data obtaining submodule is used for taking the screened voice data and the corresponding marking data as the acoustic processing data.

Optionally, as an embodiment, the data voting obtaining sub-module includes:

the voting result obtaining submodule is used for voting according to the decoding results to obtain corresponding target voting results;

and the labeled data obtaining submodule is used for obtaining labeled data of the target voice data according to a decoding result corresponding to the target voice data when a target voting result of the target voice data meets a first threshold value.

Optionally, as an embodiment, the data voting obtaining sub-module further includes:

and the labeling data labeling submodule is used for acquiring the labeling result of the target voice data when the target voting result meets a second threshold value.

and the voice data deleting submodule is used for discarding the target voice data when the target voting result meets a third threshold value.

Optionally, as an embodiment, the method further includes:

and the service data acquisition module is used for acquiring voice data corresponding to the output result of the trained acoustic analyzer, and screening the voice data through a plurality of decoders to serve as training data to optimize the acoustic analyzer.

The present application further provides a non-transitory, readable storage medium, where one or more modules (programs) are stored, and when the one or more modules are applied to a device, the device may execute instructions (instructions) of method steps in this application.

Embodiments of the present application provide one or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an electronic device to perform the methods as described in one or more of the above embodiments. In the embodiment of the present application, the electronic device includes various types of devices such as a terminal device and a server (cluster).

Embodiments of the present disclosure may be implemented as an apparatus, which may include electronic devices such as a terminal device, a server (cluster), etc., using any suitable hardware, firmware, software, or any combination thereof, to perform a desired configuration. Fig. 8 schematically illustrates an example apparatus 800 that may be used to implement various embodiments described herein.

For one embodiment, fig. 8 illustrates an example apparatus 800 having one or more processors 802, a control module (chipset) 804 coupled to at least one of the processor(s) 802, a memory 806 coupled to the control module 804, a non-volatile memory (NVM)/storage 808 coupled to the control module 804, one or more input/output devices 810 coupled to the control module 804, and a network interface 812 coupled to the control module 804.

The processor 802 may include one or more single-core or multi-core processors, and the processor 802 may include any combination of general-purpose or special-purpose processors (e.g., graphics processors, application processors, baseband processors, etc.). In some embodiments, the apparatus 800 can be used as a terminal device, a server (cluster), or the like in the embodiments of the present application.

In some embodiments, the apparatus 800 may include one or more computer-readable media (e.g., the memory 806 or the NVM/storage 808) having instructions 814 and one or more processors 802 that, in conjunction with the one or more computer-readable media, are configured to execute the instructions 814 to implement modules to perform the actions described in this disclosure.

For one embodiment, the control module 804 may include any suitable interface controller to provide any suitable interface to at least one of the processor(s) 802 and/or any suitable device or component in communication with the control module 804.

The control module 804 may include a memory controller module to provide an interface to the memory 806. The memory controller module may be a hardware module, a software module, and/or a firmware module.

The memory 806 may be used, for example, to load and store data and/or instructions 814 for the apparatus 800. For one embodiment, memory 806 may include any suitable volatile memory, such as suitable DRAM. In some embodiments, the memory 806 may comprise a double data rate type four synchronous dynamic random access memory (DDR4 SDRAM).

For one embodiment, the control module 804 may include one or more input/output controllers to provide an interface to the NVM/storage 808 and input/output device(s) 810.

For example, the NVM/storage 808 may be used to store data and/or instructions 814. NVM/storage 808 may include any suitable non-volatile memory (e.g., flash memory) and/or may include any suitable non-volatile storage device(s) (e.g., one or more Hard Disk Drives (HDDs), one or more Compact Disc (CD) drives, and/or one or more Digital Versatile Disc (DVD) drives).

The NVM/storage 808 may include storage resources that are physically part of the device on which the apparatus 800 is installed, or it may be accessible by the device and may not necessarily be part of the device. For example, the NVM/storage 808 may be accessible over a network via the input/output device(s) 810.

Input/output device(s) 810 may provide an interface for apparatus 800 to communicate with any other suitable device, input/output devices 810 may include communication components, audio components, sensor components, and so forth. The network interface 812 may provide an interface for the device 800 to communicate over one or more networks, and the device 800 may wirelessly communicate with one or more components of a wireless network according to any of one or more wireless network standards and/or protocols, such as access to a communication standard-based wireless network, such as WiFi, 2G, 3G, 4G, 5G, etc., or a combination thereof.

For one embodiment, at least one of the processor(s) 802 may be packaged together with logic for one or more controller(s) (e.g., memory controller module) of the control module 804. For one embodiment, at least one of the processor(s) 802 may be packaged together with logic for one or more controller(s) of the control module 804 to form a System In Package (SiP). For one embodiment, at least one of the processor(s) 802 may be integrated on the same die with logic for one or more controller(s) of the control module 804. For one embodiment, at least one of the processor(s) 802 may be integrated on the same die with logic of one or more controllers of the control module 804 to form a system on a chip (SoC).

In various embodiments, the apparatus 800 may be, but is not limited to being: a server, a desktop computing device, or a mobile computing device (e.g., a laptop computing device, a handheld computing device, a tablet, a netbook, etc.), among other terminal devices. In various embodiments, the apparatus 800 may have more or fewer components and/or different architectures. For example, in some embodiments, device 800 includes one or more cameras, a keyboard, a Liquid Crystal Display (LCD) screen (including a touch screen display), a non-volatile memory port, multiple antennas, a graphics chip, an Application Specific Integrated Circuit (ASIC), and speakers.

The detection device can adopt a main control chip as a processor or a control module, sensor data, position information and the like are stored in a memory or an NVM/storage device, a sensor group can be used as an input/output device, and a communication interface can comprise a network interface.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the true scope of the embodiments of the application.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The foregoing describes in detail a method and apparatus for processing voice data, an electronic device, and a storage medium provided by the present application, and specific examples are applied herein to explain the principles and implementations of the present application, and the descriptions of the foregoing examples are only used to help understand the method and core ideas of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method of processing speech data, the method comprising:

respectively decoding the voice data by adopting a plurality of decoders, and determining a plurality of corresponding decoding results;

determining acoustic processing data corresponding to the voice data according to the decoding results and the screening rules, wherein the acoustic processing data comprises the voice data and corresponding marking data;

determining a frame alignment result corresponding to the voice data according to the acoustic processing data and a set basic acoustic analyzer;

and returning the frame alignment result as training data.

2. The method of claim 1, wherein the filtering rule comprises a voting rule, and wherein determining the acoustic processing data corresponding to the speech data according to the plurality of decoding results and the filtering rule comprises:

determining the marking data of the screened voice data according to the decoding results and the voting rule;

and taking the screened voice data and the corresponding marking data as the acoustic processing data.

3. The method of claim 2, wherein determining the annotation data of the filtered speech data according to the plurality of decoding results and the voting rule comprises:

voting according to the decoding results to obtain corresponding target voting results;

and when the target voting result of the target voice data meets a first threshold value, obtaining the marking data of the target voice data according to the decoding result corresponding to the target voice data.

4. The method of claim 3, wherein determining the annotation data of the filtered speech data according to the plurality of decoding results and the voting rule further comprises:

5. The method of claim 3, wherein determining the annotation data of the filtered speech data according to the plurality of decoding results and the voting rule further comprises:

discarding the target speech data when the target voting result satisfies a third threshold.

6. The method of claim 1, wherein determining the frame alignment result corresponding to the speech data according to the acoustic processing data and a set basic acoustic parser comprises:

7. The method of claim 1, further comprising:

acquiring a voice data set corresponding to a target scene to obtain target training data corresponding to the target scene, wherein the target training data is used for training an acoustic analyzer corresponding to the target scene.

8. An acoustic resolver training method, comprising:

and training an acoustic analyzer according to the acoustic processing data to obtain the trained acoustic analyzer.

9. The method of claim 8, wherein the filtering rule comprises a voting rule, and wherein determining the acoustic processing data corresponding to the speech data according to the plurality of decoding results and the filtering rule comprises:

10. The method of claim 9, wherein determining the annotation data of the filtered speech data according to the plurality of decoding results and the voting rule comprises:

11. The method of claim 10, wherein determining the annotation data for the filtered speech data based on the plurality of decoding results and the voting rules further comprises:

12. The method of claim 10, wherein determining the annotation data for the filtered speech data based on the plurality of decoding results and the voting rules further comprises:

13. The method of claim 8, further comprising:

14. A speech data processing apparatus, characterized in that the apparatus comprises:

a decoding result acquisition module, configured to use multiple decoders to decode the voice data, and determine multiple corresponding decoding results;

the acoustic processing data acquisition module is used for determining acoustic processing data corresponding to the voice data according to the decoding results and the screening rule, wherein the acoustic processing data comprises the voice data and corresponding marking data;

an alignment result acquisition module, configured to determine, according to the acoustic processing data and a set basic acoustic parser, a frame alignment result corresponding to the speech data;

and the training data acquisition module is used for returning the frame alignment result as training data.

15. An acoustic resolver training apparatus, comprising:

a decoding result obtaining module, configured to decode the voice data by using a plurality of decoders, respectively, and determine a plurality of corresponding decoding results;

the acoustic processing data obtaining module is used for determining acoustic processing data corresponding to the voice data according to the decoding results and the screening rule, and the acoustic processing data comprises the voice data and corresponding marking data;

and the analyzer generating module is used for training an acoustic analyzer according to the acoustic processing data to obtain the trained acoustic analyzer.

16. An electronic device, comprising: a processor; and

memory having stored thereon executable code which, when executed, causes the processor to perform a method of speech data processing according to one or more of claims 1-7.

17. One or more machine-readable media having executable code stored thereon that, when executed, causes a processor to perform a method of speech data processing as recited in one or more of claims 1-7.

18. An electronic device, comprising: a processor; and

a memory having executable code stored thereon that, when executed, causes the processor to perform the acoustic parser training method of one or more of claims 8-13.

19. One or more machine-readable media having executable code stored thereon that, when executed, causes a processor to perform an acoustic parser training method as recited in one or more of claims 8-13.