CN114360517B

CN114360517B - Audio processing method and device in complex environment and storage medium

Info

Publication number: CN114360517B
Application number: CN202111551933.6A
Authority: CN
Inventors: 王伟
Original assignee: iMusic Culture and Technology Co Ltd
Current assignee: iMusic Culture and Technology Co Ltd
Priority date: 2021-12-17
Filing date: 2021-12-17
Publication date: 2023-04-18
Anticipated expiration: 2041-12-17
Also published as: CN114360517A

Abstract

The invention discloses an audio processing method, a device and a storage medium under a complex environment, wherein a neural network model is trained through audio training data and a thesaurus and sentence library by acquiring the audio training data, the neural network model is a time-lag recursion neural network combined deep neural network acoustic model of a hidden Markov model, and is trained in combination with the thesaurus and sentence library containing commonly used words or sentences in a dialogue scene, so that the performance of speech recognition under a noise environment can be improved, and the robustness can be improved; the audio data of the receiver to be received is input into the trained neural network model to obtain output content, and the output content is played to the receiver to be received through voice, so that the output content is more accurately played to the receiver to be received, the accuracy of communication with the receiver to be received is improved, and errors are reduced.

Description

Audio processing method and device in complex environment and storage medium

Technical Field

The present invention relates to the field of audio processing, and in particular, to an audio processing method and apparatus in a complex environment, and a storage medium.

Background

At present, along with the rapid development of artificial intelligence, the application in various industries is also more and more extensive, advanced artificial intelligence technology is applied to various scenes, especially in service industry, such as tea shop, restaurant, make things for a click shop and clothing shop, these occasions personnel are more than the environment noisy, the waiter often receives the background noise influence with the customer dialogue, current identification method often can obtain good identification effect under the quiet scene, but because the interference that the noise brought can not be fine solve the service dialogue identification problem under the high noise, conventional speech recognition model can not adapt to, the robustness is poor.

Disclosure of Invention

In view of the above, in order to solve at least one of the above technical problems, the present invention provides an audio processing method, an audio processing apparatus and a storage medium in a complex environment.

The embodiment of the invention adopts the technical scheme that:

an audio processing method in a complex environment, comprising:

acquiring audio training data;

training a neural network model through the audio training data and the word bank and sentence library; the neural network model is a deep neural network acoustic model combining a time-lag recurrent neural network with a hidden Markov model, and the word bank and sentence library comprises commonly used words or sentences in a dialogue scene;

and inputting the audio data of the receiver to be received into the trained neural network model to obtain output content, and playing the output content to the receiver to be received through voice.

Further, the thesaurus and sentence library is determined by the following steps:

obtaining dialogue linguistic data under a dialogue scene;

and carrying out intelligent template extraction and identification according to the dialogue corpus to obtain the word bank and sentence bank.

Further, the training of the neural network model through the audio training data and the lexicon and sentence library includes:

performing state clustering on the audio training data according to the triphone to obtain a posterior of the state;

processing the audio training data according to the neural network model;

and training according to the processing result, the posterior of the state and the word bank and sentence bank.

Further, the performing state clustering on the audio training data according to triphones to obtain a posterior of a state includes:

carrying out state clustering on the audio training data according to a dictionary, a phoneme table and a keyword configuration file to obtain a posterior of a state; the keyword profile includes lexical terms of different domains, and the phoneme table includes pronunciation criteria of different regions.

Further, the audio training data includes real labels, and the training according to the processing result, the posterior of the state, and the thesaurus corpus comprises:

determining keywords according to the processing result and the posterior of the state;

matching the keyword with the word library and sentence library to determine a matching result;

and training the neural network model according to the matching result and the real label.

Further, the training the neural network model according to the matching result and the real label includes:

the parameters of the neural network model are updated in a back propagation iteration mode in the training process;

and when the iteration updating times reach preset times or the loss value calculated according to the real label and the matching result is smaller than a preset loss threshold value, obtaining the trained neural network model according to the parameters after the iteration updating.

Further, the playing the output content to the receiver by voice includes:

and converting the output content into voice by audio and playing the voice to a receiver to be received.

An embodiment of the present invention further provides an audio processing apparatus in a complex environment, including:

the acquisition module is used for acquiring audio training data;

the training module is used for training a neural network model through the audio training data and the word bank and sentence library; the neural network model is a deep neural network acoustic model combining a time-lag recurrent neural network with a hidden Markov model, and the word bank and sentence library comprises commonly used words or sentences in a dialogue scene;

and the playing module is used for inputting the audio data of the receiver to be received to the trained neural network model to obtain output content and playing the output content to the receiver to be received through voice.

An embodiment of the present invention further provides an audio processing apparatus in a complex environment, where the apparatus includes a processor and a memory, where the memory stores at least one instruction, at least one program, a code set, or a set of instructions, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by the processor to implement the method.

Embodiments of the present invention also provide a computer-readable storage medium, where at least one instruction, at least one program, a set of codes, or a set of instructions is stored in the storage medium, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by a processor to implement the method.

The invention has the beneficial effects that: acquiring audio training data, and training a neural network model through the audio training data and a lexicon and sentence library, wherein the neural network model is a deep neural network acoustic model combining a time-lag recurrent neural network with a hidden Markov model and is trained by combining the lexicon and sentence library containing commonly used words or sentences in a dialogue scene, so that the performance of speech recognition in a noise environment can be improved, and the robustness resistance can be improved; the audio data of the receiver to be received is input into the trained neural network model, output content is obtained and played to the receiver to be received through voice, the output content is played to the receiver to be received more accurately, the accuracy of communication with the receiver to be received is improved, and errors are reduced.

Drawings

FIG. 1 is a flowchart illustrating steps of an audio processing method under a complex environment according to the present invention;

FIG. 2 is a schematic diagram of TDNN and RNN according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present application better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

As shown in fig. 1, an embodiment of the present invention provides an audio processing method in a complex environment, including steps S100-S300:

and S100, acquiring audio training data.

Alternatively, the complex environment refers to an environment with noise interference in the communication process, for example, a scenario in which a customer communicates with an attendant is taken as an example in the embodiment of the present invention. It should be noted that, at this time, the customer is a person to be received, the audio training data is audio collected in a scenario where the customer communicates with an attendant, each question of the customer has a corresponding answer, and the answer of the attendant is used as a real tag.

And S200, training the neural network model through the audio training data and the word library sentence library.

In the embodiment of the invention, the neural network model is a deep neural network acoustic model combining a time-lag recursive neural network and a hidden Markov model, and the word bank and sentence library comprises commonly used words or sentences in a dialogue scene.

Optionally, the creating of the thesaurus corpus in step S200 includes steps S211 to S212:

and S211, acquiring the dialogue linguistic data in the dialogue scene.

Optionally, the conversation scenario takes a scenario in which the customer communicates with the waiter as an example, a standard language specification is set for the waiter, so that the waiter communicates with the standard language specification, and then content in a process of communicating between the customer and the waiter is collected to obtain the conversation corpus.

S212, according to the dialogue corpus, intelligent template extraction and recognition are carried out, and a word bank and sentence bank is obtained.

In the embodiment of the invention, the intelligent template is combined with a voice recognition technology, standard commonly used words or sentences are extracted according to the dialogue corpus, and then a word library and sentence library is established by utilizing the commonly used words or sentences.

Optionally, step S200 includes steps S221-S223:

s221, carrying out state clustering on the audio training data according to the triphone to obtain a posterior of the state.

Optionally, a GMM-HMM acoustic model is constructed from the monophonic and/or triphones, a new HMM is initialized, and the audio training data is state clustered according to the dictionary, the phone list, and the keyword profile to obtain a posterior of state. It should be noted that the keyword configuration file includes vocabulary terms in different fields, and the phoneme table includes pronunciation criteria in different regions; HMM is hidden Markov model, GMM is Gaussian mixture model.

S222, processing the audio training data according to the neural network model.

Specifically, a GMM-HMM initialization model is used for solving which hidden Markov model state corresponds to audio training data, the obtained audio training data is marked as align-raw, a Viterbi algorithm is used for forcibly aligning the state on the audio training data, then a feature vector corresponding to a voice frame of the audio training data is used as the input of a neural network model, the neural network model is used for processing the audio training data, and a processing result is determined. Optionally, the processing result is a word or a sentence, forward propagation is utilized in the processing process, and a pdf probability predicted value corresponding to the feature vector of the audio training data is obtained through the softmax layer, wherein each word or sentence corresponds to one pdf probability predicted value.

As shown in fig. 2, optionally, when modeling the neural network model, since the context information is mainly modeled by a layered architecture, each layer performs audio frame splicing with different time resolutions, but the overall input context of the TDNN is limited, in order to highlight the change of the TDNN (time delay neural network) in the training structure, in this architecture, another RNN layer (time delay neural network) is added in combination with the middle of the TDNN, and the mixed RNN and TDNN obtain a time-lag recurrent neural network, which can better utilize the audio frame of the context to further improve the recognition accuracy. The contents of t, t-n, t + n (n =1,2 ... 6) and the like correspond to different time resolutions for splicing audio frames.

And S223, training according to the processing result, the posterior of the state and the word library and sentence library.

Optionally, step S223 includes steps S2231-S2233:

and S2231, determining the keywords according to the processing result and the posterior of the state.

Optionally, keyword matching is performed according to the processing result and the posterior of the state, and the keyword is determined.

And S2232, matching the keyword with the word library and sentence library to determine a matching result.

Optionally, a corresponding solution is searched from the word library and sentence library according to the keyword, and a keyword effective word and sentence is extracted to obtain a matching result.

According to the embodiment of the invention, the processing result, the dictionary, the phoneme table and the keyword configuration file are combined and matched with the word bank and sentence bank to enhance the audio data, so that the performance of voice recognition in a noise environment is improved, the output of a neural network model before and after data enhancement is compared, and the robustness is obviously improved. And the time-lag recurrent neural network is provided, a deep neural network acoustic model of the hidden Markov model is combined, a matching result is determined by combining keyword matching, the problem of accurate matching of keywords in each industry is solved, the accuracy of information communication between a waiter and a customer is improved, errors are reduced, and the service quality is improved.

And S2233, training the neural network model according to the matching result and the real label.

Optionally, step S2233 includes steps S22331-S22332:

and S22331, back propagation and iterative updating of parameters of the neural network model in the training process.

Specifically, the parameters of the neural network model are continuously updated and updated in an iterative manner by back propagation in the training process so as to update the neural network model and improve the processing effect of the neural network model.

And S22332, when the iterative update times reach a preset time, or the loss value calculated according to the real label and the matching result is less than a preset loss threshold value, obtaining the trained neural network model according to the iteratively updated parameters.

Optionally, when the iteration updating times reach the preset times, ending the iteration, and taking the parameters updated by the last iteration as final model parameters to obtain a trained neural network model; or calculating a loss value through a loss function according to the real label and the matching result, finishing iteration when the loss value is smaller than a preset loss threshold value, and taking the parameter updated by the last iteration as a final model parameter to obtain the trained neural network model. It should be noted that the preset times and the preset loss threshold may be adjusted as needed. It should be noted that when the loss value does not change or the loss value does not decrease significantly, the training is ended, otherwise, the updating is continued.

According to the embodiment of the invention, a deep neural network acoustic model (namely a neural network model) based on a time-lag recurrent neural network and a hidden Markov model is utilized, in the training process, the extracted primary result, the dictionary, the phonemic chart and the keyword list configuration file are compared and matched with the word library and the sentence library, so that the key effective word and sentence is extracted, and the accuracy of extracting the key effective word and sentence is improved through optimized comparison and matching, so that the output of the trained neural network model does not contain noise (or has low noise) when being used for voice playing and is clearer.

S300, inputting the audio data of the receiver to be received to the trained neural network model to obtain output content, and playing the output content to the receiver to be received through voice.

Specifically, the output content of the neural network model is subjected to audio conversion into voice, and then played to a receiver to be received. It should be noted that the waiting recipient is a customer, so that the customer can obtain a required answer in a noisy complex environment, and the influence on communication caused by the fact that the waiter cannot hear the speaking content of the customer clearly is avoided.

Optionally, when the trained neural network model extracts and identifies the audio data of the to-be-received person, performing word segmentation according to a word segmentation tool, searching whether a newly segmented word is contained in an original word bank and word bank, and adding the segmented word into a word; for two cases of the word not contained, firstly, the word cannot be composed of phrases in the original word stock and sentence stock after being segmented, and the content output of the default reply of the user is reprocessed or set; in addition, if a plurality of rare words appear, all rare words are segmented into shorter phrases, and the characters of long phrases contained in the original word stock sentence library are arranged and combined according to the phrase sequence to obtain specific words.

the acquisition module is used for acquiring audio training data;

the training module is used for training the neural network model through the audio training data and the word bank and sentence library; the neural network model is a deep neural network acoustic model combining a time-lag recurrent neural network with a hidden Markov model, and the word bank and sentence library comprises commonly used words or sentences in a dialogue scene;

The contents in the above method embodiments are all applicable to the present apparatus embodiment, the functions specifically implemented by the present apparatus embodiment are the same as those in the above method embodiments, and the advantageous effects achieved by the present apparatus embodiment are also the same as those achieved by the above method embodiments.

The embodiment of the present invention further provides an audio processing apparatus in a complex environment, where the apparatus includes a processor and a memory, where the memory stores at least one instruction, at least one program, a code set, or an instruction set, and the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by the processor to implement the audio processing method in the complex environment of the foregoing embodiment. Optionally, the device includes, but is not limited to, any smart terminal such as a mobile phone, a tablet computer, a computer, etc.

The embodiment of the present invention further provides a computer-readable storage medium, in which at least one instruction, at least one program, a code set, or a set of instructions is stored, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by a processor to implement the audio processing method in a complex environment according to the foregoing embodiment.

Embodiments of the present invention also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the audio processing method in the complex environment of the foregoing embodiment.

The terms "first," "second," "third," "fourth," and the like (if any) in the description of the present application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that, in this application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form. Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment. In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application, which are essential or part of the technical solutions contributing to the prior art, or all or part of the technical solutions, may be embodied in the form of a software product, which is stored in a storage medium and includes multiple instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing programs, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. An audio processing method under a complex environment, comprising:

acquiring audio training data;

inputting audio data of a receiver to be received into the trained neural network model to obtain output content, and playing the output content to the receiver to be received through voice;

wherein, the thesaurus and sentence library is determined by the following steps:

obtaining dialogue linguistic data under a dialogue scene;

performing intelligent template extraction and identification according to the dialogue corpus to obtain a word bank and sentence bank;

training the neural network model through the audio training data and the word bank sentence library, comprising the following steps of:

carrying out state clustering on the audio training data according to triphones to obtain a posterior of a state;

processing the audio training data according to the neural network model;

training according to the processing result, the posterior of the state and the word bank and sentence library;

the audio training data comprises real labels, and the training is carried out according to the processing result, the posterior of the state and the word bank and sentence library, and comprises the following steps:

matching the keyword with the word bank and sentence bank according to the keyword to determine a matching result;

2. The audio processing method under the complex environment according to claim 1, wherein: the state clustering is carried out on the audio training data according to the triphone to obtain the posterior of the state, and the method comprises the following steps:

3. The audio processing method under complex environment according to claim 1, wherein: the training the neural network model according to the matching result and the real label comprises:

and when the iterative updating times reach preset times or the loss value calculated according to the real label and the matching result is smaller than a preset loss threshold value, obtaining the trained neural network model according to the parameters after the iterative updating.

4. The audio processing method in a complex environment according to any one of claims 1 to 3, wherein: the playing the output content to the receiver by voice comprises:

and converting the audio of the output content into voice and playing the voice to a receiver to be received.

5. An audio processing apparatus under a complex environment, comprising:

the acquisition module is used for acquiring audio training data;

the training module is used for training a neural network model through the audio training data and the word bank and sentence library; the neural network model is a deep neural network acoustic model combining a time-lag recurrent neural network with a hidden Markov model, and the word bank and sentence bank comprises commonly used words or sentences in a dialogue scene; wherein, the thesaurus and sentence library is determined by the following steps: obtaining dialogue linguistic data under a dialogue scene; performing intelligent template extraction and identification according to the dialogue corpus to obtain a word bank and sentence bank; training the neural network model through the audio training data and the word bank sentence library, comprising the following steps of: performing state clustering on the audio training data according to the triphone to obtain a posterior of the state; processing the audio training data according to the neural network model; training according to the processing result, the posterior of the state and the word bank and sentence library; the audio training data comprises real labels, and the training is carried out according to the processing result, the posterior of the state and the word bank and sentence library, and comprises the following steps: determining a keyword according to the processing result and the posterior of the state; matching the keyword with the word library and sentence library to determine a matching result; training the neural network model according to the matching result and the real label;

6. An audio processing apparatus in a complex environment, the apparatus comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement the method according to any one of claims 1-4.

7. A computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement the method according to any one of claims 1-4.