CN114360517B - Audio processing method and device in complex environment and storage medium - Google Patents

Audio processing method and device in complex environment and storage medium Download PDF

Info

Publication number
CN114360517B
CN114360517B CN202111551933.6A CN202111551933A CN114360517B CN 114360517 B CN114360517 B CN 114360517B CN 202111551933 A CN202111551933 A CN 202111551933A CN 114360517 B CN114360517 B CN 114360517B
Authority
CN
China
Prior art keywords
neural network
audio
network model
training
training data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111551933.6A
Other languages
Chinese (zh)
Other versions
CN114360517A (en
Inventor
王伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iMusic Culture and Technology Co Ltd
Original Assignee
iMusic Culture and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iMusic Culture and Technology Co Ltd filed Critical iMusic Culture and Technology Co Ltd
Priority to CN202111551933.6A priority Critical patent/CN114360517B/en
Publication of CN114360517A publication Critical patent/CN114360517A/en
Application granted granted Critical
Publication of CN114360517B publication Critical patent/CN114360517B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Electrically Operated Instructional Devices (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses an audio processing method, a device and a storage medium under a complex environment, wherein a neural network model is trained through audio training data and a thesaurus and sentence library by acquiring the audio training data, the neural network model is a time-lag recursion neural network combined deep neural network acoustic model of a hidden Markov model, and is trained in combination with the thesaurus and sentence library containing commonly used words or sentences in a dialogue scene, so that the performance of speech recognition under a noise environment can be improved, and the robustness can be improved; the audio data of the receiver to be received is input into the trained neural network model to obtain output content, and the output content is played to the receiver to be received through voice, so that the output content is more accurately played to the receiver to be received, the accuracy of communication with the receiver to be received is improved, and errors are reduced.

Description

Audio processing method and device in complex environment and storage medium
Technical Field
The present invention relates to the field of audio processing, and in particular, to an audio processing method and apparatus in a complex environment, and a storage medium.
Background
At present, along with the rapid development of artificial intelligence, the application in various industries is also more and more extensive, advanced artificial intelligence technology is applied to various scenes, especially in service industry, such as tea shop, restaurant, make things for a click shop and clothing shop, these occasions personnel are more than the environment noisy, the waiter often receives the background noise influence with the customer dialogue, current identification method often can obtain good identification effect under the quiet scene, but because the interference that the noise brought can not be fine solve the service dialogue identification problem under the high noise, conventional speech recognition model can not adapt to, the robustness is poor.
Disclosure of Invention
In view of the above, in order to solve at least one of the above technical problems, the present invention provides an audio processing method, an audio processing apparatus and a storage medium in a complex environment.
The embodiment of the invention adopts the technical scheme that:
an audio processing method in a complex environment, comprising:
acquiring audio training data;
training a neural network model through the audio training data and the word bank and sentence library; the neural network model is a deep neural network acoustic model combining a time-lag recurrent neural network with a hidden Markov model, and the word bank and sentence library comprises commonly used words or sentences in a dialogue scene;
and inputting the audio data of the receiver to be received into the trained neural network model to obtain output content, and playing the output content to the receiver to be received through voice.
Further, the thesaurus and sentence library is determined by the following steps:
obtaining dialogue linguistic data under a dialogue scene;
and carrying out intelligent template extraction and identification according to the dialogue corpus to obtain the word bank and sentence bank.
Further, the training of the neural network model through the audio training data and the lexicon and sentence library includes:
performing state clustering on the audio training data according to the triphone to obtain a posterior of the state;
processing the audio training data according to the neural network model;
and training according to the processing result, the posterior of the state and the word bank and sentence bank.
Further, the performing state clustering on the audio training data according to triphones to obtain a posterior of a state includes:
carrying out state clustering on the audio training data according to a dictionary, a phoneme table and a keyword configuration file to obtain a posterior of a state; the keyword profile includes lexical terms of different domains, and the phoneme table includes pronunciation criteria of different regions.
Further, the audio training data includes real labels, and the training according to the processing result, the posterior of the state, and the thesaurus corpus comprises:
determining keywords according to the processing result and the posterior of the state;
matching the keyword with the word library and sentence library to determine a matching result;
and training the neural network model according to the matching result and the real label.
Further, the training the neural network model according to the matching result and the real label includes:
the parameters of the neural network model are updated in a back propagation iteration mode in the training process;
and when the iteration updating times reach preset times or the loss value calculated according to the real label and the matching result is smaller than a preset loss threshold value, obtaining the trained neural network model according to the parameters after the iteration updating.
Further, the playing the output content to the receiver by voice includes:
and converting the output content into voice by audio and playing the voice to a receiver to be received.
An embodiment of the present invention further provides an audio processing apparatus in a complex environment, including:
the acquisition module is used for acquiring audio training data;
the training module is used for training a neural network model through the audio training data and the word bank and sentence library; the neural network model is a deep neural network acoustic model combining a time-lag recurrent neural network with a hidden Markov model, and the word bank and sentence library comprises commonly used words or sentences in a dialogue scene;
and the playing module is used for inputting the audio data of the receiver to be received to the trained neural network model to obtain output content and playing the output content to the receiver to be received through voice.
An embodiment of the present invention further provides an audio processing apparatus in a complex environment, where the apparatus includes a processor and a memory, where the memory stores at least one instruction, at least one program, a code set, or a set of instructions, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by the processor to implement the method.
Embodiments of the present invention also provide a computer-readable storage medium, where at least one instruction, at least one program, a set of codes, or a set of instructions is stored in the storage medium, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by a processor to implement the method.
The invention has the beneficial effects that: acquiring audio training data, and training a neural network model through the audio training data and a lexicon and sentence library, wherein the neural network model is a deep neural network acoustic model combining a time-lag recurrent neural network with a hidden Markov model and is trained by combining the lexicon and sentence library containing commonly used words or sentences in a dialogue scene, so that the performance of speech recognition in a noise environment can be improved, and the robustness resistance can be improved; the audio data of the receiver to be received is input into the trained neural network model, output content is obtained and played to the receiver to be received through voice, the output content is played to the receiver to be received more accurately, the accuracy of communication with the receiver to be received is improved, and errors are reduced.
Drawings
FIG. 1 is a flowchart illustrating steps of an audio processing method under a complex environment according to the present invention;
FIG. 2 is a schematic diagram of TDNN and RNN according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present application better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
As shown in fig. 1, an embodiment of the present invention provides an audio processing method in a complex environment, including steps S100-S300:
and S100, acquiring audio training data.
Alternatively, the complex environment refers to an environment with noise interference in the communication process, for example, a scenario in which a customer communicates with an attendant is taken as an example in the embodiment of the present invention. It should be noted that, at this time, the customer is a person to be received, the audio training data is audio collected in a scenario where the customer communicates with an attendant, each question of the customer has a corresponding answer, and the answer of the attendant is used as a real tag.
And S200, training the neural network model through the audio training data and the word library sentence library.
In the embodiment of the invention, the neural network model is a deep neural network acoustic model combining a time-lag recursive neural network and a hidden Markov model, and the word bank and sentence library comprises commonly used words or sentences in a dialogue scene.
Optionally, the creating of the thesaurus corpus in step S200 includes steps S211 to S212:
and S211, acquiring the dialogue linguistic data in the dialogue scene.
Optionally, the conversation scenario takes a scenario in which the customer communicates with the waiter as an example, a standard language specification is set for the waiter, so that the waiter communicates with the standard language specification, and then content in a process of communicating between the customer and the waiter is collected to obtain the conversation corpus.
S212, according to the dialogue corpus, intelligent template extraction and recognition are carried out, and a word bank and sentence bank is obtained.
In the embodiment of the invention, the intelligent template is combined with a voice recognition technology, standard commonly used words or sentences are extracted according to the dialogue corpus, and then a word library and sentence library is established by utilizing the commonly used words or sentences.
Optionally, step S200 includes steps S221-S223:
s221, carrying out state clustering on the audio training data according to the triphone to obtain a posterior of the state.
Optionally, a GMM-HMM acoustic model is constructed from the monophonic and/or triphones, a new HMM is initialized, and the audio training data is state clustered according to the dictionary, the phone list, and the keyword profile to obtain a posterior of state. It should be noted that the keyword configuration file includes vocabulary terms in different fields, and the phoneme table includes pronunciation criteria in different regions; HMM is hidden Markov model, GMM is Gaussian mixture model.
S222, processing the audio training data according to the neural network model.
Specifically, a GMM-HMM initialization model is used for solving which hidden Markov model state corresponds to audio training data, the obtained audio training data is marked as align-raw, a Viterbi algorithm is used for forcibly aligning the state on the audio training data, then a feature vector corresponding to a voice frame of the audio training data is used as the input of a neural network model, the neural network model is used for processing the audio training data, and a processing result is determined. Optionally, the processing result is a word or a sentence, forward propagation is utilized in the processing process, and a pdf probability predicted value corresponding to the feature vector of the audio training data is obtained through the softmax layer, wherein each word or sentence corresponds to one pdf probability predicted value.
As shown in fig. 2, optionally, when modeling the neural network model, since the context information is mainly modeled by a layered architecture, each layer performs audio frame splicing with different time resolutions, but the overall input context of the TDNN is limited, in order to highlight the change of the TDNN (time delay neural network) in the training structure, in this architecture, another RNN layer (time delay neural network) is added in combination with the middle of the TDNN, and the mixed RNN and TDNN obtain a time-lag recurrent neural network, which can better utilize the audio frame of the context to further improve the recognition accuracy. The contents of t, t-n, t + n (n =1,2 ... 6) and the like correspond to different time resolutions for splicing audio frames.
And S223, training according to the processing result, the posterior of the state and the word library and sentence library.
Optionally, step S223 includes steps S2231-S2233:
and S2231, determining the keywords according to the processing result and the posterior of the state.
Optionally, keyword matching is performed according to the processing result and the posterior of the state, and the keyword is determined.
And S2232, matching the keyword with the word library and sentence library to determine a matching result.
Optionally, a corresponding solution is searched from the word library and sentence library according to the keyword, and a keyword effective word and sentence is extracted to obtain a matching result.
According to the embodiment of the invention, the processing result, the dictionary, the phoneme table and the keyword configuration file are combined and matched with the word bank and sentence bank to enhance the audio data, so that the performance of voice recognition in a noise environment is improved, the output of a neural network model before and after data enhancement is compared, and the robustness is obviously improved. And the time-lag recurrent neural network is provided, a deep neural network acoustic model of the hidden Markov model is combined, a matching result is determined by combining keyword matching, the problem of accurate matching of keywords in each industry is solved, the accuracy of information communication between a waiter and a customer is improved, errors are reduced, and the service quality is improved.
And S2233, training the neural network model according to the matching result and the real label.
Optionally, step S2233 includes steps S22331-S22332:
and S22331, back propagation and iterative updating of parameters of the neural network model in the training process.
Specifically, the parameters of the neural network model are continuously updated and updated in an iterative manner by back propagation in the training process so as to update the neural network model and improve the processing effect of the neural network model.
And S22332, when the iterative update times reach a preset time, or the loss value calculated according to the real label and the matching result is less than a preset loss threshold value, obtaining the trained neural network model according to the iteratively updated parameters.
Optionally, when the iteration updating times reach the preset times, ending the iteration, and taking the parameters updated by the last iteration as final model parameters to obtain a trained neural network model; or calculating a loss value through a loss function according to the real label and the matching result, finishing iteration when the loss value is smaller than a preset loss threshold value, and taking the parameter updated by the last iteration as a final model parameter to obtain the trained neural network model. It should be noted that the preset times and the preset loss threshold may be adjusted as needed. It should be noted that when the loss value does not change or the loss value does not decrease significantly, the training is ended, otherwise, the updating is continued.
According to the embodiment of the invention, a deep neural network acoustic model (namely a neural network model) based on a time-lag recurrent neural network and a hidden Markov model is utilized, in the training process, the extracted primary result, the dictionary, the phonemic chart and the keyword list configuration file are compared and matched with the word library and the sentence library, so that the key effective word and sentence is extracted, and the accuracy of extracting the key effective word and sentence is improved through optimized comparison and matching, so that the output of the trained neural network model does not contain noise (or has low noise) when being used for voice playing and is clearer.
S300, inputting the audio data of the receiver to be received to the trained neural network model to obtain output content, and playing the output content to the receiver to be received through voice.
Specifically, the output content of the neural network model is subjected to audio conversion into voice, and then played to a receiver to be received. It should be noted that the waiting recipient is a customer, so that the customer can obtain a required answer in a noisy complex environment, and the influence on communication caused by the fact that the waiter cannot hear the speaking content of the customer clearly is avoided.
Optionally, when the trained neural network model extracts and identifies the audio data of the to-be-received person, performing word segmentation according to a word segmentation tool, searching whether a newly segmented word is contained in an original word bank and word bank, and adding the segmented word into a word; for two cases of the word not contained, firstly, the word cannot be composed of phrases in the original word stock and sentence stock after being segmented, and the content output of the default reply of the user is reprocessed or set; in addition, if a plurality of rare words appear, all rare words are segmented into shorter phrases, and the characters of long phrases contained in the original word stock sentence library are arranged and combined according to the phrase sequence to obtain specific words.
An embodiment of the present invention further provides an audio processing apparatus in a complex environment, including:
the acquisition module is used for acquiring audio training data;
the training module is used for training the neural network model through the audio training data and the word bank and sentence library; the neural network model is a deep neural network acoustic model combining a time-lag recurrent neural network with a hidden Markov model, and the word bank and sentence library comprises commonly used words or sentences in a dialogue scene;
and the playing module is used for inputting the audio data of the receiver to be received to the trained neural network model to obtain output content and playing the output content to the receiver to be received through voice.
The contents in the above method embodiments are all applicable to the present apparatus embodiment, the functions specifically implemented by the present apparatus embodiment are the same as those in the above method embodiments, and the advantageous effects achieved by the present apparatus embodiment are also the same as those achieved by the above method embodiments.
The embodiment of the present invention further provides an audio processing apparatus in a complex environment, where the apparatus includes a processor and a memory, where the memory stores at least one instruction, at least one program, a code set, or an instruction set, and the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by the processor to implement the audio processing method in the complex environment of the foregoing embodiment. Optionally, the device includes, but is not limited to, any smart terminal such as a mobile phone, a tablet computer, a computer, etc.
The contents in the above method embodiments are all applicable to the present apparatus embodiment, the functions specifically implemented by the present apparatus embodiment are the same as those in the above method embodiments, and the advantageous effects achieved by the present apparatus embodiment are also the same as those achieved by the above method embodiments.
The embodiment of the present invention further provides a computer-readable storage medium, in which at least one instruction, at least one program, a code set, or a set of instructions is stored, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by a processor to implement the audio processing method in a complex environment according to the foregoing embodiment.
Embodiments of the present invention also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the audio processing method in the complex environment of the foregoing embodiment.
The terms "first," "second," "third," "fourth," and the like (if any) in the description of the present application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be understood that, in this application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form. Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment. In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application, which are essential or part of the technical solutions contributing to the prior art, or all or part of the technical solutions, may be embodied in the form of a software product, which is stored in a storage medium and includes multiple instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing programs, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (7)

1. An audio processing method under a complex environment, comprising:
acquiring audio training data;
training a neural network model through the audio training data and the word bank and sentence library; the neural network model is a deep neural network acoustic model combining a time-lag recurrent neural network with a hidden Markov model, and the word bank and sentence library comprises commonly used words or sentences in a dialogue scene;
inputting audio data of a receiver to be received into the trained neural network model to obtain output content, and playing the output content to the receiver to be received through voice;
wherein, the thesaurus and sentence library is determined by the following steps:
obtaining dialogue linguistic data under a dialogue scene;
performing intelligent template extraction and identification according to the dialogue corpus to obtain a word bank and sentence bank;
training the neural network model through the audio training data and the word bank sentence library, comprising the following steps of:
carrying out state clustering on the audio training data according to triphones to obtain a posterior of a state;
processing the audio training data according to the neural network model;
training according to the processing result, the posterior of the state and the word bank and sentence library;
the audio training data comprises real labels, and the training is carried out according to the processing result, the posterior of the state and the word bank and sentence library, and comprises the following steps:
determining keywords according to the processing result and the posterior of the state;
matching the keyword with the word bank and sentence bank according to the keyword to determine a matching result;
and training the neural network model according to the matching result and the real label.
2. The audio processing method under the complex environment according to claim 1, wherein: the state clustering is carried out on the audio training data according to the triphone to obtain the posterior of the state, and the method comprises the following steps:
carrying out state clustering on the audio training data according to a dictionary, a phoneme table and a keyword configuration file to obtain a posterior of a state; the keyword profile includes lexical terms of different domains, and the phoneme table includes pronunciation criteria of different regions.
3. The audio processing method under complex environment according to claim 1, wherein: the training the neural network model according to the matching result and the real label comprises:
the parameters of the neural network model are updated in a back propagation iteration mode in the training process;
and when the iterative updating times reach preset times or the loss value calculated according to the real label and the matching result is smaller than a preset loss threshold value, obtaining the trained neural network model according to the parameters after the iterative updating.
4. The audio processing method in a complex environment according to any one of claims 1 to 3, wherein: the playing the output content to the receiver by voice comprises:
and converting the audio of the output content into voice and playing the voice to a receiver to be received.
5. An audio processing apparatus under a complex environment, comprising:
the acquisition module is used for acquiring audio training data;
the training module is used for training a neural network model through the audio training data and the word bank and sentence library; the neural network model is a deep neural network acoustic model combining a time-lag recurrent neural network with a hidden Markov model, and the word bank and sentence bank comprises commonly used words or sentences in a dialogue scene; wherein, the thesaurus and sentence library is determined by the following steps: obtaining dialogue linguistic data under a dialogue scene; performing intelligent template extraction and identification according to the dialogue corpus to obtain a word bank and sentence bank; training the neural network model through the audio training data and the word bank sentence library, comprising the following steps of: performing state clustering on the audio training data according to the triphone to obtain a posterior of the state; processing the audio training data according to the neural network model; training according to the processing result, the posterior of the state and the word bank and sentence library; the audio training data comprises real labels, and the training is carried out according to the processing result, the posterior of the state and the word bank and sentence library, and comprises the following steps: determining a keyword according to the processing result and the posterior of the state; matching the keyword with the word library and sentence library to determine a matching result; training the neural network model according to the matching result and the real label;
and the playing module is used for inputting the audio data of the receiver to be received to the trained neural network model to obtain output content and playing the output content to the receiver to be received through voice.
6. An audio processing apparatus in a complex environment, the apparatus comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement the method according to any one of claims 1-4.
7. A computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement the method according to any one of claims 1-4.
CN202111551933.6A 2021-12-17 2021-12-17 Audio processing method and device in complex environment and storage medium Active CN114360517B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111551933.6A CN114360517B (en) 2021-12-17 2021-12-17 Audio processing method and device in complex environment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111551933.6A CN114360517B (en) 2021-12-17 2021-12-17 Audio processing method and device in complex environment and storage medium

Publications (2)

Publication Number Publication Date
CN114360517A CN114360517A (en) 2022-04-15
CN114360517B true CN114360517B (en) 2023-04-18

Family

ID=81100109

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111551933.6A Active CN114360517B (en) 2021-12-17 2021-12-17 Audio processing method and device in complex environment and storage medium

Country Status (1)

Country Link
CN (1) CN114360517B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109065033A (en) * 2018-09-19 2018-12-21 华南理工大学 A kind of automatic speech recognition method based on random depth time-delay neural network model
CN109147774A (en) * 2018-09-19 2019-01-04 华南理工大学 A kind of improved Delayed Neural Networks acoustic model
CN109584896A (en) * 2018-11-01 2019-04-05 苏州奇梦者网络科技有限公司 A kind of speech chip and electronic equipment

Family Cites Families (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106550156A (en) * 2017-01-23 2017-03-29 苏州咖啦魔哆信息技术有限公司 A kind of artificial intelligence's customer service system and its implementation based on speech recognition
CN107680582B (en) * 2017-07-28 2021-03-26 平安科技(深圳)有限公司 Acoustic model training method, voice recognition method, device, equipment and medium
CN109427334A (en) * 2017-09-01 2019-03-05 王阅 A kind of man-machine interaction method and system based on artificial intelligence
CN108549637A (en) * 2018-04-19 2018-09-18 京东方科技集团股份有限公司 Method for recognizing semantics, device based on phonetic and interactive system
CN109410948A (en) * 2018-09-07 2019-03-01 北京三快在线科技有限公司 Communication means, device, system, computer equipment and readable storage medium storing program for executing
CN109410911A (en) * 2018-09-13 2019-03-01 何艳玲 Artificial intelligence learning method based on speech recognition
CN109599113A (en) * 2019-01-22 2019-04-09 北京百度网讯科技有限公司 Method and apparatus for handling information
CN110086946A (en) * 2019-03-15 2019-08-02 深圳壹账通智能科技有限公司 Intelligence chat sound control method, device, computer equipment and storage medium
CN110162610A (en) * 2019-04-16 2019-08-23 平安科技(深圳)有限公司 Intelligent robot answer method, device, computer equipment and storage medium
CN110827822A (en) * 2019-12-06 2020-02-21 广州易来特自动驾驶科技有限公司 Intelligent voice interaction method and device, travel terminal, equipment and medium
KR20210079666A (en) * 2019-12-20 2021-06-30 엘지전자 주식회사 Artificial intelligence apparatus for training acoustic model
CN111143535B (en) * 2019-12-27 2021-08-10 北京百度网讯科技有限公司 Method and apparatus for generating a dialogue model
CN111508498B (en) * 2020-04-09 2024-01-30 携程计算机技术(上海)有限公司 Conversational speech recognition method, conversational speech recognition system, electronic device, and storage medium
CN111683175B (en) * 2020-04-22 2021-03-09 北京捷通华声科技股份有限公司 Method, device, equipment and storage medium for automatically answering incoming call
CN112382290B (en) * 2020-11-20 2023-04-07 北京百度网讯科技有限公司 Voice interaction method, device, equipment and computer storage medium
CN112927682B (en) * 2021-04-16 2024-04-16 西安交通大学 Speech recognition method and system based on deep neural network acoustic model
CN113223504B (en) * 2021-04-30 2023-12-26 平安科技(深圳)有限公司 Training method, device, equipment and storage medium of acoustic model
CN113299294B (en) * 2021-05-26 2024-06-11 中国平安人寿保险股份有限公司 Task type dialogue robot interaction method, device, equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109065033A (en) * 2018-09-19 2018-12-21 华南理工大学 A kind of automatic speech recognition method based on random depth time-delay neural network model
CN109147774A (en) * 2018-09-19 2019-01-04 华南理工大学 A kind of improved Delayed Neural Networks acoustic model
CN109584896A (en) * 2018-11-01 2019-04-05 苏州奇梦者网络科技有限公司 A kind of speech chip and electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
戴礼荣 ; 张仕良 ; 黄智颖 ; .基于深度学习的语音识别技术现状与展望.数据采集与处理.2017,(第02期),全文. *

Also Published As

Publication number Publication date
CN114360517A (en) 2022-04-15

Similar Documents

Publication Publication Date Title
US10176804B2 (en) Analyzing textual data
CN109887497B (en) Modeling method, device and equipment for speech recognition
CN108847241A (en) It is method, electronic equipment and the storage medium of text by meeting speech recognition
CN112992125B (en) Voice recognition method and device, electronic equipment and readable storage medium
CN111402862A (en) Voice recognition method, device, storage medium and equipment
CN112784696A (en) Lip language identification method, device, equipment and storage medium based on image identification
CN111881297A (en) Method and device for correcting voice recognition text
JP4499389B2 (en) Method and apparatus for generating decision tree questions for speech processing
CN111508497B (en) Speech recognition method, device, electronic equipment and storage medium
CN112349294B (en) Voice processing method and device, computer readable medium and electronic equipment
CN112133285B (en) Speech recognition method, device, storage medium and electronic equipment
JP3903993B2 (en) Sentiment recognition device, sentence emotion recognition method and program
CN114360517B (en) Audio processing method and device in complex environment and storage medium
Azim et al. Large vocabulary Arabic continuous speech recognition using tied states acoustic models
CN116978367A (en) Speech recognition method, device, electronic equipment and storage medium
JP4878220B2 (en) Model learning method, information extraction method, model learning device, information extraction device, model learning program, information extraction program, and recording medium recording these programs
JP5293607B2 (en) Abbreviation generation apparatus and program, and abbreviation generation method
CN112071304B (en) Semantic analysis method and device
JP4733436B2 (en) Word / semantic expression group database creation method, speech understanding method, word / semantic expression group database creation device, speech understanding device, program, and storage medium
CN113850290A (en) Text processing and model training method, device, equipment and storage medium
JP2006107353A (en) Information processor, information processing method, recording medium and program
JP4674609B2 (en) Information processing apparatus and method, program, and recording medium
KR100487718B1 (en) System and method for improving in-domain training data using out-of-domain data
CN117275458B (en) Speech generation method, device and equipment for intelligent customer service and storage medium
CN116434735A (en) Voice recognition method, and training method, device and equipment of acoustic model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant