CN114360517A - Audio processing method and device in complex environment and storage medium - Google Patents

Audio processing method and device in complex environment and storage medium Download PDF

Info

Publication number
CN114360517A
CN114360517A CN202111551933.6A CN202111551933A CN114360517A CN 114360517 A CN114360517 A CN 114360517A CN 202111551933 A CN202111551933 A CN 202111551933A CN 114360517 A CN114360517 A CN 114360517A
Authority
CN
China
Prior art keywords
neural network
audio
network model
training
training data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111551933.6A
Other languages
Chinese (zh)
Other versions
CN114360517B (en
Inventor
王伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iMusic Culture and Technology Co Ltd
Original Assignee
iMusic Culture and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iMusic Culture and Technology Co Ltd filed Critical iMusic Culture and Technology Co Ltd
Priority to CN202111551933.6A priority Critical patent/CN114360517B/en
Publication of CN114360517A publication Critical patent/CN114360517A/en
Application granted granted Critical
Publication of CN114360517B publication Critical patent/CN114360517B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Electrically Operated Instructional Devices (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses an audio processing method, a device and a storage medium under a complex environment.A neural network model is trained through obtaining audio training data and through the audio training data and a lexicon sentence library, the neural network model is a deep neural network acoustic model combining a time-lag recursive neural network with a hidden Markov model, and is trained through combining the lexicon sentence library containing common words or sentences under a dialogue scene, so that the performance of speech recognition under a noise environment can be improved, and the robustness can be improved; the audio data of the receiver to be received is input into the trained neural network model to obtain output content, and the output content is played to the receiver to be received through voice, so that the output content is more accurately played to the receiver to be received, the accuracy of communication with the receiver to be received is improved, and errors are reduced.

Description

Audio processing method and device in complex environment and storage medium
Technical Field
The present invention relates to the field of audio processing, and in particular, to an audio processing method and apparatus in a complex environment, and a storage medium.
Background
At present, along with the rapid development of artificial intelligence, the application in various industries is also more and more extensive, advanced artificial intelligence technology is applied to various scenes, especially in service industry, such as tea shop, restaurant, make things for a click shop and clothing shop, these occasions personnel are more than the environment noisy, the waiter often receives the background noise influence with the customer dialogue, current identification method often can obtain good identification effect under the quiet scene, but because the interference that the noise brought can not be fine solve the service dialogue identification problem under the high noise, conventional speech recognition model can not adapt to, the robustness is poor.
Disclosure of Invention
In view of the above, in order to solve at least one of the above technical problems, an object of the present invention is to provide an audio processing method, an audio processing apparatus and a storage medium under a complex environment.
The embodiment of the invention adopts the technical scheme that:
an audio processing method in a complex environment, comprising:
acquiring audio training data;
training a neural network model through the audio training data and the word bank and sentence library; the neural network model is a deep neural network acoustic model combining a time-lag recurrent neural network with a hidden Markov model, and the word bank and sentence library comprises commonly used words or sentences in a dialogue scene;
and inputting the audio data of the receiver to be received into the trained neural network model to obtain output content, and playing the output content to the receiver to be received through voice.
Further, the thesaurus and sentence library is determined by the following steps:
obtaining dialogue linguistic data under a dialogue scene;
and carrying out intelligent template extraction and identification according to the dialogue corpus to obtain the word bank and sentence library.
Further, the training of the neural network model through the audio training data and the lexicon and sentence library includes:
carrying out state clustering on the audio training data according to triphones to obtain a posterior of a state;
processing the audio training data according to the neural network model;
and training according to the processing result, the posterior of the state and the word bank and sentence library.
Further, the performing state clustering on the audio training data according to triphones to obtain a posterior of a state includes:
carrying out state clustering on the audio training data according to a dictionary, a phoneme table and a keyword configuration file to obtain a posterior of a state; the keyword profile includes different domain lexical terms and the phone list includes different regional pronunciation criteria.
Further, the audio training data includes a real label, and the training according to the processing result, the posterior of the state, and the thesaurus corpus is performed, including:
determining keywords according to the processing result and the posterior of the state;
matching the keyword with the word library and sentence library to determine a matching result;
and training the neural network model according to the matching result and the real label.
Further, the training the neural network model according to the matching result and the real label includes:
the parameters of the neural network model are updated in a back propagation iteration mode in the training process;
and when the iterative updating times reach preset times or the loss value calculated according to the real label and the matching result is smaller than a preset loss threshold value, obtaining the trained neural network model according to the parameters after the iterative updating.
Further, the playing the output content to the receiver by voice includes:
and converting the audio of the output content into voice and playing the voice to a receiver to be received.
An embodiment of the present invention further provides an audio processing apparatus in a complex environment, including:
the acquisition module is used for acquiring audio training data;
the training module is used for training a neural network model through the audio training data and the word bank and sentence library; the neural network model is a deep neural network acoustic model combining a time-lag recurrent neural network with a hidden Markov model, and the word bank and sentence library comprises commonly used words or sentences in a dialogue scene;
and the playing module is used for inputting the audio data of the receiver to be received to the trained neural network model to obtain output content and playing the output content to the receiver to be received through voice.
An embodiment of the present invention further provides an audio processing apparatus in a complex environment, where the apparatus includes a processor and a memory, where the memory stores at least one instruction, at least one program, a code set, or a set of instructions, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by the processor to implement the method.
Embodiments of the present invention also provide a computer-readable storage medium, where at least one instruction, at least one program, a set of codes, or a set of instructions is stored in the storage medium, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by a processor to implement the method.
The invention has the beneficial effects that: acquiring audio training data, and training a neural network model through the audio training data and a lexicon and sentence library, wherein the neural network model is a deep neural network acoustic model combining a time-lag recurrent neural network with a hidden Markov model and is trained by combining the lexicon and sentence library containing commonly used words or sentences in a dialogue scene, so that the performance of speech recognition in a noise environment can be improved, and the robustness resistance can be improved; the audio data of the receiver to be received is input into the trained neural network model, output content is obtained and played to the receiver to be received through voice, the output content is played to the receiver to be received more accurately, the accuracy of communication with the receiver to be received is improved, and errors are reduced.
Drawings
FIG. 1 is a flowchart illustrating steps of an audio processing method under a complex environment according to the present invention;
FIG. 2 is a diagram of TDNN and RNN according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
As shown in fig. 1, an embodiment of the present invention provides an audio processing method in a complex environment, including steps S100-S300:
and S100, acquiring audio training data.
Alternatively, the complex environment refers to an environment with noise interference in the communication process, for example, a scenario in which a customer communicates with an attendant is taken as an example in the embodiment of the present invention. It should be noted that, at this time, the customer is a person to be received, the audio training data is audio collected in a scene where the customer communicates with an attendant, each question of the customer has a corresponding answer, and the answer of the attendant is used as a real label.
And S200, training the neural network model through the audio training data and the word library sentence library.
In the embodiment of the invention, the neural network model is a deep neural network acoustic model combining a time-lag recurrent neural network with a hidden Markov model, and the word bank and sentence library comprises commonly used words or sentences in a dialogue scene.
Optionally, the creating of the thesaurus corpus in step S200 includes steps S211 to S212:
and S211, acquiring the dialogue linguistic data in the dialogue scene.
Optionally, the conversation scenario takes a scenario in which the customer communicates with the waiter as an example, a standard language specification is set for the waiter, so that the waiter communicates with the standard language specification, and then content in a process of communicating between the customer and the waiter is collected to obtain the conversation corpus.
S212, according to the dialogue corpus, intelligent template extraction and recognition are carried out, and a word bank and sentence bank is obtained.
In the embodiment of the invention, the intelligent template is combined with a voice recognition technology, standard commonly used words or sentences are extracted according to the dialogue corpus, and then a word library and sentence library is established by utilizing the commonly used words or sentences.
Optionally, step S200 includes steps S221-S223:
s221, carrying out state clustering on the audio training data according to the triphone to obtain a posterior of the state.
Optionally, a GMM-HMM acoustic model is constructed from the monophonic and/or triphones, a new HMM is initialized, and the audio training data is state clustered according to the dictionary, the phone list, and the keyword profile to obtain a posterior of state. It should be noted that the keyword configuration file includes vocabulary terms in different fields, and the phoneme table includes pronunciation criteria in different regions; HMM is hidden Markov model, GMM is Gaussian mixture model.
S222, processing the audio training data according to the neural network model.
Specifically, a GMM-HMM initialization model is used for solving which hidden Markov model state corresponds to audio training data, the obtained audio training data is marked as align-raw, a Viterbi algorithm is used for forcibly aligning the state on the audio training data, then a feature vector corresponding to a voice frame of the audio training data is used as the input of a neural network model, the neural network model is used for processing the audio training data, and a processing result is determined. Optionally, the processing result is a word or a sentence, forward propagation is utilized in the processing process, and pdf probability predicted values corresponding to the feature vectors of the audio training data are obtained through the softmax layer, wherein each word or sentence corresponds to one pdf probability predicted value.
As shown in fig. 2, optionally, when modeling the neural network model, since the context information is mainly modeled by a layered architecture, each layer performs audio frame splicing with different time resolutions, but the overall input context of the TDNN is limited, in order to highlight the change of the TDNN (time delay neural network) in the training structure, in this architecture, another RNN layer (time delay neural network) is added in combination with the middle of the TDNN, and the mixed RNN and TDNN obtain a time-lag recurrent neural network, which can better utilize the audio frame of the context to further improve the recognition accuracy. The content of t, t-n, t + n (n is 1,2 … … 6), etc. corresponds to the splicing of audio frames with different time resolutions.
And S223, training according to the processing result, the posterior of the state and the word library and sentence library.
Optionally, step S223 includes steps S2231-S2233:
and S2231, determining the keywords according to the processing result and the posterior of the state.
Optionally, keyword matching is performed according to the processing result and the posterior of the state, and the keyword is determined.
And S2232, matching the keyword with the word library and sentence library to determine a matching result.
Optionally, a corresponding solution is searched from the word library and sentence library according to the keyword, and a keyword effective word and sentence is extracted to obtain a matching result.
According to the embodiment of the invention, the processing result, the dictionary, the phoneme table and the keyword configuration file are combined and matched with the word library and the sentence library to enhance the audio data, so that the performance of voice recognition in a noise environment is improved, the output of a neural network model before and after data enhancement is compared, and the robustness is obviously improved. And the time-lag recurrent neural network is provided, a deep neural network acoustic model of the hidden Markov model is combined, a matching result is determined by combining keyword matching, the problem of accurate matching of keywords in each industry is solved, the accuracy of information communication between a waiter and a customer is improved, errors are reduced, and the service quality is improved.
And S2233, training the neural network model according to the matching result and the real label.
Optionally, step S2233 includes steps S22331-S22332:
and S22331, propagating back in the training process to iteratively update the parameters of the neural network model.
Specifically, the parameters of the neural network model are continuously updated in an iterative manner in a back propagation manner in the training process so as to update the neural network model and improve the processing effect of the neural network model.
And S22332, when the iterative update times reach a preset time, or the loss value calculated according to the real label and the matching result is less than a preset loss threshold value, obtaining the trained neural network model according to the parameters after the iterative update.
Optionally, when the iteration updating times reach the preset times, ending the iteration, and taking the parameters updated by the last iteration as final model parameters to obtain a trained neural network model; or calculating a loss value through a loss function according to the real label and the matching result, finishing iteration when the loss value is smaller than a preset loss threshold value, and taking the parameter updated by the last iteration as a final model parameter to obtain the trained neural network model. It should be noted that the preset times and the preset loss threshold may be adjusted as needed. It should be noted that when the loss value does not change or the loss value does not decrease significantly, the training is ended, otherwise, the updating is continued.
According to the embodiment of the invention, a deep neural network acoustic model (namely a neural network model) based on a time-lag recurrent neural network and a hidden Markov model is utilized, in the training process, the extracted primary result, the dictionary, the phonemic chart and the keyword list configuration file are compared and matched with the word library and the sentence library, so that the key effective word and sentence is extracted, and the accuracy of extracting the key effective word and sentence is improved through optimized comparison and matching, so that the output of the trained neural network model does not contain noise (or has low noise) when being used for voice playing and is clearer.
S300, inputting the audio data of the receiver to be received to the trained neural network model to obtain output content, and playing the output content to the receiver to be received through voice.
Specifically, the output content of the neural network model is subjected to audio conversion into voice, and then played to a receiver to be received. It should be noted that the waiting recipient is a customer, so that the customer can obtain the required answer in a noisy and complicated environment, and the influence on communication caused by the fact that the waiter listens to the speech content of the customer is avoided.
Optionally, when the trained neural network model extracts and identifies the audio data of the to-be-received person, performing word segmentation according to a word segmentation tool, searching whether a newly segmented word is contained in an original word bank and sentence bank of a word bank, and adding the segmented word into a word; for two cases of the word not contained, firstly, the word cannot be composed of phrases in the original word stock and sentence stock after being segmented, and the content output of the default reply of the user is reprocessed or set; in addition, if a plurality of rare words appear, all rare words are segmented into shorter phrases, and the characters of long phrases contained in the original word stock sentence library are arranged and combined according to the phrase sequence to obtain specific words.
An embodiment of the present invention further provides an audio processing apparatus in a complex environment, including:
the acquisition module is used for acquiring audio training data;
the training module is used for training the neural network model through the audio training data and the word bank and sentence library; the neural network model is a deep neural network acoustic model combining a time-lag recurrent neural network with a hidden Markov model, and the word bank and sentence library comprises commonly used words or sentences in a dialogue scene;
and the playing module is used for inputting the audio data of the receiver to be received to the trained neural network model to obtain output content and playing the output content to the receiver to be received through voice.
The contents in the above method embodiments are all applicable to the present apparatus embodiment, the functions specifically implemented by the present apparatus embodiment are the same as those in the above method embodiments, and the advantageous effects achieved by the present apparatus embodiment are also the same as those achieved by the above method embodiments.
The embodiment of the present invention further provides an audio processing apparatus in a complex environment, where the apparatus includes a processor and a memory, where the memory stores at least one instruction, at least one program, a code set, or an instruction set, and the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by the processor to implement the audio processing method in the complex environment of the foregoing embodiment. Alternatively, the device includes, but is not limited to, any smart terminal such as a mobile phone, a tablet computer, a computer, and the like.
The contents in the above method embodiments are all applicable to the present apparatus embodiment, the functions specifically implemented by the present apparatus embodiment are the same as those in the above method embodiments, and the advantageous effects achieved by the present apparatus embodiment are also the same as those achieved by the above method embodiments.
The embodiment of the present invention further provides a computer-readable storage medium, in which at least one instruction, at least one program, a code set, or a set of instructions is stored, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by a processor to implement the audio processing method in a complex environment according to the foregoing embodiment.
Embodiments of the present invention also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the audio processing method in the complex environment of the foregoing embodiment.
The terms "first," "second," "third," "fourth," and the like in the description of the application and the above-described figures, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form. Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment. In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes multiple instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method of the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing programs, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (10)

1. An audio processing method under a complex environment, comprising:
acquiring audio training data;
training a neural network model through the audio training data and the word bank and sentence library; the neural network model is a deep neural network acoustic model combining a time-lag recurrent neural network with a hidden Markov model, and the word bank and sentence library comprises commonly used words or sentences in a dialogue scene;
and inputting the audio data of the receiver to be received into the trained neural network model to obtain output content, and playing the output content to the receiver to be received through voice.
2. The audio processing method under the complex environment according to claim 1, wherein: the word bank and sentence library is determined by the following steps:
obtaining dialogue linguistic data under a dialogue scene;
and carrying out intelligent template extraction and identification according to the dialogue corpus to obtain the word bank and sentence library.
3. The audio processing method under the complex environment according to claim 1, wherein: training the neural network model through the audio training data and the word bank sentence library, comprising the following steps of:
carrying out state clustering on the audio training data according to triphones to obtain a posterior of a state;
processing the audio training data according to the neural network model;
and training according to the processing result, the posterior of the state and the word bank and sentence library.
4. The audio processing method under the complex environment according to claim 3, wherein: the state clustering is carried out on the audio training data according to the triphone to obtain the posterior of the state, and the method comprises the following steps:
carrying out state clustering on the audio training data according to a dictionary, a phoneme table and a keyword configuration file to obtain a posterior of a state; the keyword profile includes different domain lexical terms and the phone list includes different regional pronunciation criteria.
5. The audio processing method under the complex environment according to claim 3, wherein: the audio training data comprises real labels, and the training is carried out according to the processing result, the posterior of the state and the word bank and sentence library, and comprises the following steps:
determining keywords according to the processing result and the posterior of the state;
matching the keyword with the word library and sentence library to determine a matching result;
and training the neural network model according to the matching result and the real label.
6. The audio processing method under the complex environment according to claim 5, wherein: the training of the neural network model according to the matching result and the real label comprises:
the parameters of the neural network model are updated in a back propagation iteration mode in the training process;
and when the iterative updating times reach preset times or the loss value calculated according to the real label and the matching result is smaller than a preset loss threshold value, obtaining the trained neural network model according to the parameters after the iterative updating.
7. The audio processing method in a complex environment according to any one of claims 1 to 6, wherein: the playing the output content to the receiver by voice comprises:
and converting the audio of the output content into voice and playing the voice to a receiver to be received.
8. An audio processing apparatus under a complex environment, comprising:
the acquisition module is used for acquiring audio training data;
the training module is used for training a neural network model through the audio training data and the word bank and sentence library; the neural network model is a deep neural network acoustic model combining a time-lag recurrent neural network with a hidden Markov model, and the word bank and sentence library comprises commonly used words or sentences in a dialogue scene;
and the playing module is used for inputting the audio data of the receiver to be received to the trained neural network model to obtain output content and playing the output content to the receiver to be received through voice.
9. An audio processing apparatus in a complex environment, the apparatus comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement the method according to any one of claims 1-7.
10. A computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement the method according to any one of claims 1 to 7.
CN202111551933.6A 2021-12-17 2021-12-17 Audio processing method and device in complex environment and storage medium Active CN114360517B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111551933.6A CN114360517B (en) 2021-12-17 2021-12-17 Audio processing method and device in complex environment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111551933.6A CN114360517B (en) 2021-12-17 2021-12-17 Audio processing method and device in complex environment and storage medium

Publications (2)

Publication Number Publication Date
CN114360517A true CN114360517A (en) 2022-04-15
CN114360517B CN114360517B (en) 2023-04-18

Family

ID=81100109

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111551933.6A Active CN114360517B (en) 2021-12-17 2021-12-17 Audio processing method and device in complex environment and storage medium

Country Status (1)

Country Link
CN (1) CN114360517B (en)

Citations (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106550156A (en) * 2017-01-23 2017-03-29 苏州咖啦魔哆信息技术有限公司 A kind of artificial intelligence's customer service system and its implementation based on speech recognition
CN107680582A (en) * 2017-07-28 2018-02-09 平安科技(深圳)有限公司 Acoustic training model method, audio recognition method, device, equipment and medium
CN108549637A (en) * 2018-04-19 2018-09-18 京东方科技集团股份有限公司 Method for recognizing semantics, device based on phonetic and interactive system
CN109065033A (en) * 2018-09-19 2018-12-21 华南理工大学 A kind of automatic speech recognition method based on random depth time-delay neural network model
CN109147774A (en) * 2018-09-19 2019-01-04 华南理工大学 A kind of improved Delayed Neural Networks acoustic model
CN109410948A (en) * 2018-09-07 2019-03-01 北京三快在线科技有限公司 Communication means, device, system, computer equipment and readable storage medium storing program for executing
CN109410911A (en) * 2018-09-13 2019-03-01 何艳玲 Artificial intelligence learning method based on speech recognition
CN109427334A (en) * 2017-09-01 2019-03-05 王阅 A kind of man-machine interaction method and system based on artificial intelligence
CN109584896A (en) * 2018-11-01 2019-04-05 苏州奇梦者网络科技有限公司 A kind of speech chip and electronic equipment
CN109599113A (en) * 2019-01-22 2019-04-09 北京百度网讯科技有限公司 Method and apparatus for handling information
CN110086946A (en) * 2019-03-15 2019-08-02 深圳壹账通智能科技有限公司 Intelligence chat sound control method, device, computer equipment and storage medium
CN110162610A (en) * 2019-04-16 2019-08-23 平安科技(深圳)有限公司 Intelligent robot answer method, device, computer equipment and storage medium
CN110827822A (en) * 2019-12-06 2020-02-21 广州易来特自动驾驶科技有限公司 Intelligent voice interaction method and device, travel terminal, equipment and medium
CN111143535A (en) * 2019-12-27 2020-05-12 北京百度网讯科技有限公司 Method and apparatus for generating a dialogue model
CN111508498A (en) * 2020-04-09 2020-08-07 携程计算机技术(上海)有限公司 Conversational speech recognition method, system, electronic device and storage medium
CN111683175A (en) * 2020-04-22 2020-09-18 北京捷通华声科技股份有限公司 Method, device, equipment and storage medium for automatically answering incoming call
CN112382290A (en) * 2020-11-20 2021-02-19 北京百度网讯科技有限公司 Voice interaction method, device, equipment and computer storage medium
CN112927682A (en) * 2021-04-16 2021-06-08 西安交通大学 Voice recognition method and system based on deep neural network acoustic model
US20210193119A1 (en) * 2019-12-20 2021-06-24 Lg Electronics Inc. Artificial intelligence apparatus for training acoustic model
CN113223504A (en) * 2021-04-30 2021-08-06 平安科技(深圳)有限公司 Acoustic model training method, device, equipment and storage medium
CN113299294A (en) * 2021-05-26 2021-08-24 中国平安人寿保险股份有限公司 Task type dialogue robot interaction method, device, equipment and storage medium

Patent Citations (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106550156A (en) * 2017-01-23 2017-03-29 苏州咖啦魔哆信息技术有限公司 A kind of artificial intelligence's customer service system and its implementation based on speech recognition
CN107680582A (en) * 2017-07-28 2018-02-09 平安科技(深圳)有限公司 Acoustic training model method, audio recognition method, device, equipment and medium
CN109427334A (en) * 2017-09-01 2019-03-05 王阅 A kind of man-machine interaction method and system based on artificial intelligence
CN108549637A (en) * 2018-04-19 2018-09-18 京东方科技集团股份有限公司 Method for recognizing semantics, device based on phonetic and interactive system
CN109410948A (en) * 2018-09-07 2019-03-01 北京三快在线科技有限公司 Communication means, device, system, computer equipment and readable storage medium storing program for executing
CN109410911A (en) * 2018-09-13 2019-03-01 何艳玲 Artificial intelligence learning method based on speech recognition
CN109065033A (en) * 2018-09-19 2018-12-21 华南理工大学 A kind of automatic speech recognition method based on random depth time-delay neural network model
CN109147774A (en) * 2018-09-19 2019-01-04 华南理工大学 A kind of improved Delayed Neural Networks acoustic model
CN109584896A (en) * 2018-11-01 2019-04-05 苏州奇梦者网络科技有限公司 A kind of speech chip and electronic equipment
CN109599113A (en) * 2019-01-22 2019-04-09 北京百度网讯科技有限公司 Method and apparatus for handling information
CN110086946A (en) * 2019-03-15 2019-08-02 深圳壹账通智能科技有限公司 Intelligence chat sound control method, device, computer equipment and storage medium
CN110162610A (en) * 2019-04-16 2019-08-23 平安科技(深圳)有限公司 Intelligent robot answer method, device, computer equipment and storage medium
CN110827822A (en) * 2019-12-06 2020-02-21 广州易来特自动驾驶科技有限公司 Intelligent voice interaction method and device, travel terminal, equipment and medium
US20210193119A1 (en) * 2019-12-20 2021-06-24 Lg Electronics Inc. Artificial intelligence apparatus for training acoustic model
CN111143535A (en) * 2019-12-27 2020-05-12 北京百度网讯科技有限公司 Method and apparatus for generating a dialogue model
CN111508498A (en) * 2020-04-09 2020-08-07 携程计算机技术(上海)有限公司 Conversational speech recognition method, system, electronic device and storage medium
CN111683175A (en) * 2020-04-22 2020-09-18 北京捷通华声科技股份有限公司 Method, device, equipment and storage medium for automatically answering incoming call
CN112382290A (en) * 2020-11-20 2021-02-19 北京百度网讯科技有限公司 Voice interaction method, device, equipment and computer storage medium
CN112927682A (en) * 2021-04-16 2021-06-08 西安交通大学 Voice recognition method and system based on deep neural network acoustic model
CN113223504A (en) * 2021-04-30 2021-08-06 平安科技(深圳)有限公司 Acoustic model training method, device, equipment and storage medium
CN113299294A (en) * 2021-05-26 2021-08-24 中国平安人寿保险股份有限公司 Task type dialogue robot interaction method, device, equipment and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
刘洋: "餐厅场景下服务对话的智能模版提取及话术质量评估研究与实现", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
戴礼荣;张仕良;黄智颖;: "基于深度学习的语音识别技术现状与展望" *
章月红等: "一类随机惯性时滞神经网络的稳定性", 《高校应用数学学报A辑》 *

Also Published As

Publication number Publication date
CN114360517B (en) 2023-04-18

Similar Documents

Publication Publication Date Title
US10176804B2 (en) Analyzing textual data
CN109887497B (en) Modeling method, device and equipment for speech recognition
CN106782560B (en) Method and device for determining target recognition text
CN111445898B (en) Language identification method and device, electronic equipment and storage medium
CN113707125B (en) Training method and device for multi-language speech synthesis model
CN112784696A (en) Lip language identification method, device, equipment and storage medium based on image identification
CN111881297A (en) Method and device for correcting voice recognition text
CN111414745A (en) Text punctuation determination method and device, storage medium and electronic equipment
JP4499389B2 (en) Method and apparatus for generating decision tree questions for speech processing
CN112767925A (en) Voice information identification method and device
CN112349294A (en) Voice processing method and device, computer readable medium and electronic equipment
CN112133285B (en) Speech recognition method, device, storage medium and electronic equipment
Rose et al. Integration of utterance verification with statistical language modeling and spoken language understanding
CN113051384A (en) User portrait extraction method based on conversation and related device
JP3903993B2 (en) Sentiment recognition device, sentence emotion recognition method and program
CN114360517B (en) Audio processing method and device in complex environment and storage medium
CN111508481A (en) Training method and device of voice awakening model, electronic equipment and storage medium
CN116978367A (en) Speech recognition method, device, electronic equipment and storage medium
CN115985320A (en) Intelligent device control method and device, electronic device and storage medium
CN112071304B (en) Semantic analysis method and device
JP4733436B2 (en) Word / semantic expression group database creation method, speech understanding method, word / semantic expression group database creation device, speech understanding device, program, and storage medium
JP2006107353A (en) Information processor, information processing method, recording medium and program
KR100487718B1 (en) System and method for improving in-domain training data using out-of-domain data
CN117275458B (en) Speech generation method, device and equipment for intelligent customer service and storage medium
JP4674609B2 (en) Information processing apparatus and method, program, and recording medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant