WO2021000403A1 - 智能对话***的语音匹配方法、电子装置、计算机设备 - Google Patents

智能对话***的语音匹配方法、电子装置、计算机设备 Download PDF

Info

Publication number
WO2021000403A1
WO2021000403A1 PCT/CN2019/102841 CN2019102841W WO2021000403A1 WO 2021000403 A1 WO2021000403 A1 WO 2021000403A1 CN 2019102841 W CN2019102841 W CN 2019102841W WO 2021000403 A1 WO2021000403 A1 WO 2021000403A1
Authority
WO
WIPO (PCT)
Prior art keywords
text information
similarity
voice
extended
questions
Prior art date
Application number
PCT/CN2019/102841
Other languages
English (en)
French (fr)
Inventor
马力
程宁
王健宗
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021000403A1 publication Critical patent/WO2021000403A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/319Inverted lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3338Query expansion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • G10L15/148Duration modelling in HMMs, e.g. semi HMM, segmental models or transition probabilities
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Definitions

  • This application relates to the field of intelligent technology, and in particular to a voice matching method, electronic device, and computer equipment of an intelligent dialogue system.
  • Intelligent voice interaction is a new generation of interaction mode based on voice input, and feedback results can be obtained by speaking.
  • Intelligent dialogue systems are driven by the demand for data-driven frameworks, which can reduce the cost of labor-intensive manual management of complex dialogues, and can withstand errors caused by speech recognizers operating in noisy environments. Through the explicit Bayesian model that contains uncertainty, and through the strategy optimization of the reward-driven process.
  • this application proposes a voice matching method, electronic device, and computer equipment for an intelligent dialogue system, which can be effectively used in the voice system of the intelligent dialogue system by adding similarity technology to the precise strategy learning of the POMDP model. Improve the accuracy of the model to make the voice matching result more accurate.
  • this application proposes a voice method of an intelligent dialogue system, which is applied to an electronic device.
  • the method includes the steps of: converting the acquired voice information input by the user terminal into corresponding text information; The N extension questions corresponding to the text information; the similarity calculation is performed on the N extension questions by using a pre-trained POMDP model, and the similarity is performed on the N extension questions according to the calculated similarity Sorting in descending order; according to the N extended questions sorted in descending order of similarity, generate an extended question recognition intent corresponding to the text information; according to the extended question recognition intent to match the preset response words, and return to In response to the matching of the text information, the user terminal plays the response voice.
  • the present application also provides an electronic device, which includes: a conversion module adapted to convert the acquired voice information input by the user terminal into corresponding text information; The N extended questions corresponding to the text information; the similarity calculation module is suitable for calculating the similarity of the N extended questions by using a pre-trained POMDP model, and calculates the similarity of the N all questions based on the calculated similarity.
  • the extended questions are sorted by decreasing similarity; the generating module is adapted to generate the recognition intent of the extended questions corresponding to the text information according to the N extended questions sorted by the decreasing similarity; the matching module is suitable for according to the The extended question recognition intent is matched with the preset response language, and the response language matching the text information is returned, and the user terminal performs the response voice playback.
  • this application also provides a computer device, including a memory, a processor, and computer-readable instructions stored in the memory and running on the processor, and the processor executes the computer-readable instructions When implementing the steps of the above method.
  • the present application also provides a non-volatile computer-readable storage medium on which computer-readable instructions are stored, and the computer-readable instructions implement the steps of the foregoing method when executed by a processor.
  • Figure 1 is an architecture diagram of an existing intelligent voice interaction system
  • Figure 2 is a schematic diagram of the main components of the current traditional dialogue system
  • Figure 3 is a POMDP flow chart
  • FIG. 4 is the relationship diagram of POMDP
  • FIG. 5 is an optional application environment diagram of the electronic device of the embodiment of the present application.
  • FIG. 6 is a schematic diagram of the hardware architecture of the electronic device according to the first embodiment of the present application.
  • FIG. 7 is a schematic diagram of program modules of the electronic device according to the first embodiment of the present application.
  • FIG. 8 is a schematic flowchart of the voice method of the intelligent dialogue system according to the first embodiment of the present application.
  • FIG. 9 is a schematic flowchart of a voice method of an intelligent dialogue system according to a second embodiment of the present application.
  • FIG. 10 is a schematic flowchart of a voice method of an intelligent dialogue system according to a third embodiment of the present application.
  • FIG 1 shows the architecture diagram of the existing intelligent voice interaction system.
  • the intelligent dialogue system allows users to use voice as the main, usually the only communication medium, to interact with various information systems.
  • SDS is mainly deployed in call center applications.
  • the system in the application can reduce the demand for operators, thereby reducing costs.
  • Recently, the use of voice interfaces in mobile phones has become very common, such as Apple's Siri and Nuance's Dragon Go! , Demonstrating the value of integrating natural dialogue and voice interaction into mobile products, applications and services. error! The reference source was not found.
  • the words spoken by the client user are converted into text by ASR and then enter the dialogue system.
  • the specified content service is called and the text is output.
  • the content is converted into voice by TTS and then returned to the user on the client.
  • the intelligent dialogue platform generally consists of two parts.
  • Question answering system based on natural language understanding and task-driven dialogue system.
  • the question and answer system based on natural language understanding focuses on one question and one answer, that is, to give accurate answers directly based on the user's question, which is an information retrieval process.
  • a knowledge base needs to be prepared in advance.
  • the knowledge base can contain one or more fields.
  • the task-driven dialogue system focuses on task-driven
  • the multiple rounds of dialogue refer to users who come with a clear purpose and hope to obtain information or services that meet specific restrictions, such as ordering food, booking tickets, searching for music, movies, or a certain product, etc.
  • the Spoken Dialogue System in academic literature generally refers to task-driven multi-round dialogue. Compared with the information retrieval of a question and answer system, a task-driven dialogue system is a decision-making process, requiring the machine to constantly decide the optimal action to be taken next based on the current and contextual state during the dialogue. The fundamental difference between Q&A and dialogue is whether the user status needs to be maintained and a decision-making process is needed to complete the task.
  • FIG. 5 it is a schematic diagram of an optional application environment of the electronic device 20 of the present application.
  • the electronic device 20 can communicate with the client 10 and the server 30 in a wired or wireless manner.
  • the electronic device 20 obtains the voice input voice information of the user terminal 10 through the interface 23, obtains the response voice of the server 30 according to the obtained voice information, and performs voice playback on the user terminal 10 through the interface. , So as to realize the voice matching of the intelligent dialogue system.
  • the virtual reality device 10 includes glasses, a helmet, a handle, and the like.
  • the electronic device 20 may also be embedded in the client 10 or the server 30.
  • FIG. 6 is a schematic diagram of an optional hardware architecture of the electronic device 20 of the present application.
  • the electronic device 20 includes, but is not limited to, a memory 21, a processing 22, and an interface 23 that can communicate with each other through a system bus.
  • FIG. 6 only shows the electronic device 20 with components 21-23, but it should be understood that it is not required All the illustrated components are implemented, and more or fewer components may be implemented instead.
  • the memory 21 includes at least one non-volatile computer-readable storage medium.
  • the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM ), static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disks, optical disks, etc.
  • the memory 21 may be an internal storage unit of the electronic device 20, such as a hard disk or a memory of the electronic device 20.
  • the memory may also be an external storage device of the electronic device 20, for example, a plug-in hard disk equipped on the electronic device 20, a smart media card (SMC), a secure digital ( Secure Digital, SD card, Flash Card, etc.
  • the memory 21 may also include both an internal storage unit of the electronic device 20 and an external storage device thereof.
  • the memory 21 is generally used to store an operating system and various application software installed in the electronic device 20, such as the program code of the data visualization system 24.
  • the memory 21 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 22 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments.
  • the processor 22 is generally used to control the overall operation of the electronic device 20.
  • the processor 22 is used to run the program code or process data stored in the memory 21, for example, to run the intelligent dialogue system 24.
  • the interface 23 may include a wireless interface or a wired interface, and the interface 23 is generally used to establish a communication connection between the electronic device 20 and other electronic devices.
  • an embodiment of the present application proposes an electronic device 20 shown in FIG. 5, and referring to FIG. 7, it is a schematic diagram of program modules of the electronic device 20 according to the first embodiment of the present application.
  • the electronic device 20 includes a series of computer-readable instructions stored on the memory 21.
  • the computer-readable instructions are executed by the processor 22, the intelligent dialogue of the various embodiments of the present application can be realized. Voice matching operation of the system.
  • the electronic device 20 may be divided into one or more modules based on specific operations implemented by the various parts of the computer-readable instructions. For example, in FIG. 7, the electronic device 20 may be divided into a conversion module 201, a retrieval module 202, a similarity calculation module 203, a generation module 204, and a matching module 205. among them:
  • the conversion module 201 is adapted to convert the acquired voice information input by the user terminal into corresponding text information
  • the conversion module 201 is adapted to recognize the acquired voice information input by the user terminal through ASR (Automatic Speech Recognition) based on HMM-GMM (Hidden Markov Model-Gaussian Mixture Model), and The voice information is translated into corresponding pre-text information; the pre-text information is corrected by an error correction algorithm, and the text information after the correction is obtained.
  • ASR Automatic Speech Recognition
  • HMM-GMM Hidden Markov Model-Gaussian Mixture Model
  • the ASR of HMM-GMM is well known to those skilled in the art, and will not be described in detail here.
  • speech recognition technology in addition to the ASR recognition of HMM-GMM, there are other technical solutions. No longer.
  • the retrieval module 202 is adapted to determine N extended questions corresponding to the text information through retrieval;
  • the retrieval module 202 retrieves the pre-set database corresponding to the business scenario through elasticsearch, and retrieves the original extensions corresponding to the text information that are not retrieved through elasticsearch. Index; Determine N extended questions corresponding to the text information.
  • the retrieval module 202 retrieves the marked-attention library corresponding to the business scenario in the preset database through elasticsearch, the original extended question corresponding to the text information is retrieved; the original extended question recognition intent corresponding to the text information is generated .
  • the similarity calculation module 203 is adapted to calculate the similarity of the N extensions by using a pre-trained POMDP model, and to sort the N extensions in descending order of similarity according to the calculated similarity;
  • the similarity calculation module 203 uses a pre-trained POMDP model to calculate the similarity of the N extended questions; compare the similarity of the N extended questions with a preset threshold one by one. If the N similarities are all greater than or equal to the preset threshold, then the N extended questions are sorted in decreasing similarity.
  • the similarity calculation module 203 classifies and predicts the N extended questions through the pre-trained LSTM+CRF model; performs classification prediction on the N extended questions according to the classification prediction results Classification, so that the extended question recognition intention is generated based on the classification result.
  • the generating module 204 is adapted to generate an extended question recognition intent corresponding to the text information according to the N extended questions sorted in descending order of similarity;
  • the matching module 205 is adapted to match the preset response words according to the extended question recognition intent, and return the response words matching the text information, and if yes, the user terminal plays the response voice.
  • the matching module 205 performs matching according to the extended question recognition intention with the preset response words to obtain the response words matching the text information; the response words are speech synthesized into the response speech corresponding to the voice information, and The response voice is returned, so that the user end plays the response voice.
  • the electronic device 20 proposed in the embodiment of the present application can calculate the similarity of the N extended questions by using a pre-trained POMDP model, and perform similarity calculations on the N extended questions according to the calculated similarity. Decrease sorting, determine the extended question recognition intention corresponding to the text information, and match it with the preset response words, making the direct voice interaction with the user in the intelligent dialogue system more accurate, and effectively improving the user’s interactivity and Experiential.
  • this application also proposes a voice method for the intelligent dialogue system.
  • FIG. 8 is a schematic flowchart of the first embodiment of the voice method of the intelligent dialogue system of the present application.
  • the voice method of the intelligent dialogue system is applied to the electronic device 20.
  • the execution order of the steps in the flowchart shown in FIG. 8 can be changed, and some steps can be omitted.
  • Step S800 Convert the acquired voice information inputted by the user's voice into corresponding text information
  • the user input voice on the user terminal such as "the best restaurant”
  • the user voice input may be in different national languages or dialects of different regions.
  • the “best restaurant” is converted into the text message "the best restaurant”.
  • Step S801 Determine N extended questions corresponding to the text information by searching
  • the N extension questions corresponding to the text information "best restaurant” are determined by searching, such as the best restaurant near the user terminal, the best restaurant in the current city, and the best evaluation There are N extension questions for restaurants, etc. There is no specific limitation here.
  • Step S802 Perform similarity calculation on the N extension questions by using a pre-trained POMDP model, and perform a descending order of similarity on the N extension questions according to the calculated similarity;
  • a pre-trained POMDP model is used to ask N of the extensions to be the best restaurant near the current user end, the best restaurant in the current city, the best rated restaurant, etc.
  • the similarity is calculated by an extension. For example, the similarity of the best restaurant near the user terminal is 90%, the similarity of the best restaurant in the current city is 70%, and the similarity of the best rated restaurant is 80%.
  • the N extended questions are sorted in decreasing similarity to the best restaurant near the current user terminal-the best rated restaurant-the similarity of the best restaurant in the current city Sort in descending order.
  • Step S803 According to the N extended questions sorted by the degree of similarity in descending order, generate an extended question recognition intent corresponding to the text information;
  • the text information is generated
  • the corresponding extension asks to identify the intent.
  • Step S804 matching the preset response words according to the extended question recognition intention, and returning the response words matching the text information, so that the user end plays the response voice.
  • the corresponding extended question identification intention and preset response words Matching techniques are performed, and the response words matching the text information are returned, so that the user terminal plays the response voice.
  • the voice matching method, electronic device, and computer equipment of the intelligent dialogue system proposed in the embodiments of the present application can calculate the similarity of the N extension questions by using the pre-trained POMDP model, and calculate the similarity according to the calculated similarity. Sort N of the extended questions in descending order of similarity, determine the recognition intent of the extended questions corresponding to the text information, and match with the preset response words, so that the direct voice interaction with the user in the intelligent dialogue system is more Accurate, and effectively improve user interaction and experience.
  • the transition and observation probability functions are represented by appropriate random models, here called dialogue models
  • the decision of which action to take in each round is determined by the second random model of the coding strategy P.
  • a reward is assigned to each step. This design is designed to reflect the desired characteristics of the dialogue system.
  • the dialogue model can then be optimized by maximizing the expected cumulative sum of these rewards through online interaction with the user or a corpus of dialogue collected in offline similar domains And the strategy model P.
  • This POMDP-based dialogue model combines two key ideas: confidence state tracking and reinforcement learning. These ideas are separable and have their own benefits. However, combining them can form a complete and well-founded mathematical framework, providing opportunities for further synergy gains. Compared with traditional methods, the potential advantages of this method can be summarized as follows:
  • the confidence state provides a clear representation of uncertainty, making the system more sensitive and robust to speech recognition errors.
  • Bayesian inference is used to update the posterior probability of the confidence state after each user input.
  • the design of the confidence state allows the user behavior to be captured through the model prior, and the inference process can utilize the complete distribution of recognition hypotheses, such as confusion networks and N-best lists. Therefore, the evidence is integrated in each round, so that the impact of a single error is significantly reduced, and compared with traditional systems, the user's persistence will be rewarded. If users repeat something often enough, as long as the correct hypothesis repeats in the N-best list, the system's confidence in what they say will increase over time.
  • the system can track all possible dialogue paths effectively and in parallel, and choose its next action not based on the most probable state but based on the probability distribution of all states.
  • the probability of the current most likely state will decrease, and the focus will simply switch to another state. Therefore, there is no need to backtrack or correct errors for specific conversations. This allows a simple homogenous mapping from belief to action to be embedded in a powerful dialogue strategy.
  • FIG. 9 is a schematic flowchart of the second embodiment of the voice method of the intelligent dialogue system of the present application.
  • the voice method of the intelligent dialogue system is applied to the electronic device 20.
  • the execution order of the steps in the flowchart shown in FIG. 9 can be changed, and some steps can be omitted.
  • Step S900 Convert the acquired voice information inputted by the user's voice into corresponding text information
  • the user input voice on the user terminal such as "the best restaurant”
  • the user voice input may be in different national languages or dialects of different regions.
  • the “best restaurant” is converted into the text message "the best restaurant”.
  • Step S901 Perform an inverted index on the original extended questions corresponding to the text information that have not been retrieved through elasticsearch;
  • the original extension question corresponding to the text information that is not determined by the search is inverted indexed through elasticsearch. For example, the original extension question of the text information "the best restaurant” is not retrieved, Then use elasticsearch to invert the index of the "best restaurants" in the database.
  • Step S902 determining N of the extended questions corresponding to the text information
  • the N extension questions corresponding to the text information "best restaurant" are determined, for example, the best restaurant near the user terminal, the best restaurant in the current city, and the evaluation The best restaurant waits for N extended questions, and there is no specific limitation here.
  • Step S903 Perform similarity calculation on the N extension questions by using a pre-trained POMDP model
  • the pre-trained POMDP model is used to calculate the similarity of the N extension questions, which can be calculated through the Bellman optimality equation of POMDPs, and through the strategy search method, first calculate The value function corresponding to the strategy.
  • the optimal strategy is obtained by finding the strategy with the highest return value.
  • the optimal strategy is the highest similarity, that is, the pre-trained POMDP model described in this embodiment asks for the N extensions. Similarity calculation.
  • a pre-trained POMDP model is used to ask N of the extensions to be the best restaurant near the current user end, the best restaurant in the current city, the best rated restaurant, etc.
  • the similarity is calculated by an extension. For example, the similarity of the best restaurant near the user terminal is 90%, the similarity of the best restaurant in the current city is 70%, and the similarity of the best rated restaurant is 80%.
  • the partially observable Markov decision process is defined as the number of elements (S, A, T, R, O, Z, ⁇ , b 0 ), where S is a set of states, where S ⁇ S, A is a set of The action of a ⁇ A; T defines the transition probability P(s t
  • POMDP The operation of POMDP is as follows: at each time stage, the event is in an unobserved state st . Since st is not known exactly, the distribution of possible states called the confidence state b t is maintained, where Indicates the probability of being in a specific state st . Selected based on b t, machine operation a t, to receive a reward r t, and converted to (not observed) state s t + 1, where s (t + 1) and depends only on s t a t. Then the machine receives observations o t + 1, depending on the s (t + 1) and a t. This process is represented as Figure 3 in the form of an image map.
  • each policy may be [pi] This way.
  • the most common is the deterministic mapping from the confidence state to the behavior ⁇ (b) ⁇ A, or through the random distribution of the action ⁇ (a
  • the two types of strategies will use the same symbol ⁇ , and the occurrence of actions in the symbol determines whether the strategy is deterministic or random. Note, however, that other definitions are possible, such as finite state controllers, or mapping from observation sequences of finite length to actions (see Predicted State Representation)
  • the total discount of the reward can be recursively expressed as a deterministic strategy, such as
  • the correlation quantity is the Q function Q ⁇ (b, a), which gives the expected total discount of the reward.
  • V ⁇ (b) ⁇ a ⁇ (a
  • the optimal strategy ⁇ * is a strategy that maximizes V ⁇ to generate V *
  • the purpose of the POMDP model established above is to determine the various factors of the confidence state and therefore the basic model required to represent these factors in the actual system, for example, to update each user through Bayesian inference for N extended questions
  • the posterior probability of the confidence state after the input, the aforementioned user goal such as the best restaurant, contains all the information needed to complete the task
  • the user’s real intention refers to the intention that the user actually wants to express rather than the intention recognized by the system
  • the dialogue history is tracked
  • the previous dialogue flow carries out corresponding training to determine the basic model required for these factors of the best restaurant.
  • Step S904 comparing the similarities of the N extension questions with a preset threshold one by one, and sorting the extension questions with a similarity greater than the preset threshold among the N extension questions in descending order of similarity;
  • the similarities of the N extensions are compared with a preset threshold one by one, and if the N similarities are all greater than or equal to the preset threshold, the similarity is compared according to the calculated similarity
  • the N said extensions are sorted by decreasing similarity to the best restaurants near the current user terminal-the best rated restaurants-the best restaurants in the current city in decreasing similarity ranking.
  • Step S905 generating an extended question recognition intent corresponding to the text information according to the N extended questions sorted by decreasing similarity;
  • the text information is generated
  • the extended question recognition intent is determined according to the descending order of similarity, for example, the best restaurant near the current user terminal-the best rated restaurant-the best restaurant in the current city .
  • the generated extended query recognition intent is the best restaurant near the current user terminal.
  • Step S906 according to the extended question recognition intention and preset response words to be matched to obtain the response words matching the text information;
  • the corresponding extended question identification intention and preset response words Matching techniques to obtain the response words matching the text information as the best restaurant near the current user terminal-the best rated restaurant-the best restaurant in the current city, such as the generated extended question recognition If the intention is the best restaurant near the current user terminal, it is matched with the preset response words to obtain the response words matching the text information, and according to location matching and evaluation matching, the best food near the current user terminal is obtained
  • the response language text of the restaurant such as XX restaurant, etc.
  • Step S907 Perform voice synthesis of the response speech into a response voice corresponding to the voice information
  • the response speech is the best restaurant near the current user terminal-the best rated restaurant-the best restaurant in the current city is the response voice corresponding to the voice message, for example, the best restaurant near the current user terminal
  • the result of delicious restaurant matching is the corresponding response voice through voice information, such as the voice "XX restaurant”.
  • step S908 the response voice is returned, so that the user end plays the response voice.
  • the voice method of the intelligent dialogue system proposed in the embodiment of the application can calculate the similarity of the N extended questions by using a pre-trained POMDP model, and calculate the similarity of the N extended questions according to the calculated similarity. Sort the similarity in descending order, determine the extended question recognition intention corresponding to the text information, and match it with the preset response words, making the direct voice interaction with the user in the intelligent dialogue system more accurate, and effectively improving the user’s Interactivity and experience.
  • FIG. 10 is a schematic flowchart of the third embodiment of the voice method of the intelligent dialogue system of the present application.
  • the voice method of the intelligent dialogue system is applied to the electronic device 20.
  • the execution order of the steps in the flowchart shown in FIG. 10 can be changed, and some steps can be omitted.
  • Step S1000 Recognizing the acquired voice information inputted by the user's voice through ASR based on HMM-GMM, and translating the voice information into corresponding pre-text information;
  • the user input voice on the user terminal such as "the best restaurant”
  • the user voice input may be in different national languages or dialects of different regions. Due to different national languages, or different regional dialects, the corresponding pre-text information may be translated into the "best and worst restaurant", etc., and further, the audio signal processing through HMM-GMM, such as the first step , Recognize the frame as a state (difficulty).
  • the second step is to combine the states into phonemes.
  • the third step is to combine phonemes into words.
  • the first step can be done as GMM, and the following two cloths are all done by HMM.
  • HMM-GMM speech recognition is a well-known technology in the art, and will not be detailed here.
  • Step S1001 Perform error correction on the pre-text information through an error correction algorithm, and obtain the text information after error correction;
  • the user when the user needs to conduct an intelligent voice conversation through the user terminal, the user enters in the user terminal voice such as "the best restaurant", which is translated into “the best restaurant", you need to correct errors and change the text information.
  • the aforementioned error correction algorithm may be a language model constructed by LSTM, and further error correction is performed on the text after the speech recognition of HMM-GMM.
  • the error correction algorithm for speech recognition is a well-known technology in the art. Repeat it again.
  • Step S1002 searching the pre-built database corresponding to the business scenario's marked attention library through elasticsearch;
  • Step S1003 retrieve the original extended question corresponding to the text information
  • step S1011 use elasticsearch to retrieve the marked-attention gallery of the business scene that has been stored in the database, retrieve whether the marked-attention gallery is expanded, retrieve the original expanded question, and return the marked corresponding intention through the marked-attention gallery of the business scene, and retrieve the If the original extension corresponding to the text information asks "the best restaurant", step S1011 is directly executed.
  • Step S1004 Perform an inverted index on the original extended question corresponding to the text information that has not been retrieved through elasticsearch;
  • the original extension question corresponding to the text information that is not determined by the search is inverted indexed through elasticsearch. For example, the original extension question of the text information "the best restaurant” is not retrieved, Then use elasticsearch to invert the index of the "best restaurants" in the database.
  • Step S1005 determining N of the extended questions corresponding to the text information
  • the N extension questions corresponding to the text information "best restaurant" are determined, for example, the best restaurant near the user terminal, the best restaurant in the current city, and the evaluation The best restaurant waits for N extended questions, and there is no specific limitation here.
  • Step S1006 calculating the similarity of the N extension questions by using a pre-trained POMDP model
  • a pre-trained POMDP model is used to ask N of the extensions to be the best restaurant near the current user end, the best restaurant in the current city, the best rated restaurant, etc.
  • the similarity is calculated by an extension. For example, the similarity of the best restaurant near the user terminal is 90%, the similarity of the best restaurant in the current city is 70%, and the similarity of the best rated restaurant is 80%.
  • Step S1007 Compare the similarities of the N extended questions with a preset threshold one by one, and if the N similarities are all greater than or equal to the preset threshold, then sort the N extended questions in descending order of similarity ;
  • the similarities of the N extensions are compared with a preset threshold one by one, and if the N similarities are all greater than or equal to the preset threshold, the similarity is compared according to the calculated similarity
  • the N said extensions are sorted by decreasing similarity to the best restaurants near the current user terminal-the best rated restaurants-the best restaurants in the current city in decreasing similarity ranking.
  • Step S1008 generating an extended question recognition intent corresponding to the text information according to the N extended questions sorted by decreasing similarity
  • the text information is generated
  • the corresponding extension asks to identify the intent.
  • Step S1009 if the N similarities are all less than a preset threshold, perform classification prediction on the N extended questions through the pre-trained LSTM+CRF model;
  • step S1007 the similarities of the N extensions are compared with a preset threshold one by one, and if the N similarities are all less than the preset threshold, then the N is compared through the pre-trained LSTM+CRF model. Each of the extensions is asked to perform classification prediction.
  • the LSTM+CRF model is well known to those skilled in the art, and will not be repeated here.
  • X represents the information contained in a sentence in this speech
  • the second sentence is not only affected by the information contained in itself X1, but also depends on the information contained in the first sentence.
  • RNN can memorize the information of the sequence itself, but because of the design of the RNN's own mechanism, it is easy to cause serious gradient explosion and gradient disappearance problems (information explosion and subsequent information loss), so that it cannot memorize information for a long time, and it is correct. Memory and calculation time requirements are also high.
  • LSTM proposes three "doors", the forget door, the input door, and the output door to solve the problems of RNN.
  • Step S1010 Classify the N extended questions according to the classification prediction result, so that the extended question recognition intention is generated according to the classification result.
  • the N extension questions are classified according to the classification prediction result, so that the extension question recognition intent is generated according to the classification result, and then the corresponding matching step in the subsequent step is to obtain the response words matching the text information. According to this step.
  • Step S1011 matching the extended question recognition intention with the preset response words to obtain the response words matching the text information
  • the corresponding extended question identification intention and preset response words The method is used to perform matching, and the response words matching the text information are obtained as the best restaurant near the current user terminal—the best rated restaurant—the best restaurant in the current city.
  • step S1010 the classification prediction result is matched with the preset response words to obtain the response words matching the text information.
  • Step S1012 Perform voice synthesis of the response speech into a response voice corresponding to the voice information, and return the response voice, so that the user end plays the response voice.
  • the response speech is the best restaurant near the current user terminal-the best rated restaurant-the best restaurant in the current city is the response voice corresponding to the voice message, and the response voice is returned , Which makes the user end play the response voice.
  • the voice method of the intelligent dialogue system proposed in the embodiment of the application can calculate the similarity of the N extended questions by using a pre-trained POMDP model, and calculate the similarity of the N extended questions according to the calculated similarity. Sort the similarity in descending order, determine the extended question recognition intention corresponding to the text information, and match it with the preset response words, making the direct voice interaction with the user in the intelligent dialogue system more accurate, and effectively improving the user’s Interactivity and experience.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

本申请公开了一种智能对话***的语音匹配方法、电子装置、计算机设备,通过采用预先训练好的POMDP模型对N个所述扩展问进行相似度计算,并根据计算的所述相似度对N个所述扩展问进行相似度递减排序,确定与所述文本信息相对应的扩展问识别意图,并与预置回应话术进行匹配,使得在智能对话***中与用户的直接语音交互更加精准,并有效提高用户的交互性以及体验性。

Description

智能对话***的语音匹配方法、电子装置、计算机设备
本申请要求于2019年7月3日提交中国专利局,专利名称为“智能对话***的语音匹配方法、电子装置、计算机设备”,申请号为201910593107.4的发明专利的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及智能技术领域,尤其涉及一种智能对话***的语音匹配方法、电子装置、计算机设备。
背景技术
智能语音交互是基于语音输入的新一代交互模式,通过说话就可以得到反馈结果。智能对话***是受到数据驱动的框架需求所推动,这种框架可以降低劳时费力的人工复杂对话管理的成本,并且可以抵御在噪声环境中运行的语音识别器所产生的错误。通过包含不确定性的显式贝叶斯模型,以及通过奖励驱动过程的策略优化。
发明人意识到在许多现实运营环境(如汽车等公共场所)中,将会话语音转换为单词的过程中仍有15%到30%的单词错误率。因此,解释和响应口头命令的***必须实施对话策略,以解决输入的不可靠性并提供错误检查和恢复机制,传统的基于流程图的确定性***构建起来很昂贵并且在操作中通常很脆弱,语音识别的准确率较低。
发明内容
有鉴于此,本申请提出一种智能对话***的语音匹配方法、电子装置、计算机设备,能够在智能对话***的语音***中,通过在POMDP模型的精 确策略学习的基础上加入相似度技术,有效地提升模型精度,使得语音匹配结果更加准确。
首先,为实现上述目的,本申请提出一种智能对话***的语音方法,应用于电子装置中,该方法包括步骤:将获取的用户端输入的语音信息转换为对应的文本信息;通过检索确定与所述文本信息相对应的N个扩展问;通过采用预先训练好的POMDP模型对N个所述扩展问进行相似度计算,并根据计算的所述相似度对N个所述扩展问进行相似度递减排序;根据相似度递减排序后的N个所述扩展问,生成与所述文本信息相对应的扩展问识别意图;据所述扩展问识别意图与预置回应话术进行匹配,并返回与所述文本信息匹配的回应话术,是的用户端进行回应语音播放。
此外,为实现上述目的,本申请还提供一种电子装置,其包括:转换模块,适于将获取的用户端输入的语音信息转换为对应的文本信息;检索模块,适于通过检索确定与所述文本信息相对应的N个扩展问;相似度计算模块,适于通过采用预先训练好的POMDP模型对N个所述扩展问进行相似度计算,并根据计算的所述相似度对N个所述扩展问进行相似度递减排序;生成模块,适于根据相似度递减排序后的N个所述扩展问,生成与所述文本信息相对应的扩展问识别意图;匹配模块,适于根据所述扩展问识别意图与预置回应话术进行匹配,并返回与所述文本信息匹配的回应话术,是的用户端进行回应语音播放。
为实现上述目的,本申请还提供一种计算机设备,包括存储器、处理器以及存储在存储器上并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现上述方法的步骤。
为实现上述目的,本申请还提供非易失性计算机可读存储介质,其上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现上述方法的步骤。
附图说明
图1是现有的智能语音交互***架构图;
图2是目前的传统对话***的主要组成部分示意图;
图3是POMDP流程图;
图4是POMDP的关系图;
图5是本申请实施例之电子装置一可选的应用环境图;
图6是本申请第一实施例之电子装置的硬件架构示意图;
图7是本申请第一实施例之电子装置的程序模块示意图;
图8是本申请第一实施例之智能对话***的语音方法的流程示意图;
图9是本申请第二实施例之智能对话***的语音方法的流程示意图;
图10是本申请第三实施例之智能对话***的语音方法的流程示意图。
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
具体实施方式
为了便于理解本申请实施例的具体实施方式,下面先对现有的智能语音交互***架构进行描述。
图1所示的现有的智能语音交互***架构图,智能对话***允许用户使用语音作为主要的,通常是唯一的通信媒介,与各种各样的信息***进行交互。传统上,SDS主要部署在呼叫中心应用中,应用中的***可以减少对操作人员的需求,从而降低成本。最近,移动电话中语音接口的使用已经变得很普遍,例如Apple的Siri和Nuance的Dragon Go!,展示了将自然的对话语音交互集成到移动产品,应用程序和服务中的价值。错误!未找到引用源。显示了整个智能对话***的整体框架。在整个语音交互流程中,对话部分起着承上启下的左右,客户端用户说的话经过ASR转为文本后进入对话***,在对话***中通过语义理解和对话决策后,调用指定的内容服务,输出文本 内容,再经过TTS转换成语音后返回给客户端上的用户。
智能对话平台一般由两部分组成。基于自然语言理解的问答***和基于任务驱动的对话***。其中基于自然语言理解的问答***侧重于一问一答,即直接根据用户的问题给出精准的答案,是一个信息检索的过程。需要事先准备好一个知识库,知识库可以包含一个或多个领域,当有用户提问时,会根据用户提问的句子从知识库中找到语义匹配的答案;基于任务驱动的对话***侧重于任务驱动的多轮对话,指用户带着明确的目的而来,希望得到满足特定限制条件的信息或服务,例如:订餐,订票,查找音乐、电影或某种商品等。因为用户需求比较负责,可能需要分成多轮陈述,在对话过程中不断修改或完善用户自己的需求意图。此外,当用户陈述的需求不够明确时,机器也可以通过询问、澄清或确认来帮助用户找到满意的结果。学术文献中所说的Spoken Dialogue System(SDS)一般特指任务驱动的多轮对话。相比问答***的信息检索,任务驱动的对话***是一个决策的过程,需要机器在对话过程中不断根据当前和上下文状态决策下一步应该采取的最优动作。问答和对话的根本区别在于是否需要维护用户状态和需要一个决策过程来完成任务。
结合图1描述的智能语音交互***的架构,先就其中硬件装置部分进行描述,参阅图5所示,是本申请电子装置20一可选的应用环境示意图。
本实施例中,所述电子装置20可通过有线或无线方式与用户端10以及服务器30进行通信。所述电子装置20通过接口23获取所述用户端10的语音输入语音信息,根据获取到的语音信息获取服务器30的回应语音,并将所述回应语音通过接口于所述用户端10进行语音播放,从而实现智能对话***的语音匹配。所述虚拟现实设备10包括眼镜、头盔以及手柄等。所述电子装置20还可以是嵌入在用户端10或服务器30。
参阅图6所示,是本申请电子装置20一可选的硬件架构示意图。电子装置20包括,但不仅限于,可通过***总线相互通信连接存储器21、处理22以及接口23,图6仅示出了具有组件21-23的电子装置20,但是应理解的是,并不要 求实施所有示出的组件,可以替代的实施更多或者更少的组件。
所述存储器21至少包括一种非易失性计算机可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,所述存储器21可以是所述电子装置20的内部存储单元,例如该电子装置20的硬盘或内存。在另一些实施例中,所述存储器也可以是所述电子装置20的外部存储设备,例如该电子装置20上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,所述存储器21还可以既包括所述电子装置20的内部存储单元也包括其外部存储设备。本实施例中,所述存储器21通常用于存储安装于所述电子装置20的操作***和各类应用软件,例如数据可视化***24的程序代码等。此外,所述存储器21还可以用于暂时地存储已经输出或者将要输出的各类数据。
所述处理器22在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器22通常用于控制所述电子装置20的总体操作。本实施例中,所述处理器22用于运行所述存储器21中存储的程序代码或者处理数据,例如运行所述智能对话***24等。
所述接口23可包括无线接口或有线接口,该接口23通常用于在所述电子装置20与其他电子设备之间建立通信连接。
至此,己经详细介绍了本申请相关设备的硬件结构和功能。下面,将基于上述介绍提出本申请的各个实施例。
第一实施例
首先,本申请实施例提出一种图5所示的电子装置20,并参阅图7所示, 是本申请第一实施例之电子装置20的程序模块示意图。
本实施例中,所述电子装置20包括一系列的存储于存储器21上的计算机可读指令指令,当该计算机可读指令指令被处理器22执行时,可以实现本申请各实施例的智能对话***的语音匹配操作。
在一些实施例中,基于该计算机可读指令指令各部分所实现的特定的操作,电子装置20可以被划分为一个或多个模块。例如,在图7中,所述电子装置20可以被分割成转换模块201、检索模块202、相似度计算模块203、生成模块204、匹配模块205。其中:
转换模块201,适于将将获取的用户端输入的语音信息转换为对应的文本信息;
具体地,转换模块201,适于将获取的用户端输入的语音信息通过基于HMM-GMM(隐马尔可夫模型-高斯混合模型)的ASR(自动语音识别技术,Automatic Speech Recognition)进行识别,将所述语音信息转译为对应的预文本信息;将所述预文本信息通过纠错算法进行纠错,获取到纠错后的所述文本信息。
在一实施例中,HMM-GMM的ASR为本领域技术人员公知,此处不再详细赘述,同时,在语音识别技术中,除了HMM-GMM的ASR识别,还有其它的技术方案,此处不再赘述。
检索模块202,适于通过检索确定与所述文本信息相对应的N个扩展问;
在一个实施例中,检索模块202通过elasticsearch对预置数据库中的与业务场景对应的标注意图库进行检索;通过elasticsearch对未检索到的与所述文本信息对应的所述原扩展问进行倒排索引;确定与所述文本信息相对应的N个扩展问。
若检索模块202通过elasticsearch对预置数据库中的与业务场景对应的标注意图库进行检索,检索到与所述文本信息对应的原扩展问;生成与所述文本信息相对应的原扩展问识别意图。
相似度计算模块203,适于通过采用预先训练好的POMDP模型对N个所述扩展问进行相似度计算,并根据计算的所述相似度对N个所述扩展问进行相似度递减排序;
在一个实施例中,相似度计算模块203通过采用预先训练好的POMDP模型对N个所述扩展问进行相似度计算;将N个所述扩展问的相似度与预置阈值进行一一比对,若N个所述相似度均大于等于预置阈值,则将N个所述扩展问进行相似度递减排序。
若N个所述相似度均小于预置阈值,则相似度计算模块203通过预先训练的LSTM+CRF模型对N个所述扩展问进行分类预测;根据分类预测结果对N个所述扩展问进行分类,使得扩展问识别意图根据分类结果生成。
生成模块204,适于根据相似度递减排序后的N个所述扩展问,生成与所述文本信息相对应的扩展问识别意图;
匹配模块205,适于根据所述扩展问识别意图与预置回应话术进行匹配,并返回与所述文本信息匹配的回应话术,是的用户端进行回应语音播放。
具体地,匹配模块205根据所述扩展问识别意图与预置回应话术进行匹配获取与所述文本信息匹配的回应话术;将回应话术进行语音合成为与语音信息对应的回应语音,并返回所述回应语音,使得用户端进行回应语音播放。
本申请实施例所提出的电子装置20,能够通过采用预先训练好的POMDP模型对N个所述扩展问进行相似度计算,并根据计算的所述相似度对N个所述扩展问进行相似度递减排序,确定与所述文本信息相对应的扩展问识别意图,并与预置回应话术进行匹配,使得在智能对话***中与用户的直接语音交互更加精准,并有效提高用户的交互性以及体验性。
此外,本申请还提出一种智能对话***的语音方法。
参阅图8所示,是本申请智能对话***的语音方法之第一实施例的流程示意图。所述智能对话***的语音方法应用于电子装置20中。在本实施例中,根据不同的需求,图8所示的流程图中的步骤的执行顺序可以改变,某些步骤 可以省略。
步骤S800,将获取的用户语音输入的语音信息转换为对应的文本信息;
具体地,当用户需要通过用户端进行智能语音对话时,用户在用户端语音输入如“最好吃的餐厅”,用户语音输入可以是不同国家语言,亦或者是不同地区方言。则将“最好吃的餐厅”转换为文本信息“最好吃的餐厅”。
步骤S801,通过检索确定与所述文本信息相对应的N个扩展问;
具体地,通过检索确定与所述文本信息“最好吃的餐厅”相对应的N个扩展问,例如当前用户端附近最好吃的餐厅,当前所处城市最好吃的餐厅,评价最好的餐厅等N个扩展问,此处具体不做限定。
步骤S802,通过采用预先训练好的POMDP模型对N个所述扩展问进行相似度计算,并根据计算的所述相似度对N个所述扩展问进行相似度递减排序;
在其中一个实施例中,通过采用预先训练好的POMDP模型对N个所述扩展问为当前用户端附近最好吃的餐厅,当前所处城市最好吃的餐厅,评价最好的餐厅等N个扩展问进行相似度计算,例如当前用户端附近最好吃的餐厅相似度为90%,当前所处城市最好吃的餐厅相似度为70%,评价最好的餐厅相似度为80%并根据计算的所述相似度对N个所述扩展问进行相似度递减排序为当前用户端附近最好吃的餐厅--评价最好的餐厅--当前所处城市最好吃的餐厅的相似度递减排序。
步骤S803,根据相似度递减排序后的N个所述扩展问,生成与所述文本信息相对应的扩展问识别意图;
在其中一个实施例中,根据前述的当前用户端附近最好吃的餐厅--评价最好的餐厅--当前所处城市最好吃的餐厅的相似度递减排序,生成与所述文本信息相对应的扩展问识别意图。
步骤S804,根据所述扩展问识别意图与预置回应话术进行匹配,并返回与所述文本信息匹配的回应话术,使得用户端进行回应语音播放。
具体地,根据前述的当前用户端附近最好吃的餐厅--评价最好的餐厅--当前所处城市最好吃的餐厅的相似度递减排序相对应的扩展问识别意图与预置回应话术进行匹配,并返回与所述文本信息匹配的回应话术,使得用户端进行回应语音播放。
本申请实施例所提出的智能对话***的语音匹配方法、电子装置、计算机设备,能够通过采用预先训练好的POMDP模型对N个所述扩展问进行相似度计算,并根据计算的所述相似度对N个所述扩展问进行相似度递减排序,确定与所述文本信息相对应的扩展问识别意图,并与预置回应话术进行匹配,使得在智能对话***中与用户的直接语音交互更加精准,并有效提高用户的交互性以及体验性。
首先在过去几十年中,语音识别技术取得了稳步进展,但在许多现实运营环境(如汽车等公共场所)中,将会话语音转换为单词的过程中仍有15%到30%的单词错误率。因此,解释和响应口头命令的***必须实施对话策略,以解决输入的不可靠性并提供错误检查和恢复机制。结果是,传统的基于流程图的确定性***构建起来很昂贵并且在操作中通常很脆弱。
在过去几年中,出现了一种新的对话管理方法,这种方法是基于部分可观察的马尔科夫决策过程(POMDPs)的数学框架。该方法假设对话演变为马尔科夫过程,即,从某个初始状态s 0开始,每个后续状态由转移概率建模:p(s t|s t-1,a t-1)。状态s t不能直接观察到反映用户话语解释的不确定性;相反,在每一回合,***将SLU的输出视为具有概率p(o t|s t)的用户输入的噪声观察o t(图2)。过渡和观察概率函数由适当的随机模型表示,这里称为对话模型
Figure PCTCN2019102841-appb-000001
在每个回合采取哪个行动的决策由编码策略P的第二个随机模型确定。随着对话的进行,一个奖励被分配在每一个步骤,这个设计用于反映对话***的期望特征。然后可以通过与用户的在线交互或在离线相似域内收集的对话语料库来最大化这些奖励的预期累积总和,来优化对话模型
Figure PCTCN2019102841-appb-000002
和策略模型P。
这种基于POMDP的对话模型结合了两个关键思想:置信状态跟踪和强 化学习。这些想法是可分离的,并且本身就有益处。然而,将它们结合起来可以形成一个完整且有根据的数学框架,为进一步的协同增益提供机会。与传统方法相比,这种方法的潜在优势可归纳如下:
1)置信状态提供了对不确定性的明确表征,使***对语音识别错误更加灵敏健壮。在称为置信监视的过程中,通过贝叶斯推断更新每个用户输入之后的置信状态的后验概率。置信状态的设计允许通过模型先验来捕获用户行为,并且,推理过程能够利用识别假设的完整分布,例如混淆网络和N-最佳列表。因此,证据被整合在每一回合中,使得单个错误的影响被显著降低,并且与传统***相比,用户的持久性会得到奖励。如果用户经常足够地重复某件事情,只要正确的假设在N-最佳列表中重复出现,***对他们所说内容的置信就会随时间增加。
2)通过维持在所有状态下的置信分布,***可以有效地、并行地追踪所有可能的对话路径,选择其下一个行动不是基于最可能的状态而是基于所有状态的概率分布。当用户提示问题困难时,当前最可能状态的概率会降低,并且焦点简单地切换到另一状态。因此,没有必要进行反向跟踪或对特定的对话纠错。这使得强大的对话策略中嵌入一个简单的从置信到行动的同质映射。
3)状态的明确表示和策略衍生行动使得对话设计准则可以被合并,通过将奖励与状态-行动对联合。这些奖励的总和构成了对对话性能的客观衡量标准,并使强化学习能够被用于最大化性能,其中包括离线对话语料库性能和在线与真实用户的互动性能。从而引向最优决策策略,避免了昂贵的人工手动调整和细化程序的成本,并且实现了更复杂的可实施的计划,而不是可行的人工设计。
基于前述的基于POMDP的对话模型结合了两个关键思想:置信状态跟踪和强化学习描述,下面将对通过采用预先训练好的POMDP模型对N个所述扩展问进行相似度计算,并根据计算的所述相似度对N个所述扩展问进行相似度递减排序进行详细的描述,参阅图9所示,是本申请智能对话***的语音方法 之第二实施例的流程示意图。所述智能对话***的语音方法应用于电子装置20中。在本实施例中,根据不同的需求,图9所示的流程图中的步骤的执行顺序可以改变,某些步骤可以省略。
步骤S900,将获取的用户语音输入的语音信息转换为对应的文本信息;
具体地,当用户需要通过用户端进行智能语音对话时,用户在用户端语音输入如“最好吃的餐厅”,用户语音输入可以是不同国家语言,亦或者是不同地区方言。则将“最好吃的餐厅”转换为文本信息“最好吃的餐厅”。
步骤S901,通过elasticsearch对未检索到的与所述文本信息对应的所述原扩展问进行倒排索引;
在其中一个实施例中,通过elasticsearch对未检索确定到的与所述文本信息对应的所述原扩展问进行倒排索引,如文本信息“最好吃的餐厅”的原扩展问未检索到,则通过elasticsearch对“最好吃的餐厅”在数据库中进行倒排索引。
步骤S902,确定与所述文本信息相对应的N个所述扩展问;
在其中一个实施例中,确定与所述文本信息“最好吃的餐厅”相对应的N个扩展问,例如当前用户端附近最好吃的餐厅,当前所处城市最好吃的餐厅,评价最好的餐厅等N个扩展问,此处具体不做限定。
步骤S903,通过采用预先训练好的POMDP模型对N个所述扩展问进行相似度计算;
在一具体实施方式中,采用预先训练好的POMDP模型对N个所述扩展问进行相似度的计算,可以是通过POMDPs的Bellman最优性方程进行近似值的计算,通过策略搜索的方式,先计算策略对应的值函数,通过找到回报值最高的策略来获得最优策略,该最优策略为相似度最高,也就是本实施例所描述的预先训练好的POMDP模型对N个所述扩展问的相似度计算。
在其中一个实施例中,通过采用预先训练好的POMDP模型对N个所述扩展问为当前用户端附近最好吃的餐厅,当前所处城市最好吃的餐厅,评价最好的餐厅等N个扩展问进行相似度计算,例如当前用户端附近最好吃的餐 厅相似度为90%,当前所处城市最好吃的餐厅相似度为70%,评价最好的餐厅相似度为80%。
具体地,采用预先训练好的POMDP模型建立过程如下:
部分可观察的马尔可夫决策过程被定义为元数(S,A,T,R,O,Z,γ,b 0),其中S是一组状态,其中S∈S,A是一组具有a∈A的动作;T定义转移概率P(s t|s t-1,a t-1),R定义了预期(即时,实值)奖励r(s t,a_t)∈R,O是一组具有o∈O的观察值;Z定义观察概率P(o t|s t,a t-1),γ是几何衰减因子0≤γ≤1,和b 0是初始置信状态,定义如下。POMDP的操作如下:在每个时间阶段,事件处于未被观察到的状态s t。由于s t未确切知道,被称为置信状态b t的可能状态的分布维持下去,其中
Figure PCTCN2019102841-appb-000003
表示处于特定状态s t的概率。基于b t,机器选择动作a t,接收奖励r t,并转换到(未观察到)状态s t+1,其中s (t+1)仅取决于s t和a t。然后机器接收观察结果o t+1,这取决于s (t+1)和a t。该过程以影像图的图形方式表示为图3。
给定现有的置信状态b t,最后的***动作a t,以及新的观察结果o t+1,更新的置信状态
Figure PCTCN2019102841-appb-000004
由下式给出:
Figure PCTCN2019102841-appb-000005
其中η=P(o t+1|b t,a t)是归一化常数,b 0是在采取第一个***动作之前的初始置信状态分布***动作由策略π确定,策略π可以以各种方式表示。最常见的是从置信状态到行为π(b)∈A的确定性映射,或者通过动作π(a|b)∈[0,1]的随机分布,其中π(a|b)是在置信状态b时执行动作a的概率,且
Figure PCTCN2019102841-appb-000006
为了便捷,两种类型的策略将使用相同的符号π,在符号中动作的发生决定策略是确定性的还是随机的。但请注意,其他定义是可能的,例如有限状态控制器,或从有限长度的观察序列到动作的映射(参见预测状态表示)
通过开始置信状态b t和跟随的策略π(由目标函数
Figure PCTCN2019102841-appb-000007
Figure PCTCN2019102841-appb-000008
给出),奖励的折扣总和可以递归地表示为确定性策略,如
Figure PCTCN2019102841-appb-000009
和随机策略
Figure PCTCN2019102841-appb-000010
相关量是Q函数Q^π(b,a),该函数给出期待的奖励的折扣总和,当在给定置信状态b是特定动作a发生,然后遵循策略π。显然,对于确定性政策,V π(b)=Q π(b,π(b))和随机策略
V π(b)=∑ aπ(a|b)Qπ(b,a)     (4)
最优策略π *是使V π最大化以产生V *的策略
Figure PCTCN2019102841-appb-000011
这是POMDPs的Bellman最优性方程。在POMDP中找到满足(5)的策略π通常被称为“求解”或“优化”POMDP。对于简单的任务,本文已经开发了精确的解决方法。但是,标准的POMDP方法无法扩展到代表现实世界对话***所需的复杂性。即使在中等大小的***中,状态,动作和观察的数量也可以很容易地超过10 10。即使是罗列P(s t+1|s t,a t)也是难以处理的,因此,直接计算(1)并将直接求解方法应用于(5)是非常困难的。相反,近似值已经被开发出来,近似值利用口语对话任务的特定域属性,以便为模型和策略提供紧凑的表示;并允许使用易处理的算法来执行置信监控和策略优化。这些将在以下部分中介绍。
置信状态表示和监控。对话模型
Figure PCTCN2019102841-appb-000012
的可能方法,见错误!未找到引用源。。在实际的面向任务的SDS中,状态必须编码三种不同类型的信息:用户的目标g t,最近用户话语的意图u t和对话历史h t。目标包括必须从用户收集以便完成任务的信息,最近的用户话语表示实际表达和识别内容的对比,并且历史跟踪与先前回合相关的信息。这表明状态应该分为三个部分:
s t=(g t,u t,h t)    (6)
由此产生的影响图如图4所示,其中引入了一些合理的独立性假设。以这种方式对状态进行分解是有帮助的,因为它减少了状态转移矩阵的维度,并且减少了条件依赖性的数量。
将(6)中的因子分解***到置信更新方程(1)中,并根据图4中所示的独立性假设进行简化,给出了统计SDS的基本置信更新方程:
b (t+1)(s t+1)=ηP(o t+1|S t+1,a t)b t(s t)   (7)
(7)中反映了决定置信状态的各个因素以及因此在实际***中代表这些因素所需的基础模型。
需要说明的是,前述建立的POMDP模型的目的是决定置信状态的各个因素以及因此在实际***中代表这些因素所需的基础模型,例如是对N个扩展问通过贝叶斯推断更新每个用户输入之后的置信状态的后验概率,前述的用户目标如最好吃的餐厅,包含需要完成任务所有信息,用户真实意图是指用户实际想表达的意图而非***识别出的意图,对话历史跟踪之前的对话流进行对应的训练,确定最好吃的餐厅这些因素所需的基础模型。
步骤S904,将N个所述扩展问的相似度与预置阈值进行一一比对,将N个所述扩展问中相似度大于预设阈值的扩展问进行相似度递减排序;
在其中一个实施例中,将N个所述扩展问的相似度与预置阈值进行一一比对,若N个所述相似度均大于等于预置阈值,则根据计算的所述相似度对N个所述扩展问进行相似度递减排序为当前用户端附近最好吃的餐厅--评价最好的餐厅--当前所处城市最好吃的餐厅的相似度递减排序。
步骤S905,根据相似度递减排序后的N个所述扩展问,生成与所述文本信息相对应的扩展问识别意图;
在其中一个实施例中,根据前述的当前用户端附近最好吃的餐厅--评价最好的餐厅--当前所处城市最好吃的餐厅的相似度递减排序,生成与所述文本信息相对应的扩展问识别意图,扩展问识别意图是根据相似度递减排序确定的,例如前述的当前用户端附近最好吃的餐厅--评价最好的餐厅--当前所处城市最 好吃的餐厅,则生成的扩展问识别意图为当前用户端附近最好吃的餐厅。
步骤S906,根据所述扩展问识别意图与预置回应话术进行匹配获取与所述文本信息匹配的回应话术文本;
具体地,根据前述的当前用户端附近最好吃的餐厅--评价最好的餐厅--当前所处城市最好吃的餐厅的相似度递减排序相对应的扩展问识别意图与预置回应话术进行匹配,获取与所述文本信息匹配的回应话术为当前用户端附近最好吃的餐厅--评价最好的餐厅--当前所处城市最好吃的餐厅,例如生成的扩展问识别意图为当前用户端附近最好吃的餐厅,则与预置回应话术进行匹配获取与所述文本信息匹配的回应话术文本,根据位置匹配、评价匹配获取与当前用户端附近最好吃的餐厅的回应话术文本,例如XX餐厅等。
步骤S907,将所述回应话术进行语音合成为与语音信息对应的回应语音;
具体地,将回应话术为当前用户端附近最好吃的餐厅--评价最好的餐厅--当前所处城市最好吃的餐厅为与语音信息对应的回应语音,例如当前用户端附近最好吃的餐厅匹配的结果为通过语音信息进行对应的回应语音,如语音为“XX餐厅”等。
步骤S908,返回与所述回应语音,使得用户端进行回应语音播放。
本申请实施例所提出的智能对话***的语音方法,能够通过采用预先训练好的POMDP模型对N个所述扩展问进行相似度计算,并根据计算的所述相似度对N个所述扩展问进行相似度递减排序,确定与所述文本信息相对应的扩展问识别意图,并与预置回应话术进行匹配,使得在智能对话***中与用户的直接语音交互更加精准,并有效提高用户的交互性以及体验性。
进一步,解决了由于现实世界的SDS的状态-行动空间非常大,其有效的表示和操作需要复杂的算法和软件。实时贝叶斯推理具有挑战性,POMDP的精确策略学***。
参阅图10所示,是本申请智能对话***的语音方法之第三实施例的流程 示意图。所述智能对话***的语音方法应用于电子装置20中。在本实施例中,根据不同的需求,图10所示的流程图中的步骤的执行顺序可以改变,某些步骤可以省略。
步骤S1000,将获取的用户语音输入的语音信息通过基于HMM-GMM的ASR进行识别,将所述语音信息转译为对应的预文本信息;
具体地,当用户需要通过用户端进行智能语音对话时,用户在用户端语音输入如“最好吃的餐厅”,用户语音输入可以是不同国家语言,亦或者是不同地区方言。由于不同国家语言,亦或者是不同地区方言,转译为对应的预文本信息有可能会是“最好差的餐厅”等,进一步可以是,通过HMM-GMM进行音频信号的处理,例如第一步,把帧识别成状态(难点)。第二步,把状态组合成音素。第三步,把音素组合成单词。第一步可以当做GMM做的,后面两布都是HMM做的,对于本领域技术人员而言,HMM-GMM进行语音识别为本领域公知的技术,此处不再具体赘述。
步骤S1001,将所述预文本信息通过纠错算法进行纠错,获取到纠错后的所述文本信息;
具体地,当用户需要通过用户端进行智能语音对话时,用户在用户端语音输入如“最好吃的餐厅”,转译后为“最好差的餐厅”,则需要纠错,将文本信息改为“最好吃的餐厅”。需要说明的是,前述的纠错算法可以是LSTM构造的语言模型,通过对HMM-GMM进行语音识别后的文本进行进一步纠错,语音识别的纠错算法为本领域公知的技术,此处不再赘述。
步骤S1002,通过elasticsearch对预置数据库中的与业务场景对应的标注意图库进行检索;
步骤S1003,检索到与所述文本信息对应的原扩展问;
具体地,使用elasticsearch检索已经入库的业务场景的标注意图库,检索是否标注的意图库扩展问,检索到原扩展问,通过业务场景的标注意图库返回标注的对应意图,检索到与所述文本信息对应的原扩展问“最好吃的餐厅”, 则直接执行步骤S1011。
步骤S1004,通过elasticsearch对未检索到的与所述文本信息对应的所述原扩展问进行倒排索引;
在其中一个实施例中,通过elasticsearch对未检索确定到的与所述文本信息对应的所述原扩展问进行倒排索引,如文本信息“最好吃的餐厅”的原扩展问未检索到,则通过elasticsearch对“最好吃的餐厅”在数据库中进行倒排索引。
步骤S1005,确定与所述文本信息相对应的N个所述扩展问;
在其中一个实施例中,确定与所述文本信息“最好吃的餐厅”相对应的N个扩展问,例如当前用户端附近最好吃的餐厅,当前所处城市最好吃的餐厅,评价最好的餐厅等N个扩展问,此处具体不做限定。
步骤S1006,通过采用预先训练好的POMDP模型对N个所述扩展问进行相似度计算;
在其中一个实施例中,通过采用预先训练好的POMDP模型对N个所述扩展问为当前用户端附近最好吃的餐厅,当前所处城市最好吃的餐厅,评价最好的餐厅等N个扩展问进行相似度计算,例如当前用户端附近最好吃的餐厅相似度为90%,当前所处城市最好吃的餐厅相似度为70%,评价最好的餐厅相似度为80%。
步骤S1007,将N个所述扩展问的相似度与预置阈值进行一一比对,若N个所述相似度均大于等于预置阈值,则将N个所述扩展问进行相似度递减排序;
在其中一个实施例中,将N个所述扩展问的相似度与预置阈值进行一一比对,若N个所述相似度均大于等于预置阈值,则根据计算的所述相似度对N个所述扩展问进行相似度递减排序为当前用户端附近最好吃的餐厅--评价最好的餐厅--当前所处城市最好吃的餐厅的相似度递减排序。
步骤S1008,根据相似度递减排序后的N个所述扩展问,生成与所述文本信息相对应的扩展问识别意图;
在其中一个实施例中,根据前述的当前用户端附近最好吃的餐厅--评价最好的餐厅--当前所处城市最好吃的餐厅的相似度递减排序,生成与所述文本信息相对应的扩展问识别意图。
步骤S1009,若N个所述相似度均小于预置阈值,则通过预先训练的LSTM+CRF模型对N个所述扩展问进行分类预测;
具体地,当步骤S1007将N个所述扩展问的相似度与预置阈值进行一一比对,若N个所述相似度均小于预置阈值,则通过预先训练的LSTM+CRF模型对N个所述扩展问进行分类预测。LSTM+CRF模型为本领域技术人员公知的,此处不再赘述。
在一具体应用场景中,假设一段语音序列,X即表示这段语音中的一句话所含的信息,则第二句话不仅受自身所含信息X1的影响,还取决于第一句话所带的隐藏状态h0的影响。正因为此特性,RNN能够记忆序列本身信息,但因为RNN本身机制的设计,易导致严重的梯度***和梯度消失问题(信息***和后续信息丢失),从而记忆不了太长时间段信息,而且对内存和计算时间要求也高。鉴于此,LSTM提出了三扇“门”,遗忘门,输入门,输出门来解决RNN存在的问题。“遗忘门”——忘记部分过去的信息,“输入门”——记住部分现在的信息,然后将过去的记忆与现在的记忆合并后通过“输出门”——决定最终输出的部分。比如识别一段语音,X为其中一句话,我们在识别这句话时,会利用上一句话的信息帮助识别。假设上一句话的信息包含主题的性别,但此时这句话的信息中出现了新的性别,这时候“遗忘门”就起作用了,它会删去上句话中旧的主题性别,同时“输入门”会更新新的主题性别。这样,当前信息状态即可得到一句新的输入。最终我们通过“输出门”决定输出哪部分信息,考虑到主题后可能出现的动词,它可能会输出主题的单复数信息,以便知道如何与动词结合在一起。通过对前期信息有选择的记忆和遗忘,LSTM实现了对相关信息的长期记忆,从而提取了时间特征。则通过LSTM+CRF模型对N个所述扩展问进行分类预测输出对应的信息。步骤 S1010,根据分类预测结果对N个所述扩展问进行分类,使得扩展问识别意图根据分类结果生成。
具体地,本实施例的根据分类预测结果对N个所述扩展问进行分类,使得扩展问识别意图根据分类结果生成,则后续步骤对应的匹配获取与所述文本信息匹配的回应话术步骤是根据本步骤实现。
步骤S1011,根据所述扩展问识别意图与预置回应话术进行匹配获取与所述文本信息匹配的回应话术;
具体地,根据前述的当前用户端附近最好吃的餐厅--评价最好的餐厅--当前所处城市最好吃的餐厅的相似度递减排序相对应的扩展问识别意图与预置回应话术进行匹配,获取与所述文本信息匹配的回应话术为当前用户端附近最好吃的餐厅--评价最好的餐厅--当前所处城市最好吃的餐厅。
在一个实施例中,若为步骤S1010,则将分类预测结果与预置回应话术进行匹配获取与所述文本信息匹配的回应话术。
步骤S1012,将所述回应话术进行语音合成为与语音信息对应的回应语音,并返回所述回应语音,使得用户端进行回应语音播放。
具体地,将回应话术为当前用户端附近最好吃的餐厅--评价最好的餐厅--当前所处城市最好吃的餐厅为与语音信息对应的回应语音,并返回所述回应语音,使得用户端进行回应语音播放。
本申请实施例所提出的智能对话***的语音方法,能够通过采用预先训练好的POMDP模型对N个所述扩展问进行相似度计算,并根据计算的所述相似度对N个所述扩展问进行相似度递减排序,确定与所述文本信息相对应的扩展问识别意图,并与预置回应话术进行匹配,使得在智能对话***中与用户的直接语音交互更加精准,并有效提高用户的交互性以及体验性。

Claims (20)

  1. 一种智能对话***的语音方法,应用于电子装置中,所述方法包括步骤:
    将获取的用户端输入的语音信息转换为对应的文本信息;
    通过检索确定与所述文本信息相对应的N个扩展问;
    通过采用预先训练好的POMDP(部分可观察马尔可夫决策过程,Partially Observable Markov Decision Process)模型对N个所述扩展问进行相似度计算,并根据计算的所述相似度对N个所述扩展问进行相似度递减排序;
    根据相似度递减排序后的N个所述扩展问,生成与所述文本信息相对应的扩展问识别意图;
    根据所述扩展问识别意图与预置回应话术进行匹配,并返回与所述文本信息匹配的回应话术,是的用户端进行回应语音播放。
  2. 如权利要求1所述的智能对话***的语音方法,通过采用预先训练好的POMDP模型对N个所述扩展问进行相似度计算,并根据计算的所述相似度对N个所述扩展问进行相似度递减排序的步骤,包括:
    通过采用预先训练好的POMDP模型对N个所述扩展问进行相似度计算;
    将N个所述扩展问的相似度与预置阈值进行一一比对,若N个所述相似度均大于等于预置阈值,则将N个所述扩展问进行相似度递减排序。
  3. 如权利要求2所述的智能对话***的语音方法,将N个所述扩展问的相似度与预置阈值进行一一比对的步骤,还包括:
    若N个所述相似度均小于预置阈值,则通过预先训练的LSTM+CRF模型对N个所述扩展问进行分类预测;
    根据分类预测结果对N个所述扩展问进行分类,使得扩展问识别意图根据分类结果生成。
  4. 如权利要求1所述的智能对话***的语音方法,通过检索确定与所述文本信息相对应的N个扩展问的步骤,包括:
    通过elasticsearch对预置数据库中的与业务场景对应的问题话术的标注意图库进行检索;
    通过elasticsearch对未检索到的与所述文本信息对应的所述原扩展问进行倒排索引;
    根据倒排索引结果确定与所述文本信息相对应的相似度排名为前N个的N个扩展问。
  5. 如权利要求4所述的智能对话***的语音方法,通过elasticsearch对预置数据库中的与业务场景对应的标注意图库进行检索的步骤,还包括:
    若通过elasticsearch对预置数据库中的与业务场景对应的标注意图库进行检索,检索到与所述文本信息对应的原扩展问;
    生成与所述文本信息相对应的原扩展问识别意图。
  6. 如权利要求5所述的智能对话***的语音方法,将获取的用户语音输入的语音信箱转换为对应的文本信息的步骤,包括:
    获取的用户语音输入的语音信息通过基于HMM-GMM的ASR进行识别,将所述语音信息转译为对应的预文本信息;
    将所述预文本信息通过纠错算法进行纠错,获取到纠错后的所述文本信息。
  7. 如权利要求1所述的智能对话***的语音方法,根据所述扩展问识别意图与预置回应话术进行匹配,并返回与所述文本信息匹配的回应话术,是的用户端进行回应语音播放,包括:
    根据所述扩展问识别意图与预置回应话术进行匹配获取与所述文本信息匹配的回应话术;
    将所述回应话术进行语音合成为与语音信息对应的回应语音,并返回所述回应语音,使得用户端进行回应语音播放。
  8. 一种电子装置,其包括:
    转换模块,适于将获取的用户端输入的语音信息转换为对应的文本信息;
    检索模块,适于通过检索确定与所述文本信息相对应的N个扩展问;
    相似度计算模块,适于通过采用预先训练好的POMDP模型对N个所述扩展问进行相似度计算,并根据计算的所述相似度对N个所述扩展问进行相似度递减排序;
    生成模块,适于根据相似度递减排序后的N个所述扩展问,生成与所述文本信息相对应的扩展问识别意图;
    匹配模块,适于根据所述扩展问识别意图与预置回应话术进行匹配,并返回与所述文本信息匹配的回应话术,是的用户端进行回应语音播放。
  9. 一种计算机设备,包括存储器、处理器以及存储在存储器上并可在处理器上运行的计算机可读指令,所述处理器执行计算机可读指令所述智能对话***的语音方法的步骤包括:
    将获取的用户端输入的语音信息转换为对应的文本信息;
    通过检索确定与所述文本信息相对应的N个扩展问;
    通过采用预先训练好的POMDP(部分可观察马尔可夫决策过程,Partially Observable Markov Decision Process)模型对N个所述扩展问进行相似度计算,并根据计算的所述相似度对N个所述扩展问进行相似度递减排序;
    根据相似度递减排序后的N个所述扩展问,生成与所述文本信息相对应的扩展问识别意图;
    根据所述扩展问识别意图与预置回应话术进行匹配,并返回与所述文本信息匹配的回应话术,是的用户端进行回应语音播放。
  10. 如权利要求9所述的设备,通过采用预先训练好的POMDP模型对N个所述扩展问进行相似度计算,并根据计算的所述相似度对N个所述扩展问进行相似度递减排序的步骤,包括:
    通过采用预先训练好的POMDP模型对N个所述扩展问进行相似度计算;
    将N个所述扩展问的相似度与预置阈值进行一一比对,若N个所述相似度均大于等于预置阈值,则将N个所述扩展问进行相似度递减排序。
  11. 如权利要求10所述的设备,将N个所述扩展问的相似度与预置阈值进行一一比对的步骤,还包括:
    若N个所述相似度均小于预置阈值,则通过预先训练的LSTM+CRF模型对N个所述扩展问进行分类预测;
    根据分类预测结果对N个所述扩展问进行分类,使得扩展问识别意图根据分类结果生成。
  12. 如权利要求9所述的设备,通过检索确定与所述文本信息相对应的N个扩展问的步骤,包括:
    通过elasticsearch对预置数据库中的与业务场景对应的问题话术的标注意图库进行检索;
    通过elasticsearch对未检索到的与所述文本信息对应的所述原扩展问进行倒排索引;
    根据倒排索引结果确定与所述文本信息相对应的相似度排名为前N个的N个扩展问。
  13. 如权利要求12所述的设备,通过elasticsearch对预置数据库中的与业务场景对应的标注意图库进行检索的步骤,还包括:
    若通过elasticsearch对预置数据库中的与业务场景对应的标注意图库进行检索,检索到与所述文本信息对应的原扩展问;
    生成与所述文本信息相对应的原扩展问识别意图。
  14. 如权利要求13中所述的设备,将获取的用户语音输入的语音信箱转换为对应的文本信息的步骤,包括:
    获取的用户语音输入的语音信息通过基于HMM-GMM的ASR进行识别,将所述语音信息转译为对应的预文本信息;
    将所述预文本信息通过纠错算法进行纠错,获取到纠错后的所述文本信息。
  15. 一种非易失性计算机可读存储介质,其上存储有计算机可读指令,所 述计算机可读指令被处理器执行时实现所述智能对话***的语音方法的步骤包括:
    将获取的用户端输入的语音信息转换为对应的文本信息;
    通过检索确定与所述文本信息相对应的N个扩展问;
    通过采用预先训练好的POMDP(部分可观察马尔可夫决策过程,Partially Observable Markov Decision Process)模型对N个所述扩展问进行相似度计算,并根据计算的所述相似度对N个所述扩展问进行相似度递减排序;
    根据相似度递减排序后的N个所述扩展问,生成与所述文本信息相对应的扩展问识别意图;
    根据所述扩展问识别意图与预置回应话术进行匹配,并返回与所述文本信息匹配的回应话术,是的用户端进行回应语音播放。
  16. 如权利要求15所述的存储介质,通过采用预先训练好的POMDP模型对N个所述扩展问进行相似度计算,并根据计算的所述相似度对N个所述扩展问进行相似度递减排序的步骤,包括:
    通过采用预先训练好的POMDP模型对N个所述扩展问进行相似度计算;
    将N个所述扩展问的相似度与预置阈值进行一一比对,若N个所述相似度均大于等于预置阈值,则将N个所述扩展问进行相似度递减排序。
  17. 如权利要求16所述的存储介质,将N个所述扩展问的相似度与预置阈值进行一一比对的步骤,还包括:
    若N个所述相似度均小于预置阈值,则通过预先训练的LSTM+CRF模型对N个所述扩展问进行分类预测;
    根据分类预测结果对N个所述扩展问进行分类,使得扩展问识别意图根据分类结果生成。
  18. 如权利要求15所述的存储介质,通过检索确定与所述文本信息相对应的N个扩展问的步骤,包括:
    通过elasticsearch对预置数据库中的与业务场景对应的问题话术的标注意 图库进行检索;
    通过elasticsearch对未检索到的与所述文本信息对应的所述原扩展问进行倒排索引;
    根据倒排索引结果确定与所述文本信息相对应的相似度排名为前N个的N个扩展问。
  19. 如权利要求18所述的存储介质,通过elasticsearch对预置数据库中的与业务场景对应的标注意图库进行检索的步骤,还包括:
    若通过elasticsearch对预置数据库中的与业务场景对应的标注意图库进行检索,检索到与所述文本信息对应的原扩展问;
    生成与所述文本信息相对应的原扩展问识别意图。
  20. 如权利要求19中所述的存储介质,将获取的用户语音输入的语音信箱转换为对应的文本信息的步骤,包括:
    获取的用户语音输入的语音信息通过基于HMM-GMM的ASR进行识别,将所述语音信息转译为对应的预文本信息;
    将所述预文本信息通过纠错算法进行纠错,获取到纠错后的所述文本信息。
PCT/CN2019/102841 2019-07-03 2019-08-27 智能对话***的语音匹配方法、电子装置、计算机设备 WO2021000403A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910593107.4A CN110534104B (zh) 2019-07-03 2019-07-03 智能对话***的语音匹配方法、电子装置、计算机设备
CN201910593107.4 2019-07-03

Publications (1)

Publication Number Publication Date
WO2021000403A1 true WO2021000403A1 (zh) 2021-01-07

Family

ID=68659843

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/102841 WO2021000403A1 (zh) 2019-07-03 2019-08-27 智能对话***的语音匹配方法、电子装置、计算机设备

Country Status (2)

Country Link
CN (1) CN110534104B (zh)
WO (1) WO2021000403A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113189986A (zh) * 2021-04-16 2021-07-30 中国人民解放军国防科技大学 一种自主机器人的二阶段自适应行为规划方法及***

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110895940A (zh) * 2019-12-17 2020-03-20 集奥聚合(北京)人工智能科技有限公司 智能语音交互方法及装置
CN111291154B (zh) * 2020-01-17 2022-08-23 厦门快商通科技股份有限公司 方言样本数据抽取方法、装置、设备及存储介质
CN111402872B (zh) * 2020-02-11 2023-12-19 升智信息科技(南京)有限公司 用于智能语音对话***的语音数据处理方法及装置
CN112699213A (zh) * 2020-12-23 2021-04-23 平安普惠企业管理有限公司 语音意图识别方法、装置、计算机设备及存储介质
CN114295732B (zh) * 2022-03-09 2022-12-09 深圳市信润富联数字科技有限公司 混凝土拌车离析度监测方法、***、设备及介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090287678A1 (en) * 2008-05-14 2009-11-19 International Business Machines Corporation System and method for providing answers to questions
CN107315766A (zh) * 2017-05-16 2017-11-03 广东电网有限责任公司江门供电局 一种集合智能与人工问答的语音问答方法及其装置
CN107609101A (zh) * 2017-09-11 2018-01-19 远光软件股份有限公司 智能交互方法、设备及存储介质
CN107992543A (zh) * 2017-11-27 2018-05-04 上海智臻智能网络科技股份有限公司 问答交互方法和装置、计算机设备及计算机可读存储介质
CN109271498A (zh) * 2018-09-14 2019-01-25 南京七奇智能科技有限公司 面向虚拟机器人的自然语言交互方法及***
CN109859747A (zh) * 2018-12-29 2019-06-07 北京百度网讯科技有限公司 语音交互方法、设备以及存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7392185B2 (en) * 1999-11-12 2008-06-24 Phoenix Solutions, Inc. Speech based learning/training system using semantic decoding
CN101901249A (zh) * 2009-05-26 2010-12-01 复旦大学 一种图像检索中基于文本的查询扩展与排序方法
CN105677783A (zh) * 2015-12-31 2016-06-15 上海智臻智能网络科技股份有限公司 智能问答***的信息处理方法及装置
CN106503175B (zh) * 2016-11-01 2019-03-29 上海智臻智能网络科技股份有限公司 相似文本的查询、问题扩展方法、装置及机器人
KR101959292B1 (ko) * 2017-12-08 2019-03-18 주식회사 머니브레인 문맥 기반으로 음성 인식의 성능을 향상하기 위한 방법, 컴퓨터 장치 및 컴퓨터 판독가능 기록 매체

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090287678A1 (en) * 2008-05-14 2009-11-19 International Business Machines Corporation System and method for providing answers to questions
CN107315766A (zh) * 2017-05-16 2017-11-03 广东电网有限责任公司江门供电局 一种集合智能与人工问答的语音问答方法及其装置
CN107609101A (zh) * 2017-09-11 2018-01-19 远光软件股份有限公司 智能交互方法、设备及存储介质
CN107992543A (zh) * 2017-11-27 2018-05-04 上海智臻智能网络科技股份有限公司 问答交互方法和装置、计算机设备及计算机可读存储介质
CN109271498A (zh) * 2018-09-14 2019-01-25 南京七奇智能科技有限公司 面向虚拟机器人的自然语言交互方法及***
CN109859747A (zh) * 2018-12-29 2019-06-07 北京百度网讯科技有限公司 语音交互方法、设备以及存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113189986A (zh) * 2021-04-16 2021-07-30 中国人民解放军国防科技大学 一种自主机器人的二阶段自适应行为规划方法及***
CN113189986B (zh) * 2021-04-16 2023-03-14 中国人民解放军国防科技大学 一种自主机器人的二阶段自适应行为规划方法及***

Also Published As

Publication number Publication date
CN110534104A (zh) 2019-12-03
CN110534104B (zh) 2024-06-04

Similar Documents

Publication Publication Date Title
WO2021000403A1 (zh) 智能对话***的语音匹配方法、电子装置、计算机设备
US10319381B2 (en) Iteratively updating parameters for dialog states
US11270074B2 (en) Information processing apparatus, information processing system, and information processing method, and program
Serban et al. A deep reinforcement learning chatbot
US10878808B1 (en) Speech processing dialog management
US10446148B2 (en) Dialogue system, a dialogue method and a method of adapting a dialogue system
US10635698B2 (en) Dialogue system, a dialogue method and a method of adapting a dialogue system
US10832667B2 (en) Spoken dialogue system, a spoken dialogue method and a method of adapting a spoken dialogue system
CN111402895B (zh) 语音处理、语音评测方法、装置、计算机设备和存储介质
Perez et al. Dialog state tracking, a machine reading approach using memory network
US20240153489A1 (en) Data driven dialog management
CN114596844B (zh) 声学模型的训练方法、语音识别方法及相关设备
US11132994B1 (en) Multi-domain dialog state tracking
US11514916B2 (en) Server that supports speech recognition of device, and operation method of the server
US11563852B1 (en) System and method for identifying complaints in interactive communications and providing feedback in real-time
JP4634156B2 (ja) 音声対話方法および音声対話装置
CN112767921A (zh) 一种基于缓存语言模型的语音识别自适应方法和***
CN114386426B (zh) 一种基于多元语义融合的金牌话术推荐方法及装置
KR102386898B1 (ko) 인텐츠 기반의 질문/답변 서비스 제공장치 및 방법
Hurtado et al. Spoken dialog systems based on online generated stochastic finite-state transducers
Griol et al. Adaptive dialogue management using intent clustering and fuzzy rules
US20240179243A1 (en) System and method for providing personalized customer experience in interactive communications
US20230252994A1 (en) Domain and User Intent Specific Disambiguation of Transcribed Speech
JP7395976B2 (ja) 情報提示装置、情報提示方法
US20240111960A1 (en) Assessing and improving the deployment of large language models in specific domains

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19936256

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19936256

Country of ref document: EP

Kind code of ref document: A1