CN112287675A

CN112287675A - Intelligent customer service intention understanding method based on text and voice information fusion

Info

Publication number: CN112287675A
Application number: CN202011589715.7A
Authority: CN
Inventors: 张学强; 董晓飞; 张丹; 曹峰; 石霖; 孙明俊
Original assignee: Nanjing New Generation Artificial Intelligence Research Institute Co ltd
Current assignee: Nanjing New Generation Artificial Intelligence Research Institute Co ltd
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2021-01-29
Anticipated expiration: 2040-12-29
Also published as: CN112287675B

Abstract

The invention provides an intelligent customer service intention understanding method based on text and voice information fusion, which relates to intelligent customer service products applied to vertical industries of finance, education, medical treatment and the like, under the scene of intelligent customer service application, the processing process of the invention is mainly divided into six parts of user input, text coding, voice coding, feature fusion, intention understanding and execution feedback, and on the basis of adopting a bidirectional long-time and short-time memory deep neural network (BilsTM) to carry out intention understanding on a text, voice characteristics are introduced, and the purpose of improving the intention understanding effect is achieved in a multi-mode information fusion mode; meanwhile, by utilizing text and voice information, the cascade influence caused by voice recognition errors can be avoided to the greatest extent.

Description

Intelligent customer service intention understanding method based on text and voice information fusion

Technical Field

The invention relates to an intelligent customer service product applied to financial, education, medical and other vertical industries, and mainly optimizes an intention understanding algorithm in the product by means of natural language processing and voice processing methods.

Background

Intent understanding refers to accurately understanding a user's intent on a semantic level based on user preferences, spatiotemporal characteristics, context, interactions, and content such as multimodal information including text, gestures, images, and video. In recent years, a large amount of expression and comment information of interest to such people as characters, events, products, etc., which express various speaking intentions of people, such as question consultation, request assistance, or expression dissatisfaction and complaints, etc., are generated on the internet. The real world is multimodal and interactive, so that information data of a user query object is generally multimodal as well. Therefore, in addition to the most common characters, multimodal data such as pictures, videos, and audios can be applied to assist understanding of user intention, thereby improving accuracy of information services. The intention understanding is one of four dimensions (intention understanding, service providing, smooth interaction and personality traits) for measuring the intellectualization of the intelligent customer service product, and the accurate intention understanding can greatly improve the problem solving rate and the task completing rate of the intelligent customer service and effectively improve the user satisfaction of the intelligent customer service.

The source or form of the information may be referred to as a modality. For example, the sense of touch, hearing, vision, smell, etc.; information media such as voice, video, text, etc.; the sensor can be radar, infrared, accelerometer, etc. The multi-modal fusion task generally needs to fuse the features of two or more modalities, and feature fusion is to input feature vectors of two modalities and output the fused vectors.

The traditional method has the problems that only text is used as the input of the intelligent customer service system, if the user input is voice, the voice is simply converted into the text through the voice recognition technology, and therefore important characteristics such as tone, speed, stress and the like in the voice of the user cannot be effectively analyzed.

Disclosure of Invention

Aiming at the problems, the invention aims to fully extract the modal characteristics of voice, text and the like by utilizing a multi-modal fusion technology on the basis of adopting a bidirectional long-time memory deep neural network (BilTM) to perform intention recognition on the text, and finally improve the effect of intention recognition in scenes such as intelligent customer service and the like by a method of multi-modal information fusion of the text, the voice and the like.

In order to achieve the purpose, the invention provides an intelligent customer service intention understanding method based on text and voice information fusion, which introduces voice characteristics on the basis of adopting a bidirectional long-time and short-time memory deep neural network (BilTM) to carry out intention understanding on a text, and achieves the purpose of improving the intention understanding effect in a multi-mode information fusion mode. In an intelligent customer service application scene, the processing process of the proposal mainly comprises six steps of user input, text coding, voice coding, feature fusion, intention understanding and execution feedback.

Step 1: and (3) user input:

(1) the user accesses the intelligent customer service system through the channels of web pages, WeChat, applets, public numbers and the like, and initiates question answering or conversation in the form of voice communication or characters. If the user input is speech, the speech is converted to text for further processing and analysis via speech recognition techniques.

Step 2: text encoding:

the method adopts the BilSTM neural network to encode the text, can encode the input text from the forward direction and the reverse direction simultaneously, and ensures that the context information of each word is captured;

(1) forward scanning text by using LSTM deep neural network to obtain forward characteristic vector

；

(2) Reversely scanning the text by adopting an LSTM deep neural network to obtain a reverse characteristic vector

；

(3) Splicing the feature vectors of the two parts of the text to obtain the feature vector

：

Wherein the content of the first and second substances,

is the vector resulting from forward encoding the text at time t,

is the vector resulting from reverse encoding the text at time t,

is the t-th word from left to right in the text,

is a hidden state at the moment t of forward coding,

is a hidden state at the moment of reverse coding t +1,

representing the concatenation of the two vectors,

a bi-directional coded vector representing the text at time t.

And step 3: and (3) voice coding:

the BilSTM neural network is adopted to encode the voice audio, and has the advantages that the input voice can be encoded from the forward direction and the reverse direction simultaneously, and the context information of each section of audio is ensured to be accurately captured, which is specifically as follows;

(1) forward scanning voice audio by using LSTM deep neural network to obtain forward characteristic vector

；

(2) Reversely scanning voice audio by adopting LSTM deep neural network to obtain reverse characteristic vector

；

(3) Splicing the feature vectors of the two parts of the voice to obtain

：

Wherein the content of the first and second substances,

is the vector resulting from forward encoding audio at time t,

is the vector resulting from reverse encoding the text at time t,

is the t-th segment from left to right in the audio,

is a hidden state at the moment t of forward coding,

is a hidden state at the moment of reverse coding t +1,

representing the concatenation of the two vectors,

a bi-directional coded vector representing the text at time t.

And 4, step 4: feature fusion:

and (3) performing weighted fusion on the two independent feature vectors obtained in the step (2) and the step (3) through function calculation:

]

wherein the content of the first and second substances,

wherein the content of the first and second substances,

is the state of the decoder at the initial time,

indicating the hidden state of the decoder at the last moment,

is the word decoded at the last time instant,

is the vector of the attention of the user,

it is the weight of attention that is being weighted,

is the jth word in the source language sentence,

is the k-th word in the source language sentence,

indicating the hidden state of the encoder at time T.

And 5: it is intended to understand that:

inputting the fused feature vector into a softmax function, and identifying the user intention in the intelligent customer service system;

wherein the content of the first and second substances,

indicating the hidden state of the decoder at time i,

is the word decoded at time i,

representing the kth word in the vocabulary V,

indicating a hidden state

The confidence of (c). exp is an exponential function with a natural constant e as base, P (y)_i) Indicating the currently generated target word y_iThe probability of (c).

Step 6: and (3) performing feedback:

and after the intelligent customer service system correctly understands the questioning intention of the user, matching the questioning intention with a knowledge base maintained by a background, and recommending a relevant solution for the user.

Compared with the prior art, the invention has the main advantages that:

(1) the invention fully utilizes the characteristics of the text and the voice by adopting a text and voice multi-mode coding technology, thereby improving the effect of understanding the intention in the intelligent customer service;

(2) the invention can ensure that the complementary information of voice and text is fully combined on the premise of not introducing other modal information in scenes such as intelligent customer service and the like;

(3) for intelligent customer service products, voice interaction is mainly used basically. Therefore, the invention simultaneously utilizes the text and the voice information, and can avoid the cascade influence caused by the voice recognition error to the maximum extent.

Drawings

FIG. 1 is a flow chart of an intelligent customer service intent understanding method of the present invention.

Detailed Description

The invention is further explained below with reference to the figures and the specific embodiments.

As shown in fig. 1, this embodiment provides an intelligent customer service intention understanding method based on text and speech information fusion, and introduces speech features on the basis of adopting a bidirectional long-and-short-term memory deep neural network BiLSTM to perform intention understanding on a text, so as to achieve the purpose of improving the intention understanding effect in a multi-modal information fusion manner. In an intelligent customer service application scene, the method mainly comprises six parts of user input, text coding, voice coding, feature fusion, intention understanding and execution feedback.

Step 1: and (3) user input:

the user accesses the intelligent customer service system through the channels of web pages, WeChat, applets, public numbers and the like, and initiates question answering or conversation in the form of voice communication or characters.

The traditional method has the problem that the important characteristics of tone, speed, stress and the like in the voice of a user cannot be analyzed due to the fact that only the text is used as the input of the intelligent customer service system. The method has the greatest advantage that under the intelligent customer service application scene, the deep neural network is fully utilized to extract two types of input information characteristics of voice and text of the user, so that the intention recognition effect is effectively improved.

Step 2: text encoding:

text coding is a common strategy in the traditional intelligent customer service system, and text features are extracted for intention analysis and understanding;

the BilSTM neural network is adopted to encode the text, and the method has the advantages that the input text can be encoded from the forward direction and the reverse direction at the same time, and the context information of each word is ensured to be captured;

forward scanning text by using LSTM deep neural network to obtain forward characteristic vector

；

Reversely scanning the text by adopting an LSTM deep neural network to obtain a reverse characteristic vector

；

Splicing the feature vectors of the two parts of the text to obtain the feature vector

：

Wherein the content of the first and second substances,

is the vector resulting from forward encoding the text at time t,

is the vector resulting from reverse encoding the text at time t,

is the t-th word from left to right in the text,

is a hidden state at the moment t of forward coding,

is a hidden state at the moment of reverse coding t +1,

representing the concatenation of the two vectors,

a bi-directional coded vector representing the text at time t.

And step 3: and (3) voice coding:

the voice coding is an optimization strategy of the proposal, and voice characteristics are extracted for intention analysis and understanding;

the BilSTM neural network is adopted to encode the voice, and the method has the advantages that the input voice can be encoded from the forward direction and the reverse direction simultaneously, and the context information of each section of audio can be accurately captured;

forward scanning voice by using LSTM deep neural network to obtain forward characteristic vector

；

Reversely scanning voice by using LSTM deep neural network to obtain reverse characteristic vector

；

Splicing the feature vectors of the two parts of the voice to obtain

：

Wherein the content of the first and second substances,

is the vector resulting from forward encoding audio at time t,

is the vector resulting from reverse encoding the text at time t,

is the t-th segment from left to right in the audio,

is a hidden state at the moment t of forward coding,

is a hidden state at the moment of reverse coding t +1,

representing the concatenation of the two vectors,

a bi-directional coded vector representing the text at time t.

And 4, step 4: feature fusion:

the feature fusion is a core strategy of the proposal, and extracts text and voice features for intention analysis and understanding so as to ensure that the semantics in the text expressed by the user and the important features such as tone, speed, stress and the like in the voice are fully utilized;

and (4) decoding by using the feature vector obtained in the step (3):

]

wherein the content of the first and second substances,

wherein the content of the first and second substances,

is the state of the decoder at the initial time,

representing the hidden state of the decoder at the last moment,

is the word decoded at the last time instant,

is the vector of the attention of the user,

it is the weight of attention that is being weighted,

is the jth word in the source language sentence,

is the k-th word in the source language sentence,

indicating the hidden state of the encoder at time T.

And 5: it is intended to understand that:

through the steps 1-4, the fusion characteristic vector of the text and the voice is obtained, and the vector is input into the softmax function, so that the user intention can be identified in the intelligent customer service system, the user idea can be accurately known, high-quality service is provided, and better user experience is created.

Wherein the content of the first and second substances,

indicating the hidden state of the decoder at time i,

is the word decoded at time i,

representing the kth word in the vocabulary V,

a hidden state is represented in the form of a hidden state,

Step 6: and (3) performing feedback:

after the intelligent customer service system correctly understands the questioning intention of the user, matching the questioning intention with a knowledge base maintained by a background, and recommending a relevant solution for the user;

according to the method for fusing the text information and the voice information in the intelligent customer service application product, the question intention of the user is jointly inferred by combining the characteristics of the two parts, so that the effect in the intelligent customer service is effectively improved.

The technical idea of the present invention is described above only, and the scope of the present invention should not be limited thereby, and any modifications made on the basis of the technical solutions according to the technical idea of the present invention are within the scope of the present invention. The technology not related to the invention can be realized by the prior art.

Claims

1. An intelligent customer service intention understanding method based on text and voice information fusion is characterized in that: the method comprises the following steps:

step 1: and (3) user input: a user accesses the intelligent customer service system through a webpage, a WeChat, an applet or a channel in a public number, and initiates a question and answer or a conversation in a voice call or character mode;

step 2: text encoding: the method comprises the steps of coding a text by using a BilSTM neural network, coding an input text from a forward direction and a reverse direction simultaneously, and accurately capturing context information of each word to obtain a feature vector;

and step 3: and (3) voice coding: coding voice audio by using a BilSTM neural network, coding input voice from forward direction and reverse direction simultaneously, and accurately capturing context information of each section of audio to obtain a feature vector;

and 4, step 4: feature fusion: performing weighted fusion on the two independent feature vectors obtained in the step 2 and the step 3 through function calculation;

and 5: it is intended to understand that: inputting the fused feature vector into a softmax function, and identifying the user intention in the intelligent customer service system;

step 6: and (3) performing feedback: and after the intelligent customer service system correctly understands the questioning intention of the user, matching the questioning intention with a knowledge base maintained by a background, and recommending a solution for the user.

2. The intelligent customer service intention understanding method of claim 1, wherein: the step 2 specifically comprises the following steps:

step 2.1: forward scanning text by using LSTM deep neural network to obtain forward characteristic vector

；

Step 2.2: reversely scanning the text by adopting an LSTM deep neural network to obtain a reverse characteristic vector

；

Step 2.3: splicing the feature vectors of the two parts of the text to obtain the feature vector

：

Wherein the content of the first and second substances,

is the vector resulting from forward encoding the text at time t,

is the vector resulting from reverse encoding the text at time t,

is the t-th word from left to right in the text,

is a hidden state at the moment t of forward coding,

is a hidden state at the moment of reverse coding t +1,

representing the concatenation of the two vectors,

a bi-directional coded vector representing the text at time t.

3. The intelligent customer service intention understanding method of claim 2, wherein: the step 3 specifically comprises the following steps:

step 3.1: forward scanning voice audio by using LSTM deep neural network to obtain forward characteristic vector

；

Step 3.2: reversely scanning voice audio by adopting LSTM deep neural network to obtain reverse characteristic vector

；

Step 3.3: splicing the feature vectors of the two parts of the voice to obtain

：

Wherein the content of the first and second substances,

is the vector resulting from forward encoding audio at time t,

is the vector resulting from reverse encoding the text at time t,

is the t-th segment from left to right in the audio,

is a hidden state at the moment t of forward coding,

is a hidden state at the moment of reverse coding t +1,

is a hidden state at the moment of reverse coding t +1,

representing the concatenation of the two vectors,

a bi-directional coded vector representing the text at time t.

4. The intelligent customer service intention understanding method of claim 3, wherein: the specific process of decoding in step 4 is as follows:

]

wherein the content of the first and second substances,

wherein the content of the first and second substances,

is the state of the decoder at the initial time,

representing the hidden state of the decoder at the last moment,

is the word decoded at the last time instant,

is the vector of the attention of the user,

it is the weight of attention that is being weighted,

is the jth word in the source language sentence,

is the k-th word in the source language sentence,

indicating the hidden state of the encoder at time T.

5. The intelligent customer service intention understanding method of claim 1, wherein: soft in the step 5The max function identifies the fused feature vector as follows:

wherein the content of the first and second substances,

indicating the hidden state of the decoder at time i,

is the word decoded at time i,

representing the kth word in the vocabulary V,

indicating a hidden state

Exp is an exponential function based on a natural constant e, P (y)_i) Indicating the currently generated target word y_iThe probability of (c).