CN112364636A

CN112364636A - User intention identification system based on dual target coding

Info

Publication number: CN112364636A
Application number: CN202011383699.6A
Authority: CN
Inventors: 章建森; 傅剑文; 陈心童; 韩弘炀
Original assignee: Tianyi Electronic Commerce Co Ltd
Current assignee: Tianyi Electronic Commerce Co Ltd
Priority date: 2020-12-01
Filing date: 2020-12-01
Publication date: 2021-02-12

Abstract

The invention discloses a user intention identification system based on dual target coding, which comprises the following steps: firstly, generating training samples, wherein each sample comprises an intention classification label of a user and text information corresponding to the intention classification label, secondly, preprocessing the text, dividing words and removing stop words of the text content of all samples to construct a corpus, and thirdly, generating a pre-trained basic word vector U (w1, w2 and w3 … Wn) by using a skip-gram algorithm. The invention provides a method for generating double target codes under an intention recognition scene, which comprises a method for generating word dimensions and user dimensions; compared with the traditional user intention recognition system constructed by only using a word vector method, the innovative method of the patent, which uses double target coding, splices original vectors at the text and user side, and considers semantic information between texts and prior distribution of intention recognition targets. The method can effectively improve the accuracy of identification under the scene of user intention identification.

Description

User intention identification system based on dual target coding

Technical Field

The invention relates to the field of data analysis, in particular to a user intention identification system based on dual target coding.

Background

In the field of natural language processing, intent recognition is an important branch. The intention recognition is to classify the text into corresponding categories by classification. In an intelligent customer service system, intention recognition often predicts the intention of a user through information of the voice (content of voice recognition) and text of the user.

The text information is originally expressed in a one-hot form, and the expression method not only discards the relation between semantics but also causes dimension explosion. The Word2Vec model proposed by Google solves the above two problems well, and each Word is represented by a k-dimensional vector (k is usually 64, 128), so that the problem of one-hot dimension disaster is solved, and simultaneously, words with similar semantics are mapped to similar space, so that semantic information between texts is retained. The pre-trained word vectors can well complete many problems of NLP scenes, such as text classification, emotion analysis, intelligent question answering and the like.

However, in the user intention recognition scene, although the word vectors well represent the relationship between semantics, each word itself lacks prior information about a classification target, so that the recognition accuracy is not high. In order to solve the problems of insufficient information of original word vectors and low intention identification accuracy rate in the intention identification scene, the patent provides an innovative method based on dual target coding, namely, data dimensionality is increased at the text side and the user side simultaneously, prior probability distribution information of each word and each user for each classification is added, and the method obviously improves the identification accuracy rate in the user intention identification scene.

Disclosure of Invention

The technical problem to be solved by the invention is to overcome the defects of the prior art and provide a user intention identification system based on dual target coding.

In order to solve the technical problems, the invention provides the following technical scheme:

the invention relates to a user intention identification system based on dual target coding, which comprises the following steps:

firstly, generating training samples, wherein the samples are online chatting records of customer service personnel and users, intention labels are manually input by the customer service personnel, and each sample comprises an intention classification label of the user and text information corresponding to the intention classification label;

secondly, text preprocessing, namely performing word segmentation and word stop on the text content of all samples to construct a corpus, and generating a pre-trained basic word vector U (w1, w2, w3 … Wn) by using a skip-gram algorithm;

generating double target codes, wherein K-dimensional codes can be generated on the text side and the user side by using a double target coding method, wherein K is the number of target categories, in order to obtain the robust double target codes, in the process of calculating the prior probability, in order to prevent data from passing through, a sample needs to be subjected to K folding according to a user, wherein K is 5, and the K is crossed, and the double target codes of each user under folding are obtained by calculating the target prior probability of each word by using the rest four-fold users; the specific steps are as follows:

s1, randomly dividing all users into 5 sub-user groups

I(I₁，I₂，I₃，I₄，I₅)；

S2, for each sub-user group

By a set of users I-I other than the set of sub-users_mRespectively calculating the prior probability of each word based on the target, namely calculating the probability that each word belongs to each category under a certain user group, wherein the formula is as follows:

wherein i represents the class u_w，iRepresenting the number of users with words w under the ith category, wherein N is the total number of users with words w in the user group;

for user e I_mThe text sequence of the user is seq (w)The prior probability Pw of each word in the Seq can be obtained according to the above formula, so that the target prior probability based on the user can be obtained by calculation, that is, the prior probability of the user to each category under the current Seq corpus, and the calculation formula is as follows:

weight is the Weight of word w, w is all words in the user Seq, n is the sequence length, P_w,P_uThe target code is the double target code which is obtained by calculation;

s3, splicing the target code and the word vector to form a new vector V, wherein the new vector formed by combining the same word under different sub-user groups is different:

i.e. word w in sub-user group I_iThe lower vector is the word vector of w and the target code of w under the user group are combined and spliced;

and S4, training a model, obtaining double target codes through the method, constructing a deep network model, using a word dimension splicing vector as the input of a Transformer at an Embedding layer, and splicing the output after maxporoling and the target codes of user dimensions to serve as the input of a full connection layer.

Compared with the prior art, the invention has the following beneficial effects:

the invention provides a method for generating double target codes under an intention recognition scene, which comprises a method for generating word dimensions and user dimensions; compared with the traditional user intention recognition system constructed by only using a word vector method, the innovative method of the patent, which uses double target coding, splices original vectors at the text and user side, and considers semantic information between texts and prior distribution of intention recognition targets. The method can effectively improve the accuracy of identification under the scene of user intention identification.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 is a flow diagram of a user intent recognition system of the present invention;

FIG. 2 is a schematic diagram of 5-fold target encoding according to the present invention;

FIG. 3 is a flow chart of the dual target code generation of the present invention.

Detailed Description

The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.

Example 1

As shown in fig. 1 to 3, the present invention provides a dual object coding-based user intention recognition system, comprising the steps of:

firstly, generating training samples (the samples are online chatting records of customer service personnel and users, and intention labels are manually input by the customer service personnel), wherein each sample comprises an intention classification label of the user and text information corresponding to the intention classification label;

generating double target codes, wherein K-dimensional codes (K is the number of target categories) can be respectively generated on the text side and the user side by using a double target coding method, in order to obtain robust double target codes, in the process of calculating prior probability, in order to prevent data from passing through, samples need to be crossed according to K folds (K is 5) of users, and the double target codes of each fold user are obtained by calculating the target prior probability of each word by the rest four fold users; the specific steps are as follows:

s1, randomly dividing all users into 5 sub-user groups

I(I₁，I₂，I₃，I₄，I₅)；

S2, for each sub-user group

for user e I_mThe text sequence of the user is Seq (w), and the prior probability Pw of each word in Seq can be obtained according to the above formula, so that the target prior probability based on the user can be obtained by calculation, that is, the prior probability of the user for each category under the current Seq corpus, and the calculation formula is as follows:

Examples are as follows:

1. extracting linguistic data, and pre-training to obtain 128-dimensional word vectors;

calculating double target codes by K-fold intersection, wherein K takes a passing experience value of 5;

the number of neurons in the MLP layer is 1024, 512, 30;

4. the output layer is activated using softmax.

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A dual object coding based user intent recognition system, comprising the steps of:

s1, randomly dividing all users into 5 sub-user groups

I(I₁，I₂，I₃，I₄，I₅)；

S2, for each sub-user group