CN112364636A - User intention identification system based on dual target coding - Google Patents

User intention identification system based on dual target coding Download PDF

Info

Publication number
CN112364636A
CN112364636A CN202011383699.6A CN202011383699A CN112364636A CN 112364636 A CN112364636 A CN 112364636A CN 202011383699 A CN202011383699 A CN 202011383699A CN 112364636 A CN112364636 A CN 112364636A
Authority
CN
China
Prior art keywords
user
word
target
text
intention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011383699.6A
Other languages
Chinese (zh)
Inventor
章建森
傅剑文
陈心童
韩弘炀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianyi Electronic Commerce Co Ltd
Original Assignee
Tianyi Electronic Commerce Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianyi Electronic Commerce Co Ltd filed Critical Tianyi Electronic Commerce Co Ltd
Priority to CN202011383699.6A priority Critical patent/CN112364636A/en
Publication of CN112364636A publication Critical patent/CN112364636A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a user intention identification system based on dual target coding, which comprises the following steps: firstly, generating training samples, wherein each sample comprises an intention classification label of a user and text information corresponding to the intention classification label, secondly, preprocessing the text, dividing words and removing stop words of the text content of all samples to construct a corpus, and thirdly, generating a pre-trained basic word vector U (w1, w2 and w3 … Wn) by using a skip-gram algorithm. The invention provides a method for generating double target codes under an intention recognition scene, which comprises a method for generating word dimensions and user dimensions; compared with the traditional user intention recognition system constructed by only using a word vector method, the innovative method of the patent, which uses double target coding, splices original vectors at the text and user side, and considers semantic information between texts and prior distribution of intention recognition targets. The method can effectively improve the accuracy of identification under the scene of user intention identification.

Description

User intention identification system based on dual target coding
Technical Field
The invention relates to the field of data analysis, in particular to a user intention identification system based on dual target coding.
Background
In the field of natural language processing, intent recognition is an important branch. The intention recognition is to classify the text into corresponding categories by classification. In an intelligent customer service system, intention recognition often predicts the intention of a user through information of the voice (content of voice recognition) and text of the user.
The text information is originally expressed in a one-hot form, and the expression method not only discards the relation between semantics but also causes dimension explosion. The Word2Vec model proposed by Google solves the above two problems well, and each Word is represented by a k-dimensional vector (k is usually 64, 128), so that the problem of one-hot dimension disaster is solved, and simultaneously, words with similar semantics are mapped to similar space, so that semantic information between texts is retained. The pre-trained word vectors can well complete many problems of NLP scenes, such as text classification, emotion analysis, intelligent question answering and the like.
However, in the user intention recognition scene, although the word vectors well represent the relationship between semantics, each word itself lacks prior information about a classification target, so that the recognition accuracy is not high. In order to solve the problems of insufficient information of original word vectors and low intention identification accuracy rate in the intention identification scene, the patent provides an innovative method based on dual target coding, namely, data dimensionality is increased at the text side and the user side simultaneously, prior probability distribution information of each word and each user for each classification is added, and the method obviously improves the identification accuracy rate in the user intention identification scene.
Disclosure of Invention
The technical problem to be solved by the invention is to overcome the defects of the prior art and provide a user intention identification system based on dual target coding.
In order to solve the technical problems, the invention provides the following technical scheme:
the invention relates to a user intention identification system based on dual target coding, which comprises the following steps:
firstly, generating training samples, wherein the samples are online chatting records of customer service personnel and users, intention labels are manually input by the customer service personnel, and each sample comprises an intention classification label of the user and text information corresponding to the intention classification label;
secondly, text preprocessing, namely performing word segmentation and word stop on the text content of all samples to construct a corpus, and generating a pre-trained basic word vector U (w1, w2, w3 … Wn) by using a skip-gram algorithm;
generating double target codes, wherein K-dimensional codes can be generated on the text side and the user side by using a double target coding method, wherein K is the number of target categories, in order to obtain the robust double target codes, in the process of calculating the prior probability, in order to prevent data from passing through, a sample needs to be subjected to K folding according to a user, wherein K is 5, and the K is crossed, and the double target codes of each user under folding are obtained by calculating the target prior probability of each word by using the rest four-fold users; the specific steps are as follows:
s1, randomly dividing all users into 5 sub-user groups
I(I1,I2,I3,I4,I5);
S2, for each sub-user group
Figure BDA0002810428070000022
By a set of users I-I other than the set of sub-usersmRespectively calculating the prior probability of each word based on the target, namely calculating the probability that each word belongs to each category under a certain user group, wherein the formula is as follows:
Figure BDA0002810428070000021
wherein i represents the class uw,iRepresenting the number of users with words w under the ith category, wherein N is the total number of users with words w in the user group;
for user e ImThe text sequence of the user is seq (w)The prior probability Pw of each word in the Seq can be obtained according to the above formula, so that the target prior probability based on the user can be obtained by calculation, that is, the prior probability of the user to each category under the current Seq corpus, and the calculation formula is as follows:
Figure BDA0002810428070000031
weight is the Weight of word w, w is all words in the user Seq, n is the sequence length, Pw,PuThe target code is the double target code which is obtained by calculation;
s3, splicing the target code and the word vector to form a new vector V, wherein the new vector formed by combining the same word under different sub-user groups is different:
Figure BDA0002810428070000032
i.e. word w in sub-user group IiThe lower vector is the word vector of w and the target code of w under the user group are combined and spliced;
and S4, training a model, obtaining double target codes through the method, constructing a deep network model, using a word dimension splicing vector as the input of a Transformer at an Embedding layer, and splicing the output after maxporoling and the target codes of user dimensions to serve as the input of a full connection layer.
Compared with the prior art, the invention has the following beneficial effects:
the invention provides a method for generating double target codes under an intention recognition scene, which comprises a method for generating word dimensions and user dimensions; compared with the traditional user intention recognition system constructed by only using a word vector method, the innovative method of the patent, which uses double target coding, splices original vectors at the text and user side, and considers semantic information between texts and prior distribution of intention recognition targets. The method can effectively improve the accuracy of identification under the scene of user intention identification.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
FIG. 1 is a flow diagram of a user intent recognition system of the present invention;
FIG. 2 is a schematic diagram of 5-fold target encoding according to the present invention;
FIG. 3 is a flow chart of the dual target code generation of the present invention.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.
Example 1
As shown in fig. 1 to 3, the present invention provides a dual object coding-based user intention recognition system, comprising the steps of:
firstly, generating training samples (the samples are online chatting records of customer service personnel and users, and intention labels are manually input by the customer service personnel), wherein each sample comprises an intention classification label of the user and text information corresponding to the intention classification label;
secondly, text preprocessing, namely performing word segmentation and word stop on the text content of all samples to construct a corpus, and generating a pre-trained basic word vector U (w1, w2, w3 … Wn) by using a skip-gram algorithm;
generating double target codes, wherein K-dimensional codes (K is the number of target categories) can be respectively generated on the text side and the user side by using a double target coding method, in order to obtain robust double target codes, in the process of calculating prior probability, in order to prevent data from passing through, samples need to be crossed according to K folds (K is 5) of users, and the double target codes of each fold user are obtained by calculating the target prior probability of each word by the rest four fold users; the specific steps are as follows:
s1, randomly dividing all users into 5 sub-user groups
I(I1,I2,I3,I4,I5);
S2, for each sub-user group
Figure BDA0002810428070000042
By a set of users I-I other than the set of sub-usersmRespectively calculating the prior probability of each word based on the target, namely calculating the probability that each word belongs to each category under a certain user group, wherein the formula is as follows:
Figure BDA0002810428070000041
wherein i represents the class uw,iRepresenting the number of users with words w under the ith category, wherein N is the total number of users with words w in the user group;
for user e ImThe text sequence of the user is Seq (w), and the prior probability Pw of each word in Seq can be obtained according to the above formula, so that the target prior probability based on the user can be obtained by calculation, that is, the prior probability of the user for each category under the current Seq corpus, and the calculation formula is as follows:
Figure BDA0002810428070000051
weight is the Weight of word w, w is all words in the user Seq, n is the sequence length, Pw,PuThe target code is the double target code which is obtained by calculation;
s3, splicing the target code and the word vector to form a new vector V, wherein the new vector formed by combining the same word under different sub-user groups is different:
Figure BDA0002810428070000052
i.e. word w in sub-user group IiThe lower vector is the word vector of w and the target code of w under the user group are combined and spliced;
and S4, training a model, obtaining double target codes through the method, constructing a deep network model, using a word dimension splicing vector as the input of a Transformer at an Embedding layer, and splicing the output after maxporoling and the target codes of user dimensions to serve as the input of a full connection layer.
Examples are as follows:
1. extracting linguistic data, and pre-training to obtain 128-dimensional word vectors;
calculating double target codes by K-fold intersection, wherein K takes a passing experience value of 5;
the number of neurons in the MLP layer is 1024, 512, 30;
4. the output layer is activated using softmax.
Compared with the prior art, the invention has the following beneficial effects:
the invention provides a method for generating double target codes under an intention recognition scene, which comprises a method for generating word dimensions and user dimensions; compared with the traditional user intention recognition system constructed by only using a word vector method, the innovative method of the patent, which uses double target coding, splices original vectors at the text and user side, and considers semantic information between texts and prior distribution of intention recognition targets. The method can effectively improve the accuracy of identification under the scene of user intention identification.
Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (1)

1. A dual object coding based user intent recognition system, comprising the steps of:
firstly, generating training samples, wherein the samples are online chatting records of customer service personnel and users, intention labels are manually input by the customer service personnel, and each sample comprises an intention classification label of the user and text information corresponding to the intention classification label;
secondly, text preprocessing, namely performing word segmentation and word stop on the text content of all samples to construct a corpus, and generating a pre-trained basic word vector U (w1, w2, w3 … Wn) by using a skip-gram algorithm;
generating double target codes, wherein K-dimensional codes can be generated on the text side and the user side by using a double target coding method, wherein K is the number of target categories, in order to obtain the robust double target codes, in the process of calculating the prior probability, in order to prevent data from passing through, a sample needs to be subjected to K folding according to a user, wherein K is 5, and the K is crossed, and the double target codes of each user under folding are obtained by calculating the target prior probability of each word by using the rest four-fold users; the specific steps are as follows:
s1, randomly dividing all users into 5 sub-user groups
I(I1,I2,I3,I4,I5);
S2, for each sub-user group
Figure FDA0002810428060000011
By a set of users I-I other than the set of sub-usersmRespectively calculating the prior probability of each word based on the target, namely calculating the probability that each word belongs to each category under a certain user group, wherein the formula is as follows:
Figure FDA0002810428060000012
wherein i represents the class uw,iRepresenting the number of users with words w under the ith category, wherein N is the total number of users with words w in the user group;
for user e ImThe text sequence of the user is Seq (w), and the prior probability Pw of each word in Seq can be obtained according to the above formula, so that the target prior probability based on the user can be obtained by calculation, that is, the prior probability of the user for each category under the current Seq corpus, and the calculation formula is as follows:
Figure FDA0002810428060000021
weight is the Weight of word w, w is all words in the user Seq, n is the sequence length, Pw,PuThe target code is the double target code which is obtained by calculation;
s3, splicing the target code and the word vector to form a new vector V, wherein the new vector formed by combining the same word under different sub-user groups is different:
Figure FDA0002810428060000022
i.e. word w in sub-user group IiThe lower vector is the word vector of w and the target code of w under the user group are combined and spliced;
and S4, training a model, obtaining double target codes through the method, constructing a deep network model, using a word dimension splicing vector as the input of a Transformer at an Embedding layer, and splicing the output after maxporoling and the target codes of user dimensions to serve as the input of a full connection layer.
CN202011383699.6A 2020-12-01 2020-12-01 User intention identification system based on dual target coding Pending CN112364636A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011383699.6A CN112364636A (en) 2020-12-01 2020-12-01 User intention identification system based on dual target coding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011383699.6A CN112364636A (en) 2020-12-01 2020-12-01 User intention identification system based on dual target coding

Publications (1)

Publication Number Publication Date
CN112364636A true CN112364636A (en) 2021-02-12

Family

ID=74536949

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011383699.6A Pending CN112364636A (en) 2020-12-01 2020-12-01 User intention identification system based on dual target coding

Country Status (1)

Country Link
CN (1) CN112364636A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113254617A (en) * 2021-06-11 2021-08-13 成都晓多科技有限公司 Message intention identification method and system based on pre-training language model and encoder
CN113643703A (en) * 2021-08-06 2021-11-12 西北工业大学 Password understanding method of voice-driven virtual human

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113254617A (en) * 2021-06-11 2021-08-13 成都晓多科技有限公司 Message intention identification method and system based on pre-training language model and encoder
CN113643703A (en) * 2021-08-06 2021-11-12 西北工业大学 Password understanding method of voice-driven virtual human
CN113643703B (en) * 2021-08-06 2024-02-27 西北工业大学 Password understanding method for voice-driven virtual person

Similar Documents

Publication Publication Date Title
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN109062893B (en) Commodity name identification method based on full-text attention mechanism
CN113158665B (en) Method for improving dialog text generation based on text abstract generation and bidirectional corpus generation
CN111401061A (en) Method for identifying news opinion involved in case based on BERT and Bi L STM-Attention
CN111783462A (en) Chinese named entity recognition model and method based on dual neural network fusion
CN109977416A (en) A kind of multi-level natural language anti-spam text method and system
CN111680541A (en) Multi-modal emotion analysis method based on multi-dimensional attention fusion network
CN112182191B (en) Structured memory map network model for multi-round-mouth linguistic understanding
CN109255027B (en) E-commerce comment sentiment analysis noise reduction method and device
CN113626589B (en) Multi-label text classification method based on mixed attention mechanism
CN110263325A (en) Chinese automatic word-cut
CN110781290A (en) Extraction method of structured text abstract of long chapter
CN112364636A (en) User intention identification system based on dual target coding
CN114461804B (en) Text classification method, classifier and system based on key information and dynamic routing
CN112749274A (en) Chinese text classification method based on attention mechanism and interference word deletion
CN113449084A (en) Relationship extraction method based on graph convolution
CN110472245A (en) A kind of multiple labeling emotional intensity prediction technique based on stratification convolutional neural networks
CN114004220A (en) Text emotion reason identification method based on CPC-ANN
CN117807232A (en) Commodity classification method, commodity classification model construction method and device
CN116737922A (en) Tourist online comment fine granularity emotion analysis method and system
CN117558270B (en) Voice recognition method and device and keyword detection model training method and device
Huang A CNN model for SMS spam detection
CN114742016A (en) Chapter-level event extraction method and device based on multi-granularity entity differential composition
CN113254575B (en) Machine reading understanding method and system based on multi-step evidence reasoning
CN116522905B (en) Text error correction method, apparatus, device, readable storage medium, and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210212