CN112906367A

CN112906367A - Information extraction structure, labeling method and identification method of consumer text

Info

Publication number: CN112906367A
Application number: CN202110172747.5A
Authority: CN
Inventors: 杨骏; 李�杰
Original assignee: Shanghai Hongyuan Information Technology Co ltd
Current assignee: Shanghai Hongyuan Information Technology Co ltd
Priority date: 2021-02-08
Filing date: 2021-02-08
Publication date: 2021-06-04

Abstract

The invention discloses an information extraction structure, a labeling method and an identification method of a consumer text, wherein the information extraction structure comprises six dimensions of a demand, a scene, a scheme, a driving factor, a blocking factor and a question neutral factor, the information extraction structure is identified through a plurality of two-dimensional arrays and BIO structures so as to be identified by a model, elements which are classified through the dimensions in the text to be detected can be identified by constructing the identification model, and the corresponding relation between the elements is established according to the dimensions.

Description

Information extraction structure, labeling method and identification method of consumer text

Technical Field

The invention relates to the technical field of natural language processing, in particular to an information extraction structure, a labeling method and an identification method of consumer texts.

Background

In the technical field of natural language processing of consumer text expression, common information extraction technologies comprise named entity recognition, aspect extraction and text emotion analysis. Specifically, named entity recognition includes inputting a text, and outputting named entities mentioned therein, where the named entities generally refer to names of people, places, names of brands, and the like. Facet extraction involves entering a piece of text and outputting facets mentioned therein, which generally refer to various attributes of the product, such as price, efficacy, appearance, etc. Textual sentiment analysis, including document level sentiment analysis, entity level sentiment analysis, aspect level sentiment analysis, and entity-aspect level sentiment analysis.

The analysis methods are mutually isolated, and none of the methods can automatically extract elements and aspects and automatically perform sentiment analysis on entities and aspects correspondingly. The problem isolated from each other is that if the methods are connected in series by hard method, error transmission is generated, namely, the error prediction of the preposition task (such as named entity identification and aspect extraction) can cause the result of the postposition task (emotion analysis) to generate larger deviation.

In addition, in the emotion analysis technology, document level, entity level and aspect level emotion analysis neglects that different emotion attitudes may be expressed for different aspects of different entities in a document, and the attitudes of the expressors are reflected one by one. Although the entity-aspect level emotion analysis is correctly reflected, the entity and the aspect need to depend on other model output, and the application in a real scene is limited.

Furthermore, the semantic structured definition of the prior art cannot cover the main information. For example, there will be a large number of similar expressions on social media: "baby is easy to be undigested in summer, and can get well quickly when eating the synbiotics. The named entity recognition technology can recognize the brand name 'synbiotic', the aspect extraction technology can recognize 'digestion', and entity aspect level sentiment analysis can be output (synbiotic, digestion, positive). However, the technologies can omit the situation that indigestion occurs in summer, the object is a baby, the indigestion is a demand, the solution is a synbiotics, and the good speed is a driving factor for selecting the synbiotics. Information which cannot be identified by the existing method, including scenes, objects, requirements, solutions, driving factors and question neutral factors, is very helpful to brand product research and development and marketing.

Disclosure of Invention

The invention aims to provide an information extraction structure, a labeling method and an identification method of a consumer text, which are used for identifying structural information and corresponding relations in a text of a fire fighter.

In order to achieve the above object, an aspect of the present invention provides an information extraction structure of a consumer text, comprising:

a demand to express a consumer's demand;

a scene to express a scene where the demand occurs;

a scheme to express a solution to the requirement;

a driver to express a reason for selecting the solution;

a hindering factor to express a reason for hindering selection of the solution;

query neutral factors to express query elements in purchasing decisions.

In another aspect, the present invention further provides a method for labeling a text message structure of a consumer, which includes the following steps:

acquiring a text to be identified;

extracting information from a text to be identified, and establishing n two-dimensional arrays according to the extracted information, wherein each two-dimensional array comprises elements and dimensionality thereof, the association of the elements is established through the dimensionality, and the dimensionality comprises: requirements, scenarios, plans, drivers, deterrents, and question neutrality factors;

and marking the elements in the two-dimensional array by adopting a BIO structure to obtain a BIO marking result, wherein each marked element comprises a BIO mark and a dimension.

In another aspect, the present invention further provides an identification method, including:

acquiring a marked text to be detected;

classifying elements in the text to be detected according to BIO labels, and inputting the classified elements into corresponding information extraction dimension classifications, wherein the dimensions comprise: requirements, scenarios, plans, drivers, deterrents, and question neutrality factors;

and outputting the elements of the text to be detected after dimension classification according to the classification result.

Further, in the classification process, the method further comprises:

inputting the text into a BERT coding model, and converting the text into a coded feature sequence, wherein the feature sequence has vector identification combined with context semantics.

Further, in the classification process, the method further comprises:

inputting the characteristic sequence coded by the BERT into an LSTM model, and outputting a characteristic sequence with dimension expression;

and inputting the characteristic sequence with the dimension expression into Dropout and a full connection layer, and performing generalization processing and distribution characteristic mapping.

Further, in the classification process, the method further comprises:

inputting Dropout and the output result of the full connection layer into a conditional random field, and identifying a sequential relationship in the BIO label;

classifying information extraction dimensions by adopting a recognition result of the word segmentation correction on the correction conditional random field;

and formatting, processing and outputting the information extraction result of the consumer text according to the classification result of the BIO label and the information extraction dimension.

acquiring a marked text to be detected;

outputting the elements of the text to be detected after dimension classification according to the classification result;

identifying corresponding relations of dimension classification, wherein the corresponding relations comprise demand-scene, demand-solution, solution-driving factor, solution-blocking factor and solution-question neutral, and outputting the dimension classification relation between the elements according to the corresponding relations.

Further, in the classification process, the method further comprises:

and completing the classification of information extraction dimensionality by adopting the recognition result of the word segmentation correction on the correction conditional random field.

Further, in the process of identifying the corresponding relationship, the method further includes:

and inputting the corresponding relation of the identification dimension classification into a BERT identification model, and outputting the corresponding relation of the identification object by the BERT identification model according to the parameter tuning result.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a flowchart of a method for labeling a text information structure of a consumer according to an embodiment of the present invention.

FIG. 2 is a flow diagram of an identification method according to one embodiment of the invention.

Fig. 3 is a flow chart of an identification method according to another embodiment of the present invention.

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

The invention firstly defines an information extraction structure of a consumer text, which comprises the following six dimensions: requirements, scenarios, drivers, deterrents, question neutrality factors. This dimension is used to define elements in the consumer text.

Wherein, the demand is the demand of the consumer in the text, including: symptoms, index results, appeals, mood, life, questions, etc.

For example, i today [ mood is bad ], or i today [ pull many bellies ].

The scene is a scene where the demand occurs, and comprises people, time, space, accompanying events (including some background information generated by the demand, inspection) and demand inducement.

For example, [ my son ] is totally ill, or [ because of physical relationships ] the old is cold.

The scheme is a solution for a certain demand, and comprises categories, brands, products and practices.

For example, people can drink only [ medicine ] and [ body-wiping and cooling ], and want to drink [ milk tea ].

The driving factor is the reason for choosing a solution, directly driving the reason for choosing the product/taking the method. Including appeal, product characteristics (ingredients, materials, colors, etc.), place of origin, packaging, quality security, brand image, cost performance, emotional experience.

For example, calcium for discotheque children [ good taste ], [ simple and convenient to chew ].

Hindering factors are the reasons for choosing a solution, including: side effects, unmet appeal, packaging, quality safety, brand image, cost performance, emotional experience.

For example, the milk powder of Junlebao now [ easy to get on fire ].

The query neutral factor is a point of questioning, or a neutral opinion in a consumer's decision to purchase a product.

For example, such as: "taste" is not good, and "Germany" is produced.

The embodiment of the invention provides a labeling method of a consumer text information structure, and the purpose of labeling data is to enable a natural language processing model to learn the thinking mode and the cognitive result of human beings. By recording the text and the requirements, scenes, schemes, driving factors, hindering factors and question neutral factors in the text, and recording the corresponding relation between the dimensions in the text.

Fig. 1 is a flowchart of a method for labeling a text information structure of a consumer according to an embodiment of the present invention. As shown in fig. 1, the method for labeling the text information structure of the consumer of the present invention comprises the following steps:

s101, acquiring a text to be recognized.

Wherein the text to be identified can be from consumer text or marketing corpora issued by supply chain manufacturers.

For example, the consumer text may come from a C-side data source such as consumer evaluation, consumer complaints, consumer messages, etc., or from a B-side data source such as product design instructions, marketing content, etc.

The sentence to be recognized can be text data obtained by converting voice data acquired by a user through a voice acquisition device by a system, or can also be text data input by the user directly through an input device.

And S102, information extraction.

The purpose of information extraction is to associate extracted elements with corresponding dimensions, wherein the dimensions of the elements comprise requirements, scenes, schemes, driving factors, hindering factors and question neutral factors. Therefore, when there are multiple elements and dimensions in a sentence, the elements and elements, dimensions and dimensions, and element-to-dimension relationships need to be considered.

In one embodiment, the invention extracts information from a text to be recognized, and establishes n two-dimensional arrays according to the extracted information, wherein each two-dimensional array comprises elements and dimensions thereof, the association of the elements is established through the dimensions, and the dimensions comprise: requirements, scenarios, drivers, deterrents, and question neutrality factors. The details are shown in the following table:

s103, labeling and post-processing

And marking the elements in the two-dimensional array by adopting a BIO structure to obtain a BIO marking result, wherein each marked element comprises a BIO mark and a dimension. BIO labeling is a commonly used labeling mode in sequence labeling tasks, wherein B-begin labels the initial words of entities, I-inside labels the words (except the initial words) in the entities, and O-outside labels the words other than the entities; the B tag refers to the initial word of the entity, and the I tag refers to the word other than the initial word in the entity.

The elements after labeling are shown in the following table:

FIG. 2 is a flow diagram of an identification method according to one embodiment of the invention. As shown in fig. 2, the identification method according to the embodiment of the present invention includes the following steps:

s201, obtaining the marked text to be detected.

Specifically, in step S201, the text to be detected is denoted as T ═ { w1, w2, w3, … wn }, where wi is the ith character in the text.

S202, classifying elements in the text to be detected according to BIO labels, and inputting the classified elements into corresponding information extraction dimension classifications, wherein the dimensions comprise: requirements, scenarios, drivers, deterrents, and question neutrality factors.

In one embodiment, the text to be detected is first input into the BERT coding model, and the output obtains vector representations of words in combination with context semantics, each word being represented as a vector representation of 768 dimensions. The output of the BERT coding layer is V ═ { V1, V2, V3, … vn }, where vi is the vector representation of the ith character in the text after coding.

Then, the BERT-encoded signature sequence is input to an LSTM model, and a signature sequence having dimensional expression is output, which is expressed as H ═ H1, H2, H3, … hn }.

It will be appreciated that the LSTM model aims to address the issue of long term and short term dependence of context information. The context global information can be coded, and the understanding of the whole sentence semantics is facilitated; and may encode the local information. The LSTM model identifies requirements, scenarios, solutions, drivers, deterrents, and question neutrality factors with high efficiency.

In one embodiment, the feature sequence with dimension expression is input into Dropout and the fully connected layer, generalized and distributed feature mapping.

Preferably, the Dropout rate of the Dropout layer is 0.5, and 50% of the nodes are randomly selected from the layer, and the value is set to 0. Better predictions can still be made when only a portion of the information is retained.

In one embodiment, the invention inputs Dropout and fully-connected layer output results into a conditional random field to identify sequential relationships in the BIO labels.

It will be appreciated that conditional random fields can be viewed as a generalization of the maximum entropy markov model to the labeling problem. Its main value is to learn the sequential relationship of the tags in the BIO label structure, for example, I can only be preceded by B or I.

In one embodiment, the invention completes the classification of information extraction dimensions by adopting word segmentation to correct the recognition result of the conditional random field.

In particular, the problem of inaccurate recognition of certain vocabulary boundaries is solved. The calculation method of the word segmentation correction step is as follows:

inputting: the predicted output of the conditional random field { pi }, where i ═ 1, …, n }; and (5) segmenting the original sentence into words and results.

A calculation step:

for i from 1to n:

if the prediction result of the ith position is not tag O, i.e. pi! O:

for the word where the ith character is located after word segmentation, for all characters in the word:

if the class label of the prediction result is O:

ensuring that the category labels of all characters of the word are consistent with the category label of pi;

the structural label of the first character of the word is set to B and the remaining characters are set to O.

For example, the brand name "help fit" only identifies that "help fit" is a scheme, and ignores the word "help". The output of the conditional random field is corrected by adopting a word segmentation technology, wherein the correction mode is that all characters of a word after word segmentation are recognized into the category as long as one of the characters is recognized into one of requirements, scenes, solutions, driving factors, blocking factors and question neutral factors. If a plurality of characters in a word are respectively recognized into a plurality of categories, no correction is carried out, the probability of the occurrence of the condition is very small, the average number of the occurrences of the condition in every 1000 words is less than 1, and the condition is usually caused by the problem of word segmentation per se.

And S203, outputting the elements of the text to be detected after dimension classification according to the classification result.

In one embodiment, after classification prediction is performed according to the BIO structure, the prediction result is converted into 5 structured columns through a post-processing step, that is, the B and I are used as the beginning classes, corresponding characters are extracted and output to the corresponding classes, and the output form is shown in the following table:

fig. 3 is a flow chart of an identification method according to another embodiment of the present invention. As shown in fig. 3, the identification method of the present embodiment includes the following steps:

s301, the marked text to be detected is obtained.

S302, classifying elements in the text to be detected according to BIO labels, and inputting the classified elements into corresponding information extraction dimension classifications, wherein the dimensions comprise: requirements, scenarios, drivers, deterrents, and question neutrality factors.

A calculation step:

for i from 1to n:

if the prediction result of the ith position is not tag O, i.e. pi! O:

if the class label of the prediction result is O:

And S303, outputting the elements of the text to be detected after dimension classification according to the classification result.

s304, identifying the corresponding relation of the dimension classification. The corresponding relation comprises demand-scene, demand-solution, solution-driving factor, solution-obstruction factor and solution-question neutral, and the dimension classification relation between the elements is output according to the corresponding relation.

Wherein, for the recognition result of each text, if the preceding item and the following item of one or more relations are contained at the same time, the preceding item and the following item are arranged and combined to form a 2-element relation pair, which is in the form of a relation preceding item-a relation following item. For example:

and inputting the original text and the corresponding relation into a BERT model, and outputting whether the corresponding relation exists or not. The BERT model is different from a BERT coding model, and internal parameters of the BERT model are continuously adjusted and optimized in the learning process. The design here is equivalent to having a pre-trained language model fine-tuned on the task, learning the model parameters for the task.

And finally, outputting the elements and the dimensions with the corresponding relation by the model.

In another aspect, the present invention further provides a computer device, which includes a memory and a processor, wherein the memory stores a computer program, and the computer program, when executed by the processor, causes the processor to execute the steps of the above method.

In another aspect, the present invention also provides a computer readable storage medium storing a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the steps of performing the above method.

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. As shown in fig. 4, an electronic device of one embodiment of the invention includes one or more input devices 1000, one or more output devices 1000, one or more processors 3000, and memory 4000.

In one embodiment of the invention, the processor 1000, the input device 2000, the output device 3000, and the memory 4000 may be connected by a bus or other means. The input device 2000, the output device 3000 may be a standard wired or wireless communication interface.

The Processor 1000 may be a Central Processing Unit (CPU), and may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Memory 4000 may be a high speed RAM memory or a non-volatile memory such as a disk memory. The memory 4000 is used to store a set of computer programs, and the input device 2000, the output device 3000, and the processor 1000 may call the program codes stored in the memory 4000.

The memory 4000 stores a computer program comprising program instructions that, when executed by the processor, cause the processor to perform the steps of the patent value assessment method as described in the above embodiments.

An embodiment of the present invention also provides a computer-readable storage medium. The computer readable storage medium may be a high speed RAM memory or a non-volatile memory such as a disk memory. The computer-readable storage medium may be connected through an external computing device or a network to read a set of computer programs stored in the computer-readable storage medium. The computer program stored by the computer readable storage medium comprises program instructions which, when executed by a processor, cause the processor to perform the steps of the method as described above in the embodiments above.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. Information extraction structure of consumer text, characterized in that the information extraction structure comprises the following dimensions:

a demand to express a consumer's demand;

a scene to express a scene where the demand occurs;

a scheme to express a solution to the requirement;

a driver to express a reason for selecting the solution;

a hindering factor to express a reason for hindering selection of the solution;

query neutral factors to express query elements in purchasing decisions.

2. The labeling method of the text information structure of the consumer is characterized by comprising the following steps:

acquiring a text to be identified;

3. An identification method, comprising:

acquiring a marked text to be detected;

4. An identification method as claimed in claim 3, characterized in that in the classification process, it further comprises:

5. An identification method as claimed in claim 4, characterized in that in the classification process, it further comprises:

6. An identification method as claimed in claim 4, characterized in that in the classification process, it further comprises:

7. An identification method, comprising:

acquiring a marked text to be detected;

8. An identification method as claimed in claim 7, characterized in that in the classification process, it further comprises:

9. An identification method as claimed in claim 8, characterized in that in the classification process, it further comprises:

10. An identification method as claimed in claim 9, characterized in that in the classification process, it further comprises:

11. An identification method as claimed in claim 10, characterized in that in the process of identifying the correspondence, it further comprises: