CN112906367A - Information extraction structure, labeling method and identification method of consumer text - Google Patents

Information extraction structure, labeling method and identification method of consumer text Download PDF

Info

Publication number
CN112906367A
CN112906367A CN202110172747.5A CN202110172747A CN112906367A CN 112906367 A CN112906367 A CN 112906367A CN 202110172747 A CN202110172747 A CN 202110172747A CN 112906367 A CN112906367 A CN 112906367A
Authority
CN
China
Prior art keywords
text
elements
dimension
classification
inputting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110172747.5A
Other languages
Chinese (zh)
Inventor
杨骏
李�杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Hongyuan Information Technology Co ltd
Original Assignee
Shanghai Hongyuan Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Hongyuan Information Technology Co ltd filed Critical Shanghai Hongyuan Information Technology Co ltd
Priority to CN202110172747.5A priority Critical patent/CN112906367A/en
Publication of CN112906367A publication Critical patent/CN112906367A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses an information extraction structure, a labeling method and an identification method of a consumer text, wherein the information extraction structure comprises six dimensions of a demand, a scene, a scheme, a driving factor, a blocking factor and a question neutral factor, the information extraction structure is identified through a plurality of two-dimensional arrays and BIO structures so as to be identified by a model, elements which are classified through the dimensions in the text to be detected can be identified by constructing the identification model, and the corresponding relation between the elements is established according to the dimensions.

Description

Information extraction structure, labeling method and identification method of consumer text
Technical Field
The invention relates to the technical field of natural language processing, in particular to an information extraction structure, a labeling method and an identification method of consumer texts.
Background
In the technical field of natural language processing of consumer text expression, common information extraction technologies comprise named entity recognition, aspect extraction and text emotion analysis. Specifically, named entity recognition includes inputting a text, and outputting named entities mentioned therein, where the named entities generally refer to names of people, places, names of brands, and the like. Facet extraction involves entering a piece of text and outputting facets mentioned therein, which generally refer to various attributes of the product, such as price, efficacy, appearance, etc. Textual sentiment analysis, including document level sentiment analysis, entity level sentiment analysis, aspect level sentiment analysis, and entity-aspect level sentiment analysis.
The analysis methods are mutually isolated, and none of the methods can automatically extract elements and aspects and automatically perform sentiment analysis on entities and aspects correspondingly. The problem isolated from each other is that if the methods are connected in series by hard method, error transmission is generated, namely, the error prediction of the preposition task (such as named entity identification and aspect extraction) can cause the result of the postposition task (emotion analysis) to generate larger deviation.
In addition, in the emotion analysis technology, document level, entity level and aspect level emotion analysis neglects that different emotion attitudes may be expressed for different aspects of different entities in a document, and the attitudes of the expressors are reflected one by one. Although the entity-aspect level emotion analysis is correctly reflected, the entity and the aspect need to depend on other model output, and the application in a real scene is limited.
Furthermore, the semantic structured definition of the prior art cannot cover the main information. For example, there will be a large number of similar expressions on social media: "baby is easy to be undigested in summer, and can get well quickly when eating the synbiotics. The named entity recognition technology can recognize the brand name 'synbiotic', the aspect extraction technology can recognize 'digestion', and entity aspect level sentiment analysis can be output (synbiotic, digestion, positive). However, the technologies can omit the situation that indigestion occurs in summer, the object is a baby, the indigestion is a demand, the solution is a synbiotics, and the good speed is a driving factor for selecting the synbiotics. Information which cannot be identified by the existing method, including scenes, objects, requirements, solutions, driving factors and question neutral factors, is very helpful to brand product research and development and marketing.
Disclosure of Invention
The invention aims to provide an information extraction structure, a labeling method and an identification method of a consumer text, which are used for identifying structural information and corresponding relations in a text of a fire fighter.
In order to achieve the above object, an aspect of the present invention provides an information extraction structure of a consumer text, comprising:
a demand to express a consumer's demand;
a scene to express a scene where the demand occurs;
a scheme to express a solution to the requirement;
a driver to express a reason for selecting the solution;
a hindering factor to express a reason for hindering selection of the solution;
query neutral factors to express query elements in purchasing decisions.
In another aspect, the present invention further provides a method for labeling a text message structure of a consumer, which includes the following steps:
acquiring a text to be identified;
extracting information from a text to be identified, and establishing n two-dimensional arrays according to the extracted information, wherein each two-dimensional array comprises elements and dimensionality thereof, the association of the elements is established through the dimensionality, and the dimensionality comprises: requirements, scenarios, plans, drivers, deterrents, and question neutrality factors;
and marking the elements in the two-dimensional array by adopting a BIO structure to obtain a BIO marking result, wherein each marked element comprises a BIO mark and a dimension.
In another aspect, the present invention further provides an identification method, including:
acquiring a marked text to be detected;
classifying elements in the text to be detected according to BIO labels, and inputting the classified elements into corresponding information extraction dimension classifications, wherein the dimensions comprise: requirements, scenarios, plans, drivers, deterrents, and question neutrality factors;
and outputting the elements of the text to be detected after dimension classification according to the classification result.
Further, in the classification process, the method further comprises:
inputting the text into a BERT coding model, and converting the text into a coded feature sequence, wherein the feature sequence has vector identification combined with context semantics.
Further, in the classification process, the method further comprises:
inputting the characteristic sequence coded by the BERT into an LSTM model, and outputting a characteristic sequence with dimension expression;
and inputting the characteristic sequence with the dimension expression into Dropout and a full connection layer, and performing generalization processing and distribution characteristic mapping.
Further, in the classification process, the method further comprises:
inputting Dropout and the output result of the full connection layer into a conditional random field, and identifying a sequential relationship in the BIO label;
classifying information extraction dimensions by adopting a recognition result of the word segmentation correction on the correction conditional random field;
and formatting, processing and outputting the information extraction result of the consumer text according to the classification result of the BIO label and the information extraction dimension.
In another aspect, the present invention further provides an identification method, including:
acquiring a marked text to be detected;
classifying elements in the text to be detected according to BIO labels, and inputting the classified elements into corresponding information extraction dimension classifications, wherein the dimensions comprise: requirements, scenarios, plans, drivers, deterrents, and question neutrality factors;
outputting the elements of the text to be detected after dimension classification according to the classification result;
identifying corresponding relations of dimension classification, wherein the corresponding relations comprise demand-scene, demand-solution, solution-driving factor, solution-blocking factor and solution-question neutral, and outputting the dimension classification relation between the elements according to the corresponding relations.
Further, in the classification process, the method further comprises:
inputting the text into a BERT coding model, and converting the text into a coded feature sequence, wherein the feature sequence has vector identification combined with context semantics.
Further, in the classification process, the method further comprises:
inputting the characteristic sequence coded by the BERT into an LSTM model, and outputting a characteristic sequence with dimension expression;
and inputting the characteristic sequence with the dimension expression into Dropout and a full connection layer, and performing generalization processing and distribution characteristic mapping.
Further, in the classification process, the method further comprises:
inputting Dropout and the output result of the full connection layer into a conditional random field, and identifying a sequential relationship in the BIO label;
and completing the classification of information extraction dimensionality by adopting the recognition result of the word segmentation correction on the correction conditional random field.
Further, in the process of identifying the corresponding relationship, the method further includes:
and inputting the corresponding relation of the identification dimension classification into a BERT identification model, and outputting the corresponding relation of the identification object by the BERT identification model according to the parameter tuning result.
The invention discloses an information extraction structure, a labeling method and an identification method of a consumer text, wherein the information extraction structure comprises six dimensions of a demand, a scene, a scheme, a driving factor, a blocking factor and a question neutral factor, the information extraction structure is identified through a plurality of two-dimensional arrays and BIO structures so as to be identified by a model, elements which are classified through the dimensions in the text to be detected can be identified by constructing the identification model, and the corresponding relation between the elements is established according to the dimensions.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is a flowchart of a method for labeling a text information structure of a consumer according to an embodiment of the present invention.
FIG. 2 is a flow diagram of an identification method according to one embodiment of the invention.
Fig. 3 is a flow chart of an identification method according to another embodiment of the present invention.
Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.
The invention firstly defines an information extraction structure of a consumer text, which comprises the following six dimensions: requirements, scenarios, drivers, deterrents, question neutrality factors. This dimension is used to define elements in the consumer text.
Wherein, the demand is the demand of the consumer in the text, including: symptoms, index results, appeals, mood, life, questions, etc.
For example, i today [ mood is bad ], or i today [ pull many bellies ].
The scene is a scene where the demand occurs, and comprises people, time, space, accompanying events (including some background information generated by the demand, inspection) and demand inducement.
For example, [ my son ] is totally ill, or [ because of physical relationships ] the old is cold.
The scheme is a solution for a certain demand, and comprises categories, brands, products and practices.
For example, people can drink only [ medicine ] and [ body-wiping and cooling ], and want to drink [ milk tea ].
The driving factor is the reason for choosing a solution, directly driving the reason for choosing the product/taking the method. Including appeal, product characteristics (ingredients, materials, colors, etc.), place of origin, packaging, quality security, brand image, cost performance, emotional experience.
For example, calcium for discotheque children [ good taste ], [ simple and convenient to chew ].
Hindering factors are the reasons for choosing a solution, including: side effects, unmet appeal, packaging, quality safety, brand image, cost performance, emotional experience.
For example, the milk powder of Junlebao now [ easy to get on fire ].
The query neutral factor is a point of questioning, or a neutral opinion in a consumer's decision to purchase a product.
For example, such as: "taste" is not good, and "Germany" is produced.
The embodiment of the invention provides a labeling method of a consumer text information structure, and the purpose of labeling data is to enable a natural language processing model to learn the thinking mode and the cognitive result of human beings. By recording the text and the requirements, scenes, schemes, driving factors, hindering factors and question neutral factors in the text, and recording the corresponding relation between the dimensions in the text.
Fig. 1 is a flowchart of a method for labeling a text information structure of a consumer according to an embodiment of the present invention. As shown in fig. 1, the method for labeling the text information structure of the consumer of the present invention comprises the following steps:
s101, acquiring a text to be recognized.
Wherein the text to be identified can be from consumer text or marketing corpora issued by supply chain manufacturers.
For example, the consumer text may come from a C-side data source such as consumer evaluation, consumer complaints, consumer messages, etc., or from a B-side data source such as product design instructions, marketing content, etc.
The sentence to be recognized can be text data obtained by converting voice data acquired by a user through a voice acquisition device by a system, or can also be text data input by the user directly through an input device.
And S102, information extraction.
The purpose of information extraction is to associate extracted elements with corresponding dimensions, wherein the dimensions of the elements comprise requirements, scenes, schemes, driving factors, hindering factors and question neutral factors. Therefore, when there are multiple elements and dimensions in a sentence, the elements and elements, dimensions and dimensions, and element-to-dimension relationships need to be considered.
In one embodiment, the invention extracts information from a text to be recognized, and establishes n two-dimensional arrays according to the extracted information, wherein each two-dimensional array comprises elements and dimensions thereof, the association of the elements is established through the dimensions, and the dimensions comprise: requirements, scenarios, drivers, deterrents, and question neutrality factors. The details are shown in the following table:
Figure BDA0002939358140000081
s103, labeling and post-processing
And marking the elements in the two-dimensional array by adopting a BIO structure to obtain a BIO marking result, wherein each marked element comprises a BIO mark and a dimension. BIO labeling is a commonly used labeling mode in sequence labeling tasks, wherein B-begin labels the initial words of entities, I-inside labels the words (except the initial words) in the entities, and O-outside labels the words other than the entities; the B tag refers to the initial word of the entity, and the I tag refers to the word other than the initial word in the entity.
The elements after labeling are shown in the following table:
Figure BDA0002939358140000082
FIG. 2 is a flow diagram of an identification method according to one embodiment of the invention. As shown in fig. 2, the identification method according to the embodiment of the present invention includes the following steps:
s201, obtaining the marked text to be detected.
Specifically, in step S201, the text to be detected is denoted as T ═ { w1, w2, w3, … wn }, where wi is the ith character in the text.
S202, classifying elements in the text to be detected according to BIO labels, and inputting the classified elements into corresponding information extraction dimension classifications, wherein the dimensions comprise: requirements, scenarios, drivers, deterrents, and question neutrality factors.
In one embodiment, the text to be detected is first input into the BERT coding model, and the output obtains vector representations of words in combination with context semantics, each word being represented as a vector representation of 768 dimensions. The output of the BERT coding layer is V ═ { V1, V2, V3, … vn }, where vi is the vector representation of the ith character in the text after coding.
Then, the BERT-encoded signature sequence is input to an LSTM model, and a signature sequence having dimensional expression is output, which is expressed as H ═ H1, H2, H3, … hn }.
It will be appreciated that the LSTM model aims to address the issue of long term and short term dependence of context information. The context global information can be coded, and the understanding of the whole sentence semantics is facilitated; and may encode the local information. The LSTM model identifies requirements, scenarios, solutions, drivers, deterrents, and question neutrality factors with high efficiency.
In one embodiment, the feature sequence with dimension expression is input into Dropout and the fully connected layer, generalized and distributed feature mapping.
Preferably, the Dropout rate of the Dropout layer is 0.5, and 50% of the nodes are randomly selected from the layer, and the value is set to 0. Better predictions can still be made when only a portion of the information is retained.
In one embodiment, the invention inputs Dropout and fully-connected layer output results into a conditional random field to identify sequential relationships in the BIO labels.
It will be appreciated that conditional random fields can be viewed as a generalization of the maximum entropy markov model to the labeling problem. Its main value is to learn the sequential relationship of the tags in the BIO label structure, for example, I can only be preceded by B or I.
In one embodiment, the invention completes the classification of information extraction dimensions by adopting word segmentation to correct the recognition result of the conditional random field.
In particular, the problem of inaccurate recognition of certain vocabulary boundaries is solved. The calculation method of the word segmentation correction step is as follows:
inputting: the predicted output of the conditional random field { pi }, where i ═ 1, …, n }; and (5) segmenting the original sentence into words and results.
A calculation step:
for i from 1to n:
if the prediction result of the ith position is not tag O, i.e. pi! O:
for the word where the ith character is located after word segmentation, for all characters in the word:
if the class label of the prediction result is O:
ensuring that the category labels of all characters of the word are consistent with the category label of pi;
the structural label of the first character of the word is set to B and the remaining characters are set to O.
For example, the brand name "help fit" only identifies that "help fit" is a scheme, and ignores the word "help". The output of the conditional random field is corrected by adopting a word segmentation technology, wherein the correction mode is that all characters of a word after word segmentation are recognized into the category as long as one of the characters is recognized into one of requirements, scenes, solutions, driving factors, blocking factors and question neutral factors. If a plurality of characters in a word are respectively recognized into a plurality of categories, no correction is carried out, the probability of the occurrence of the condition is very small, the average number of the occurrences of the condition in every 1000 words is less than 1, and the condition is usually caused by the problem of word segmentation per se.
And S203, outputting the elements of the text to be detected after dimension classification according to the classification result.
In one embodiment, after classification prediction is performed according to the BIO structure, the prediction result is converted into 5 structured columns through a post-processing step, that is, the B and I are used as the beginning classes, corresponding characters are extracted and output to the corresponding classes, and the output form is shown in the following table:
Figure BDA0002939358140000111
fig. 3 is a flow chart of an identification method according to another embodiment of the present invention. As shown in fig. 3, the identification method of the present embodiment includes the following steps:
s301, the marked text to be detected is obtained.
Specifically, in step S201, the text to be detected is denoted as T ═ { w1, w2, w3, … wn }, where wi is the ith character in the text.
S302, classifying elements in the text to be detected according to BIO labels, and inputting the classified elements into corresponding information extraction dimension classifications, wherein the dimensions comprise: requirements, scenarios, drivers, deterrents, and question neutrality factors.
In one embodiment, the text to be detected is first input into the BERT coding model, and the output obtains vector representations of words in combination with context semantics, each word being represented as a vector representation of 768 dimensions. The output of the BERT coding layer is V ═ { V1, V2, V3, … vn }, where vi is the vector representation of the ith character in the text after coding.
Then, the BERT-encoded signature sequence is input to an LSTM model, and a signature sequence having dimensional expression is output, which is expressed as H ═ H1, H2, H3, … hn }.
In one embodiment, the feature sequence with dimension expression is input into Dropout and the fully connected layer, generalized and distributed feature mapping.
Preferably, the Dropout rate of the Dropout layer is 0.5, and 50% of the nodes are randomly selected from the layer, and the value is set to 0. Better predictions can still be made when only a portion of the information is retained.
In one embodiment, the invention inputs Dropout and fully-connected layer output results into a conditional random field to identify sequential relationships in the BIO labels.
In one embodiment, the invention completes the classification of information extraction dimensions by adopting word segmentation to correct the recognition result of the conditional random field.
In particular, the problem of inaccurate recognition of certain vocabulary boundaries is solved. The calculation method of the word segmentation correction step is as follows:
inputting: the predicted output of the conditional random field { pi }, where i ═ 1, …, n }; and (5) segmenting the original sentence into words and results.
A calculation step:
for i from 1to n:
if the prediction result of the ith position is not tag O, i.e. pi! O:
for the word where the ith character is located after word segmentation, for all characters in the word:
if the class label of the prediction result is O:
ensuring that the category labels of all characters of the word are consistent with the category label of pi;
the structural label of the first character of the word is set to B and the remaining characters are set to O.
And S303, outputting the elements of the text to be detected after dimension classification according to the classification result.
In one embodiment, after classification prediction is performed according to the BIO structure, the prediction result is converted into 5 structured columns through a post-processing step, that is, the B and I are used as the beginning classes, corresponding characters are extracted and output to the corresponding classes, and the output form is shown in the following table:
Figure BDA0002939358140000121
s304, identifying the corresponding relation of the dimension classification. The corresponding relation comprises demand-scene, demand-solution, solution-driving factor, solution-obstruction factor and solution-question neutral, and the dimension classification relation between the elements is output according to the corresponding relation.
Wherein, for the recognition result of each text, if the preceding item and the following item of one or more relations are contained at the same time, the preceding item and the following item are arranged and combined to form a 2-element relation pair, which is in the form of a relation preceding item-a relation following item. For example:
Figure BDA0002939358140000131
and inputting the original text and the corresponding relation into a BERT model, and outputting whether the corresponding relation exists or not. The BERT model is different from a BERT coding model, and internal parameters of the BERT model are continuously adjusted and optimized in the learning process. The design here is equivalent to having a pre-trained language model fine-tuned on the task, learning the model parameters for the task.
And finally, outputting the elements and the dimensions with the corresponding relation by the model.
In another aspect, the present invention further provides a computer device, which includes a memory and a processor, wherein the memory stores a computer program, and the computer program, when executed by the processor, causes the processor to execute the steps of the above method.
In another aspect, the present invention also provides a computer readable storage medium storing a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the steps of performing the above method.
Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. As shown in fig. 4, an electronic device of one embodiment of the invention includes one or more input devices 1000, one or more output devices 1000, one or more processors 3000, and memory 4000.
In one embodiment of the invention, the processor 1000, the input device 2000, the output device 3000, and the memory 4000 may be connected by a bus or other means. The input device 2000, the output device 3000 may be a standard wired or wireless communication interface.
The Processor 1000 may be a Central Processing Unit (CPU), and may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Memory 4000 may be a high speed RAM memory or a non-volatile memory such as a disk memory. The memory 4000 is used to store a set of computer programs, and the input device 2000, the output device 3000, and the processor 1000 may call the program codes stored in the memory 4000.
The memory 4000 stores a computer program comprising program instructions that, when executed by the processor, cause the processor to perform the steps of the patent value assessment method as described in the above embodiments.
An embodiment of the present invention also provides a computer-readable storage medium. The computer readable storage medium may be a high speed RAM memory or a non-volatile memory such as a disk memory. The computer-readable storage medium may be connected through an external computing device or a network to read a set of computer programs stored in the computer-readable storage medium. The computer program stored by the computer readable storage medium comprises program instructions which, when executed by a processor, cause the processor to perform the steps of the method as described above in the embodiments above.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (11)

1. Information extraction structure of consumer text, characterized in that the information extraction structure comprises the following dimensions:
a demand to express a consumer's demand;
a scene to express a scene where the demand occurs;
a scheme to express a solution to the requirement;
a driver to express a reason for selecting the solution;
a hindering factor to express a reason for hindering selection of the solution;
query neutral factors to express query elements in purchasing decisions.
2. The labeling method of the text information structure of the consumer is characterized by comprising the following steps:
acquiring a text to be identified;
extracting information from a text to be identified, and establishing n two-dimensional arrays according to the extracted information, wherein each two-dimensional array comprises elements and dimensionality thereof, the association of the elements is established through the dimensionality, and the dimensionality comprises: requirements, scenarios, plans, drivers, deterrents, and question neutrality factors;
and marking the elements in the two-dimensional array by adopting a BIO structure to obtain a BIO marking result, wherein each marked element comprises a BIO mark and a dimension.
3. An identification method, comprising:
acquiring a marked text to be detected;
classifying elements in the text to be detected according to BIO labels, and inputting the classified elements into corresponding information extraction dimension classifications, wherein the dimensions comprise: requirements, scenarios, plans, drivers, deterrents, and question neutrality factors;
and outputting the elements of the text to be detected after dimension classification according to the classification result.
4. An identification method as claimed in claim 3, characterized in that in the classification process, it further comprises:
inputting the text into a BERT coding model, and converting the text into a coded feature sequence, wherein the feature sequence has vector identification combined with context semantics.
5. An identification method as claimed in claim 4, characterized in that in the classification process, it further comprises:
inputting the characteristic sequence coded by the BERT into an LSTM model, and outputting a characteristic sequence with dimension expression;
and inputting the characteristic sequence with the dimension expression into Dropout and a full connection layer, and performing generalization processing and distribution characteristic mapping.
6. An identification method as claimed in claim 4, characterized in that in the classification process, it further comprises:
inputting Dropout and the output result of the full connection layer into a conditional random field, and identifying a sequential relationship in the BIO label;
classifying information extraction dimensions by adopting a recognition result of the word segmentation correction on the correction conditional random field;
and formatting, processing and outputting the information extraction result of the consumer text according to the classification result of the BIO label and the information extraction dimension.
7. An identification method, comprising:
acquiring a marked text to be detected;
classifying elements in the text to be detected according to BIO labels, and inputting the classified elements into corresponding information extraction dimension classifications, wherein the dimensions comprise: requirements, scenarios, plans, drivers, deterrents, and question neutrality factors;
outputting the elements of the text to be detected after dimension classification according to the classification result;
identifying corresponding relations of dimension classification, wherein the corresponding relations comprise demand-scene, demand-solution, solution-driving factor, solution-blocking factor and solution-question neutral, and outputting the dimension classification relation between the elements according to the corresponding relations.
8. An identification method as claimed in claim 7, characterized in that in the classification process, it further comprises:
inputting the text into a BERT coding model, and converting the text into a coded feature sequence, wherein the feature sequence has vector identification combined with context semantics.
9. An identification method as claimed in claim 8, characterized in that in the classification process, it further comprises:
inputting the characteristic sequence coded by the BERT into an LSTM model, and outputting a characteristic sequence with dimension expression;
and inputting the characteristic sequence with the dimension expression into Dropout and a full connection layer, and performing generalization processing and distribution characteristic mapping.
10. An identification method as claimed in claim 9, characterized in that in the classification process, it further comprises:
inputting Dropout and the output result of the full connection layer into a conditional random field, and identifying a sequential relationship in the BIO label;
and completing the classification of information extraction dimensionality by adopting the recognition result of the word segmentation correction on the correction conditional random field.
11. An identification method as claimed in claim 10, characterized in that in the process of identifying the correspondence, it further comprises:
and inputting the corresponding relation of the identification dimension classification into a BERT identification model, and outputting the corresponding relation of the identification object by the BERT identification model according to the parameter tuning result.
CN202110172747.5A 2021-02-08 2021-02-08 Information extraction structure, labeling method and identification method of consumer text Pending CN112906367A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110172747.5A CN112906367A (en) 2021-02-08 2021-02-08 Information extraction structure, labeling method and identification method of consumer text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110172747.5A CN112906367A (en) 2021-02-08 2021-02-08 Information extraction structure, labeling method and identification method of consumer text

Publications (1)

Publication Number Publication Date
CN112906367A true CN112906367A (en) 2021-06-04

Family

ID=76123995

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110172747.5A Pending CN112906367A (en) 2021-02-08 2021-02-08 Information extraction structure, labeling method and identification method of consumer text

Country Status (1)

Country Link
CN (1) CN112906367A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114049528A (en) * 2022-01-12 2022-02-15 上海蜜度信息技术有限公司 Method and equipment for identifying brand name

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649223A (en) * 2016-12-23 2017-05-10 北京文因互联科技有限公司 Financial report automatic generation method based on natural language processing
CN108875051A (en) * 2018-06-28 2018-11-23 中译语通科技股份有限公司 Knowledge mapping method for auto constructing and system towards magnanimity non-structured text
CN109472026A (en) * 2018-10-31 2019-03-15 北京国信云服科技有限公司 Accurate emotion information extracting methods a kind of while for multiple name entities
CN109493166A (en) * 2018-10-23 2019-03-19 深圳智能思创科技有限公司 A kind of construction method for e-commerce shopping guide's scene Task conversational system
CN110298042A (en) * 2019-06-26 2019-10-01 四川长虹电器股份有限公司 Based on Bilstm-crf and knowledge mapping video display entity recognition method
CN110598213A (en) * 2019-09-06 2019-12-20 腾讯科技(深圳)有限公司 Keyword extraction method, device, equipment and storage medium
CN111178047A (en) * 2019-12-24 2020-05-19 浙江大学 Ancient medical record prescription extraction method based on hierarchical sequence labeling
CN111552819A (en) * 2020-04-28 2020-08-18 腾讯科技(深圳)有限公司 Entity extraction method and device and readable storage medium
CN111553162A (en) * 2020-04-28 2020-08-18 腾讯科技(深圳)有限公司 Intention identification method and related device
CN111783466A (en) * 2020-07-15 2020-10-16 电子科技大学 Named entity identification method for Chinese medical records
CN112036185A (en) * 2020-11-04 2020-12-04 长沙树根互联技术有限公司 Method and device for constructing named entity recognition model based on industrial enterprise
CN112257417A (en) * 2020-10-29 2021-01-22 重庆紫光华山智安科技有限公司 Multi-task named entity recognition training method, medium and terminal

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649223A (en) * 2016-12-23 2017-05-10 北京文因互联科技有限公司 Financial report automatic generation method based on natural language processing
CN108875051A (en) * 2018-06-28 2018-11-23 中译语通科技股份有限公司 Knowledge mapping method for auto constructing and system towards magnanimity non-structured text
CN109493166A (en) * 2018-10-23 2019-03-19 深圳智能思创科技有限公司 A kind of construction method for e-commerce shopping guide's scene Task conversational system
CN109472026A (en) * 2018-10-31 2019-03-15 北京国信云服科技有限公司 Accurate emotion information extracting methods a kind of while for multiple name entities
CN110298042A (en) * 2019-06-26 2019-10-01 四川长虹电器股份有限公司 Based on Bilstm-crf and knowledge mapping video display entity recognition method
CN110598213A (en) * 2019-09-06 2019-12-20 腾讯科技(深圳)有限公司 Keyword extraction method, device, equipment and storage medium
CN111178047A (en) * 2019-12-24 2020-05-19 浙江大学 Ancient medical record prescription extraction method based on hierarchical sequence labeling
CN111552819A (en) * 2020-04-28 2020-08-18 腾讯科技(深圳)有限公司 Entity extraction method and device and readable storage medium
CN111553162A (en) * 2020-04-28 2020-08-18 腾讯科技(深圳)有限公司 Intention identification method and related device
CN111783466A (en) * 2020-07-15 2020-10-16 电子科技大学 Named entity identification method for Chinese medical records
CN112257417A (en) * 2020-10-29 2021-01-22 重庆紫光华山智安科技有限公司 Multi-task named entity recognition training method, medium and terminal
CN112036185A (en) * 2020-11-04 2020-12-04 长沙树根互联技术有限公司 Method and device for constructing named entity recognition model based on industrial enterprise

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王蕾等: ""再制造服务需求动态获取方法及应用"", 《计算机集成制造***》, vol. 24, no. 03, pages 781 - 792 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114049528A (en) * 2022-01-12 2022-02-15 上海蜜度信息技术有限公司 Method and equipment for identifying brand name

Similar Documents

Publication Publication Date Title
CN112164391B (en) Statement processing method, device, electronic equipment and storage medium
CN110121706B (en) Providing responses in a conversation
CN111488931B (en) Article quality evaluation method, article recommendation method and corresponding devices
CN110196930B (en) Multi-mode customer service automatic reply method and system
CN110704622A (en) Text emotion classification method and device and electronic equipment
US20190138599A1 (en) Performing semantic analyses of user-generated text content using a lexicon
CN113627447B (en) Label identification method, label identification device, computer equipment, storage medium and program product
CN112307351A (en) Model training and recommending method, device and equipment for user behavior
CN110377733B (en) Text-based emotion recognition method, terminal equipment and medium
CN110704586A (en) Information processing method and system
CN109460462B (en) Chinese similarity problem generation system and method
CN113553850A (en) Entity relation extraction method based on ordered structure encoding pointer network decoding
CN114972823A (en) Data processing method, device, equipment and computer medium
CN114491077A (en) Text generation method, device, equipment and medium
CN112364664B (en) Training of intention recognition model, intention recognition method, device and storage medium
CN115269781A (en) Modal association degree prediction method, device, equipment, storage medium and program product
CN117558270B (en) Voice recognition method and device and keyword detection model training method and device
CN117390497B (en) Category prediction method, device and equipment based on large language model
CN114528418A (en) Text processing method, system and storage medium
CN112906367A (en) Information extraction structure, labeling method and identification method of consumer text
Das A multimodal approach to sarcasm detection on social media
CN116701604A (en) Question and answer corpus construction method and device, question and answer method, equipment and medium
CN116662522A (en) Question answer recommendation method, storage medium and electronic equipment
CN116543798A (en) Emotion recognition method and device based on multiple classifiers, electronic equipment and medium
CN113741759B (en) Comment information display method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination