CN110059176B

CN110059176B - Rule-based general text information extraction and information generation method

Info

Publication number: CN110059176B
Application number: CN201910153119.5A
Authority: CN
Inventors: 骆斌; 卢坚; 伏晓
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2019-02-28
Filing date: 2019-02-28
Publication date: 2021-07-13
Anticipated expiration: 2039-02-28
Also published as: CN110059176A

Abstract

The invention provides a rule-based general text information extraction and information generation method, which comprises the following steps: initializing an information dictionary context, a rule word packet, a rule engine and a template engine; carrying out information annotation on the text; defining an information extraction algorithm and compiling a rule script code; generating a rule dependent directed graph; executing a text extraction rule and finely adjusting according to the extraction accuracy; defining an information generation meta-template; selecting a custom template rule and generating a text. The invention realizes the modularization of the extraction rule, improves the sharing possibility of the extraction rule, can well analyze and mine the structure of the complex text information, greatly improves the efficiency of extracting information and generating text from external information, and is particularly suitable for the field of legal documents and the like which need a large amount of information text for information extraction and generation. The method can obviously improve the efficiency and the accuracy of text extraction, optimize the complexity of text extraction and improve the generation efficiency of the information text.

Description

Rule-based general text information extraction and information generation method

Technical Field

The invention relates to the field of software engineering and rule engines, in particular to a method for extracting and generating general text information based on rules, and more particularly to a method for extracting and generating general text information based on rules.

Background

With the improvement of the informatization level of each enterprise, the traditional information entry has a better solution. And more enterprises face the problems of: many information all derive from the text of semi-structured, this part text still lacks the information extraction instrument well, the input mode of many information all realizes through manual input or complicated extraction logic at present, manual input consumes a large amount of manpower and materials, the effect is not good, and complicated extraction logic is higher though the extraction rate of accuracy, but has brought high maintenance cost, and the difficult multiplexing of extraction rule, whole process needs longer cycle simultaneously, is unfavorable for the quick delivery of software.

Disclosure of Invention

In order to solve the problems, the invention provides a rule-based general text information extraction and text generation method, which realizes information labeling of texts, enables workers to write extraction rules efficiently, achieves multiplexing and sharing of the extraction rules to the maximum extent, can be integrated with a third-party data source, generates texts with good formats, facilitates the extraction process, and forms a good forward cycle.

In order to achieve the purpose, the invention provides the following technical scheme:

a general text information extraction and text generation method based on rules comprises the following steps:

the method comprises the following steps: initializing information dictionary context, rule word package, rule engine and template engine

Initializing an information dictionary as a context of information extraction, and performing dynamic and extensive information extraction on an information text; loading the engine types defined in the configuration information, and carrying out loading work of a rule grammar parser, a grammar dependency analyzer and a rule executor; initializing a data access engine attached to the rule engine and supporting a third-party data source; loading the pre-compiled template instruction and the compiled information generation template by loading the template engine configuration information so as to complete the loading work of the whole template engine;

step two: information labeling is carried out on text information

Performing modeling analysis on the text information extraction, wherein the text information extraction model is divided into single-value information extraction and multi-value information extraction; extracting text of which the content of a single area is extracted from a piece of text by single value information; and the multi-valued information extraction means extracting information specifying a plurality of areas from a piece of text; the text information labeling model comprises: the range of the text information mark, the mark information characteristic and the information mark identifier, and for each information mark, a desired extracted text can be found from a section of information text;

step three: defining information extraction algorithms and writing rule script code

Analyzing and modeling the extraction rule, wherein the extraction rule model comprises: scalar rules, sharing rules, no-dependency calculation rules, and variable context rules; when a user extracts information, if the currently extracted information item does not depend on other rules and also does not have obvious text context dependence, scalar rules can be used for extracting the information; if the extraction mode of the current extraction information item is similar to other similar structure texts, the extraction rule can be shared in a direct reference or copy mode; if the current extracted information item does not depend on other rules of the current rule context, the information can be extracted through the independent calculation rule; if the current extracted information item has dependence on other rules for the current rule context, calculation can be performed through the dependence calculation rule; if the current extracted information item has deep structural dependence and the information of the intermediate state does not need to be extracted by display, the information can be extracted through the variable context rule without influencing the current rule context;

step four: generating rule dependent directed graphs

The method comprises the steps of conducting syntax analysis on an extraction rule written by a user, deriving a dependence item of the rule and a derivation item of the rule, and generating a rule dependence directed graph;

step five: executing text extraction rules and fine-tuning according to extraction accuracy

The text extraction rule is put into the rule engine to be executed, the extraction text with a good structure can be generated, the content of the extraction information is compared with the content of the text label information at the beginning, and the accuracy of the extraction information is generated.

Step six: definition information generation meta template

A user can define an information generation meta-template according to the scene requirement; the information generation meta-template comprises a basic information text format and a plurality of rule filling areas; in order to provide a general information generation mode, a user can import the information of the third-party data source in a mode of conforming to a rule format by providing a self-defined information rule expansion mode;

step seven: custom template rule selection and text generation

For the same information generation meta-template, a user can select different information rules from a plurality of rule filling areas to generate texts suitable for different sub-scenes; the user can select a format for information text generation.

Further, the first step comprises the following steps:

step 1-1: an initial state;

step 1-2: defining a data structure table for storing an information dictionary, wherein the data structure of the information dictionary is a hierarchical Hash table structure and can support a multi-level information structure;

step 1-3: loading information of the information dictionary according to the hierarchical structure, loading the root node first, and then sequentially loading along the hierarchy until the leaf nodes are completely loaded;

step 1-4: for each information sub item of the information dictionary, acquiring a corresponding information result, accessing a leaf node at first, checking whether the leaf node exists, and if the current leaf node exists, directly returning the information item result; otherwise, searching upwards along the hierarchical structure until a certain information hierarchy contains the information subitem, and then returning an information item result;

step 1-5: loading word packets to a system from a database, wherein one word packet comprises a single or a plurality of word groups and a plurality of selectable condition selection sentences, a certain information extraction rule can comprise one or a plurality of word packets, and the word packets can be obtained by compiling a rule script;

step 1-6: loading a condition judgment function associated with the word packet for use in running; for the word packet, some condition judgment functions are predefined, whether a certain word or certain words exist in the word packet or not and whether a certain sentence contains the words in the word packet or not can be judged, and meanwhile, a user can expand the condition functions of certain word packets in a self-expanding mode;

step 1-7: initializing the rule engine by loading the rule engine configuration information: selecting a rule engine grammar set, loading a grammar parser, loading an unnecessary grammar context dependence analyzer aiming at the grammar parser, and finally loading a rule actuator to complete the loading process of the whole rule engine;

step 1-8: initializing the template engine by loading the configuration information of the template engine: selecting a template engine type, loading a template engine instruction set, loading an information generation template which is defined by a system, and completing the loading process of the whole template engine;

step 1-9: and importing the context of the information dictionary, the rule word packet and the rule engine into the rule system, and finally integrating the template engine and the rule system to finish the whole initialization work.

Further, the second step includes the following steps:

step 2-1: an initial state;

step 2-2: a user selects a text needing information extraction or introduces the text to be extracted into a system;

step 2-3: determining an extraction area by a user through self-defined division of the text;

step 2-4: the user adds the type, single value or multiple values of the extracted information;

step 2-5: if the user selects the single-value type, text labeling is carried out on the specified text;

step 2-6: if the user selects the multi-value type, the user needs to determine the number of the text extraction areas, and then the specified text is selected and labeled;

step 2-7: the user names the information label, and then the system gives a unique information label identifier;

step 2-8: and finishing labeling the text information.

Further, the third step includes the following steps:

step 3-1: an initial state;

step 3-2: selecting the information marking in the step 2 by a user, and compiling specific information extraction rules;

step 3-3: during the actual extraction process, a user uses several types of extraction algorithms predefined in the rule engine, and if the extraction result of the algorithm is satisfactory, specific rule writing is not needed;

step 3-4: otherwise, the user needs to write the custom rule: the user needs to summarize the characteristics from the text to be extracted;

step 3-5: performing lexical analysis on an extraction rule written by a user through a rule grammar parser, and identifying whether a variable defined by the user when writing the rule meets the specification or not and whether the rule in a rule-dependent context exists or not;

step 3-6: according to the lexical analysis sequence generated in the step 3-5, further performing syntax analysis through a rule syntax analyzer, analyzing functions and program structures defined in the rules written by the user, and performing error reminding on the repeatedly defined functions and the incorrect program structures;

step 3-7: exporting the extracted text information through a predefined function for a rule script written by a user, wherein the export item is used for other rules of the current rule context so as to be convenient for extracting the structured text information;

step 3-8: the user can do the following for the written rule in the rule context list: checking the extraction content, applying the extraction rule and analyzing the dependence of the extraction rule;

step 3-9: and defining an information extraction algorithm and writing a rule script code.

Further, in step 3-4, the step that the user needs to summarize features from the text to be extracted specifically includes the following steps: the user can perform feature induction from the context-free keywords, specified phrases or regular expressions and other modes, and also can perform feature induction from the context-free modes containing specific semantics.

Further, the fourth step includes the following steps:

step 4-1: an initial state;

step 4-2: the user can selectively perform rule-dependent analysis, if the user selects to perform dependent analysis, the step 4-3 is performed, otherwise, the step 4-7 is performed;

step 4-3: after a user selects rule dependence analysis, the system constructs an abstract syntax tree for the rule through a rule engine;

step 4-4: the dependency analyzer analyzes the contents of the dependency variable, the local variable and the rule export item in the abstract syntax tree, and completes the analysis of rule dependency by performing deep search on the abstract syntax tree;

and 4-5: the dependency analyzer displays the generated dependency analysis directed graph to a user, and the user selects an interested rule dependent item or rule derived item to check the access relation of the current item and know the dependency context of the current rule;

and 4-6: a user can directly enter a rule adjusting stage by selecting a rule item in the directed graph;

and 4-7: and finishing the generation of the rule depending on the directed graph.

Further, in the fifth step, if the text extraction accuracy does not reach the target, the extraction rule of the underaccuracy is continuously adjusted until the extraction accuracy reaches a specified threshold.

Further, the fifth step includes the following steps:

step 5-1: an initial state;

step 5-2: the user executes the rule of a single rule, and executes all the rules after all the extraction rules are compiled;

step 5-3: when the user executes the rule, according to the content in the definition information extraction algorithm and the compiling rule script code in the step three, the system puts the rule content without grammar content error into the rule executor to execute, and adopts corresponding execution engine modes according to different configured execution engines;

step 5-4: firstly, a rule executor puts an information dictionary and a word packet which are depended by a rule into a rule execution context, and sequentially executes the rules to be executed, wherein for a certain executed rule, if the rule which is depended by the currently executed rule is concentrated with the unexecuted rule, the unexecuted rule is executed firstly until the dependent rule of the currently executed rule is executed completely, and then the unexecuted rule is traced back before the execution is completed;

step 5-5: if the rule is executed, the system compares the text information label corresponding to the rule with the rule export item, calculates the extraction accuracy and prompts the document label information which is not hit;

and 5-6: if the extraction accuracy reaches the requirement, continuing to execute the step 5-7, otherwise, adjusting the rule content and continuing to execute the step 5-2;

and 5-7: and executing text extraction rules and carrying out fine adjustment according to the extraction accuracy.

Further, the sixth step includes the following steps:

step 6-1: an initial state;

step 6-2: a user creates an information generation meta-template with a name;

step 6-3: a user adds a text basic information block, a fixed dependence rule item and a placeholder in a meta template; the text basic information block is arbitrary text information; a fixed dependency rule is a rule term for some type of text extraction; aiming at the placeholder, when the template is generated at the later stage, text information and rule items in the rule writing context can be used for replacement;

step 6-4: the user stores the information generation meta-template in a database;

step 6-5: and finishing defining the information generation meta-template.

Further, the seventh step includes the steps of:

step 7-1: an initial state;

step 7-2: a user selects an existing information generation meta template to generate a text;

and 7-3: the user selects to generate a temporary text or generate a new self-defined template;

and 7-4: a user replaces a placeholder in the information generation meta template, wherein the placeholder is common text information, an information item in a rule writing context or a rule item in a rule context;

and 7-5: after filling the placeholders in the template, the user selects a format for generating the text, and then downloads the generated text;

and 7-6: and finishing the selection of the self-defined template rule and the generation of the text.

Compared with the prior art, the invention has the following advantages and beneficial effects:

(1) the invention effectively expands the traditional text information extraction method, so that the whole extraction process is more effective, and users can conveniently discover more information contents from the text information; the information extraction is carried out based on the rules, the rules are easier to multiplex and share by regularizing the extracted content, and meanwhile, an information dictionary and a word packet are assisted, so that the extraction process can be dynamically expanded without repeatedly carrying out logical modification; after the extraction logic is refined into rules, the rules are analyzed in a syntax mode, and the dependency analysis among the rules is visually expressed, so that a user can clearly know the dependency flow direction of the extraction information in the current text, and a basis is provided for the extraction logic optimization and the extraction process high efficiency; by introducing the information generation meta-template, the extraction rule is favorably combined with the enterprise third-party data source, so that the text information generation becomes simple and efficient.

(2) Compared with the traditional information extraction mode, the method and the device have the advantages that the modularization of the extraction rule is realized, the sharing possibility of the extraction rule is improved, the structure of the complex text information can be well analyzed and mined through rule dependence analysis, and meanwhile, the efficiency of extracting the information and generating the text from the external information is greatly improved through self-defining the information generation template. The invention is particularly suitable for the field of extracting and generating information by needing a large amount of information texts, such as legal documents and the like. Practice proves that the method can obviously improve the text extraction efficiency and accuracy, optimize the text extraction complexity and improve the information text generation efficiency.

Drawings

Fig. 1 is a flowchart of a method for extracting and generating general text information based on rules according to the present invention.

Fig. 2 is a schematic structural diagram of the present invention.

Fig. 3 is a schematic diagram of the present invention in operation.

FIG. 4 is a flow chart of a rule dependent analysis algorithm of the present invention.

FIG. 5 is a schematic illustration of the present invention.

FIG. 6 is a diagram of the actual extraction rule writing of the present invention.

FIG. 7 is a graph showing the results of the dependency analysis of the present invention.

FIG. 8 is a diagram illustrating meta-template editing and text generation according to the present invention.

Fig. 9 is a diagram of the message text generation result of the present invention.

Detailed Description

The technical solutions provided by the present invention will be described in detail below with reference to specific examples, and it should be understood that the following specific embodiments are only illustrative of the present invention and are not intended to limit the scope of the present invention.

FIG. 1 is a flow chart of the present invention, which generally includes an initialization stage, an information labeling stage, a writing extraction rule stage, and a template editing and text generation stage,

in the initialization stage, a pre-configured information dictionary and word packets are mainly stored and loaded into the system from a database and the like, meanwhile, a rule resolver, a dependency analyzer and a rule actuator of a rule engine are connected to complete the initialization work of the rule engine, and a meta template, a predefined template instruction and the template engine are connected to complete the initialization work of the template engine.

And in the information marking stage, after the initialization work is finished, a user can select a text needing information extraction, then the text is subjected to information extraction marking, and the text can be marked into a structure corresponding to a subsequent writing rule in the information marking stage, so that the subsequent extraction accuracy is conveniently improved.

In the stage of writing the extraction rule, when text information is actually extracted, the extraction rule needs to be analyzed and modeled. And compiling the extraction rule script by the user according to the characteristics of the extraction information. For the characteristics of the extraction rule, the following categories can be classified:

a. and scalar rule, if the current extracted information item does not depend on other rules and does not depend on text context, the current extracted information item is the scalar rule.

b. And if the current extracted information item is similar to other text structures, the extraction rule can be shared in a direct reference or copy mode, and the sharing rule is called as a sharing rule.

c. And the independent calculation rule is the independent calculation rule if the current extracted information item has no other rules depending on the context of the current rule.

d. And depending on the calculation rule, if the current extracted information item depends on other rules of the current rule context, the current extracted information item is the dependent calculation rule.

e. And if the currently extracted information item has deep structural dependence and the intermediate state information does not need explicit extraction, the variable context rule can extract information through the variable context rule without influencing the current rule context, and is called as the variable context rule.

In the stage of template editing and text generation, a user can regenerate the meta template by adding the template of information or selecting the extraction rule, and meanwhile, the data of a third party can be introduced into the text generation through a third party data source adapter.

More specifically, as shown in fig. 1, the method for extracting and generating general text information based on rules provided by the present invention includes the following steps:

On one hand, for information texts in different fields, the information dictionaries frequently appear in the fields exist, and dynamic and extensive information extraction can be performed on the information texts by initializing the information dictionaries as the context of information extraction; on the other hand, because the information text has the inherent characteristics, for words with the same meaning, different text writers can give out words with similar meanings, and the accuracy of information extraction can be gradually improved by initializing the regular word packet. On the aspect of the rule engine, loading work of the rule parser, the grammar dependency parser and the rule executor is carried out by loading the engine types defined in the configuration information, and besides, a data access engine which supports a third-party data source is attached to the rule engine and needs to carry out initialization work. In the aspect of information generation, the pre-compiled template instructions which are defined and the written information generation template are loaded by loading the configuration information of the template engine, so that the loading work of the whole template engine is completed.

The method comprises the following substeps:

step 1-1: an initial state;

step 1-3: and loading information of the information dictionary according to the hierarchical structure, loading the root node firstly, and then loading the leaf nodes along the hierarchical sequence until the leaf nodes are completely loaded.

step 1-5: loading word packets to a system from a database, wherein one word packet comprises a single or a plurality of word groups and a plurality of selectable condition selection sentences, a certain information extraction rule can comprise one or a plurality of word packets, and the word packets can be obtained by writing a rule script;

step 1-6: and loading the condition judgment function associated with the word packet for use in running. For the word packet, some condition judgment functions are predefined, whether a certain word or certain words exist in the word packet or not and whether a certain sentence contains the vocabulary in the word packet or not can be judged, and meanwhile, a user can expand the condition functions of certain word packets in a self-expanding mode;

step 1-7: the rule engine is initialized by loading the rule engine configuration information. Selecting a rule engine grammar set, loading a grammar parser, loading an unnecessary grammar context dependence analyzer aiming at the grammar parser, and finally loading a rule actuator to complete the loading process of the whole rule engine;

step 1-8: and initializing the template engine by loading the configuration information of the template engine. Selecting a template engine type, loading a template engine instruction set, loading an information generation template which is defined by a system, and completing the loading process of the whole template engine;

Step two: information labeling is carried out on text information

The step is to perform modeling analysis on the text information extraction, and the text information extraction model is divided into single-value information extraction and multi-value information extraction by taking the purpose of the text information extraction as guidance. The single value information extraction is to extract the text of the content of a single area from a section of text; and the multi-value information extraction means extracting information specifying a plurality of areas from a piece of text. The information labeling model comprises the following contents: the range of the text information mark, the mark information characteristic and the information mark identifier, and for each information mark, the expected extracted text can be found from a piece of information text.

The method comprises the following substeps:

step 2-1: an initial state;

step 2-5: if the user selects the single-value type, text labeling can be carried out on the specified text;

step 2-8: and finishing labeling the text information.

When text information is actually extracted, analysis modeling needs to be performed on extraction rules. In order to extract general text information, the extraction rule model comprises the following components: scalar rules, shared rules, no-dependency computation rules, dependent computation rules, and variable context rules. These rules are explained in detail in the write extraction rules stage described above.

The method comprises the following substeps:

step 3-1: an initial state;

step 3-3: during the actual extraction process, a user can use several types of extraction algorithms predefined in the rule engine, and if the extraction result of the algorithm is satisfactory, the user does not need to compile specific rules;

step 3-4: otherwise, the user needs to write the custom rule. The user needs to generalize the characteristics from the text to be extracted, can generalize the characteristics from the key words which are irrelevant to the context, appointed phrases or regular expressions and other modes, and can generalize the characteristics from the modes which are relevant to the context and contain specific semantics;

step 3-5: for a general rule script, an extraction rule written by a user is firstly analyzed by a rule grammar parser, and the extraction rule is mainly used for identifying whether a variable defined by the user when writing the rule meets the specification or not and whether the rule in a rule-dependent context exists or not;

step 3-6: according to the lexical analysis sequence generated in the step 3-5, further performing syntax analysis through a rule syntax analyzer, mainly analyzing functions and program structures defined in rules written by a user at this stage, and performing error reminding on repeatedly defined functions and incorrect program structures;

step 3-7: the rule script written by the user can lead out the extracted text information through a predefined function, and the lead-out item can be used for other rules of the current rule context so as to be convenient for extracting the structured text information;

step 3-8: the user may do the following for the written rule in the rule context list: checking the extraction content, applying the extraction rule and analyzing the dependence of the extraction rule;

Step four: generating rule dependent directed graphs

By carrying out syntax analysis on the extraction rule written by the user, deriving the dependency item of the rule and the derivation item thereof, a rule dependency directed graph can be generated, the user can be helped to optimize the current extraction rule, and the user can be helped to know the current text structure.

The method comprises the following substeps:

step 4-1: an initial state;

step 4-3: after the user selects the rule dependence analysis, the system constructs an abstract syntax tree for the rule through a rule engine;

step 4-4: the dependency analyzer analyzes the contents of the dependency variables, the local variables and the rule export items in the abstract syntax tree, and completes the analysis of rule dependency by performing deep search on the abstract syntax tree;

and 4-5: the dependency analyzer displays the generated dependency analysis directed graph to a user, and the user can select an interested rule dependent item or rule derived item to check the in-out degree relation of a current item and know the dependency context of the current rule;

and 4-6: the user can also directly enter the rule adjusting stage by selecting the rule item in the directed graph;

By putting the text extraction rule into the rule engine for execution, an extraction text with a good structure can be generated, the extraction information can be compared with the text label information at the beginning, and the extraction information accuracy can be generated. If the text extraction accuracy does not reach the target, the extraction rules of the underaccuracy can be continuously adjusted until the extraction accuracy reaches a specified threshold.

The method comprises the following substeps:

step 5-1: an initial state;

step 5-2: the user can execute the rule for a single rule, and can also execute all the rules after all the extraction rules are compiled;

step 5-3: when the user executes the rule, according to the content in the defined information extraction algorithm and the written rule script code in the step 3, the system puts the rule content without grammar content error into the rule executor to execute, and the specific execution engine modes are different according to different configured execution engines;

step 5-4: firstly, the rule executor puts an information dictionary and a word packet which are depended by the rule into a rule execution context, and sequentially executes the rules to be executed, and for a certain executed rule, if the rule which is depended by the currently executed rule is concentrated with the unexecuted rules, the unexecuted rules are executed firstly until the dependency rules of the currently executed rule are executed completely, and then the unexecuted rules are traced back before the execution is completed;

and 5-6: if the extraction accuracy reaches the requirement, going to step 6, otherwise, adjusting the rule content, and continuing to execute step 1;

and 5-7: executing a text extraction rule and finely adjusting according to the extraction accuracy;

step six: definition information generation meta template

The user can define the information generation meta-template according to the scene requirement. The information generation meta-template mainly comprises a basic information text format and a plurality of rule filling areas. In order to provide a general information generation mode, a user can import the information of the third-party data source in a mode of conforming to a rule format by providing a mode of expanding a self-defined information rule.

The method comprises the following substeps:

step 6-1: an initial state;

step 6-2: a user can create an information generation meta-template with a name;

step 6-3: a user can add a text basic information block, a fixed dependency rule item and a placeholder in the meta-template;

the basic information block of the text can be any text information;

for a fixed dependency rule, it may be a rule item for some type of text extraction;

aiming at the placeholder, when the template is generated in the later period, the rule item and the placeholder in the text information and rule writing context can be used for replacement;

step 6-5: and finishing defining the information generation meta-template.

Step seven: custom template rule selection and text generation

For the same information generation meta-template, a user can select different information rules from a plurality of rule filling areas to generate texts suitable for different sub-scenes. The user can select the final generated format for information text generation.

The method comprises the following substeps:

step 7-1: an initial state;

step 7-2: the user can select the existing information generation meta template to generate the text;

and 7-3: the user can select to generate the temporary text and can also generate a new self-defined template;

and 7-4: firstly, a user needs to replace a placeholder in an information generation meta template, wherein the placeholder can be common text information, an information item in a rule writing context or a rule item in a rule context;

and 7-5: after the user fills the placeholders in the template, the user can select a format for text generation, including TXT, DOC, DOCX, PDF and the like, and then the user can download the generated text;

FIG. 2 is a schematic structural diagram of the present invention, and FIG. 3 is a schematic operational diagram of the present invention, wherein the core structures of the present invention are dynamic configuration information, a scalable rule engine, and an efficient template engine. If the extracted information of a certain region of the text changes, the user can solve the problem by adding the context of the information dictionary and expanding the word packet, and if the extraction rule and the extraction content of the certain region of the text change, the user can clear the current extracted dependency relationship by relying on the directed graph, and then selects the rule needing to be modified for adjustment and reapplication. In enterprises, the requirements for text generation often change frequently, but the meta-template design of the invention can conveniently replace some information items and redesign text areas without any code writing, and can complete the task of generating the template through online information item and rule configuration.

FIG. 4 is a flow chart of the algorithm of the rule dependency analyzer of the present invention. After the user writes the rule, the rule passes through the rule syntax analyzer, and then an abstract syntax tree with expression information is generated. By traversing the abstract syntax tree, the dependency and derivation of the rules can be obtained. In the present algorithm, it is necessary to identify a specific expression in the syntax tree. Here, the expressions in the syntax tree are classified:

a. dependency related expressions, access expressions for attributes, variable expressions, and array expressions.

b. And deriving a term-dependent expression, and calling the expression for the method.

c. And local variable expressions, and evaluating the expressions for the declarative expressions.

The algorithm steps are as follows.

Step 1: generating an abstract syntax tree by the rule through a syntax parser;

step 2: traversing the abstract syntax tree;

and step 3: is the expression traversal complete? If the process is finished, going to step 8, otherwise, going to step 4;

and 4, step 4: is the current expression a local variable definition expression? If yes, adding a local variable set, and entering the step 3, otherwise, entering the step 5;

and 5: is the current expression an attribute expression and the attribute is not in the local variable set? If yes, adding a dependency set, and entering a step 3, otherwise, entering a step 6;

step 6: is the current expression a method call expression and the method call a derived function? If yes, adding a definition set, and entering the step 3, otherwise, entering the step 7;

and 7: performing recursive traversal on other expressions, and entering the step 3;

and 8: deriving a dependent item and a defining item;

and step 9: and (6) ending.

FIG. 5 is a schematic diagram of text annotation according to the present invention. After the user imports the text information, the information marking can be carried out on the text information. The user can select the text region, and after the selection is finished, the name can be set for the region for subsequent extraction rule writing. After the user finishes labeling the text, the text information can be positioned by labeling the text segment.

Fig. 6 is a schematic diagram of writing an actual extraction rule of the present invention, and fig. 7 is a schematic diagram of a result of dependency analysis of the present invention. After the user marks the text, the user can actually write a specific extraction rule to extract the information of the text. The user can extract the text information in detail by configuring algorithm rules, and writing keywords and regular expressions or scripts. Meanwhile, the user can analyze the dependent item and the derived item of the current rule to optimize and adjust the current written rule by checking the dependent directed graph of the text.

Fig. 8 is a schematic diagram of meta template editing and text generation according to the present invention, and fig. 9 is a diagram of a result of information text generation according to the present invention. After the user finishes writing the extraction rule of the text information, the user can enter a text generation template editing module, and finally generates the required information text by defining the template content generated by the text and the extraction rule depending on the template content.

In conclusion, the invention effectively expands the traditional text information extraction method, so that the whole extraction process is more effective, and a user can conveniently discover more information contents from the text information. And the rule-based extraction script can be effectively multiplexed, and the improvement on maintenance is brought. By analyzing the grammar of the rule script, the invention can help the user to understand the dependency relationship of various texts at present and well lay the foundation for the subsequent information extraction and extraction optimization. In addition, in order to better utilize the extracted information, the invention provides the concept of the meta-template, and a user can generate the text in an online visualization mode, so that the complexity of text generation is greatly reduced, and the efficiency of text generation is improved. .

The technical means disclosed in the invention scheme are not limited to the technical means disclosed in the above embodiments, but also include the technical scheme formed by any combination of the above technical features. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and such improvements and modifications are also considered to be within the scope of the present invention.

Claims

1. A general text information extraction and text generation method based on rules is characterized by comprising the following steps:

step two: information labeling is carried out on text information

step four: generating rule dependent directed graphs

The text extraction rule is put into a rule engine to be executed, an extraction text with a good structure can be generated, the content of the extraction information is compared with the content of the text marking information at the beginning, and the accuracy of the extraction information is generated;

step six: definition information generation meta template

step seven: custom template rule selection and text generation

2. The method for rule-based extraction of general text information and text generation according to claim 1, wherein the step one comprises the steps of:

step 1-1: an initial state;

3. The method for extracting and generating general text information based on rules according to claim 1, wherein the second step comprises the following steps:

step 2-1: an initial state;

step 2-8: and finishing labeling the text information.

4. The method for extracting and generating general text information based on rules according to claim 1, wherein the third step comprises the following steps:

step 3-1: an initial state;

5. The method for extracting and generating general text information based on rules according to claim 4, wherein in the step 3-4, the step that the user needs to summarize features from the text to be extracted specifically comprises the following steps: the user can perform feature induction from the context-free keywords, specified phrases or regular expressions and other modes, and also can perform feature induction from the context-free modes containing specific semantics.

6. The method for extracting and generating general text information based on rules according to claim 1, wherein the step four comprises the steps of:

step 4-1: an initial state;

7. The method of claim 1, wherein in step five, if the accuracy of text extraction does not reach the target, the method continues to adjust the extraction rule with less accuracy until the extraction accuracy reaches a specified threshold.

8. The method for rule-based extraction of general text information and text generation according to claim 7, wherein the step five comprises the steps of:

step 5-1: an initial state;

9. The method for rule-based general text-information extraction and text generation according to claim 1, wherein the sixth step comprises the steps of:

step 6-1: an initial state;

step 6-2: a user creates an information generation meta-template with a name;

step 6-5: and finishing defining the information generation meta-template.

10. The method for rule-based general text-information extraction and text generation according to claim 1, wherein the seventh step comprises the steps of:

step 7-1: an initial state;