CN112463988A - Method for extracting Chinese classical garden information - Google Patents

Method for extracting Chinese classical garden information Download PDF

Info

Publication number
CN112463988A
CN112463988A CN202011450290.1A CN202011450290A CN112463988A CN 112463988 A CN112463988 A CN 112463988A CN 202011450290 A CN202011450290 A CN 202011450290A CN 112463988 A CN112463988 A CN 112463988A
Authority
CN
China
Prior art keywords
state
entity
conversion
sigma
relation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011450290.1A
Other languages
Chinese (zh)
Inventor
刘耀忠
黄亦工
王亚弟
常少辉
吕洁
孙萌
费晓飞
谢帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Bayi Space Information Engineering Co ltd
Beijing Preparatory Office Of Museum Of Chinese Gardens And Landscape Architecture
Original Assignee
Beijing Bayi Space Information Engineering Co ltd
Beijing Preparatory Office Of Museum Of Chinese Gardens And Landscape Architecture
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Bayi Space Information Engineering Co ltd, Beijing Preparatory Office Of Museum Of Chinese Gardens And Landscape Architecture filed Critical Beijing Bayi Space Information Engineering Co ltd
Priority to CN202011450290.1A priority Critical patent/CN112463988A/en
Publication of CN112463988A publication Critical patent/CN112463988A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a method for extracting Chinese classical garden information, which comprises the following steps: 1. calculating a word vector embedding sequence according to input; 2. Bi-LSTM encoding, namely bidirectional long-short term memory encoding, is carried out on the sequence; 3. executing state conversion, judging whether the entity and relationship information are extracted when the state reaches a final state, and ending the process, otherwise, performing the next step according to probability calculation; 4. selecting an entity extraction state transition action or selecting a relationship extraction state transition action; 5. and returning to the step 3 after the execution is finished, and finally obtaining the extracted entities and the extracted relations. The technical scheme of the invention mainly has the following technical advantages: 1. an information extraction algorithm aiming at the knowledge in the field of Chinese classical gardens is proposed for the first time; 2. the utilization rate and the execution efficiency of the information are improved; 3. can be widely applied to national classical gardens.

Description

Method for extracting Chinese classical garden information
Technical Field
The invention relates to the technical field of processing natural language data, information retrieval and database structures thereof, in particular to a method for extracting Chinese classical garden information.
Background
The world that ancient Chinese gardens enjoy the reputation with its exquisite gardening skills and profound cultural connotation is an important component of the traditional Chinese culture. The effective means for protecting and inheriting the digital information is to apply modern information technology to realize digitalization. One important basis for achieving digitization is to achieve the storage of information about the data in a computer. The computer stores a large amount of data such as garden historical archives, videos, pictures, text materials and the like, and the greatest challenge is how to organize massive unstructured data so as to be beneficial to realizing efficient information retrieval. At present, the data storage technology which can support efficient information retrieval belongs to the knowledge graph. Knowledge Graph (KG) stores Knowledge by using Graph structures, and forms a huge semantic network Graph by describing various entities or concepts existing in the real world and the relations thereof, wherein nodes in the Graph represent the entities or concepts, and edges are formed by attributes or relations. Knowledge-graph is usually represented by triples, and the basic form of the triples mainly includes (entity 1-relationship-entity 2) and (entity-attribute value), etc.
The existing famous knowledge maps include knowledge maps established by ***, microsoft, hundredth, dog search and other companies, open Chinese knowledge map (OpenKG) and the like, which store hundreds of millions of orders of magnitude of entities. In the CASIA-KB project of the Chinese academy of sciences, the Baidu encyclopedia and the interactive encyclopedia are extracted to construct the Chinese knowledge map of the Chinese tourist attraction, and the map can be applied to geography, life, entertainment and the like. In the Clinga project of Nanjing university, Chinese Wikipedia is used as a data source, a new geographic ontology is manually constructed, various natural geographic and human geographic entities are classified and automatically linked with the existing knowledge base, and the obtained Chinese geographic knowledge map contains more than 50 ten thousand Chinese geographic entities and can be accessed publicly.
However, the existing knowledge maps do not contain the system knowledge of the Chinese classical gardens through retrieval. The Chinese classical garden knowledge map must be constructed by oneself.
The core of knowledge graph construction is information extraction. Many tools exist to extract information from structured, semi-structured, and unstructured data to obtain knowledge.
D2RQ is a tool for converting a relational database into a virtual RDF database, and comprises three components of D2R Server, D2RQ Engine and D2RQ Mapping Language. But the method is difficult to combine with the knowledge modeling result and map, is difficult to fuse with other types of knowledge, and is difficult to support large-scale data mapping and incremental mapping.
Lixtio and WIE can generate web page wrappers to get knowledge from web page data, but mainly aim at early static page development and need to be extended to support dynamic pages.
The deep dive and the Snorkel provide a set of extraction framework facing specific relation and based on remote supervised learning, the existing knowledge base and rule definition are used for automatically generating linguistic data, the training process of the model is automatically completed, noise and uncertainty are reduced by using a machine learning algorithm, and the learning process is influenced by the rules available to a user so as to improve the result quality. DeepKE is a relation extraction tool developed by Zhejiang university, and uses various deep learning algorithms such as a convolutional neural network, a cyclic neural network, an attention mechanism network, a graph convolutional neural network, a capsule neural network, a language pre-training model and the like. However, deep dive, Snorkel and deep are only used for relationship extraction, and do not provide an extraction function for knowledge of concepts, entities, events, and the like.
The existing knowledge element (entity and relation) extraction technology and method are usually carried out on a data set of limited fields and subjects, and although a good effect is obtained, due to more restriction conditions, the method is not strong enough in expandability and cannot well meet the requirement of extracting the information of the classical gardens in China.
The primary task of knowledge extraction techniques is named entity identification. The prior art generally recognizes three major categories (entity category, time category and number category), seven minor categories (name, organization name, place name, time, date, currency and percentage) named entities in the text to be processed. Most research works are directed to recognizing names of people, places, organizations, proper nouns, etc.
Therefore, the existing knowledge extraction technology can not meet the requirement of building the Chinese classical garden knowledge map.
In summary, aiming at the current situation that the Chinese classical garden knowledge graph does not exist, and the information extraction technology does not meet the requirement of automatically constructing the Chinese classical garden knowledge graph, the invention aims to provide a Chinese classical garden information extraction algorithm and lay a solid foundation for constructing the Chinese classical garden knowledge graph.
The knowledge system of the classical garden knowledge graph mainly comprises core contents of three aspects: the method comprises the steps of classifying concepts, describing concept attributes and defining interrelations among the concepts. The basic form of the knowledge system comprises five different levels of vocabulary, concepts, classification relations, non-classification relations, axioms and the like. Based on a method combining automatic construction and manual construction, a knowledge learning method and technology based on unstructured data, structured data and semi-structured data are researched, entity identification, classification system construction and concept attribute and relation extraction in the field of classical gardens are achieved through a natural language processing tool, and a knowledge system of a classical garden knowledge graph is constructed.
The entity is a basic unit of the knowledge graph and is also an important language unit for bearing information in the text. Entity recognition and analysis are important technologies to support knowledge graph construction and application. The application of an entity identification method and technology based on machine learning in the construction of the classical garden knowledge graph is researched, and a method based on a neural network is mainly researched. Based on deep learning, effective features are automatically captured from the text by utilizing a neural network, and named entity recognition is further completed. The method mainly comprises the following steps: the method mainly comprises the steps of designing and building a neural network model, expressing character symbol characteristics into distributed characteristic information by using the neural network model, and expressing characters and words in a text by using a bidirectional LSTM. Model training: and optimizing network parameters and training a network model by using the labeled data. And optimizing model parameters by using training methods such as random gradient descent and the like, and further training parameters of the whole network. And (3) classifying the models: and classifying the new samples by using the trained model so as to complete entity identification. And performing characteristic representation on the input text by using the bidirectional LSTM, inputting the characteristic representation into a CRF, classifying each word in the sentence, integrally scoring, and finally outputting a classification result to finish entity identification.
The classical garden entity link is used for associating entities in multiple data sources through links so as to better represent semantic association relations among the entities in the different data sources, and further realize multi-source data fusion so as to be used for semantic understanding and semantic analysis in the classical garden artificial intelligence. In various classical garden text data sources, expressions of various entities of four major elements such as mountains and waters, buildings, plants, terrains and the like are diversified, and irregular expressions such as entity abbreviations and the like and context references of the entities are not clear, so that great difficulty is brought to entity linking. According to the adopted correlation calculation method, entity linking methods are mainly divided into two categories, one category is a method based on an entity, and the method mainly utilizes the characteristics of the entity characters to carry out calculation, such as a character string editing distance Jaro-Winkler distance and a Smith-Waterman algorithm; the other is a calculation method based on entity background information, and generally comprises cosine similarity, Jaccard coefficients, a topic model, a word vector, SimRank and a graph structure. Aiming at the concrete reality of the construction of the classical garden knowledge graph, the subject focuses on researching an entity linking method based on multiple knowledge bases, and the entity linking of the multiple knowledge bases is realized by fusing the same entity of different data sources through multi-source entity linking, so that the problem of low coverage of the knowledge graph of a single data source is solved, and the garden data fusion is fundamentally promoted.
How to identify the relationship between entities from the structured or unstructured text is one of the core tasks of knowledge graph construction, and the relationship extraction is one of the important support technologies for text content understanding. The extraction of entity relations is a key link for constructing the knowledge graph of the classical garden. Entity relationship mining can be classified into pattern matching based, semantic dictionary based, feature based and machine learning based methods. The pattern matching is based on the entity recognition result, sentences are used as units, corresponding patterns are formulated according to the sign words, and then the relation between corresponding entities is determined through pattern matching comparison. The dictionary-based method determines entity relationships based on semantic dictionary resources according to associations between entities. The characteristic-based method is that according to the characteristics of entity type, part of speech, position between words, words and parts of speech before and after the entity, and the like, through continuous iteration and aggregation, entity groups (usually two non-homogeneous entities) with the same characteristics are regarded as the same type, and then entity relationship mining is carried out. Machine learning methods are commonly used in the current entity relationship mining, and the idea of the methods is to convert the relationship mining into a classification problem.
The knowledge graph construction is carried out by starting from most original data (comprising structured, semi-structured and unstructured data), extracting knowledge facts from an original database and a third-party database by adopting a series of automatic or semi-automatic technical means, and storing the knowledge facts into a data layer and a mode layer of a knowledge base, wherein the process comprises the following steps: the method comprises four processes of information extraction, knowledge representation, knowledge fusion and knowledge reasoning, and each updating iteration comprises four stages. The knowledge graph mainly has two construction modes of top-down (top-down) and bottom-up (bottom-up). Top-down refers to defining the ontology and data schema for the knowledge graph and then adding the entity to the knowledge base. And the bottom-up method comprises the steps of extracting entities from some open link data, selecting the entities with higher confidence degrees, adding the entities into a knowledge base, and then constructing a top-level ontology mode.
Disclosure of Invention
The invention aims to solve the problems in the prior art and provides a method for extracting Chinese classical garden information.
In order to achieve the purpose, the invention adopts the following technical scheme: a method for extracting Chinese classical garden information comprises the following steps: 1. calculating a word vector embedding sequence according to input; 2. Bi-LSTM encoding, namely bidirectional long-short term memory encoding, is carried out on the sequence; 3. executing state conversion, judging whether the entity and relationship information are extracted when the state reaches a final state, and ending the process, otherwise, performing the next step according to probability calculation; 4. selecting an entity extraction state transition action or selecting a relationship extraction state transition action; 5. and returning to the step 3 after the execution is finished, and finally obtaining the extracted entities and the extracted relations.
The specific method of input calculation described in step 1 of this patent is as follows:
word vector embedding:
for each input token, vector embedding is calculated using the following equation:
Figure BDA0002826558410000061
wherein, wiIs the word vector that is learned and,
Figure BDA0002826558410000062
is a fixed word vector, V is a concatenated matrix of two vectors,
calculating to obtain a vector embedded sequence:
x=(x1,x2,......,xi,......xn)。
the specific method of Bi-LSTM encoding described in step 2 of this patent is as follows:
performing Bi-LSTM encoding on the sequence x obtained in the step (1), namely bidirectional long-short term memory encoding, firstly according to the sequence x1To xnIn order of forward LSTM encoding
Figure BDA0002826558410000071
Then according to the following from xnTo x1Sequentially backward LSTM encoding
Figure BDA0002826558410000072
Each LSTM encoding includes the following six steps:
1) and (3) state training:
for the current input xtAnd hidden state h passed by last statet-1The stitching training yields four states, three of which are gated states zf,zi,zoAfter the splicing vector is multiplied by a weight matrix, the splicing vector is converted into a value between 0 and 1 through a sigmoid activation function to serve as a gating state, and the other state z is obtained by converting the result into a value between-1 and 1 through a tanh activation function instead of a gating signal;
2) forgetting:
by state zfPerforming matrix multiplication as forgetting gating to control long-term memory of last state ct-1Which are left, important, unimportant, calculated as:
zf⊙ct-1
3) selecting and memorizing:
by state ziAs a gating signal, a matrix multiplication is performed, controlling the state z and thus the input xtAnd (4) carrying out selection memory, important recording and unimportant short recording, and calculating according to the following formula:
zi⊙z
4) calculating long-term memory:
performing matrix addition on the control results of the last two steps to obtain long-term memory c transmitted to the next statetCalculated as follows:
Figure BDA0002826558410000081
5) calculating short-term memory:
passing state zoControl and c obtained by tanh activation functiontZooming to obtain short-term memory htCalculated as follows:
ht=zo⊙tanh(ct)
6) and (3) outputting:
final passage htChange toTo the output ytCalculated as follows:
yt=σ(W’ht)
forward LSTM encoding
Figure BDA0002826558410000082
The results are reported as
Figure BDA0002826558410000083
Backward LSTM encoding
Figure BDA0002826558410000084
The results are reported as
Figure BDA0002826558410000085
The two results are concatenated and are recorded as
Figure BDA0002826558410000086
Shows the Bi-LSTM encoding results.
The specific method for state transition described in step 3 of this patent is as follows:
defining a six-tuple (sigma, delta, E, beta, E, R) representing the state at each moment, wherein sigma is a stack for storing generated entities, delta is a stack for storing entities which are pushed again after being temporarily popped from sigma, E is used for storing partial entity blocks which are being processed, beta is a buffer containing unprocessed words, E is used for storing a generated entity set, and R is used for storing a generated relation set;
the following information extraction task can be expressed as the initial state
Figure BDA0002826558410000087
To the final state (σ, δ],[]The state transition process of [ E ], R ], wherein [ 2 ]]The indication is that the stack is empty,
Figure BDA0002826558410000088
representing an empty set;
for the state at time t:
mt=max{0,W[st;bt;pt;et;at]+d}
calculating the probability by
Figure BDA0002826558410000091
Predicting the state transition action to be selected at the moment t, switching to the step (4) or the step (5) according to the prediction result, and returning to the step (3) after executing one state transition until the final state is reached;
given an input w, the probability of any reasonable sequence of state transition actions z can be expressed as:
Figure BDA0002826558410000092
therefore, there are:
Figure BDA0002826558410000093
and when E and beta in the state six-tuple are empty stacks, the final state is reached, the state conversion is finished, and at the moment, the extracted entities and the extracted relations are respectively in the sets E and R, and the extracted entities and the extracted relations can be used as algorithm execution results to be output.
The state transition actions of the entity extraction in the step 4 of this patent include the following three types:
1) deleting
Judging conversion conditions:
Figure BDA0002826558410000094
the state before conversion: ([ sigma | i ], delta, E, [ j | beta ], E, R)
The state after conversion is as follows: ([ sigma | i ], delta, E, beta, E, R)
After the state conversion is selected and executed, firstly, judging whether the currently processed word j is not in the entity set E and the entity block E being processed is an empty stack and indicates that the currently processed word j is not the target information to be extracted, and deleting the word j from the buffer area beta to be processed;
2) transfer of
Judging conversion conditions:
Figure BDA0002826558410000101
the state before conversion: ([ sigma | i ], delta, E, [ j | beta ], E, R)
The state after conversion is as follows: ([ sigma | i ], delta, [ j | E ], beta, E, R)
After the state conversion is selected and executed, firstly, judging whether the currently processed word j is not in the entity set E but is selected to be further operated, and transferring the j from the buffer area beta to be processed to the entity block E being processed;
3) entity identification
Judging conversion conditions:
Figure BDA0002826558410000102
the state before conversion: ([ sigma | i ], delta, [ j | E ], [ beta ], E, R)
The state after conversion is as follows: ([ sigma | i }, delta, [ ], [ j | beta ], EU { j } R)
After the state transition is selected and executed, firstly, if the currently processed word j is not in the entity set E and the entity block E being processed is not an empty stack, marking j and then moving back to the buffer area beta to be processed, and merging the new entity j into the entity set E.
The state transition actions of the relationship extraction in the step 4 of this patent include the following seven types:
1) extracting left-hand relation and popping the end-point entity
Judging conversion conditions:
Figure BDA0002826558410000103
Figure BDA0002826558410000104
the state before conversion: ([ sigma | i ], delta, E, [ j | beta ], E, R)
The state after conversion is as follows:
Figure BDA0002826558410000115
after the state conversion is selected and executed, judging whether a left-hand relation is found, merging the relation into a relation set R, and popping a relation end point entity i from a generated entity stack sigma;
2) extracting right-hand relationships and transferring end-point entities
Judging conversion conditions:
Figure BDA0002826558410000111
Figure BDA0002826558410000112
the state before conversion: ([ sigma | i ], delta, E, [ j | beta ], E, R)
The state after conversion is as follows:
Figure BDA0002826558410000116
after the state conversion is selected and executed, judging whether a right-direction relation is found, merging the relation into a relation set R, and transferring a relation end point entity j to a generated entity stack sigma;
3) un-extracted relationships and transferring entities
Judging conversion conditions:
Figure BDA0002826558410000113
the state before conversion: ([ sigma | i ], delta, E, [ j | beta ], E, R)
The state after conversion is as follows: ([ sigma | i | delta | j ], [ ], E, beta, E, R)
After the state conversion is selected and executed, firstly, judging whether the relation is not extracted, and transferring the entity j to the generated entity stack sigma;
4) un-extracted relationships and popping entities off stack
Judging conversion conditions:
Figure BDA0002826558410000114
the state before conversion: ([ sigma | i ], delta, E, [ j | beta ], E, R)
The state after conversion is as follows: (sigma, delta, E, [ j | beta ], E, R)
After the state conversion is selected and executed, judging whether the relation is not extracted, and popping the entity i from the generated entity stack sigma;
5) extracting left-hand relation and putting end-point entity into temporary stack
Judging conversion conditions:
Figure BDA0002826558410000121
the state before conversion: ([ sigma | i ], delta, E, [ j | beta ], E, R)
The state after conversion is as follows:
Figure BDA0002826558410000122
after the state conversion is selected and executed, judging whether a left-hand relation is found or not, merging the relation into a relation set R, popping a relation end point entity i from a generated entity stack sigma, and then stacking the relation end point entity i into a temporary stack delta;
6) extracting right-direction relation and putting starting point entity into temporary stack
Judging conversion conditions:
Figure BDA0002826558410000123
the state before conversion: ([ sigma | i ], delta, E, [ j | beta ], E, R)
The state after conversion is as follows:
Figure BDA0002826558410000124
after the state conversion is selected and executed, judging whether a right-direction relation is found, merging the relation into a relation set R, popping a relation starting point entity i from a generated entity stack sigma, and then stacking the relation starting point entity i into a temporary stack delta;
7) un-extracted relationships and putting entities into temporary stacks
Judging conversion conditions: is free of
The state before conversion: ([ sigma | i ], delta, E, [ j | beta ], E, R)
The state after conversion is as follows: (sigma, [ i | delta ], E, [ j | beta ], E, R)
After the state transition is selected and executed, the entity i is directly popped from the generated entity stack sigma and then is pushed to the temporary stack delta.
The technical scheme of the invention mainly has the following technical advantages.
1. The information extraction algorithm aiming at the knowledge in the field of Chinese classical gardens is put forward for the first time
The method for constructing the knowledge graph is different according to different application fields and different requirements. The existing methods do not aim at the knowledge in the field of Chinese classical gardens aiming at general knowledge and specific field knowledge, can not meet the application requirements and can not be directly applied. The information extraction algorithm which belongs to the knowledge in the field of Chinese classical gardens is not found through retrieval.
The technical scheme of the invention tightly surrounds the Chinese classical garden knowledge, establishes the domain knowledge ontology model, defines entities and relationship types according to the Chinese classical garden concept model, and forms the automatic extraction algorithm of the classical garden knowledge information.
The invention finally designs five algorithm steps, firstly proposes an information extraction algorithm aiming at the knowledge in the field of Chinese classical gardens, and verifies that the method is effective and feasible through an application example.
2. Improving the utilization rate and the execution efficiency of information
In a mainstream model of a word vector + BilSTM + CRF three-layer model and a BERT + BilSTM + CRF three-layer model, an output layer CRF (conditional random field) is a processed sequence labeling problem, entity information and relationship information are difficult to extract simultaneously, and a large amount of useful information directly related between entities and relationships is discarded.
The algorithm of the invention changes the output layer from CRF to state conversion layer, and changes the sequence marking problem into the problem of generating directed graph through state conversion, and makes full use of the associated information between the entities and the relations in the processing process, thereby not only realizing the information extraction of the entities and the relations, but also improving the utilization rate and the execution efficiency of the information.
3. Can be widely applied to national classical gardens
The Chinese classical garden information extraction algorithm is scientific in design, strict in structure and standard in format, and is applied and verified in the construction process of the Chinese classical garden knowledge map. Therefore, the invention can be widely applied to national classical gardens.
Detailed Description
A method for extracting Chinese classical garden information comprises the following steps: 1. calculating a word vector embedding sequence according to input; 2. Bi-LSTM encoding, namely bidirectional long-short term memory encoding, is carried out on the sequence; 3. executing state conversion, judging whether the entity and relationship information are extracted when the state reaches a final state, and ending the process, otherwise, performing the next step according to probability calculation; 4. selecting an entity extraction state transition action or selecting a relationship extraction state transition action; 5. and returning to the step 3 after the execution is finished, and finally obtaining the extracted entities and the extracted relations.
The specific method of input calculation described in step 1 of this patent is as follows:
word vector embedding:
for each input token, vector embedding is calculated using the following equation:
Figure BDA0002826558410000141
wherein, wiIs the word vector that is learned and,
Figure BDA0002826558410000142
is a fixed word vector, V is a concatenated matrix of two vectors,
calculating to obtain a vector embedded sequence:
x=(x1,x2,......,xi,......xn)。
the specific method of Bi-LSTM encoding described in step 2 of this patent is as follows:
performing Bi-LSTM encoding on the sequence x obtained in the step (1), namely bidirectional long-short term memory encoding, firstly according to the sequence x1To xnIn order of forward LSTM encoding
Figure BDA0002826558410000143
Then press againAccording to xnTo x1Sequentially backward LSTM encoding
Figure BDA0002826558410000144
Each LSTM encoding includes the following six steps:
1) and (3) state training:
for the current input xtAnd hidden state h passed by last statet-1The stitching training yields four states, three of which are gated states zf,zi,zoAfter the splicing vector is multiplied by a weight matrix, the splicing vector is converted into a value between 0 and 1 through a sigmoid activation function to serve as a gating state, and the other state z is obtained by converting the result into a value between-1 and 1 through a tanh activation function instead of a gating signal;
2) forgetting:
by state zfPerforming matrix multiplication as forgetting gating to control long-term memory of last state ct-1Which are left, important, unimportant, calculated as:
zf⊙ct-1
3) selecting and memorizing:
by state ziAs a gating signal, a matrix multiplication is performed, controlling the state z and thus the input xtAnd (4) carrying out selection memory, important recording and unimportant short recording, and calculating according to the following formula:
zi⊙z
4) calculating long-term memory:
performing matrix addition on the control results of the last two steps to obtain long-term memory c transmitted to the next statetCalculated as follows:
Figure BDA0002826558410000151
5) calculating short-term memory:
passing state zoIs controlled bytan h activation function vs. c obtained previouslytZooming to obtain short-term memory htCalculated as follows:
ht=zo⊙tanh(ct)
6) and (3) outputting:
final passage htVariation to give an output ytCalculated as follows:
yt=σ(W’ht)
forward LSTM encoding
Figure BDA0002826558410000161
The results are reported as
Figure BDA0002826558410000162
Backward LSTM encoding
Figure BDA0002826558410000163
The results are reported as
Figure BDA0002826558410000164
The two results are concatenated and are recorded as
Figure BDA0002826558410000165
Shows the Bi-LSTM encoding results.
The specific method for state transition described in step 3 of this patent is as follows:
defining a six-tuple (sigma, delta, E, beta, E, R) representing the state at each moment, wherein sigma is a stack for storing generated entities, delta is a stack for storing entities which are pushed again after being temporarily popped from sigma, E is used for storing partial entity blocks which are being processed, beta is a buffer containing unprocessed words, E is used for storing a generated entity set, and R is used for storing a generated relation set;
the following information extraction task can be expressed as the initial state
Figure BDA0002826558410000166
To the final state (σ, δ],[]The state transition process of [ E ], R ], wherein [ 2 ]]The indication is that the stack is empty,
Figure BDA0002826558410000167
representing an empty set;
for the state at time t:
mt=max{0,W[st;bt;pt;et;at]+d}
calculating the probability by
Figure BDA0002826558410000168
Predicting the state transition action to be selected at the moment t, switching to the step (4) or the step (5) according to the prediction result, and returning to the step (3) after executing one state transition until the final state is reached;
given an input w, the probability of any reasonable sequence of state transition actions z can be expressed as:
Figure BDA0002826558410000171
therefore, there are:
Figure BDA0002826558410000172
and when E and beta in the state six-tuple are empty stacks, the final state is reached, the state conversion is finished, and at the moment, the extracted entities and the extracted relations are respectively in the sets E and R, and the extracted entities and the extracted relations can be used as algorithm execution results to be output.
The state transition actions of the entity extraction in the step 4 of this patent include the following three types:
1) deleting
Judging conversion conditions:
Figure BDA0002826558410000173
the state before conversion: ([ sigma | i ], delta, E, [ j | beta ], E, R)
The state after conversion is as follows: ([ sigma | i ], delta, E, beta, E, R)
After the state conversion is selected and executed, firstly, judging whether the currently processed word j is not in the entity set E and the entity block E being processed is an empty stack and indicates that the currently processed word j is not the target information to be extracted, and deleting the word j from the buffer area beta to be processed;
2) transfer of
Judging conversion conditions:
Figure BDA0002826558410000174
the state before conversion: ([ sigma | i ], delta, E, [ j | beta ], E, R)
The state after conversion is as follows: ([ sigma | i ], delta, [ j | E ], beta, E, R)
After the state conversion is selected and executed, firstly, judging whether the currently processed word j is not in the entity set E but is selected to be further operated, and transferring the j from the buffer area beta to be processed to the entity block E being processed;
3) entity identification
Judging conversion conditions:
Figure BDA0002826558410000181
the state before conversion: ([ sigma | i ], delta, [ j | E ], [ beta ], E, R)
The state after conversion is as follows: ([ sigma | i }, delta, [ ], [ j | beta ], EU { j } R)
After the state transition is selected and executed, firstly, if the currently processed word j is not in the entity set E and the entity block E being processed is not an empty stack, marking j and then moving back to the buffer area beta to be processed, and merging the new entity j into the entity set E.
The state transition actions of the relationship extraction in the step 4 of this patent include the following seven types:
1) extracting left-hand relation and popping the end-point entity
Judging conversion conditions:
Figure BDA0002826558410000182
Figure BDA0002826558410000183
the state before conversion: ([ sigma | i ], delta, E, [ j | beta ], E, R)
The state after conversion is as follows:
Figure BDA0002826558410000184
after the state conversion is selected and executed, judging whether a left-hand relation is found, merging the relation into a relation set R, and popping a relation end point entity i from a generated entity stack sigma;
2) extracting right-hand relationships and transferring end-point entities
Judging conversion conditions:
Figure BDA0002826558410000185
Figure BDA0002826558410000186
the state before conversion: ([ sigma | i ], delta, E, [ j | beta ], E, R)
The state after conversion is as follows:
Figure BDA0002826558410000187
after the state conversion is selected and executed, judging whether a right-direction relation is found, merging the relation into a relation set R, and transferring a relation end point entity j to a generated entity stack sigma;
3) un-extracted relationships and transferring entities
Judging conversion conditions:
Figure BDA0002826558410000191
the state before conversion: ([ sigma | i ], delta, E, [ j | beta ], E, R)
The state after conversion is as follows: ([ sigma | i | delta | j ], [ ], E, beta, E, R)
After the state conversion is selected and executed, firstly, judging whether the relation is not extracted, and transferring the entity j to the generated entity stack sigma;
4) un-extracted relationships and popping entities off stack
Judging conversion conditions:
Figure BDA0002826558410000192
the state before conversion: ([ sigma | i ], delta, E, [ j | beta ], E, R)
The state after conversion is as follows: (sigma, delta, E, [ j | beta ], E, R)
After the state conversion is selected and executed, judging whether the relation is not extracted, and popping the entity i from the generated entity stack sigma;
5) extracting left-hand relation and putting end-point entity into temporary stack
Judging conversion conditions:
Figure BDA0002826558410000193
the state before conversion: ([ sigma | i ], delta, E, [ j | beta ], E, R)
The state after conversion is as follows:
Figure BDA0002826558410000194
after the state conversion is selected and executed, judging whether a left-hand relation is found or not, merging the relation into a relation set R, popping a relation end point entity i from a generated entity stack sigma, and then stacking the relation end point entity i into a temporary stack delta;
6) extracting right-direction relation and putting starting point entity into temporary stack
Judging conversion conditions:
Figure BDA0002826558410000201
the state before conversion: ([ sigma | i ], delta, E, [ j | beta ], E, R)
The state after conversion is as follows:
Figure BDA0002826558410000202
after the state conversion is selected and executed, judging whether a right-direction relation is found, merging the relation into a relation set R, popping a relation starting point entity i from a generated entity stack sigma, and then stacking the relation starting point entity i into a temporary stack delta;
7) un-extracted relationships and putting entities into temporary stacks
Judging conversion conditions: is free of
The state before conversion: ([ sigma | i ], delta, E, [ j | beta ], E, R)
The state after conversion is as follows: (sigma, [ i | delta ], E, [ j | beta ], E, R)
After the state transition is selected and executed, the entity i is directly popped from the generated entity stack sigma and then is pushed to the temporary stack delta.
The above description is only a preferred embodiment of the present invention, and not intended to limit the present invention in other forms, and any person skilled in the art may apply the above modifications or changes to the equivalent embodiments with equivalent changes, without departing from the technical spirit of the present invention, and any simple modification, equivalent change and change made to the above embodiments according to the technical spirit of the present invention still belong to the protection scope of the technical spirit of the present invention.

Claims (6)

1. A method for extracting Chinese classical garden information comprises the following steps: 1) calculating a word vector embedding sequence according to input; 2) carrying out Bi-LSTM encoding on the sequence, namely bidirectional long-term and short-term memory encoding; 3) executing state conversion, judging that if the state reaches a final state, entity and relationship information are extracted, and ending, otherwise, performing the next step according to probability calculation; 4) selecting an entity extraction state transition action, or selecting a relationship extraction state transition action; 5) and returning to the step 3 after the execution is finished, and finally obtaining the extracted entities and the extracted relations.
2. The method for extracting classical garden information in China according to claim 1, wherein the specific method of input calculation in step 1) of the present patent is as follows:
word vector embedding:
for each input token, vector embedding is calculated using the following equation:
Figure FDA0002826558400000011
wherein, wiIs the word vector that is learned and,
Figure FDA0002826558400000012
is a fixed word vector, V is a concatenated matrix of two vectors,
calculating to obtain a vector embedded sequence:
x=(x1,x2,......,xi,......xn)。
3. the method for extracting classical garden information in China according to claim 1, wherein the specific method of Bi-LSTM encoding in step 2) of this patent is as follows:
performing Bi-LSTM encoding on the sequence x obtained in the step (1), namely bidirectional long-short term memory encoding, firstly according to the sequence x1To xnIn order of forward LSTM encoding
Figure FDA0002826558400000013
Then according to the following from xnTo x1Sequentially backward LSTM encoding
Figure FDA0002826558400000014
Each LSTM encoding includes the following six steps:
1) and (3) state training:
for the current input xtAnd hidden state h passed by last statet-1The stitching training yields four states, three of which are gated states zf,zi,zoAfter the splicing vector is multiplied by a weight matrix, the splicing vector is converted into a value between 0 and 1 through a sigmoid activation function to serve as a gating state, and the other state z is obtained by converting the result into a value between-1 and 1 through a tanh activation function instead of a gating signal;
2) forgetting:
by state zfPerforming matrix multiplication as forgetting gatingMethod for controlling the long-term memory of the last state ct-1Which are left, important, unimportant, calculated as:
zf⊙ct-1
3) selecting and memorizing:
by state ziAs a gating signal, a matrix multiplication is performed, controlling the state z and thus the input xtAnd (4) carrying out selection memory, important recording and unimportant short recording, and calculating according to the following formula:
zi⊙z
4) calculating long-term memory:
performing matrix addition on the control results of the last two steps to obtain long-term memory c transmitted to the next statetCalculated as follows:
Figure FDA0002826558400000021
5) calculating short-term memory:
passing state zoControl and c obtained by tanh activation functiontZooming to obtain short-term memory htCalculated as follows:
ht=zo⊙tanh(ct)
6) and (3) outputting:
final passage htVariation to give an output ytCalculated as follows:
yt=σ(W’ht)
forward LSTM encoding
Figure FDA0002826558400000031
The results are reported as
Figure FDA0002826558400000032
Backward LSTM encoding
Figure FDA0002826558400000033
The results are reported as
Figure FDA0002826558400000034
The two results are concatenated and are recorded as
Figure FDA0002826558400000035
Shows the Bi-LSTM encoding results.
4. The method for extracting classical garden information in China according to claim 1, wherein the specific method of state transition in step 3) of the present patent is as follows:
defining a six-tuple (sigma, delta, E, beta, E, R) representing the state at each moment, wherein sigma is a stack for storing generated entities, delta is a stack for storing entities which are pushed again after being temporarily popped from sigma, E is used for storing partial entity blocks which are being processed, beta is a buffer containing unprocessed words, E is used for storing a generated entity set, and R is used for storing a generated relation set;
the following information extraction task can be expressed as the initial state
Figure FDA0002826558400000036
To the final state (σ, δ],[]The state transition process of [ E ], R ], wherein [ 2 ]]The indication is that the stack is empty,
Figure FDA0002826558400000037
representing an empty set;
for the state at time t:
mt=max{0,W[st;bt;pt;et;at]+d}
calculating the probability by
Figure FDA0002826558400000038
Predicting the state transition action to be selected at the moment t, switching to the step (4) or the step (5) according to the prediction result, and returning to the step (3) after executing one state transition until the final state is reached;
given an input w, the probability of any reasonable sequence of state transition actions z can be expressed as:
Figure FDA0002826558400000041
therefore, there are:
Figure FDA0002826558400000042
and when E and beta in the state six-tuple are empty stacks, the final state is reached, the state conversion is finished, and at the moment, the extracted entities and the extracted relations are respectively in the sets E and R, and the extracted entities and the extracted relations can be used as algorithm execution results to be output.
5. The method for extracting classical garden information in China according to claim 1, wherein there are three state transition actions of the entity extraction in the step 4) of this patent:
1) deleting
Judging conversion conditions:
Figure FDA0002826558400000043
the state before conversion: ([ sigma | i ], delta, E, [ j | beta ], E, R)
The state after conversion is as follows: ([ sigma | i ], delta, E, beta, E, R)
After the state conversion is selected and executed, firstly, judging whether the currently processed word j is not in the entity set E and the entity block E being processed is an empty stack and indicates that the currently processed word j is not the target information to be extracted, and deleting the word j from the buffer area beta to be processed;
2) transfer of
Judging conversion conditions:
Figure FDA0002826558400000044
the state before conversion: ([ sigma | i ], delta, E, [ j | beta ], E, R)
The state after conversion is as follows: ([ sigma | i ], delta, [ j | E ], beta, E, R)
After the state conversion is selected and executed, firstly, judging whether the currently processed word j is not in the entity set E but is selected to be further operated, and transferring the j from the buffer area beta to be processed to the entity block E being processed;
3) entity identification
Judging conversion conditions:
Figure FDA0002826558400000051
the state before conversion: ([ sigma | i ], delta, [ j | E ], [ beta ], E, R)
The state after conversion is as follows: ([ sigma | i }, delta, [ ], [ j | beta ], EU { j } R)
After the state transition is selected and executed, firstly, if the currently processed word j is not in the entity set E and the entity block E being processed is not an empty stack, marking j and then moving back to the buffer area beta to be processed, and merging the new entity j into the entity set E.
6. The method for extracting classical garden information in China according to claim 1, wherein there are seven following state transition actions for the relationship extraction in step 4) of this patent:
1) extracting left-hand relation and popping the end-point entity
Judging conversion conditions:
Figure FDA0002826558400000052
Figure FDA0002826558400000053
the state before conversion: ([ sigma | i ], delta, E, [ j | beta ], E, R)
The state after conversion is as follows:
Figure FDA0002826558400000054
after the state conversion is selected and executed, judging whether a left-hand relation is found, merging the relation into a relation set R, and popping a relation end point entity i from a generated entity stack sigma;
2) extracting right-hand relationships and transferring end-point entities
Judging conversion conditions:
Figure FDA0002826558400000055
Figure FDA0002826558400000056
the state before conversion: ([ sigma | i ], delta, E, [ j | beta ], E, R)
The state after conversion is as follows:
Figure FDA0002826558400000061
after the state conversion is selected and executed, judging whether a right-direction relation is found, merging the relation into a relation set R, and transferring a relation end point entity j to a generated entity stack sigma;
3) un-extracted relationships and transferring entities
Judging conversion conditions:
Figure FDA0002826558400000062
the state before conversion: ([ sigma | i ], delta, E, [ j | beta ], E, R)
The state after conversion is as follows: ([ sigma | i | delta | j ], [ ], E, beta, E, R)
After the state conversion is selected and executed, firstly, judging whether the relation is not extracted, and transferring the entity j to the generated entity stack sigma;
4) un-extracted relationships and popping entities off stack
Judging conversion conditions:
Figure FDA0002826558400000063
the state before conversion: ([ sigma | i ], delta, E, [ j | beta ], E, R)
The state after conversion is as follows: (sigma, delta, E, [ j | beta ], E, R)
After the state conversion is selected and executed, judging whether the relation is not extracted, and popping the entity i from the generated entity stack sigma;
5) extracting left-hand relation and putting end-point entity into temporary stack
Judging conversion conditions:
Figure FDA0002826558400000065
the state before conversion: ([ sigma | i ], delta, E, [ j | beta ], E, R)
The state after conversion is as follows:
Figure FDA0002826558400000064
after the state conversion is selected and executed, judging whether a left-hand relation is found or not, merging the relation into a relation set R, popping a relation end point entity i from a generated entity stack sigma, and then stacking the relation end point entity i into a temporary stack delta;
6) extracting right-direction relation and putting starting point entity into temporary stack
Judging conversion conditions:
Figure FDA0002826558400000071
the state before conversion: ([ sigma | i ], delta, E, [ j | beta ], E, R)
The state after conversion is as follows:
Figure FDA0002826558400000072
after the state conversion is selected and executed, judging whether a right-direction relation is found, merging the relation into a relation set R, popping a relation starting point entity i from a generated entity stack sigma, and then stacking the relation starting point entity i into a temporary stack delta;
7) un-extracted relationships and putting entities into temporary stacks
Judging conversion conditions: is free of
The state before conversion: ([ sigma | i ], delta, E, [ j | beta ], E, R)
The state after conversion is as follows: (sigma, [ i | delta ], E, [ j | beta ], E, R)
After the state transition is selected and executed, the entity i is directly popped from the generated entity stack sigma and then is pushed to the temporary stack delta.
CN202011450290.1A 2020-12-09 2020-12-09 Method for extracting Chinese classical garden information Pending CN112463988A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011450290.1A CN112463988A (en) 2020-12-09 2020-12-09 Method for extracting Chinese classical garden information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011450290.1A CN112463988A (en) 2020-12-09 2020-12-09 Method for extracting Chinese classical garden information

Publications (1)

Publication Number Publication Date
CN112463988A true CN112463988A (en) 2021-03-09

Family

ID=74801039

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011450290.1A Pending CN112463988A (en) 2020-12-09 2020-12-09 Method for extracting Chinese classical garden information

Country Status (1)

Country Link
CN (1) CN112463988A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113065003A (en) * 2021-04-22 2021-07-02 国际关系学院 Knowledge graph generation method based on multiple indexes

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109492113A (en) * 2018-11-05 2019-03-19 扬州大学 Entity and relation combined extraction method for software defect knowledge

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109492113A (en) * 2018-11-05 2019-03-19 扬州大学 Entity and relation combined extraction method for software defect knowledge

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113065003A (en) * 2021-04-22 2021-07-02 国际关系学院 Knowledge graph generation method based on multiple indexes
CN113065003B (en) * 2021-04-22 2023-05-26 国际关系学院 Knowledge graph generation method based on multiple indexes

Similar Documents

Publication Publication Date Title
US20210124878A1 (en) On-Device Projection Neural Networks for Natural Language Understanding
CN108763284B (en) Question-answering system implementation method based on deep learning and topic model
CN104318340B (en) Information visualization methods and intelligent visible analysis system based on text resume information
CN111581401B (en) Local citation recommendation system and method based on depth correlation matching
CN109492232A (en) A kind of illiteracy Chinese machine translation method of the enhancing semantic feature information based on Transformer
CN109960786A (en) Chinese Measurement of word similarity based on convergence strategy
CN109062939A (en) A kind of intelligence towards Chinese international education leads method
CN108829719A (en) The non-true class quiz answers selection method of one kind and system
Shuang et al. AELA-DLSTMs: attention-enabled and location-aware double LSTMs for aspect-level sentiment classification
CN107273355A (en) A kind of Chinese word vector generation method based on words joint training
CN108628935A (en) A kind of answering method based on end-to-end memory network
CN106202010A (en) The method and apparatus building Law Text syntax tree based on deep neural network
CN113127624B (en) Question-answer model training method and device
CN111274790B (en) Chapter-level event embedding method and device based on syntactic dependency graph
CN108153864A (en) Method based on neural network generation text snippet
Cai et al. Intelligent question answering in restricted domains using deep learning and question pair matching
CN111710428B (en) Biomedical text representation method for modeling global and local context interaction
CN111143574A (en) Query and visualization system construction method based on minority culture knowledge graph
Zhang et al. Hierarchical scene parsing by weakly supervised learning with image descriptions
CN113705218A (en) Event element gridding extraction method based on character embedding, storage medium and electronic device
CN112784602B (en) News emotion entity extraction method based on remote supervision
CN114912449A (en) Technical feature keyword extraction method and system based on code description text
CN116992886A (en) BERT-based hot news event context generation method and device
CN113901228B (en) Cross-border national text classification method and device fusing domain knowledge graph
CN114510576A (en) Entity relationship extraction method based on BERT and BiGRU fusion attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination