WO2022022045A1 - 基于知识图谱的文本比对方法、装置、设备及存储介质 - Google Patents

基于知识图谱的文本比对方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2022022045A1
WO2022022045A1 PCT/CN2021/096862 CN2021096862W WO2022022045A1 WO 2022022045 A1 WO2022022045 A1 WO 2022022045A1 CN 2021096862 W CN2021096862 W CN 2021096862W WO 2022022045 A1 WO2022022045 A1 WO 2022022045A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
text
relationship
entities
entity
Prior art date
Application number
PCT/CN2021/096862
Other languages
English (en)
French (fr)
Inventor
朱昱锦
徐国强
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022022045A1 publication Critical patent/WO2022022045A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/231Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to the field of big data technologies, and in particular, to a text comparison method, apparatus, device and storage medium based on knowledge graphs.
  • Text content comparison technology is widely used in both vertical and general fields. For example, in insurance, banking, investment and other financial text processing scenarios involving incoming document review or risk monitoring, it is necessary to compare multiple documents and check whether there are any contradictions in the information provided by different documents to achieve the purpose of review.
  • the existing text comparison technology uses automatic abstract generation technology to split the text, then generates abstracts for each segment of the split, and finally compares the abstracts of the two articles to determine whether the main content of the two articles expresses the meaning. Consistent, and then judge whether the two articles belong to the same text.
  • This method will perform semantic extraction on the text content, which is conducive to refining the text and improving the efficiency of text element comparison.
  • the inventor realized that in the process of text refining, some useful text information will inevitably be lost, resulting in The comparison results are biased. There is an urgent need for a method that can improve the accuracy of text alignment.
  • the purpose of the embodiments of the present application is to propose a text comparison method based on knowledge graph, so as to improve the accuracy of text comparison.
  • the embodiments of the present application provide a text comparison method based on knowledge graph, including:
  • the relationship between any two adjacent target entities is extracted, the association relationship between any two target entities is judged, and any two target entities with an association relationship are used as Associative entity, take the associated relationship corresponding to the associated entity as the target relationship;
  • the coverage ratio exceeds a preset threshold, it is determined that the text to be compared is the same type of text.
  • a technical solution adopted in this application is to provide a text comparison device based on knowledge graph, including:
  • a training text acquisition module used for collecting training corpus in a preset field, and performing text preprocessing on the training corpus to obtain training text;
  • a target entity acquisition module used for part-of-speech tagging on the training text, and extracting entities in the training text according to the method of dependency syntax analysis, as target entities;
  • the target relationship acquisition module is used for extracting the relationship between any two adjacent target entities through the trained relationship extraction model combined with the training text, and judging the relationship between any two target entities, there will be an associated relationship Any two target entities are regarded as associated entities, and the associated relations corresponding to the associated entities are regarded as target relations;
  • an initial graph construction module used to construct and generate an initial graph with the target entity as a node and the target relationship as an edge
  • the target map building module is used to mark the target entity and target relationship of the initial map, take the marked target entity and target relationship as core information, and cluster the nodes of the initial map according to the core information. class, get the target map;
  • the core information comparison module is used to obtain the text to be compared, input the text to be compared into the target atlas, and count the relationship between the entities and relationships extracted from each of the texts to be compared to the core information in the target atlas. coverage;
  • the same text judgment module is used to determine that the text to be compared is the same type of text if the coverage ratio exceeds a preset threshold.
  • an embodiment of the present application further provides a computer device, including at least one processor; and,
  • the memory stores computer-readable instructions, and the processor implements the following steps when executing the computer-readable instructions:
  • the relationship between any two adjacent target entities is extracted, the association relationship between any two target entities is judged, and any two target entities with an association relationship are used as Associative entity, take the associated relationship corresponding to the associated entity as the target relationship;
  • the coverage ratio exceeds a preset threshold, it is determined that the text to be compared is the same type of text.
  • an embodiment of the present application further provides a computer-readable storage medium, where computer-readable instructions are stored on the computer-readable storage medium, and when the computer-readable instructions are executed by a processor, the processing The device performs the following steps:
  • the relationship between any two adjacent target entities is extracted, the association relationship between any two target entities is judged, and any two target entities with an association relationship are used as Associative entity, take the associated relationship corresponding to the associated entity as the target relationship;
  • the coverage ratio exceeds a preset threshold, it is determined that the text to be compared is the same type of text.
  • a text comparison method based on knowledge graph in the above scheme constructs the form of graph by extracting entities and relations between entities from the text, and then compares the graphs to identify the similarity of the text, realizes the refinement of the comparison object, and avoids the original text.
  • the interference items in this book are not affected by the text format, which improves the accuracy of text comparison.
  • FIG. 1 is a schematic diagram of an application environment of a text comparison method based on a knowledge graph provided by an embodiment of the present application
  • Fig. 2 is a realization flow chart of the text comparison method based on knowledge graph provided according to the embodiment of the present application;
  • step S2 in the text comparison method based on knowledge graph provided by the embodiment of the present application
  • Fig. 4 is a realization flow chart after step S24 in the text comparison method based on knowledge graph provided by the embodiment of the present application;
  • step S3 is a flow chart of an implementation of step S3 in the text comparison method based on knowledge graph provided by the embodiment of the present application;
  • step S5 is a flow chart of an implementation of step S5 in the text comparison method based on knowledge graph provided by the embodiment of the present application;
  • FIG. 7 is a schematic diagram of a text comparison device based on a knowledge graph provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a computer device provided by an embodiment of the present application.
  • the system architecture 100 may include terminal devices 101 , 102 , and 103 , a network 104 and a server 105 .
  • the network 104 is a medium used to provide a communication link between the terminal devices 101 , 102 , 103 and the server 105 .
  • the network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
  • the user can use the terminal devices 101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages and the like.
  • Various communication client applications may be installed on the terminal devices 101 , 102 and 103 , such as web browser applications, search applications, instant communication tools, and the like.
  • the terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop computers, desktop computers, and the like.
  • the server 105 may be a server that provides various services, such as a background server that provides support for the pages displayed on the terminal devices 101 , 102 , and 103 .
  • a knowledge graph-based text comparison method provided by the embodiments of the present application is generally executed by a server, and accordingly, a knowledge graph-based text comparison apparatus is generally set in the server.
  • terminal devices, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.
  • FIG. 2 shows a specific implementation of a text comparison method based on knowledge graph.
  • the method of the present application is not limited to the flow sequence shown in FIG. 2, and the method includes the following steps:
  • S1 Collect training corpus in a preset field, and perform text preprocessing on the training corpus to obtain training text.
  • the text preprocessing includes data cleaning of the text, etc., so as to keep the text data consistent.
  • the training corpus in the preset field is selected according to the actual need to compare the text, which is not limited here.
  • the training corpus refers to the Chinese sentence pairs or question banks used for training and training;
  • the training corpus in the preset field refers to the Chinese sentence pairs or question banks in the field according to the needs, and the Chinese sentence pairs or question banks in the field are used as the training in the preset field. corpus. For example, if the text of a certain engineering project needs to be compared, the training corpus in the preset field is the text of the engineering project.
  • S2 Tag the training text by part of speech, and extract the entities in the training text according to the method of dependency syntax analysis as the target entity.
  • the nouns and pronouns in the training text are obtained by tagging the training text, and the entities in the training text are extracted according to the method of dependency syntax analysis and used as the target entity.
  • the nouns and pronouns in the training text are extracted by using the pyltp and hanlp open source libraries by tagging the parts of speech of the training text and relying on syntactic analysis.
  • Pyltp and Hanlp are the basic natural language processing libraries released by Harbin Institute of Technology and Hankcs respectively, which are used for part-of-speech tagging and entity extraction.
  • POS part-of-speech tagging
  • DP dependency parsing
  • n general noun
  • ni organization word
  • nl place word
  • ns location word
  • nt time word
  • p the pronoun.
  • I eat apples (I, p), (apples, n).
  • Dependency syntax analysis, using the subject-predicate-object (SBV) relationship, will mark the corresponding words in the sentences in the training text. For example, "I eat apples” is marked as (I, Subject), (eat, Predict), ( Apple, Object), correspond the extracted nouns to the subject and object components, and delete the nouns that do not satisfy these two components in the sentence.
  • dependency syntax analysis was first proposed by the French linguist L.Tesniere. It analyzes the sentence into a dependency syntax tree, and describes the dependency relationship between each word, that is, it points out the syntactic collocation between words, which is related to semantics.
  • entities in the training text are extracted by means of dependency syntax analysis.
  • S3 Extract the relationship between any two adjacent target entities through the trained relationship extraction model combined with the training text, determine the relationship between any two target entities, and use any two target entities with an associated relationship as Associative entities, take the association relationship corresponding to the associated entity as the target relationship.
  • the relationship between each two target entities includes the existence of an association relationship and the absence of an association relationship between each two target entities.
  • the target relationship is an association relationship between two entities
  • the association relationship refers to the state of interaction and mutual influence between the two entities in the text.
  • the relation extraction model includes four parts: Embedding, Encoding, Selector and Classifier. Among them, (1) Embedding performs word embedding and position embedding on the input training text to generate a vector, which is used as the input of the entire model; (2) The Encoding layer is composed of Piecewise-CNN (PCNN), and the context of the training text is input when it is input. The current two target entities are divided into three segments, and the PCNN obtains the feature vectors extracted from the three segments of text, and then splices them together; (3) Selector is the attention layer, which assigns different weights to the feature vectors to train the relation extraction model later.
  • Embedding performs word embedding and position embedding on the input training text to generate a vector, which is used as the input of the entire model
  • the Encoding layer is composed of Piecewise-CNN (PCNN), and the context of the training text is input when it is input.
  • PCNN Piecewise-CNN
  • Selector is the attention layer, which assigns different weight
  • Classifier is an ordinary multi-classification layer, which outputs the possibility that the target entities of the two inputs have a relationship with each other.
  • the model is trained by two-category labeling data (with/without relationship), and the relationship between each two target entities is output.
  • S4 Use the target entity as a node and the target relationship as an edge to construct and generate an initial graph.
  • an initial map is generated by taking the relationship between the entities and the entities in the training text, so as to facilitate the subsequent comparison of the texts to be compared through the map comparison method, and improve the text comparison efficiency. Accuracy and recognition efficiency.
  • S5 Label the target entity and target relationship of the initial graph, take the labeled target entity and target relationship as the core information, and cluster the nodes of the initial graph according to the core information to obtain the target graph.
  • the target entities and target relationships in the initial graph by labeling the target entities and target relationships in the initial graph, and clustering the nodes of the initial graph according to the labeled target entities and target relationships, reducing redundant target entities and target relationships in the initial graph, and finally obtaining the target Atlas.
  • the labeling method adopted is the consistent labeling method.
  • the consistent labeling method is a method of labeling the entities of the graph and the relationships between entities according to unified rules or methods.
  • Consistent labeling methods include but are not limited to: labeling methods based on historical data and experience, randomly selecting labeling methods, etc.
  • the annotation is performed according to the annotation method of historical data and experience, and through this annotation method, the best entity and the relationship between the entities are selected for annotation through the previous data and experience, which is beneficial to improve the accuracy of the map for the relationship between the entities and the entities.
  • S6 Obtain the text to be compared, input the text to be compared into the target graph, and count the coverage ratio of the entities and relationships extracted from each text to be compared to the core information in the target graph.
  • the texts to be compared are input into the target map in turn, and the coverage rate of the entities and relationships extracted from each text to be compared to the core information in the target map is counted, and the subsequent steps are used to determine the target map. Compare whether the text and the training text are the same text.
  • the coverage ratio is the ratio of the overlap between the entities and relationships extracted from the text to be compared and the nodes and edges of the core information.
  • the preset threshold is set according to the actual situation, and is not limited here.
  • a preferable preset threshold is 75%, and under this threshold, it can be clearly seen that there is little difference between the contents of the compared texts.
  • the two or more texts are of the same type.
  • the comparison object is refined, the interference items in the original text are avoided, and the text format is not affected. , which improves the accuracy of text comparison.
  • FIG. 3 shows a specific implementation of step S2.
  • step S2 part-of-speech tagging is performed on the training text, and entities in the training text are extracted according to the method of dependency syntax analysis, which is used as the target entity.
  • the specific implementation process is described in detail as follows:
  • the text delimiter contained in the training text is obtained, which is used to segment the text in subsequent steps.
  • the text delimiters include format delimiters and punctuation delimiters.
  • the format delimiter refers to the delimiter that is divided according to the text encoding type or the text structure. Through the format separator, it is possible to separate the training text according to the encoding type of the text or the structure of the text, and obtain short sentences of the same encoding type or structured text, which is beneficial to the subsequent acquisition of the target entity.
  • the punctuation separator refers to the separator that divides the text according to the punctuation characters. Through the punctuation separator, the training text can be quickly divided, and the efficiency of obtaining short text sentences can be improved.
  • S22 Perform text segmentation on the training text through the text separator to obtain short text sentences.
  • the text segments are spliced into short text sentences according to the preset length; in subsequent steps, part-of-speech tagging and entity extraction can be performed according to the long text short sentences, so as to improve the efficiency of text part-of-speech tagging and entity extraction.
  • the preset length is set according to the actual length, which is not limited here.
  • a preferred preset length is 300 words, and a long text sentence is spliced from 1-5 segments after segmented sentences.
  • S23 Mark the nouns and pronouns in the short text sentence by means of part-of-speech tagging to obtain the marked nouns and pronouns.
  • the consistency rule is to use the subject-verb-object (SBV) relationship, and mark the corresponding words. For example, "I eat apples” is marked as (I, Subject), (eat, Predict), (apple, Object), and the extracted nouns are mapped to the subject and object components, and nouns that do not satisfy these two components in the sentence to delete.
  • SBV subject-verb-object
  • the regular matching method is used to obtain the text separation contained in the training text, and the training text is divided into text by the text separator to obtain short text sentences.
  • Part-of-speech tagging and entity extraction provide a basis for subsequent graph construction, which is beneficial to improve the accuracy of text comparison.
  • FIG. 4 shows a specific implementation after step S24, including:
  • S25 Determine whether two or more initial entities form a compound word by counting the degree of cohesion of the initial entities in the short text sentence, and obtain a judgment result.
  • the aggregation degree of initial entities in short text sentences is counted.
  • tf-idf is a statistical method to evaluate the importance of a word to a document set or one of the documents in a corpus.
  • the importance of a word increases proportionally to the number of times it appears in the document, but decreases inversely to the frequency it appears in the corpus.
  • Co-word analysis utilizes the co-occurrence of words and noun phrases in a collection to determine the relationship between topics in the discipline represented by the collection. It is generally believed that the more times a lexical pair occurs in the same document, the closer the relationship between the two topics is.
  • a co-word network composed of the associations of these word pairs can be formed, and the distance between the nodes in the network can reflect the subject content. of intimacy.
  • tf-idf and co-occurrence analysis are used to count the degree of cohesion of the initial entities, and then it is judged whether two or more initial entities constitute a compound word.
  • the cohesion degree refers to the possibility that multiple words form the current phrase slice (ie compound word).
  • the agglomeration degree of the initial entities it is judged whether two or more initial entities constitute a compound word. For example, if a text phrase is ABC, first divide the frequency of ABC by the frequency of A, B, C, AB, BC, and AC, and divide the smallest of these results. Value as a compound word.
  • the determination of the target entity is further realized by judging whether two or more initial entities constitute a compound word.
  • the combined entity is an entity obtained by combining two or more initial entities to form a compound word.
  • the initial entities include entities that can form a compound word and entities that do not form a compound word; the initial entities that can form a compound word are combined as the target entity, and the entities that do not form a compound word are also used as the target entity alone.
  • FIG. 5 shows a specific implementation before step S3, including:
  • S31 Obtain sample text, and perform word embedding and position embedding on the sample text to generate an embedding vector.
  • the embedded vector is generated and used as the input of the relation extraction model for subsequent numerical operations.
  • the sample text is used to train the relationship extraction model, and the trained relationship extraction model is obtained, which is convenient for subsequent entity relationship extraction.
  • word embedding is a general term for language model and representation learning technology in natural language processing (NLP).
  • NLP natural language processing
  • word embedding refers to the embedding of a high-dimensional space of the number of all words into a continuous vector space of much lower dimension, and each word or phrase is mapped as a vector on the real number domain.
  • Position embedding is relative to word embedding by embedding different positions of the sample text.
  • S32 Divide the context of the sample text into three texts, and obtain the embedding vectors of the three texts as feature vectors.
  • the entities in the sample text are obtained first, and when the sample text is input into the context, the context is divided into three segments through the two entities of the context, and the embedding vectors of the three segments of text are obtained as feature vectors.
  • the feature vector is the output vector hidden layer state vector of the hidden layer of the neural network, which is used as the intermediate result of the relation extraction model for the numerical operation of the subsequent steps.
  • S33 Splicing feature vectors of the same type to obtain a target vector.
  • feature vectors of the same type are spliced to form a feature vector set, that is, a target vector.
  • a feature vector set that is, a target vector.
  • different types of feature vector sets have different weights.
  • Selector is selected as the attention layer of relation extraction.
  • the reason for choosing Selector is that the training data used by the relation extraction model is often derived from remote supervision technology, which leads to large data noise.
  • a common method is to combine multiple samples marked as the same type by remote supervision. Put it into a bag of words, train the entire bag of words in the current training batch at the same time, and then select the correct samples in each bag of words by comparison.
  • Selector can assign different weights to different samples in the same word bag, which is essentially a weighting, so Selector is selected.
  • the weight is obtained by calculating the difference between the probability that the current sample is predicted to be true and the probability that it is correct.
  • an embedding vector is generated, and the context of the sample text is divided into three pieces of text, and the embedding vectors of the three pieces of text are obtained.
  • the feature vectors of the same type are spliced to obtain the target vector, and finally the weight of the target vector is obtained, and the relation extraction model is trained according to the weight of the target vector and the target vector, and the trained relation extraction model is obtained to realize the training of the target extraction model. , which is used to output the relationship between entities in the training file to build a map, which is beneficial to improve the accuracy of text comparison.
  • step S4 the text comparison method based on knowledge graph also includes:
  • the target entities and target relations extracted from the training text are disambiguated and deduplicated. Because there may be the same entity that is expressed in different ways in different texts, or the entities connected by the same relationship are expressed in different ways, resulting in entity/relation redundancy.
  • To disambiguate and deduplicate use the python open source library dedupe to complete. Substitute all the extracted entities and relationships into the tool in the form of triples (entity A, relationship, entity B), and dedupe merges entities or relationships with the same meaning through clustering operations.
  • the clustering operation is to select the corresponding target entity and target relationship by aggregating duplicate items, select the optimal threshold through the calculation of the similarity value, and finally obtain the target entity and target relationship with the same meaning.
  • Dedupe is a python open source library for knowledge fusion.
  • the processing flow includes entity/relationship description similarity calculation (record similarity), smart comparisons (smart comparisons), aggregating duplicates (Grouping Duplicates), and selecting an optimal threshold (Choosing a Good Threshold) several main steps.
  • similarity calculation and intelligent matching use the method of active learning combined with rule matching, aggregation duplicates use the hierarchical clustering with centroid linkage, and finally put these three modules into active learning (active learning). ) framework for learning, and through a small number of annotations, dedupe determines the optimal threshold according to the annotations.
  • FIG. 6 shows a specific implementation of step S5.
  • step S5 the target entity and target relationship of the initial map are marked, and the marked target entity and target relationship are used as core information, and according to the core information
  • the nodes of the initial graph are clustered to obtain the specific implementation process of the target graph, which is described in detail as follows
  • S51 Acquire the text information of the marked target entity and the unmarked target entity in the training text, and obtain the marked text information and the unmarked text information.
  • the labeled text information and the unlabeled text information are obtained by annotating the target entity and the target relationship of the initial graph, and obtaining the text information of the labeled target entity and the unlabeled target entity in the training text.
  • S52 Substitute the labeled text information and the unlabeled text information into the BERT model to obtain the vector, and obtain the labeled vector and the unlabeled vector.
  • the labeled vector is obtained by substituting the labeled text information into the BERT model for vector acquisition
  • the unlabeled vector is obtained by substituting the unlabeled text information into the BERT model for vector acquisition.
  • the calculation of the similarity value includes but is not limited to: Minkowski Distance, Manhattan Distance, Euclidean Distance, Cosine Similarity, Hamming Distance, etc.
  • the preset threshold is set according to the actual situation, which is not limited here.
  • the marked text information and the unmarked text information are obtained, and the marked text information and the unmarked text information are substituted into the BERT
  • the vector acquisition is performed in the model to obtain the labeled vector and the unlabeled vector, and then the similarity value between each unlabeled vector and the labeled vector is calculated.
  • the marked target entity and target relationship are deleted to obtain the target map, which realizes the construction of the target map, which is conducive to comparing the texts to be compared and improving the accuracy of text comparison.
  • the above text to be compared can also be stored in a node of a blockchain.
  • the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM) or the like.
  • the present application provides an embodiment of a text comparison apparatus based on knowledge graph, and the apparatus embodiment corresponds to the method embodiment shown in FIG. 2 .
  • the device can be specifically applied to various electronic devices.
  • the knowledge graph-based text comparison device of this embodiment includes: a training text acquisition module 71, a target entity acquisition module 72, a target relationship acquisition module 73, an initial graph construction module 74, a target graph construction module 75, The core information comparison module 76 and the same text judgment module 77, wherein:
  • the training text acquisition module 71 is used for collecting training corpus in a preset field, and performing text preprocessing on the training corpus to obtain training text;
  • the target entity acquisition module 72 is used to perform part-of-speech tagging on the training text, and extract the entities in the training text according to the method of dependency syntax analysis as the target entity;
  • the target relationship acquisition module 73 is used to extract the relationship between any two adjacent target entities through the trained relationship extraction model combined with the training text, and determine the relationship between any two target entities. Any two target entities are regarded as associated entities, and the associated relationship corresponding to the associated entities is regarded as the target relationship;
  • the initial graph construction module 74 is used for judging the association relationship between any two target entities, taking any two target entities that have the association relationship as the associated entity, and the association relationship corresponding to the associated entity as the target relationship;
  • the target map building module 75 is used to mark the target entity and target relationship of the initial map, take the marked target entity and target relationship as core information, and cluster the nodes of the initial map according to the core information to obtain the target map;
  • the core information comparison module 76 is used to obtain the text to be compared, input the text to be compared into the target map, and count the coverage rate of the core information in the target map of the entities and relationships extracted from each text to be compared;
  • the same text judgment module 77 is configured to determine that the texts to be compared are of the same type if the coverage ratio exceeds a preset threshold.
  • the target entity acquisition module 72 includes:
  • the text separator obtaining unit is used to obtain the text separator contained in the training text by means of regular matching;
  • the text short sentence acquisition unit is used to perform text segmentation on the training text through the text separator to obtain text short sentences;
  • the part-of-speech tagging unit is used to tag nouns and pronouns in short text sentences by means of part-of-speech tagging to obtain the tagged nouns and pronouns;
  • the initial entity determination unit is used to correspond the marked nouns and pronouns to the consistency rules according to the method of dependency syntax analysis, and use the marked nouns that conform to the consistency rules as the initial entities.
  • the target entity acquisition module 72 further includes:
  • the cohesion degree statistical unit is used to judge whether two or more initial entities form a compound word by counting the cohesion degree of the initial entities in the short sentences of the text, and obtain the judgment result;
  • the compound word judgment unit is used for combining the initial entities forming the compound word to obtain a combined entity if the result of the determination is that a compound word is formed, and the combined entity is used as a target entity.
  • the above-mentioned knowledge graph-based text comparison device further includes:
  • the sample text acquisition module is used to obtain sample text, perform word embedding and position embedding on the sample text, and generate an embedding vector:
  • the feature vector acquisition module is used to divide the context of the sample text into three texts, and obtain the embedded vectors of the three texts as feature vectors;
  • the target vector acquisition module is used to splicing the feature vectors of the same type to obtain the target vector;
  • the target extraction model training module is used to obtain the weight of the target vector, and train the relation extraction model according to the weight of the target vector and the target vector, so as to obtain a trained relation extraction model.
  • the above-mentioned knowledge graph-based text comparison device further includes:
  • the clustering operation module is used to perform clustering operations on target entities and target relations respectively, and respectively combine target entities with the same meaning and target relations with the same meaning.
  • the target map building module 75 includes:
  • the text information acquisition unit is used to acquire the text information of the marked target entity and the unmarked target entity in the training text, and obtain the marked text information and the unmarked text information;
  • the vector acquisition unit is used to substitute the marked text information and unmarked text information into the BERT model for vector acquisition, and obtain the marked vector and the unmarked vector;
  • the similarity value statistical unit is used to count the similarity value between each unlabeled vector and the labeled vector
  • the similarity value judgment unit is configured to delete the unlabeled target entity and target relationship in the initial map corresponding to the unlabeled vector if the similarity value exceeds a preset threshold, to obtain the target map.
  • the above target data can also be stored in a node of a blockchain.
  • FIG. 8 is a block diagram of a basic structure of a computer device according to this embodiment.
  • the computer device 8 includes a memory 81 , a processor 82 , and a network interface 83 that are connected to each other through a system bus. It should be pointed out that the figure only shows the computer device 8 with three components, the memory 81, the processor 82, and the network interface 83, but it should be understood that it is not required to implement all the components shown, and alternative implementations are possible. More or fewer components.
  • the computer device here is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions, and its hardware includes but is not limited to microprocessors, special-purpose Integrated circuit (Application Specific Integrated Circuit, ASIC), programmable gate array (Field-Programmable Gate Array, FPGA), digital processor (Digital Signal Processor, DSP), embedded equipment, etc.
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • DSP Digital Signal Processor
  • embedded equipment etc.
  • the computer equipment may be a desktop computer, a notebook computer, a palmtop computer, and a cloud server and other computing equipment.
  • Computer devices can interact with users through keyboards, mice, remote controls, touchpads, or voice-activated devices.
  • the memory 81 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory ( SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disk, optical disk, etc.
  • the memory 81 may be an internal storage unit of the computer device 8 , such as a hard disk or memory of the computer device 8 .
  • the memory 81 may also be an external storage device of the computer device 8, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc.
  • the memory 81 may also include both the internal storage unit of the computer device 8 and its external storage device.
  • the memory 81 is generally used to store the operating system and various application software installed on the computer device 8 , such as computer-readable instructions for a text comparison method based on a knowledge graph, and the like.
  • the memory 81 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 82 may be a central processing unit (CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments.
  • the processor 82 is typically used to control the overall operation of the computer device 8 .
  • the processor 82 is configured to execute computer-readable instructions stored in the memory 81 or process data, for example, computer-readable instructions for executing a text comparison method based on a knowledge graph.
  • the network interface 83 may comprise a wireless network interface or a wired network interface, and the network interface 83 is typically used to establish a communication connection between the computer device 8 and other electronic devices.
  • the present application also provides another embodiment, which is to provide a computer-readable storage medium, where computer-readable instructions are stored in the computer-readable storage medium, and the computer-readable instructions can be executed by at least one processor to cause at least one processing
  • the controller executes the steps of a knowledge graph-based text comparison method as described above.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • the methods of the above embodiments can be implemented by means of software plus a necessary general hardware platform, and of course hardware can also be used, but in many cases the former is better implementation.
  • the technical solution of the present application can be embodied in the form of a software product in essence or in a part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, CD-ROM), including several instructions to make a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods of the various embodiments of the present application.
  • a storage medium such as ROM/RAM, magnetic disk, CD-ROM

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Animal Behavior & Ethology (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

一种基于知识图谱的文本比对方法,涉及一种大数据技术,该方法包括:获取训练文本,并识别训练文本中的目标实体和目标关系,然后将目标实体作为节点,目标关系作为边,生成图谱,并将图谱作为初始图谱;通过对初始图谱的目标实体和目标关系进行标注,并根据标注的目标实体和目标关系对初始图谱的节点进行聚类,得到目标图谱;获取待对比文本,并将待对比文本输入到目标图谱中,统计每篇待对比文本抽取的实体与关系对目标图谱中的核心信息的覆盖率;若覆盖率超过预设阈值,则判断待对比文本为同类文本。还涉及区块链技术,待对比文本存储于区块链中。该方法通过图谱对比的形式,提高了文本比对精确度和效率。

Description

基于知识图谱的文本比对方法、装置、设备及存储介质
本申请要求于2020年07月27日提交中国专利局、申请号为202010734571.3,发明名称为“基于知识图谱的文本比对方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及大数据技术领域,尤其涉及基于知识图谱的文本比对方法、装置、设备及存储介质。
背景技术
文本内容对比技术在垂直领域与通用领域均有广泛应用。例如在保险、银行、投资等涉及进件审核或风险监控的金融文本处理场景,需要通过比对多份资料、检查不同资料提供的信息是否存在矛盾点来达到审核目的。
现有文本比对技术是利用自动摘要生成技术,将文本拆分,再为拆分的各段生成摘要,最后比较两篇文章的摘要,用以判断两篇文章的主要内容所表达的意思是否一致,进而判断两篇文章是否属于同类文本。这种方法会对文本内容进行语义提取,有利于实现对文本的精炼,提高文本要素对比的效率,但是,发明人意识到,在文本精炼的过程中,不可避免遗失一些有用的文本信息,导致对比结果出现偏差。现亟需一种能够提高文本比对的精确度的方法。
发明内容
本申请实施例的目的在于提出一种基于知识图谱的文本比对方法,以提高文本比对的精确度。
为了解决上述技术问题,本申请实施例提供一种基于知识图谱的文本比对方法,包括:
采集预设领域的训练语料,并对所述训练语料进行文本预处理,得到训练文本;
对所述训练文本进行词性标注,并根据依存句法分析的方式,抽取出所述训练文本中的实体,作为目标实体;
通过训练好的关系抽取模型,结合所述训练文本,对任意两个相邻目标实体间的关系进行抽取,判断任意两个目标实体间的关联关系,将存在关联关系的任意两个目标实体作为关联实体,将所述关联实体对应的关联关系,作为目标关系;
以所述目标实体作为节点,以所述目标关系作为边,构建生成初始图谱;
对所述初始图谱的目标实体和目标关系进行标注,将所述标注的目标实体和目标关系作为核心信息,并根据所述核心信息对所述初始图谱的节点进行聚类,得到目标图谱;
获取待对比文本,并将所述待对比文本输入到所述目标图谱中,统计每篇所述待对比文本抽取的实体与关系对所述目标图谱中的核心信息的覆盖率;
若所述覆盖率超过预设阈值,则确定所述待对比文本为同类文本。
为解决上述技术问题,本申请采用的一个技术方案是:提供一种基于知识图谱的文本比对装置,包括:
训练文本获取模块,用于采集预设领域的训练语料,并对所述训练语料进行文本预处理,得到训练文本;
目标实体获取模块,用于对所述训练文本进行词性标注,并根据依存句法分析的方式,抽取出所述训练文本中的实体,作为目标实体;
目标关系获取模块,用于通过训练好的关系抽取模型,结合所述训练文本,对任意两个相邻目标实体间的关系进行抽取,判断任意两个目标实体间的关联关系,将存在关联关 系的任意两个目标实体作为关联实体,将所述关联实体对应的关联关系,作为目标关系;
初始图谱构建模块,用于以所述目标实体作为节点,以所述目标关系作为边,构建生成初始图谱;
目标图谱构建模块,用于对所述初始图谱的目标实体和目标关系进行标注,将所述标注的目标实体和目标关系作为核心信息,并根据所述核心信息对所述初始图谱的节点进行聚类,得到目标图谱;
核心信息对比模块,用于获取待对比文本,并将所述待对比文本输入到所述目标图谱中,统计每篇所述待对比文本抽取的实体与关系对所述目标图谱中的核心信息的覆盖率;
同一文本判断模块,用于若所述覆盖率超过预设阈值,则确定所述待对比文本为同类文本。
为了解决上述问题,本申请实施例还提供一种计算机设备,包括至少一个处理器;以及,
与所述至少一个处理器通信连接的存储器;其中,
所述存储器存储有计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:
采集预设领域的训练语料,并对所述训练语料进行文本预处理,得到训练文本;
对所述训练文本进行词性标注,并根据依存句法分析的方式,抽取出所述训练文本中的实体,作为目标实体;
通过训练好的关系抽取模型,结合所述训练文本,对任意两个相邻目标实体间的关系进行抽取,判断任意两个目标实体间的关联关系,将存在关联关系的任意两个目标实体作为关联实体,将所述关联实体对应的关联关系,作为目标关系;
以所述目标实体作为节点,以所述目标关系作为边,构建生成初始图谱;
对所述初始图谱的目标实体和目标关系进行标注,将所述标注的目标实体和目标关系作为核心信息,并根据所述核心信息对所述初始图谱的节点进行聚类,得到目标图谱;
获取待对比文本,并将所述待对比文本输入到所述目标图谱中,统计每篇所述待对比文本抽取的实体与关系对所述目标图谱中的核心信息的覆盖率;
若所述覆盖率超过预设阈值,则确定所述待对比文本为同类文本。
为了解决上述问题,本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时,使得所述处理器执行如下步骤:
采集预设领域的训练语料,并对所述训练语料进行文本预处理,得到训练文本;
对所述训练文本进行词性标注,并根据依存句法分析的方式,抽取出所述训练文本中的实体,作为目标实体;
通过训练好的关系抽取模型,结合所述训练文本,对任意两个相邻目标实体间的关系进行抽取,判断任意两个目标实体间的关联关系,将存在关联关系的任意两个目标实体作为关联实体,将所述关联实体对应的关联关系,作为目标关系;
以所述目标实体作为节点,以所述目标关系作为边,构建生成初始图谱;
对所述初始图谱的目标实体和目标关系进行标注,将所述标注的目标实体和目标关系作为核心信息,并根据所述核心信息对所述初始图谱的节点进行聚类,得到目标图谱;
获取待对比文本,并将所述待对比文本输入到所述目标图谱中,统计每篇所述待对比文本抽取的实体与关系对所述目标图谱中的核心信息的覆盖率;
若所述覆盖率超过预设阈值,则确定所述待对比文本为同类文本。
以上方案中的一种基于知识图谱的文本比对方法,通过从文本中抽取实体和实体间关系构建成图谱的形式,进而对比图谱来识别文本的相似度,实现对比较对象的精炼,避免原文本中的干扰项,不受文本格式的影响,提高了文本比对的精确度。
附图说明
为了更清楚地说明本申请中的方案,下面将对本申请实施例描述中所需要使用的附图作一个简单介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的基于知识图谱的文本比对方法的应用环境示意图;
图2根据本申请实施例提供的基于知识图谱的文本比对方法的一实现流程图;
图3是本申请实施例提供的基于知识图谱的文本比对方法中步骤S2的一实现流程图;
图4是本申请实施例提供的基于知识图谱的文本比对方法中步骤S24之后的一实现流程图;
图5是本申请实施例提供的基于知识图谱的文本比对方法中步骤S3的一实现流程图;
图6是本申请实施例提供的基于知识图谱的文本比对方法中步骤S5的一实现流程图;
图7是本申请实施例提供的基于知识图谱的文本比对装置示意图;
图8是本申请实施例提供的计算机设备的示意图。
具体实施方式
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同;本文中在申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请;本申请的说明书和权利要求书及上述附图说明中的术语“包括”和“具有”以及它们的任何变形,意图在于覆盖不排他的包含。本申请的说明书和权利要求书或上述附图中的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定顺序。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
为了使本技术领域的人员更好地理解本申请方案,下面将结合附图,对本申请实施例中的技术方案进行清楚、完整地描述。
下面结合附图和实施方式对本申请进行详细说明。
请参阅图1,***架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。
用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。终端设备101、102、103上可以安装有各种通讯客户端应用,例如网页浏览器应用、搜索类应用、即时通信工具等。
终端设备101、102、103可以是具有显示屏并且支持网页浏览的各种电子设备,包括但不限于智能手机、平板电脑、膝上型便携计算机和台式计算机等等。
服务器105可以是提供各种服务的服务器,例如对终端设备101、102、103上显示的页面提供支持的后台服务器。
需要说明的是,本申请实施例所提供的一种基于知识图谱的文本比对方法一般由服务器执行,相应地,一种基于知识图谱的文本比对装置一般设置于服务器中。
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。
请参阅图2,图2示出了基于知识图谱的文本比对方法的一种具体实施方式。
需注意的是,若有实质上相同的结果,本申请的方法并不以图2所示的流程顺序为限,该方法包括如下步骤:
S1:采集预设领域的训练语料,并对训练语料进行文本预处理,得到训练文本。
具体的,文本预处理包括对文本进行数据清理等,使得文本数据保持一致性。
其中,预设领域的训练语料根据实际需要的对比的文本进行选取,此处不做限定。训练语料是指用来培养、训练的中文句对或题库;预设领域的训练语料是指根据需要该领域的中文句对或者题库,将该领域的中文句对或者题库作为预设领域的训练语料。例如,需要对某工程项目的文本进行对比,则预设领域的训练语料为该工程项目的文本。
S2:对训练文本进行词性标注,并根据依存句法分析的方式,抽取出训练文本中的实体,作为目标实体。
具体的,通过对训练文本进行词性标注,获取训练文本中的名词和代词,并根据依存句法分析的方式,抽取出训练文本中的实体,将其作为目标实体。
进一步的,通过对训练文本词性标注、并依存句法分析的方式,使用pyltp与hanlp开源库,将训练文本中的名词和代词提取出来。
其中,Pyltp与Hanlp分别是哈工大与Hankcs发布的基础自然语言处理库,用于词性标注和实体抽取。实现步骤:1.训练文本的文章分片段:按断句标点符号进行断句。设置片段长度;2.调用Pyltp与Hanlp这两个库中的词性标注(POS)与依存句法分析(DP)模块对训练文本进行词性标注和实体抽取,词性标注和实体抽取的结果以json形式返回解析结果。
需要说明的是,对于词性标注,只保留词性标注标签中含有n的词,即各种类的名词,例如n=一般名词,ni=组织机构词,nl=地点词,ns=地理位置词,nt=时间词,以及p,即代词。标注方式举例:我吃苹果=(我,p)、(苹果,n)。依存句法分析,使用主-谓-宾(SBV)关系,会在训练文本中的句子中对应单词上作标注,例如“我吃苹果”标注为(我,Subject)、(吃,Predict)、(苹果,Object),将提取到的名词对应到主语和宾语成分上,将在句子里不满足这两个成分的名词删除。
需要说明的是,同时用pyltp和hanlp,是为了避免一个库识别不全的情况,两者结果能够提高实体的识别和抽取精度。
其中,依存句法分析是由法国语言学家L.Tesniere最先提出。它将句子分析成一棵依存句法树,描述出各个词语之间的依存关系,也即指出了词语之间在句法上的搭配关系,这种搭配关系是和语义相关联的。在本申请中,通过依存句法分析的方式,抽取出训练文本中的实体。
S3:通过训练好的关系抽取模型,结合训练文本,对任意两个相邻目标实体间的关系进行抽取,判断任意两个目标实体间的关联关系,将存在关联关系的任意两个目标实体作为关联实体,将关联实体对应的关联关系,作为目标关系。
具体的,将目标实体和关系转化为(实体A,关系,实体B)三元组;然后将“关系”改为0与1,其中1表示两个目标实体存在关联关系,0表示两个目标实体不存在关联关系,最后输出每两个目标实体间的关联关系。其中,不存在关联关系实体是随机从训练文本的句中名词里抽选出来的。
其中,每两个目标实体间的关系包括每两个目标实体存在关联关系和不存在关联关系。
其中,目标关系为两个实体间存在关联关系,该关联关系是指在文本中两个实体之间相互作用、相互影响的状态。
具体的,关系抽取模型包括Embedding、Encoding、Selector与Classifier四部分。其中,(1)Embedding将输入的训练文本进行词嵌入与位置嵌入,生成向量,该向量作为整个模型的输入;(2)Encoding层由Piecewise-CNN(PCNN)构成,训练文本的上下文输入时被当前两个目标实体分为三段,PCNN获取三段文本提取到的特征向量,再拼接起来;(3)Selector为注意力层,通过赋予特征向量不同的权重,以后续对关系抽取模型进行训练;(4)Classifier则为普通多分类层,输出两个输入的目标实体彼此具有关系的可能性。最后通过二分类标注数据(有/无关系),对模型进行训练,输出每两个目标实体 的关系。
S4:以目标实体作为节点,以目标关系作为边,构建生成初始图谱。
具体的,通过将训练文本中的实体和实体间的关系,按照目标实体为节点,目标关系为边,生成初始图谱,便于后续将待对比的文本通过图谱的对比方式进行比较,提高文本对比的精度和识别效率。
S5:对初始图谱的目标实体和目标关系进行标注,将标注的目标实体和目标关系作为核心信息,并根据核心信息对初始图谱的节点进行聚类,得到目标图谱。
具体的,通过对初始图谱中的目标实体和目标关系进行标注,并根据标注的目标实体和目标关系对初始图谱的节点进行聚类,减少初始图谱的冗余目标实体和目标关系,最终得到目标图谱。
其中,针对初始图谱的目标实体和目标关系,所采用的标注方式为一致性的标注方式。一致性的标注方式为根据统一的规则或者手段进行对图谱的实体和实体间关系标注的方式。一致性的标注方式包括但不限于:根据历史数据和经验的标注方式、随机选择标注的方式等。优选的,根据历史数据和经验的标注方式进行标注,通过该标注方式实现以往数据和经验选择最佳的实体和实体间关系进行标注,有利于提高图谱对实体和实体间关系的准确度。
需要说明的是,需要进行约五次的标注,每一个标注,都需要进行聚类,最终得到目标图谱。
S6:获取待对比文本,并将待对比文本输入到目标图谱中,统计每篇待对比文本抽取的实体与关系对目标图谱中的核心信息的覆盖率。
具体的,通过获取待对比文本,将待对比文本依次输入到目标图谱中,并统计每篇待对比文本抽取的实体与关系对目标图谱中的核心信息的覆盖率,用以后续步骤判断该待对比文本与训练的文本是否为同类文本。
其中,覆盖率为待对比文本抽取的实体与关系与核心信息的节点和边重合的比例。
需要说明的是,输入的待对比文本会通过上述步骤进行实体和实体间关系抽取。
S7:若覆盖率超过预设阈值,则确定待对比文本为同类文本。
其中,预设阈值根据实际情况设定,此处不做限定。提供一较佳的预设阈值为75%,在该阈值下,能够清楚对比文本间的内容相差不大。
其中,通过判断两本或者两本以上的文本的主要内容所表达的意思是否一致,若一致,则该两本或者两本以上的文本为同类文本。
本实施例中,通过从文本中抽取实体和实体间关系构建成图谱的形式,进而对比图谱来识别文本的相似度,实现对比较对象的精炼,避免原文本中的干扰项,不受文本格式的影响,提高了文本比对的精确度。
请参阅图3,图3示出了步骤S2的一种具体实施方式,步骤S2中,对训练文本进行词性标注,并根据依存句法分析的方式,抽取出训练文本中的实体,作为目标实体的具体实现过程,详叙如下:
S21:采用正则匹配的方式,获取训练文本中包含的文本分隔符。
具体的,通过获取训练文本中包含的文本分隔符,用于后续步骤对文本进行分割。
可选地,文本分隔符包括格式分隔符和标点分隔符。
其中,格式分隔符指根据文本编码类型或文本的结构进行分割的分隔符。通过格式分隔符有实现根据文本的编码类型或文本的结构,将训练文本进行分隔,获取得到相同编码类型或结构文本短句,有利于后续获取目标实体。
其中,标点分隔符指根据标点符号将文本进行分割的分隔符。通过标点分隔符,实现快速将训练文本进行分割,提高获取文本短句的效率。
S22:通过文本分隔符对训练文本进行文本分割,得到文本短句。
优选的,按照预设长度将文本断句拼接成长的文本短句;在后续步骤中可以根据长的 为文本短句进行词性标注和实体抽取,以提高提高文本词性标注和实体抽取的效率。预设长度根据实际长度进行设置,此处不做限定。提供一较佳的预设长度为300字,由1-5个断句后的片段拼接成一个长的文本短句。
S23:通过词性标注的方式,将文本短句中的名词和代词进行标注,得到标注的名词和代词。
需要说明的是,只需要对文本短句中的名词和代词进行标注,其他的词不需要词性标注,避免对所有词性进行标注,提高词性标注的效率。
S24:根据依存句法分析的方式,将标注的名词和代词对应到一致性规则下,将符合一致性规则的标注的名词作为初始实体。
其中,一致性规则为使用主-谓-宾(SBV)关系,通过对应单词上作标注。例如“我吃苹果”标注为(我,Subject)、(吃,Predict)、(苹果,Object),将提取到的名词对应到主语和宾语成分上,在句子里不满足这两个成分的名词则去删除。
本实施例中,采用正则匹配的方式,获取训练文本中包含的文本分隔,并通过文本分隔符对训练文本进行文本分割,得到文本短句,根据词性标注的方式,将文本短句中的名词和代词进行标注,得到标注的名词和代词,并根据依存句法分析的方式,将标注的名词和代词对应到一致性规则下,将符合一致性规则的标注的名词作为初始实体,实现对训练文本进行词性标注和实体的抽取,为后续构建图谱提供基础,有利于提高文本对比的精确度。
请参阅图4,图4示出了步骤S24之后的一种具体实施方式,包括:
S25:通过统计文本短句中初始实体的凝聚度,判断两个或两个以上的初始实体是否构成复合词,得到判断结果。
具体的,通过tf-idf和共现分析的方式,统计文本短句中初始实体的凝聚程度。
其中,tf-idf是一种统计方法,用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。共词分析法利用文献集中词汇对名词短语共同出现的情况,来确定该文献集所代表学科中个主题之间的关系。一般认为词汇对在同一篇文献中出现的次数越多,则代表这两个主题的关系越紧密。由此,统计一组文献的主题词两两之间在同一篇文献出现的频率,便可形成一个由这些词对关联所组成的共词网络,网络内节点之间的远近便可反映主题内容的亲疏关系。在本申请中,利用tf-idf和共现分析的方式统计初始实体的凝聚度,进而判断两个或两个以上的初始实体是否构成复合词。
其中,凝聚度指多个单词构成当前短语切片(即复合词)的可能性。通过统计初始实体的凝聚度,判断两个或两个以上的初始实体是否构成复合词。例如,设有一个文本短句为ABC,首先用ABC的频率分别除以A的频率,B的频率,C的频率,AB频率,BC的频率和AC的频率,将这些得到的结果中最小的值作为复合词。
S26:若判断结果为构成复合词,则将构成复合词的初始实体进行合并,得到合并实体,并将合并实体作为目标实体。
具体的,通过判断两个或者两个以上的初始实体是否构成复合词,进一步实现对目标实体的确定。
其中,合并实体为两个或两个以上的初始实体构成复合词,并将其进行合并得到的实体。
进一步的,初始实体包括能够构成复合词和不构成复合词的实体;能够构成复合词的初始实体进行合并作为目标实体,而不构成复合词的实体也单独作为目标实体。
本实施例中,通过统计文本短句中初始实体的凝聚度,判断两个或两个以上的初始实体是否构成复合词,得到判断结果,若判断结果为构成复合词,则将构成复合词的初始实体进行合并,得到合并实体,并将合并实体作为目标实体,实现对目标实体的确定,为后续构建图谱提供基础,有利于提高文本对比的精确度。
请参阅图5,图5示出了步骤S3之前的一种具体实施方式,包括:
S31:获取样本文本,并对样本文本进行词嵌入和位置嵌入,生成嵌入向量。
具体的,通过生成嵌入向量,将其作为关系抽取模型的输入,以便后续对后续进行的数值运算。
其中,样本文本是用来训练关系抽取模型,得到训练好的关系抽取模型,便于后续进行实体关系的抽取。
其中,词嵌入是自然语言处理(NLP)中语言模型与表征学习技术的统称。概念上而言,它是指把一个维数为所有词的数量的高维空间嵌入到一个维数低得多的连续向量空间中,每个单词或词组被映射为实数域上的向量。位置嵌入相对于词嵌入,是通过对样本文本不同的位置进行嵌入。
S32:将样本文本的上下文分为三段文本,获取三段文本的嵌入向量,作为特征向量。
具体的,先获取样本文本中的实体,在样本文本输入上下文时,通过上下文的两个实体,将上下文分为三段,获取三段文本的嵌入向量,作为特征向量。
其中,特征向量为神经网络隐藏层输出向量hidden layer state vector,其作为关系抽取模型的中间结果,用于后续步骤的数值操作。
S33:将同一类型的特征向量进行拼接,得到目标向量。
具体的,将同一类型的特征向量进行拼接,组成一个特征向量集合,即目标向量,在后续步骤中,不同的类型的特征向量集合,拥有不同权重。
S34:获取目标向量的权重,并根据目标向量和目标向量的权重对关系抽取模型进行训练,得到训练好的关系抽取模型。
其中,选用Selector作为关系抽取的注意力层。选用Selector的原因为关系抽取模型使用的训练数据往往来源于远程监督技术,这导致数据噪音较大,为了克服单个样本的错误,常见方法是把多个被远程监督标注为同一类型的样本合起来放入一个词袋bag中,在当前训练批次同时训练整个词袋bag,再通过对比选择每个词袋bag中被判断正确的样本。而Selector可以在同一个词袋bag中的不同样本分配不同权重,本质是一种加权,所以选用Selector。
本关系抽取模型中,权重是计算预测当前样本为真的概率与正确概率的差值获得的。通过对目标向量的不同类型加不同权重,能提升关系抽取模型识别准确率。
本实施例中,通过获取样本文本,并对样本文本进行词嵌入和位置嵌入,生成嵌入向量,并将样本文本的上下文分为三段文本,获取三段文本的嵌入向量,作为特征向量,将同一类型的特征向量进行拼接,得到目标向量,最后获取目标向量的权重,并根据目标向量和目标向量的权重对关系抽取模型进行训练,得到训练好的关系抽取模型,实现对目标抽取模型的训练,用以输出训练文件中实体间的关系,以构建图谱,有利于提高文本对比的精确度。
进一步的,在步骤S4之前,该基于知识图谱的文本比对方法还包括:
分别对目标实体和目标关系进行聚类操作,并分别将相同含义的目标实体和相同含义的目标关系进行合并。
具体的,将从训练文本中提取出的目标实体和目标关系消歧与去重。因为可能存在同一个实体在不同文本中表述方式完全不同,或是同一种关系连接的实体表述方式不同,从而造成实体/关系冗余。为消除歧义并去重,使用python开源库dedupe来完成。将提取到的实体与关系以(实体A,关系,实体B)的三元组形式全部代入该工具,dedupe通过聚类操作,将相同含义的实体或关系合并起来。
其中,聚类操作为通过聚合重复项选取相应的目标实体和目标关系,通过相似度值的计算,选取最优的阈值,最终得到相同含义的目标实体和目标关系。
其中,Dedupe是一个python开源库,用于知识融合。其处理流程包括实体/关系描述相似度计算(record similarity)、智能匹配(smart comparisons)、聚合重复项(Grouping  Duplicates)、选择最优阈值(Choosing a Good Threshold)几个主要步骤。其中相似度计算和智能匹配使用的是主动学习结合规则匹配的方法,聚合重复项使用中心链指层级聚类法(hierarchical clustering with centroid linkage),最后将这三个模块放入主动学习(active learning)框架进行学习,通过少量的标注,dedupe根据标注来确定最优阈值。
本实施例中,通过分别对目标实体和目标关系进行聚类操作,并分别将相同含义的目标实体和相同含义的目标关系进行合并,减少冗余的目标实体和/或目标关系,提高后续构建图谱的效率,有利于提高文本对比的精确度。
请参阅图6,图6示出了步骤S5的一种具体实施方式,步骤S5中对初始图谱的目标实体和目标关系进行标注,将标注的目标实体和目标关系作为核心信息,并根据核心信息对初始图谱的节点进行聚类,得到目标图谱的具体实现过程,详叙如下
S51:获取标注的目标实体和未标注的目标实体在训练文本的文本信息,得到标注的文本信息和未标注的文本信息。
具体的,通过对初始图谱的目标实体和目标关系进行标注,并获取标注的目标实体和未标注的目标实体在训练文本的文本信息,得到标注的文本信息和未标注的文本信息。
S52:将标注的文本信息和未标注的文本信息代入BERT模型中进行向量获取,得到标注向量和未标注向量。
其中,标注向量为标注的文本信息代入BERT模型中进行向量获取而得到的,未标注向量为未标注的文本信息代入BERT模型中进行向量获取而得到的。
S53:统计每个未标注向量与标注向量的相似度值。
其中,相似度值的计算包括但不限于:闵可夫斯基距离(Minkowski Distance)、曼哈顿距离(Manhattan Distance)、欧氏距离(Euclidean Distance)、余弦相似度、汉明距离等。
S54:若相似度值超过预设阈值,则将未标注向量对应的初始图谱中的未标注的目标实体和目标关系删除,得到目标图谱。
其中,预设阈值根据实际情况进行设定,此处不做限定。
本实施例中,通过获取标注的目标实体和未标注的目标实体在训练文本的文本信息,得到标注的文本信息和未标注的文本信息,并将标注的文本信息和未标注的文本信息代入BERT模型中进行向量获取,得到标注向量和未标注向量,然后统计每个未标注向量与标注向量的相似度值,若相似度值超过预设阈值,则将未标注向量对应的初始图谱中的未标注的目标实体和目标关系删除,得到目标图谱,实现目标图谱的构建,有利于将待对比文本进行对比,提高文本对比的精确度。
需要强调的是,为进一步保证上述待对比文本的私密和安全性,上述待对比文本还可以存储于一区块链的节点中。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,该计算机可读指令可存储于一计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,前述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等非易失性存储介质,或随机存储记忆体(Random Access Memory,RAM)等。
请参考图7,作为对上述图2所示方法的实现,本申请提供了一种基于知识图谱的文本比对装置的一个实施例,该装置实施例与图2所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。
如图7所示,本实施例的基于知识图谱的文本比对装置包括:训练文本获取模块71、目标实体获取模块72、目标关系获取模块73、初始图谱构建模块74、目标图谱构建模块75、核心信息对比模块76及同一文本判断模块77,其中:
训练文本获取模块71,用于采集预设领域的训练语料,并对训练语料进行文本预处理, 得到训练文本;
目标实体获取模块72,用于对训练文本进行词性标注,并根据依存句法分析的方式,抽取出训练文本中的实体,作为目标实体;
目标关系获取模块73,用于通过训练好的关系抽取模型,结合训练文本,对任意两个相邻目标实体间的关系进行抽取,判断任意两个目标实体间的关联关系,将存在关联关系的任意两个目标实体作为关联实体,将关联实体对应的关联关系,作为目标关系;
初始图谱构建模块74,用于判断任意两个目标实体间的关联关系,将存在关联关系的任意两个目标实体作为关联实体,将关联实体对应的关联关系,作为目标关系;
目标图谱构建模块75,用于对初始图谱的目标实体和目标关系进行标注,将标注的目标实体和目标关系作为核心信息,并根据核心信息对初始图谱的节点进行聚类,得到目标图谱;
核心信息对比模块76,用于获取待对比文本,并将待对比文本输入到目标图谱中,统计每篇待对比文本抽取的实体与关系对目标图谱中的核心信息的覆盖率;
同一文本判断模块77,用于若覆盖率超过预设阈值,则确定待对比文本为同类文本。
进一步的,目标实体获取模块72包括:
文本分隔符获取单元,用于采用正则匹配的方式,获取训练文本中包含的文本分隔符;
文本短句获取单元,用于通过文本分隔符对训练文本进行文本分割,得到文本短句;
词性标注标注单元,用于通过词性标注的方式,将文本短句中的名词和代词进行标注,得到标注的名词和代词;
初始实体确定单元,用于根据依存句法分析的方式,将标注的名词和代词对应到一致性规则下,将符合一致性规则的标注的名词作为初始实体。
进一步的,在初始实体确定单元之后,该目标实体获取模块72还包括:
凝聚度统计单元,用于通过统计文本短句中初始实体的凝聚度,判断两个或两个以上的初始实体是否构成复合词,得到判断结果;
复合词判断单元,用于若判断结果为构成复合词,则将构成复合词的初始实体进行合并,得到合并实体,并将合并实体作为目标实体。
进一步的,在目标关系获取模块73之前,上述基于知识图谱的文本比对装置还包括:
样本文本获取模块,用于获取样本文本,并对样本文本进行词嵌入和位置嵌入,生成嵌入向量:
特征向量获取模块,用于将样本文本的上下文分为三段文本,获取三段文本的嵌入向量,作为特征向量;
目标向量获取模块,用于将同一类型的特征向量进行拼接,得到目标向量;
目标抽取模型训练模块,用于获取目标向量的权重,并根据目标向量和目标向量的权重对关系抽取模型进行训练,得到训练好的关系抽取模型。
进一步的,在初始图谱构建模块74之前,上述基于知识图谱的文本比对装置还包括:
聚类操作模块,用于分别对目标实体和目标关系进行聚类操作,并分别将相同含义的目标实体和相同含义的目标关系进行合并。
进一步的,目标图谱构建模块75包括:
文本信息获取单元,用于获取标注的目标实体和未标注的目标实体在训练文本的文本信息,得到标注的文本信息和未标注的文本信息;
向量获取单元,用于将标注的文本信息和未标注的文本信息代入BERT模型中进行向量获取,得到标注向量和未标注向量;
相似度值统计单元,用于统计每个未标注向量与标注向量的相似度值;
相似度值判断单元,用于若相似度值超过预设阈值,则将未标注向量对应的初始图谱中的未标注的目标实体和目标关系删除,得到目标图谱。
需要强调的是,为进一步保证上述目标数据的私密和安全性,上述目标数据还可以存 储于一区块链的节点中。
为解决上述技术问题,本申请实施例还提供计算机设备。具体请参阅图8,图8为本实施例计算机设备基本结构框图。
计算机设备8包括通过***总线相互通信连接存储器81、处理器82、网络接口83。需要指出的是,图中仅示出了具有三种组件存储器81、处理器82、网络接口83的计算机设备8,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。其中,本技术领域技术人员可以理解,这里的计算机设备是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。
计算机设备可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。计算机设备可以与用户通过键盘、鼠标、遥控器、触摸板或声控设备等方式进行人机交互。
存储器81至少包括一种类型的可读存储介质,可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,存储器81可以是计算机设备8的内部存储单元,例如该计算机设备8的硬盘或内存。在另一些实施例中,存储器81也可以是计算机设备8的外部存储设备,例如该计算机设备8上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,存储器81还可以既包括计算机设备8的内部存储单元也包括其外部存储设备。本实施例中,存储器81通常用于存储安装于计算机设备8的操作***和各类应用软件,例如基于知识图谱的文本比对方法的计算机可读指令等。此外,存储器81还可以用于暂时地存储已经输出或者将要输出的各类数据。
处理器82在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器82通常用于控制计算机设备8的总体操作。本实施例中,处理器82用于运行存储器81中存储的计算机可读指令或者处理数据,例如运行一种基于知识图谱的文本比对方法的计算机可读指令。
网络接口83可包括无线网络接口或有线网络接口,该网络接口83通常用于在计算机设备8与其他电子设备之间建立通信连接。
本申请还提供了另一种实施方式,即提供一种计算机可读存储介质,计算机可读存储介质存储有计算机可读指令,计算机可读指令可被至少一个处理器执行,以使至少一个处理器执行如上述的一种基于知识图谱的文本比对方法的步骤。所述计算机可读存储介质可以是非易失性,也可以是易失性。
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例的方法。
显然,以上所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例,附图 中给出了本申请的较佳实施例,但并不限制本申请的专利范围。本申请可以以许多不同的形式来实现,相反地,提供这些实施例的目的是使对本申请的公开内容的理解更加透彻全面。尽管参照前述实施例对本申请进行了详细的说明,对于本领域的技术人员来而言,其依然可以对前述各具体实施方式所记载的技术方案进行修改,或者对其中部分技术特征进行等效替换。凡是利用本申请说明书及附图内容所做的等效结构,直接或间接运用在其他相关的技术领域,均同理在本申请专利保护范围之内。

Claims (20)

  1. 一种基于知识图谱的文本比对方法,包括:
    采集预设领域的训练语料,并对所述训练语料进行文本预处理,得到训练文本;
    对所述训练文本进行词性标注,并根据依存句法分析的方式,抽取出所述训练文本中的实体,作为目标实体;
    通过训练好的关系抽取模型,结合所述训练文本,对任意两个相邻目标实体间的关系进行抽取,判断任意两个目标实体间的关联关系,将存在关联关系的任意两个目标实体作为关联实体,将所述关联实体对应的关联关系,作为目标关系;
    以所述目标实体作为节点,以所述目标关系作为边,构建生成初始图谱;
    对所述初始图谱的目标实体和目标关系进行标注,将所述标注的目标实体和目标关系作为核心信息,并根据所述核心信息对所述初始图谱的节点进行聚类,得到目标图谱;
    获取待对比文本,并将所述待对比文本输入到所述目标图谱中,统计每篇所述待对比文本抽取的实体与关系对所述目标图谱中的核心信息的覆盖率;
    若所述覆盖率超过预设阈值,则确定所述待对比文本为同类文本。
  2. 根据权利要求1所述的基于知识图谱的文本比对方法,其中,所述对所述训练文本进行词性标注,并根据依存句法分析的方式,抽取出所述训练文本中的实体,作为目标实体包括:
    采用正则匹配的方式,获取所述训练文本中包含的文本分隔符;
    通过所述文本分隔符对所述训练文本进行文本分割,得到文本短句;
    通过词性标注的方式,将所述文本短句中的名词和代词进行标注,得到标注的名词和代词;
    根据依存句法分析的方式,将所述标注的名词和代词对应到一致性规则下,将符合所述一致性规则的所述标注的名词作为初始实体。
  3. 根据权利要求2所述的基于知识图谱的文本比对方法,其中,在所述根据依存句法分析的方式,将所述标注的名词和代词对应到一致性规则下,将符合所述一致性规则的所述标注的名词作为初始实体之后,所述方法还包括:
    通过统计文本短句中初始实体的凝聚度,判断两个或两个以上的所述初始实体是否构成复合词,得到判断结果;
    若判断结果为构成复合词,则将构成复合词的初始实体进行合并,得到合并实体,并将所述合并实体作为目标实体。
  4. 根据权利要求1所述的基于知识图谱的文本比对方法,其中,在通过训练好的关系抽取模型,结合所述训练文本,对任意两个相邻目标实体间的关系进行抽取,判断任意两个目标实体间的关联关系,将存在关联关系的任意两个目标实体作为关联实体,将所述关联实体对应的关联关系,作为目标关系之前,所述方法还包括:
    获取样本文本,并对所述样本文本进行词嵌入和位置嵌入,生成嵌入向量:
    将所述样本文本的上下文分为三段文本,获取所述三段文本的嵌入向量,作为特征向量;
    将同一类型的特征向量进行拼接,得到目标向量;
    获取所述目标向量的权重,并根据所述目标向量和所述目标向量的权重对关系抽取模型进行训练,得到所述训练好的关系抽取模型。
  5. 根据权利要求1所述的基于知识图谱的文本比对方法,其中,在所述以所述目标实体作为节点,以所述目标关系作为边,构建生成初始图谱之前,所述方法还包括:
    分别对所述目标实体和所述目标关系进行聚类操作,并分别将相同含义的目标实体和相同含义的目标关系进行合并。
  6. 根据权利要求1至5任一项所述的基于知识图谱的文本比对方法,其中,所述对所述初始图谱的目标实体和目标关系进行标注,将所述标注的目标实体和目标关系作为核 心信息,并根据所述核心信息对所述初始图谱的节点进行聚类,得到目标图谱包括:
    获取标注的目标实体和未标注的目标实体在所述训练文本的文本信息,得到标注的文本信息和未标注的文本信息;
    将所述标注的文本信息和未标注的文本信息代入BERT模型中进行向量获取,得到标注向量和未标注向量;
    统计每个所述未标注向量与所述标注向量的相似度值;
    若所述相似度值超过预设阈值,则将所述未标注向量对应的初始图谱中的未标注的目标实体和目标关系删除,得到所述目标图谱。
  7. 一种基于知识图谱的文本比对装置,包括:
    训练文本获取模块,用于采集预设领域的训练语料,并对所述训练语料进行文本预处理,得到训练文本;
    目标实体获取模块,用于对所述训练文本进行词性标注,并根据依存句法分析的方式,抽取出所述训练文本中的实体,作为目标实体;
    目标关系获取模块,用于通过训练好的关系抽取模型,结合所述训练文本,对任意两个相邻目标实体间的关系进行抽取,判断任意两个目标实体间的关联关系,将存在关联关系的任意两个目标实体作为关联实体,将所述关联实体对应的关联关系,作为目标关系;
    初始图谱构建模块,用于以所述目标实体作为节点,以所述目标关系作为边,构建生成初始图谱;
    目标图谱构建模块,用于对所述初始图谱的目标实体和目标关系进行标注,将所述标注的目标实体和目标关系作为核心信息,并根据所述核心信息对所述初始图谱的节点进行聚类,得到目标图谱;
    核心信息对比模块,用于获取待对比文本,并将所述待对比文本输入到所述目标图谱中,统计每篇所述待对比文本抽取的实体与关系对所述目标图谱中的核心信息的覆盖率;
    同一文本判断模块,用于若所述覆盖率超过预设阈值,则确定所述待对比文本为同类文本。
  8. 根据权利要求7所述的基于知识图谱的文本比对装置,其中,所述目标实体获取模块包括:
    文本分隔符获取单元,用于采用正则匹配的方式,获取所述训练文本中包含的文本分隔符;
    文本短句获取单元,用于通过所述文本分隔符对所述训练文本进行文本分割,得到文本短句;
    词性标注标注单元,用于通过词性标注的方式,将所述文本短句中的名词和代词进行标注,得到标注的名词和代词;
    初始实体确定单元,用于根据依存句法分析的方式,将所述标注的名词和代词对应到一致性规则下,将符合所述一致性规则的所述标注的名词作为初始实体。
  9. 一种计算机设备,包括:至少一个处理器;以及,
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:
    采集预设领域的训练语料,并对所述训练语料进行文本预处理,得到训练文本;
    对所述训练文本进行词性标注,并根据依存句法分析的方式,抽取出所述训练文本中的实体,作为目标实体;
    通过训练好的关系抽取模型,结合所述训练文本,对任意两个相邻目标实体间的关系进行抽取,判断任意两个目标实体间的关联关系,将存在关联关系的任意两个目标实体作为关联实体,将所述关联实体对应的关联关系,作为目标关系;
    以所述目标实体作为节点,以所述目标关系作为边,构建生成初始图谱;
    对所述初始图谱的目标实体和目标关系进行标注,将所述标注的目标实体和目标关系作为核心信息,并根据所述核心信息对所述初始图谱的节点进行聚类,得到目标图谱;
    获取待对比文本,并将所述待对比文本输入到所述目标图谱中,统计每篇所述待对比文本抽取的实体与关系对所述目标图谱中的核心信息的覆盖率;
    若所述覆盖率超过预设阈值,则确定所述待对比文本为同类文本。
  10. 根据权利要求9所述的计算机设备,其中,所述对所述训练文本进行词性标注,并根据依存句法分析的方式,抽取出所述训练文本中的实体,作为目标实体包括:
    采用正则匹配的方式,获取所述训练文本中包含的文本分隔符;
    通过所述文本分隔符对所述训练文本进行文本分割,得到文本短句;
    通过词性标注的方式,将所述文本短句中的名词和代词进行标注,得到标注的名词和代词;
    根据依存句法分析的方式,将所述标注的名词和代词对应到一致性规则下,将符合所述一致性规则的所述标注的名词作为初始实体。
  11. 根据权利要求10所述的计算机设备,其中,在所述根据依存句法分析的方式,将所述标注的名词和代词对应到一致性规则下,将符合所述一致性规则的所述标注的名词作为初始实体之后,所述方法还包括:
    通过统计文本短句中初始实体的凝聚度,判断两个或两个以上的所述初始实体是否构成复合词,得到判断结果;
    若判断结果为构成复合词,则将构成复合词的初始实体进行合并,得到合并实体,并将所述合并实体作为目标实体。
  12. 根据权利要求9所述的计算机设备,其中,在通过训练好的关系抽取模型,结合所述训练文本,对任意两个相邻目标实体间的关系进行抽取,判断任意两个目标实体间的关联关系,将存在关联关系的任意两个目标实体作为关联实体,将所述关联实体对应的关联关系,作为目标关系之前,所述方法还包括:
    获取样本文本,并对所述样本文本进行词嵌入和位置嵌入,生成嵌入向量:
    将所述样本文本的上下文分为三段文本,获取所述三段文本的嵌入向量,作为特征向量;
    将同一类型的特征向量进行拼接,得到目标向量;
    获取所述目标向量的权重,并根据所述目标向量和所述目标向量的权重对关系抽取模型进行训练,得到所述训练好的关系抽取模型。
  13. 根据权利要求9所述的计算机设备,其中,在所述以所述目标实体作为节点,以所述目标关系作为边,构建生成初始图谱之前,所述方法还包括:
    分别对所述目标实体和所述目标关系进行聚类操作,并分别将相同含义的目标实体和相同含义的目标关系进行合并。
  14. 根据权利要求9至13中任一项所述的计算机设备,其中,所述对所述初始图谱的目标实体和目标关系进行标注,将所述标注的目标实体和目标关系作为核心信息,并根据所述核心信息对所述初始图谱的节点进行聚类,得到目标图谱包括:
    获取标注的目标实体和未标注的目标实体在所述训练文本的文本信息,得到标注的文本信息和未标注的文本信息;
    将所述标注的文本信息和未标注的文本信息代入BERT模型中进行向量获取,得到标注向量和未标注向量;
    统计每个所述未标注向量与所述标注向量的相似度值;
    若所述相似度值超过预设阈值,则将所述未标注向量对应的初始图谱中的未标注的目标实体和目标关系删除,得到所述目标图谱。
  15. 一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时,使得所述处理器执行如下步骤:
    采集预设领域的训练语料,并对所述训练语料进行文本预处理,得到训练文本;
    对所述训练文本进行词性标注,并根据依存句法分析的方式,抽取出所述训练文本中的实体,作为目标实体;
    通过训练好的关系抽取模型,结合所述训练文本,对任意两个相邻目标实体间的关系进行抽取,判断任意两个目标实体间的关联关系,将存在关联关系的任意两个目标实体作为关联实体,将所述关联实体对应的关联关系,作为目标关系;
    以所述目标实体作为节点,以所述目标关系作为边,构建生成初始图谱;
    对所述初始图谱的目标实体和目标关系进行标注,将所述标注的目标实体和目标关系作为核心信息,并根据所述核心信息对所述初始图谱的节点进行聚类,得到目标图谱;
    获取待对比文本,并将所述待对比文本输入到所述目标图谱中,统计每篇所述待对比文本抽取的实体与关系对所述目标图谱中的核心信息的覆盖率;
    若所述覆盖率超过预设阈值,则确定所述待对比文本为同类文本。
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述对所述训练文本进行词性标注,并根据依存句法分析的方式,抽取出所述训练文本中的实体,作为目标实体包括:
    采用正则匹配的方式,获取所述训练文本中包含的文本分隔符;
    通过所述文本分隔符对所述训练文本进行文本分割,得到文本短句;
    通过词性标注的方式,将所述文本短句中的名词和代词进行标注,得到标注的名词和代词;
    根据依存句法分析的方式,将所述标注的名词和代词对应到一致性规则下,将符合所述一致性规则的所述标注的名词作为初始实体。
  17. 根据权利要求16所述的计算机可读存储介质,其中,在所述根据依存句法分析的方式,将所述标注的名词和代词对应到一致性规则下,将符合所述一致性规则的所述标注的名词作为初始实体之后,所述方法还包括:
    通过统计文本短句中初始实体的凝聚度,判断两个或两个以上的所述初始实体是否构成复合词,得到判断结果;
    若判断结果为构成复合词,则将构成复合词的初始实体进行合并,得到合并实体,并将所述合并实体作为目标实体。
  18. 根据权利要求15所述的计算机可读存储介质,其中,在通过训练好的关系抽取模型,结合所述训练文本,对任意两个相邻目标实体间的关系进行抽取,判断任意两个目标实体间的关联关系,将存在关联关系的任意两个目标实体作为关联实体,将所述关联实体对应的关联关系,作为目标关系之前,所述方法还包括:
    获取样本文本,并对所述样本文本进行词嵌入和位置嵌入,生成嵌入向量:
    将所述样本文本的上下文分为三段文本,获取所述三段文本的嵌入向量,作为特征向量;
    将同一类型的特征向量进行拼接,得到目标向量;
    获取所述目标向量的权重,并根据所述目标向量和所述目标向量的权重对关系抽取模型进行训练,得到所述训练好的关系抽取模型。
  19. 根据权利要求15所述的计算机可读存储介质,其中,在所述以所述目标实体作为节点,以所述目标关系作为边,构建生成初始图谱之前,所述方法还包括:
    分别对所述目标实体和所述目标关系进行聚类操作,并分别将相同含义的目标实体和相同含义的目标关系进行合并。
  20. 根据权利要求15至19中任一项所述的计算机可读存储介质,其中,所述对所述初始图谱的目标实体和目标关系进行标注,将所述标注的目标实体和目标关系作为核心信息,并根据所述核心信息对所述初始图谱的节点进行聚类,得到目标图谱包括:
    获取标注的目标实体和未标注的目标实体在所述训练文本的文本信息,得到标注的文 本信息和未标注的文本信息;
    将所述标注的文本信息和未标注的文本信息代入BERT模型中进行向量获取,得到标注向量和未标注向量;
    统计每个所述未标注向量与所述标注向量的相似度值;
    若所述相似度值超过预设阈值,则将所述未标注向量对应的初始图谱中的未标注的目标实体和目标关系删除,得到所述目标图谱。
PCT/CN2021/096862 2020-07-27 2021-05-28 基于知识图谱的文本比对方法、装置、设备及存储介质 WO2022022045A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010734571.3A CN111897970B (zh) 2020-07-27 2020-07-27 基于知识图谱的文本比对方法、装置、设备及存储介质
CN202010734571.3 2020-07-27

Publications (1)

Publication Number Publication Date
WO2022022045A1 true WO2022022045A1 (zh) 2022-02-03

Family

ID=73190588

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/096862 WO2022022045A1 (zh) 2020-07-27 2021-05-28 基于知识图谱的文本比对方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN111897970B (zh)
WO (1) WO2022022045A1 (zh)

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114138985A (zh) * 2022-02-08 2022-03-04 深圳希施玛数据科技有限公司 文本数据处理的方法、装置、计算机设备以及存储介质
CN114372732A (zh) * 2022-03-22 2022-04-19 杭州杰牌传动科技有限公司 实现用户需求智能匹配的减速电机协同制造方法和***
CN114496115A (zh) * 2022-04-18 2022-05-13 北京白星花科技有限公司 实体关系的标注自动生成方法和***
CN114661872A (zh) * 2022-02-25 2022-06-24 北京大学 一种面向初学者的api自适应推荐方法与***
CN114707005A (zh) * 2022-06-02 2022-07-05 浙江建木智能***有限公司 一种舰船装备的知识图谱构建方法和***
CN114742029A (zh) * 2022-04-20 2022-07-12 中国传媒大学 一种汉语文本比对方法、存储介质及设备
CN114741522A (zh) * 2022-03-11 2022-07-12 北京师范大学 一种文本分析方法、装置、存储介质及电子设备
CN114741468A (zh) * 2022-03-22 2022-07-12 平安科技(深圳)有限公司 文本去重方法、装置、设备及存储介质
CN114783559A (zh) * 2022-06-23 2022-07-22 浙江太美医疗科技股份有限公司 医学影像报告信息抽取方法、装置、电子设备和存储介质
CN114996389A (zh) * 2022-08-04 2022-09-02 中科雨辰科技有限公司 一种标注类别一致性检验方法、存储介质及电子设备
CN115129719A (zh) * 2022-06-28 2022-09-30 深圳市规划和自然资源数据管理中心 一种基于知识图谱的定性位置空间范围构建方法
CN115358341A (zh) * 2022-08-30 2022-11-18 北京睿企信息科技有限公司 一种基于关系模型的指代消歧的训练方法及***
CN115880120A (zh) * 2023-02-24 2023-03-31 江西微博科技有限公司 一种在线政务服务***及服务方法
CN115909386A (zh) * 2023-01-06 2023-04-04 中国石油大学(华东) 一种管道仪表流程图的补全和纠错方法、设备及存储介质
CN116703441A (zh) * 2023-05-25 2023-09-05 云内控科技有限公司 一种基于知识图谱的医疗项目成本核算可视分析方法
CN116882408A (zh) * 2023-09-07 2023-10-13 南方电网数字电网研究院有限公司 变压器图模型的构建方法、装置、计算机设备和存储介质
US20230359825A1 (en) * 2022-05-06 2023-11-09 Sap Se Knowledge graph entities from text
CN117195913A (zh) * 2023-11-08 2023-12-08 腾讯科技(深圳)有限公司 文本处理方法、装置、电子设备、存储介质及程序产品
WO2023246849A1 (zh) * 2022-06-22 2023-12-28 青岛海尔电冰箱有限公司 回馈数据图谱生成方法及冰箱
CN117332282A (zh) * 2023-11-29 2024-01-02 之江实验室 一种基于知识图谱的事件匹配的方法及装置
CN117371534A (zh) * 2023-12-07 2024-01-09 同方赛威讯信息技术有限公司 一种基于bert的知识图谱构建方法及***
CN117454884A (zh) * 2023-12-20 2024-01-26 上海蜜度科技股份有限公司 历史人物信息纠错方法、***、电子设备和存储介质
CN118171727A (zh) * 2024-05-16 2024-06-11 神思电子技术股份有限公司 三元组的生成方法、装置、设备、介质及程序产品

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111897970B (zh) * 2020-07-27 2024-05-10 平安科技(深圳)有限公司 基于知识图谱的文本比对方法、装置、设备及存储介质
CN113051407B (zh) * 2021-03-26 2022-10-21 烽火通信科技股份有限公司 一种网络智能运维知识图谱协同构建和共享方法与装置
CN113220827B (zh) * 2021-04-23 2023-03-28 哈尔滨工业大学 一种农业语料库的构建方法及装置
CN113128231A (zh) * 2021-04-25 2021-07-16 深圳市慧择时代科技有限公司 一种数据质检方法、装置、存储介质和电子设备
CN113408271B (zh) * 2021-06-16 2021-11-30 北京来也网络科技有限公司 基于rpa及ai的信息抽取方法、装置、设备及介质
CN113742495B (zh) * 2021-09-07 2024-02-23 平安科技(深圳)有限公司 基于预测模型的评级特征权重确定方法及装置、电子设备
CN113590846B (zh) * 2021-09-24 2021-12-17 天津汇智星源信息技术有限公司 法律知识图谱构建方法及相关设备
CN114547327A (zh) * 2022-01-19 2022-05-27 北京吉威数源信息技术有限公司 时空大数据关系图谱生成方法、装置、设备及存储介质
CN114925210B (zh) * 2022-03-21 2023-12-08 中国电信股份有限公司 知识图谱的构建方法、装置、介质及设备
CN114880023B (zh) * 2022-07-11 2022-09-30 山东大学 面向技术特征的源代码对比方法、***与程序产品
CN117475086A (zh) * 2023-12-22 2024-01-30 知呱呱(天津)大数据技术有限公司 一种基于扩散模型的科技文献附图生成方法及***

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5371807A (en) * 1992-03-20 1994-12-06 Digital Equipment Corporation Method and apparatus for text classification
CN107633005A (zh) * 2017-08-09 2018-01-26 广州思涵信息科技有限公司 一种基于课堂教学内容的知识图谱构建、对比***及方法
CN110825882A (zh) * 2019-10-09 2020-02-21 西安交通大学 一种基于知识图谱的信息***管理方法
CN111259897A (zh) * 2018-12-03 2020-06-09 杭州翼心信息科技有限公司 知识感知的文本识别方法和***
CN111428044A (zh) * 2020-03-06 2020-07-17 中国平安人寿保险股份有限公司 多模态获取监管识别结果的方法、装置、设备及存储介质
CN111897970A (zh) * 2020-07-27 2020-11-06 平安科技(深圳)有限公司 基于知识图谱的文本比对方法、装置、设备及存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109241538B (zh) * 2018-09-26 2022-12-20 上海德拓信息技术股份有限公司 基于关键词和动词依存的中文实体关系抽取方法
CN109284396A (zh) * 2018-09-27 2019-01-29 北京大学深圳研究生院 医学知识图谱构建方法、装置、服务器及存储介质
CN110543571A (zh) * 2019-08-07 2019-12-06 北京市天元网络技术股份有限公司 用于水利信息化的知识图谱构建方法以及装置
CN111177393B (zh) * 2020-01-02 2023-03-24 广东博智林机器人有限公司 一种知识图谱的构建方法、装置、电子设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5371807A (en) * 1992-03-20 1994-12-06 Digital Equipment Corporation Method and apparatus for text classification
CN107633005A (zh) * 2017-08-09 2018-01-26 广州思涵信息科技有限公司 一种基于课堂教学内容的知识图谱构建、对比***及方法
CN111259897A (zh) * 2018-12-03 2020-06-09 杭州翼心信息科技有限公司 知识感知的文本识别方法和***
CN110825882A (zh) * 2019-10-09 2020-02-21 西安交通大学 一种基于知识图谱的信息***管理方法
CN111428044A (zh) * 2020-03-06 2020-07-17 中国平安人寿保险股份有限公司 多模态获取监管识别结果的方法、装置、设备及存储介质
CN111897970A (zh) * 2020-07-27 2020-11-06 平安科技(深圳)有限公司 基于知识图谱的文本比对方法、装置、设备及存储介质

Cited By (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114138985B (zh) * 2022-02-08 2022-04-26 深圳希施玛数据科技有限公司 文本数据处理的方法、装置、计算机设备以及存储介质
CN114138985A (zh) * 2022-02-08 2022-03-04 深圳希施玛数据科技有限公司 文本数据处理的方法、装置、计算机设备以及存储介质
CN114661872A (zh) * 2022-02-25 2022-06-24 北京大学 一种面向初学者的api自适应推荐方法与***
CN114661872B (zh) * 2022-02-25 2023-07-21 北京大学 一种面向初学者的api自适应推荐方法与***
CN114741522A (zh) * 2022-03-11 2022-07-12 北京师范大学 一种文本分析方法、装置、存储介质及电子设备
CN114741468B (zh) * 2022-03-22 2024-03-29 平安科技(深圳)有限公司 文本去重方法、装置、设备及存储介质
CN114372732A (zh) * 2022-03-22 2022-04-19 杭州杰牌传动科技有限公司 实现用户需求智能匹配的减速电机协同制造方法和***
CN114741468A (zh) * 2022-03-22 2022-07-12 平安科技(深圳)有限公司 文本去重方法、装置、设备及存储介质
CN114496115A (zh) * 2022-04-18 2022-05-13 北京白星花科技有限公司 实体关系的标注自动生成方法和***
CN114496115B (zh) * 2022-04-18 2022-08-23 北京白星花科技有限公司 实体关系的标注自动生成方法和***
CN114742029A (zh) * 2022-04-20 2022-07-12 中国传媒大学 一种汉语文本比对方法、存储介质及设备
US20230359825A1 (en) * 2022-05-06 2023-11-09 Sap Se Knowledge graph entities from text
CN114707005A (zh) * 2022-06-02 2022-07-05 浙江建木智能***有限公司 一种舰船装备的知识图谱构建方法和***
CN114707005B (zh) * 2022-06-02 2022-10-25 浙江建木智能***有限公司 一种舰船装备的知识图谱构建方法和***
WO2023246849A1 (zh) * 2022-06-22 2023-12-28 青岛海尔电冰箱有限公司 回馈数据图谱生成方法及冰箱
CN114783559B (zh) * 2022-06-23 2022-09-30 浙江太美医疗科技股份有限公司 医学影像报告信息抽取方法、装置、电子设备和存储介质
CN114783559A (zh) * 2022-06-23 2022-07-22 浙江太美医疗科技股份有限公司 医学影像报告信息抽取方法、装置、电子设备和存储介质
CN115129719A (zh) * 2022-06-28 2022-09-30 深圳市规划和自然资源数据管理中心 一种基于知识图谱的定性位置空间范围构建方法
CN114996389B (zh) * 2022-08-04 2022-10-11 中科雨辰科技有限公司 一种标注类别一致性检验方法、存储介质及电子设备
CN114996389A (zh) * 2022-08-04 2022-09-02 中科雨辰科技有限公司 一种标注类别一致性检验方法、存储介质及电子设备
CN115358341A (zh) * 2022-08-30 2022-11-18 北京睿企信息科技有限公司 一种基于关系模型的指代消歧的训练方法及***
CN115909386A (zh) * 2023-01-06 2023-04-04 中国石油大学(华东) 一种管道仪表流程图的补全和纠错方法、设备及存储介质
CN115909386B (zh) * 2023-01-06 2023-05-12 中国石油大学(华东) 一种管道仪表流程图的补全和纠错方法、设备及存储介质
CN115880120A (zh) * 2023-02-24 2023-03-31 江西微博科技有限公司 一种在线政务服务***及服务方法
CN116703441A (zh) * 2023-05-25 2023-09-05 云内控科技有限公司 一种基于知识图谱的医疗项目成本核算可视分析方法
CN116882408B (zh) * 2023-09-07 2024-02-27 南方电网数字电网研究院有限公司 变压器图模型的构建方法、装置、计算机设备和存储介质
CN116882408A (zh) * 2023-09-07 2023-10-13 南方电网数字电网研究院有限公司 变压器图模型的构建方法、装置、计算机设备和存储介质
CN117195913B (zh) * 2023-11-08 2024-02-27 腾讯科技(深圳)有限公司 文本处理方法、装置、电子设备、存储介质及程序产品
CN117195913A (zh) * 2023-11-08 2023-12-08 腾讯科技(深圳)有限公司 文本处理方法、装置、电子设备、存储介质及程序产品
CN117332282A (zh) * 2023-11-29 2024-01-02 之江实验室 一种基于知识图谱的事件匹配的方法及装置
CN117332282B (zh) * 2023-11-29 2024-03-08 之江实验室 一种基于知识图谱的事件匹配的方法及装置
CN117371534A (zh) * 2023-12-07 2024-01-09 同方赛威讯信息技术有限公司 一种基于bert的知识图谱构建方法及***
CN117371534B (zh) * 2023-12-07 2024-02-27 同方赛威讯信息技术有限公司 一种基于bert的知识图谱构建方法及***
CN117454884A (zh) * 2023-12-20 2024-01-26 上海蜜度科技股份有限公司 历史人物信息纠错方法、***、电子设备和存储介质
CN117454884B (zh) * 2023-12-20 2024-04-09 上海蜜度科技股份有限公司 历史人物信息纠错方法、***、电子设备和存储介质
CN118171727A (zh) * 2024-05-16 2024-06-11 神思电子技术股份有限公司 三元组的生成方法、装置、设备、介质及程序产品

Also Published As

Publication number Publication date
CN111897970A (zh) 2020-11-06
CN111897970B (zh) 2024-05-10

Similar Documents

Publication Publication Date Title
WO2022022045A1 (zh) 基于知识图谱的文本比对方法、装置、设备及存储介质
US20230142217A1 (en) Model Training Method, Electronic Device, And Storage Medium
US9317498B2 (en) Systems and methods for generating summaries of documents
WO2021068339A1 (zh) 文本分类方法、装置及计算机可读存储介质
US10025819B2 (en) Generating a query statement based on unstructured input
US10740545B2 (en) Information extraction from open-ended schema-less tables
WO2021121198A1 (zh) 基于语义相似度的实体关系抽取方法、装置、设备及介质
Bhargava et al. Atssi: Abstractive text summarization using sentiment infusion
Al-Anzi et al. Beyond vector space model for hierarchical Arabic text classification: A Markov chain approach
US20130060769A1 (en) System and method for identifying social media interactions
US9477756B1 (en) Classifying structured documents
WO2020232943A1 (zh) 用于事件预测的知识图构建方法与事件预测方法
Zhang et al. A comprehensive survey of abstractive text summarization based on deep learning
US20210073257A1 (en) Logical document structure identification
Han et al. Text Summarization Using FrameNet‐Based Semantic Graph Model
Beheshti et al. Big data and cross-document coreference resolution: Current state and future opportunities
WO2015084757A1 (en) Systems and methods for processing data stored in a database
Makrynioti et al. PaloPro: a platform for knowledge extraction from big social data and the news
Khalid et al. Reference terms identification of cited articles as topics from citation contexts
US20230282018A1 (en) Generating weighted contextual themes to guide unsupervised keyphrase relevance models
Mishra et al. A novel approach to capture the similarity in summarized text using embedded model
Mishra et al. Similarity search based on text embedding model for detection of near duplicates
SCALIA Network-based content geolocation on social media for emergency management
Sonbhadra et al. Email classification via intention-based segmentation
Croce et al. Grammatical Feature Engineering for Fine-grained IR Tasks.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21849774

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21849774

Country of ref document: EP

Kind code of ref document: A1