WO2022222300A1 - Procédé et appareil d'extraction de relation ouverte, dispositif électronique et support de stockage - Google Patents

Procédé et appareil d'extraction de relation ouverte, dispositif électronique et support de stockage Download PDF

Info

Publication number
WO2022222300A1
WO2022222300A1 PCT/CN2021/109488 CN2021109488W WO2022222300A1 WO 2022222300 A1 WO2022222300 A1 WO 2022222300A1 CN 2021109488 W CN2021109488 W CN 2021109488W WO 2022222300 A1 WO2022222300 A1 WO 2022222300A1
Authority
WO
WIPO (PCT)
Prior art keywords
entity
relationship
original
open
data set
Prior art date
Application number
PCT/CN2021/109488
Other languages
English (en)
Chinese (zh)
Inventor
朱昱锦
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2022222300A1 publication Critical patent/WO2022222300A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Definitions

  • the present application relates to the technical field of artificial intelligence, and in particular, to an open relationship extraction method, apparatus, electronic device, and computer-readable storage medium.
  • Relation extraction is an important supporting technology in the field of information extraction and knowledge graph construction. There are many practical scenarios, such as building large-scale general/vertical domain graphs, extracting information from application forms for pre-loan review, etc.
  • the traditional relationship extraction technology faces two problems and is difficult to put into practice: 1) It requires more labeled data to train the relationship classification model, resulting in high data cost and labeling cost; 2) The relationship type often requires business definition, which is limited and cannot be changed. Many requirements do not have a predefined set of relationships.
  • Open relation extraction technology requires inputting a piece of text, and automatically outputs all possible relation triples (head entity, relation, tail entity) and bigrams (head entity, tail entity) from it.
  • the "relationship" field in the triple is the descriptor that comes with the context.
  • Open relation extraction has always been intractable due to type uncertainty.
  • the classic methods include ReVerb, OLLIE, OpenIE, etc., but most of these solutions are for English, and it is difficult to migrate to Chinese Text, and the matching rules are strict, and the processing method is inflexible; 2.
  • the scheme uses two-layer network blocks to process text, first extracts the head entity from the text, and then jointly extracts the tail entity and determines the relationship type according to the output of the head entity and the hidden layer, which constitutes a behavioral relationship class number, which is listed as the text length
  • the number of relationship types becomes the text length, so that the model needs to calculate a tensor whose size is the number of batch samples ⁇ the number of head entities ⁇ text length ⁇ text length.
  • the multi-triple problem in the text also improves the accuracy, but it takes up a lot of computing resources and is extremely inefficient.
  • An open relation extraction method comprising:
  • Segmenting the text to be classified obtaining segmented text, and extracting entities in the segmented text by using the open entity extraction model
  • the entity relationship of the entity is predicted by the open relationship extraction model, and the entity and the entity relationship are clustered to obtain a relationship extraction result.
  • An apparatus for extracting open relationships comprising:
  • the training set building module is used to obtain the original entity data set and the original relationship data set, perform remote supervision on the original entity data set and the original relationship data set respectively, and compare the supervised original entity data set with all the original entity data sets. Perform entity chaining on the original relational data set to obtain the original training set;
  • an entity reinforcement module which is used to sequentially perform strategy labeling and entity reinforcement processing on the original training set to obtain a standard training set
  • a model building module for obtaining a pre-trained language model, performing entity fine-tuning on the language model using the standard training set to obtain an open entity extraction model, and using the standard training set to perform relational fine-tuning on the language model, Get the open relation extraction model;
  • an entity extraction module used for segmenting the text to be classified, obtaining segmented text, and extracting entities in the segmented text by using the open entity extraction model
  • the relationship extraction module is used for predicting the entity relationship of the entity by using the open relationship extraction model, and clustering the entity and the entity relationship to obtain a relationship extraction result.
  • An electronic device comprising:
  • a processor that executes the instructions stored in the memory to achieve the following steps:
  • Segmenting the text to be classified obtaining segmented text, and extracting entities in the segmented text by using the open entity extraction model
  • the entity relationship of the entity is predicted by the open relationship extraction model, and the entity and the entity relationship are clustered to obtain a relationship extraction result.
  • the present application also provides a computer-readable storage medium, where at least one instruction is stored in the computer-readable storage medium, and the at least one instruction is executed by a processor in an electronic device to implement the following steps:
  • Segmenting the text to be classified obtaining segmented text, and extracting entities in the segmented text by using the open entity extraction model
  • the entity relationship of the entity is predicted by the open relationship extraction model, and the entity and the entity relationship are clustered to obtain a relationship extraction result.
  • the present application can solve the problem of low extraction efficiency of open relationships.
  • FIG. 1 is a schematic flowchart of an open relationship extraction method provided by an embodiment of the present application
  • Fig. 2 is a detailed implementation flow diagram of one of the steps in Fig. 1;
  • Fig. 3 is the detailed implementation flow schematic diagram of another step in Fig. 1;
  • Fig. 4 is a detailed implementation flow diagram of another step in Fig. 1;
  • Fig. 5 is the detailed implementation flow schematic diagram of another step in Fig. 1;
  • FIG. 6 is a functional block diagram of an apparatus for extracting open relationships provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of an electronic device implementing the method for extracting an open relationship according to an embodiment of the present application.
  • the embodiment of the present application provides an open relationship extraction method.
  • the execution subject of the open relationship extraction method includes, but is not limited to, at least one of electronic devices that can be configured to execute the method provided by the embodiments of the present application, such as a server and a terminal.
  • the open relationship extraction method can be executed by software or hardware installed in a terminal device or a server device, and the software can be a blockchain platform.
  • the server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.
  • the open relationship extraction method includes:
  • S1 obtain the original entity data set and the original relationship data set, respectively carry out remote supervision on the original entity data set and the original relationship data set, and supervise the original entity data set and the original relationship data set. Perform entity chaining to get the original training set.
  • the obtaining of the original entity data set and the original relational data set includes:
  • the preset data crawling tool can be Hawk data crawling tool
  • the source website can be portal websites and professional websites in different fields, including: finance, law, medical care, education, entertainment , sports, etc.
  • the text data in the source website can be directly crawled.
  • 3 sentences may be set as the minimum segmentation unit of the text data, and the length of each sentence does not exceed 256 characters.
  • the open source entity datasets may include datasets such as the Chinese General Encyclopedia Knowledge Graph (CN-DBpedia), and CN-DBpedia is mainly collected from the plain text pages of Chinese encyclopedia websites (such as Baidu Encyclopedia, Interactive Encyclopedia, Chinese Wikipedia, etc.).
  • Entity information is extracted, and after filtering, fusion, inference and other operations, a high-quality structured data set is finally formed.
  • the graph not only contains (head entity, relationship, tail entity) triple information, but also contains entity description information ( From Baidu Encyclopedia, etc.).
  • Performing deduplication processing on the triplet information to obtain a deduplication triplet including:
  • the target triplet is not repeated, and the target triplet is reselected from the entity data set for calculation;
  • the target triplet is repeated, and the target triplet is deleted to obtain a deduplication triplet.
  • the following distance algorithm is used to calculate the distance value between the target triplet and all unselected triplet information in the entity data set:
  • d is the distance value
  • w j is the jth target triple
  • w k is any unselected triple information in the entity data set
  • n is the number of triple information in the entity data set .
  • performing remote supervision on the original entity data set and the original relationship data set respectively, and performing entity chaining on the supervised original entity data set and the original relationship data set to obtain the original training set including:
  • the original training set is obtained by summarizing the text segmentation and the triplet information.
  • the remote supervision refers to a method of using ready-made triples in an open source knowledge graph to perform automatic labeling without manual participation, and obtain a large number of labeled data sets.
  • the triples in the original entity data set and the text segmentation in the original relationship data set are matched, and it is required that at least the head entity and the tail entity in the triple are in the context of the current text segmentation, and Label the position of the entity in the current text segment (eg, "text”: text segment, "entity_idx”: ⁇ entity_1: [start, end], entity_2: [start, end], ... ⁇ , where, " text” represents the current text segment, "entity_idx” represents the position of the entity in the current text segment), and the matched triples and text segments are aggregated to obtain the matching data.
  • a pre-built disambiguation model can be used to perform the entity link reference, and the pre-built disambiguation model can be a BERT model trained by open source short text matching, and the entity link refers to the BERT model as The backbone, splicing the text in the matching data (including the triplet and the text segment where the triplet is located) and the description of the triplet in the original entity data set as input, and outputting a matching probability, the preset threshold can be 0.5, when the matching probability is greater than 0.5, it means that the entities in the matching data and the original entity dataset are of the same type.
  • the preset threshold can be 0.5, when the matching probability is greater than 0.5, it means that the entities in the matching data and the original entity dataset are of the same type.
  • the strategy labeling and entity reinforcement processing are sequentially performed on the original training set to obtain a standard training set, including:
  • strategy labeling may be performed based on the method of MTB (Matching The Blank, blank matching), wherein the preset labeling symbols may be ⁇ tag> and ⁇ /tag>, and ⁇ tag> and ⁇ / The part enclosed by the tag> is the mention of the entity or relationship in the sentence.
  • the classification sample can be [CLS]XXX ⁇ entity_head>XXX ⁇ /entity_head>XXX ⁇ rel>XXX ⁇ /rel>XXX ⁇ entity_tail>XXX ⁇ /en tity_tail>XXX[SEP], entity_head, rel, and entity_tail represent head entity, relationship, and tail entity respectively.
  • [CLS] and [SEP] are the spacers
  • [CLS] is the classification bit, this position outputs the binary classification result 0/1, indicating whether there is a relationship between the two entities at present
  • [SEP] is the termination bit, indicating the end of the sentence.
  • the BIO sequence labeling mode may be used to label the entities in the classified samples, the words mentioned by the entities are labelled as B or I, and the non-entities are labelled as O. Since this is open entity recognition, it is only divided into two categories: yes/no entity.
  • the preset natural language processing library can be the HanLP natural language processing library, and the dependency syntax parsing tool in the HanLP natural language processing library is used to analyze the prefix of the current entity to perform entity enhancement on the current entity, for example, the current entity is " Cook", prefixed with "Apple CEO", then the enhanced entity is "Apple CEO Cook".
  • the implementation of the present application can improve the accuracy of model training by performing policy labeling and entity enhancement processing on the original training set.
  • the pre-trained language model may be a large-scale unsupervised pre-trained language model based on the BERT algorithm in the open source transformer project.
  • the model is written using the pytorch framework and has been previously performed on large-scale open source Chinese corpus. Training, the training process adopts the method of cloze to determine the error, that is, the input Chinese expected text to intentionally cover a few words, check whether the model predicts the masked words according to the unmasked context when outputting, and calculate the predicted value of the model. The difference between the value and the true value until the difference is below a pre-specified threshold.
  • Relation extraction models including:
  • the relationship span can be represented by a one-hot vector, and the [CLS] bit is used to determine the prediction result between the prediction entities in the binary classification linear layer. 0 or 1, 0 indicates that the relationship does not exist , 1 means the relationship exists.
  • the relationship prediction is simplified into a limited two-class problem, which greatly simplifies the training process of the model.
  • the text to be classified is segmented to obtain segmented text
  • the open entity extraction model is used to extract entities in the segmented text, including:
  • entities in the text to be classified can be rapidly extracted through the open entity extraction model, which improves the rate of entity relationship prediction.
  • predicting the entity relationship of the entity by using the open relationship extraction model, and clustering the entity and the entity relationship to obtain a relationship extraction result including:
  • the open relationship extraction model is used to extract the relationship in the segmented sentence to be classified, and the entity to be classified that has no relationship is filtered out to obtain a predicted triplet;
  • the predicted triples are clustered by using a preset clustering method to obtain a plurality of cluster clusters, wherein the cluster clusters include the relationship extraction result.
  • a triple head entity, relationship, tail entity
  • a double head entity, None, tail entity
  • the preset clustering method may be a K-means clustering method.
  • the K-means clustering method vectorizes the relationship in the predicted triplet through the word2vec algorithm, and calculates the distance between the vectors, according to the distance.
  • the predicted triples are gathered into K central points to form K clusters. At this time, a type name is manually summarized for each cluster, so as to classify the predicted triples.
  • each cluster when each cluster is stable (not changing), each cluster will find the mean of all relation vectors in the cluster, and then the new relation will be compared with the mean of each existing cluster. If the similarity with multiple clusters (which can be Euclidean distance) is higher than the predefined similarity threshold, it will be classified into the most similar cluster. If the similarity with all clusters is lower than the predefined similarity threshold The similarity threshold is independently classified into the "unknown" class. When the relationship in the "unknown" class accumulates to a certain amount (usually 70% of the known class relationship), the K-means clustering method and manual definition type are repeated for the unknown relationship. .
  • the extracted open relationships can be automatically classified, which improves the efficiency of open relationship extraction.
  • FIG. 6 it is a functional block diagram of an apparatus for extracting an open relationship provided by an embodiment of the present application.
  • the open relationship extraction apparatus 100 described in this application may be installed in an electronic device. According to the implemented functions, the open relationship extraction apparatus 100 may include a training set construction module 101 , an entity enhancement module 102 , a model construction module 103 , an entity extraction module 104 and a relationship extraction module 105 .
  • the modules described in this application may also be referred to as units, which refer to a series of computer program segments that can be executed by the processor of an electronic device and can perform fixed functions, and are stored in the memory of the electronic device.
  • each module/unit is as follows:
  • the training set construction module 101 is used to obtain the original entity data set and the original relationship data set, perform remote supervision on the original entity data set and the original relationship data set respectively, and compare the supervised original entity data set. Perform entity chaining between the set and the original relational data set to obtain the original training set.
  • the training set construction module 101 obtains the original entity data set and the original relationship data set through the following operations:
  • the entity data set includes triplet information and description information corresponding to each triplet information, and deduplicates the triplet information to obtain a deduplication triplet , summarizing the deduplication triplet and the description information corresponding to the triplet information to obtain the original entity data set.
  • the preset data crawling tool can be Hawk data crawling tool
  • the source website can be portal websites and professional websites in different fields, including: finance, law, medical care, education, entertainment , sports, etc.
  • the text data in the source website can be directly crawled.
  • 3 sentences can be set as the minimum segmentation unit of the text data, the length of each sentence is not more than 256 words, and when exceeding, it is cut down to 2 sentences or even 1 sentence or skipped directly.
  • the open source entity datasets may include datasets such as the Chinese General Encyclopedia Knowledge Graph (CN-DBpedia), and CN-DBpedia is mainly collected from the plain text pages of Chinese encyclopedia websites (such as Baidu Encyclopedia, Interactive Encyclopedia, Chinese Wikipedia, etc.). Entity information is extracted, and after filtering, fusion, inference and other operations, a high-quality structured data set is finally formed.
  • the graph not only contains (head entity, relationship, tail entity) triple information, but also contains entity description information ( From Baidu Encyclopedia, etc.).
  • the training set building module 101 obtains deduplication triples through the following operations:
  • the target triplet is not repeated, and the target triplet is reselected from the entity data set for calculation;
  • the target triplet is repeated, and the target triplet is deleted to obtain a deduplication triplet.
  • the following distance algorithm is used to calculate the distance value between the target triplet and all unselected triplet information in the entity data set:
  • d is the distance value
  • w j is the jth target triple
  • w k is any unselected triple information in the entity data set
  • n is the number of triple information in the entity data set .
  • training set construction module 101 obtains the original training set through the following operations:
  • the original training set is obtained by summarizing the text segmentation and the triplet information.
  • the remote supervision refers to a method of using ready-made triples in an open source knowledge graph to perform automatic labeling without manual participation, and obtain a large number of labeled data sets.
  • the triples in the original entity data set and the text segmentation in the original relationship data set are matched, and it is required that at least the head entity and the tail entity in the triple are in the context of the current text segmentation, and Label the position of the entity in the current text segment (eg, "text”: text segment, "entity_idx”: ⁇ entity_1: [start, end], entity_2: [start, end], ... ⁇ , where, " text” represents the current text segment, "entity_idx” represents the position of the entity in the current text segment), and the matched triples and text segments are aggregated to obtain the matching data.
  • a pre-built disambiguation model can be used to perform the entity link reference, and the pre-built disambiguation model can be a BERT model trained by open source short text matching, and the entity link refers to the BERT model as The backbone, splicing the text in the matching data (including the triplet and the text segment where the triplet is located) and the description of the triplet in the original entity data set as input, and outputting a matching probability, the preset threshold can be 0.5, when the matching probability is greater than 0.5, it means that the entities in the matching data and the original entity dataset are of the same type.
  • the preset threshold can be 0.5, when the matching probability is greater than 0.5, it means that the entities in the matching data and the original entity dataset are of the same type.
  • the entity reinforcement module 102 is configured to sequentially perform strategy labeling and entity reinforcement processing on the original training set to obtain a standard training set.
  • the entity reinforcement module 102 obtains the standard training set through the following operations:
  • strategy labeling may be performed based on the method of MTB (Matching The Blank, blank matching), wherein the preset labeling symbols may be ⁇ tag> and ⁇ /tag>, and ⁇ tag> and ⁇ / The part enclosed by the tag> is the mention of the entity or relationship in the sentence.
  • the classification sample can be [CLS]XXX ⁇ entity_head>XXX ⁇ /entity_head>XXX ⁇ rel>XXX ⁇ /rel>XXX ⁇ entity_tail>XXX ⁇ /en tity_tail>XXX[SEP], entity_head, rel, and entity_tail represent head entity, relationship, and tail entity respectively.
  • [CLS] and [SEP] are the spacers
  • [CLS] is the classification bit, this position outputs the binary classification result 0/1, indicating whether there is a relationship between the two entities at present
  • [SEP] is the termination bit, indicating the end of the sentence.
  • the BIO sequence labeling mode may be used to label the entities in the classified samples, the words mentioned by the entities are labelled as B or I, and the non-entities are labelled as O. Since this is open entity recognition, it is only divided into two categories: yes/no entity.
  • the preset natural language processing library can be the HanLP natural language processing library, and the dependency syntax parsing tool in the HanLP natural language processing library is used to analyze the prefix of the current entity to perform entity enhancement on the current entity, for example, the current entity is " Cook", prefixed with "Apple CEO", then the enhanced entity is "Apple CEO Cook".
  • the implementation of the present application can improve the accuracy of model training by performing policy labeling and entity enhancement processing on the original training set.
  • the model building module 103 is used to obtain a pre-trained language model, use the standard training set to perform entity fine-tuning on the language model to obtain an open entity extraction model, and use the standard training set to perform entity fine-tuning on the language model.
  • the relationship is fine-tuned to obtain an open relationship extraction model.
  • the pre-trained language model may be a large-scale unsupervised pre-trained language model based on the BERT algorithm in the open source transformer project.
  • the model is written using the pytorch framework and has been previously performed on large-scale open source Chinese corpus. Training, the training process adopts the method of cloze to determine the error, that is, the input Chinese expected text to intentionally cover a few words, check whether the model predicts the masked words according to the unmasked context when outputting, and calculate the predicted value of the model. The difference between the value and the true value until the difference is below a pre-specified threshold.
  • model building module 103 obtains the open entity extraction model and the open relationship extraction model through the following operations:
  • a preset two-class linear layer is used to output a prediction result between the prediction entities, wherein the prediction result includes the existence of a relationship;
  • the relationship span can be represented by a one-hot vector, and the [CLS] bit is used to determine the prediction result between the prediction entities in the binary classification linear layer. 0 or 1, 0 indicates that the relationship does not exist , 1 means the relationship exists.
  • the relationship prediction is simplified into a limited two-class problem, which greatly simplifies the training process of the model.
  • the entity extraction module 104 is configured to segment the text to be classified, obtain segmented text, and extract entities in the segmented text by using the open entity extraction model.
  • the entity extraction module 104 extracts entities in the segmented text through the following operations:
  • All entities in the text to be classified are extracted by using the open entity extraction model to obtain entities to be classified.
  • entities in the text to be classified can be rapidly extracted through the open entity extraction model, which improves the rate of entity relationship prediction.
  • the relationship extraction module 105 is configured to use the open relationship extraction model to predict the entity relationship of the entity, and to cluster the entity and the entity relationship to obtain a relationship extraction result.
  • the relationship extraction module 105 obtains the relationship extraction result through the following operations:
  • the open relationship extraction model is used to extract the relationship in the segmented sentence to be classified, and the entity to be classified that has no relationship is filtered out to obtain a predicted triplet;
  • the predicted triples are clustered by using a preset clustering method to obtain a plurality of cluster clusters, wherein the cluster clusters include the relationship extraction result.
  • a triple head entity, relationship, tail entity
  • a double head entity, None, tail entity
  • the preset clustering method may be a K-means clustering method.
  • the K-means clustering method vectorizes the relationship in the predicted triplet through the word2vec algorithm, and calculates the distance between the vectors, according to the distance.
  • the predicted triples are gathered into K central points to form K clusters. At this time, a type name is manually summarized for each cluster, so as to classify the predicted triples.
  • each cluster when each cluster is stable (not changing), each cluster will find the mean of all relation vectors in the cluster, and then the new relation will be compared with the mean of each existing cluster. If the similarity with multiple clusters (which can be Euclidean distance) is higher than the predefined similarity threshold, it will be classified into the most similar cluster. If the similarity with all clusters is lower than the predefined similarity threshold The similarity threshold is independently classified into the "unknown" class. When the relationship in the "unknown" class accumulates to a certain amount (usually 70% of the known class relationship), the K-means clustering method and manual definition type are repeated for the unknown relationship. .
  • the extracted open relationships can be automatically classified, which improves the efficiency of open relationship extraction.
  • FIG. 7 it is a schematic structural diagram of an electronic device for implementing an open relationship extraction method provided by an embodiment of the present application.
  • the electronic device 1 may include a processor 10, a memory 11 and a bus, and may also include a computer program stored in the memory 11 and executable on the processor 10, such as an open relationship extraction program 12.
  • the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, mobile hard disk, multimedia card, card-type memory (for example: SD or DX memory, etc.), magnetic memory, magnetic disk, CD etc.
  • the memory 11 may be an internal storage unit of the electronic device 1 in some embodiments, such as a mobile hard disk of the electronic device 1 .
  • the memory 11 may also be an external storage device of the electronic device 1, such as a pluggable mobile hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital) equipped on the electronic device 1. , SD) card, flash memory card (Flash Card), etc.
  • the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device.
  • the memory 11 can not only be used to store application software installed in the electronic device 1 and various types of data, such as the code of the open relationship extraction program 12, etc., but also can be used to temporarily store data that has been output or will be output.
  • the processor 10 may be composed of integrated circuits, for example, may be composed of a single packaged integrated circuit, or may be composed of multiple integrated circuits packaged with the same function or different functions, including one or more integrated circuits.
  • Central Processing Unit CPU
  • microprocessor digital processing chip
  • graphics processor and combination of various control chips, etc.
  • the processor 10 is the control core (Control Unit) of the electronic device, and uses various interfaces and lines to connect the various components of the entire electronic device, by running or executing the program or module (for example, opening the device) stored in the memory 11. relationship extraction program, etc.), and call the data stored in the memory 11 to execute various functions of the electronic device 1 and process data.
  • the bus may be a peripheral component interconnect (PCI for short) bus or an extended industry standard architecture (Extended industry standard architecture, EISA for short) bus or the like.
  • PCI peripheral component interconnect
  • EISA Extended industry standard architecture
  • the bus can be divided into address bus, data bus, control bus and so on.
  • the bus is configured to implement connection communication between the memory 11 and at least one processor 10 and the like.
  • FIG. 7 only shows an electronic device with components. Those skilled in the art can understand that the structure shown in FIG. 7 does not constitute a limitation on the electronic device 1, and may include fewer or more components than those shown in the drawings. components, or a combination of certain components, or a different arrangement of components.
  • the electronic device 1 may also include a power supply (such as a battery) for powering the various components, preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so that the power management
  • the device implements functions such as charge management, discharge management, and power consumption management.
  • the power source may also include one or more DC or AC power sources, recharging devices, power failure detection circuits, power converters or inverters, power status indicators, and any other components.
  • the electronic device 1 may further include various sensors, Bluetooth modules, Wi-Fi modules, etc., which will not be repeated here.
  • the electronic device 1 may also include a network interface, optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.
  • a network interface optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.
  • the electronic device 1 may further include a user interface, and the user interface may be a display (Display), an input unit (eg, a keyboard (Keyboard)), optionally, the user interface may also be a standard wired interface or a wireless interface.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, and the like.
  • the display may also be appropriately called a display screen or a display unit, which is used for displaying information processed in the electronic device 1 and for displaying a visualized user interface.
  • the open relationship extraction program 12 stored in the memory 11 in the electronic device 1 is a combination of multiple instructions, and when running in the processor 10, it can realize:
  • Segmenting the text to be classified obtaining segmented text, and extracting entities in the segmented text by using the open entity extraction model
  • the entity relationship of the entity is predicted by the open relationship extraction model, and the entity and the entity relationship are clustered to obtain a relationship extraction result.
  • the modules/units integrated in the electronic device 1 may be stored in a computer-readable storage medium.
  • the computer-readable storage medium may be volatile or non-volatile.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a USB flash drive, a removable hard disk, a magnetic disk, an optical disc, a computer memory, a read-only memory (ROM, Read-Only). Memory).
  • the present application also provides a computer-readable storage medium, the computer-readable storage medium may be volatile or non-volatile, and the readable storage medium stores a computer program, and the computer program is stored in When executed by the processor of the electronic device, it can achieve:
  • Segmenting the text to be classified obtaining segmented text, and extracting entities in the segmented text by using the open entity extraction model
  • the entity relationship of the entity is predicted by the open relationship extraction model, and the entity and the entity relationship are clustered to obtain a relationship extraction result.
  • modules described as separate components may or may not be physically separated, and components shown as modules may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional module in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware, or can be implemented in the form of hardware plus software function modules.
  • AI artificial intelligence
  • digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • the basic technologies of artificial intelligence generally include technologies such as sensors, special artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • the blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente demande se rapporte à la technologie de l'intelligence artificielle, et divulgue un procédé d'extraction de relation ouverte, consistant : à obtenir un ensemble de formation d'origine à l'aide de techniques de supervision à distance et de liaison d'entité ; à effectuer un traitement d'annotation de politique et de renforcement d'entité sur l'ensemble de formation d'origine pour obtenir un ensemble de formation standard ; à utiliser l'ensemble de formation standard pour effectuer un ajustement fin d'entité et un ajustement fin de relation sur un modèle de langage pré-formé pour obtenir un modèle d'extraction d'entité ouverte et un modèle d'extraction de relation ouverte ; à extraire, à l'aide du modèle d'extraction d'entité ouverte, des entités dans un texte à classifier ; et à prédire une relation d'entités entre les entités à l'aide du modèle d'extraction de relation ouverte, et à regrouper les entités et la relation d'entités pour obtenir un résultat d'extraction de relation. De plus, la présente demande se rapporte en outre à une technologie de chaîne de blocs, et le résultat d'extraction de relation peut être stocké dans un nœud d'une chaîne de blocs. La présente demande concerne en outre un appareil d'extraction de relation ouverte, un dispositif électronique et un support de stockage lisible par ordinateur. La présente demande peut résoudre le problème de l'efficacité d'extraction relativement faible d'une relation ouverte.
PCT/CN2021/109488 2021-04-21 2021-07-30 Procédé et appareil d'extraction de relation ouverte, dispositif électronique et support de stockage WO2022222300A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110428927.5 2021-04-21
CN202110428927.5A CN113051356B (zh) 2021-04-21 2021-04-21 开放关系抽取方法、装置、电子设备及存储介质

Publications (1)

Publication Number Publication Date
WO2022222300A1 true WO2022222300A1 (fr) 2022-10-27

Family

ID=76519844

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/109488 WO2022222300A1 (fr) 2021-04-21 2021-07-30 Procédé et appareil d'extraction de relation ouverte, dispositif électronique et support de stockage

Country Status (2)

Country Link
CN (1) CN113051356B (fr)
WO (1) WO2022222300A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116776886A (zh) * 2023-08-15 2023-09-19 浙江同信企业征信服务有限公司 一种信息抽取方法、装置、设备及存储介质
CN116881471A (zh) * 2023-07-07 2023-10-13 深圳智现未来工业软件有限公司 一种基于知识图谱的大语言模型微调方法及装置
CN117725223A (zh) * 2023-11-20 2024-03-19 中国科学院成都文献情报中心 面向知识发现的科学实验知识图谱构建方法及***

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113051356B (zh) * 2021-04-21 2023-05-30 深圳壹账通智能科技有限公司 开放关系抽取方法、装置、电子设备及存储介质
CN113704429A (zh) * 2021-08-31 2021-11-26 平安普惠企业管理有限公司 基于半监督学习的意图识别方法、装置、设备及介质
CN113553854B (zh) * 2021-09-18 2021-12-10 航天宏康智能科技(北京)有限公司 实体关系的联合抽取方法和联合抽取装置
CN114528418B (zh) * 2022-04-24 2022-10-14 杭州同花顺数据开发有限公司 一种文本处理方法、***和存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150178273A1 (en) * 2013-12-20 2015-06-25 Microsoft Corporation Unsupervised Relation Detection Model Training
CN110209836A (zh) * 2019-05-17 2019-09-06 北京邮电大学 远程监督关系抽取方法及装置
CN110619053A (zh) * 2019-09-18 2019-12-27 北京百度网讯科技有限公司 实体关系抽取模型的训练方法和抽取实体关系的方法
CN113051356A (zh) * 2021-04-21 2021-06-29 深圳壹账通智能科技有限公司 开放关系抽取方法、装置、电子设备及存储介质

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140032209A1 (en) * 2012-07-27 2014-01-30 University Of Washington Through Its Center For Commercialization Open information extraction
US11693873B2 (en) * 2016-02-03 2023-07-04 Global Software Innovation Pty Ltd Systems and methods for using entity/relationship model data to enhance user interface engine
US11210324B2 (en) * 2016-06-03 2021-12-28 Microsoft Technology Licensing, Llc Relation extraction across sentence boundaries
CN109472033B (zh) * 2018-11-19 2022-12-06 华南师范大学 文本中的实体关系抽取方法及***、存储介质、电子设备
CN112487203B (zh) * 2019-01-25 2024-01-16 中译语通科技股份有限公司 一种融入动态词向量的关系抽取***
US10943068B2 (en) * 2019-03-29 2021-03-09 Microsoft Technology Licensing, Llc N-ary relation prediction over text spans
CN111291185B (zh) * 2020-01-21 2023-09-22 京东方科技集团股份有限公司 信息抽取方法、装置、电子设备及存储介质
CN111339774B (zh) * 2020-02-07 2022-11-29 腾讯科技(深圳)有限公司 文本的实体关系抽取方法和模型训练方法
CN111324743A (zh) * 2020-02-14 2020-06-23 平安科技(深圳)有限公司 文本关系抽取的方法、装置、计算机设备及存储介质
CN111881256B (zh) * 2020-07-17 2022-11-08 中国人民解放军战略支援部队信息工程大学 文本实体关系抽取方法、装置及计算机可读存储介质设备
CN111950269A (zh) * 2020-08-21 2020-11-17 清华大学 文本语句处理方法、装置、计算机设备和存储介质
CN112214610B (zh) * 2020-09-25 2023-09-08 中国人民解放军国防科技大学 一种基于跨度和知识增强的实体关系联合抽取方法
CN112507125A (zh) * 2020-12-03 2021-03-16 平安科技(深圳)有限公司 三元组信息提取方法、装置、设备及计算机可读存储介质
CN112507061A (zh) * 2020-12-15 2021-03-16 康键信息技术(深圳)有限公司 多关系医学知识提取方法、装置、设备及存储介质
CN112632975B (zh) * 2020-12-29 2024-06-07 北京明略软件***有限公司 上下游关系的抽取方法、装置、电子设备及存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150178273A1 (en) * 2013-12-20 2015-06-25 Microsoft Corporation Unsupervised Relation Detection Model Training
CN110209836A (zh) * 2019-05-17 2019-09-06 北京邮电大学 远程监督关系抽取方法及装置
CN110619053A (zh) * 2019-09-18 2019-12-27 北京百度网讯科技有限公司 实体关系抽取模型的训练方法和抽取实体关系的方法
CN113051356A (zh) * 2021-04-21 2021-06-29 深圳壹账通智能科技有限公司 开放关系抽取方法、装置、电子设备及存储介质

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116881471A (zh) * 2023-07-07 2023-10-13 深圳智现未来工业软件有限公司 一种基于知识图谱的大语言模型微调方法及装置
CN116881471B (zh) * 2023-07-07 2024-06-04 深圳智现未来工业软件有限公司 一种基于知识图谱的大语言模型微调方法及装置
CN116776886A (zh) * 2023-08-15 2023-09-19 浙江同信企业征信服务有限公司 一种信息抽取方法、装置、设备及存储介质
CN116776886B (zh) * 2023-08-15 2023-12-05 浙江同信企业征信服务有限公司 一种信息抽取方法、装置、设备及存储介质
CN117725223A (zh) * 2023-11-20 2024-03-19 中国科学院成都文献情报中心 面向知识发现的科学实验知识图谱构建方法及***

Also Published As

Publication number Publication date
CN113051356A (zh) 2021-06-29
CN113051356B (zh) 2023-05-30

Similar Documents

Publication Publication Date Title
WO2022222300A1 (fr) Procédé et appareil d'extraction de relation ouverte, dispositif électronique et support de stockage
WO2021212682A1 (fr) Procédé d'extraction de connaissances, appareil, dispositif électronique et support de stockage
WO2022022045A1 (fr) Procédé et appareil de comparaison de texte basée sur un graphe de connaissances, dispositif, et support de stockage
CN108875051B (zh) 面向海量非结构化文本的知识图谱自动构建方法及***
CN108399228B (zh) 文章分类方法、装置、计算机设备及存储介质
WO2021068339A1 (fr) Procédé et dispositif de classification de texte, et support de stockage lisible par ordinateur
US10997369B1 (en) Systems and methods to generate sequential communication action templates by modelling communication chains and optimizing for a quantified objective
WO2021121198A1 (fr) Procédé et appareil d'extraction de relation d'entité basée sur une similitude sémantique, dispositif et support
WO2020108063A1 (fr) Procédé, appareil et serveur de détermination de mots caractéristiques
WO2020252919A1 (fr) Procédé et appareil d'identification de cv, ainsi que dispositif informatique et support de stockage
JP7164701B2 (ja) セマンティックテキストデータをタグとマッチングさせる方法、装置、及び命令を格納するコンピュータ読み取り可能な記憶媒体
CN108804423B (zh) 医疗文本特征提取与自动匹配方法和***
US11580119B2 (en) System and method for automatic persona generation using small text components
US20180068221A1 (en) System and Method of Advising Human Verification of Machine-Annotated Ground Truth - High Entropy Focus
CN110413787B (zh) 文本聚类方法、装置、终端和存储介质
WO2021208703A1 (fr) Procédé et appareil d'analyse de question, dispositif électronique et support d'enregistrement
WO2022116435A1 (fr) Procédé et appareil de génération de titre, dispositif électronique et support de stockage
CN111539193A (zh) 基于本体的文档分析和注释生成
CN112989208B (zh) 一种信息推荐方法、装置、电子设备及存储介质
CN113704429A (zh) 基于半监督学习的意图识别方法、装置、设备及介质
CN113051914A (zh) 一种基于多特征动态画像的企业隐藏标签抽取方法及装置
US20230114673A1 (en) Method for recognizing token, electronic device and storage medium
CN111177375A (zh) 一种电子文档分类方法及装置
JP2020173779A (ja) 文書における見出しのシーケンスの識別
WO2022073341A1 (fr) Procédé et appareil de mise en correspondance d'entités de maladie fondés sur la sémantique vocale, et dispositif informatique

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21937524

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 24.01.2024)