CN115080742B - Text information extraction method, apparatus, device, storage medium, and program product - Google Patents

Text information extraction method, apparatus, device, storage medium, and program product Download PDF

Info

Publication number
CN115080742B
CN115080742B CN202210732269.3A CN202210732269A CN115080742B CN 115080742 B CN115080742 B CN 115080742B CN 202210732269 A CN202210732269 A CN 202210732269A CN 115080742 B CN115080742 B CN 115080742B
Authority
CN
China
Prior art keywords
target
entity
text
node
slot
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210732269.3A
Other languages
Chinese (zh)
Other versions
CN115080742A (en
Inventor
孙建东
史亚冰
蒋烨
柴春光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202210732269.3A priority Critical patent/CN115080742B/en
Publication of CN115080742A publication Critical patent/CN115080742A/en
Priority to KR1020220188247A priority patent/KR20230009345A/en
Priority to JP2023003753A priority patent/JP2023040248A/en
Application granted granted Critical
Publication of CN115080742B publication Critical patent/CN115080742B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/322Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2323Non-hierarchical techniques based on graph theory, e.g. minimum spanning trees [MST] or graph cuts
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Discrete Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The disclosure provides a text information extraction method, a device, equipment, a storage medium and a program product, which relate to the technical field of data processing, in particular to the technical field of artificial intelligence such as knowledge graph, natural language processing and the like. The specific implementation scheme is as follows: representing the input text by using a tree structure to obtain a text tree structure, wherein the text tree structure comprises at least one tree structure path from a root node to each leaf node; determining a target path corresponding to the target relationship from at least one tree structure path according to the target relationship; and determining a target subject and a target object related to the target relation according to the target path to obtain a text information triplet related to the target subject, the target relation and the target object.

Description

Text information extraction method, apparatus, device, storage medium, and program product
Technical Field
The disclosure relates to the technical field of data processing, in particular to the technical field of artificial intelligence such as knowledge graph and natural language processing.
Background
In the technical fields of artificial intelligence such as knowledge graph, natural language processing and the like, extracting specific text information from input text is an important application branch.
Disclosure of Invention
The present disclosure provides a text information extraction method, apparatus, device, storage medium, and program product.
According to an aspect of the present disclosure, there is provided a text information extraction method, including: representing the input text by using a tree structure to obtain a text tree structure, wherein the text tree structure comprises at least one tree structure path from a root node to each leaf node; determining a target path corresponding to the target relationship from at least one tree structure path according to the target relationship; and determining a target subject and a target object related to the target relation according to the target path to obtain a text information triplet related to the target subject, the target relation and the target object.
According to another aspect of the present disclosure, there is provided a text information extracting apparatus including: the system comprises a text tree structure determining module, a target path determining module and a text information triplet determining module. A text tree structure determining module for characterizing the input text with a tree structure to obtain a text tree structure, the text tree structure including at least one tree structure path from the root node to each leaf node; the target path determining module is used for determining a target path corresponding to the target relationship from at least one tree structure path according to the target relationship; and the text information triplet determining module is used for determining a target subject and a target object related to the target relation according to the target path to obtain text information triples related to the target subject, the target relation and the target object.
According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor and a memory communicatively coupled to the at least one processor. Wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the methods of the embodiments of the present disclosure.
According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of the embodiments of the present disclosure.
According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program stored on at least one of a readable storage medium and an electronic device, the computer program, when executed by a processor, implementing the method of the embodiments of the present disclosure.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 schematically illustrates a system architecture diagram of a text information extraction method and apparatus according to an embodiment of the present disclosure;
FIG. 2 schematically illustrates a flow chart of a text information extraction method according to another embodiment of the present disclosure;
FIG. 3 schematically illustrates a schematic diagram of a text information extraction method according to another embodiment of the present disclosure;
FIG. 4 schematically illustrates a schematic diagram of entering text as a technical specification;
FIG. 5 schematically illustrates a schematic diagram of a text tree structure of an input text in the case where the input text is the technical specification shown in FIG. 4;
fig. 6 schematically illustrates a schematic diagram of a text information extraction method according to still another embodiment of the present disclosure;
FIG. 7 schematically illustrates a diagram of determining text information triples according to an embodiment of the present disclosure;
fig. 8 schematically illustrates a block diagram of a text information extracting apparatus according to an embodiment of the present disclosure; and
fig. 9 schematically illustrates a block diagram of an electronic device in which a text information extraction method of an embodiment of the present disclosure may be implemented.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.
Where expressions like at least one of "A, B and C, etc. are used, the expressions should generally be interpreted in accordance with the meaning as commonly understood by those skilled in the art (e.g.," a system having at least one of A, B and C "shall include, but not be limited to, a system having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).
Text information extraction techniques may extract specific text information from the input text. The text triplet information set associated with the subject, the object, and the relationship between the subject and the object will be described below as an example of extraction from the input text.
Text information extraction techniques may be used, for example, to assist intelligent questions and answers, intelligent customer service, etc., in information processing and information retrieval related requirements.
In some implementations, specific information in the input text may be extracted by manually defined rules. In this implementation, the manually defined rules have the disadvantage of poor generalization, for example, when the input text has new properties or new expressions, the predefined extraction rules are no longer applicable.
In some implementations, text information extraction is performed based on sentence-level information extraction techniques. For example, where a particular text message is located in a single sentence, the input text may be processed into a collection of sentences, with text message extraction for each sentence. This implementation cannot cover the case where specific text information is scattered in different sentences.
In some implementations, entity information including a subject and an object is extracted based on a sentence-level information extraction technology, and then the entity information is combined to obtain entity pairs, and the relation between each entity pair can be determined by classifying the relation of each entity pair by using a document-level classification model. This implementation suffers from error accumulation, e.g., errors in the extraction entity may be passed to the step of relationship classification, resulting in error accumulation. Furthermore, entity information extracted based on sentence-level information extraction techniques cannot cover the case where entities are distributed over multiple sentences.
Fig. 1 schematically illustrates a system architecture of a text information extraction method and apparatus according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which embodiments of the present disclosure may be applied to assist those skilled in the art in understanding the technical content of the present disclosure, but does not mean that embodiments of the present disclosure may not be used in other devices, systems, environments, or scenarios.
As shown in fig. 1, a system architecture 100 according to this embodiment may include clients 101, 102, 103, a network 104, and a server 105. The network 104 is the medium used to provide communication links between the clients 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
A user may interact with the server 105 through the network 104 using clients 101, 102, 103 to receive or send messages, etc. Various communication client applications may be installed on clients 101, 102, 103, such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients, social platform software, and the like (by way of example only).
The clients 101, 102, 103 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like. The clients 101, 102, 103 of the disclosed embodiments may, for example, run applications.
The server 105 may be a server providing various services, such as a background management server (by way of example only) that provides support for websites browsed by users using clients 101, 102, 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the client. In addition, the server 105 may also be a cloud server, i.e. the server 105 has cloud computing functionality.
It should be noted that the text information extraction method provided by the embodiment of the present disclosure may be executed by the server 105. Accordingly, the text information extracting device provided by the embodiment of the present disclosure may be provided in the server 105. The text information extraction method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the clients 101, 102, 103 and/or the server 105. Accordingly, the text information extraction device provided by the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 105 and is capable of communicating with the clients 101, 102, 103 and/or the server 105.
In one example, the server 105 may obtain input text from the clients 101, 102, 103 over the network 104 and extract text information triples associated with the target subject, target relationship, and target guest based on the input text.
It should be understood that the number of clients, networks, and servers in fig. 1 is merely illustrative. There may be any number of clients, networks, and servers, as desired for implementation.
It should be noted that, in the technical solution of the present disclosure, the processes of collecting, storing, using, processing, transmitting, providing, disclosing, etc. related personal information of the user all conform to the rules of the related laws and regulations, and do not violate the public welfare.
In the technical scheme of the disclosure, the authorization or consent of the user is obtained before the personal information of the user is obtained or acquired.
The embodiment of the present disclosure provides a text information extraction method, and a text information extraction method according to an exemplary embodiment of the present disclosure is described below with reference to the accompanying drawings in conjunction with the system architecture of fig. 1. The text information extraction method of the embodiment of the present disclosure may be performed by the server 105 shown in fig. 1, for example.
Fig. 2 schematically illustrates a flowchart of a text information extraction method according to an embodiment of the present disclosure.
As shown in fig. 2, the text information extraction method 200 of the embodiment of the present disclosure may include, for example, operations S210 to S230.
In operation S210, an input text is characterized by a tree structure, resulting in a text tree structure.
A tree structure may be understood as a set of several nodes having a hierarchical relationship, where the tree structure includes a root node and a leaf node, where the leaf node may be understood as a node with a degree of 0, the root node may be understood as a node without a parent node, and the degree may be understood as the number of child nodes that a node contains.
The input text may include text of various formats, different numbers of words, which is input, for example, the input text may be in a news format, a technical specification format, or the like.
In the text information extraction method according to the embodiment of the present disclosure, for example, information of an input text may be represented by an entity (entity), a relation (relationship), a predicate (prediction), an attribute (attribute), or the like. An entity may comprise an object in reality, an attribute may be understood as some characteristic of the entity, and a relationship or predicate may characterize a relationship between entities. The entities may include subjects (objects) and objects (objects), which may be understood as entities having a relationship with the subject or entities having a certain attribute value.
The text tree structure includes at least one tree structure path from the root node to each leaf node. It will be appreciated that the number of tree structure paths of the text tree structure is the same as the number of leaf nodes.
In operation S220, a target path corresponding to the target relationship is determined from the at least one tree-structured path according to the target relationship.
The target relationship may be predetermined, for example. For example, the target relationship may be predetermined at the client by a person desiring a triplet of text information, and the server performs the text information extraction method of the embodiments of the present disclosure in response to an instruction to determine the target relationship.
The target path may be understood as a tree-structured path associated with the target relationship, and the target path may be one or more of the tree-structured paths.
In operation S230, a target subject and a target object related to the target relationship are determined according to the target path, and a text information triplet related to the target subject, the target relationship, and the target object is obtained.
The input text is formed by organizing the text according to specific logic. For example, text information corresponding to a certain topic is concentrated in a certain portion of text, the title of a certain document is a brief description of the document, and the body of the document is a detailed description of the title.
According to the text information extraction method, the input text is characterized by utilizing the tree structure, the obtained text tree structure has clear hierarchical relation, and the text tree structure can reflect specific logic of the input text. According to the target relationship, the target path determined from the tree structure path is part of the text tree structure, and by determining the text information triples according to the target path, the whole input text is not traversed. In addition, the relevance of the target path and the target relationship is higher, and according to the target path, the text information triples can be rapidly and accurately determined, so that the text information extraction method disclosed by the embodiment of the invention has higher text information extraction efficiency.
Fig. 3 schematically illustrates a schematic diagram of a text information extraction method according to another embodiment of the present disclosure.
As shown in fig. 3, determining a target subject and a target object related to a target relationship according to a target path may be implemented, for example, using the following embodiments, to obtain a specific example of a text information triplet associated with the target subject, the target relationship, and the target object.
In operation S331, entity identification is performed on each target node of the target path 304, and an entity identification result is obtained.
The entity identification result characterizes the relevance between the partial text corresponding to the target node and the entity attribute.
A target node may be understood as a node located on a target path. It is understood that a text tree structure is a representation of the tree structure of an input text. Each node of the text tree structure corresponds to a portion of the input text. For example, when the input text is a news format text, the input text may include, for example, a news headline and a piece of news body, and the corresponding text tree structure may include, for example, a root node and a leaf node, where a portion of the text corresponding to the root node is a news headline text, and a portion of the text corresponding to the leaf node is a news body text.
In operation S332, relationship recognition is performed on each target node of the target path, so as to obtain a relationship recognition result.
The relation recognition result represents the relevance between the partial text corresponding to the target node and the target relation.
In operation S333, the text information triplet 305 is determined according to the entity recognition result and the relationship recognition result.
The target entity is determined according to the entity attribute and the target relation, and comprises a target subject and a target object.
According to the text information extraction method, the relevance of the partial text corresponding to the target node and the entity attribute can be represented by the obtained entity identification result through entity identification on each target node of the target path, and the relevance of the partial text corresponding to the target node and the target relationship can be represented by the obtained relationship identification result through relationship identification on each target node of the target path. And combining the target relation and the entity attribute, the target subject and the target object can be determined, and further, the text information triplet is determined. According to the text information extraction method, the entity identification and the relation identification are not affected, and the fact that errors of the entity identification are accumulated in the relation identification process can be avoided, so that the accuracy and the efficiency of text information extraction are higher.
In the example of fig. 3, the target path 304 includes two target nodes, a root node NR and a leaf node NL, respectively. And carrying out entity recognition and relation recognition on partial texts corresponding to the target node NR to respectively obtain an entity recognition result E_R and a relation recognition result R_R, and carrying out entity recognition and relation recognition on the target node NL to respectively obtain an entity recognition result E_L and a relation recognition result R_L.
Fig. 3 also schematically illustrates operations S310 to S330.
In operation S310, the input text 301 is characterized by a tree structure, resulting in a text tree structure 302. The text tree structure includes at least one tree structure path from the root node to each leaf node. In the example of fig. 3, an example is schematically shown in which text tree structure 302 includes x tree structure paths of tree structure path p1 through tree structure path px.
In operation S320, a target path 304 corresponding to the target relationship is determined from the at least one tree-structured path according to the target relationship 303.
In operation S330, a target subject and a target object related to the target relationship 303 are determined according to the target path 304, and a text information triplet 305 associated with the target subject, the target relationship, and the target object is obtained.
Operations S310 to S330 are similar to operations S210 to S230 of the above embodiment, and are not described here again.
The text information extraction method according to the embodiment of the present disclosure will be described below with reference to the technical specification shown in fig. 4 as an example of input text.
Fig. 5 schematically illustrates a schematic diagram of a text tree structure of an input text in the case where the input text of the text information extraction method according to an embodiment of the present disclosure is the technical specification shown in fig. 4.
For example, the technical specification shown in fig. 4 is characterized in terms of the hierarchy of the tree structure of "document title→chapter→section→bar→body→sentence".
In connection with the examples of fig. 4 and 5, the root node nr_0 of the text tree structure shown in fig. 5 corresponds to the document title of the technical specification of fig. 4, and the corresponding partial text is "10-50 kV oil-immersed distribution transformer technical specification".
The root node nr_0 corresponds to x child nodes, which are node n_1, node n_2, node n_3, node n_4 to node n_x, respectively. The partial text of the technical specification of fig. 4 corresponding to the node n_1 is "directory", the partial text of the technical specification of fig. 4 corresponding to the node n_2 is "1. General rule", the partial text of the technical specification of fig. 4 corresponding to the node n_3 is "2. Working range", the partial text of the technical specification of fig. 4 corresponding to the node n_4 is "3. Use condition", and the partial text of the technical specification of fig. 4 corresponding to the node n_x is "appendix".
Node n_4 corresponds to y child nodes, node n_41, node n_42 through node n_4y, respectively. The partial text of the technical specification of fig. 4 corresponding to the node n_41 is "temperature and humidity", and the partial text of the technical specification of fig. 4 corresponding to the node n_42 is "shock resistance".
Node n_42 corresponds to z child nodes, node n_421 through node n_42z, respectively. The partial text of the technical specification of fig. 4 corresponding to node n_421 is "transformer with rated capacity of 30kVA to 2500 kVA".
Node n_421 corresponds to 2 child nodes, node N L _1 through node nl_2, respectively. Node N L _1 and node nl_2 are respectively two leaf nodes of the text tree structure. The partial text of the technical specification of fig. 4 corresponding to the node nl_1 is "resonance, sine beat wave experimental method", excitation is performed 5 times, each time lasts for 5 cycles, each time is separated by 2s, and the influence of vibration and tension of the end connection wire is considered. The partial text of the technical specification of fig. 4 corresponding to the node nl_2 is that the safety coefficient of the 10kY transformer is not less than 1.67, and the power amplification coefficient of the equipment bracket is 1.2 "as the equipment body horizontal acceleration meter.
In the examples of fig. 4 and 5, in the case where the target relationship is "the shock resistance safety coefficient" or "the power amplification coefficient", the tree structure path of "nr_0→n_4→n_42→n_421→nl_2" may be taken as the target path p_m. Each node of the target path is a target node.
Illustratively, the partial text corresponding to the target node includes at least one sentence, and the text information extraction method may further include: and splitting the target node according to each sentence under the condition that the partial text corresponding to the target node comprises a plurality of sentences, so as to obtain a plurality of target split nodes.
In connection with the examples of fig. 4 and 5, for example, the partial text corresponding to the target node n_421 includes two sentences. The target node n_421 may be split according to each sentence, and the leaf node N L _1 and the leaf node nl_2 are two target split nodes corresponding to the target node n_421.
According to the text information extraction method disclosed by the embodiment of the disclosure, when the part of text corresponding to the target node comprises a plurality of sentences, the obtained plurality of target split nodes can be used for characterizing the input text in a fine granularity manner by splitting the target node according to each sentence, the text corresponding to the target split node is fewer, and then, for example, when the text corresponding to the target split node is processed, the text processing amount can be reduced. In addition, the plurality of target splitting nodes can finely characterize the input text, and the target path determined according to the target relation is more accurate.
Fig. 6 schematically illustrates a diagram of obtaining an entity recognition result according to a text information extraction method according to still another embodiment of the present disclosure.
As shown in fig. 6, the following embodiment may be used to implement entity identification for each target node of the target path, to obtain a specific example of the entity identification result.
In operation S611, the target path is encoded to obtain a target path vector.
In the example of fig. 6, the target path p_m of "nr_0→n_4→n_42→n_421→nl_2" illustrated in fig. 5 is taken as an example. And encoding the target path P_m to obtain a target path vector.
Illustratively, the target path may be encoded with an encoder to obtain a target path vector. The encoder may include, for example, an ERNIE (Enhanced Representation with Informative Entities, using information entity enhancement language representation), BERT (Bidirectional Encoder Representations form Transformer, transform-based bi-directional encoder representation), or the like pre-trained models.
In operation S612, the target path vector is decoded to obtain an entity recognition result matrix of each target node of the target path.
Illustratively, the target path vector may be decoded with a decoder. The decoder may for example comprise a pretraining model of ERNIE, BERT, etc.
In the example of fig. 6, the entity recognition result matrix mr_e of the target node nr_0 of the target path p_m and the entity recognition result matrix ml_e of the target node nl_2 are schematically shown.
The first dimension of the entity recognition result matrix characterizes word cell vectors.
The word unit vector is a vector representation form obtained by encoding word units of a part of text corresponding to the target node, and the word units are obtained by word segmentation of the part of text corresponding to the target node. In the example of fig. 6, taking the target node nr_0 as an example, a part of text corresponding to the target node is "10-50 kV oil immersed distribution transformer technical specification book", and word units obtained after word segmentation include, for example, "10", "" to "," 50"," kV "," oil "," immersed "," book.
In the example of fig. 6, the word unit vector may be encoded using a first encoder Encol.
The second dimension of the entity recognition result matrix characterizes the entity attribute slots. The elements of the entity recognition result matrix represent whether the word units corresponding to the element indexes correspond to the initial bit and the final bit of the slot values of the entity attribute slots.
At least one of the category and the number of entity attribute slots may be customizable. The category of the entity attribute slot is the category of the entity attribute corresponding to the entity attribute slot.
In the example of fig. 6, two kinds of physical properties of the nominal voltage and the numerical unit are schematically shown, corresponding to two physical property slots, respectively.
A slot is understood to be an abstract content, each slot corresponding to a fill condition, the fill condition defining an attribute common to a plurality of slot values corresponding to the slot, and an entity may be defined by one or more entity attribute slots. The attributes may include at least one of part of speech, type, character length, and the part of speech may include names, verbs, adjectives, and the like.
The slot value is a specific content, in the example of fig. 6, the "-S" of the second dimension of the entity identification result matrix characterizes the initial bit of the slot value of the entity attribute slot, and the "—e" characterizes the final bit of the slot value of the entity attribute slot. For example, the word "10" is the initial bit of the slot value for the nominal voltage slot, the nominal voltage slot is the physical attribute slot with the physical attribute being the nominal voltage, and the word "kV" is the termination of the slot value for the nominal voltage slot. The slot position value of the nominal voltage slot position is the word unit sequence '10-50 kV' between the corresponding initial position and the corresponding final position.
In the example of fig. 6, the element value of the entity recognition result matrix is 1 or 0, where 1 indicates that the word unit corresponding to the element index corresponds to the slot value of the entity attribute slot, and 0 indicates that the word unit corresponding to the element index does not correspond to the slot value of the entity attribute slot.
According to the text information extraction method disclosed by the embodiment of the invention, through the entity recognition result matrix, the correlation between the partial text corresponding to the target node and the entity attribute can be accurately represented. Specifically, the first dimension of the entity recognition result matrix represents a word unit vector, the second dimension represents each entity attribute slot and an initial position and a final position of a slot value of the entity attribute slot, and whether a part of text corresponding to the target node comprises the slot value of the entity attribute slot and a specific position of the entity attribute slot can be determined word by word through elements of the entity recognition result matrix.
According to the text information extraction method of the embodiment of the disclosure, the entity attribute slots may be used for determining an entity, for example, the subject or the object may include a plurality of entity attribute slots, the text information extraction method of the embodiment of the disclosure performs entity identification on a portion of text corresponding to each target node, determines a plurality of entity attribute slots, determines an entity according to the plurality of entity attribute slots, and may cover application scenarios in which the entity is distributed at different positions of the input text.
The entity recognition result matrix can be obtained by using a pre-training model, does not need to manually relate to text information extraction rules, and has lower labor cost.
For example, operations S611 to S612 may be performed after the above-described operation S220 or operation S320, for example.
Fig. 6 also schematically illustrates a diagram of a result of relationship recognition by a text information extraction method according to still another embodiment of the present disclosure.
As shown in fig. 6, for example, the following embodiment may be used to implement relationship recognition on each target node of the target path, to obtain a specific example of the relationship recognition result.
In operation S613, the target path vector is decoded to obtain a relationship recognition result matrix of the target nodes of the target path.
In the example of fig. 6, the relationship-identifying result matrix mr_r of the target node nr_0 of the target path p_m and the relationship-identifying result matrix ml_r of the target node nl_2 are schematically shown.
The first dimension of the relational identification result matrix represents word unit vectors.
The second dimension of the relationship recognition result matrix characterizes the relationship class. The relationship category is determined from the target relationship.
In the example of fig. 6, taking the target node nr_0 as an example, after the partial text corresponding to the target node is identified by the relationship, the second dimension of the obtained relationship identification result matrix mr_r represents two relationship categories, namely, the "shock-resistant safety coefficient" and the "power amplification coefficient".
Whether the entity attribute slot position of the word unit corresponding to the element characterization element index of the relation recognition result matrix corresponds to the entity category related to the relation category or not, wherein the entity category comprises a subject and an object.
In the example of fig. 6, the element value of the entity recognition result matrix is 1 or 0, where 1 indicates that the entity attribute slot to which the word unit corresponding to the element index belongs corresponds to the entity category related to the relationship category, and 0 indicates that the entity attribute slot to which the word unit corresponding to the element index belongs does not correspond to the entity category related to the relationship category.
In the example of fig. 6, taking a target node nr_0 as an example, a portion of text corresponding to the target node is a "technical specification book of an oil immersed transformer of 10-50 kV", where a physical attribute slot to which a unit of "10" belongs is a nominal voltage slot, an "-S" of the relationship recognition result matrix represents a subject related to a relationship category, and an "-O" represents an object related to the relationship category. Taking the element value of "1" of the first row and the first column of the first row of the target node NR as an example, in the first dimension, the word unit of a part of text corresponding to the target node NR is "10", in the second dimension, the corresponding relation type is "shock-resistant safety coefficient", and the element value of "1" represents the main body of the relation type of the nominal voltage slot position to which the word unit of "10" belongs is corresponding to "shock-resistant safety coefficient".
According to the text information extraction method disclosed by the embodiment of the invention, through the relation recognition result matrix, the correlation between the partial text corresponding to the target node and the target relation can be accurately represented. Specifically, the first dimension of the relation recognition result matrix represents a word unit vector, the second dimension represents whether the entity attribute slot position of the word unit corresponding to the element index corresponds to the entity category related to the relation category, and whether the partial text corresponding to the target node corresponds to the subject or object of the relation category can be determined word by word through the elements of the relation recognition result matrix.
According to the text information extraction method disclosed by the embodiment of the invention, the second dimension of the relation recognition result matrix is related to the relation category, so that the situation that the target relation comprises a plurality of relation categories can be adapted, and the application scenes that the subjects or objects corresponding to the relation categories are distributed at different positions of the input text can be covered.
The entity recognition result matrix can be obtained by using a pre-training model, does not need to manually design text information extraction rules, and has lower labor cost.
For example, operation S613 may be performed after operation S612 described above, for example.
Fig. 6 also illustrates a specific example of encoding a target path resulting in a target path vector.
In operation S614, a portion of text corresponding to each target node is encoded to obtain a target node vector.
In operation S615, the hierarchy of each target node of the target path is encoded according to the hierarchy of each target node, resulting in a target node hierarchy vector.
The hierarchy is determined from a text tree structure.
In operation S617, a target path vector is obtained from the target node vector and the target node hierarchy vector.
In the example of fig. 6, a first encoder Encol may be used to encode a portion of text corresponding to each target node to obtain a target node vector. For example, in the case that the first encoder and the decoder are pre-trained deep learning models, for example, the first encoder may set partial text sharing parameters corresponding to each target node, so as to reduce model parameters, avoid model overfitting, and save resources when encoding and decoding. The level of each target node may be encoded using a second encoder Enco2 to obtain a target node level vector. The target path vector may be obtained from the target node vector and the target node hierarchy vector using a third encoder Enco 3.
For example, the target path vector may be determined using the following equation (1).
Wherein,,representing a target node level vector corresponding to the ith target node of the target path, < +.>Representing a target node vector corresponding to the ith target node of the target path, e path And representing a target path vector corresponding to the target path.
For example, use may be made ofAnd representing the vector representation of the jth word unit token of the partial text corresponding to the ith target node in the target path.
f is a mapping function that can be implemented by averaging the length of the target nodes of the target path, or by using a feed-forward neural network.
According to the text information extraction method disclosed by the embodiment of the disclosure, the target path vector obtained by carrying out layered coding on the target path, namely the target node, the target node level and the target path, can embody each target node and the target node level, the target path vector is more characterized, and the entity recognition result matrix and the relation recognition result matrix obtained by decoding the target path vector are more accurate.
By way of example, specific examples of determining a text information triplet based on entity recognition results and relationship recognition results may be implemented using the following embodiments, for example.
Determining entity attribute slots respectively corresponding to the subjects and the objects of each relation category according to the relation recognition result matrix and the entity recognition result matrix; determining a target main body according to the slot position value of the entity attribute slot position of the corresponding main body; determining a target object according to a slot value of the entity attribute slot of the corresponding object; and determining text information triples associated with the target subject, the target relationship and the target object according to the target subject and the target object of each relationship category.
In the example of fig. 6, it is schematically shown that in the case where the target relationship is the relationship category "earthquake-resistant safety factor", the slot values of the entity attribute slots SL corresponding to the subjects and objects of the "earthquake-resistant safety factor" may be determined from the entity recognition result matrix and the relationship recognition result matrix of the corresponding text of each target node, respectively. According to the slot position value of the entity attribute slot position corresponding to the main body, a target main body can be determined; the target object can be determined according to the slot value of the entity attribute slot corresponding to the object. In the example of fig. 6, the physical attribute slots of the corresponding body include a nominal voltage slot, an insulating medium slot, a type slot, a name slot, a rated capacity slot, and a rated voltage slot. The physical attribute slots corresponding to the objects include numerical slots. The text information triplet SPO comprises a target subject of 10kV, 30 kVA-2500 kVA, oil immersed type, power distribution and transformer, a target relation of shock resistance safety coefficient and a target object of 1.67.
Fig. 7 schematically illustrates a schematic diagram of determining a text information triplet according to another embodiment of the disclosure.
As shown in fig. 7, a specific example of determining a text information triplet from an entity recognition result and a relationship recognition result may be implemented using, for example, the following embodiments.
In operation S731, a target attribute slot is determined from among the candidate entity attribute slots according to the distance between each candidate entity attribute slot and the entity-associated slot.
The plurality of candidate entity attribute slots correspond to the same entity attribute, the plurality of candidate entity attribute slots correspond to one of the subject and the subject of the relationship class, and the entity association slot corresponds to the other of the subject and the subject of the relationship class.
In the example of fig. 7, m candidate entity attribute slots s_ cil to s_cim correspond to one same entity attribute a_i, m candidate entity attribute slots also correspond to subjects of the relationship class, and entity association slot s_c corresponds to objects of the relationship class.
For example, in the example of fig. 6, there are two physical attribute slots (nominal voltage slots) with physical attributes of nominal voltage, each corresponding to a main body of the relationship category of "shock-resistant safety factor", and the slot values of the two nominal voltage slots are "10 to 50Kv" and "10Kv", respectively. The two nominal voltage slots are two candidate physical attribute slots, respectively. The numerical slot of the object corresponding to the relation category of the 'earthquake-proof safety coefficient' (the slot value of the numerical slot is '1.67') is the entity association slot. In the example of fig. 7, the candidate entity attribute slot corresponds to the subject of the relationship class and the entity is associated with the subject of the relationship class.
For example, a target property slot may be determined from candidate entity property slots based on the node distance and the text distance.
The text distance characterizes a word unit based amount of deviation between a slot value of a candidate entity attribute slot and a slot value of an entity associated slot. The node distance characterizes the distance between the target node corresponding to the candidate entity attribute slot and the target node corresponding to the entity association slot.
For example, the distance D between the candidate entity attribute slot and the entity-associated slot may be determined using the following equation (2), and for example, the candidate entity attribute having the shortest distance from the entity-associated slot may be selected as the target attribute slot.
D=W 1 ×N×B+W 2 ×U (2)
N represents the absolute value of the depth difference between the target node corresponding to the candidate entity attribute slot and the target node corresponding to the entity association slot based on the text tree structure, B represents the unit node distance base number, and the value of B can be preset. U characterizes bias values between candidate entity attribute slots and entity association slots based on word units. W (W) 1 And W is 2 The weights are characterized.
In the example of fig. 6, a first candidate entity attribute slot having a slot value of "10-50 Kv" corresponds to the target node nr_0, a second candidate entity attribute slot having a slot value of "10Kv" corresponds to the target node nl_2, and an entity-associated slot (a numerical slot having a slot value of "1.67") corresponds to the target node nl_2. And determining the second candidate entity attribute slot position which is closer to the entity association slot position as the target attribute slot position according to the distance between the first candidate entity attribute slot position and the entity association slot position and the distance between the second candidate entity attribute slot position and the entity association slot position respectively.
In operation S732, the target subject S and the target object O are determined according to the slot values of the entity attribute slots and the relationship recognition result matrix.
The entity attribute slots include a target attribute slot and an entity association slot.
As shown in fig. 7, for example, through the above operation S731, in conjunction with the example of fig. 6, a target attribute slot s_t may be determined from two nominal voltage slots having a slot value of "10-50 Kv" and "10Kv", respectively, where the slot value v_t of the target attribute slot s_t is "10Kv", and the body of the relationship type where the target attribute slot s_t corresponds to the "shock resistance safety factor" may be determined according to the relationship identification matrix m_r (corresponding to ml_r in fig. 6), and the entity associated slot s_c may be determined as a numerical slot having a slot value v_c of "1.67".
In the example of fig. 6, in the case where the predetermined physical attribute slot further includes one insulating medium slot s_j, the target body S may be determined to be "10kV, oil-immersed" according to the slot value v_j of the insulating medium slot s_j of the body corresponding to the relationship category of "shock-resistant safety coefficient" and the slot value v_t of the target attribute slot s_t. The target object O may be determined to be "1.67" based on the slot value v_c as the physical association slot s_c for the object of the relationship class of "shock resistance safety coefficient".
In operation S733, a text information triplet is determined according to the target subject S, the target object O, and the target relationship P.
When the target subject S is "10kV, oil immersed", and the target object O is "1.67", the corresponding target relationship P is "shock-resistant safety coefficient".
Under the condition of a plurality of candidate entity attribute slots, the plurality of candidate entity attribute slots correspond to the same entity attribute, redundancy can occur in slot values of the plurality of candidate entity attribute slots, and the text information extraction method in the embodiment of the disclosure can determine the target attribute slot with higher correlation with the entity associated slot from the plurality of candidate entity attribute slots and accurately extract text information.
For example, operations S731 to S733 may be performed after the above-described operation S613.
Fig. 8 schematically illustrates a block diagram of a text information extracting device according to an embodiment of the present disclosure.
As shown in fig. 8, the text information extraction apparatus 800 of the embodiment of the present disclosure includes, for example, a text tree structure determination module 810, a target path determination module 820, and a text information triplet determination module 830.
The text tree structure determination module 810 is configured to characterize the input text using a tree structure, resulting in a text tree structure that includes at least one tree structure path from the root node to each leaf node.
The target path determining module 820 is configured to determine, according to the target relationship, a target path corresponding to the target relationship from at least one tree structure path.
The text information triplet determining module 830 is configured to determine, according to the target path, a target subject and a target object related to the target relationship, and obtain a text information triplet related to the target subject, the target relationship, and the target object.
According to an embodiment of the present disclosure, the text information triplet determination module includes: an entity identification sub-module, a relationship identification sub-module and a text information triplet determination sub-module.
And the entity identification sub-module is used for carrying out entity identification on each target node of the target path to obtain an entity identification result. The entity identification result characterizes the relevance between the partial text corresponding to the target node and the entity attribute.
And the relationship identification sub-module is used for carrying out relationship identification on each target node of the target path to obtain a relationship identification result. The relation recognition result represents the relevance between the partial text corresponding to the target node and the target relation.
And the text information triplet determination sub-module is used for determining the text information triplet according to the entity recognition result and the relation recognition result.
According to an embodiment of the present disclosure, an entity identification submodule includes: and the target path vector determining unit and the entity recognition result matrix determining unit.
And the target path vector determining unit is used for encoding the target path to obtain a target path vector.
And the entity recognition result matrix determining unit is used for decoding the target path vector to obtain an entity recognition result matrix of each target node of the target path. The method comprises the steps that a first dimension of an entity recognition result matrix represents a word unit vector, the word unit vector is a vector representation form obtained by encoding word units of partial texts corresponding to target nodes, the word units are obtained by word segmentation of partial texts corresponding to the target nodes, a second dimension of the entity recognition result matrix represents entity attribute slots, whether word units corresponding to element representation element indexes of the entity recognition result matrix are initial bits and termination bits of slot values corresponding to the entity attribute slots or not, and at least one of the number and the category of the entity attribute slots can be customized.
According to an embodiment of the present disclosure, a relationship identification submodule includes: and a relationship recognition result matrix determining unit.
The relationship identification result matrix determining unit is used for decoding the target path vector to obtain a relationship identification result matrix of the target node of the target path, wherein a first dimension of the relationship identification result matrix represents a word unit vector, a second dimension of the relationship identification result matrix represents a relationship class, whether an entity attribute slot position of a word unit corresponding to an element representation element index of the relationship identification result matrix corresponds to an entity class related to the relationship class or not, the entity class comprises a subject and an object, and the relationship class is determined according to the target relationship.
According to an embodiment of the present disclosure, a text information triplet determination submodule includes: the device comprises a target attribute slot position determining unit, a target subject and target object determining unit and a text information triplet determining unit.
And the target attribute slot position determining unit is used for determining a target attribute slot position from a plurality of candidate entity attribute slot positions according to the distance between each candidate entity attribute slot position and the entity association slot position, wherein the plurality of candidate entity attribute slot positions correspond to the same entity attribute, the plurality of candidate entity attribute slot positions correspond to one of the subjects and the objects of the relation class, and the entity association slot position corresponds to the other of the subjects and the objects of the relation class.
The target host and target object determining unit is used for determining a target host and a target object according to the slot position values of the entity attribute slots and the relation recognition result matrix, wherein the entity attribute slots comprise target attribute slots and entity association slots; and
and the text information triplet determining unit is used for determining the text information triplet according to the target subject, the target object and the target relation.
According to an embodiment of the present disclosure, a target attribute slot determining unit includes: the target attribute slot determines the subunit.
And the target attribute slot position determining subunit is used for determining a target attribute slot position from the candidate entity attribute slot positions according to the node distance and the text distance, wherein the node distance represents the distance between a target node corresponding to the candidate entity attribute slot position and a target node corresponding to the entity association slot position, and the text distance represents the offset value based on the word unit between the slot position value of the candidate entity attribute slot position and the slot position value of the entity association slot position.
According to an embodiment of the present disclosure, a target path vector determination unit includes: the target node vector determination subunit, the target node hierarchy vector determination subunit, and the target path vector determination subunit.
And the target node vector determining subunit is used for encoding a part of text corresponding to each target node to obtain a target node vector.
And the target node level vector determining subunit is used for encoding the level of each target node according to the level of each target node of the target path to obtain a target node level vector, wherein the level is determined according to the text tree structure.
And the target path vector determining subunit is used for obtaining the target path vector according to the target node vector and the target node level vector.
According to an embodiment of the present disclosure, the partial text corresponding to the target node includes at least one sentence, and the text information extracting apparatus further includes: and the target node splitting module.
And the target node splitting module is used for splitting the target node according to each sentence to obtain a plurality of target splitting nodes under the condition that the partial text corresponding to the target node comprises a plurality of sentences.
It should be understood that the embodiments of the apparatus portion of the present disclosure correspond to the same or similar embodiments of the method portion of the present disclosure, and the technical problems to be solved and the technical effects to be achieved also correspond to the same or similar embodiments, which are not described herein in detail.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
Fig. 9 shows a schematic block diagram of an example electronic device 900 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 9, the apparatus 900 includes a computing unit 901 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The computing unit 901, the ROM 902, and the RAM 903 are connected to each other by a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.
Various components in device 900 are connected to I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, or the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, an optical disk, or the like; and a communication unit 909 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunications networks.
The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 901 performs the respective methods and processes described above, such as a text information extraction method. For example, in some embodiments, the text information extraction method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 900 via the ROM 902 and/or the communication unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the text information extraction method described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the text information extraction method by any other suitable means (e.g. by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (16)

1. A text information extraction method, comprising:
representing the input text with a tree structure, resulting in a text tree structure comprising at least one tree structure path from the root node to each leaf node; each node in the text tree structure corresponds to a portion of text in the input text; the partial text includes at least one sentence;
determining a target path corresponding to the target relationship from the at least one tree structure path according to the target relationship; and
according to the target path, determining a target subject and a target object related to the target relation, and obtaining text information triples related to the target subject, the target relation and the target object;
the determining, according to the target path, a target subject and a target object related to the target relationship, and obtaining a text information triplet related to the target subject, the target relationship and the target object includes:
Performing entity identification on each target node of the target path to obtain an entity identification result, wherein the entity identification result represents the correlation between part of text corresponding to the target node and entity attributes;
carrying out relationship identification on each target node of the target path to obtain a relationship identification result, wherein the relationship identification result represents the correlation between a part of text corresponding to the target node and the target relationship; and
determining the text information triples according to the entity identification result and the relation identification result;
wherein determining the text information triplet according to the entity recognition result and the relationship recognition result comprises:
determining a target subject and a target object according to the entity identification result and the relationship identification result; and
and determining the text information triplet according to the target subject, the target object and the target relation.
2. The method of claim 1, wherein the entity identifying each target node of the target path comprises:
encoding the target path to obtain a target path vector; and
Decoding the target path vector to obtain an entity recognition result matrix of each target node of the target path, wherein a first dimension of the entity recognition result matrix represents a word unit vector, the word unit vector is a vector representation form obtained by encoding word units of partial texts corresponding to the target nodes, the word units are obtained by word segmentation of partial texts corresponding to the target nodes, a second dimension of the entity recognition result matrix represents entity attribute slots, whether the word units corresponding to element representation element indexes of the entity recognition result matrix are initial positions and end positions of slot values corresponding to the entity attribute slots or not, and at least one of the number and the category of the entity attribute slots can be customized.
3. The method of claim 2, wherein the performing relationship identification on each target node of the target path, to obtain a relationship identification result, comprises:
decoding the target path vector to obtain a relationship recognition result matrix of a target node of the target path, wherein a first dimension of the relationship recognition result matrix represents the word unit vector, a second dimension of the relationship recognition result matrix represents a relationship class, an element of the relationship recognition result matrix represents whether the entity attribute slot to which the word unit corresponding to the element index belongs corresponds to an entity class related to the relationship class, the entity class comprises a subject and an object, and the relationship class is determined according to the target relationship.
4. The method of claim 3, wherein the determining a target subject and a target object from the entity recognition result and the relationship recognition result comprises:
determining a target attribute slot from a plurality of candidate entity attribute slots according to the distance between each candidate entity attribute slot and an entity association slot, wherein the plurality of candidate entity attribute slots correspond to the same entity attribute, the plurality of candidate entity attribute slots correspond to one of a subject and an object of the relationship class, and the entity association slot corresponds to the other of the subject and the object of the relationship class; and
and determining the target subject and the target object according to the slot value of the entity attribute slot and the relation recognition result matrix, wherein the entity attribute slot comprises the target attribute slot and the entity association slot.
5. The method of claim 4, wherein the determining a target property slot from the candidate entity property slots based on a distance between each candidate entity property slot and the entity association slot comprises:
and determining a target attribute slot from the candidate entity attribute slot according to a node distance and a text distance, wherein the node distance represents the distance between the target node corresponding to the candidate entity attribute slot and the target node corresponding to the entity association slot, and the text distance represents the deviation value based on the word unit between the slot value of the candidate entity attribute slot and the slot value of the entity association slot.
6. The method of claim 2, wherein the encoding the target path to obtain a target path vector comprises:
coding a part of text corresponding to each target node to obtain a target node vector;
coding the hierarchy of each target node according to the hierarchy of each target node of the target path to obtain a target node hierarchy vector, wherein the hierarchy is determined according to the text tree structure; and
and obtaining the target path vector according to the target node vector and the target node hierarchy vector.
7. The method of any of claims 1-6, wherein the text information extraction method further comprises:
and splitting the target node according to each sentence under the condition that the partial text corresponding to the target node comprises a plurality of sentences, so as to obtain a plurality of target split nodes.
8. A text information extracting apparatus comprising:
a text tree structure determining module for representing an input text with a tree structure, resulting in a text tree structure, the text tree structure comprising at least one tree structure path from a root node to each leaf node; each node in the text tree structure corresponds to a portion of text in the input text; the partial text includes at least one sentence;
The target path determining module is used for determining a target path corresponding to the target relationship from the at least one tree structure path according to the target relationship; and
the text information triplet determining module is used for determining a target subject and a target object related to the target relation according to the target path to obtain text information triples related to the target subject, the target relation and the target object;
wherein, the text information triplet determination module comprises:
the entity identification sub-module is used for carrying out entity identification on each target node of the target path to obtain an entity identification result, wherein the entity identification result represents the correlation between part of texts corresponding to the target nodes and entity attributes;
the relationship identification sub-module is used for carrying out relationship identification on each target node of the target path to obtain a relationship identification result, wherein the relationship identification result represents the correlation between a part of text corresponding to the target node and the target relationship; and
a text information triplet determination sub-module, configured to determine the text information triplet according to the entity identification result and the relationship identification result;
Wherein the text information triplet determination submodule is used for:
determining a target subject and a target object according to the entity identification result and the relationship identification result; and
and determining the text information triplet according to the target subject, the target object and the target relation.
9. The apparatus of claim 8, wherein the entity identification submodule comprises:
a target path vector determining unit, configured to encode the target path to obtain a target path vector; and
the entity recognition result matrix determining unit is configured to decode the target path vector to obtain an entity recognition result matrix of each target node of the target path, where a first dimension of the entity recognition result matrix represents a word unit vector, the word unit vector is a vector representation form obtained by encoding word units of a part of text corresponding to the target node, the word units are obtained by word segmentation of a part of text corresponding to the target node, a second dimension of the entity recognition result matrix represents an entity attribute slot, whether the word units corresponding to element representation element indexes of the entity recognition result matrix are initial positions and termination positions of slot values corresponding to the entity attribute slot, and at least one of the number and the category of the entity attribute slot can be customized.
10. The apparatus of claim 9, wherein the relationship identification submodule comprises:
the relation recognition result matrix determining unit is used for decoding the target path vector to obtain a relation recognition result matrix of a target node of the target path, wherein a first dimension of the relation recognition result matrix represents the word unit vector, a second dimension of the relation recognition result matrix represents a relation category, an element representing element index of the relation recognition result matrix corresponds to whether the entity attribute slot position of the word unit corresponds to an entity category related to the relation category, the entity category comprises a subject and an object, and the relation category is determined according to the target relation.
11. The apparatus of claim 10, wherein the text information triplet determination submodule comprises:
a target attribute slot determining unit, configured to determine a target attribute slot from a plurality of candidate entity attribute slots according to a distance between each candidate entity attribute slot and an entity association slot, where the plurality of candidate entity attribute slots correspond to the same entity attribute, and the plurality of candidate entity attribute slots correspond to one of a subject and an object of the relationship class, and the entity association slot corresponds to the other one of the subject and the object of the relationship class; and
And the target subject and target object determining unit is used for determining the target subject and the target object according to the slot position value of the entity attribute slot position and the relation recognition result matrix, wherein the entity attribute slot position comprises the target attribute slot position and the entity association slot position.
12. The apparatus of claim 11, wherein the target attribute slot determination unit comprises:
and the target attribute slot position determining subunit is used for determining a target attribute slot position from the candidate entity attribute slot positions according to a node distance and a text distance, wherein the node distance represents the distance between the target node corresponding to the candidate entity attribute slot position and the target node corresponding to the entity association slot position, and the text distance represents the deviation value based on the word unit between the slot position value of the candidate entity attribute slot position and the slot position value of the entity association slot position.
13. The apparatus of claim 9, wherein the target path vector determination unit comprises:
a target node vector determining subunit, configured to encode a portion of text corresponding to each target node to obtain a target node vector;
A target node level vector determining subunit, configured to encode, according to a level of each target node of the target path, a level of each target node to obtain a target node level vector, where the level is determined according to the text tree structure; and
and the target path vector determining subunit is used for obtaining the target path vector according to the target node vector and the target node level vector.
14. The apparatus according to any one of claims 9-13, wherein the text information extraction apparatus further comprises:
and the target node splitting module is used for splitting the target node according to each sentence to obtain a plurality of target splitting nodes under the condition that the partial text corresponding to the target node comprises a plurality of sentences.
15. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.
16. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-7.
CN202210732269.3A 2022-06-24 2022-06-24 Text information extraction method, apparatus, device, storage medium, and program product Active CN115080742B (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN202210732269.3A CN115080742B (en) 2022-06-24 2022-06-24 Text information extraction method, apparatus, device, storage medium, and program product
KR1020220188247A KR20230009345A (en) 2022-06-24 2022-12-29 Method and apparatus for extracting text information, electronic device, storage medium and computer program
JP2023003753A JP2023040248A (en) 2022-06-24 2023-01-13 Text information extraction method, device, electronic apparatus, storage medium, and computer program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210732269.3A CN115080742B (en) 2022-06-24 2022-06-24 Text information extraction method, apparatus, device, storage medium, and program product

Publications (2)

Publication Number Publication Date
CN115080742A CN115080742A (en) 2022-09-20
CN115080742B true CN115080742B (en) 2023-09-05

Family

ID=83256480

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210732269.3A Active CN115080742B (en) 2022-06-24 2022-06-24 Text information extraction method, apparatus, device, storage medium, and program product

Country Status (3)

Country Link
JP (1) JP2023040248A (en)
KR (1) KR20230009345A (en)
CN (1) CN115080742B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116522935B (en) * 2023-03-29 2024-03-29 北京德风新征程科技股份有限公司 Text data processing method, processing device and electronic equipment
CN116383655B (en) * 2023-04-07 2024-01-05 北京百度网讯科技有限公司 Sample generation method, model training method, text processing method and device
CN116628230A (en) * 2023-07-25 2023-08-22 航天宏图信息技术股份有限公司 Method and device for expressing attribute association relationship, electronic equipment and storage medium
CN117174234B (en) * 2023-11-03 2024-01-05 南京都昌信息科技有限公司 Medical text data analysis method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105701253A (en) * 2016-03-04 2016-06-22 南京大学 Chinese natural language interrogative sentence semantization knowledge base automatic question-answering method
CN111611399A (en) * 2020-04-15 2020-09-01 广发证券股份有限公司 Information event mapping system and method based on natural language processing
CN114595686A (en) * 2022-03-11 2022-06-07 北京百度网讯科技有限公司 Knowledge extraction method, and training method and device of knowledge extraction model

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105701253A (en) * 2016-03-04 2016-06-22 南京大学 Chinese natural language interrogative sentence semantization knowledge base automatic question-answering method
CN111611399A (en) * 2020-04-15 2020-09-01 广发证券股份有限公司 Information event mapping system and method based on natural language processing
CN114595686A (en) * 2022-03-11 2022-06-07 北京百度网讯科技有限公司 Knowledge extraction method, and training method and device of knowledge extraction model

Also Published As

Publication number Publication date
KR20230009345A (en) 2023-01-17
CN115080742A (en) 2022-09-20
JP2023040248A (en) 2023-03-22

Similar Documents

Publication Publication Date Title
CN115080742B (en) Text information extraction method, apparatus, device, storage medium, and program product
JP7301922B2 (en) Semantic retrieval method, device, electronic device, storage medium and computer program
CN114861889B (en) Deep learning model training method, target object detection method and device
CN110852106A (en) Named entity processing method and device based on artificial intelligence and electronic equipment
CN111078842A (en) Method, device, server and storage medium for determining query result
EP4174683A1 (en) Data evaluation method and apparatus, training method and apparatus, and electronic device and storage medium
CN113268560A (en) Method and device for text matching
CN111930915A (en) Session information processing method, device, computer readable storage medium and equipment
CN111368551A (en) Method and device for determining event subject
CN113407851A (en) Method, device, equipment and medium for determining recommendation information based on double-tower model
CN115269768A (en) Element text processing method and device, electronic equipment and storage medium
CN110852057A (en) Method and device for calculating text similarity
CN112906368A (en) Industry text increment method, related device and computer program product
CN114036921A (en) Policy information matching method and device
CN110807097A (en) Method and device for analyzing data
CN116049370A (en) Information query method and training method and device of information generation model
CN115860003A (en) Semantic role analysis method and device, electronic equipment and storage medium
CN111459959B (en) Method and apparatus for updating event sets
CN114841172A (en) Knowledge distillation method, apparatus and program product for text matching double tower model
CN115809313A (en) Text similarity determination method and equipment
CN113806541A (en) Emotion classification method and emotion classification model training method and device
CN113051896A (en) Method and device for correcting text, electronic equipment and storage medium
CN113220841B (en) Method, apparatus, electronic device and storage medium for determining authentication information
CN117909505B (en) Event argument extraction method and related equipment
CN114925185B (en) Interaction method, model training method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant