CN114780649A - Method and device for identifying structured data entity type - Google Patents

Method and device for identifying structured data entity type Download PDF

Info

Publication number
CN114780649A
CN114780649A CN202210457940.8A CN202210457940A CN114780649A CN 114780649 A CN114780649 A CN 114780649A CN 202210457940 A CN202210457940 A CN 202210457940A CN 114780649 A CN114780649 A CN 114780649A
Authority
CN
China
Prior art keywords
field
data
node number
data tables
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210457940.8A
Other languages
Chinese (zh)
Inventor
郭徽
王龙
陈立力
周明伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Dahua Technology Co Ltd
Original Assignee
Zhejiang Dahua Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Dahua Technology Co Ltd filed Critical Zhejiang Dahua Technology Co Ltd
Priority to CN202210457940.8A priority Critical patent/CN114780649A/en
Publication of CN114780649A publication Critical patent/CN114780649A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Animal Behavior & Ethology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method and a device for identifying the type of a structured data entity are used for solving the problem of low accuracy in identifying the type of the structured data entity in the prior art. The method comprises the following steps: acquiring a plurality of data tables of a database; for the structured data in each data table, mapping similar fields to a node number by calculating the similarity between field information or by a trained classifier model; the field information comprises a field name and a field description; aggregating fields mapped to the same node number in the multiple data tables to obtain the node number corresponding to each field of the multiple data tables; and inputting the information of each field of the multiple data tables and the node number corresponding to each field into a trained representation learning model, and predicting the entity type corresponding to the node number to obtain the entity type corresponding to each field of the multiple data tables.

Description

Method and device for identifying type of structured data entity
Technical Field
The present application relates to the field of information extraction technologies, and in particular, to a method and an apparatus for identifying a type of a structured data entity.
Background
With the rapid development of artificial intelligence technology, knowledge maps have been widely used in various industries during the process of digital transformation. The knowledge graph vividly shows the core structure, development history, frontier domain and whole knowledge framework of the subject by utilizing a visual graph. The recognition and extraction of the entity types are used as a preposed stage of knowledge graph construction, and are important links in the automatic construction process of the knowledge graph. How to identify and extract entity types in massive and complicated structured data and deeply fuse with industry requirements is one of the main tasks of the current information extraction technology.
At present, the knowledge graph is further established aiming at the database table processing, and the existing relation mode in the data is usually mined by a method of manually weaving rules based on a template or a rule mode. On one hand, due to the complex diversity of the language rules, a large amount of manpower is consumed to write the rules; on the other hand, there are multiple expression modes for the same entity type, and there may be a case where the accuracy is low when the entity type is identified by using a template or rule mode.
Therefore, a solution is needed to solve the problem of low accuracy in identifying the type of the structured data entity in the prior art.
Disclosure of Invention
The application provides a method and a device for identifying a type of a structured data entity, which are used for solving the problem of low accuracy in identifying the type of the structured data entity in the prior art.
In a first aspect, an embodiment of the present application provides a method for identifying a type of a structured data entity, where the method includes: acquiring a plurality of data tables of a database; for the structured data in each data table, mapping similar fields to a node number by calculating the similarity between field information or by a trained classifier model; the field information comprises a field name and a field description; aggregating the fields mapped to the same node number in the multiple data tables to obtain the node numbers corresponding to the fields of the multiple data tables; and inputting the information of each field of the multiple data tables and the node number corresponding to each field into a trained representation learning model, and predicting the entity type corresponding to the node number to obtain the entity type corresponding to each field of the multiple data tables.
In the technical scheme, similar fields in each data table are aggregated, fields mapped to the same node number in all the data tables are aggregated, the entity type of each node number is predicted, the entity type of the fields is identified by the way of twice aggregation and combining the field names and the field description multiple information, and the accuracy of identifying the entity type can be improved.
In one possible design, the method further includes: and for the field with long field description, performing word segmentation processing on the field description to obtain a plurality of word segments.
In the technical scheme, the longer description field possibly contains rich information, and the entity type of the field is predicted after the longer description field is subjected to word segmentation processing, so that the accuracy of identifying the entity type can be improved.
In one possible design, the method further includes: and establishing a word stock model according to the field description of each field, the multiple word segments after word segmentation processing and the entity types corresponding to each field.
In the technical scheme, with the increase of actual service scenes and the access of more data tables, the data volume of the word stock model is richer, and after a certain amount of data is accumulated, the word stock model can be used independently of the classifier model.
In one possible design, the classifier model is trained by: and after the field description and the field name of each field are subjected to feature engineering treatment, inputting the field description and the field name into a classifier model to train the classifier model.
In one possible design, the representation learning model is trained by: selecting a training set and a test set; marking the entity type of each field of each data table in the training set and the test set, and inputting the information of each field of a plurality of data tables in the training set, the node number corresponding to each field and the marked entity type of each field into a representation learning model for training; the trained representation learning model is evaluated using the test set.
In the technical scheme, the finally trained expression learning model is used for identifying the entity type, and has certain generalization capability on subsequently accessed service data.
In one possible design, before the mapping of similar fields to a node number by calculating similarity between field information or by a trained classifier for the structured data in each data table, the method further comprises preprocessing the structured data in each data table; the pretreatment comprises the following steps: data selection and abnormal data processing.
In the technical scheme, unnormalized and obviously invalid field information is filtered by preprocessing, and disordered input data can be converted into relatively clean data.
In one possible design, the evaluating the representation learning model using the test set includes: the representation learning model is evaluated using mean rank and hit @10 as evaluation indexes.
In the technical scheme, mean _ rank and hit @10 are used as evaluation indexes to evaluate the representation learning model, so that the representation learning model is corrected in time according to the obtained evaluation result.
In a second aspect, an embodiment of the present application provides an apparatus for identifying a type of a structured data entity, including:
the acquisition module is used for acquiring a plurality of data tables of the database;
the processing module is used for mapping similar fields to a node number by calculating the similarity between field information or a trained classifier model aiming at the structured data in each data table; the field information comprises a field name and a field description;
the processing module is further configured to aggregate the fields mapped to the same node number in the multiple data tables to obtain a node number corresponding to each field of the multiple data tables;
the processing module is further configured to input information of each field of the multiple data tables and a node number corresponding to each field into a trained representation learning model, predict an entity type corresponding to the node number, and obtain an entity type corresponding to each field of the multiple data tables.
In a possible design, the processing module is further configured to perform word segmentation on a field with a long field description to obtain a plurality of word segments.
In a possible design, the processing module is further configured to establish a thesaurus model according to the field description of each field, the multiple word segments after the word segmentation processing, and the entity type corresponding to each field.
In one possible design, the processing module is further configured to train the classifier model in the following manner: and after the field description and the field name of each field are subjected to feature engineering treatment, inputting the field description and the field name into a classifier model to train the classifier model.
In one possible design, the processing module is further configured to train the representation learning model in the following manner: selecting a training set and a test set; marking the entity type of each field of each data table in the training set and the test set, and inputting the information of each field of a plurality of data tables in the training set, the node number corresponding to each field and the marked entity type of each field into a representation learning model for training; the trained representation learning model is evaluated using the test set.
In a possible design, the processing module is further configured to perform preprocessing on the structured data in each data table; the pretreatment comprises the following steps: data selection and abnormal data processing.
In one possible design, the processing module is further configured to evaluate the representation learning model using mean _ rank and hit @10 as evaluation indexes.
In a third aspect, an embodiment of the present application further provides a computing device, including:
a memory for storing program instructions;
a processor for calling the program instructions stored in the memory and for executing the method as described in the various possible designs of the first aspect in accordance with the obtained program instructions.
In a fourth aspect, the embodiments of the present application further provide a computer-readable storage medium, in which computer-readable instructions are stored, and when the computer reads and executes the computer-readable instructions, the method described in the first aspect or any one of the possible designs of the first aspect is implemented.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
FIG. 1 is a schematic diagram of a structured data entity type identification system according to an embodiment of the present application;
FIG. 2 is a flowchart illustrating a method for identifying a type of a structured data entity according to an embodiment of the present application;
FIG. 3 is a diagram illustrating processing of fields in a data table according to an embodiment of the present application;
fig. 4 is a schematic diagram of aggregating a plurality of data table fields according to an embodiment of the present application;
FIG. 5 is a schematic diagram of an apparatus for identifying types of structured data entities according to an embodiment of the present application;
fig. 6 is a schematic diagram of a computing device according to an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the accompanying drawings, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In the embodiments of the present application, a plurality means two or more. The terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or order.
Fig. 1 exemplarily shows a schematic diagram of a structured data entity type identification system to which the embodiment of the present application is applicable, and as shown in fig. 1, the system includes a business data module, an industry information module, an information extraction module, a quality index module, and an entity set.
A service data module: the module includes a structured data source of a specified industry, or any other suitable data source, and specifically may refer to an enterprise database, a data table, and the like. This module provides training data and test data for model training.
An information extraction module: the module is used for realizing the identification and extraction of the input business data entity types, can integrate various algorithm models or realization modes, and adjusts or combines the algorithm models and the realization modes used in the module. The present invention relates to a specific method in the fields of natural language processing, machine learning, representation learning, and the like.
Industry knowledge: the module is mainly used for supporting an information extraction link and providing priori knowledge and experience for the information extraction link.
A quality index module: the method is used for evaluating and controlling the quality of data in the information extraction process so as to meet the application requirements.
Entity aggregation: the entity set is a set of multiple entity types required by the construction of the knowledge graph and is a result output by the information extraction module.
Fig. 2 exemplarily illustrates a method for identifying a structured data entity type provided by an embodiment of the present application, and as shown in fig. 2, the method includes the following steps:
step 201, acquiring a plurality of data tables of the database.
In the embodiment of the application, a plurality of data tables in a database of a specified industry are obtained, wherein the specified industry can be but is not limited to a public security industry. Illustratively, table 1 is a data table provided in this embodiment of the present application, the data in the table is structured data, where each row is a record, each column is a field, a character string in a title is a field name of the field, the field name may be a field chinese name and/or a field english name, the field names are different from each other, and a certain column may be uniquely identified, for example, there are 8 fields in table 1, and the field names are respectively an ID, a name, an age, an identification number, a mobile phone number, an address, a license plate number, and a vehicle color. It should be noted that the purpose of the embodiment of the present application is to identify the entity type to which each field belongs, and it is not necessary to know the specific content stored in each record, for example, we know that the second field in table 1 stores the attribute of name, and it is not necessary to know whether the name stored in each record is specifically zhang three or lie four.
TABLE 1
ID Name(s) Age (age) Identity card number Mobile phone number Address License plate number Vehicle color
After a plurality of data tables in a database of a specified industry are obtained, data in each data table is preprocessed. The preprocessing can be divided into data selection and abnormal data processing, wherein the data selection is to define fields needing entity type identification according to actual requirements and then extract each field in a data table based on a defined rule. Specifically, a field containing rich field description content may be extracted as a main field according to field information stored in the data table.
The abnormal data processing includes operations such as cleaning and filtering abnormal values of the input original data, filtering out unnormalized and obviously invalid field information in the input original data, and deleting abnormal characters in the field information, for example, deleting mathematical symbols and emoticons in the field information, adjusting the format of the field information to a uniform standard, and the like. The scrambled input data is converted into relatively clean data for further processing of the data in subsequent steps.
And step 202, for the structured data in each data table, mapping similar fields to a node number by calculating the similarity between field information or by using a trained classifier model.
In the embodiment of the present application, the field information includes a field name and a field description. Wherein the field description may include one or more of the following information: field meaning description, data type, length, value range and the like. The field meaning description in the field description is mainly extracted, the field meaning description can be understood as the explanation description of the information stored in the field, and for the field name with fuzzy description, the entity type of the field can be further confirmed through the field meaning. For example, a field name of a field in a certain data table is "color", and the field is described as "color of vehicle", so that what color the information stored in the field is cannot be known only from the field name, and the entity type to which the field belongs cannot be determined. After the field description is combined, the information stored in the field is the color of the vehicle, and the entity type of the field can be determined to be the vehicle. It should be noted that the field names and the field descriptions may also be referred to as data item names and data item descriptions.
After preprocessing the data in the data tables, step 202 processes the fields in each data table separately, maps the similar fields in each data table to the same node, and numbers each node from 0, and the specific implementation manner may be to map the similar fields to a node number by calculating the similarity between the information of each field or by using a trained classifier model. It can be understood that the fields mapped to the same node have a high similarity and belong to the same entity type. Before calculating the similarity between the pieces of field information, the pieces of field information need to be input into an existing model to convert the pieces of field text information into vector representations, so that calculating the similarity between the pieces of field information can be converted into a numerical operation of calculating the inner product of the field vectors. Similarly, before each piece of field information is input into the classifier model, each piece of field information needs to be input into an existing model to convert the text information of the field into a vector representation.
For better understanding of step 202, fig. 3 exemplarily shows a process of mapping similar fields in each data table to the same node number, as shown in fig. 3, taking two data tables as an example, after calculating the similarity of each field in 2 data tables respectively or inputting each field into a trained classifier model, 3 fields in data table 1, field 101, field 102 and field 103 are mapped to the same node, and the node number is 0; the 3 fields in data table 2, field 201, field 202 and field 203, are also mapped to node number 0, and field 204 and field 205 are mapped to the same node, which is numbered 1.
In a possible implementation manner, for a field with a long field description or rich information, the field description is subjected to word segmentation processing to obtain a plurality of word segments. For example, "sender name" can be broken down into "sender" and "name". The longer description field may contain rich information, and after the word segmentation processing is performed on the longer description field, the entity type of the field is predicted, so that the accuracy of identifying the entity type can be improved.
And further, establishing a word stock model according to the field description of each field, the multiple word segments after word segmentation processing and the entity type corresponding to each field. For example, different data tables may have multiple expressions for the same attribute, for example, field information such as "identification card number", "resident identification card number" and "resident identification card number" are multiple expressions of the attribute, and the entity type of the attribute is human. The multiple expressions of the 'identity card number' are collected into the word stock model, when certain field information accessed into a data table is the 'resident identity card number', the attribute that the field is the 'identity card number' can be known by looking up the word stock model, and the corresponding entity type is human. When certain field information accessed into one data table is 'continental resident identity card number', the 'continental resident identity card number' is added into the word bank model, the data volume of the word bank model is richer along with the increase of actual service scenes and the access of more data tables, after a certain amount of data is accumulated, the word bank model can be used independently of the classifier model, and the word bank model and the classifier model can be respectively used as independent models to participate in actual task processing.
The classifier model used in step 202 may be trained by:
(1) and selecting a training set and a testing set.
Wherein the ratio of the training set to the test set may be 7: 3.
(2) And after the field description and the field name of each field in the training set are subjected to feature engineering treatment, inputting the field description and the field name into a classifier model to train the classifier model.
(3) The classifier model is evaluated using a test set.
The characteristic engineering is established by carrying out coding operation on field description and field names and converting text information into numerical information. For example, the field description and field name are converted into a codeword of 7 to 2.
And 203, aggregating the fields mapped to the same node number in the multiple data tables to obtain the node numbers corresponding to the fields of the multiple data tables.
And based on the node numbers corresponding to the fields in each data table obtained in the step 202, mapping the multiple data tables in the database to the fields with the same node numbers for aggregation. This step may be understood as aggregating similar fields in multiple data tables, which belong to the same entity type. Fig. 4 exemplarily shows a process of aggregating fields with the same node number in a plurality of data tables, as shown in fig. 4, taking 4 data tables as an example, after the fields in data table 1, data table 2, data table 3 and data table 4 are separately processed, the node numbers of the similar fields in each data table are obtained, where a field 101 in data table 1 corresponds to a node number 0, and a field 102 corresponds to a node number 1; the field 201 and the field 202 in the data table 2 correspond to the node number 0, and the field 203 corresponds to the node number 1; a field 301 in the data table 3 corresponds to a node number 1, and a field 302 corresponds to a node number 2; the fields 401 and 403 in the data table 4 correspond to the node number 2, and the field 402 corresponds to the node number 1. After the fields with the same node number in the 4 data tables are aggregated, the node number 0 corresponds to the field 101 of the data table 1 and the fields 201 and 202 of the data table 2; node number 1 corresponds to field 102 of data table 1, field 203 of data table 2, field 301 of data table 3, and field 402 of data table 4; node number 2 corresponds to field 302 of data table 3, and fields 401 and 403 of data table 4.
And 204, inputting the information of each field of the multiple data tables and the node number corresponding to each field into the trained representation learning model, and predicting the entity type corresponding to the node number to obtain the entity type corresponding to each field of the multiple data tables.
In the embodiment of the application, the entity type is predetermined according to actual business requirements, for example, people, things, vehicles, places, virtual account numbers, and the like.
In one possible implementation, the representation learning model in step 204 may be trained by:
(1) and selecting a training set and a testing set.
Wherein the ratio of the training set to the test set may be 7: 3.
(2) And marking the entity type of each field of each data table in the training set and the test set, and inputting the information of each field of a plurality of data tables in the training set, the node number corresponding to each field and the entity type of each field marked in advance into a representation learning model for learning.
(3) The representation learning model is evaluated using a test set.
Similarly, the data in each data table in the training set and test set needs to be preprocessed. The preprocessing comprises data selection and abnormal data processing, wherein the data selection is to define fields needing entity type identification according to actual requirements and then extract all fields in a data table based on a defined rule. The abnormal data processing comprises the steps of cleaning input original data, filtering abnormal values and the like, filtering out unnormalized and obviously invalid field information in the input original data and deleting abnormal characters in the field information.
In addition, the preprocessing also needs to mark each field in the test set by using the internal identifier of the data element, so as to provide reference for evaluating the representation learning model. The internal identifier of the data element is determined according to a national standard organization or related industry standards, the field and the internal identifier of the data element are in unique corresponding relation, and the identifiers of the data elements in the same field are the same.
The labeling of the entity type of each field in each data table can be understood as associating the attribute of the field with the entity type, and taking the fields in table 1 as an example, the entity types of the fields of name, age, identification number, mobile phone number and address are labeled as people, and the entity types of the two fields of license plate number and vehicle color are labeled as vehicles. That is, the person has attributes of name, age, identification number, mobile phone number and address, and the vehicle has two attributes of license plate number and vehicle color. It is also understood that the attributes of name, age, identification number, cell phone number and address belong to the entity type of person, and the attributes of license plate number and vehicle color belong to the entity type of vehicle.
In a specific implementation process, training set data can be input into a TransE model for learning, and an objective function expression of the TransE model is as follows:
fr(h,t)=|lh+lr-lt|L1/L2
wherein lh、lt、lrRespectively representing the vectorization of head, tail and relation in each triple instance (head, relation, tail).
TransE is distributed vector representation based on entities and relations, and field information, entity types of fields and specified relation types are constructed into triple examples in the data construction process, for example, names (names, persons and persons) are head nodes, numbers are m, persons are tail nodes, numbers are n, and relation types are names and persons. The TransE model treats the relationship in each triple instance (head, relationship, tail) as a slave entityTranslation of head to entity tail (vector addition), by continually adjusting lh、lrAnd l, reacting (l)h+lr) As much as possible with ltAre equal, i.e./h+lr=lt
And evaluating the trained representation learning model by adopting mean _ rank and hit @10 as evaluation indexes so as to correct the representation learning model in time according to the obtained evaluation result. Wherein mean _ rank represents the average number of times matching is required to obtain a correct result, and the lower the mean _ rank value is, the better the learning model is. hit @10 represents the probability that the correct result occurs at the top 10, and a higher value of hit @10 represents a better learning model.
And inputting the information of each field of the multiple data tables and the node number corresponding to each field into the trained representation learning model, and predicting the entity type corresponding to the node number to obtain the entity type corresponding to each field of the multiple data tables. Taking fig. 4 as an example, after aggregating fields of multiple data tables with the same node number, inputting the aggregated fields into a trained representation learning model, predicting that the field 101, the field 201, and the field 202 mapped to the node 0 belong to the entity type a, the field 102, the field 203, the field 301, and the field 402 mapped to the node 1 belong to the entity type B, and the field 302, the field 401, and the field 403 mapped to the node 2 belong to the entity type C. Therefore, the entity types of the fields in the multiple data tables of the database are obtained, and the method can be used for constructing the subsequent industry knowledge graph or the general knowledge graph.
The method for identifying the type of the structured data entity includes the steps of firstly aggregating similar fields in each data sheet, then aggregating fields mapped to the same node number in all the data sheets, then predicting the type of the entity of each node number, identifying the type of the entity of the field by means of twice aggregation and combining field names and field description multiple information, and improving accuracy of identification of the type of the entity.
Based on the same technical concept, fig. 5 exemplarily illustrates an apparatus for identifying a structured data entity type provided by an embodiment of the present application, and the apparatus is used for implementing the above method for identifying a structured data entity type. As shown in fig. 5, the apparatus 500 includes:
an obtaining module 501, configured to obtain multiple data tables of a database;
a processing module 502, configured to map similar fields to a node number by calculating similarity between field information or by using a trained classifier model for the structured data in each data table; the field information comprises a field name and a field description;
the processing module 502 is further configured to aggregate the fields mapped to the same node number in the multiple data tables, so as to obtain a node number corresponding to each field of the multiple data tables;
the processing module 502 is further configured to input information of each field of the multiple data tables and a node number corresponding to each field into a trained representation learning model, and predict an entity type corresponding to the node number to obtain an entity type corresponding to each field of the multiple data tables.
In a possible design, the processing module 502 is further configured to perform word segmentation on a field with a long field description to obtain a plurality of word segments.
In a possible design, the processing module 502 is further configured to establish a word library model according to the field description of each field, the multiple word segments after the word segmentation processing, and the entity type corresponding to each field.
In one possible design, the processing module 502 is further configured to train the classifier model in the following manner: and after the field description and the field name of each field are subjected to feature engineering treatment, inputting the field description and the field name into a classifier model to train the classifier model.
In one possible design, the processing module 502 is further configured to train the representation learning model in the following manner: selecting a training set and a test set; marking the entity type of each field of each data table in the training set and the test set, and inputting the information of each field of a plurality of data tables in the training set, the node number corresponding to each field and the marked entity type of each field into a representation learning model for training; the trained representation learning model is evaluated using the test set.
In one possible design, the processing module 502 is further configured to perform preprocessing on the structured data in each data table; the pretreatment comprises the following steps: data selection and abnormal data processing.
In one possible design, the processing module 502 is further configured to evaluate the representation learning model by using mean _ rank and hit @10 as evaluation indexes.
Based on the same technical concept, the embodiment of the present application provides a computing device, as shown in fig. 6, including at least one processor 601 and a memory 602 connected to the at least one processor, where a specific connection medium between the processor 601 and the memory 602 is not limited in the embodiment of the present application, and a bus connection between the processor 601 and the memory 602 in fig. 6 is taken as an example. The bus may be divided into an address bus, a data bus, a control bus, etc.
In the embodiment of the present application, the memory 602 stores instructions executable by the at least one processor 601, and the at least one processor 601 may execute the steps of the method for identifying the type of the structured data entity by executing the instructions stored in the memory 602.
The processor 601 is a control center of the computer device, and can connect various parts of the computer device by using various interfaces and lines, and perform resource setting by executing or executing instructions stored in the memory 602 and calling data stored in the memory 602. Alternatively, processor 601 may include one or more processing units, and processor 601 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 601. In some embodiments, processor 601 and memory 602 may be implemented on the same chip, or in some embodiments, they may be implemented separately on separate chips.
The processor 601 may be a general-purpose processor, such as a Central Processing Unit (CPU), a digital signal processor, an Application Specific Integrated Circuit (ASIC), a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like, and may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present Application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in a processor.
The memory 602, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory 602 may include at least one type of storage medium, which may include, for example, a flash Memory, a hard disk, a multimedia card, a card-type Memory, a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Programmable Read Only Memory (PROM), a Read Only Memory (ROM), a charged Erasable Programmable Read Only Memory (EEPROM), a magnetic Memory, a magnetic disk, an optical disk, and the like. The memory 602 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 602 in the embodiments of the present application may also be circuitry or any other device capable of performing a storage function for storing program instructions and/or data.
Based on the same technical concept, embodiments of the present application further provide a computer-readable storage medium storing a computer-executable program, where the computer-executable program is configured to enable a computer to perform the method for identifying a type of a structured data entity listed in any of the above manners.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all changes and modifications that fall within the scope of the present application.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims (10)

1. A method of identifying a structured data entity type, the method comprising:
acquiring a plurality of data tables of a database;
for the structured data in each data table, mapping similar fields to a node number by calculating the similarity between field information or by a trained classifier model; the field information comprises a field name and a field description;
aggregating the fields mapped to the same node number in the multiple data tables to obtain the node numbers corresponding to the fields of the multiple data tables;
and inputting the information of each field of the multiple data tables and the node number corresponding to each field into a trained representation learning model, and predicting the entity type corresponding to the node number to obtain the entity type corresponding to each field of the multiple data tables.
2. The method of claim 1, further comprising:
and for the field with long field description, performing word segmentation processing on the field description to obtain a plurality of word segments.
3. The method of claim 2, further comprising:
and establishing a word stock model according to the field description of each field, the multiple word segments after word segmentation processing and the entity types corresponding to each field.
4. The method of claim 1, wherein the classifier model is trained by: and after the field description and the field name of each field are subjected to feature engineering treatment, inputting the field description and the field name into a classifier model to train the classifier model.
5. The method of claim 1, wherein the representation learning model is trained by:
selecting a training set and a test set;
marking the entity type of each field of each data table in the training set and the test set;
inputting information of each field of a plurality of data tables in the training set, node numbers corresponding to each field and entity types of each marked field into a representation learning model for training;
the trained representation learning model is evaluated using the test set.
6. The method according to claim 1, wherein before the step of mapping similar fields to a node number by calculating similarity between field information or by a trained classifier for the structured data in each data table, the method further comprises preprocessing the structured data in each data table;
the pretreatment comprises the following steps: data selection and abnormal data processing.
7. The method of any one of claims 1-6, wherein said evaluating said representation learning model using said test set comprises:
the representation learning model is evaluated using mean rank and hit @10 as evaluation indexes.
8. An apparatus for identifying structured data entity types, the method comprising:
the acquisition module is used for acquiring a plurality of data tables of the database;
the processing module is used for mapping similar fields to a node number by calculating the similarity between field information or a trained classifier model aiming at the structured data in each data table; the field information comprises a field name and a field description;
the processing module is further configured to aggregate fields mapped to the same node number in the multiple data tables to obtain a node number corresponding to each field of the multiple data tables;
the processing module is further configured to input information of each field of the multiple data tables and a node number corresponding to each field into a trained representation learning model, predict an entity type corresponding to the node number, and obtain an entity type corresponding to each field of the multiple data tables.
9. A computing device, comprising:
a memory for storing program instructions;
a processor for calling program instructions stored in said memory and for executing the method of any one of claims 1 to 7 in accordance with the obtained program instructions.
10. A computer readable storage medium comprising computer readable instructions which, when read and executed by a computer, cause the computer to perform the method of any one of claims 1 to 7.
CN202210457940.8A 2022-04-27 2022-04-27 Method and device for identifying structured data entity type Pending CN114780649A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210457940.8A CN114780649A (en) 2022-04-27 2022-04-27 Method and device for identifying structured data entity type

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210457940.8A CN114780649A (en) 2022-04-27 2022-04-27 Method and device for identifying structured data entity type

Publications (1)

Publication Number Publication Date
CN114780649A true CN114780649A (en) 2022-07-22

Family

ID=82434013

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210457940.8A Pending CN114780649A (en) 2022-04-27 2022-04-27 Method and device for identifying structured data entity type

Country Status (1)

Country Link
CN (1) CN114780649A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117827991A (en) * 2024-03-06 2024-04-05 南湖实验室 Method and system for identifying personal identification information in semi-structured data

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117827991A (en) * 2024-03-06 2024-04-05 南湖实验室 Method and system for identifying personal identification information in semi-structured data
CN117827991B (en) * 2024-03-06 2024-05-31 南湖实验室 Method and system for identifying personal identification information in semi-structured data

Similar Documents

Publication Publication Date Title
CN106776538A (en) The information extracting method of enterprise's noncanonical format document
CN111597348B (en) User image drawing method, device, computer equipment and storage medium
CN112070138B (en) Construction method of multi-label mixed classification model, news classification method and system
CN108959474B (en) Entity relation extraction method
CN112836509A (en) Expert system knowledge base construction method and system
CN113360768A (en) Product recommendation method, device and equipment based on user portrait and storage medium
CN113360654A (en) Text classification method and device, electronic equipment and readable storage medium
CN114780649A (en) Method and device for identifying structured data entity type
CN114357195A (en) Knowledge graph-based question-answer pair generation method, device, equipment and medium
CN113935880A (en) Policy recommendation method, device, equipment and storage medium
CN112395407A (en) Method and device for extracting enterprise entity relationship and storage medium
CN109657710B (en) Data screening method and device, server and storage medium
CN112148735A (en) Construction method for structured form data knowledge graph
CN111597336A (en) Processing method and device of training text, electronic equipment and readable storage medium
CN114898156B (en) Cross-modal semantic representation learning and fusion-based image classification method and system
CN110851597A (en) Method and device for sentence annotation based on similar entity replacement
CN110750712A (en) Software security requirement recommendation method based on data driving
CN113705201B (en) Text-based event probability prediction evaluation algorithm, electronic device and storage medium
CN112528674B (en) Text processing method, training device, training equipment and training equipment for model and storage medium
CN115146073A (en) Test question knowledge point marking method for cross-space semantic knowledge injection and application
CN111046934B (en) SWIFT message soft clause recognition method and device
CN113888265A (en) Product recommendation method, device, equipment and computer-readable storage medium
CN112541357A (en) Entity identification method and device and intelligent equipment
CN111400413A (en) Method and system for determining category of knowledge points in knowledge base
CN113836244B (en) Sample acquisition method, model training method, relation prediction method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination