CN110515926A

CN110515926A - Heterogeneous data source mass data carding method based on participle and semantic dependency analysis

Info

Publication number: CN110515926A
Application number: CN201910802454.3A
Authority: CN
Inventors: 马世乾; 闫卫国; 王刚; 尚学军; 王伟臣; 李国栋; 郭悦; ***; 杨晓静; 黄志刚; 崇志强; 王天昊
Original assignee: State Grid Tianjin Electric Power Co Ltd; Electric Power Research Institute of State Grid Tianjin Electric Power Co Ltd
Current assignee: State Grid Tianjin Electric Power Co Ltd; Electric Power Research Institute of State Grid Tianjin Electric Power Co Ltd
Priority date: 2019-08-28
Filing date: 2019-08-28
Publication date: 2019-11-29

Abstract

The present invention relates to a kind of based on the heterogeneous data source mass data carding method segmented with semantic dependency analysis, is technically characterized by comprising the steps as follows: and extracts isomeric data source data；It is formatted using common data representation format to data have been extracted；Data structure combing is carried out to isomeric data source data；Parsing cleaning is carried out to data using the natural language processing technique of participle and semantic dependency analysis；It exports isomeric data source data and combs result set.The present invention uses across multiclass integrating heterogeneous data source sorts of systems data, data structure is carried out to be uniformly processed, repeated data record under multi-source end system is merged with semantic dependency analysis technology by participle, it creates the incidence relation between each system data and merges the different inter-system datas of each profession, it screens all types of Data duplication attributes to merge, automatically analyzes data validity and data cleansing is carried out to ranking results, simultaneously, human cost can be substantially saved, working efficiency is improved.

Description

Heterogeneous data source mass data carding method based on participle and semantic dependency analysis

Technical field

It is the present invention relates to power information and field of communication technology, in particular to a kind of based on participle and semantic dependency analysis Heterogeneous data source mass data carding method.

Background technique

In recent years, with the development of the information technologies such as big data, cloud computing, Internet of Things, power domain has also increased many newly Information analysis means make it possible to be analyzed for magnanimity electric network data, further obtain in data in correlative connection.

Electric power legacy system is based respectively on the business model that oneself is respectively established needed for itself, although between leading to sorts of systems Objective subject is identical, but data structure difference is huge between each other, each table focus is different, association is docked and using difficult.It is existing There is technology spininess to migrate not genbank database, lacks the optimization for data structure and the cleaning to data, while such Business transfers to artificial control amount heavy and cannot cope with long-term business diversity requirement.

Summary of the invention

It is an object of the invention to overcome above-mentioned the shortcomings of the prior art, provide it is a kind of accurate and reliable, high-efficient and Manually put into the low heterogeneous data source mass data carding method based on participle and semantic dependency analysis.

The present invention solves its technical problem and adopts the following technical solutions to achieve:

A kind of heterogeneous data source mass data carding method based on participle and semantic dependency analysis, comprising the following steps:

Step 1 extracts isomeric data source data；

Step 2 is formatted using common data representation format to data have been extracted；

Step 3 carries out data structure combing to isomeric data source data；

Step 4 carries out parsing cleaning to data using the natural language processing technique of participle and semantic dependency analysis；

Step 5, output isomeric data source data comb result set.

Further, the heterogeneous data source include relevant database, non-relational database and be stored in document class text Data in part.

Further, the step 1 concrete methods of realizing the following steps are included:

(1) the connection of different driving program dynamic reflective is called by the connection type of ODBC or JDBC using configurable tool Database instance extracts variety classes database instance data；

(2), using configurable tool Dynamic Recognition file type, connected by different driving program and read document, text text Shelves and form document, extract the data in different files.

Further, the step 2 concrete methods of realizing the following steps are included:

(1) it is formatted using common data representation format xml to data are extracted, isomeric data source data object is turned It changes into as the unified global data object for meeting common data representation call format；

(2) automatic to mark corresponding attribute for globally unique identifier's category by identification " ID ", " mark ", " title " keyword Property.

Further, the step 3 concrete methods of realizing the following steps are included:

(1) the attribute complete or collected works of global data object are established, and contained attribute is the global data object in each heterogeneous data source The attribute complete or collected works possessed；

(2) dissection process, such as heterogeneous data source are carried out to the attribute of global data object using participle and semantic dependency analysis The attribute of lower global data object, which exists, to be repeated, then the attribute complete or collected works of the global data object only retain one in duplicate attribute Example；

(3) as the final property set of such global data object and corresponding data structure is generated using attribute complete or collected works；

(4) judgment criteria is generated as record using global data object unique identification, record under heterogeneous data source whole It closes.

Further, the step 4 concrete methods of realizing the following steps are included:

(1) recording integrating is carried out according to the unique identification label in global data object and create result set, every record is protected There is globally unique identifier's attribute, globally unique identifier's attribute result set if existing and repeating only retains one under different data sources；

(2) data being carried out for heterogeneous data source duplicate attribute and using priority ranking, sort by is data integrity；

(3) priority is used referring to data, carry out the combing of global data object data, carry out every note according to attribute complete or collected works The data of record are rerecorded, and the real data of each attribute is according to decreasing priority sequence, preferentially using in the high data source of priority The correspondence attribute of global data object.

Further, the step 5 concrete methods of realizing the following steps are included:

(1) structuring exports global data object properties complete or collected works, forms standard and complete data structure；

(2) specifically need to carry out output configuration according to business, carry out storage format conversion, ranks conversion, export global data Result set after object properties combing.

The advantages and positive effects of the present invention are:

The present invention has rational design, uses across multiclass integrating heterogeneous data source sorts of systems data, carries out data structure system Repeated data record under multi-source end system is merged by participle with semantic dependency analysis technology, creates each system by one processing Incidence relation between system data simultaneously merges the different inter-system datas of each profession, screens all types of Data duplication attributes and carries out Merge, automatically analyzes data validity and data cleansing is carried out to ranking results.This process overwhelming majority step is automatic mistake simultaneously Journey can substantially save human cost, improve working efficiency.

Detailed description of the invention

Fig. 1 is process flow diagram of the invention.

Specific embodiment

The embodiment of the present invention is further described below in conjunction with attached drawing.

Design philosophy of the invention is: extracting heterogeneous data source number with configurable program dynamic call difference connection driving According to, be formatted using common data representation format to data have been extracted, further to isomeric data source data carry out data Then structure hackling carries out parsing cleaning to data using the natural language processing technique of participle and semantic dependency analysis, to more Source data carry out data record integration, data attribute is rerecorded, and finally carry out combing result set output with specified format.

In the present invention, the meaning of participle is to be reassembled into word sequence according to certain specification, by continuous word sequence Process, specification used herein includes but is not limited only to part-of-speech tagging, that is, determine the part of speech of each word in sentence, as adjective, Verb, noun etc., also referred to as part-of-speech tagging or referred to as mark.Semantic dependency analysis refers between each linguistic unit of parsing sentence Semantic association, and semantic association is presented with dependency structure.

Based on above-mentioned design philosophy, heterogeneous data source mass data carding method of the present invention, comprising the following steps:

Step 1 extracts isomeric data source data, and the specific method is as follows:

(1) it is based on C# written in code, executable program under .Net environment is established, uses open CNC (ODBC) Mode connects entitled in distinct type data-base SQL Server, MySQL (heterogeneous data source in simulation real system) Data are respectively as data source A, data source B in TransFormer (transformer) table.

(2) reading has the transformer data in spreadsheet format (xls) document of transformer equipment information as data Source C.

Step 2 is formatted using common data representation format to data have been extracted, and the specific method is as follows:

(1) A, B, C three classes original data source are carried out using common data representation format (xml format) structured formatted.

(2) program is automatically marked the name field in transformer table, as unique identification, under manual confirmation enters One link.

Transformer table in data source A

Device name (*)	Voltage rating	Manufacturer	Device model
				No. 1 main transformer in the station A	235	A factory	I type
No. 2 main transformers in the station A	235	B factory	II type
				No. 1 main transformer in the station B	121	C factory	I type

Transformer table in data source B

Transformer name (*)	Voltage rating	Rated capacity	Rated frequency
				No. 1 main transformer in the station A	220	100000	50
No. 1 main transformer in the station B	110	50000	50

Transformer table in data source C

Title (*)	Date of putting into operation	Operating status	Voltage rating
				No. 1 main transformer in the station A	2010.06	It is transporting	235
No. 1 main transformer in the station B	2010.07	It is transporting
				No. 2 main transformers in the station B	2010.01	Move back fortune	235

Step 3 carries out data structure combing to isomeric data source data, method particularly includes:

(1) the attribute complete or collected works of global data object (transformer object) are established, and contained attribute should be A, B, in C data source The complete or collected works of the data field possessed.

(2) dissection process, such as heterogeneous data source are carried out to the attribute of global data object using participle and semantic dependency analysis The attribute of lower global data object, which exists, to be repeated, then the attribute complete or collected works of the global data object only retain one in duplicate attribute Example, such as: have the column of corresponding device name, respectively device name, title in data source A, B, C, but according to semanteme all referring to Transformer equipment title, then this generic attribute only retains one.

(3) as the final property set of such global data object and corresponding data structure is generated using attribute complete or collected works, such as originally Transformer object in embodiment, contained attribute should be (device name, voltage rating, rated current, manufacturer, device model, Date of putting into operation, equipment state).

Step 4 carries out parsing cleaning to data using the natural language processing technique of participle and semantic dependency analysis

(1) recording integrating is carried out according to the unique identification label in global data object and create result set, every record is protected There is globally unique identifier's attribute, globally unique identifier's attribute result set if existing and repeating only retains one under different data sources, Such as 4 records of reservation in the present embodiment, it is respectively as follows: No. 2 No. 1 main transformer in the station A, No. 2 main transformers in the station A, No. 1 main transformer in the station B, the station B main transformers.

(2) data are carried out for heterogeneous data source duplicate attribute described in claim 4 use priority ranking, sequence According to be data integrity (i.e. accounting of this attribute non-empty global data object in same data source whole global data object, As accounting unanimously if using the data source sequence more than total amount of data, data source name lead-in busbar is used if total amount is also identical Sequence), as voltage rating attribute Sort Priority is primary from high to low are as follows: data source A, data source B, data source C.

(3) priority is used referring to data, carry out the combing of global data object data, carry out every note according to attribute complete or collected works The data of record are rerecorded, and the real data of each attribute is according to decreasing priority sequence, preferentially using in the high data source of priority The correspondence attribute of global data object, as a result:

Step 5, output isomeric data source data comb result set, method particularly includes:

(1) structuring exports global data object properties complete or collected works, forms standard and complete data structure:

Transformer: { device name, manufacturer, device model, rated capacity, rated frequency, puts into operation day at voltage rating Phase, operating status }

(2) specifically needed to carry out output configuration according to business, storage format selection can be carried out, result set after combing can be exported Data structure after to integration, also into the original table in A, B, C three classes data source, (the attribute field data that former table does not contain will Abandoned), the result set after output global data object properties combing.

The present invention does not address place and is suitable for the prior art.

It is emphasized that the present embodiment is illustrative, wherein the data structure in corresponding heterogeneous data source is mostly more For complicated, the numerous objects of attribute, while its data magnitude is suitable for mass data record, is herein only that description this method is realized Process.

It is emphasized that embodiment of the present invention be it is illustrative, without being restrictive, therefore packet of the present invention Include and be not limited to embodiment described in specific embodiment, it is all by those skilled in the art according to the technique and scheme of the present invention The other embodiments obtained, also belong to the scope of protection of the invention.

Claims

1. a kind of heterogeneous data source mass data carding method based on participle and semantic dependency analysis, it is characterised in that: including Following steps:

Step 1 extracts isomeric data source data；

Step 3 carries out data structure combing to isomeric data source data；

Step 5, output isomeric data source data comb result set.

2. the heterogeneous data source mass data carding method according to claim 1 based on participle and semantic dependency analysis, It is characterized by: the heterogeneous data source includes relevant database, non-relational database and is stored in document class file Data.

3. the heterogeneous data source mass data combing side according to claim 1 or 2 based on participle and semantic dependency analysis Method, it is characterised in that: the concrete methods of realizing of the step 1 the following steps are included:

(1) call different driving program dynamic reflective to connect data by the connection type of ODBC or JDBC using configurable tool Library example extracts variety classes database instance data；

(2) using configurable tool Dynamic Recognition file type, connected by different driving program read document, text document and Form document extracts the data in different files.

4. the heterogeneous data source mass data carding method according to claim 1 based on participle and semantic dependency analysis, It is characterized by: the concrete methods of realizing of the step 2 the following steps are included:

(1) it is formatted using common data representation format xml to data are extracted, isomeric data source data object is converted into For the unified global data object for meeting common data representation call format；

(2), by identification " ID ", " mark ", " title " keyword, the automatic corresponding attribute of label is globally unique identifier's attribute.

5. the heterogeneous data source mass data carding method according to claim 1 based on participle and semantic dependency analysis, It is characterized by: the concrete methods of realizing of the step 3 the following steps are included:

(1) the attribute complete or collected works of global data object are established, and contained attribute is gathered around by the global data object in each heterogeneous data source Some attribute complete or collected works；

(2) dissection process is carried out to the attribute of global data object using participle and semantic dependency analysis, as complete under heterogeneous data source The attribute of office data object, which exists, to be repeated, then the attribute complete or collected works of the global data object only retain an example in duplicate attribute；

(4) judgment criteria is generated as record using global data object unique identification, carry out heterogeneous data source recording integrating.

6. the heterogeneous data source mass data carding method according to claim 1 based on participle and semantic dependency analysis, It is characterized by: the concrete methods of realizing of the step 4 the following steps are included:

(1) recording integrating is carried out according to the unique identification label in global data object and create result set, every record is possessed complete Office's unique identification attribute, globally unique identifier's attribute result set if existing repeatedly only retains one under different data sources；

(3) priority is used referring to data, carry out the combing of global data object data, carry out every record according to attribute complete or collected works Data are rerecorded, and the real data of each attribute is according to decreasing priority sequence, preferentially using global in the high data source of priority The correspondence attribute of data object.

7. the heterogeneous data source mass data carding method according to claim 1 based on participle and semantic dependency analysis, It is characterized by: the concrete methods of realizing of the step 5 the following steps are included:

(2) specifically need to carry out output configuration according to business, carry out storage format conversion, ranks conversion, export global data object Result set after attribute combing.