CN110515926A - Heterogeneous data source mass data carding method based on participle and semantic dependency analysis - Google Patents

Heterogeneous data source mass data carding method based on participle and semantic dependency analysis Download PDF

Info

Publication number
CN110515926A
CN110515926A CN201910802454.3A CN201910802454A CN110515926A CN 110515926 A CN110515926 A CN 110515926A CN 201910802454 A CN201910802454 A CN 201910802454A CN 110515926 A CN110515926 A CN 110515926A
Authority
CN
China
Prior art keywords
data
attribute
data source
participle
heterogeneous
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910802454.3A
Other languages
Chinese (zh)
Inventor
马世乾
闫卫国
王刚
尚学军
王伟臣
李国栋
郭悦
***
杨晓静
黄志刚
崇志强
王天昊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Tianjin Electric Power Co Ltd
Electric Power Research Institute of State Grid Tianjin Electric Power Co Ltd
Original Assignee
State Grid Tianjin Electric Power Co Ltd
Electric Power Research Institute of State Grid Tianjin Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Tianjin Electric Power Co Ltd, Electric Power Research Institute of State Grid Tianjin Electric Power Co Ltd filed Critical State Grid Tianjin Electric Power Co Ltd
Priority to CN201910802454.3A priority Critical patent/CN110515926A/en
Publication of CN110515926A publication Critical patent/CN110515926A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of based on the heterogeneous data source mass data carding method segmented with semantic dependency analysis, is technically characterized by comprising the steps as follows: and extracts isomeric data source data;It is formatted using common data representation format to data have been extracted;Data structure combing is carried out to isomeric data source data;Parsing cleaning is carried out to data using the natural language processing technique of participle and semantic dependency analysis;It exports isomeric data source data and combs result set.The present invention uses across multiclass integrating heterogeneous data source sorts of systems data, data structure is carried out to be uniformly processed, repeated data record under multi-source end system is merged with semantic dependency analysis technology by participle, it creates the incidence relation between each system data and merges the different inter-system datas of each profession, it screens all types of Data duplication attributes to merge, automatically analyzes data validity and data cleansing is carried out to ranking results, simultaneously, human cost can be substantially saved, working efficiency is improved.

Description

Heterogeneous data source mass data carding method based on participle and semantic dependency analysis
Technical field
It is the present invention relates to power information and field of communication technology, in particular to a kind of based on participle and semantic dependency analysis Heterogeneous data source mass data carding method.
Background technique
In recent years, with the development of the information technologies such as big data, cloud computing, Internet of Things, power domain has also increased many newly Information analysis means make it possible to be analyzed for magnanimity electric network data, further obtain in data in correlative connection.
Electric power legacy system is based respectively on the business model that oneself is respectively established needed for itself, although between leading to sorts of systems Objective subject is identical, but data structure difference is huge between each other, each table focus is different, association is docked and using difficult.It is existing There is technology spininess to migrate not genbank database, lacks the optimization for data structure and the cleaning to data, while such Business transfers to artificial control amount heavy and cannot cope with long-term business diversity requirement.
Summary of the invention
It is an object of the invention to overcome above-mentioned the shortcomings of the prior art, provide it is a kind of accurate and reliable, high-efficient and Manually put into the low heterogeneous data source mass data carding method based on participle and semantic dependency analysis.
The present invention solves its technical problem and adopts the following technical solutions to achieve:
A kind of heterogeneous data source mass data carding method based on participle and semantic dependency analysis, comprising the following steps:
Step 1 extracts isomeric data source data;
Step 2 is formatted using common data representation format to data have been extracted;
Step 3 carries out data structure combing to isomeric data source data;
Step 4 carries out parsing cleaning to data using the natural language processing technique of participle and semantic dependency analysis;
Step 5, output isomeric data source data comb result set.
Further, the heterogeneous data source include relevant database, non-relational database and be stored in document class text Data in part.
Further, the step 1 concrete methods of realizing the following steps are included:
(1) the connection of different driving program dynamic reflective is called by the connection type of ODBC or JDBC using configurable tool Database instance extracts variety classes database instance data;
(2), using configurable tool Dynamic Recognition file type, connected by different driving program and read document, text text Shelves and form document, extract the data in different files.
Further, the step 2 concrete methods of realizing the following steps are included:
(1) it is formatted using common data representation format xml to data are extracted, isomeric data source data object is turned It changes into as the unified global data object for meeting common data representation call format;
(2) automatic to mark corresponding attribute for globally unique identifier's category by identification " ID ", " mark ", " title " keyword Property.
Further, the step 3 concrete methods of realizing the following steps are included:
(1) the attribute complete or collected works of global data object are established, and contained attribute is the global data object in each heterogeneous data source The attribute complete or collected works possessed;
(2) dissection process, such as heterogeneous data source are carried out to the attribute of global data object using participle and semantic dependency analysis The attribute of lower global data object, which exists, to be repeated, then the attribute complete or collected works of the global data object only retain one in duplicate attribute Example;
(3) as the final property set of such global data object and corresponding data structure is generated using attribute complete or collected works;
(4) judgment criteria is generated as record using global data object unique identification, record under heterogeneous data source whole It closes.
Further, the step 4 concrete methods of realizing the following steps are included:
(1) recording integrating is carried out according to the unique identification label in global data object and create result set, every record is protected There is globally unique identifier's attribute, globally unique identifier's attribute result set if existing and repeating only retains one under different data sources;
(2) data being carried out for heterogeneous data source duplicate attribute and using priority ranking, sort by is data integrity;
(3) priority is used referring to data, carry out the combing of global data object data, carry out every note according to attribute complete or collected works The data of record are rerecorded, and the real data of each attribute is according to decreasing priority sequence, preferentially using in the high data source of priority The correspondence attribute of global data object.
Further, the step 5 concrete methods of realizing the following steps are included:
(1) structuring exports global data object properties complete or collected works, forms standard and complete data structure;
(2) specifically need to carry out output configuration according to business, carry out storage format conversion, ranks conversion, export global data Result set after object properties combing.
The advantages and positive effects of the present invention are:
The present invention has rational design, uses across multiclass integrating heterogeneous data source sorts of systems data, carries out data structure system Repeated data record under multi-source end system is merged by participle with semantic dependency analysis technology, creates each system by one processing Incidence relation between system data simultaneously merges the different inter-system datas of each profession, screens all types of Data duplication attributes and carries out Merge, automatically analyzes data validity and data cleansing is carried out to ranking results.This process overwhelming majority step is automatic mistake simultaneously Journey can substantially save human cost, improve working efficiency.
Detailed description of the invention
Fig. 1 is process flow diagram of the invention.
Specific embodiment
The embodiment of the present invention is further described below in conjunction with attached drawing.
Design philosophy of the invention is: extracting heterogeneous data source number with configurable program dynamic call difference connection driving According to, be formatted using common data representation format to data have been extracted, further to isomeric data source data carry out data Then structure hackling carries out parsing cleaning to data using the natural language processing technique of participle and semantic dependency analysis, to more Source data carry out data record integration, data attribute is rerecorded, and finally carry out combing result set output with specified format.
In the present invention, the meaning of participle is to be reassembled into word sequence according to certain specification, by continuous word sequence Process, specification used herein includes but is not limited only to part-of-speech tagging, that is, determine the part of speech of each word in sentence, as adjective, Verb, noun etc., also referred to as part-of-speech tagging or referred to as mark.Semantic dependency analysis refers between each linguistic unit of parsing sentence Semantic association, and semantic association is presented with dependency structure.
Based on above-mentioned design philosophy, heterogeneous data source mass data carding method of the present invention, comprising the following steps:
Step 1 extracts isomeric data source data, and the specific method is as follows:
(1) it is based on C# written in code, executable program under .Net environment is established, uses open CNC (ODBC) Mode connects entitled in distinct type data-base SQL Server, MySQL (heterogeneous data source in simulation real system) Data are respectively as data source A, data source B in TransFormer (transformer) table.
(2) reading has the transformer data in spreadsheet format (xls) document of transformer equipment information as data Source C.
Step 2 is formatted using common data representation format to data have been extracted, and the specific method is as follows:
(1) A, B, C three classes original data source are carried out using common data representation format (xml format) structured formatted.
(2) program is automatically marked the name field in transformer table, as unique identification, under manual confirmation enters One link.
Transformer table in data source A
Device name (*) Voltage rating Manufacturer Device model
No. 1 main transformer in the station A 235 A factory I type
No. 2 main transformers in the station A 235 B factory II type
No. 1 main transformer in the station B 121 C factory I type
Transformer table in data source B
Transformer name (*) Voltage rating Rated capacity Rated frequency
No. 1 main transformer in the station A 220 100000 50
No. 1 main transformer in the station B 110 50000 50
Transformer table in data source C
Title (*) Date of putting into operation Operating status Voltage rating
No. 1 main transformer in the station A 2010.06 It is transporting 235
No. 1 main transformer in the station B 2010.07 It is transporting
No. 2 main transformers in the station B 2010.01 Move back fortune 235
Step 3 carries out data structure combing to isomeric data source data, method particularly includes:
(1) the attribute complete or collected works of global data object (transformer object) are established, and contained attribute should be A, B, in C data source The complete or collected works of the data field possessed.
(2) dissection process, such as heterogeneous data source are carried out to the attribute of global data object using participle and semantic dependency analysis The attribute of lower global data object, which exists, to be repeated, then the attribute complete or collected works of the global data object only retain one in duplicate attribute Example, such as: have the column of corresponding device name, respectively device name, title in data source A, B, C, but according to semanteme all referring to Transformer equipment title, then this generic attribute only retains one.
(3) as the final property set of such global data object and corresponding data structure is generated using attribute complete or collected works, such as originally Transformer object in embodiment, contained attribute should be (device name, voltage rating, rated current, manufacturer, device model, Date of putting into operation, equipment state).
Step 4 carries out parsing cleaning to data using the natural language processing technique of participle and semantic dependency analysis
(1) recording integrating is carried out according to the unique identification label in global data object and create result set, every record is protected There is globally unique identifier's attribute, globally unique identifier's attribute result set if existing and repeating only retains one under different data sources, Such as 4 records of reservation in the present embodiment, it is respectively as follows: No. 2 No. 1 main transformer in the station A, No. 2 main transformers in the station A, No. 1 main transformer in the station B, the station B main transformers.
(2) data are carried out for heterogeneous data source duplicate attribute described in claim 4 use priority ranking, sequence According to be data integrity (i.e. accounting of this attribute non-empty global data object in same data source whole global data object, As accounting unanimously if using the data source sequence more than total amount of data, data source name lead-in busbar is used if total amount is also identical Sequence), as voltage rating attribute Sort Priority is primary from high to low are as follows: data source A, data source B, data source C.
(3) priority is used referring to data, carry out the combing of global data object data, carry out every note according to attribute complete or collected works The data of record are rerecorded, and the real data of each attribute is according to decreasing priority sequence, preferentially using in the high data source of priority The correspondence attribute of global data object, as a result:
Step 5, output isomeric data source data comb result set, method particularly includes:
(1) structuring exports global data object properties complete or collected works, forms standard and complete data structure:
Transformer: { device name, manufacturer, device model, rated capacity, rated frequency, puts into operation day at voltage rating Phase, operating status }
(2) specifically needed to carry out output configuration according to business, storage format selection can be carried out, result set after combing can be exported Data structure after to integration, also into the original table in A, B, C three classes data source, (the attribute field data that former table does not contain will Abandoned), the result set after output global data object properties combing.
The present invention does not address place and is suitable for the prior art.
It is emphasized that the present embodiment is illustrative, wherein the data structure in corresponding heterogeneous data source is mostly more For complicated, the numerous objects of attribute, while its data magnitude is suitable for mass data record, is herein only that description this method is realized Process.
It is emphasized that embodiment of the present invention be it is illustrative, without being restrictive, therefore packet of the present invention Include and be not limited to embodiment described in specific embodiment, it is all by those skilled in the art according to the technique and scheme of the present invention The other embodiments obtained, also belong to the scope of protection of the invention.

Claims (7)

1. a kind of heterogeneous data source mass data carding method based on participle and semantic dependency analysis, it is characterised in that: including Following steps:
Step 1 extracts isomeric data source data;
Step 2 is formatted using common data representation format to data have been extracted;
Step 3 carries out data structure combing to isomeric data source data;
Step 4 carries out parsing cleaning to data using the natural language processing technique of participle and semantic dependency analysis;
Step 5, output isomeric data source data comb result set.
2. the heterogeneous data source mass data carding method according to claim 1 based on participle and semantic dependency analysis, It is characterized by: the heterogeneous data source includes relevant database, non-relational database and is stored in document class file Data.
3. the heterogeneous data source mass data combing side according to claim 1 or 2 based on participle and semantic dependency analysis Method, it is characterised in that: the concrete methods of realizing of the step 1 the following steps are included:
(1) call different driving program dynamic reflective to connect data by the connection type of ODBC or JDBC using configurable tool Library example extracts variety classes database instance data;
(2) using configurable tool Dynamic Recognition file type, connected by different driving program read document, text document and Form document extracts the data in different files.
4. the heterogeneous data source mass data carding method according to claim 1 based on participle and semantic dependency analysis, It is characterized by: the concrete methods of realizing of the step 2 the following steps are included:
(1) it is formatted using common data representation format xml to data are extracted, isomeric data source data object is converted into For the unified global data object for meeting common data representation call format;
(2), by identification " ID ", " mark ", " title " keyword, the automatic corresponding attribute of label is globally unique identifier's attribute.
5. the heterogeneous data source mass data carding method according to claim 1 based on participle and semantic dependency analysis, It is characterized by: the concrete methods of realizing of the step 3 the following steps are included:
(1) the attribute complete or collected works of global data object are established, and contained attribute is gathered around by the global data object in each heterogeneous data source Some attribute complete or collected works;
(2) dissection process is carried out to the attribute of global data object using participle and semantic dependency analysis, as complete under heterogeneous data source The attribute of office data object, which exists, to be repeated, then the attribute complete or collected works of the global data object only retain an example in duplicate attribute;
(3) as the final property set of such global data object and corresponding data structure is generated using attribute complete or collected works;
(4) judgment criteria is generated as record using global data object unique identification, carry out heterogeneous data source recording integrating.
6. the heterogeneous data source mass data carding method according to claim 1 based on participle and semantic dependency analysis, It is characterized by: the concrete methods of realizing of the step 4 the following steps are included:
(1) recording integrating is carried out according to the unique identification label in global data object and create result set, every record is possessed complete Office's unique identification attribute, globally unique identifier's attribute result set if existing repeatedly only retains one under different data sources;
(2) data being carried out for heterogeneous data source duplicate attribute and using priority ranking, sort by is data integrity;
(3) priority is used referring to data, carry out the combing of global data object data, carry out every record according to attribute complete or collected works Data are rerecorded, and the real data of each attribute is according to decreasing priority sequence, preferentially using global in the high data source of priority The correspondence attribute of data object.
7. the heterogeneous data source mass data carding method according to claim 1 based on participle and semantic dependency analysis, It is characterized by: the concrete methods of realizing of the step 5 the following steps are included:
(1) structuring exports global data object properties complete or collected works, forms standard and complete data structure;
(2) specifically need to carry out output configuration according to business, carry out storage format conversion, ranks conversion, export global data object Result set after attribute combing.
CN201910802454.3A 2019-08-28 2019-08-28 Heterogeneous data source mass data carding method based on participle and semantic dependency analysis Pending CN110515926A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910802454.3A CN110515926A (en) 2019-08-28 2019-08-28 Heterogeneous data source mass data carding method based on participle and semantic dependency analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910802454.3A CN110515926A (en) 2019-08-28 2019-08-28 Heterogeneous data source mass data carding method based on participle and semantic dependency analysis

Publications (1)

Publication Number Publication Date
CN110515926A true CN110515926A (en) 2019-11-29

Family

ID=68628425

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910802454.3A Pending CN110515926A (en) 2019-08-28 2019-08-28 Heterogeneous data source mass data carding method based on participle and semantic dependency analysis

Country Status (1)

Country Link
CN (1) CN110515926A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111767325A (en) * 2020-09-03 2020-10-13 国网浙江省电力有限公司营销服务中心 Multi-source data deep fusion method based on deep learning
CN112612840A (en) * 2020-12-29 2021-04-06 清华大学 Heterogeneous data processing method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107193858A (en) * 2017-03-28 2017-09-22 福州金瑞迪软件技术有限公司 Towards the intelligent Service application platform and method of multi-source heterogeneous data fusion
CN107402976A (en) * 2017-07-03 2017-11-28 国网山东省电力公司经济技术研究院 Power grid multi-source data fusion method and system based on multi-element heterogeneous model
CN107958086A (en) * 2017-12-18 2018-04-24 北京睿力科技有限公司 The multi-source heterogeneous database data for solving data semantic Heterogeneity integrates method
CN109635107A (en) * 2018-11-19 2019-04-16 北京亚鸿世纪科技发展有限公司 The method and device of semantic intellectual analysis and the event scenarios reduction of multi-data source
CN110147357A (en) * 2019-05-07 2019-08-20 浙江科技学院 The multi-source data polymerization methods of sampling and system under a kind of environment based on big data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107193858A (en) * 2017-03-28 2017-09-22 福州金瑞迪软件技术有限公司 Towards the intelligent Service application platform and method of multi-source heterogeneous data fusion
CN107402976A (en) * 2017-07-03 2017-11-28 国网山东省电力公司经济技术研究院 Power grid multi-source data fusion method and system based on multi-element heterogeneous model
CN107958086A (en) * 2017-12-18 2018-04-24 北京睿力科技有限公司 The multi-source heterogeneous database data for solving data semantic Heterogeneity integrates method
CN109635107A (en) * 2018-11-19 2019-04-16 北京亚鸿世纪科技发展有限公司 The method and device of semantic intellectual analysis and the event scenarios reduction of multi-data source
CN110147357A (en) * 2019-05-07 2019-08-20 浙江科技学院 The multi-source data polymerization methods of sampling and system under a kind of environment based on big data

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111767325A (en) * 2020-09-03 2020-10-13 国网浙江省电力有限公司营销服务中心 Multi-source data deep fusion method based on deep learning
CN112612840A (en) * 2020-12-29 2021-04-06 清华大学 Heterogeneous data processing method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111708773A (en) Multi-source scientific and creative resource data fusion method
CN103186639B (en) Data creation method and system
CN112860872A (en) Self-learning-based method and system for verifying semantic compliance of power distribution network operation tickets
CN106919612A (en) A kind of processing method and processing device of SQL script of reaching the standard grade
CN105608232A (en) Bug knowledge modeling method based on graphic database
CN113094512B (en) Fault analysis system and method in industrial production and manufacturing
CN108665141B (en) Method for automatically extracting emergency response process model from emergency plan
CN110704577A (en) Method and system for searching power grid scheduling data
CN106897437B (en) High-order rule multi-classification method and system of knowledge system
CN117150050B (en) Knowledge graph construction method and system based on large language model
CN109766416A (en) A kind of new energy policy information abstracting method and system
CN111144116B (en) Document knowledge structured extraction method and device
CN110515926A (en) Heterogeneous data source mass data carding method based on participle and semantic dependency analysis
CN111461644A (en) Audit information management and control platform
CN113297251A (en) Multi-source data retrieval method, device, equipment and storage medium
CN111177401A (en) Power grid free text knowledge extraction method
CN114372118A (en) Audit knowledge recommendation system and method based on recursive algorithm
CN111898351A (en) Automatic Excel data importing method and device based on Aviator, terminal equipment and storage medium
CN117473054A (en) Knowledge graph-based general intelligent question-answering method and device
CN106649219B (en) A kind of telecommunication satellite design document automatic generation method
CN115827885A (en) Operation and maintenance knowledge graph construction method and device and electronic equipment
CN115168543A (en) Examination question automatic generation design method based on unstructured text
CN113779200A (en) Target industry word stock generation method, processor and device
Fan Application of computer aided translation in technical English manual
US11995411B1 (en) Large language model artificial intelligence text evaluation system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20191129