CN112052260B - Mass virtual-real data comparison method based on multi-process data stream - Google Patents

Mass virtual-real data comparison method based on multi-process data stream Download PDF

Info

Publication number
CN112052260B
CN112052260B CN202011051147.5A CN202011051147A CN112052260B CN 112052260 B CN112052260 B CN 112052260B CN 202011051147 A CN202011051147 A CN 202011051147A CN 112052260 B CN112052260 B CN 112052260B
Authority
CN
China
Prior art keywords
data
virtual
row
file
actual
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011051147.5A
Other languages
Chinese (zh)
Other versions
CN112052260A (en
Inventor
袁景凌
肖骅
罗忆
李新平
罗佩
江春鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University of Technology WUT
Original Assignee
Wuhan University of Technology WUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University of Technology WUT filed Critical Wuhan University of Technology WUT
Priority to CN202011051147.5A priority Critical patent/CN112052260B/en
Publication of CN112052260A publication Critical patent/CN112052260A/en
Application granted granted Critical
Publication of CN112052260B publication Critical patent/CN112052260B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24532Query optimisation of parallel queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/248Presentation of query results

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for quickly comparing massive virtual and real data based on multi-process data streams, which comprises the steps of firstly, sorting virtual data to form a standard data file, preprocessing actual measurement data according to a standard data format, and aligning the data length; combining the virtual data and the actual data into a piece of data according to the corresponding unique codes to form a single data file; reading in a data stream, traversing the whole data file through one cycle, and comparing the difference of the front half section and the rear half section of each piece of data; and finally, directly blocking the data stream in the memory according to the hardware environment, and traversing the blocked memory data in a multithreading parallel mode to obtain a block comparison result file. According to the invention, through the effective fusion of multi-process parallel processing and data stream calculation, the high-speed comparison of virtual data and actual data in the memory is realized, and the memory storage limitation caused by overlarge data volume can be avoided, so that the mass data comparison efficiency in production or engineering projects can be improved.

Description

Mass virtual-real data comparison method based on multi-process data stream
Technical Field
The invention belongs to the technical field of data processing, relates to a mass data comparison analysis method, and in particular relates to a mass virtual-real data comparison method based on multi-process data streams.
Background
The mass data comparison analysis is an important means for preprocessing data in the fields of scientific research, engineering monitoring and the like, at present, in more and more fields of science, production and the like, such as biology, medicine, astronomy, high-energy physics, weather and the like, and industries of banks, securities, telecommunications, insurance, energy sources, internet social networks and the like which are related to life, the daily generated data is rapidly growing in an explosive trend, the mass data is more and more extensive and popular, but how to rapidly and effectively process the mass data is an important problem in practical scientific research and engineering implementation.
The original production information data in the fields of science, engineering, production and the like are mostly stored and operated efficiently through a cloud computing center, the problem of low efficiency exists in mass data processing, the subsequent scientific computing or engineering implementation is seriously dependent on mass data comparison results, and the contradiction causes serious obstacle in mass data analysis.
Disclosure of Invention
The invention mainly aims at the problem of real-time fast comparison of massive virtual and real data, combines the high-speed operation of data streams with multi-process parallel processing, constructs a multi-process data stream parallel comparison model, realizes the fast comparison of virtual data and actual measurement data in a memory by effectively fusing the iterative data stream calculation and the block parallelization of the combined data lines, thereby improving the efficiency of mass data comparison in production or engineering projects, providing a practical and feasible method for mass data comparison in scientific research and engineering monitoring, and ensuring the smooth development of subsequent project implementation and research.
The technical scheme adopted by the invention is as follows: the mass virtual-real data comparison method based on the multi-process data stream is characterized by comprising the following steps of:
step 1: the virtual data are arranged to form a standard data file, the measured data are preprocessed according to a standard data file format, and the data length is aligned; the virtual data is simulation experiment data which are automatically created and maintained by scientific researchers, and the actual measured data is implementation data and scientific research actual data generated by actual production environments;
step 2: combining the virtual data and the actual data into a piece of data according to the corresponding unique code ID to form a single data file;
step 3: reading in a data stream, traversing the whole data file through one cycle, and comparing the difference of the front half section and the rear half section of each piece of data; dividing the combined data stream into blocks according to a hardware environment, and carrying out multi-thread parallel processing;
step 4: and directly blocking the data stream in the memory according to the hardware environment, and traversing the blocked memory data in a multithreading parallel mode to obtain a block comparison result file.
According to the invention, through the effective fusion of multi-process parallel processing and data stream calculation, the efficient interaction and comparison of virtual data and actual measurement data are realized, so that the mass data comparison efficiency in production or engineering projects can be greatly improved. By the technology, the rapid virtual-real comparison of the ultra-large text containing the ultra-hundred million lines of virtual data and real data is finally realized, so that a data comparison analysis result which is more accurate, complete and reliable than that of the traditional comparison method is generated.
Drawings
Fig. 1 is a flow chart of an embodiment of the present invention.
Detailed Description
In order to facilitate the understanding and practice of the invention, those of ordinary skill in the art will now make further details with reference to the drawings and examples, it being understood that the examples described herein are for the purpose of illustration and explanation only and are not intended to limit the invention thereto.
Referring to fig. 1, the method for comparing massive virtual and real data based on multi-process data stream provided by the invention comprises the following steps:
step 1: the virtual data are arranged to form a standard data file, the measured data are preprocessed according to a standard data file format, and the data length is aligned; the virtual data is simulation experiment data which are automatically created and maintained by scientific researchers, and the actual measured data is implementation data and scientific research actual data generated by actual production environments;
in this embodiment, preprocessing is performed on the actually measured data according to the virtual data format, and the specific implementation includes the following sub-steps:
step 1.1: integrating historical data, domain expert knowledge and simulation experiment data to form virtual data, extracting specified data attributes to form a standard data file, wherein the standard data file format at least comprises: uniquely encoded ID, attribute value (P 1 ~P n ) State value (S 1 ~S n ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein the attribute value (P 1 ~P n ) For recording and describing the original state and physical meaning of data, state values (S 1 ~S n ) For recording data attribute changes;
step 1.2: and designing a standardized processing algorithm for the measured data according to the standard data file format, filling the missing attribute values according to the attribute values of the IDs corresponding to the virtual data, filling zero values for the missing state values, and aligning the data length to ensure that the data formats of each row of the measured data and the standard data file are consistent.
Step 2: combining the virtual data and the actual data into a piece of data according to the corresponding unique code ID to form a single data file;
in this embodiment, the specific implementation of step 2 includes the following sub-steps:
step 2.1: combining the preprocessed measured data and virtual data according to the line: firstly checking the row ID of the actual measurement data to be combined, and copying virtual data row data when the row ID does not correspond to the actual measurement data row ID, so as to generate a corresponding ID filling and combining format, wherein the actual measurement data row extends along with the virtual data of the next row to be combined;
step 2.2: and (2.1) repeating the step (2.1) to enable the ID of the row to be merged of the actual measured data and the ID of the parallel to be merged of the virtual data to be corresponding, executing the algorithm of the step (1.2), merging the parallel, and finally enabling the ID of the merged actual measured data and the ID of the virtual data to be corresponding one by one to form new merged data.
Step 3: reading in a data stream, traversing the whole data file through one cycle, and comparing the difference of the front half section and the rear half section of each piece of data; dividing the combined data stream into blocks according to a hardware environment, and carrying out multi-thread parallel processing;
in this embodiment, the specific implementation of step 3 includes the following sub-steps:
step 3.1: the combined new data file is read in the data stream for traversing in an iterative mode, one row is written after finishing, the operation of storing the data file into a memory is abandoned, the whole data file is traversed through one cycle, and the difference of the front half section and the rear half section of each piece of data is compared;
step 3.2: step 3.1, a row-by-row traversal comparison algorithm is designed, the whole data file is traversed through one cycle, and each row of data difference is compared;
the row-by-row traversal comparison is to split each row of data of the read data stream in half and compare the corresponding state values S of the first half segment and the second half segment 1 ~S n The absolute difference is calculated, and the result after the difference calculation is compared in each row is written into a new file in a streaming mode in time, and finally an error set file is output.
Step 4: and directly blocking the data stream in the memory according to the hardware environment, and traversing the blocked memory data in a multithreading parallel mode to obtain a block comparison result file.
In this embodiment, the specific implementation of step 4 includes the following sub-steps:
step 4.1: reading the combined files in the data stream, splitting the combined files into n files with the same size according to lines, and then starting n threads, wherein n is the parallel program data quantity supported by hardware;
step 4.2: respectively and simultaneously operating the n files, and repeating the step 3.2;
step 4.3: acquiring multi-process comparison result files, and merging n result files;
step 4.4: according to actual scientific research and engineering indexes, analyzing the difference value of the attribute and the state corresponding to each ID of the integrated result file, and calculating abnormal data with the difference value exceeding the specified index.
The present embodiment performs fast virtual-to-real comparison for very large text containing billions of lines of virtual data and real data. The data are specifically distinguished by rows, each row has the structure of [ node number, x original coordinates, x displacement, y displacement and z displacement ], and if three displacement coordinates corresponding to the same node number of the virtual data and the real data are different, the node is considered as an error point to output, and due to the requirement of engineering environment on instantaneity, the time of comparing errors after collecting the actual measurement data needs to be reduced as much as possible.
Experiment 1 conventional serial alignment
Since the size of the file exceeds 2G (about 5G for a single file), the file content is of a single line processing type for the super-large text, and if the super-large text is read and written, the file needs to be loaded into a memory, and a single memory is often limited, so that the reading is considered to be performed in a manner of using a buffer zone. The experimental software and hardware environment comprises: (1) hardware: operating system windows10 bits, memory 32GB, CPU32 core; (2) software: inteliJ IDEA 2019.3.1, java8 development environment. Because the java tool kit provides a plurality of read-write modes, including several read-write modes with buffer areas, the language is intended to select the java language.
The invention provides a scheme that: and (5) iterating the rows. We need only traverse each line of the file and then do the corresponding processing, after which it is thrown away.
This scheme will traverse all rows in the file, allowing each row to be processed without maintaining its references, and thus without memory overflow. The comparison result is correct, the time consumption is 30+min, and the time complexity is O (n 2).
Experiment 2 merged data stream alignment
The experimental software and hardware environment comprises: (1) hardware: operating system windows10 bits, memory 32GB, CPU32 core; (2) software: intelijidea 2019.3.1, java8 development environment, spark toolkit.
Firstly, reading texts according to paths, storing the texts into RDD operators, and automatically dividing the texts according to rows so as to compare the rows of the two RDD operators. However, the RDD operator needs to temporarily store the memory first and release the memory after the reading is finished, so that the memory overflow occurs.
Aiming at the bottleneck, the invention uses the scalea.io.source.from file read stream, then uses getLines () to carry out line traversal, and discards the operation of storing the memory, thereby effectively solving the problem of memory overflow. The invention combines two files into one file according to rows in advance, then traverses the file read stream, divides each row in half, then compares and calculates absolute difference, and the result obtained after comparing and calculating the difference of each row is written into a new file in time in a manner of writing stream, and finally outputs a complete error point integrated file.
Because the merging file operation is performed, only one cycle is needed to change the full traversal, and the time complexity is successfully reduced from O (n 2) to O (n). The experiment was also completed smoothly, taking about 6 minutes, and was improved about 5 times compared to experiment 1.
Experiment 3 blocking parallel comparison
Splitting the combined files in the scheme spark read-write stream into 32 files with the same size according to the rows, starting 32 threads, and simultaneously carrying out experimental operation of the scheme spark read-write stream on the 32 files respectively, wherein the experimental environment is the same as above, the code is slightly omitted, the experimental time is 40.9s, and the experimental time is improved by about 9 times compared with the scheme II.
Experiments show that the efficient interaction and comparison of virtual data and actual measurement data are realized through the effective fusion of the multi-process parallel processing and the data stream calculation, so that the mass data comparison efficiency in production or engineering projects is greatly improved.
It should be understood that the foregoing description of the preferred embodiments is not intended to limit the scope of the invention, but rather to limit the scope of the claims, and that those skilled in the art can make substitutions or modifications without departing from the scope of the invention as set forth in the appended claims.

Claims (5)

1. The mass virtual-real data comparison method based on the multi-process data stream is characterized by comprising the following steps of:
step 1: the virtual data are arranged to form a standard data file, the measured data are preprocessed according to a standard data file format, and the data length is aligned; the virtual data is simulation experiment data which are automatically created and maintained by scientific researchers, and the actual measured data is implementation data and scientific research actual data generated by actual production environments;
step 2: combining the virtual data and the actual data into a piece of data according to the corresponding unique code ID to form a single data file;
step 3: reading in a data stream, traversing the whole data file through one cycle, and comparing the difference of the front half section and the rear half section of each piece of data; dividing the combined data stream into blocks according to a hardware environment, and carrying out multi-thread parallel processing;
step 4: and directly blocking the data stream in the memory according to the hardware environment, and traversing the blocked memory data in a multithreading parallel mode to obtain a block comparison result file.
2. The method for comparing massive virtual and real data based on multi-process data streams according to claim 1, wherein the preprocessing of the measured data according to the standard data format in step 1 specifically comprises the following sub-steps:
step 1.1: integrating historical data, domain expert knowledge and simulation experiment data to form virtual data, extracting specified data attributes to form a standard data file, wherein the standard data file format at least comprises: uniquely encoded ID, attribute value (P 1 ~P n ) State value (S 1 ~S n ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein the attribute value (P 1 ~P n ) For recording and describing the original state and physical meaning of data, state values (S 1 ~S n ) For recording data attribute changes;
step 1.2: and designing a standardized processing algorithm for the measured data according to the standard data file format, filling the missing attribute values according to the attribute values of the IDs corresponding to the virtual data, filling zero values for the missing state values, and aligning the data length to ensure that the data formats of each row of the measured data and the standard data file are consistent.
3. The method for comparing massive virtual and real data based on multi-process data streams according to claim 2, wherein the specific implementation of step 2 comprises the following sub-steps:
step 2.1: combining the preprocessed measured data and virtual data according to the line: firstly checking the row ID of the actual measurement data to be combined, and copying virtual data row data when the row ID does not correspond to the actual measurement data row ID, so as to generate a corresponding ID filling and combining format, wherein the actual measurement data row extends along with the virtual data of the next row to be combined;
step 2.2: and (2.1) repeating the step (2.1) to enable the ID of the row to be merged of the actual measured data and the ID of the parallel to be merged of the virtual data to be corresponding, executing the algorithm of the step (1.2), merging the parallel, and finally enabling the ID of the merged actual measured data and the ID of the virtual data to be corresponding one by one to form new merged data.
4. A method for comparing massive virtual and real data based on multi-process data streams according to claim 3, characterized in that the specific implementation of step 3 comprises the following sub-steps:
step 3.1: the combined new data file is read in the data stream for traversing in an iterative mode, one row is written after finishing, the operation of storing the data file into a memory is abandoned, the whole data file is traversed through one cycle, and the difference of the front half section and the rear half section of each piece of data is compared;
step 3.2: step 3.1, a row-by-row traversal comparison algorithm is designed, the whole data file is traversed through one cycle, and each row of data difference is compared;
the row-by-row traversal comparison is to split each row of data of the read-in data stream in half and compare the corresponding state values S of the first half section and the second half section 1 ~S n The absolute difference is calculated, and the result after the difference calculation is compared in each row is written into a new file in a streaming mode in time, and finally an error set file is output.
5. The method for comparing massive virtual and real data based on multi-process data streams according to claim 4, wherein the specific implementation of step 4 comprises the following sub-steps:
step 4.1: reading the combined files in the data stream, splitting the combined files into n files with the same size according to lines, and then starting n threads, wherein n is the parallel program data quantity supported by hardware;
step 4.2: respectively and simultaneously operating the n files, and repeating the step 3.2;
step 4.3: acquiring multi-process comparison result files, and merging n result files;
step 4.4: according to actual scientific research and engineering indexes, analyzing the difference value of the attribute and the state corresponding to each ID of the integrated result file, and calculating abnormal data with the difference value exceeding the specified index.
CN202011051147.5A 2020-09-29 2020-09-29 Mass virtual-real data comparison method based on multi-process data stream Active CN112052260B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011051147.5A CN112052260B (en) 2020-09-29 2020-09-29 Mass virtual-real data comparison method based on multi-process data stream

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011051147.5A CN112052260B (en) 2020-09-29 2020-09-29 Mass virtual-real data comparison method based on multi-process data stream

Publications (2)

Publication Number Publication Date
CN112052260A CN112052260A (en) 2020-12-08
CN112052260B true CN112052260B (en) 2024-01-26

Family

ID=73605894

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011051147.5A Active CN112052260B (en) 2020-09-29 2020-09-29 Mass virtual-real data comparison method based on multi-process data stream

Country Status (1)

Country Link
CN (1) CN112052260B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113326292B (en) * 2021-06-25 2024-06-07 深圳前海微众银行股份有限公司 Data stream merging method, device, equipment and computer storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101510203A (en) * 2009-02-25 2009-08-19 南京联创科技股份有限公司 Big data quantity high performance processing implementing method based on parallel process of split mechanism
CN103914868A (en) * 2013-12-20 2014-07-09 柳州腾龙煤电科技股份有限公司 Method for mass model data dynamic scheduling and real-time asynchronous loading under virtual reality
CN107679104A (en) * 2017-09-12 2018-02-09 杭州美创科技有限公司 Big surface low formula parallel high-speed data comparison method
CN110070911A (en) * 2019-04-12 2019-07-30 内蒙古农业大学 A kind of parallel comparison method of gene order based on Hadoop

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7076319B2 (en) * 2003-12-11 2006-07-11 Taiwan Semiconductor Manufacturing Co., Ltd. Method and database structure for managing technology files for semiconductor manufacturing operations

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101510203A (en) * 2009-02-25 2009-08-19 南京联创科技股份有限公司 Big data quantity high performance processing implementing method based on parallel process of split mechanism
CN103914868A (en) * 2013-12-20 2014-07-09 柳州腾龙煤电科技股份有限公司 Method for mass model data dynamic scheduling and real-time asynchronous loading under virtual reality
CN107679104A (en) * 2017-09-12 2018-02-09 杭州美创科技有限公司 Big surface low formula parallel high-speed data comparison method
CN110070911A (en) * 2019-04-12 2019-07-30 内蒙古农业大学 A kind of parallel comparison method of gene order based on Hadoop

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
大型网架结构智能监测***的工程数据库设计;钟珞, 肖诗轶, 袁景凌, 瞿伟廉;微机发展(第01期);9-11 *

Also Published As

Publication number Publication date
CN112052260A (en) 2020-12-08

Similar Documents

Publication Publication Date Title
CN102508880B (en) Method for joining files and method for splitting files
Gligorov et al. Efficient, reliable and fast high-level triggering using a bonsai boosted decision tree
CN104504105B (en) A kind of storage method of real-time data base
CN107609350A (en) A kind of data processing method of two generations sequencing data analysis platform
US7827179B2 (en) Data clustering system, data clustering method, and data clustering program
US11841839B1 (en) Preprocessing and imputing method for structural data
US11941534B2 (en) Genome sequence alignment system and method
CN110389950B (en) Rapid running big data cleaning method
CN103440246A (en) Intermediate result data sequencing method and system for MapReduce
CN112052260B (en) Mass virtual-real data comparison method based on multi-process data stream
CN107025175A (en) A kind of fuzz testing seed use-case variable-length field pruning method
CN112148359B (en) Distributed code clone detection and search method, system and medium based on subblock filtering
CN115185818A (en) Program dependence cluster detection method based on binary set
Ilić et al. A comparative analysis of smart metering data aggregation performance
CN106802787B (en) MapReduce optimization method based on GPU sequence
CN113139712B (en) Machine learning-based extraction method for incomplete rules of activity attributes of process logs
Yuan et al. Decision tree algorithm optimization research based on MapReduce
CN116204542B (en) Quick reading and writing processing method for database
Duvignau et al. Piecewise linear approximation in data streaming: Algorithmic implementations and experimental analysis
CN114420210B (en) Rapid trimming method and system for biological sequencing sequence
CN106528916A (en) Collaboration system for adaptive subspace iterative segmentation applied to fusion reactor nuclear analysis
CN115221045A (en) Multi-target software defect prediction method based on multi-task and multi-view learning
CN114677052A (en) Natural gas load fluctuation asymmetry analysis method and system based on TARCH model
US20210081424A1 (en) Joiner for distributed databases
CN107391560B (en) Method and device for constructing variance optimization histogram

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant