CN112052260B

CN112052260B - Mass virtual-real data comparison method based on multi-process data stream

Info

Publication number: CN112052260B
Application number: CN202011051147.5A
Authority: CN
Inventors: 袁景凌; 肖骅; 罗忆; 李新平; 罗佩; 江春鹏
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2020-09-29
Filing date: 2020-09-29
Publication date: 2024-01-26
Anticipated expiration: 2040-09-29
Also published as: CN112052260A

Abstract

The invention discloses a method for quickly comparing massive virtual and real data based on multi-process data streams, which comprises the steps of firstly, sorting virtual data to form a standard data file, preprocessing actual measurement data according to a standard data format, and aligning the data length; combining the virtual data and the actual data into a piece of data according to the corresponding unique codes to form a single data file; reading in a data stream, traversing the whole data file through one cycle, and comparing the difference of the front half section and the rear half section of each piece of data; and finally, directly blocking the data stream in the memory according to the hardware environment, and traversing the blocked memory data in a multithreading parallel mode to obtain a block comparison result file. According to the invention, through the effective fusion of multi-process parallel processing and data stream calculation, the high-speed comparison of virtual data and actual data in the memory is realized, and the memory storage limitation caused by overlarge data volume can be avoided, so that the mass data comparison efficiency in production or engineering projects can be improved.

Description

Mass virtual-real data comparison method based on multi-process data stream

Technical Field

The invention belongs to the technical field of data processing, relates to a mass data comparison analysis method, and in particular relates to a mass virtual-real data comparison method based on multi-process data streams.

Background

The mass data comparison analysis is an important means for preprocessing data in the fields of scientific research, engineering monitoring and the like, at present, in more and more fields of science, production and the like, such as biology, medicine, astronomy, high-energy physics, weather and the like, and industries of banks, securities, telecommunications, insurance, energy sources, internet social networks and the like which are related to life, the daily generated data is rapidly growing in an explosive trend, the mass data is more and more extensive and popular, but how to rapidly and effectively process the mass data is an important problem in practical scientific research and engineering implementation.

The original production information data in the fields of science, engineering, production and the like are mostly stored and operated efficiently through a cloud computing center, the problem of low efficiency exists in mass data processing, the subsequent scientific computing or engineering implementation is seriously dependent on mass data comparison results, and the contradiction causes serious obstacle in mass data analysis.

Disclosure of Invention

The invention mainly aims at the problem of real-time fast comparison of massive virtual and real data, combines the high-speed operation of data streams with multi-process parallel processing, constructs a multi-process data stream parallel comparison model, realizes the fast comparison of virtual data and actual measurement data in a memory by effectively fusing the iterative data stream calculation and the block parallelization of the combined data lines, thereby improving the efficiency of mass data comparison in production or engineering projects, providing a practical and feasible method for mass data comparison in scientific research and engineering monitoring, and ensuring the smooth development of subsequent project implementation and research.

The technical scheme adopted by the invention is as follows: the mass virtual-real data comparison method based on the multi-process data stream is characterized by comprising the following steps of:

step 1: the virtual data are arranged to form a standard data file, the measured data are preprocessed according to a standard data file format, and the data length is aligned; the virtual data is simulation experiment data which are automatically created and maintained by scientific researchers, and the actual measured data is implementation data and scientific research actual data generated by actual production environments;

step 2: combining the virtual data and the actual data into a piece of data according to the corresponding unique code ID to form a single data file;

step 3: reading in a data stream, traversing the whole data file through one cycle, and comparing the difference of the front half section and the rear half section of each piece of data; dividing the combined data stream into blocks according to a hardware environment, and carrying out multi-thread parallel processing;

step 4: and directly blocking the data stream in the memory according to the hardware environment, and traversing the blocked memory data in a multithreading parallel mode to obtain a block comparison result file.

According to the invention, through the effective fusion of multi-process parallel processing and data stream calculation, the efficient interaction and comparison of virtual data and actual measurement data are realized, so that the mass data comparison efficiency in production or engineering projects can be greatly improved. By the technology, the rapid virtual-real comparison of the ultra-large text containing the ultra-hundred million lines of virtual data and real data is finally realized, so that a data comparison analysis result which is more accurate, complete and reliable than that of the traditional comparison method is generated.

Drawings

Fig. 1 is a flow chart of an embodiment of the present invention.

Detailed Description

In order to facilitate the understanding and practice of the invention, those of ordinary skill in the art will now make further details with reference to the drawings and examples, it being understood that the examples described herein are for the purpose of illustration and explanation only and are not intended to limit the invention thereto.

Referring to fig. 1, the method for comparing massive virtual and real data based on multi-process data stream provided by the invention comprises the following steps:

in this embodiment, preprocessing is performed on the actually measured data according to the virtual data format, and the specific implementation includes the following sub-steps:

step 1.1: integrating historical data, domain expert knowledge and simulation experiment data to form virtual data, extracting specified data attributes to form a standard data file, wherein the standard data file format at least comprises: uniquely encoded ID, attribute value (P ₁ ～P _n ) State value (S ₁ ～S _n ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein the attribute value (P ₁ ～P _n ) For recording and describing the original state and physical meaning of data, state values (S ₁ ～S _n ) For recording data attribute changes;

step 1.2: and designing a standardized processing algorithm for the measured data according to the standard data file format, filling the missing attribute values according to the attribute values of the IDs corresponding to the virtual data, filling zero values for the missing state values, and aligning the data length to ensure that the data formats of each row of the measured data and the standard data file are consistent.

in this embodiment, the specific implementation of step 2 includes the following sub-steps:

step 2.1: combining the preprocessed measured data and virtual data according to the line: firstly checking the row ID of the actual measurement data to be combined, and copying virtual data row data when the row ID does not correspond to the actual measurement data row ID, so as to generate a corresponding ID filling and combining format, wherein the actual measurement data row extends along with the virtual data of the next row to be combined;

step 2.2: and (2.1) repeating the step (2.1) to enable the ID of the row to be merged of the actual measured data and the ID of the parallel to be merged of the virtual data to be corresponding, executing the algorithm of the step (1.2), merging the parallel, and finally enabling the ID of the merged actual measured data and the ID of the virtual data to be corresponding one by one to form new merged data.

in this embodiment, the specific implementation of step 3 includes the following sub-steps:

step 3.1: the combined new data file is read in the data stream for traversing in an iterative mode, one row is written after finishing, the operation of storing the data file into a memory is abandoned, the whole data file is traversed through one cycle, and the difference of the front half section and the rear half section of each piece of data is compared;

step 3.2: step 3.1, a row-by-row traversal comparison algorithm is designed, the whole data file is traversed through one cycle, and each row of data difference is compared;

the row-by-row traversal comparison is to split each row of data of the read data stream in half and compare the corresponding state values S of the first half segment and the second half segment ₁ ～S _n The absolute difference is calculated, and the result after the difference calculation is compared in each row is written into a new file in a streaming mode in time, and finally an error set file is output.

In this embodiment, the specific implementation of step 4 includes the following sub-steps:

step 4.1: reading the combined files in the data stream, splitting the combined files into n files with the same size according to lines, and then starting n threads, wherein n is the parallel program data quantity supported by hardware;

step 4.2: respectively and simultaneously operating the n files, and repeating the step 3.2;

step 4.3: acquiring multi-process comparison result files, and merging n result files;

step 4.4: according to actual scientific research and engineering indexes, analyzing the difference value of the attribute and the state corresponding to each ID of the integrated result file, and calculating abnormal data with the difference value exceeding the specified index.

The present embodiment performs fast virtual-to-real comparison for very large text containing billions of lines of virtual data and real data. The data are specifically distinguished by rows, each row has the structure of [ node number, x original coordinates, x displacement, y displacement and z displacement ], and if three displacement coordinates corresponding to the same node number of the virtual data and the real data are different, the node is considered as an error point to output, and due to the requirement of engineering environment on instantaneity, the time of comparing errors after collecting the actual measurement data needs to be reduced as much as possible.

Experiment 1 conventional serial alignment

Since the size of the file exceeds 2G (about 5G for a single file), the file content is of a single line processing type for the super-large text, and if the super-large text is read and written, the file needs to be loaded into a memory, and a single memory is often limited, so that the reading is considered to be performed in a manner of using a buffer zone. The experimental software and hardware environment comprises: (1) hardware: operating system windows10 bits, memory 32GB, CPU32 core; (2) software: inteliJ IDEA 2019.3.1, java8 development environment. Because the java tool kit provides a plurality of read-write modes, including several read-write modes with buffer areas, the language is intended to select the java language.

The invention provides a scheme that: and (5) iterating the rows. We need only traverse each line of the file and then do the corresponding processing, after which it is thrown away.

This scheme will traverse all rows in the file, allowing each row to be processed without maintaining its references, and thus without memory overflow. The comparison result is correct, the time consumption is 30+min, and the time complexity is O (n 2).

Experiment 2 merged data stream alignment

The experimental software and hardware environment comprises: (1) hardware: operating system windows10 bits, memory 32GB, CPU32 core; (2) software: intelijidea 2019.3.1, java8 development environment, spark toolkit.

Firstly, reading texts according to paths, storing the texts into RDD operators, and automatically dividing the texts according to rows so as to compare the rows of the two RDD operators. However, the RDD operator needs to temporarily store the memory first and release the memory after the reading is finished, so that the memory overflow occurs.

Aiming at the bottleneck, the invention uses the scalea.io.source.from file read stream, then uses getLines () to carry out line traversal, and discards the operation of storing the memory, thereby effectively solving the problem of memory overflow. The invention combines two files into one file according to rows in advance, then traverses the file read stream, divides each row in half, then compares and calculates absolute difference, and the result obtained after comparing and calculating the difference of each row is written into a new file in time in a manner of writing stream, and finally outputs a complete error point integrated file.

Because the merging file operation is performed, only one cycle is needed to change the full traversal, and the time complexity is successfully reduced from O (n 2) to O (n). The experiment was also completed smoothly, taking about 6 minutes, and was improved about 5 times compared to experiment 1.

Experiment 3 blocking parallel comparison

Splitting the combined files in the scheme spark read-write stream into 32 files with the same size according to the rows, starting 32 threads, and simultaneously carrying out experimental operation of the scheme spark read-write stream on the 32 files respectively, wherein the experimental environment is the same as above, the code is slightly omitted, the experimental time is 40.9s, and the experimental time is improved by about 9 times compared with the scheme II.

Experiments show that the efficient interaction and comparison of virtual data and actual measurement data are realized through the effective fusion of the multi-process parallel processing and the data stream calculation, so that the mass data comparison efficiency in production or engineering projects is greatly improved.

It should be understood that the foregoing description of the preferred embodiments is not intended to limit the scope of the invention, but rather to limit the scope of the claims, and that those skilled in the art can make substitutions or modifications without departing from the scope of the invention as set forth in the appended claims.

Claims

1. The mass virtual-real data comparison method based on the multi-process data stream is characterized by comprising the following steps of:

2. The method for comparing massive virtual and real data based on multi-process data streams according to claim 1, wherein the preprocessing of the measured data according to the standard data format in step 1 specifically comprises the following sub-steps:

3. The method for comparing massive virtual and real data based on multi-process data streams according to claim 2, wherein the specific implementation of step 2 comprises the following sub-steps:

4. A method for comparing massive virtual and real data based on multi-process data streams according to claim 3, characterized in that the specific implementation of step 3 comprises the following sub-steps:

the row-by-row traversal comparison is to split each row of data of the read-in data stream in half and compare the corresponding state values S of the first half section and the second half section ₁ ～S _n The absolute difference is calculated, and the result after the difference calculation is compared in each row is written into a new file in a streaming mode in time, and finally an error set file is output.

5. The method for comparing massive virtual and real data based on multi-process data streams according to claim 4, wherein the specific implementation of step 4 comprises the following sub-steps: