CN110647531A - Data synchronization method, device, equipment and computer readable storage medium - Google Patents

Data synchronization method, device, equipment and computer readable storage medium Download PDF

Info

Publication number
CN110647531A
CN110647531A CN201910751919.7A CN201910751919A CN110647531A CN 110647531 A CN110647531 A CN 110647531A CN 201910751919 A CN201910751919 A CN 201910751919A CN 110647531 A CN110647531 A CN 110647531A
Authority
CN
China
Prior art keywords
data
task
task data
check
check code
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910751919.7A
Other languages
Chinese (zh)
Inventor
许浩奇
刘广平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Property and Casualty Insurance Company of China Ltd
Original Assignee
Ping An Property and Casualty Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Property and Casualty Insurance Company of China Ltd filed Critical Ping An Property and Casualty Insurance Company of China Ltd
Priority to CN201910751919.7A priority Critical patent/CN110647531A/en
Publication of CN110647531A publication Critical patent/CN110647531A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of big data processing, and discloses a data synchronization method.A task data table to be synchronized and a second check code of task data in the task data table are determined according to the task data table to be synchronized and a first check code of task data in historical data and historical data, and only the incremental task data need to be synchronized during synchronization; the invention also provides a data synchronization device, equipment and a computer readable storage medium, and the data synchronization mode reduces data in the synchronization process, shortens the data synchronization time, reduces the resource occupancy rate of the equipment during synchronization and improves the resource utilization rate of the equipment.

Description

Data synchronization method, device, equipment and computer readable storage medium
Technical Field
The present invention relates to the field of big data processing technologies, and in particular, to a data synchronization method, apparatus, device, and computer-readable storage medium.
Background
Hive is a data warehouse basic tool based on Hadoop, which can be regarded as a user programming interface (API) to some extent, does not store and process data per se, and depends on HDFS to store data and MR to process data. It provides a series of tools for extracting, converting and loading data. Data is stored in dependence of the HDFS, and data is processed in dependence of the MR. In Hadoop to process structured data. The Hive query language is used for processing structured data for MapReduce. Hive is not a real-time query language.
In actual development, it is often used for data cleaning, storage and analysis, for example: processing the data on Hive, and then leading the data to a real-time query engine, such as: hbase and elastic search are used for upper-layer application to perform real-time query, while the Hive task is operated with certain timeliness, data updating synchronization can be performed after expiration, most of the existing data synchronization updating adopts a full-amount synchronization mode, namely, during each updating, historical data and new data are required to be exported and updated together, along with the continuous increase of data amount, the consumption of resources and memory during updating is also continuously increased, and the tasks occupy cluster operation resources for a long time and waste resources; meanwhile, due to the fact that the finer mode generally has higher delay, continuous superposition updating can also cause the occurrence of the whole derivative crossing the aging time, and the Hive task can refresh data once the Hive task has aged, so that the previous data is not completely derived and is refreshed by the Hive task, and the problem of data loss exists.
Disclosure of Invention
The invention mainly aims to provide a data synchronization method, a data synchronization device, data synchronization equipment and a computer readable storage medium, and aims to solve the technical problem that in the existing data synchronization mode, the data repeatability of each synchronization is large, so that the time is too long.
In order to achieve the above object, the present invention provides a data synchronization method applied to a data storage device, the data synchronization method comprising the steps of:
detecting whether the data validity of the data storage device meets an updating condition, wherein the updating condition is that the data validity reaches an expiration time;
if the data validity meets the updating condition, acquiring a task data table to be synchronized to the data storage device at the current time and a first check code of task data in historical data, wherein the task data table at least comprises the new data and the historical data which are changed currently, the historical data is the task data stored in the data storage device currently, the first check code comprises N first sub-check codes, the first sub-check codes are calculated based on at least one row of data or at least one task data in the historical data, and N is greater than or equal to 1;
calculating a second check code of the task data in the task data table by using a calculation algorithm which is the same as that used for calculating the first check code, wherein the second check code comprises M second sub check codes, the second sub check codes are calculated based on at least one row of data or at least one task data in the task data table, and M is greater than or equal to N;
comparing whether the second check code is consistent with the first check code or not to judge whether the task data currently stored in the data storage equipment is abnormal or not;
if the abnormal task data exist, screening the abnormal task data from the task data table, keeping the historical data, and synchronizing the abnormal task data into the data storage equipment, wherein the abnormal task data are incremental task data;
and if the abnormal condition does not exist, replacing the first check code with the second check code, and deleting the historical data and the first check code in the task data table.
In other embodiments, the calculation algorithm includes a hamming code check algorithm and a hash algorithm.
In other embodiments, the step of calculating the second check code of the task data in the task data table by using the same calculation algorithm as that used for calculating the first check code includes:
if the calculation algorithm is a Hash algorithm, calculating an integral check value of the task data in the task data table according to the Hash algorithm;
extracting each row of data or each task data in the task data table, and calculating a verification code of each row of data or each task data according to the Hash algorithm to obtain a second sub-verification code, wherein the second sub-verification code is an MD code;
and independently storing all the MD codes into a preset verification code table in the task data table.
In some other embodiments, the step of comparing whether the second check code is consistent with the first check code to determine whether the task data currently stored in the data storage device has an exception includes:
extracting a first check value and a first sub check code from the first check code, wherein the first check value is an integral check value of the historical data;
comparing whether the overall check value is the same as the first check value;
if the overall check value is the same as the first check value, determining that the incremental task data does not exist in the task data table;
and if the integral check value is different from the first check value, determining that the incremental task data exists in the task data table.
In other embodiments, after the step of determining that the incremental task data exists in the task data table, the method further includes:
and comparing each MD code in the second check code with the corresponding first sub-check code in the first check code one by one, and screening out different MD codes in the second check code based on the comparison result.
In other embodiments, the step of screening out abnormal task data from the task data table, retaining the historical data, and synchronizing the abnormal task data to the data storage device includes:
extracting corresponding increment task data from the task data table according to the screened MD codes different from the first sub-check codes, inserting the increment task data into a preset increment data table, setting the attribute of the inserted increment task data as a distinguishing field and recording the insertion time of the distinguishing field;
and reserving the task data currently stored in the data storage device, determining the incremental task data corresponding to the insertion time with small current time error in the incremental data table, and synchronizing the incremental task data to the data storage device.
In other embodiments, after the steps of screening out abnormal task data from the task data table, retaining the historical data, and synchronizing the abnormal task data to the data storage device, the method further comprises:
calculating the time deviation amount between the insertion time of other fields except the insertion time of the distinguishing field in the increment data table and the insertion time of the distinguishing field;
judging whether the time deviation amount is larger than a preset time difference or not;
and if the time deviation is greater than the preset time difference, deleting the incremental task data corresponding to all the fields greater than the preset time difference from the incremental data table.
In addition, to achieve the above object, the present invention also provides a data synchronization apparatus, including:
the detection module is used for detecting whether the data validity of the data storage equipment meets an updating condition, wherein the updating condition is that the data validity reaches the expiration time;
the data acquisition module is used for acquiring a task data table to be synchronized to the data storage device at the current time and a first check code of task data in historical data when the data validity meets an updating condition, wherein the task data table at least comprises new data and historical data which change currently, the historical data is the task data stored in the data storage device currently, the first check code comprises N first sub-check codes, the first sub-check codes are obtained by calculation based on at least one row of data or at least one task data in the historical data, and N is greater than or equal to 1;
the calculation module is used for calculating a second check code of the task data in the task data table by using a calculation algorithm which is the same as that used for calculating the first check code, wherein the second check code comprises M second sub check codes, the second sub check codes are calculated based on at least one row of data or at least one task data in the task data table, and M is greater than or equal to N;
the comparison module is used for comparing whether the second check code is consistent with the first check code or not so as to judge whether the task data currently stored in the data storage equipment is abnormal or not;
the synchronization module is used for screening abnormal task data from the task data table when abnormality exists, reserving the historical data and synchronizing the abnormal task data into the data storage device, wherein the abnormal task data are incremental task data; and replacing the first check code with the second check code and deleting the historical data and the first check code in the task data table when no abnormity exists.
Optionally, the calculation algorithm includes a hamming code check algorithm and a hash algorithm.
Optionally, the computing module includes an arithmetic unit and a storage unit, wherein,
when the calculation algorithm is a hash algorithm, the operation unit is used for calculating an overall check value of the task data in the task data table according to the hash algorithm; extracting each row of data or each task data in the task data table, and calculating a verification code of each row of data or each task data according to the Hash algorithm to obtain a second sub-verification code, wherein the second sub-verification code is an MD code;
the storage unit is used for storing all the MD codes into a preset verification code table in the task data table.
Optionally, the comparing module includes an extracting unit and a comparing unit, wherein,
the extraction unit is used for extracting a first check value and a first sub-check code from the first check code, wherein the first check value is an integral check value of the historical data;
the comparison unit is used for comparing whether the overall check value is the same as the first check value or not;
if the overall check value is the same as the first check value, determining that the incremental task data does not exist in the task data table;
and if the integral check value is different from the first check value, determining that the incremental task data exists in the task data table.
Optionally, the comparing module further includes a screening unit, configured to compare each MD code in the second check code with the corresponding first sub-check code in the first check code one by one, and screen out different MD codes in the second check code based on a comparison result.
Optionally, the synchronization module is configured to extract corresponding incremental task data from the task data table according to the screened MD code different from the first sub-check code, insert the incremental task data into a preset incremental data table, set an attribute of the inserted incremental task data as a distinguishing field, and record an insertion time of the distinguishing field; and reserving the task data currently stored in the data storage device, determining the incremental task data corresponding to the insertion time with small current time error from the incremental data table, and synchronizing the incremental task data into the data storage device.
Optionally, the calculating module is further configured to calculate a time deviation amount between the insertion time of the other field in the incremental data table except the insertion time of the distinguishing field and the insertion time of the distinguishing field;
the comparison module is also used for judging whether the time deviation value is greater than a preset time difference or not;
and the synchronization module is further used for deleting the incremental task data corresponding to all the fields with the time difference larger than the preset time difference from the incremental data table when the time deviation is larger than the preset time difference.
In addition, to achieve the above object, the present invention is also a data synchronization apparatus, including: a memory, a processor and a data synchronization program stored on the memory and executable on the processor, the data synchronization program when executed by the processor implementing the steps of the data synchronization method as in any one of the above.
Furthermore, to achieve the above object, the present invention is also a computer readable storage medium having stored thereon a data synchronization program, which when executed by a processor, implements the steps of the data synchronization method according to any one of the above.
The incremental task data in the task data table to be synchronized is determined according to the task data table to be synchronized and the second check code of the task data in the task data table, and the first check codes of the task data in the historical data and the historical data, and only the incremental task data needs to be synchronized during synchronization.
Drawings
FIG. 1 is a schematic structural diagram of an operating environment of a data synchronization system according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of another data synchronization system according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating a first embodiment of a data synchronization method according to the present invention;
FIG. 4 is a flowchart illustrating a data synchronization method according to a second embodiment of the present invention;
FIG. 5 is a functional block diagram of a data synchronization apparatus according to the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention provides a data synchronization device.
Referring to fig. 1, fig. 1 is a schematic structural diagram of an operating environment of a data synchronization device according to an embodiment of the present invention.
As shown in fig. 1, the data synchronization apparatus includes: a processor 101, e.g. a CPU, a communication bus 102, a user interface 103, a network interface 104, a memory 105. Wherein the communication bus 102 is used for enabling connection communication between these components. The user interface 103 may comprise a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the network interface 104 may optionally comprise a standard wired interface, a wireless interface (e.g. WI-FI interface). The memory 105 may be a high-speed RAM memory or a non-volatile memory (e.g., a disk memory). The memory 105 may alternatively be a storage device separate from the processor 101 described above.
Those skilled in the art will appreciate that the hardware configuration of the data synchronization system shown in fig. 1 does not constitute a limitation of the data synchronization system, and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.
As shown in fig. 1, the memory 105, which is a kind of computer-readable storage medium, may include therein an operating system, a network communication module, a user interface module, and a data synchronization program for implementing data synchronization. The operating system is a program for managing and controlling the data synchronization system and software resource calls in the system and the memory, and supports the operation of the data synchronization program and other software and/or programs.
In the hardware configuration of the data synchronization system shown in fig. 1, the network interface 104 is mainly used for accessing a network; the user interface 103 is mainly used to detect whether a data synchronization operation is required in the system or monitor whether data in the system has information such as update, exception, etc., and the processor 101 may be used to call a data synchronization program stored in the memory 105 and perform the following operations of various embodiments of the data synchronization method.
In the embodiment of the present invention, the implementation shown in fig. 1 may also be a networking system composed of a mobile terminal and a server, where the mobile terminal is used as a device for generating a task data table, and the server is used for synchronizing the task data table on the mobile terminal, and a processor of the server recognizes incremental task data by reading a program code, which is stored in a buffer or a storage unit and can implement a data synchronization method, and implements the synchronization operation.
Further, the data synchronization method provided by the present invention can also be used to implement synchronization database data synchronization between multiple servers S and terminals TE, or between several terminals TE and several servers S, where a synchronization database can be broadly understood as a storage device that can store any data. If synchronization is to be performed between the terminal TE, which acts as a generating device for the task data table to be synchronized, or the server S, which acts as a synchronization server, the server S is typically a web server or PC. The TE is typically a mobile phone, PC (personal computer), laptop or PDA device.
As shown in fig. 2(a), the first example shows a synchronization system in which a client device TE and a synchronization server S are connected to a local area network LAN. The client device TE connected to the network LAN comprises functional components for communicating with the devices of the network LAN, such as a network card and software controlling data theft. The local area network LAN may be any kind of local area network and the TE may also be connected to the server S via the internet, typically using a firewall FW. The terminal TE may also be connected wirelessly to the local area network LAN via the access point AP.
A second example of which shows a synchronization system for a client device TE communicating with a server S via a mobile network MNW. The terminal TE, when connected to the network side W, comprises functionality for communicating wirelessly with the network MNW. There may also be other networks, such as a local area network LAN, between the mobile network MNW and the server S. The mobile network MNW may be any known wireless network, for example a network supporting GSM services, a network supporting GPRS (general packet radio service), a third generation mobile network such as a network defined according to the 3GPP (third generation partnership project) network, a wireless local area network WL, AN or a private network. It should be noted that the server S itself may comprise the database it synchronizes, or it may be located in another device; in fig. 2(b), the server S and the database DB are separated for convenience of explanation. Synchronization configurations other than that illustrated in fig. 2 are also possible.
Based on the hardware structure of the data synchronization system, the embodiments of the data synchronization method of the present invention are provided.
Referring to fig. 3, fig. 3 is a flowchart of a data synchronization method according to an embodiment of the present invention. In this embodiment, the data synchronization method specifically includes the following steps:
step S310, detecting whether the data validity of the data storage device meets an updating condition, wherein the updating condition is that the data validity reaches an expiration time;
in this step, the validity of the detection data can be achieved from two aspects, on one hand, whether the data on the mobile terminal is updated by the service end of the synchronous data or whether the time for updating the data is reached, so as to judge whether the data stored on the service end is out of date; on the other hand, the self-checking is realized in a way that the server performs self-checking, and the self-checking specifically judges whether the data is out of date or not by detecting whether the difference value between the data start time currently stored by the server and the current time is greater than a preset value or whether the timing of a timer for data synchronization on the server is reached.
In practical application, it may also be that whether the data in the data storage device reaches the time of data update, specifically, it may also be that whether the data update condition is met is determined by detecting a data update request on the data storage device, in practical application, for the case of automatic update, automatic timing update is implemented by setting a timer, at this time, after the data storage device detects that the timer times out, the device may think that the stored data has lost validity, and needs to obtain new data from a data source device connected to the data storage device for update.
For the case of actively triggering update, the data storage device determines whether the data needs to be updated by detecting a data update request, and if the request is detected, the original data stored in the device needs to be replaced by new data update.
Step S320, if the data validity meets the update condition, acquiring a task data table to be synchronized to the data storage device at the current time and a first check code of task data in historical data, where the task data table at least includes new data and historical data that change currently, the historical data is the task data currently stored in the data storage device, the first check code includes N first sub-check codes, the first sub-check codes are calculated based on at least one row or at least one task data in the historical data, and N is greater than or equal to 1;
in this embodiment, the acquired task data refers to data on a user terminal or a server connected to a data storage device, and the data storage device may be understood as a backup database, a data buffer, a database in the server, or other storage devices, or may be a user terminal itself with a data saving function, where the user terminal may generate data in real time, and the data generated in real time may be retained by the terminal for a short time.
In practical applications, the acquired task data may be a data source acquired from a user terminal connected to the data storage device, and the historical data and the first check code of the historical data may be acquired from the data storage device itself, and at this time, for the data storage device, simultaneous storage of the data and the corresponding check code is required each time the data is synchronized.
In this embodiment, the first check code may be specifically understood as a set including a plurality of sub-check codes, and each sub-check code corresponds to each row of data or each task data in the data, that is, each sub-verification code is calculated based on each row of data or each task data. Of course, in order to reduce the number of check codes, multiple lines of data or multiple task data may be used for calculation, which may further reduce the amount of data calculation and also reduce the power consumption of the data storage device.
In the present embodiment, the calculation algorithm for the first sub-check code includes multiple algorithms, such as hamming check algorithm and hash algorithm, and the hash algorithm includes MD2, MD4 and MD5, and the hash value converted from the digital signature by using the hash method is called message-digest (message-digest), and there is also Secure Hash Algorithm (SHA), which is a standard algorithm and can generate a larger (60bit) message digest, and is somewhat similar to the MD4 algorithm.
Step S330, calculating a second check code of the task data in the task data table by using a calculation algorithm which is the same as that used for calculating the first check code, wherein the second check code comprises M second sub check codes, the second sub check codes are calculated based on at least one row of data or at least one task data in the task data table, and M is greater than or equal to N;
step S340, comparing whether the second check code is consistent with the first check code, to determine whether the task data currently stored in the data storage device is abnormal;
in this embodiment, the second check code is also a set including a plurality of sub-check codes, and in order to ensure the correctness and comparability of the check code, the same calculation algorithm as that used for calculating the sub-check codes in the first check code is used, for example, a hash algorithm is used, and a hash algorithm is also used here, and if a hamming code check algorithm is used, a corresponding hamming code check algorithm should be used here.
In this embodiment, when calculating the second sub check code in the second check code, it is further necessary to obtain a data partitioning rule when calculating the first sub check code, that is, when calculating the first sub check code, the device needs to perform data small set partitioning or task data small set partitioning on the historical data, and then calculate the first sub check code for each small set according to a corresponding check algorithm; before the second sub-check code is calculated, the division rule of the device on the historical data needs to be determined, then the current data is divided according to the division rule, and then the second sub-check code is calculated based on the same check algorithm.
For this case, if the current data and the historical data are the same, all the calculated second sub-check codes correspond to the first sub-check code, and if they are not the same, it is determined whether the data are the same.
Step S350, if the abnormal task data exist, screening the abnormal task data from the task data table, keeping the historical data, and synchronizing the abnormal task data into the data storage device, wherein the abnormal task data are incremental task data;
step S360, if no abnormity exists, the first check code is replaced by the second check code, and the historical data and the first check code in the task data table are deleted.
In this embodiment, the exception specifically means that the second check code is not equal to or different from the first check code, and at this time, the current data and the historical data are compared with each other, then a part with a difference is determined, and finally the part with the difference is synchronized into the data storage device.
In practical application, the data of the difference part comprises the following two conditions, one is that the data has the same part, and the corresponding part in the current data is replaced by the corresponding part in the historical data, so that only part of the data needs to be synchronized, and the other part of the data is reserved, thereby reducing the data amount of synchronization and improving the efficiency of synchronization; the other is newly added data, and for the situation, the newly added partial data is directly synchronized into the device, specifically, the partial data is added on the basis of the data originally stored by the data storage device.
In the embodiment, data is synchronized through an increment scheme, as long as the data of the increment of the updated data is synchronized, the data synchronization time is greatly shortened, and meanwhile, the utilization rate of resources is improved.
In this embodiment, the algorithm for calculating the check code may be an existing symmetric algorithm, an asymmetric algorithm, or even other algorithms, as long as the algorithm can perform encryption calculation on data, and then different algorithms of the data can be compared, and the hamming code check algorithm and the hash algorithm may be preferred.
If a hash algorithm is adopted, the step of calculating the second check code of the task data in the task data table by using the same calculation algorithm for calculating the first check code comprises the following specific implementation steps:
calculating an integral check value of the task data in the task data table according to the Hash algorithm;
extracting each row of data or each task data in the task data table, and calculating a verification code of each row of data or each task data according to the Hash algorithm to obtain a second sub-verification code, wherein the second sub-verification code is an MD code;
and independently storing all the MD codes into a preset verification code table in the task data table.
In this embodiment, the following two calculation schemes may be specifically adopted in the process of calculating the second check code of the task data table, and specifically, the two schemes are distinguished from the viewpoint of dividing the task data into small sets:
the first is a check code which is calculated by taking a current task data table as a whole;
the second is a check code calculated by partial data in the task data table;
the check code of the first calculation scheme is mainly used for judging whether the current data are different, and if the current data are different, the check code passing through the second calculation scheme is used for determining different partial data in the task data table.
In this embodiment, when calculating the check code of the second calculation scheme, it is further required to obtain a partition rule for task data in the task data table, that is, it is required to determine a data partition manner when the historical data calculates the first sub-check code, for example, a row-by-row partition, or a partition of each task, then calculate a check code for each row of data or calculate a check code for each task, and so on, and after determining the partition manner for the historical data, apply the partition manner to the task data table, partition the task data table, and then calculate the second sub-check code.
At this time, the step of comparing whether the second check code is consistent with the first check code to determine whether the task data currently stored in the data storage device is abnormal includes:
extracting a first check value and a first sub check code from the first check code, wherein the first check value is an integral check value of the historical data;
comparing whether the overall check value is the same as the first check value;
if the overall check value is the same as the first check value, determining that the incremental task data does not exist in the task data table;
and if the integral check value is different from the first check value, determining that the incremental task data exists in the task data table.
In this embodiment, for the data storage device to determine the abnormality of the task data, it actually identifies whether new data is added to the data to be synchronized, and determines whether new data is added by using the overall check code of the data, where the new data may be understood as newly added data or as that the original historical data is changed. However, in either case, if the two check values are not equal, it is determined that new data exists, and the process jumps to step S350.
In this embodiment, after the step of determining that the current data has the incremental task data, the method further includes:
and comparing each MD code in the second check code with the corresponding first sub-check code in the first check code one by one, and screening out the MD codes in the second check code which are different from the first check code in the first check code on the basis of the comparison result.
In this embodiment, in addition to the above step of judging data as abnormal as a whole, and then directly comparing the data to screen out different data to implement data synchronization, in the screening of different data, the data can be screened by the sub-check code of the data, in practical application, when the overall check code of the historical data and the current data of the data storage device is calculated, the data is further subdivided and divided, for example, the data is formatted according to the stored format, then the sub-check code of each row of data is calculated according to the check algorithm for calculating the overall check code, then after the data is judged to have new data, the sub-check code (the second self-check code) in the current data and the sub-check code (the first sub-check code) of the historical data are obtained to be compared with each other, and then the data is extracted according to the comparison result, so as to obtain different partial data, and finally, synchronizing according to the part of data, thereby greatly reducing the synchronizing amount of the data, reducing the consumption of resources and improving the efficiency of data synchronization.
In practical applications, the MD5 algorithm in the hash algorithm is preferably used for the above calculation algorithm, and the calculated check code is the MD5 code, but in order to ensure the accuracy of data comparison, in the embodiment of the present invention, it is preferable that the data is divided in a task data dividing manner, for example, when there are a plurality of task data in the obtained current data, each task data is used as a small data set, and then the check code is calculated to obtain the sub-check code of the task data, so that the integrity of the data can be ensured, and the error of calculating the sub-check code is reduced.
Of course, if the data volume is small, the data can be divided by directly using a line division mode, so that the error of each line of data can be accurately achieved, the data volume of the synchronous data is reduced, and the data synchronization efficiency is improved.
In this embodiment, the steps of screening out abnormal task data from the current data, retaining the historical data, and synchronizing the abnormal task data to the data storage device include:
extracting corresponding increment task data from the current data according to the screened different MD codes, inserting the increment task data into a preset increment data table, setting the attribute of the inserted increment task data as a distinguishing field and recording the insertion time of the distinguishing field;
and reserving the task data currently stored in the data storage device, and selecting the increment task data corresponding to the insertion time with small current time error from the increment data table and synchronizing the increment task data into the data storage device.
In this embodiment, in order to facilitate a fast data synchronization operation, after a sub-parity that is different from historical data in current data is compared, corresponding data is extracted according to the sub-parity and stored in an incremental data table, where the incremental data represents new data added to the current data when data is updated and is a call data table when data synchronization is performed.
Further, after the steps of screening abnormal task data from the current data according to the abnormality, retaining the historical data, and synchronizing the abnormal task data to the data storage device, the method further includes:
calculating the time deviation amount between the insertion time of other fields except the insertion time of the distinguishing field in the increment data table and the insertion time of the distinguishing field;
judging whether the time deviation amount is larger than a preset time difference or not;
and if the time deviation is greater than the preset time difference, deleting the incremental task data corresponding to all the fields greater than the preset time difference from the incremental data table.
In the data synchronization method provided by the embodiment of the invention, when the data needs to be updated and synchronized, the incremental task data in the data to be synchronized and the historical data are determined based on the comparison between the check code of the data to be synchronized and the check code of the historical data, and then the incremental task data are synchronized to the storage device.
The data synchronization method provided by the present embodiment is described below with a data warehouse infrastructure established on Hadoop as an execution subject of the method. The data warehouse infrastructure provides a set of tools that can be used to perform data Extraction Transformation Loading (ETL), a mechanism that can store, query, and analyze large-scale data stored in Hadoop, and also defines a simple SQL-like query language, called HQL, that allows SQL-familiar users to query data, as shown in particular in fig. 4:
in this embodiment, for the operation of data synchronization, the mutual variables are interacted in the form of data tables, for example, the historical data and the current data are present in the form of data tables, in this embodiment, after the historical data and the current data are obtained, the data form a data full table, the incremental data also needs to establish a table, and the check code is also in the table, and the specific process is as shown in the following figures:
in step S40, the task is the original Hive full-volume task running number.
Step S41, the task is an incremental increment table running number task, the increment table retains increment data in nearly 10 days, and the main processing logic is: and calculating difference data of the history table and the latest data table of the current day, and inserting the difference data into the incremental Hive table. There is a partition field in the delta table to record the time of day's insertion. Finally, historical incremental data for more than 10 days is deleted simultaneously.
And step S42, after the task is completed in the step S41, the incremental derivative task is called up, and a small amount of data is updated to the real-time query system according to the incremental Hive table.
In step S43, the task performs MD5 codes on all fields in each row in the Hive full table, and inserts the MD5 codes into the Hive MD5 table for the Hive increment table task that executes the task in S41 next time.
The procedures of steps S40-S43 are respectively represented in the form of a table, for example, as follows:
taking the current data to be synchronized as a task data table, simultaneously calculating corresponding MD5 codes, and respectively storing the data into corresponding tables, wherein the following two tables are shown
TABLE A Total amount
{
PRIMARY _ KEY STRING COMMENT 'physical Primary Key'
COL1STRING COMMENT "field 1"
COL2STRING COMMENT "field 2"
}
Table A _ MD5
{
PRIMARY _ KEY STRING COMMENT 'physical Primary Key'
MD5STRING COMMENT "FULL-FIELD MD5 CODE"
}
After comparing the check codes on the task data table and the historical data, summarizing the identified incremental data into an incremental table, and only calling the table and selecting and synchronizing the data according to the time sequence during synchronization, wherein the data is selected and synchronized as shown in the following table
TABLE A _ Delta
{
PRIMARY _ KEY STRING COMMENT 'physical Primary Key'
COL1STRING COMMENT "field 1"
COL2STRING COMMENT "field 2"
PARTITION mapping component "
}
The following description takes 10 days as the update time point as an example:
1) hive full-scale task: is responsible for carrying out calculation statistics on the full data
2) And Hive increment table task:
taking the A _ total amount of the table as a base table, using a primary key to associate the A _ MD5 of the table, wherein the A _ MD5 records MD5 codes of all fields CONCAT of the A _ total amount of the table in the previous day, the MD5 codes of the A _ total amount of the table in the last day after association are compared with the MD5 codes of the previous day, and the data which are not equal are newly added in the current day are inserted into the A _ incremental table of the table, and the partition field is set as the time of the day.
Finally, deleting data with increment history data exceeding 10 days;
3) hive full scale MD5 information
The task was MD5 production based on Hive full scale data for the next day incremental task calculation usage.
The calculation logic of code MD5 is as follows MD5(CONCAT (NVL (COL1, "), '&', NVL (COL2,"))
4) And the incremental derivative task acquires the data of the current day according to the A _ incremental table and synchronizes a small amount of incremental data to a real-time system. Because the increment table retains the increment data within 10 days, in an extreme case, if the increment derivative task cannot complete the derivative on the current day, the complement operation can be performed through the partition field on the next day.
Further, in this embodiment, the task data table, the check table, and the increment table are associated with each other through a physical primary key table header set in the table, and after the different sub check codes are determined, the corresponding table and data may be directly called to perform data synchronization operation.
The method is based on the method that only data which is increased when the data is synchronized, so that the data synchronization time is greatly shortened, and meanwhile, the utilization rate of resources is improved.
In the actual task scheduling design, the Hive full-scale MD5 task is pushed to the last operation, the increment derivative task does not need to depend on the increment derivative task, and the task can be started up only after the execution of the Hive increment table task is completed, so that the task waiting time is reduced.
By means of MD5 code comparison, the problem that each field needs to be compared with a difference in Hive increment task comparison is avoided. Execution performance and code readability are higher.
The Hive increment table reserves increment historical data within 10 days, and guarantees the risk that the derivative task cannot be finished when the derivative task is finished, so that the data across the day is brushed away.
In order to solve the above problem, an embodiment of the present invention further provides a data synchronization apparatus, and referring to fig. 5, fig. 5 is a schematic diagram of functional modules of the data synchronization apparatus provided in the embodiment of the present invention. In this embodiment, the apparatus comprises:
a detecting module 51, configured to detect whether data validity of the data storage device meets an update condition, where the update condition is that the data validity has reached an expiration time;
a data acquisition module 52, configured to acquire, when the data validity meets an update condition, a task data table to be synchronized to the data storage device at a current time and a first check code of task data in historical data, where the task data table at least includes new data and historical data that change currently, the historical data is the task data currently stored in the data storage device, the first check code includes N first sub-check codes, the first sub-check code is calculated based on at least one row of data or at least one task data in the historical data, and N is greater than or equal to 1;
a calculating module 53, configured to calculate a second check code of the task data in the task data table by using a same calculation algorithm for calculating the first check code, where the second check code includes M second sub-check codes, the second sub-check code is calculated based on at least one row of data or at least one task data in the task data table, and M is greater than or equal to N;
a comparing module 54, configured to compare whether the second check code is consistent with the first check code, so as to determine whether the task data currently stored in the data storage device is abnormal;
a synchronization module 55, configured to, when there is an anomaly, screen out anomalous task data from the task data table, retain the historical data, and synchronize the anomalous task data to the data storage device, where the anomalous task data is incremental task data; and replacing the first check code with the second check code and deleting the historical data and the first check code in the task data table when no abnormity exists.
Based on the same embodiment as the data synchronization method in the above embodiment of the present invention, the contents of the embodiment of the data synchronization apparatus are not described in detail in this embodiment.
The invention also provides a computer readable storage medium.
In this embodiment, the computer readable storage medium has a data synchronization program stored thereon, and the data synchronization program, when executed by a processor, implements the steps of the data synchronization method as described in any one of the above embodiments. The method implemented by the data synchronization program when executed by the processor may refer to various embodiments of the data synchronization method of the present invention, and therefore, redundant description is not repeated.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM), and includes instructions for causing a terminal (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
The present invention is described in connection with the accompanying drawings, but the present invention is not limited to the above embodiments, which are only illustrative and not restrictive, and those skilled in the art can make various changes without departing from the spirit and scope of the invention as defined by the appended claims, and all changes that come within the meaning and range of equivalency of the specification and drawings that are obvious from the description and the attached claims are intended to be embraced therein.

Claims (10)

1. A data synchronization method applied to a data storage device is characterized by comprising the following steps:
detecting whether the data validity of the data storage device meets an updating condition, wherein the updating condition is that the data validity reaches an expiration time;
if the data validity meets the updating condition, acquiring a task data table to be synchronized to the data storage device at the current time and a first check code of task data in historical data, wherein the task data table at least comprises the new data and the historical data which are changed currently, the historical data is the task data stored in the data storage device currently, the first check code comprises N first sub-check codes, the first sub-check codes are calculated based on at least one row of data or at least one task data in the historical data, and N is greater than or equal to 1;
calculating a second check code of the task data in the task data table by using a calculation algorithm which is the same as that used for calculating the first check code, wherein the second check code comprises M second sub check codes, the second sub check codes are calculated based on at least one row of data or at least one task data in the task data table, and M is greater than or equal to N;
comparing whether the second check code is consistent with the first check code or not to judge whether the task data currently stored in the data storage equipment is abnormal or not;
if the abnormal task data exist, screening the abnormal task data from the task data table, keeping the historical data, and synchronizing the abnormal task data into the data storage equipment, wherein the abnormal task data are incremental task data;
and if the abnormal condition does not exist, replacing the first check code with the second check code, and deleting the historical data and the first check code in the task data table.
2. The data synchronization method of claim 1, wherein the computational algorithm comprises a hamming code check algorithm and a hash algorithm.
3. The data synchronization method of claim 2, wherein the step of calculating the second check code of the task data in the task data table using the same calculation algorithm as that used to calculate the first check code comprises:
if the calculation algorithm is a Hash algorithm, calculating an integral check value of the task data in the task data table according to the Hash algorithm;
extracting each row of data or each task data in the task data table, and calculating a verification code of each row of data or each task data according to the Hash algorithm to obtain a second sub-verification code, wherein the second sub-verification code is an MD code;
and independently storing all the MD codes into a preset verification code table in the task data table.
4. The data synchronization method of claim 3, wherein the step of comparing whether the second parity code is consistent with the first parity code to determine whether the task data currently stored in the data storage device has an exception comprises:
extracting a first check value and a first sub check code from the first check code, wherein the first check value is an integral check value of the historical data;
comparing whether the overall check value is the same as the first check value;
if the overall check value is the same as the first check value, determining that the incremental task data does not exist in the task data table;
and if the integral check value is different from the first check value, determining that the incremental task data exists in the task data table.
5. The data synchronization method of claim 4, wherein after the step of determining that the incremental task data exists in the task data table, further comprising:
and comparing each MD code in the second check code with the corresponding first sub-check code in the first check code one by one, and screening out different MD codes in the second check code based on the comparison result.
6. The data synchronization method of claim 5, wherein the step of screening the task data table for anomalous task data, and retaining the historical data and synchronizing the anomalous task data to the data storage device comprises:
extracting corresponding increment task data from the task data table according to the screened MD codes different from the first sub-check codes, inserting the increment task data into a preset increment data table, setting the attribute of the inserted increment task data as a distinguishing field and recording the insertion time of the distinguishing field;
and reserving the task data currently stored in the data storage device, determining the incremental task data corresponding to the insertion time with small current time error in the incremental data table, and synchronizing the incremental task data to the data storage device.
7. The data synchronization method of claim 6, further comprising, after the steps of screening the task data table for anomalous task data, preserving the historical data, and synchronizing the anomalous task data to the data storage device:
calculating the time deviation amount between the insertion time of other fields except the insertion time of the distinguishing field in the increment data table and the insertion time of the distinguishing field;
judging whether the time deviation amount is larger than a preset time difference or not;
and if the time deviation is greater than the preset time difference, deleting the incremental task data corresponding to all the fields greater than the preset time difference from the incremental data table.
8. A data synchronization apparatus, characterized in that the data synchronization apparatus comprises:
the detection module is used for detecting whether the data validity of the data storage equipment meets an updating condition, wherein the updating condition is that the data validity reaches the expiration time;
the data acquisition module is used for acquiring a task data table to be synchronized to the data storage device at the current time and a first check code of task data in historical data when the data validity meets an updating condition, wherein the task data table at least comprises new data and historical data which change currently, the historical data is the task data stored in the data storage device currently, the first check code comprises N first sub-check codes, the first sub-check codes are obtained by calculation based on at least one row of data or at least one task data in the historical data, and N is greater than or equal to 1;
the calculation module is used for calculating a second check code of the task data in the task data table by using a calculation algorithm which is the same as that used for calculating the first check code, wherein the second check code comprises M second sub check codes, the second sub check codes are calculated based on at least one row of data or at least one task data in the task data table, and M is greater than or equal to N;
the comparison module is used for comparing whether the second check code is consistent with the first check code or not so as to judge whether the task data currently stored in the data storage equipment is abnormal or not;
the synchronization module is used for screening abnormal task data from the task data table when abnormality exists, reserving the historical data and synchronizing the abnormal task data into the data storage device, wherein the abnormal task data are incremental task data; and replacing the first check code with the second check code and deleting the historical data and the first check code in the task data table when no abnormity exists.
9. A data synchronization apparatus, characterized in that the data synchronization apparatus comprises: memory, a processor and a data synchronization program stored on the memory and executable on the processor, the data synchronization program when executed by the processor implementing the steps of the data synchronization method of any of claims 1-7.
10. A computer-readable storage medium, having stored thereon a data synchronization program, which when executed by a processor implements the steps of the data synchronization method of any one of claims 1-7.
CN201910751919.7A 2019-08-15 2019-08-15 Data synchronization method, device, equipment and computer readable storage medium Pending CN110647531A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910751919.7A CN110647531A (en) 2019-08-15 2019-08-15 Data synchronization method, device, equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910751919.7A CN110647531A (en) 2019-08-15 2019-08-15 Data synchronization method, device, equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN110647531A true CN110647531A (en) 2020-01-03

Family

ID=69009586

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910751919.7A Pending CN110647531A (en) 2019-08-15 2019-08-15 Data synchronization method, device, equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN110647531A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111488317A (en) * 2020-04-16 2020-08-04 北京雷石天地电子技术有限公司 File synchronization method and device based on file change content
CN111736762A (en) * 2020-05-21 2020-10-02 平安国际智慧城市科技股份有限公司 Synchronous updating method, device, equipment and storage medium of data storage network
CN112785446A (en) * 2021-01-26 2021-05-11 中国人寿保险股份有限公司上海数据中心 Premium data self-correction real-time display method, system and storage medium
CN113467222A (en) * 2020-03-30 2021-10-01 南京南瑞继保电气有限公司 Data synchronization method and device for master-slave redundancy control system
CN114422531A (en) * 2022-03-11 2022-04-29 深圳市金政软件技术有限公司 Data synchronization method, system, device and storage medium
CN115221245A (en) * 2022-07-14 2022-10-21 广州纳指数据智能科技有限公司 Intelligent data acquisition synchronization method, system and equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102011012444A1 (en) * 2011-02-25 2012-08-30 Deutsches Zentrum für Luft- und Raumfahrt e.V. Synchronizing volume of data memorized on various hosts that are not connected with one another by wireless ad-hoc-network, comprises combining several nodes into superordinate node and calculating node hash value from hash values of nodes
CN105740259A (en) * 2014-12-09 2016-07-06 阿里巴巴集团控股有限公司 Method and device for processing incremental data
CN105808373A (en) * 2014-12-30 2016-07-27 中兴通讯股份有限公司 Incremental backup method and device of data, and NAS (Network Attached Storage) equipment
WO2017113059A1 (en) * 2015-12-28 2017-07-06 华为技术有限公司 Discrepant data backup method, storage system and discrepant data backup device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102011012444A1 (en) * 2011-02-25 2012-08-30 Deutsches Zentrum für Luft- und Raumfahrt e.V. Synchronizing volume of data memorized on various hosts that are not connected with one another by wireless ad-hoc-network, comprises combining several nodes into superordinate node and calculating node hash value from hash values of nodes
CN105740259A (en) * 2014-12-09 2016-07-06 阿里巴巴集团控股有限公司 Method and device for processing incremental data
CN105808373A (en) * 2014-12-30 2016-07-27 中兴通讯股份有限公司 Incremental backup method and device of data, and NAS (Network Attached Storage) equipment
WO2017113059A1 (en) * 2015-12-28 2017-07-06 华为技术有限公司 Discrepant data backup method, storage system and discrepant data backup device

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113467222A (en) * 2020-03-30 2021-10-01 南京南瑞继保电气有限公司 Data synchronization method and device for master-slave redundancy control system
CN111488317A (en) * 2020-04-16 2020-08-04 北京雷石天地电子技术有限公司 File synchronization method and device based on file change content
CN111736762A (en) * 2020-05-21 2020-10-02 平安国际智慧城市科技股份有限公司 Synchronous updating method, device, equipment and storage medium of data storage network
CN111736762B (en) * 2020-05-21 2023-04-07 平安国际智慧城市科技股份有限公司 Synchronous updating method, device, equipment and storage medium of data storage network
CN112785446A (en) * 2021-01-26 2021-05-11 中国人寿保险股份有限公司上海数据中心 Premium data self-correction real-time display method, system and storage medium
CN114422531A (en) * 2022-03-11 2022-04-29 深圳市金政软件技术有限公司 Data synchronization method, system, device and storage medium
CN115221245A (en) * 2022-07-14 2022-10-21 广州纳指数据智能科技有限公司 Intelligent data acquisition synchronization method, system and equipment

Similar Documents

Publication Publication Date Title
CN110647531A (en) Data synchronization method, device, equipment and computer readable storage medium
KR102476531B1 (en) Data Synchronization Method and Apparatus, Media, and Electronic Device for Distributed Systems
CN107038162B (en) Real-time data query method and system based on database log
AU2017101864A4 (en) Method, device, server and storage apparatus of reviewing SQL
CN107515874B (en) Method and equipment for synchronizing incremental data in distributed non-relational database
WO2017162032A1 (en) Method and device for executing data recovery operation
US20130346364A1 (en) Method and system for capturing change of data
CN109376196B (en) Method and device for batch synchronization of redo logs
CN106874281B (en) Method and device for realizing database read-write separation
CN110096522B (en) Block chain data processing method, device and equipment supporting relational retrieval
CN108965383B (en) File synchronization method and device, computer equipment and storage medium
CN110245145A (en) Structure synchronization method and apparatus of the relevant database to Hadoop database
CN115408370B (en) Database migration evaluation method and system, computer device and storage medium
CN111222089A (en) Data processing method, data processing device, computer equipment and storage medium
CN112732756A (en) Data query method, device, equipment and storage medium
CN113672692B (en) Data processing method, data processing device, computer equipment and storage medium
JP6329552B2 (en) Reference data segmentation from single table to multiple tables
CN111046081A (en) Access method and system for industrial time sequence data
CN114791927A (en) Data analysis method and device
WO2024060677A1 (en) Data verification method and electronic device
CN112148705A (en) Data migration method and device
US10447807B1 (en) Dynamic middleware source selection for optimizing data retrieval from network nodes
US20150088958A1 (en) Information Processing System and Distributed Processing Method
CN115827777A (en) Self-adaptive synchronization and difference identification method, device and equipment for multiple data sources
CN113032368A (en) Data migration method and device, storage medium and platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200103