CN111930837A

CN111930837A - Mass data processing method and system based on preposed distributed database

Info

Publication number: CN111930837A
Application number: CN202010703239.0A
Authority: CN
Inventors: 刘跃红; 余丽玲; 管正爽; 郭倩
Original assignee: Yinsheng Payment Service Co Ltd
Current assignee: Yinsheng Payment Service Co Ltd
Priority date: 2020-07-21
Filing date: 2020-07-21
Publication date: 2020-11-13

Abstract

The embodiment of the invention provides a mass data processing method based on a preposed distributed database, which comprises the following steps: the method comprises the following steps: acquiring current time and transaction date recorded in a distributed database, wherein the distributed database comprises a front cluster and a full cluster; step two: comparing a distance between the current time and the transaction date to a threshold; step three: if the distance between the current time and the transaction date is larger than a threshold value, the data corresponding to the transaction date is cold data, and the data corresponding to the transaction date is stored in the full-scale cluster; step four: and if the distance between the current time and the transaction date is smaller than or equal to a threshold value, the data corresponding to the transaction date is thermal data, and the data corresponding to the transaction date is stored to the front-end cluster. The embodiment of the invention is convenient for separating hot data and cold data, and reduces the load capacity of the distributed database.

Description

Mass data processing method and system based on preposed distributed database

Technical Field

The invention relates to the field of distributed databases, in particular to a mass data processing method and system based on a preposed distributed database.

Background

With the rapid development of services and the rapid increase of data volume, data storage and data use gradually become the bottleneck of the system.

The current solution to the database bottleneck can adopt a master-slave synchronous read-write separation and a database-based table division scheme. The read-write separation can be combined with a distributed database, so that the read performance is effectively improved, and the influence of frequent read requests on the write function of the database is reduced; the sub-database and sub-table can solve the problem of data writing of non-mass data, and when the data volume reaches the PB level and writing requests are frequent, the writing performance is seriously reduced due to consumption of selecting writing nodes.

Summary of the invention

In order to overcome the defects of the prior art, the invention provides a mass data processing method based on a preposed distributed database, which is used for separating hot data from cold data and solving the problem of high impact of mass data on the database.

The technical scheme adopted by the invention for solving the technical problems is as follows: a mass data processing method based on a preposed distributed database comprises the following steps: the method comprises the following steps: acquiring current time and transaction date recorded in a distributed database, wherein the distributed database comprises a front cluster and a full cluster; step two: comparing a distance between the current time and the transaction date to a threshold; step three: if the distance between the current time and the transaction date is larger than a threshold value, the data corresponding to the transaction date is cold data, and the data corresponding to the transaction date is stored in the full-scale cluster; step four: and if the distance between the current time and the transaction date is smaller than or equal to a threshold value, the data corresponding to the transaction date is thermal data, and the data corresponding to the transaction date is stored to the front-end cluster.

Preferably, before the acquiring the current time and the transaction date recorded in the distributed database, the steps further include:

and creating an interface of the current application program based on different tables in the distributed database and different authorities of different users.

Preferably, before the creating of the interface of the current application, the steps further include:

and creating tables, indexes and fragment structures with consistent structures in the front-end cluster and the full-scale cluster.

Preferably, after the data corresponding to the transaction date is thermal data and is stored in the pre-cluster if the distance between the current time and the transaction date is less than or equal to a threshold, the step further includes:

and periodically storing the data in the front cluster to the full-scale cluster through a script program.

Preferably, after the data in the pre-cluster is stored to the full-scale cluster by a script program, the steps further include:

and clearing the expired data in the pre-cluster through a script program at regular time according to a custom routing rule.

A mass data processing system based on a pre-populated distributed database, the system comprising:

the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring the current time and the transaction date recorded in a distributed database, and the distributed database comprises a front cluster and a full cluster;

the comparison unit is used for comparing the distance between the current time and the transaction date with a threshold value;

the first storage unit is used for storing the data corresponding to the transaction date to the full-volume cluster if the distance between the current time and the transaction date is larger than a threshold value, wherein the data corresponding to the transaction date is cold data;

and the second storage unit is used for storing the data corresponding to the transaction date to the preposed cluster if the distance between the current time and the transaction date is less than or equal to a threshold value.

Preferably, the system further comprises:

and the first creating unit is used for creating an interface of the current application program based on different tables in the distributed database and different authorities of different users.

Preferably, the system further comprises:

and the second creating unit is used for creating tables, indexes and fragment structures with consistent structures in the front cluster and the full cluster.

Preferably, the system further comprises:

and the third storage unit is used for storing the data in the pre-cluster to the full-scale cluster at regular time through a script program.

Preferably, the system further comprises:

and the clearing unit is used for clearing the expired data in the pre-cluster according to the custom routing rule at regular time through a script program.

The invention has the beneficial effects that: the distance between the current time and the transaction date is compared with a threshold value, so that whether the current data is cold data or hot data is judged and stored in different distributed databases respectively, the hot data and the cold data are separated, and the load capacity of the distributed databases is reduced.

Drawings

Fig. 1 is a flow chart diagram of a mass data processing method based on a preposed distributed database.

FIG. 2 is a functional block diagram of a mass data processing system based on a pre-populated distributed database.

FIG. 3 is another functional block diagram of a mass data processing system based on a pre-populated distributed database.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The following detailed description of specific implementations of the present invention is provided in conjunction with specific embodiments:

the first embodiment is as follows:

fig. 1 shows an implementation flow of a real-time push printing method based on an intelligent POS machine according to an embodiment of the present invention, and for convenience of description, only parts related to the embodiment of the present invention are shown, which are detailed as follows:

in step S101: acquiring current time and transaction date recorded in a distributed database, wherein the distributed database comprises a front cluster and a full cluster;

in the embodiment of the application, the current time and the transaction date recorded in the distributed database are obtained, and two clusters are deployed in the distributed database system: one front cluster and the other full cluster; the pre-cluster stores a small amount of data, the full-scale cluster stores full-scale data, the distributed database comprises various merchant transaction flow meters, transaction dates and current time are obtained from the merchant transaction flow meters, and therefore the merchant transaction flow meters can be conveniently classified according to the transaction dates.

Preferably, before the obtaining of the current time and the transaction date recorded in the distributed database, an interface of the current application program is created based on different tables in the distributed database and different permissions of different users. Further preferably, before the creating of the interface of the current application program, tables, indexes and shards with consistent structures are created in the pre-cluster and the full-scale cluster.

In order to ensure the safety and controllability of data, different interfaces are developed for different tables and different authorities of different users to provide services for the outside.

The mongodb cluster M1 is used as a preposed distributed storage platform to store data of nearly 10 days; cluster M2 stores the full amount of data as a storage distributed storage platform; for example, a TRADE _ DETAIL table is created in the ODS libraries of the M1 and M2 clusters, a joint index of date + sort fields is created, and the date + sort is distributed as a shard as a distribution rule.

In step S102: comparing a distance between the current time and the transaction date to a threshold;

in the embodiment of the application, the distance between the acquired current time and the transaction date recorded in the distributed database is compared with the threshold, the threshold is generally set to be 10 days according to the actual service requirement, the distance between the transaction date and the current time is compared with 10 working days, the type of the data is further judged according to the comparison result, and the data can be conveniently stored in different storage platforms according to the type of the data.

In step S103: if the distance between the current time and the transaction date is larger than a threshold value, the data corresponding to the transaction date is cold data, and the data corresponding to the transaction date is stored in the full-scale cluster;

in the embodiment of the application, if the distance between the current time and the transaction date is larger than the threshold, the data corresponding to the transaction date is cold data, the data corresponding to the transaction date is stored in the full-scale cluster, and when the cold data is judged, the merchant transaction flow water meter corresponding to the transaction date is stored in the full-scale cluster.

In step S104: and if the distance between the current time and the transaction date is smaller than or equal to a threshold value, the data corresponding to the transaction date is thermal data, and the data corresponding to the transaction date is stored to the front-end cluster.

In the embodiment of the application, when the distance between the current time and the transaction date is less than or equal to the threshold, the data corresponding to the transaction date is thermal data, and the merchant transaction flow meter corresponding to the transaction date is stored in the front-end cluster. The front-end cluster only stores a small amount of data in the last 10 days, the corresponding speed is high, and the transaction detail data adding, deleting, modifying and checking operations are facilitated. According to the actual situation of the business, the hot data are routed to different distributed storage platforms for operation according to the custom rule, most of transaction detail data adding, deleting, modifying and checking operations are data of nearly several days, date is used as a routing rule, the front-end cluster M1 stores data of nearly 10 days, and the M2 full-scale cluster stores full-scale data. The data in nearly 10 days (the current day is minus 10< date) is routed to the front cluster for operation, and the operation before 10 days (the date < the current day is minus 10) is routed to the M2 full-scale cluster for operation, so that the impact and pressure of high concurrency of mass data on the database are solved, the load capacity of the database is reduced, and the query efficiency is improved.

The M2 full-scale cluster stores full-scale data to improve fault tolerance, and once the front-scale cluster has problems, the full-scale cluster stores large part of data of the front-scale cluster to reduce influence.

Preferably, if the distance between the current time and the transaction date is less than or equal to a threshold value, the data corresponding to the transaction date is thermal data, and after the data corresponding to the transaction date is stored in the front-end cluster, the data in the front-end cluster is stored in the full-size cluster at regular time by a script program. Further preferably, the expired data in the pre-cluster is cleared regularly according to the custom routing rule through a script program, high availability of the system is ensured, and when the pre-cluster is upgraded or abnormal, the full storage platform can be started to provide services to the outside at any time.

The data in the front-mounted cluster M1 is led into the cluster M2 in batch at fixed time every day by using the timing script, because the data has a large amount of updating operations, the data of the whole cluster is compared by taking the front-mounted cluster as a main basis, a main key and the last updating time, if the whole cluster does not exist, the data is directly inserted, and if the data of the whole cluster is updated, the data in the whole cluster is directly updated, the efficiency of accessing the data is improved by the front-mounted cluster, and the accuracy of data access is improved by the full cluster.

The custom routing rule may be a routing rule using date as a determination field, and periodically clear data before 10 days (date is current day-10) in the pre-cluster M1, so that the data amount of the pre-cluster TRADE _ detach table is always kept for nearly 10 days.

It will be understood by those skilled in the art that all or part of the steps in the method for implementing the above embodiments may be implemented by relevant hardware instructed by a program, and the program may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc.

Example two:

fig. 2 shows a structure of a mass data processing system based on a front distributed database according to a second embodiment of the present invention, and for convenience of description, only the parts related to the second embodiment of the present invention are shown, which are detailed as follows:

an obtaining unit 201, configured to obtain a current time and a transaction date recorded in a distributed database, where the distributed database includes a pre-cluster and a full-scale cluster;

a comparing unit 202, configured to compare a distance between the current time and the transaction date with a threshold;

a first storage unit 203, configured to, if a distance between the current time and the transaction date is greater than a threshold, determine that data corresponding to the transaction date is cold data, and store the data corresponding to the transaction date in the full-size cluster;

the second storage unit 204 is configured to, if a distance between the current time and the transaction date is less than or equal to a threshold, determine that data corresponding to the transaction date is hot data, and store the data corresponding to the transaction date to the front cluster.

In the embodiment of the invention, the current time and the transaction date recorded in a distributed database are obtained, wherein the distributed database comprises a front cluster and a full cluster; comparing a distance between the current time and the transaction date to a threshold; if the distance between the current time and the transaction date is larger than a threshold value, the data corresponding to the transaction date is cold data, and the data corresponding to the transaction date is stored in the full-scale cluster; and if the distance between the current time and the transaction date is smaller than or equal to a threshold value, the data corresponding to the transaction date is hot data, and the data corresponding to the transaction date is stored in the front-end cluster, so that the hot data and the cold data are conveniently separated, and the load capacity of the distributed database is reduced. The detailed implementation of each unit can refer to the description of the first embodiment, and is not repeated herein.

Example three:

fig. 3 shows another structure of a mass data processing system based on a front distributed database according to a third embodiment of the present invention, and for convenience of description, only the parts related to the third embodiment of the present invention are shown, which include:

a first creating unit 301, configured to create an interface of a current application program based on different tables in the distributed database and different permissions of different users;

a second creating unit 302, configured to create tables, indexes, and segment structures with consistent structures in the pre-cluster and the full-scale cluster;

an obtaining unit 303, configured to obtain a current time and a transaction date recorded in a distributed database, where the distributed database includes a pre-cluster and a full-scale cluster;

a comparison unit 304, configured to compare a threshold value with a distance between the current time and the transaction date;

a first storage unit 305, configured to, if a distance between the current time and the transaction date is greater than a threshold, determine that data corresponding to the transaction date is cold data, and store the data corresponding to the transaction date in the full-size cluster;

a second storage unit 305, configured to, if a distance between the current time and the transaction date is less than or equal to a threshold, determine that data corresponding to the transaction date is hot data, and store the data corresponding to the transaction date in the pre-cluster;

a third storage unit 306, configured to store the data in the pre-cluster to the full-scale cluster at regular time by using a script program;

and a clearing unit 307, configured to clear the stale data in the pre-cluster at regular time according to the custom routing rule through a script program.

In the embodiment of the present invention, an interface of a current application program is created based on different tables in the distributed database and different permissions of different users, tables, indexes, and segment structures with consistent structures are created in the pre-cluster and the full-volume cluster, a current time and a transaction date recorded in the distributed database are obtained, the distributed database includes the pre-cluster and the full-volume cluster, comparison is performed according to a distance between the current time and the transaction date and a threshold, if the distance between the current time and the transaction date is greater than the threshold, data corresponding to the transaction date is cold data, and data corresponding to the transaction date is stored in the full-volume cluster, if the distance between the current time and the transaction date is less than or equal to the threshold, data corresponding to the transaction date is hot data, and storing the data corresponding to the transaction date to the pre-cluster, regularly storing the data in the pre-cluster to the full-scale cluster through a script program, and regularly clearing out expired data in the pre-cluster according to a custom routing rule through the script program, so that hot data and cold data are separated, and the efficiency of querying the data is improved.

In the embodiment of the present invention, the processing of the mass data based on the pre-distributed database may be implemented by corresponding hardware or software units, and each unit may be an independent software or hardware unit, or may be integrated into a software or hardware unit, which is not limited herein.

Claims

1. A mass data processing method based on a preposed distributed database is characterized by comprising the following steps:

the method comprises the following steps: acquiring current time and transaction date recorded in a distributed database, wherein the distributed database comprises a front cluster and a full cluster;

step two: comparing a distance between the current time and the transaction date to a threshold;

step three: if the distance between the current time and the transaction date is larger than a threshold value, the data corresponding to the transaction date is cold data, and the data corresponding to the transaction date is stored in the full-scale cluster;

step four: and if the distance between the current time and the transaction date is smaller than or equal to a threshold value, the data corresponding to the transaction date is thermal data, and the data corresponding to the transaction date is stored to the front-end cluster.

2. The method for processing mass data based on the pre-distributed database according to claim 1, wherein before the obtaining the current time and the transaction date recorded in the distributed database, the steps further include:

3. The method for processing mass data based on the pre-distributed database according to claim 2, wherein before the creating of the interface of the current application program, the steps further include:

4. The method according to claim 1, wherein after the data corresponding to the transaction date is hot data and is stored in the pre-cluster if the distance between the current time and the transaction date is less than or equal to a threshold value, the method further comprises:

5. The massive data processing method based on the preposed distributed database as claimed in claim 4, wherein after the data in the preposed cluster is stored to the full-scale cluster at regular time by a script program, the steps further comprise:

6. A mass data processing system based on a pre-populated distributed database, the system comprising:

7. The system for processing mass data based on pre-distributed database according to claim 6, further comprising:

8. The system for processing mass data based on pre-distributed database according to claim 7, further comprising:

9. The system for processing mass data based on pre-distributed database according to claim 8, further comprising:

10. The system for processing mass data based on pre-distributed database according to claim 7, further comprising: