CN116910079A

CN116910079A - Method, system, device and storage medium for realizing delay association of Flink with respect to CDC data dimension table

Info

Publication number: CN116910079A
Application number: CN202310948871.5A
Authority: CN
Inventors: 战亚楠; 王刚; 王新根; 汪陈笑; 马顺华
Original assignee: Hangzhou Bangrui Technology Co ltd; Zhejiang Bangsheng Technology Co ltd
Current assignee: Hangzhou Bangrui Technology Co ltd; Zhejiang Bangsheng Technology Co ltd
Priority date: 2023-07-31
Filing date: 2023-07-31
Publication date: 2023-10-20

Abstract

The invention discloses a method, a system, a device and a storage medium for realizing delay association of a Flink with respect to a CDC data dimension table, wherein the fact table and dimension table data are synchronized from a database, and dimension table data update is realized; under the condition that the synchronization speed of the dimension table is later than that of the fact table, the data association can be successfully realized within a certain time range; the fact table synchronous through the CDC mode can be solved, when a plurality of data dimension tables of the same main key are associated, if the first association fails, the subsequent association is successful, and the problem of data disorder is caused. The invention mainly uses a database for data storage and uses CDC technology to collect data for related business scenes, and changes the data updating delay of the dimension table from hour level to second level; the accuracy of data association can be ensured when dimension table data is later than fact table data in a specific time range; when the data fails to be associated within a specific time range, abnormal data can be collected, and the integrity of the data is ensured.

Description

Method, system, device and storage medium for realizing delay association of Flink with respect to CDC data dimension table

Technical Field

The invention relates to the field of real-time computing, in particular to a method, a system, a device and a storage medium for realizing delay association of a dimension table of data of a Flink related to CDC (code division multiple access).

Background

CDC: change data capture refers to the process of identifying and capturing changes made to data in a database (including insertion, updating, deletion, etc. of data or data tables), then recording these changes in their order of occurrence in their entirety, and transmitting them in real time through message middleware to downstream processes or systems. In this way, the CDC can provide efficient, low-latency data transfer to the data warehouse so that information is timely converted and delivered to the application program for analysis. Wherein the order is important, the order of the same primary key is disordered, which can affect the accuracy of the final data.

Facts table: fact tables are used to record and organize data for analysis and reporting. Mainly contains numerical data for answering business questions such as sales, quantity, profits, etc. Typically large tables, because they contain a large amount of numerical data.

Dimension table: a dimension table is a table in a data warehouse that stores dimension information related to a business. Dimension tables are typically small because they contain descriptive data. For analyzing and filtering data in fact tables, such as time, place, products, etc.

Data width table: fact tables are typically associated with multiple dimension tables because business problems often require analysis from different angles. And according to the dimension table primary key information recorded in the fact table, other analyzed dimension contents are correlated and redundant into the data, so that the data can be directly analyzed in the later period, and the correlation time is reduced.

The fact table data mainly aims at the log type and only supports an additional mode to be stored in a kafka message queue, the dimension table information is selected according to the data size, the data size is small to be stored in a database, and the large data size is stored by Hbase. The desired data is processed into a broad table, which requires consumption of the Kafka data, and then the dimension table information is queried according to the associated primary key and added to the current data and passed downstream. However, the dimension table data of this scheme is generally updated by offline timing, and is mainly applied to a scene where the dimension update frequency is low.

Chinese patent application number 202211457131.3 discloses a method, equipment and medium for realizing a FlinkSQL dimension table join, wherein the method mainly monitors whether a middleware has new data, and if yes, an SQL source table is obtained from the middleware; generating interface information of an enterprise service bus Esb according to the SQL source table; inquiring and acquiring dimension table data according to the interface information of the enterprise service bus Esb; and aiming at the dimension table data, acquiring a processing result through preset operation processing, and storing the processing result to a preset position to realize join of the dimension table. The method improves timeliness of the dimension table data for the general dimension table association scene. However, if the scenario of the fact table and the dimension table data generated in the database and synchronized by the CDC mode is still not satisfied, and the dimension table data may lag behind the fact table data in the synchronization process, when the fact table is associated, the dimension table data still has no phenomenon of being ready to be completed, and for the problem of statistics of the data which is not associated, no good implementation is yet available.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a method, a system, a device and a storage medium for realizing delay association of a dimension table of Flink with respect to CDC data, which solve the problems that the fact table and dimension table data are synchronized from a database and the update of the dimension table data is slow; the problem of data association failure in a certain time range under the condition that the synchronization speed of the dimension table is later than that of the fact table is solved; the problem that when a plurality of data dimension tables of the same main key are associated in a CDC mode, if a first association fails, a rear association is successful, so that data is disordered is solved; the correlation failure statistics problem of the data with the irregular type is solved.

The aim of the invention is realized by the following technical scheme: a method for realizing delay association of a Flink with respect to a CDC data dimension table comprises the following steps:

(1) Generating and storing the fact table and dimension table data in a service database;

(2) The service data is synchronized to the Kafka queue: realizing streaming data synchronization through a Flink framework and a CDC technology, and respectively collecting change logs of a dimension table and a fact table into different topics of Kafka;

(3) Dimension table data synchronization: and consuming the data of the dimension table in Kafka by configuring the Flink job and outputting the data into a dimension table database.

(4) Consuming real-time table data: by configuring the Flink operation, data are consumed from the Topic of the fact table and the Topic of the delay queue respectively;

(5) Associating dimension tables according to conditions: after receiving the data in the step 4, associating dimension information from a dimension table database according to the dimension table main key, and adding the association condition processed as an association main key into the data;

(6) And successfully judging the association dimension table: judging whether the association is successful or not according to the query result, if the association is successful, continuing to execute subsequent processing of the data, and if the association is failed, executing retry operation, and storing the data in a cache system.

(7) The associated cache tag exists: after the association dimension table is successful, checking a cache system according to the association main key, and if the cache main key is not present, indicating that the data are normally associated successfully, and directly outputting the data to a wide table queue; if the data exist, the association failure data are indicated, in order to ensure that the data sequences of the same main key are consistent, further judgment processing is needed, and when the serial numbers of the cache systems are not included in the data, the data are indicated to be later than the data in the cache systems in sequence and are output to the cache systems; if the data has the serial number of the cache system, judging whether the serial number is smaller than the current serial number of the cache system, if so, indicating that the sequence is earlier than the data of the cache system, directly outputting the data to a wide-table queue, and if so, continuing to output the data to the cache system, and ensuring that the front data are successfully associated and output.

(8) Broad table Kafka queue: the fact table data is output to Topic of kafka in association with final data after dimension table information.

Further, in step (3), the link operation is always running, and as long as the dimension table data in the service table is changed, the changed data can be synchronized into the dimension table database of the analysis system in real time.

Further, in step (6), the data structure of the cache system is composed of a primary key list and an ordering queue, wherein the primary key information includes the minimum time in the same primary key queue, each primary key corresponds to an ordering queue, and the order is ordered according to the time; when the minimum time data is removed, the time in the primary key also refreshes the minimum value in the fetch queue.

Further, in step (6), an expiration primary key is obtained from the cache system through a timing program, expiration data is obtained according to the primary key, the retry times in the data are combined with the maximum retry times in the configuration to be compared, if the data do not reach the maximum retry times, the next expiration time in the data is required to be updated according to the configured retry time interval, and the data are output to the delay queue; if it is determined that the data has reached the maximum number of retries, the data is output to a timeout queue.

In a second aspect, the invention also provides a system for realizing delay association of a dimension table of Flink with respect to CDC data, which comprises a service data synchronization module, a dimension table data synchronization module, a consumption real-time table data module, a dimension table module according to conditions, an associated dimension table success judgment module, a module for judging whether an associated cache mark exists or not and a wide table Kafka queue module;

the business data synchronization module is used for generating and storing the fact table and dimension table data in a business database, realizing stream data synchronization through a Flink frame and a CDC technology, and respectively collecting dimension table and fact table change logs into different Topic of Kafka;

and the dimension table data synchronization module is used for consuming dimension table data in Kafka to be output to a dimension table database through configuration of the Flink operation.

The consumption real-time table data module is used for respectively consuming data from Topic of the fact table and Topic of the delay queue through configuration of the Flink operation;

the dimension table module is used for associating dimension information from a dimension table database according to a dimension table main key after receiving the data of the consumption real-time table data module, and adding the association condition processed as an association main key into the data;

and the association dimension table success judging module is used for judging whether the association is successful or not according to the query result, if the association is successful, the data continues to execute subsequent processing, and if the association is failed, retry operation is required to be executed, and the data is stored in the cache system.

The module for judging whether the associated cache mark exists is used for checking the cache system according to the associated main key after the associated dimension table is successful, and if the associated main key does not exist, the data are indicated to be normally associated successfully and are directly output to the wide table queue; if the data exist, the association failure data are indicated, in order to ensure that the data sequences of the same main key are consistent, further judgment processing is needed, and when the serial numbers of the cache systems are not included in the data, the data are indicated to be later than the data in the cache systems in sequence and are output to the cache systems; if the data has the serial number of the cache system, judging whether the serial number is smaller than the current serial number of the cache system, if so, indicating that the sequence is earlier than the data of the cache system, directly outputting the data to a wide-table queue, and if so, continuing to output the data to the cache system, and ensuring that the front data are successfully associated and output;

the wide-table Kafka queue module is used for outputting final data of the fact table data after the dimension table information is associated to Topic of Kafka.

In a third aspect, the present invention further provides a device for implementing delay association of a Flink with respect to a CDC data dimension table, including a memory and one or more processors, where the memory stores executable codes, and when the processor executes the executable codes, the processor implements the method for implementing delay association of a Flink with respect to a CDC data dimension table.

In a fourth aspect, the present invention further provides a computer readable storage medium, where a program is stored, where the program, when executed by a processor, implements the method for implementing the delay association of a Flink with respect to a CDC data dimension table.

The invention has the beneficial effects that: the invention mainly uses a database for data storage and uses CDC technology to collect data for related business scenes, and changes the data updating delay of the dimension table from hour level to second level; the accuracy of data association can be ensured when dimension table data is later than fact table data in a specific time range; when the data fails to be associated within a specific time range, abnormal data can be collected, and the integrity of the data is ensured.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for implementing a delay association of a Flink with respect to a CDC data dimension table in accordance with the present invention;

FIG. 2 is a schematic diagram of a data structure of a cache system according to the present invention;

fig. 3 is a block diagram of a device for implementing a delay association of a link with respect to a CDC data dimension table according to the present invention.

Detailed Description

The following describes the embodiments of the present invention in further detail with reference to the drawings.

As shown in FIG. 1, the implementation method of delay association of a Flink with respect to a CDC data dimension table provided by the invention comprises the following specific steps:

1. business objective: the fact table and dimension table data are generated and stored in a business database, and the data need to be synchronized from a business system to an analysis system and finish data processing. After the data processing treatment and the dimension table data are associated to form a wide table, the analysis system (clickhouse, presto, doris) compiles sql based on the wide table data to complete the spot query and the business analysis.

2. The service data is synchronized to the Kafka queue: through a Flink framework (real-time stream processing engine) +CDC technology (data change generates an operation log, and data is recovered through the operation log), stream data synchronization is realized, and a dimension table and a fact table change log are respectively collected into different topics of Kafka. The influence on a service library system is minimum through the CDC technology, and meanwhile, the timeliness of data synchronization is highest. And synchronizing data to Kafka, so that the buffer effect can be achieved on fluctuation of data quantity, meanwhile, subscription of multiple service parties can be provided, synchronization of one data for multiple people is achieved, repeated development is avoided, and system pressure is reduced. The dimension table and the fact table are stored in a topic mode, and the synchronization time of dimension table data is shortened. Because the amount of dimension table data is generally small, the amount of fact table data is large, and the delay and throughput must be chosen and chosen, the delay can be preferentially reduced for the dimension table data, and the throughput capacity of the system can be increased for the fact table.

3. Dimension table data synchronization: and consuming the data of the dimension table in Kafka by configuring the Flink job and outputting the data into a dimension table database. The Flink operation is operated all the time, and the change data can be synchronized into a dimension table database of the analysis system in real time as long as the dimension table data in the service table is changed.

4. Consuming real-time table data: by configuring the Flink job, data is consumed from Topic of the fact table and Topic of the delay queue, respectively, in the stream processing framework, collected data is transferred to downstream operators in real time, and the delay queue is a message queue of kafka for storing delay data. Since log data of the CDC type is processed, strict requirements are imposed on the data sequence of the same primary key, such as: adding 1 piece of data will produce 1 piece of data marked as I, updating 1 piece of data is equivalent to deleting the original data and adding 1 piece of new data, so 2 pieces of data are correspondingly produced, marked as D and I respectively, and if 1 piece of data is deleted, 1 piece of data marked as D will be produced. Similarly, the data is output to the dimension table database to be added, deleted and changed according to the marking process. So that if the order is out of order, the result will be inaccurate.

5. Associating dimension tables according to conditions: and 4, after receiving the data, associating dimension information from the dimension table database according to the dimension table main key, and adding the association condition processed as the association main key into the data.

6. And successfully judging the association dimension table: in the program, judging whether the association is successful or not according to the query result, if the association is successful, continuing to execute subsequent processing of the data, and if the association is failed, executing retry operation and storing the data in a cache system.

The data structure of the cache system is shown in fig. 2, and is composed of a primary key list and an ordering queue, wherein primary key information comprises the minimum time in the same primary key queue, each primary key corresponds to an ordered queue, and the order is ordered according to time. When the minimum time data is removed, the time in the primary key also refreshes the minimum value in the fetch queue.

7. Timing task fetch: and acquiring an expiration primary key from the cache system through a timing program, and acquiring expiration data according to the primary key.

8. Up to the maximum number of retries: after obtaining the expiration data in the step 7, comparing the retry times in the data with the maximum retry times in the configuration, if the data does not reach the maximum retry times, updating the next expiration time in the data according to the configured retry time interval, and outputting the data to a delay queue; if it is determined that the data has reached the maximum number of retries, the data is output to a timeout queue.

9. Timeout Kafka queue: and each piece of data is associated with the data according to the retry times and the retry interval time, and failure data is collected for analysis after the condition is met.

Kafka delay queue: and after the data meets the retry condition, storing the data in the queue for the data processing operation to continue the consumption processing.

11. The associated cache tag exists: after the association dimension table is successful, checking a cache system according to the association main key, and if the cache main key is not present, indicating that the data are normally associated successfully, and directly outputting the data to a wide table queue; if so, it is indicated that there is association failure data, and in order to ensure that the data sequence of the same primary key is consistent, step 12 is further executed to perform the judgment process.

12. The data has a cache tag: when the data has no serial number of the cache system, the data is indicated to be in sequence later than the data in the cache system, and the data is output to the cache system; if the data has the serial number of the cache system, judging whether the serial number is smaller than the current serial number of the cache system, if so, indicating that the sequence is earlier than the data of the cache system, directly outputting the data to a wide-table queue, and if so, continuing to output the data to the cache system, and ensuring that the front data are successfully associated and output.

13. Broad table Kafka queue: the fact table data is output to Topic of kafka in association with final data after dimension table information.

Through the cache system, consistency of the data association sequence is guaranteed, and meanwhile, according to the retry times and the retry time interval, the effect that even if the dimension table data arrives at the fact table data later, the dimension table data can be finally associated to the data is guaranteed. The specific data information of the data association failure can be counted through the collection of the overtime queue, and the data management is convenient.

Corresponding to the embodiment of the implementation method of the delay association of the Flink with respect to the CDC data dimension table, the invention also provides the embodiment of the implementation device of the delay association of the Flink with respect to the CDC data dimension table.

Referring to fig. 3, an apparatus for implementing delay association of a link with respect to a CDC data dimension table according to an embodiment of the present invention includes a memory and one or more processors, where the memory stores executable codes, and when the processors execute the executable codes, the processor is configured to implement a method for implementing delay association of a link with respect to a CDC data dimension table in the foregoing embodiment.

The embodiment of the invention provides a device for realizing delay association of a Flink with a CDC data dimension table, which can be applied to any device with data processing capability, wherein the device with data processing capability can be a device or a device such as a computer. The apparatus embodiments may be implemented by software, or may be implemented by hardware or a combination of hardware and software. Taking software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions in a nonvolatile memory into a memory by a processor of any device with data processing capability. In terms of hardware, as shown in fig. 3, a hardware structure diagram of an apparatus with data processing capability, where a device for implementing delay association of a Flink with respect to a CDC data dimension table provided in the present invention is located, is shown in fig. 3, and in addition to a processor, a memory, a network interface, and a nonvolatile memory, any apparatus with data processing capability in an embodiment is generally according to an actual function of the any apparatus with data processing capability, and may further include other hardware, which is not described herein.

The implementation process of the functions and roles of each unit in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.

For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present invention. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The embodiment of the invention also provides a computer readable storage medium, wherein a program is stored, and when the program is executed by a processor, the method for realizing delay association of the Flink with respect to the CDC data dimension table in the embodiment is realized.

The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any of the data processing enabled devices described in any of the previous embodiments. The computer readable storage medium may be any external storage device that has data processing capability, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), or the like, which are provided on the device. Further, the computer readable storage medium may include both internal storage units and external storage devices of any data processing device. The computer readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing apparatus, and may also be used for temporarily storing data that has been output or is to be output.

The above-described embodiments are intended to illustrate the present invention, not to limit it, and any modifications and variations made thereto are within the spirit of the invention and the scope of the appended claims.

Claims

1. A method for realizing delay association of a Flink with respect to a CDC data dimension table is characterized by comprising the following steps:

(3) Dimension table data synchronization: consuming the data of the dimension table in Kafka by configuring the Flink operation and outputting the data into a dimension table database;

(5) Associating dimension tables according to conditions: after the step (4) receives the data, associating dimension information from a dimension table database according to the dimension table main key, and adding the association condition processed as an association main key into the data;

(6) And successfully judging the association dimension table: judging whether the association is successful or not according to the query result, if the association is successful, continuing to execute subsequent processing of the data, and if the association is failed, executing retry operation, and storing the data in a cache system;

(7) The associated cache tag exists: after the association dimension table is successful, checking a cache system according to the association main key, and if the cache main key is not present, indicating that the data are normally associated successfully, and directly outputting the data to a wide table queue; if the data exist, the association failure data are indicated, in order to ensure that the data sequences of the same main key are consistent, further judgment processing is needed, and when the serial numbers of the cache systems are not included in the data, the data are indicated to be later than the data in the cache systems in sequence and are output to the cache systems; if the data has the serial number of the cache system, judging whether the serial number is smaller than the current serial number of the cache system, if so, indicating that the sequence is earlier than the data of the cache system, directly outputting the data to a wide-table queue, and if so, continuing to output the data to the cache system, and ensuring that the front data are successfully associated and output;

2. The method for implementing delay association of a Flink with respect to a CDC data dimension table of claim 1, wherein in step (3), the Flink operation is always running, and the changed data can be synchronized into a dimension table database of an analysis system in real time as long as there is a change in dimension table data in a service table.

3. The method for implementing delay association of a link with respect to a CDC data dimension table according to claim 1, wherein in step (6), the data structure of the cache system is composed of a primary key list and an ordering queue, wherein primary key information includes a minimum time in the same primary key queue, each primary key corresponds to an ordered queue, and the order is ordered according to time; when the minimum time data is removed, the time in the primary key also refreshes the minimum value in the fetch queue.

4. The method for implementing delay association of a Flink with respect to a CDC data dimension table according to claim 3, wherein in step (6), an expired primary key is acquired from a cache system through a timing program, and expired data is acquired according to the primary key, and the number of retries in the data is compared with the maximum number of retries in configuration, and if the data does not reach the maximum number of retries, the next expired time in the data needs to be updated according to the configured retry time interval, and the data is output to a delay queue; if it is determined that the data has reached the maximum number of retries, the data is output to a timeout queue.

5. A delay association realization system of a dimension table of Flink about CDC data is characterized by comprising a business data synchronization module, a dimension table data synchronization module, a consumption real-time table data module, a dimension table based on condition association, a dimension table success judgment module, a module for judging whether an association cache mark exists and a wide table Kafka queue module;

the dimension table data synchronization module is used for consuming dimension table data in Kafka to be output to a dimension table database through configuration of the Flink operation;

the association dimension table success judging module is used for judging whether the association is successful or not according to the query result, if the association is successful, the data continues to execute subsequent processing, and if the association is failed, retry operation is required to be executed, and the data is stored in the cache system;

6. A device for implementing a delay association of a link with respect to a CDC data dimension table, comprising a memory and one or more processors, wherein the memory stores executable code, and wherein the processor implements a method for implementing a delay association of a link with respect to a CDC data dimension table according to any one of claims 1-4 when executing the executable code.

7. A computer readable storage medium having a program stored thereon, wherein the program, when executed by a processor, implements a method for achieving a flank-to-CDC data dimension table deferred association as claimed in any one of claims 1 to 4.