CN112800091B - Flow batch integrated calculation control system and method - Google Patents

Flow batch integrated calculation control system and method Download PDF

Info

Publication number
CN112800091B
CN112800091B CN202110105453.0A CN202110105453A CN112800091B CN 112800091 B CN112800091 B CN 112800091B CN 202110105453 A CN202110105453 A CN 202110105453A CN 112800091 B CN112800091 B CN 112800091B
Authority
CN
China
Prior art keywords
batch
data
streaming
data source
real
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110105453.0A
Other languages
Chinese (zh)
Other versions
CN112800091A (en
Inventor
张玮霖
王泽东
于政
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Mininglamp Software System Co ltd
Original Assignee
Beijing Mininglamp Software System Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Mininglamp Software System Co ltd filed Critical Beijing Mininglamp Software System Co ltd
Priority to CN202110105453.0A priority Critical patent/CN112800091B/en
Publication of CN112800091A publication Critical patent/CN112800091A/en
Application granted granted Critical
Publication of CN112800091B publication Critical patent/CN112800091B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5022Mechanisms to release resources

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a flow batch integrated calculation control system and a method, wherein the system comprises the following components: the control device is used for converting metadata of the batch data source into dimension table metadata, continuously reading batch offline data from the batch data source, and generating a real-time dimension table based on the batch offline data when the batch offline data is read each time; and the computing device is used for jointly carrying out streaming computation on streaming real-time data of the streaming data source and the real-time dimension table to obtain streaming computation results. The whole system of the embodiment of the application completely uses the streaming system to perform streaming calculation, thereby avoiding the additional calculation consumption and operation and maintenance cost of simultaneously maintaining the streaming system and the batch system in a scene of separating the streaming batch; in addition, the real-time dimension table serves as a buffer between the batch offline data and the streaming real-time data, so that access time delay gap between the batch offline data and the streaming real-time data can be reduced, and meanwhile, high system load generated by loading the batch offline data into the streaming system memory can be relieved.

Description

Flow batch integrated calculation control system and method
Technical Field
The application relates to the technical field of data processing, in particular to a flow batch integrated calculation control system and method.
Background
In the data age, data is an important influence factor of productivity, the efficiency evaluation of the service is indistinguishable from the timeliness of data support, and the 'action' of capturing the data in time can make the service become agile, so that high-efficiency feedback and quick response are realized. In the power industry, devices generate large amounts of data. These data reflect the operating state of the device and therefore require computational analysis of the device data in real time. Devices of the same model have different parameters in different installation environments, and even if the devices of the two same models send out the same signal data at different positions, the meaning of the representation of the devices is different. If a piece of data is analyzed according to a real-time streaming computation sheet, the meaning contained behind the data is difficult to find; if the device data and its historical data are analyzed according to batch calculation, a certain time is consumed in the calculation process, and thus the real-time property of the device data is lost.
The prior art has the following two solutions:
1) Flow batch separation calculation scheme: based on the original batch data, a streaming computing module for executing the same computation is added. When the real-time data enters the system, the streaming computing module calculates and generates a real-time result, and simultaneously stores the data into an offline data warehouse. And after the result is obtained by the batch calculation triggered at fixed time, the batch calculation result is covered with the real-time result of the streaming calculation. In the scheme, the same data needs to be calculated twice, so that more system resources are consumed; meanwhile, when the system is maintained, a set of streaming system with the same calculation logic needs to be additionally maintained, and extra operation and maintenance resource consumption is generated.
2) The calculation scheme is replaced by a stream batch: the batch calculation is regarded as a special, input-limited stream calculation, and the original batch calculation process is replaced by the stream calculation entirely. When the offline data is required to be processed, the offline data is read into the streaming system in a limited data stream form, and a new streaming calculation task is started to complete the original batch calculation. In the scheme, although extra maintenance work is not needed any more, the batch data is needed to be loaded into the system in a streaming mode instead of batch calculation by streaming calculation, so that serious dependence is generated on the caching capacity of the middleware; on the other hand, streaming batches require that large amounts of data be loaded into the system simultaneously to ensure correctness, which increases system load and may lead to incorrect results due to data loss.
Disclosure of Invention
Therefore, the present application is directed to a system and a method for controlling the flow-batch integrated computation, which can avoid the additional computation consumption and operation cost of maintaining the flow-batch system and batch-batch system in the flow-batch separation scene because the whole system uses the flow-batch system to perform the flow-batch computation; in addition, the real-time dimension table serves as a buffer between the batch offline data and the streaming real-time data, so that access time delay gap between the batch offline data and the streaming real-time data can be reduced, and meanwhile, high system load generated by loading the batch offline data into the streaming system memory can be relieved.
In a first aspect, an embodiment of the present application provides a flow batch integrated computing control system, including:
The control device is used for converting metadata of the batch data source into dimension table metadata, continuously reading batch offline data from the batch data source, and generating a real-time dimension table based on the batch offline data when the batch offline data are read from the batch data source each time;
And the computing device is used for carrying out streaming computation on streaming real-time data of the streaming data source and the real-time dimension table together to obtain streaming computation results.
In one possible embodiment, the control device includes:
The metadata importing module is used for respectively acquiring metadata from the batch data source and the stream data source and importing the metadata of the batch data source and the stream data source into the metadata catalogue module;
The metadata catalog module is used for storing and retrieving metadata of the streaming data source and the batch data source and converting the metadata of the batch data source into dimension table metadata;
the dimension table synchronization module is used for continuously reading batch offline data from a batch data source, and generating a real-time dimension table based on the batch offline data, dimension table metadata and a dimension table synchronization strategy configured by a user when the batch offline data is read from the batch data source each time; the dimension table synchronization strategy is used for controlling the frequency of the synchronous data so as to balance the timeliness of the data in the real-time dimension table and the system consumption generated in the process of synchronizing the data.
In one possible embodiment, the control device further includes:
And the metadata management module is used for controlling the metadata import module to start to act and generating the dimension table synchronization strategy based on the configuration operation of the user.
In one possible embodiment, the control device further includes:
The SQL statement analysis module is used for converting standard SQL statements configured by a user into an abstract semantic tree;
An execution plan generation module for generating an execution plan based on the abstract semantic tree, metadata of the batch data source, and metadata of the stream data source; the execution plan comprises a directed acyclic computing flow graph which is ordered according to topology, and each vertex in the directed acyclic computing flow graph corresponds to one streaming computing thread.
In one possible implementation, the computing device includes:
The connection information temporary storage module is used for temporarily storing streaming real-time data of the streaming data source and data source connection information of the real-time dimension table, and releasing the memory space occupied by the data source connection information after the streaming calculation module finishes calculation;
And the streaming calculation module is used for reading streaming real-time data of the streaming data source and the real-time dimension table based on the data source connection information, and jointly carrying out streaming calculation on the streaming real-time data of the streaming data source and the real-time dimension table based on the execution plan to obtain streaming calculation results.
In a possible implementation manner, the connection information temporary storage module is further configured to temporarily store data source connection information of batch offline data of the batch data source, and release a memory space occupied by the data source connection information after the calculation of the streaming calculation module is completed;
The flow type calculation module is also used for reading the batch type offline data of the batch type data source based on the data source connection information, and carrying out flow type calculation on the batch type offline data of the batch type data source based on the execution plan to obtain a flow type calculation result.
In a possible implementation manner, the connection information temporary storage module is further configured to temporarily store data source connection information of streaming real-time data of the streaming data source, and release a memory space occupied by the data source connection information after the streaming calculation module finishes calculating;
The streaming calculation module is further used for reading streaming real-time data of the streaming data source based on the data source connection information, and carrying out streaming calculation on the streaming real-time data of the streaming data source based on the execution plan to obtain streaming calculation results.
In a second aspect, an embodiment of the present application further provides a method for controlling flow batch integrated calculation, including:
converting metadata of a batch data source into dimension table metadata;
Continuously reading batch offline data from the batch data source, and generating a real-time dimension table based on the batch offline data each time batch offline data is read from the batch data source;
and carrying out streaming calculation on streaming real-time data of the streaming data source and the real-time dimension table together to obtain streaming calculation results.
In a third aspect, an embodiment of the present application further provides an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory in communication via the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the steps in the second aspect described above.
In a fourth aspect, embodiments of the present application also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the second aspect described above.
The embodiment of the application provides a flow batch integrated calculation control system which comprises a control device and a calculation device. The control device is used for converting metadata of the batch data source into dimension table metadata, continuously reading batch offline data from the batch data source, and generating a real-time dimension table based on the batch offline data when the batch offline data are read from the batch data source each time; the computing device is used for carrying out streaming computation on streaming real-time data of the streaming data source and the real-time dimension table together to obtain streaming computation results. The system connects the streaming real-time data with the batch offline data by the real-time dimension table, and the offline data is online by the real-time dimension table, so that the streaming system can quickly access the batch offline data. Based on the double-drive calculation that batch offline data is streaming real-time data, batch bounded data and streaming unbounded data are managed, calculated and stored in a unified mode, mining and improving of high-density data value are achieved through the supplement of dimension data, real-time feedback capability is enhanced, resource consumption is reduced, and organization coordination capability is improved. On one hand, the whole system completely uses the streaming system for streaming calculation, so that extra calculation consumption and operation and maintenance cost for simultaneously maintaining the streaming system and the batch system in a scene of stream-batch separation can be avoided. On the other hand, the real-time dimension table serves as a buffer between the batch offline data and the streaming real-time data, so that the access time delay gap between the batch offline data and the streaming real-time data can be reduced, and meanwhile, high system load generated by loading the batch offline data into the streaming system memory can be relieved.
In order to make the above objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a flow batch integrated computing control system according to an embodiment of the present application;
FIG. 2 is a flowchart of a method for controlling flow batch integrated computation according to an embodiment of the present application;
fig. 3 shows a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. The components of the embodiments of the present application generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the application, as presented in the figures, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present application.
In the power industry, devices generate large amounts of data. These data reflect the operating state of the device and therefore require computational analysis of the device data in real time. Devices of the same model have different parameters in different installation environments, and even if the devices of the two same models send out the same signal data at different positions, the meaning of the representation of the devices is different. If a piece of data is analyzed according to a real-time streaming computation sheet, the meaning contained behind the data is difficult to find; if the device data and its historical data are analyzed according to batch calculation, a certain time is consumed in the calculation process, and thus the real-time property of the device data is lost. The prior art has the following two solutions:
1) Flow batch separation calculation scheme: based on the original batch data, a streaming computing module for executing the same computation is added. When the real-time data enters the system, the streaming computing module calculates and generates a real-time result, and simultaneously stores the data into an offline data warehouse. And after the result is obtained by the batch calculation triggered at fixed time, the batch calculation result is covered with the real-time result of the streaming calculation. In the scheme, the same data needs to be calculated twice, so that more system resources are consumed; meanwhile, when the system is maintained, a set of streaming system with the same calculation logic needs to be additionally maintained, and extra operation and maintenance resource consumption is generated.
2) The calculation scheme is replaced by a stream batch: the batch calculation is regarded as a special, input-limited stream calculation, and the original batch calculation process is replaced by the stream calculation entirely. When the offline data is required to be processed, the offline data is read into the streaming system in a limited data stream form, and a new streaming calculation task is started to complete the original batch calculation. In the scheme, although extra maintenance work is not needed any more, the batch data is needed to be loaded into the system in a streaming mode instead of batch calculation by streaming calculation, so that serious dependence is generated on the caching capacity of the middleware; on the other hand, streaming batches require that large amounts of data be loaded into the system simultaneously to ensure correctness, which increases system load and may lead to incorrect results due to data loss.
Based on this, the embodiment of the application provides a system and a method for controlling flow batch integrated calculation, which are described by the embodiment.
For the sake of understanding the present embodiment, a detailed description will be given of a batch integrated computing control system disclosed in the present embodiment.
Referring to fig. 1, fig. 1 is a schematic structural diagram of a flow batch integrated computing control system according to an embodiment of the application. As shown in fig. 1, the system may include:
Control means 10 for converting metadata of a batch data source into dimension table metadata, continuously reading batch offline data from the batch data source, and generating a real-time dimension table based on the batch offline data each time batch offline data is read from the batch data source;
The computing device 20 is configured to perform streaming computation on streaming real-time data of the streaming data source and the real-time dimension table together, so as to obtain a streaming computation result.
The control device 10 and the calculation device 20 are described in detail below.
As shown in fig. 1, the control device 10 may include the following modules:
the metadata importing module 101 is configured to obtain metadata from a batch data source and a stream data source, and import metadata from the batch data source and the stream data source into the metadata catalog module. Metadata of a batch data source refers to data describing batch offline data, and generally includes table structures of batch data tables and configuration information of connection, fragmentation, etc. of the batch data source. Metadata of a streaming data source refers to data for describing streaming real-time data, and generally includes a table structure of a streaming data table and configuration information of connection, fragmentation, etc. of the streaming data source.
The metadata catalog module 102 is configured to store and retrieve metadata (data field information and data source connection information) of a streaming data source and a batch data source, and convert the metadata of the batch data source into dimension table metadata. The dimension table is a copy of a subset of the batch data table for fusion with the streaming real-time data to obtain a list of the incoming data. Dimension table metadata refers to data obtained by converting a batch data table structure.
The dimension table synchronization module 103 is configured to continuously read batch offline data from a batch data source, and generate a real-time dimension table based on the batch offline data, the dimension table metadata, and a dimension table synchronization policy configured by a user each time batch offline data is read from the batch data source. The real-time dimension table is a memory key value pair database, and because batch offline data entries in a batch data source are always increasing, the real-time dimension table needs to continuously synchronize or incrementally read new data from the batch data source, and thus a synchronization strategy is needed to control the frequency of synchronizing the data, so that the timeliness of the data in the real-time dimension table and the system consumption generated in the process of synchronizing the data are balanced.
It should be noted that when the real-time data requirement is not too high or when the batch offline data size is quite large, the real-time dimension table may use an RDBMS system or a distributed data storage system. The access latency of the RDBMS system and the distributed data storage system is lower than that of the memory key value database, but the RDBMS system and the distributed data storage system can store larger data quantity and provide better stability. Thus, when the data real-time requirements are not too high or the offline data size is quite large, the RDBMS system or the distributed data storage system can replace the memory key values as real-time dimension tables for the database.
In one possible embodiment, the system may further comprise:
The metadata management module 104 is configured to control the metadata import module 101 to start to act, and generate the dimension table synchronization policy based on a configuration operation of a user. The user may configure parameters of various modules in the control device 10 through the metadata management module 104. Specifically, the user may set the timing or manually import metadata from the batch and stream data sources via the metadata management module 104. The user may also perform configuration operations on the metadata management module 104 through the front-end page, and the metadata management module 104 generates the dimension table synchronization policy based on the user's configuration operations.
In one possible embodiment, the system may further comprise:
The SQL statement parsing module 105 is configured to convert standard SQL statements configured by a user into an Abstract semantic tree (Abstract SEMANTIC TREE, AST).
An execution plan generation module 106, configured to generate an execution plan based on the abstract semantic tree, metadata of the batch data source, and metadata of the stream data source; the execution plan comprises a directed acyclic computation flow graph which is ordered according to topology and corresponds to a topological connection and allocation scheme of a group of streaming computation resources. The streaming computing resource specifically refers to a streaming computing thread located in a distributed host cluster, and corresponds to a vertex in the directed acyclic computing flow graph. Data is transmitted between the streaming computing resources through TCP, and the directed edges in the loop-free computing flow graph are corresponding. There may be multiple computing resources on a host.
As shown in fig. 2, the computing device 20 may include the following modules:
The connection information temporary storage module 201 is configured to temporarily store streaming real-time data of the streaming data source and data source connection information (possibly existing on different hosts) of the real-time dimension table, and release a memory space occupied by the data source connection information after the streaming calculation module 202 finishes calculating. The data source connection information is represented in a memory temporary data view form, and the memory temporary data view is similar to a view concept in a relational data management system (Relational DataBase MANAGEMENT SYSTEM, RDBMS), is a virtual table for temporarily storing the data source connection information, does not store data, and automatically releases occupied memory space after a calculation task is finished.
The streaming calculation module 202 is configured to read streaming real-time data of the streaming data source and the real-time dimension table based on the data source connection information, and perform streaming calculation on the streaming real-time data of the streaming data source and the real-time dimension table based on the execution plan together, so as to obtain a streaming calculation result.
In a possible implementation manner, the connection information temporary storage module 201 is further configured to temporarily store data source connection information of batch offline data of the batch data source, and release a memory space occupied by the data source connection information after the calculation of the stream calculation module 202 is completed;
The streaming calculation module 202 is further configured to read batch offline data of the batch data source based on the data source connection information, and perform streaming calculation on the batch offline data of the batch data source based on the execution plan, so as to obtain a streaming calculation result.
In another possible implementation manner, the connection information temporary storage module 201 is further configured to temporarily store the data source connection information of the streaming real-time data of the streaming data source, and release the memory space occupied by the data source connection information after the streaming calculation module 202 finishes calculating;
The streaming calculation module 202 is further configured to read streaming real-time data of the streaming data source based on the data source connection information, and perform streaming calculation on the streaming real-time data of the streaming data source based on the execution plan, so as to obtain a streaming calculation result.
The flow batch integrated calculation control system provided by the embodiment not only can perform flow batch integrated calculation on flow real-time data and batch offline data by using the real-time dimension table, but also can independently perform flow calculation on batch offline data of a batch data source or flow real-time data of a flow data source, and can be compatible with the existing system.
The flow batch integrated computing control system provided by the embodiment comprises a control device and a computing device, wherein the control device is used for converting metadata of a batch data source into dimension table metadata, continuously reading batch offline data from the batch data source, and generating a real-time dimension table based on the batch offline data when the batch offline data are read from the batch data source each time; and then adopting a computing device to jointly perform streaming computation on streaming real-time data of the streaming data source and the real-time dimension table to obtain streaming computation results. The system connects the streaming real-time data with the batch offline data by the real-time dimension table, and the offline data is online by the real-time dimension table, so that the streaming system can quickly access the batch offline data. On one hand, the whole system completely uses the streaming system for streaming calculation, so that extra calculation consumption and operation and maintenance cost for simultaneously maintaining the streaming system and the batch system in a scene of stream-batch separation can be avoided. On the other hand, the real-time dimension table serves as a buffer between the batch offline data and the streaming real-time data, so that the access time delay gap between the batch offline data and the streaming real-time data can be reduced, and meanwhile, high system load generated by loading the batch offline data into the streaming system memory can be relieved.
Based on the same technical concept, the embodiment of the application also provides a flow batch integrated calculation control method, electronic equipment, a computer storage medium and the like, and particularly can be seen in the following embodiments.
Referring to fig. 2, fig. 2 is a flowchart of a method for controlling flow batch integrated calculation according to an embodiment of the present application. As shown in fig. 2, the method may include the steps of:
S210, converting metadata of a batch data source into dimension table metadata;
S220, continuously reading batch offline data from the batch data source;
S230, judging whether new batch offline data is read from the batch data source, if so, turning to step S240, and if not, turning to step S220;
s240, generating a real-time dimension table based on the batch offline data;
s250, carrying out streaming calculation on streaming real-time data of the streaming data source and the real-time dimension table together to obtain streaming calculation results.
In a possible implementation manner, step S210 further includes: metadata is obtained from the batch data source and the streaming data source, respectively.
In one possible implementation, step S240 specifically includes: generating a real-time dimension table based on the batch offline data, the dimension table metadata and the dimension table synchronization strategy configured by the user; the dimension table synchronization strategy is used for controlling the frequency of the synchronous data so as to balance the timeliness of the data in the real-time dimension table and the system consumption generated in the process of synchronizing the data.
In one possible implementation, after step S240 and before step S250, the method further includes: converting standard SQL sentences configured by a user into abstract semantic trees; generating an execution plan based on the abstract semantic tree, metadata of the batch data source and metadata of the stream data source; the execution plan comprises a directed acyclic computing flow graph which is ordered according to topology, and each vertex in the directed acyclic computing flow graph corresponds to one streaming computing thread.
In one possible implementation, step S250 specifically includes: temporarily storing streaming real-time data of the streaming data source and data source connection information of the real-time dimension table, and releasing the memory space occupied by the data source connection information after the streaming calculation module finishes calculation; and reading the streaming real-time data of the streaming data source and the real-time dimension table based on the data source connection information, and jointly performing streaming calculation on the streaming real-time data of the streaming data source and the real-time dimension table based on the execution plan to obtain streaming calculation results.
In one possible implementation, step S250 further includes: temporarily storing the data source connection information of the batch offline data of the batch data source, and releasing the memory space occupied by the data source connection information after the calculation of the stream calculation module is finished; and reading the batch offline data of the batch data source based on the data source connection information, and carrying out streaming calculation on the batch offline data of the batch data source based on the execution plan to obtain streaming calculation results.
In one possible implementation, step S250 further includes: temporarily storing the data source connection information of the streaming real-time data of the streaming data source, and releasing the memory space occupied by the data source connection information after the streaming calculation module finishes calculation; and reading the streaming real-time data of the streaming data source based on the data source connection information, and performing streaming calculation on the streaming real-time data of the streaming data source based on the execution plan to obtain streaming calculation results.
In the flow batch integrated calculation control method provided by the embodiment, firstly, metadata of a batch data source is converted into dimension table metadata, batch offline data are continuously read from the batch data source, and a real-time dimension table is generated based on the batch offline data when the batch offline data are read from the batch data source each time; and then, carrying out streaming calculation on streaming real-time data of the streaming data source and the real-time dimension table together to obtain streaming calculation results. According to the method, the real-time dimension table is used for connecting the streaming real-time data with the batch offline data, and the offline data is online through the real-time dimension table, so that the streaming system can quickly access the batch offline data. On one hand, the whole method completely uses the streaming system to perform streaming calculation, so that extra calculation consumption and operation and maintenance cost for simultaneously maintaining the streaming system and the batch system in a scene of stream-batch separation can be avoided. On the other hand, the real-time dimension table serves as a buffer between the batch offline data and the streaming real-time data, so that the access time delay gap between the batch offline data and the streaming real-time data can be reduced, and meanwhile, high system load generated by loading the batch offline data into the streaming system memory can be relieved.
The embodiment of the application discloses an electronic device, as shown in fig. 3, comprising: a processor 301, a memory 302 and a bus 303, said memory 302 storing machine readable instructions executable by said processor 301, said processor 301 and said memory 302 communicating via the bus 303 when the electronic device is running. The machine readable instructions, when executed by the processor 301, perform the method described in the foregoing method embodiments, and specific implementation may refer to method embodiments, which are not described herein.
The embodiment of the application provides a computer program product of a flow batch integrated calculation control method, which comprises a computer readable storage medium storing non-volatile program codes executable by a processor, wherein the program codes comprise instructions for executing the method described in the previous method embodiment, and specific implementation can be seen in the method embodiment and will not be repeated herein.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.
In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Finally, it should be noted that: the above examples are only specific embodiments of the present application, and are not intended to limit the scope of the present application, but it should be understood by those skilled in the art that the present application is not limited thereto, and that the present application is described in detail with reference to the foregoing examples: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or perform equivalent substitution of some of the technical features, while remaining within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (9)

1. A flow batch integrated computing control system, comprising:
The control device is used for converting metadata of the batch data source into dimension table metadata, continuously reading batch offline data from the batch data source, and generating a real-time dimension table based on the batch offline data when the batch offline data are read from the batch data source each time;
The computing device is used for carrying out streaming computation on streaming real-time data of the streaming data source and the real-time dimension table together to obtain streaming computation results;
the control device includes:
The metadata importing module is used for respectively acquiring metadata from the batch data source and the stream data source and importing the metadata of the batch data source and the stream data source into the metadata catalogue module;
The metadata catalog module is used for storing and retrieving metadata of the streaming data source and the batch data source and converting the metadata of the batch data source into dimension table metadata;
the dimension table synchronization module is used for continuously reading batch offline data from a batch data source, and generating a real-time dimension table based on the batch offline data, dimension table metadata and a dimension table synchronization strategy configured by a user when the batch offline data is read from the batch data source each time; the dimension table synchronization strategy is used for controlling the frequency of the synchronous data so as to balance the timeliness of the data in the real-time dimension table and the system consumption generated in the process of synchronizing the data.
2. The system of claim 1, wherein the control device further comprises:
And the metadata management module is used for controlling the metadata import module to start to act and generating the dimension table synchronization strategy based on the configuration operation of the user.
3. The system of claim 1, wherein the control device further comprises:
The SQL statement analysis module is used for converting standard SQL statements configured by a user into an abstract semantic tree;
An execution plan generation module for generating an execution plan based on the abstract semantic tree, metadata of the batch data source, and metadata of the stream data source; the execution plan comprises a directed acyclic computing flow graph which is ordered according to topology, and each vertex in the directed acyclic computing flow graph corresponds to one streaming computing thread.
4. The system of claim 3, wherein the computing device comprises:
The connection information temporary storage module is used for temporarily storing streaming real-time data of the streaming data source and data source connection information of the real-time dimension table, and releasing the memory space occupied by the data source connection information after the streaming calculation module finishes calculation;
And the streaming calculation module is used for reading streaming real-time data of the streaming data source and the real-time dimension table based on the data source connection information, and jointly carrying out streaming calculation on the streaming real-time data of the streaming data source and the real-time dimension table based on the execution plan to obtain streaming calculation results.
5. The system of claim 4, wherein the system further comprises a controller configured to control the controller,
The connection information temporary storage module is also used for temporarily storing the data source connection information of the batch offline data of the batch data source and releasing the memory space occupied by the data source connection information after the calculation of the stream calculation module is finished;
The flow type calculation module is also used for reading the batch type offline data of the batch type data source based on the data source connection information, and carrying out flow type calculation on the batch type offline data of the batch type data source based on the execution plan to obtain a flow type calculation result.
6. The system of claim 5, wherein the system further comprises a controller configured to control the controller,
The connection information temporary storage module is also used for temporarily storing the data source connection information of the streaming real-time data of the streaming data source and releasing the memory space occupied by the data source connection information after the streaming calculation module finishes calculation;
The streaming calculation module is further used for reading streaming real-time data of the streaming data source based on the data source connection information, and carrying out streaming calculation on the streaming real-time data of the streaming data source based on the execution plan to obtain streaming calculation results.
7. The flow batch integrated calculation control method is characterized by comprising the following steps of:
Metadata is obtained from a batch data source and a stream data source respectively;
converting metadata of a batch data source into dimension table metadata;
Continuously reading batch offline data from the batch data source, and generating a real-time dimension table based on the batch offline data each time batch offline data is read from the batch data source; the continuously reading batch offline data from the batch data source, generating a real-time dimension table based on the batch offline data each time batch offline data is read from the batch data source, comprising: continuously reading batch offline data from a batch data source, and generating a real-time dimension table based on the batch offline data, dimension table metadata and a user-configured dimension table synchronization policy each time batch offline data is read from the batch data source; the dimension table synchronization strategy is used for controlling the frequency of the synchronous data so as to balance the timeliness of the data in the real-time dimension table and the system consumption generated in the process of synchronizing the data;
and carrying out streaming calculation on streaming real-time data of the streaming data source and the real-time dimension table together to obtain streaming calculation results.
8. An electronic device, comprising: a processor, a storage medium storing machine-readable instructions executable by the processor, the processor in communication with the storage medium via a bus when the electronic device is running, the processor executing the machine-readable instructions to perform the steps of the method of claim 7.
9. A computer-readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, performs the steps of the method according to claim 7.
CN202110105453.0A 2021-01-26 2021-01-26 Flow batch integrated calculation control system and method Active CN112800091B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110105453.0A CN112800091B (en) 2021-01-26 2021-01-26 Flow batch integrated calculation control system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110105453.0A CN112800091B (en) 2021-01-26 2021-01-26 Flow batch integrated calculation control system and method

Publications (2)

Publication Number Publication Date
CN112800091A CN112800091A (en) 2021-05-14
CN112800091B true CN112800091B (en) 2024-06-11

Family

ID=75811918

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110105453.0A Active CN112800091B (en) 2021-01-26 2021-01-26 Flow batch integrated calculation control system and method

Country Status (1)

Country Link
CN (1) CN112800091B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113868525A (en) * 2021-09-27 2021-12-31 支付宝(杭州)信息技术有限公司 Method, device and equipment for determining accumulative independent access amount based on batch streaming coordination
CN117435596B (en) * 2023-12-20 2024-04-02 杭州网易云音乐科技有限公司 Streaming batch task integration method and device, storage medium and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105677752A (en) * 2015-12-30 2016-06-15 深圳先进技术研究院 Streaming computing and batch computing combined processing system and method
CN106909598A (en) * 2016-07-01 2017-06-30 阿里巴巴集团控股有限公司 It is a kind of to ensure processing method, the apparatus and system for calculating data consistency
CN109189835A (en) * 2018-08-21 2019-01-11 北京京东尚科信息技术有限公司 The method and apparatus of the wide table of data are generated in real time
CN109522341A (en) * 2018-11-27 2019-03-26 北京京东金融科技控股有限公司 Realize method, apparatus, the equipment of the stream data processing engine based on SQL
CN110309848A (en) * 2019-05-08 2019-10-08 重庆天蓬网络有限公司 The method that off-line data and stream data real time fusion calculate

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11232105B2 (en) * 2019-02-28 2022-01-25 Microsoft Technology Licensing, Llc Unified metrics computation platform

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105677752A (en) * 2015-12-30 2016-06-15 深圳先进技术研究院 Streaming computing and batch computing combined processing system and method
CN106909598A (en) * 2016-07-01 2017-06-30 阿里巴巴集团控股有限公司 It is a kind of to ensure processing method, the apparatus and system for calculating data consistency
CN109189835A (en) * 2018-08-21 2019-01-11 北京京东尚科信息技术有限公司 The method and apparatus of the wide table of data are generated in real time
CN109522341A (en) * 2018-11-27 2019-03-26 北京京东金融科技控股有限公司 Realize method, apparatus, the equipment of the stream data processing engine based on SQL
CN110309848A (en) * 2019-05-08 2019-10-08 重庆天蓬网络有限公司 The method that off-line data and stream data real time fusion calculate

Also Published As

Publication number Publication date
CN112800091A (en) 2021-05-14

Similar Documents

Publication Publication Date Title
US11216187B2 (en) Data writing and reading method and apparatus, and distributed object storage cluster
WO2017080431A1 (en) Log analysis-based database replication method and device
CN105824957A (en) Query engine system and query method of distributive memory column-oriented database
CN112800091B (en) Flow batch integrated calculation control system and method
US11030196B2 (en) Method and apparatus for processing join query
CA3057038C (en) Data filtering method, apparatus, electronic apparatus and storage medium
US20160170462A1 (en) Resource capacity management in a cluster of host computers using power management analysis
CN114722119A (en) Data synchronization method and system
US11308063B2 (en) Data structure to array conversion
CN107066205B (en) Data storage system
CN112328592A (en) Data storage method, electronic device and computer readable storage medium
US20150254109A1 (en) System, method, and apparatus for coordinating distributed electronic discovery processing
Dai et al. Research and implementation of big data preprocessing system based on Hadoop
CN113051102A (en) File backup method, device, system, storage medium and computer equipment
US10552419B2 (en) Method and system for performing an operation using map reduce
CN102855297B (en) A kind of method of control data transmission and connector
CN112433812A (en) Method, system, equipment and computer medium for virtual machine cross-cluster migration
CN112506869A (en) File processing method, device and system
US20200081869A1 (en) File storage method and storage apparatus
CN110941658A (en) Data export method, device, server and storage medium
CN115630122A (en) Data synchronization method and device, storage medium and computer equipment
CN109902067B (en) File processing method and device, storage medium and computer equipment
CN109063201B (en) Impala online interactive query method based on mixed storage scheme
CN113377791A (en) Data processing method, system and computing equipment
CN114691766A (en) Data acquisition method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant