CN112597200B - Batch and stream combined data processing method and device - Google Patents

Batch and stream combined data processing method and device Download PDF

Info

Publication number
CN112597200B
CN112597200B CN202011529842.8A CN202011529842A CN112597200B CN 112597200 B CN112597200 B CN 112597200B CN 202011529842 A CN202011529842 A CN 202011529842A CN 112597200 B CN112597200 B CN 112597200B
Authority
CN
China
Prior art keywords
data
batch
node
model
computing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011529842.8A
Other languages
Chinese (zh)
Other versions
CN112597200A (en
Inventor
陈卓
孙启明
汪利鹏
李延明
李侃
郭显宽
胡鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Three Eye Spirit Information Technology Co ltd
Original Assignee
Nanjing Three Eye Spirit Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Three Eye Spirit Information Technology Co ltd filed Critical Nanjing Three Eye Spirit Information Technology Co ltd
Priority to CN202011529842.8A priority Critical patent/CN112597200B/en
Publication of CN112597200A publication Critical patent/CN112597200A/en
Application granted granted Critical
Publication of CN112597200B publication Critical patent/CN112597200B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/211Schema design and management
    • G06F16/212Schema design and management with details for data modelling support
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Stored Programmes (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a batch and stream combined data processing method and device, wherein the method comprises the following steps: determining a corresponding calculation model according to the data node type and the number of data nodes required to be calculated, wherein the data node type comprises batch data nodes and stream data nodes, if the number of the data nodes is single, executing a single-source calculation model, otherwise, executing a multi-source calculation model; performing data processing according to the calculation model and a preset execution mode; according to the method and the device, batch data and streaming data can be effectively combined, and data processing efficiency is improved.

Description

Batch and stream combined data processing method and device
Technical Field
The present application relates to the field of data processing, and in particular, to a method and apparatus for processing data by combining batch and stream.
Background
After entering the digital age, the value of data is continuously discovered, and particularly, the appearance of big data technology makes data analysis work become necessary lessons for further development in various fields, and data analysis modeling becomes a hot research direction.
In the field of data processing, batch and streaming are the most common forms of data, and for many years, a batch of targeted data computing tools have emerged: in the aspect of batch data, from the sql tool of the traditional relational database to the hive and impala batch calculation engine of a big data platform; in terms of streaming data, from message middleware kafka, rabbitMQ to streaming frameworks store, flink, etc. They are each characterized by providing powerful computational means for batch and streaming data analysis from multiple angles and scenarios.
The inventors found that there are drawbacks and deficiencies in the prior art:
(1) Streaming data has low utilization rate in data modeling
The method has the advantages that the factors such as business and efficiency are integrated, most of data stored in a streaming mode are single in business attribute and simple in content, the data are used as single scenes and are realized in a directional function in application, even a lot of data are collected, and the data are directly put into storage to form batch data, so that precious timeliness benefit is lost. In addition, streaming data is less flexible in terms of formulation and use than batch data and is difficult to control, so streaming data has less application in the data modeling process than the latter.
(2) Batch and streaming computing lacks a means of integration
The difference between batch and streaming data is evident: batch data are stored in various databases and file systems, and are integrally accessed and used in batches; the stream data are stored in various message middleware and even memory, and are processed one by one or in small batches. Furthermore, the two data formats are very different, and in many cases, format conversion needs to be performed in advance in order to implement common calculation. In addition to the differences in flow control, batch and streaming data computations lack suitable modes to integrate.
Disclosure of Invention
Aiming at the problems in the prior art, the application provides a batch and streaming combined data processing method and device, which can effectively combine batch data and streaming data and improve data processing efficiency.
In order to solve at least one of the above problems, the present application provides the following technical solutions:
in a first aspect, the present application provides a batch and stream combined data processing method, including:
determining a corresponding calculation model according to the data node type and the number of data nodes required to be calculated, wherein the data node type comprises batch data nodes and stream data nodes, if the number of the data nodes is single, executing a single-source calculation model, otherwise, executing a multi-source calculation model;
and carrying out data processing according to the calculation model and a preset execution mode.
Further, the data processing according to the calculation model and the preset execution mode includes:
if the computing model is a single-source computing model and the data node type is a batch data node, acquiring target data of the batch data node in batch at one time and performing data processing according to a preset execution mode;
if the computing model is a multi-source computing model and the data node type is a streaming data node, sequentially acquiring each item of target data of the streaming data node or acquiring target data of the streaming data node in a set time period, and performing data processing according to a preset executing mode.
Further, the data processing according to the calculation model and the preset execution mode includes:
if the computing model is a multi-source computing model and the type of the data node needing to be computed is a plurality of batch data nodes, performing data processing on target data of each batch data node through a preset computing engine or a preset data index rule;
if the computing model is a multi-source computing model and the data node type requiring computing comprises batch data nodes and stream data nodes, carrying out data indexing on the batch data nodes in advance, determining target data matched with the stream data nodes according to preset rules, and carrying out data processing;
if the computing model is a multi-source computing model and the type of the data node needing to execute computation is a plurality of stream data nodes, performing data processing through a preset computing engine or a preset data index rule according to target data of the stream data nodes in a set time period and target data of the batch data nodes.
Further, the preset execution mode includes: at least one of a one-time execution model, a continuous execution mode for a set time period, and a timing update execution mode.
In a second aspect, the present application provides a batch-to-stream combined data processing apparatus, comprising:
the computing node determining module is used for determining a corresponding computing model according to the type and the number of data nodes required to execute computation, wherein the data node type comprises batch data nodes and stream data nodes, if the number of the data nodes is single, executing a single-source computing model, otherwise, executing a multi-source computing model;
and the target data calculation module is used for carrying out data processing according to the calculation model and a preset execution mode.
Further, the computing node determination module includes:
the batch data node single-source computing unit is used for acquiring target data of batch data nodes in batch at one time and performing data processing according to a preset execution mode if the computing model is a single-source computing model and the data node type is a batch data node;
and the streaming data node single-source computing unit is used for sequentially acquiring each item of target data of the streaming data node or acquiring target data of the streaming data node in a set time period if the computing model is a single-source computing model and the data node type is the streaming data node, and performing data processing according to a preset execution mode.
Further, the computing node determination module includes:
the multi-batch data node multi-source computing unit is used for carrying out data processing on target data of each batch data node through a preset computing engine or a preset data index rule if the computing model is a multi-source computing model and the type of the data node needing to be computed is a plurality of batch data nodes;
the system comprises a batch data node and stream data node combination calculation unit, a data processing unit and a data processing unit, wherein the batch data node and stream data node combination calculation unit is used for carrying out data indexing on the batch data node in advance and determining target data matched with the stream data node in the stream data node according to a preset rule if the calculation model is a multi-source calculation model and the data node type needing to be calculated comprises the batch data node and the stream data node;
and the multi-source computing unit of the multi-stream data node is used for carrying out data processing through a preset computing engine or a preset data index rule according to the target data of the stream data node in a set time period and the target data of the batch data nodes if the computing model is a multi-source computing model and the type of the data node needing to be computed is a plurality of stream data nodes.
Further, the preset execution mode includes: at least one of a one-time execution model, a continuous execution mode for a set time period, and a timing update execution mode.
In a third aspect, the present application provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the batch and stream combined data processing method when the program is executed.
In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the batch and stream combined data processing method.
As can be seen from the above technical solution, the present application provides a batch and stream combined data processing method and apparatus, which determines a corresponding calculation model by executing a calculated data node type and a calculated data node number according to needs, where the data node type includes a batch of data nodes and a stream of data nodes, if the data node number is single, executing a single source calculation model, otherwise executing a multi-source calculation model; performing data processing according to the calculation model and a preset execution mode; according to the method and the device, batch data and streaming data can be effectively combined, and data processing efficiency is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method of batch and stream combined data processing in an embodiment of the present application;
FIG. 2 is a block diagram of a batch and stream combined data processing apparatus in an embodiment of the present application;
FIG. 3 is a schematic diagram of a batch and stream combined data processing apparatus architecture in an embodiment of the present application;
FIG. 4 is a schematic diagram of single source calculation of batch data nodes in an embodiment of the present application;
fig. 5 is a schematic diagram of single-source computation of a streaming data node in an embodiment of the present application;
FIG. 6 is a schematic diagram of multi-source computing of multi-batch data nodes in an embodiment of the present application;
FIG. 7 is a schematic diagram illustrating a batch data node and stream data node combination calculation in an embodiment of the present application;
FIG. 8 is a schematic diagram of multi-source computation of a multi-stream data node according to an embodiment of the present application;
Fig. 9 is a schematic structural diagram of an electronic device in an embodiment of the present application.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.
In view of the problems existing in the prior art, the present application provides a batch and stream combined data processing method and apparatus, which determine a corresponding calculation model by executing a calculated data node type and a calculated data node number according to needs, where the data node type includes a batch of data nodes and a stream of data nodes, and if the data node number is single, execute a single-source calculation model, otherwise execute a multi-source calculation model; performing data processing according to the calculation model and a preset execution mode; according to the method and the device, batch data and streaming data can be effectively combined, and data processing efficiency is improved.
In order to effectively combine batch data and streaming data and improve data processing efficiency, the present application provides an embodiment of a batch-streaming combined data processing method, referring to fig. 1, wherein the batch-streaming combined data processing method specifically includes the following contents:
step S101: determining a corresponding calculation model according to the data node type and the number of data nodes required to be calculated, wherein the data node type comprises batch data nodes and stream data nodes, if the number of the data nodes is single, executing a single-source calculation model, otherwise, executing a multi-source calculation model;
step S102: and carrying out data processing according to the calculation model and a preset execution mode.
As can be seen from the foregoing description, the batch and stream combined data processing method provided by the embodiments of the present application can determine a corresponding calculation model by executing the calculated data node types and the calculated data node numbers according to needs, where the data node types include batch data nodes and stream data nodes, if the data node numbers are single, then executing a single-source calculation model, otherwise executing a multi-source calculation model; performing data processing according to the calculation model and a preset execution mode; according to the method and the device, batch data and streaming data can be effectively combined, and data processing efficiency is improved.
In an embodiment of the batch and stream combined data processing method of the present application, the performing data processing according to the computing model and a preset execution mode includes:
step S201: if the computing model is a single-source computing model and the data node type is a batch data node, acquiring target data of the batch data node in batch at one time and performing data processing according to a preset execution mode;
step S202: if the computing model is a multi-source computing model and the data node type is a streaming data node, sequentially acquiring each item of target data of the streaming data node or acquiring target data of the streaming data node in a set time period, and performing data processing according to a preset executing mode.
In an embodiment of the batch and stream combined data processing method of the present application, the performing data processing according to the computing model and a preset execution mode includes:
step S301: if the computing model is a multi-source computing model and the type of the data node needing to be computed is a plurality of batch data nodes, performing data processing on target data of each batch data node through a preset computing engine or a preset data index rule;
Step S302: if the computing model is a multi-source computing model and the data node type requiring computing comprises batch data nodes and stream data nodes, carrying out data indexing on the batch data nodes in advance, determining target data matched with the stream data nodes according to preset rules, and carrying out data processing;
step S303: if the computing model is a multi-source computing model and the type of the data node needing to execute computation is a plurality of stream data nodes, performing data processing through a preset computing engine or a preset data index rule according to target data of the stream data nodes in a set time period and target data of the batch data nodes.
In an embodiment of the batch and stream combined data processing method of the present application, the preset execution mode includes: at least one of a one-time execution model, a continuous execution mode for a set time period, and a timing update execution mode.
In order to effectively combine bulk data and streaming data and improve data processing efficiency, the present application provides an embodiment of a bulk and streaming data processing apparatus for implementing all or part of contents of the bulk and streaming data processing method, referring to fig. 2, where the bulk and streaming data processing apparatus specifically includes:
The computing node determining module 10 is configured to determine a corresponding computing model according to a data node type and a data node number required to perform computing, where the data node type includes a batch of data nodes and a stream of data nodes, if the data node number is single, then execute a single-source computing model, otherwise execute a multi-source computing model;
and the target data calculation module 20 is used for carrying out data processing according to a preset execution mode according to the calculation model.
As can be seen from the foregoing description, the batch and stream combined data processing apparatus provided in the embodiments of the present application is capable of determining a corresponding calculation model by executing a calculated data node type and a calculated data node number according to needs, where the data node type includes a batch data node and a stream data node, and executing a single source calculation model if the data node number is single, and otherwise executing a multi-source calculation model; performing data processing according to the calculation model and a preset execution mode; according to the method and the device, batch data and streaming data can be effectively combined, and data processing efficiency is improved.
In one embodiment of the batch and stream combined data processing apparatus of the present application, the computing node determining module 10 includes:
The batch data node single-source computing unit 11 is configured to obtain target data of the batch data node in batch at one time and perform data processing according to a preset execution mode if the computing model is a single-source computing model and the data node type is a batch data node;
and the single-source computing unit 12 of the streaming data node is configured to sequentially obtain each item of target data of the streaming data node or obtain target data of the streaming data node in a set time period if the computing model is a single-source computing model and the data node type is the streaming data node, and perform data processing according to a preset execution mode.
In one embodiment of the batch and stream combined data processing apparatus of the present application, the computing node determining module 10 includes:
a multi-batch data node multi-source computing unit 13, configured to perform data processing on target data of each batch data node by using a preset computing engine or a preset data index rule if the computing model is a multi-source computing model and the type of the data node to be computed is a plurality of batch data nodes;
the batch data node and stream data node combination calculation unit 14 is configured to, if the calculation model is a multi-source calculation model and the data node type requiring calculation includes a batch data node and a stream data node, perform data indexing on the batch data node in advance, and determine target data matched with the stream data node according to a preset rule and perform data processing;
And the multi-source computing unit 15 is configured to perform data processing according to the target data of the streaming data node in the set time period and the target data of the batch data node by using a preset computing engine or a preset data index rule if the computing model is a multi-source computing model and the type of the data node to be computed is a plurality of streaming data nodes.
In an embodiment of the batch and stream combined data processing apparatus of the present application, the preset execution mode includes: at least one of a one-time execution model, a continuous execution mode for a set time period, and a timing update execution mode.
In order to further explain the scheme, the application also provides a specific application example of the batch and stream combined data processing method by using the batch and stream combined data processing device, which specifically comprises the following contents:
referring to FIG. 3, the element nodes and their computation relationships in FIG. 3 form a Directed Acyclic Graph (DAG), and a graph and its representation of information we refer to as a computation Model (Computing Model). The model supports the calculation relation of multi-level nodes (the number of the nodes is arbitrary), the construction process of the multi-level nodes is that of the calculation model, and the model is independently operated according to the sequence and the direction in the graph (data calculation is carried out step by step) after the construction process of the model is finished.
The specific design scheme is as follows:
1. element constitution
The scheme divides the elements in the model into the following parts:
(1) Data node
Data nodes are the initial source of data in the model and fall into two categories:
batch data node
The data representing batch data and metadata thereof, the data of the nodes are randomly updated in batch (business related), and the structure is as follows:
data format: two-dimensional forms, hierarchically structured entities, custom structured lists, etc. that can be made up of rows (data), columns (fields)
Metadata information: data description, structure description (data Structure description, field name, field meaning, field type, etc.)
The bearing mode is as follows: can be stored in a relational database, a distributed file system, and the like
Streaming data node:
the data representing the streaming data and the metadata thereof, the data of the nodes continuously flows, and the data has time attributes, and the structure is as follows:
data format: message ID+message body, message ID has uniqueness, message body contains all data values of current message (line type storage, can be in various formats, such as Key/Value structure, json structure, etc.), and the scheme uses Key/Value structure as default format.
Metadata information: data description, message body parsing Structure (schema, including key name, data type, parsing method, etc.)
The bearing mode is as follows: message middleware or memory of various types
(2) Computing node
The compute nodes represent two layers of meaning: first, a computing operation for a data source node (data node or compute node); and secondly, calculating a generated data set through the step, wherein the data set is in a batch carrier form and a stream carrier form.
The supported computing types include:
data cleaning: the data is cleaned using basic processing logic (e.g., merging, intercepting, etc.).
Algorithm prediction: and predicting the source node by using the pre-training data algorithm model to generate a prediction result.
Operator calculation: and processing the source node data by using an external operator (a calculation process after encapsulation) to generate a processing result.
Data collision: rules (e.g., cross-union) of two or more node data collide, resulting in post-collision results.
Relation mining: and calculating the relation between the two or more node data to obtain relation data.
Other extension types: any other type of computation that supports independent calls, such as external interfaces, SDKs, etc.
The node stores basic information required by various calculations, such as field matching relation of data collision, feature selection predicted by an algorithm, source fields calculated by an operator and the like.
In the scheme, no matter the data node or the computing node, different physical platforms can be adopted to bear data, and the data can be accessed in a model operation link.
(3) Connecting wire
The connection lines in the logic expression diagram represent the node relation of data calculation and the running direction of the model, and can be used as interface display (the information required by the running of the specific model is completely provided in the nodes).
2. Computing mode
The node computing type is closely related to the computing mode, and the scheme provides the following modes which respectively correspond to different scenes and application modes of data under the computing type (without involving internal details of a specific computation):
single source computing:
for the calculation of a single node data set, most of the calculation types include data cleaning, algorithm prediction and operator calculation.
(1) Batch data calculation (Batch)
Referring to fig. 4, node data is obtained, transmitted and participates in calculation in batch at one time, and the calculation result is written into the bearing container as a whole.
(2) Stream data calculation (Stream)
Referring to fig. 5, according to real-time requirements, it can be classified into:
the real-time mode is as follows: each piece of data in the streaming carrier is processed independently, acquiring, transmitting and calculating links are completed in sequence, and the result carrier is written in immediately. This approach has high requirements on data processing performance.
Quasi-real time mode: and the data in the streaming carrier is accumulated to a certain amount or more than a certain time, and the small amount of data is calculated in batches and written into the result carrier. This approach sacrifices real-time to obtain computational efficiency of the approximate batch process.
Multisource computing:
aiming at the calculation among a plurality of node data sets, a plurality of node data participation needs to be called at the same time, and most of the node data are of relation mining and data collision types.
(1) Batch-to-Batch calculation (Batch VS Batch)
Referring to FIG. 6, there are two implementations of computation between bulk nodes:
engine mode
And mapping the data by using a computing engine (such as an SQL engine used by a structured two-dimensional table and a NoSQL engine used by unstructured data) meeting the data format requirements, converting the computation among batch data into computation operations supported by the engine, and obtaining a computation result.
Index method
And carrying out data indexing on the batch data, constructing the index based on the field/key value related to the calculation operation, converting the calculation between the batch data into the matching operation between indexes, and then reversely extracting the original data to generate the next result set (calculation node).
Both of the above approaches support multi-node (2 and more) simultaneous computations to reduce computation steps and intermediate temporary data generation. In the current mode, the generated result data is of a batch type.
(2) Batch convection type calculation (Batch VS Stream)
Referring to fig. 7, to improve the computing efficiency, data indexes are performed on batch nodes in advance, and the indexes are constructed based on fields/key values involved in the computing operation. Continuously reading data from a data carrier of the streaming node, scanning an index row matched with the data carrier according to a specific rule set under the current calculation type, and reversely acquiring reserved field data and storing the reserved field data into a result node.
The model supports simultaneous computation of multiple batches of nodes and a stream node, wherein the multiple batches of nodes are firstly executed to form an integrated batch of nodes, and then the integrated batch of nodes are computed with the stream node. In the current mode, the generated result data is of a streaming type.
(3) Streaming-to-streaming computing (Stream VS Stream)
Referring to fig. 8, since the streaming data itself contains the tape time attribute, the full-volume calculation in the full time range of the streaming data is not considered, but can be converted into batch-to-streaming calculation after a specified point in time (one of them is converted into continuously growing batch data).
The specific implementation process comprises the following steps:
a streaming DATA 1 is selected, which DATA is continuously read from and written to the batch DATA carrier at a calculation start time T0, forming an continuously increasing intermediate node DATA X. A timeout or excessive strategy may also be adopted at subsequent time points T1, T2, for example in view of efficiency issues. . . Tn gradually and batchwise performs data disc dropping.
The remaining party DATA 2 adopts the former (batch flow type) strategy to realize the calculation of the two parties, and the calculation object is all DATA/indexes of each piece of DATA of D2 and the current moment of DATA X.
In order to control the storage space of the intermediate node, the old data can be cleared to release space (the data is required to meet the timeliness requirement) by using the principle of overtime rejection. In the current mode, the generated result data is of a streaming type.
3. Execution mode
After the model is built, the corresponding execution mode is needed to start the model to start data calculation. The scheme provides the following execution modes
(1) Disposable mode
The model is only used by a model only containing batch data, the model is finished after being integrally executed from an initial node, and the data of each calculation node is the final result.
(2) Continuous mode
The model needs to contain streaming data, and the model is continuously executed from an initial node without setting an ending condition, in a specific way:
batch data computing node: only one calculation is performed as the final result
Streaming data computing node: continuously calculating, and updating calculation results into a result set
(3) Timing update mode
The mode has no requirement on the data type, is similar to a continuous mode, is continuously executed after the model is started, does not set an ending condition, but increases the time interval of data updating, and is specifically implemented in the following way:
Batch data computing node: the model is executed once immediately after starting as the current result, and is re-executed again after reaching the data update interval time, and the new result (the new calculation result generated by the data source update in the two time intervals) is updated to the current result set.
Streaming data computing node: continuing to calculate, if the related batch data changes, calculating by adopting the latest data set after updating.
With the above, the present application can at least achieve the following technical effects:
1. the proposed model construction method supports the introduction and use of batch data and streaming data, so that the streaming data can be widely applied to data modeling, and the unique timeliness advantage of the streaming data can be exerted.
2. The method realizes a plurality of fusion calculation methods of batch data and stream data, so that the batch data and the stream data can be combined to cope with more and more complex actual problems, the respective application fields are enriched, and the actual combat effects of the two parties are improved.
In order to effectively combine batch data and streaming data and improve data processing efficiency from a hardware aspect, the application provides an embodiment of an electronic device for implementing all or part of contents in a batch and streaming combined data processing method, where the electronic device specifically includes the following contents:
A processor (processor), a memory (memory), a communication interface (Communications Interface), and a bus; the processor, the memory and the communication interface complete communication with each other through the bus; the communication interface is used for realizing information transmission between a batch and stream combined data processing device and related equipment such as a core service system, a user terminal, a related database and the like; the logic controller may be a desktop computer, a tablet computer, a mobile terminal, etc., and the embodiment is not limited thereto. In this embodiment, the logic controller may refer to an embodiment of the batch and stream combined data processing method and an embodiment of the batch and stream combined data processing apparatus in the embodiment, and the contents thereof are incorporated herein and are not repeated here.
It is understood that the user terminal may include a smart phone, a tablet electronic device, a network set top box, a portable computer, a desktop computer, a Personal Digital Assistant (PDA), a vehicle-mounted device, a smart wearable device, etc. Wherein, intelligent wearing equipment can include intelligent glasses, intelligent wrist-watch, intelligent bracelet etc..
In practical applications, part of the batch and streaming combined data processing method may be performed on the electronic device side as described above, or all operations may be performed in the client device. Specifically, the selection may be made according to the processing capability of the client device, and restrictions of the use scenario of the user. The present application is not limited in this regard. If all operations are performed in the client device, the client device may further include a processor.
The client device may have a communication module (i.e. a communication unit) and may be connected to a remote server in a communication manner, so as to implement data transmission with the server. The server may include a server on the side of the task scheduling center, and in other implementations may include a server of an intermediate platform, such as a server of a third party server platform having a communication link with the task scheduling center server. The server may include a single computer device, a server cluster formed by a plurality of servers, or a server structure of a distributed device.
Fig. 9 is a schematic block diagram of a system configuration of an electronic device 9600 of an embodiment of the present application. As shown in fig. 9, the electronic device 9600 may include a central processor 9100 and a memory 9140; the memory 9140 is coupled to the central processor 9100. Notably, this fig. 9 is exemplary; other types of structures may also be used in addition to or in place of the structures to implement telecommunications functions or other functions.
In one embodiment, batch and stream combined data processing method functionality may be integrated into the central processor 9100. The central processor 9100 may be configured to perform the following control:
Step S101: determining a corresponding calculation model according to the data node type and the number of data nodes required to be calculated, wherein the data node type comprises batch data nodes and stream data nodes, if the number of the data nodes is single, executing a single-source calculation model, otherwise, executing a multi-source calculation model;
step S102: and carrying out data processing according to the calculation model and a preset execution mode.
As can be seen from the above description, the electronic device provided in the embodiments of the present application determines a corresponding calculation model by executing a data node type and a data node number calculated according to needs, where the data node type includes a batch data node and a stream data node, if the data node number is single, then executing a single-source calculation model, otherwise executing a multi-source calculation model; performing data processing according to the calculation model and a preset execution mode; according to the method and the device, batch data and streaming data can be effectively combined, and data processing efficiency is improved.
In another embodiment, the batch and stream combined data processing apparatus may be configured separately from the cpu 9100, for example, the batch and stream combined data processing apparatus may be configured as a chip connected to the cpu 9100, and the batch and stream combined data processing method function is implemented under the control of the cpu.
As shown in fig. 9, the electronic device 9600 may further include: a communication module 9110, an input unit 9120, an audio processor 9130, a display 9160, and a power supply 9170. It is noted that the electronic device 9600 need not include all of the components shown in fig. 9; in addition, the electronic device 9600 may further include components not shown in fig. 9, and reference may be made to the related art.
As shown in fig. 9, the central processor 9100, sometimes referred to as a controller or operational control, may include a microprocessor or other processor device and/or logic device, which central processor 9100 receives inputs and controls the operation of the various components of the electronic device 9600.
The memory 9140 may be, for example, one or more of a buffer, a flash memory, a hard drive, a removable media, a volatile memory, a non-volatile memory, or other suitable device. The information about failure may be stored, and a program for executing the information may be stored. And the central processor 9100 can execute the program stored in the memory 9140 to realize information storage or processing, and the like.
The input unit 9120 provides input to the central processor 9100. The input unit 9120 is, for example, a key or a touch input device. The power supply 9170 is used to provide power to the electronic device 9600. The display 9160 is used for displaying display objects such as images and characters. The display may be, for example, but not limited to, an LCD display.
The memory 9140 may be a solid state memory such as Read Only Memory (ROM), random Access Memory (RAM), SIM card, etc. But also a memory which holds information even when powered down, can be selectively erased and provided with further data, an example of which is sometimes referred to as EPROM or the like. The memory 9140 may also be some other type of device. The memory 9140 includes a buffer memory 9141 (sometimes referred to as a buffer). The memory 9140 may include an application/function storage portion 9142, the application/function storage portion 9142 storing application programs and function programs or a flow for executing operations of the electronic device 9600 by the central processor 9100.
The memory 9140 may also include a data store 9143, the data store 9143 for storing data, such as contacts, digital data, pictures, sounds, and/or any other data used by an electronic device. The driver storage portion 9144 of the memory 9140 may include various drivers of the electronic device for communication functions and/or for performing other functions of the electronic device (e.g., messaging applications, address book applications, etc.).
The communication module 9110 is a transmitter/receiver 9110 that transmits and receives signals via an antenna 9111. A communication module (transmitter/receiver) 9110 is coupled to the central processor 9100 to provide input signals and receive output signals, as in the case of conventional mobile communication terminals.
Based on different communication technologies, a plurality of communication modules 9110, such as a cellular network module, a bluetooth module, and/or a wireless local area network module, etc., may be provided in the same electronic device. The communication module (transmitter/receiver) 9110 is also coupled to a speaker 9131 and a microphone 9132 via an audio processor 9130 to provide audio output via the speaker 9131 and to receive audio input from the microphone 9132 to implement usual telecommunications functions. The audio processor 9130 can include any suitable buffers, decoders, amplifiers and so forth. In addition, the audio processor 9130 is also coupled to the central processor 9100 so that sound can be recorded locally through the microphone 9132 and sound stored locally can be played through the speaker 9131.
The embodiments of the present application further provide a computer readable storage medium capable of implementing all steps in the batch and stream combined data processing method in which the execution subject is a server or a client in the above embodiments, where the computer readable storage medium stores a computer program, and when the computer program is executed by a processor, the computer program implements all steps in the batch and stream combined data processing method in which the execution subject is a server or a client in the above embodiments, for example, the processor implements the following steps when executing the computer program:
Step S101: determining a corresponding calculation model according to the data node type and the number of data nodes required to be calculated, wherein the data node type comprises batch data nodes and stream data nodes, if the number of the data nodes is single, executing a single-source calculation model, otherwise, executing a multi-source calculation model;
step S102: and carrying out data processing according to the calculation model and a preset execution mode.
As can be seen from the foregoing description, the computer readable storage medium provided in the embodiments of the present application determines a corresponding calculation model by executing a data node type and a data node number of calculation according to needs, where the data node type includes a batch of data nodes and a stream of data nodes, if the data node number is single, executing a single-source calculation model, otherwise executing a multi-source calculation model; performing data processing according to the calculation model and a preset execution mode; according to the method and the device, batch data and streaming data can be effectively combined, and data processing efficiency is improved.
It will be apparent to those skilled in the art that embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (devices), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The principles and embodiments of the present invention have been described in detail with reference to specific examples, which are provided to facilitate understanding of the method and core ideas of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims (4)

1. A batch and stream combined data processing method, the method comprising:
determining a corresponding calculation model according to the data node type and the number of data nodes required to be calculated, wherein the data node type comprises batch data nodes and stream data nodes, if the number of the data nodes is single, executing a single-source calculation model, otherwise, executing a multi-source calculation model;
If the computing model is a single-source computing model and the data node type is a batch data node, acquiring target data of the batch data node in batch at one time and performing data processing according to a preset execution mode;
if the computing model is a multi-source computing model and the data node type is a streaming data node, sequentially acquiring each item of target data of the streaming data node or acquiring target data of the streaming data node in a set time period, and performing data processing according to a preset executing mode;
performing data processing according to the calculation model and a preset execution mode;
if the computing model is a multi-source computing model and the type of the data node needing to be computed is a plurality of batch data nodes, performing data processing on target data of each batch data node through a preset computing engine or a preset data index rule;
if the computing model is a multi-source computing model and the data node type requiring computing comprises batch data nodes and stream data nodes, carrying out data indexing on the batch data nodes in advance, determining target data matched with the stream data nodes according to preset rules, and carrying out data processing;
If the computing model is a multi-source computing model and the type of the data node needing to be computed is a plurality of stream data nodes, performing data processing through a preset computing engine or a preset data index rule according to target data of the stream data nodes in a set time period and target data of the batch data nodes;
the construction process of the calculation model is as follows:
determining element constitution, including batch data nodes representing batch data and metadata thereof and stream data nodes representing stream data and metadata thereof;
calculating the data source node to generate a data set, wherein the data set is in a form of two carriers of batch and stream type, and the calculation type comprises: data cleaning, algorithm prediction, operator calculation, data collision and relation mining;
establishing a connecting line according to the node relation calculated by the data and the running direction of the model, and performing interface display;
selecting different computing modes according to application modes of data under different scenes and computing types, wherein the computing modes comprise single-source computing and multi-source computing, the single-source computing comprises batch data computing and streaming data computing, and the multi-source computing comprises batch-to-batch computing, batch streaming computing and streaming computing;
After the model is built, starting the model by adopting a corresponding execution mode to start data calculation, wherein the calculation execution mode comprises a disposable mode, a continuous mode and a timing update mode.
2. A batch and stream combined data processing apparatus comprising:
the computing node determining module is used for determining a corresponding computing model according to the type and the number of data nodes required to execute computation, wherein the data node type comprises batch data nodes and stream data nodes, if the number of the data nodes is single, executing a single-source computing model, otherwise, executing a multi-source computing model;
the target data calculation module is used for carrying out data processing according to the calculation model and a preset execution mode, wherein the preset execution mode comprises the following steps: at least one of a one-time execution model, a continuous execution mode for setting a time period, and a timing update execution mode;
the computing node determination module includes: the system comprises a batch data node single-source computing unit, a stream data node single-source computing unit, a multi-batch data node multi-source computing unit, a batch data node and stream data node combination computing unit and a multi-stream data node multi-source computing unit;
The batch data node single-source computing unit is used for acquiring target data of batch data nodes in batch at one time and performing data processing according to a preset execution mode if the computing model is a single-source computing model and the data node type is a batch data node;
the stream data node single-source computing unit is used for sequentially acquiring each item of target data of the stream data node or acquiring target data of the stream data node in a set time period if the computing model is a single-source computing model and the data node type is the stream data node, and carrying out data processing according to a preset execution mode;
the multi-batch data node multi-source computing unit is used for carrying out data processing on target data of each batch data node through a preset computing engine or a preset data index rule if the computing model is a multi-source computing model and the type of the data node needing to be computed is a plurality of batch data nodes;
the system comprises a batch data node and stream data node combination calculation unit, a data processing unit and a data processing unit, wherein the batch data node and stream data node combination calculation unit is used for carrying out data indexing on the batch data node in advance and determining target data matched with the stream data node in the stream data node according to a preset rule if the calculation model is a multi-source calculation model and the data node type needing to be calculated comprises the batch data node and the stream data node;
The multi-source computing unit of the multi-stream data node is used for carrying out data processing through a preset computing engine or a preset data index rule according to the target data of the stream data node in a set time period and the target data of the batch data node if the computing model is a multi-source computing model and the type of the data node needing to be computed is a plurality of stream data nodes;
the construction of the calculation model is to determine element constitution, calculate the data source node, calculate and generate a data set, establish a connecting line according to the node relation of data calculation and the running direction of the model, perform interface display, select different calculation modes according to the application modes of the data under different scenes and calculation types, and start the model by adopting the corresponding execution mode after the model construction is completed.
3. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the batch and stream combined data processing method of claim 1 when the program is executed by the processor.
4. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the batch and stream combined data processing method of claim 1.
CN202011529842.8A 2020-12-22 2020-12-22 Batch and stream combined data processing method and device Active CN112597200B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011529842.8A CN112597200B (en) 2020-12-22 2020-12-22 Batch and stream combined data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011529842.8A CN112597200B (en) 2020-12-22 2020-12-22 Batch and stream combined data processing method and device

Publications (2)

Publication Number Publication Date
CN112597200A CN112597200A (en) 2021-04-02
CN112597200B true CN112597200B (en) 2024-01-12

Family

ID=75200746

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011529842.8A Active CN112597200B (en) 2020-12-22 2020-12-22 Batch and stream combined data processing method and device

Country Status (1)

Country Link
CN (1) CN112597200B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115599524B (en) * 2022-10-27 2023-06-09 中国兵器工业计算机应用技术研究所 Data lake system based on cooperative scheduling processing of stream data and batch data

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103761309A (en) * 2014-01-23 2014-04-30 ***(深圳)有限公司 Operation data processing method and system
CN105677752A (en) * 2015-12-30 2016-06-15 深圳先进技术研究院 Streaming computing and batch computing combined processing system and method
CN105701161A (en) * 2015-12-31 2016-06-22 深圳先进技术研究院 Real-time big data user label system
CN106873945A (en) * 2016-12-29 2017-06-20 中山大学 Data processing architecture and data processing method based on batch processing and Stream Processing
CN107330238A (en) * 2016-08-12 2017-11-07 中国科学院上海技术物理研究所 Medical information collection, processing, storage and display methods and device
CN109889575A (en) * 2019-01-15 2019-06-14 北京航空航天大学 Cooperated computing plateform system and method under a kind of peripheral surroundings

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
IN2013CH01044A (en) * 2013-03-12 2015-08-14 Yahoo Inc

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103761309A (en) * 2014-01-23 2014-04-30 ***(深圳)有限公司 Operation data processing method and system
CN105677752A (en) * 2015-12-30 2016-06-15 深圳先进技术研究院 Streaming computing and batch computing combined processing system and method
CN105701161A (en) * 2015-12-31 2016-06-22 深圳先进技术研究院 Real-time big data user label system
CN107330238A (en) * 2016-08-12 2017-11-07 中国科学院上海技术物理研究所 Medical information collection, processing, storage and display methods and device
CN106873945A (en) * 2016-12-29 2017-06-20 中山大学 Data processing architecture and data processing method based on batch processing and Stream Processing
CN109889575A (en) * 2019-01-15 2019-06-14 北京航空航天大学 Cooperated computing plateform system and method under a kind of peripheral surroundings

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Two privacy-preserving approaches for publishing transaction data systems;Jinyan Wang 等;《IEEE ACCESS》;第6卷;第1-2页 *
多源异构航班航迹数据流实时融合方法研究;张瞩熹 等;《物联网学报》;第4卷(第3期);第60-68页 *

Also Published As

Publication number Publication date
CN112597200A (en) 2021-04-02

Similar Documents

Publication Publication Date Title
CN103970861B (en) Information demonstrating method and equipment
CN110990482A (en) Data synchronization method and device between asynchronous databases
CN103312590A (en) Method, device, receiving terminal, transmitting terminal and equipment for group communication
CN102426567A (en) Graphical editing and debugging system of automatic answer system
CN112597200B (en) Batch and stream combined data processing method and device
CN111898698A (en) Object processing method and device, storage medium and electronic equipment
CN106227597A (en) Task priority treating method and apparatus
CN113658597B (en) Voice ordering method, device, electronic equipment and computer readable medium
CN115098262B (en) Multi-neural network task processing method and device
CN115495519A (en) Report data processing method and device
CN112734545B (en) Block chain data sharing method, device and system
CN116362327A (en) Model training method and system and electronic equipment
CN114398883B (en) Presentation generation method and device, computer readable storage medium and server
CN115205009A (en) Account opening business processing method and device based on virtual technology
CN112792808B (en) Industrial robot online track planning method and device based on variable structure filter
CN112417018B (en) Data sharing method and device
CN109598344A (en) Model generating method and device
CN114254563A (en) Data processing method and device, electronic equipment and storage medium
CN110475325A (en) Power distribution method, device, terminal and storage medium
CN111526054B (en) Method and device for acquiring network
CN117291689A (en) Commodity recommendation method and device based on user attribute network
CN113656645A (en) Log consumption method and device
CN116226535A (en) Site recommendation method and device
CN116506506A (en) Service dynamic change method, device, computer equipment and storage medium
CN116560664A (en) Code function calling method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant