CN116521963A

CN116521963A - Method and system for processing calculation engine data based on componentization

Info

Publication number: CN116521963A
Application number: CN202310807717.6A
Authority: CN
Inventors: 王心安
Original assignee: Zhilin Technology Co ltd; Beijing Zhilin Technology Co ltd
Current assignee: Zhilin Technology Co ltd; Beijing Zhilin Technology Co ltd
Priority date: 2023-07-04
Filing date: 2023-07-04
Publication date: 2023-08-01

Abstract

The invention provides a data processing method and a system of a calculation engine based on componentization, which relate to the technical field of data processing, and are used for reading data from data sources of various components as input data of the calculation engine, and inputting the input data into the calculation engine in a sequential manner for processing; the computing engine classifies and numbers the data according to the characteristics and the category selection characteristic decision algorithm of the input data, and sorts the data according to the sequence of the numbers; the method and the device for classifying the input data of the computing engine receive the ordered real-time data stream, divide the real-time data stream into small batches of data streams, perform real-time feedback control on the processing rate, and return the real-time feedback control to the database.

Description

Method and system for processing calculation engine data based on componentization

Technical Field

The invention relates to the technical field of data processing, in particular to a method and a system for processing calculation engine data based on componentization.

Background

In recent years, computing engine construction has received attention from large internet companies and researchers at home and abroad, and a plurality of related technical fields of exploration and attempts have been developed around computing engine construction, wherein the most popular computing engine is a distributed computing engine. With the rapid development of the big data field in China, aiming at big data calculation, the generated calculation engine tool is overwhelmed, and the application prospect of the calculation engine is quite broad. For users, a large amount of data, including structured and unstructured, is already stored in the database, but the data is distributed in different systems, so that the requirements and conditions for each business system to fetch the data from the database are increasing, a 'spider web' which is difficult to maintain and manage is already formed, a unified data management and access platform needs to be established, and the unified maintenance and management is convenient, so that one-stop data access service is provided.

The main idea of the engine data architecture is to divide the big data system architecture into a plurality of layers, namely an offline processing layer batch layer, a real-time processing layer speed layer and a service layer. Moreover, the data processing for the offline processing layer and the real-time processing layer is realized based on respective different functions. However, in the prior art, the different functions have no unified logic layer, so that in order to ensure that the functions are identical in logic, the offline processing layer and the real-time processing layer need to be manually controlled by developers to maintain at least two sets of different core code logic, which increases the difficulty and cost of system maintenance and also causes quality difference of processing results of the processing layer. The powerful management platform still has some defects:

1. at the data processing level, single data compute engine computing frameworks often always have an inadaptation to data processing in certain data processing scenarios.

2. In the data storage layer, the production data types of enterprises are various, and a storage platform is difficult to meet the requirements of high concurrency, high throughput and low-delay read-write access of all types of data of the enterprises.

3. In the aspect of platform usability, a user has a higher technical threshold for grasping a certain data processing tool, and a professional is often required to learn for a long time to master the use of a large data platform. And constructing a data processing task through a large data platform often requires a complex process, which is time-consuming and labor-consuming.

Disclosure of Invention

In order to solve the technical problems, the invention provides a componentization-based calculation engine data processing method, which comprises the following steps:

s1, reading data from data sources of various components, taking the data as input data of a calculation engine, and inputting the input data into the calculation engine in a serial mode for processing;

s2, the computing engine classifies and numbers the data according to the characteristics and the category selection characteristic decision algorithm of the input data, and sorts the data according to the sequence of the numbers;

s3, receiving the real-time data stream which is sequenced according to the category, dividing the real-time data stream into small batches of data streams, outputting the small batches of data streams, performing real-time feedback control on the output processing rate, and returning the small batches of data streams to the database.

Further, in step S2: for a data set Z1 with a feature A, extracting any sample R in the data set Z1, randomly extracting adjacent samples S in the same category, and heterogeneous samples D in different categories, if the distance between the any sample R and the adjacent samples S is smaller than that between the any sample R and the heterogeneous samples D, the feature A is beneficial to distinguishing nearest neighbors of the same category from nearest neighbors of the different category, and increasing the weight of the feature A; conversely, the distance between any sample R and adjacent sample S is greater than the distance between sample R and heterogeneous sample D, indicating that feature a has a negative effect on distinguishing between similar and dissimilar nearest neighbors, the weight of feature a is reduced.

Further, any sample R is extracted again, the process is repeated m times, and finally an average weight W (a) of the feature a is obtained, where the formula is:

；

diff (a, R, S) represents the difference between any sample R and adjacent sample S on feature a, diff (a, R, D) represents the difference between any sample R and heterogeneous sample D on feature a.

Further, k samples with the same class C in the data set Z1 are extracted to form a data set Z2, and any sample B is extracted from the data set Z2, so that the weight W (C) of the class C is as follows:

；

wherein p (C) is the ratio of the number of data of class C in the data set Z2 to the total number of data set Z1, p (B) is the ratio of the number of samples B in the data set Z2 to the total number of data set Z2,representative is sample j of class C, which is not in sample B, and the number of samples j is k.

Further, step S3 includes:

s3.1, receiving real-time data streams sequenced according to categories, dividing the sequenced real-time data streams into small batches of data streams according to different data categories, and outputting the small batches of data streams outwards;

s3.2, monitoring the processing rate of the output data stream in real time, and processing the data stream according to the upper limit of the data processing rate;

and S3.3, after the data stream processing of the batch is completed, returning the processed data stream batch to the database.

Further, in step S31, the real-time data stream sorted by category is received, and segmented into L according to different data categories ₁ 、L ₂ 、L ₃ 、…、L _n-1 、L _n Wherein the data volume of each of the n segments is R ₁ 、R ₂ 、R ₃ 、…、R _n-1 、R _n Wherein the minimum data amount is R _min The maximum data amount is R _max 。

Setting the data volume of the single section to be not lower than R ₀ The data amount in the single segment is set to satisfy the following formula:

。

further, in step S32, at the start of the lot, the start time t of the current lot is submitted ₁ Giving a rate controller;

at the completion of the batch, the processing end time t of the current batch is submitted ₂ The time for processing the batch is t for the rate controller,let the waiting time of the batch in the batch queue be t _w The data amount of the batch processing data is NR _n ；

At the time of the current lot submission, the upper data processing rate limit V for the current lot is calculated:

V＝V _late -K _p ×er-K _i ×E _r -K _d ×d _r ；

V _late =(t+t _w )/NR _n ；

wherein V is _late An upper data output rate limit for the latest processed lot; k (K) _p For the scaling factor, er is the estimated error of the data processing rate; k (K) _i Is an integral coefficient; e (E) _r Cumulative error for data processing rate; k (K) _d To reduce the effect of noise data on the system; d, d _r Is the process rate change.

The invention also provides a calculation engine data processing system based on the componentization, which is used for realizing a calculation engine data processing method, comprising the following steps: the system comprises a plurality of micro-engines, a feature decision unit, a control unit and a database;

the plurality of micro-engines correspond to the plurality of components, data are read from data sources of various components and used as input data of the computing engine, and the input data are input into the computing engine in a serial mode for processing;

the computing engine classifies and numbers the data through a feature decision unit according to the features and the categories of the input data, and sorts the data according to the sequence of the numbers;

the control unit receives the real-time data stream which is sequenced according to the category, divides the real-time data stream into small batches of data stream for output, carries out real-time feedback control on the output processing rate, and returns the real-time data stream to the database.

Further, the control unit further includes: a segmentation module and a rate controller;

the segmentation module receives the real-time data streams sequenced according to the categories, segments the sequenced real-time data streams into small batches of data streams according to different data categories and outputs the small batches of data streams;

and the rate controller monitors the processing rate of the output data stream in real time, and processes the data stream according to the upper limit of the data processing rate.

Compared with the prior art, the invention has the following beneficial technical effects: reading data from data sources of various components, inputting the input data into a computing engine in a serial mode as the input data of the computing engine, and processing the input data; the computing engine classifies and numbers the data according to the characteristics and the category selection characteristic decision algorithm of the input data, and sorts the data according to the sequence of the numbers; the method and the device for classifying the input data of the computing engine receive the ordered real-time data stream, divide the real-time data stream into small batches of data streams, perform real-time feedback control on the processing rate, and return the real-time feedback control to the database.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort to a person skilled in the art.

FIG. 1 is a flow chart of a componentized compute engine data processing method in accordance with the present invention.

FIG. 2 is a flow chart of the present invention for dividing a small lot data stream and performing real-time feedback control of processing rate.

FIG. 3 is a block diagram of a componentized-based computing engine data processing system in accordance with the present invention.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

In the drawings of the specific embodiments of the present invention, in order to better and more clearly describe the working principle of each element in the system, the connection relationship of each part in the device is represented, but only the relative positional relationship between each element is clearly distinguished, and the limitations on the signal transmission direction, connection sequence and the structure size, dimension and shape of each part in the element or structure cannot be constructed.

FIG. 1 is a flow chart of a componentized computation engine data processing method of the present invention, comprising the steps of:

s1, reading data from data sources of various components, inputting the data into a computing engine in a serial mode as the input data of the computing engine, and processing the input data.

Based on a plurality of micro-engines with a plurality of components corresponding to each other in the componentized computing engine, all threads of the micro-engines share a microcode program memory, all threads execute the same microcode program, if the plurality of threads of each micro-engine execute different tasks, branch jump is performed by using thread numbers, and the different threads execute a part of microcode program instructions.

Each microengine supports M threads, each thread having 4 states: idle state, running state, ready state and sleep state. The unused thread is in an idle state; in the occupied state, the thread occupies the micro-engine to execute a micro-code program; the thread that is ready to run is in a ready state; once the currently running thread enters the sleep state, the thread switcher can select one thread from all ready state threads to start running, the thread in the ready state waits for the arrival of an external event signal, only one thread in the running state of each microengine can exist at any moment, and a plurality of threads in other states can exist, so that the thread switcher adopts a preemptive mode to switch tasks.

S2, the computing engine classifies and numbers the data according to the characteristics and the category selection characteristic decision algorithm of the input data, and sorts the data according to the sequence of the numbers.

For a data set Z1 with a feature A, extracting any sample R in the data set Z1, randomly extracting adjacent samples S in the same category, and heterogeneous samples D in different categories, if the distance between the any sample R and the adjacent samples S is smaller than that between the any sample R and the heterogeneous samples D, the feature A is beneficial to distinguishing nearest neighbors of the same category from nearest neighbors of the different category, and increasing the weight of the feature A; conversely, the distance between any sample R and adjacent sample S is greater than the distance between sample R and heterogeneous sample D, indicating that feature a has a negative effect on distinguishing between similar and dissimilar nearest neighbors, the weight of feature a is reduced.

And re-extracting any sample R, repeating the process for m times, and finally obtaining the average weight W (A) of the characteristic A, wherein the formula is as follows:

；

K samples with the same category C in the data set Z1 are extracted to form a data set Z2, any sample B is extracted in the data set Z2, and the weight W (C) of the category C is obtained as follows:

；

Sorting according to the weight value of the weight W (C), classifying and numbering the data, and achieving the purpose of improving the calculation efficiency.

S3, receiving the real-time data stream which is sequenced according to the category, dividing the real-time data stream into small batches of data streams, outputting the small batches of data streams, performing real-time feedback control on the output processing rate, and returning the small batches of data streams to the database. As shown in fig. 2, the method comprises the following steps:

s3.1, receiving the real-time data streams sequenced according to the categories, dividing the sequenced real-time data streams into small batches of data streams according to different data categories, and outputting the small batches of data streams.

Receiving real-time data streams sequenced according to categories, and segmenting the real-time data streams into L according to different data categories ₁ 、L ₂ 、L ₃ 、…、L _n-1 、L _n Wherein the data volume of each of the n segments is R ₁ 、R ₂ 、R ₃ 、…、R _n-1 、R _n Wherein the minimum data amount is R _min The maximum data amount is R _max 。

。

and S3.2, monitoring the processing rate of the output data stream in real time, and processing the data stream according to the upper limit of the data processing rate.

And determining how many pieces of data can be read to the message queue at most according to the upper limit of the processing speed of the current output data, and avoiding that the quantity of the read data exceeds the maximum processing capacity.

The processing rate is monitored in real time in a slow start mode, and the initial upper limit of the data processing rate is set to acquire N sections of data per second;

at the start of a batch, the start time t of the current batch is submitted ₁ Giving a rate controller;

at the completion of the batch, the processing end time t of the current batch is submitted ₂ For the rate controller, t is the time for processing the batch, and t is the waiting time of the batch in the batch queue _w The data amount of the batch processing data is NR _n ；

V＝V _late -K _p ×er-K _i ×E _r -K _d ×d _r ；

V _late =(t+t _w ) /NR _n

FIG. 3 is a schematic diagram of a componentized-based computing engine data processing system for implementing a computing engine data processing method according to the present invention, comprising: the system comprises a plurality of micro-engines, a feature decision unit, a control unit and a database;

The control unit further includes: the segmentation module and the rate controller.

The segmentation module receives the real-time data streams sequenced according to the categories, segments the sequenced real-time data streams into small batches of data streams according to different data categories, and outputs the small batches of data streams.

The rate controller monitors the processing rate of the output data stream in real time, and processes the data stream according to the upper limit of the data processing rate.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted across a computer-readable storage medium. The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A componentized-based computing engine data processing method, comprising the steps of:

2. The computing engine data processing method according to claim 1, wherein in step S2: for a data set Z1 with a feature A, extracting any sample R in the data set Z1, randomly extracting adjacent samples S in the same category, and heterogeneous samples D in different categories, if the distance between the any sample R and the adjacent samples S is smaller than that between the any sample R and the heterogeneous samples D, the feature A is beneficial to distinguishing nearest neighbors of the same category from nearest neighbors of the different category, and increasing the weight of the feature A; conversely, the distance between any sample R and adjacent sample S is greater than the distance between sample R and heterogeneous sample D, indicating that feature a has a negative effect on distinguishing between similar and dissimilar nearest neighbors, the weight of feature a is reduced.

3. The computing engine data processing method according to claim 2, wherein the process is repeated m times to re-extract any samples R, and finally an average weight W (a) of the feature a is obtained, where the formula is:

；

4. The computing engine data processing method according to claim 3, wherein k samples with the same class C in the data set Z1 are extracted to form a data set Z2, and any sample B is extracted in the data set Z2, so as to obtain a weight W (C) of the class C as follows:

；

wherein, p (C) is the ratio of the data number of class C in the data set Z2 to the total data number of the data set Z1, and p (B) isThe ratio of the number of samples B in data set Z2 to the total data number of data set Z2,representative is sample j of class C, which is not in sample B, and the number of samples j is k.

5. The computing engine data processing method according to claim 1, wherein step S3 includes:

6. The method of claim 5, wherein in step S31, the real-time data stream sorted by category is received and segmented into L according to different data categories ₁ 、L ₂ 、L ₃ 、…、L _n-1 、L _n Wherein the data volume of each of the n segments is R ₁ 、R ₂ 、R ₃ 、…、R _n-1 、R _n Wherein the minimum data amount is R _min The maximum data amount is R _max， Setting the data volume of the single section to be not lower than R ₀ The data amount in the single segment is set to satisfy the following formula:

。

7. the method of claim 5, wherein in step S32, at the start of the batch, the start time t of the current batch is submitted ₁ Feed rate controlA device;

V＝V _late -K _p ×er-K _i ×E _r -K _d ×d _r ；

V _late =(t+t _w )/NR _n ；

8. A componentized based computing engine data processing system for implementing a computing engine data processing method according to any one of claims 1-7, comprising: the system comprises a plurality of micro-engines, a feature decision unit, a control unit and a database;

9. The compute engine data processing system of claim 8, wherein the control unit further comprises: a segmentation module and a rate controller;