CN111526188A

CN111526188A - System and method for ensuring zero data loss based on Spark Streaming in combination with Kafka

Info

Publication number: CN111526188A
Application number: CN202010281180.0A
Authority: CN
Inventors: 王婧妍; 徐晶; 石波; 胡佳; 谢小明; 施雪成; 丁卫星; 李渊; 杨坤崇
Original assignee: Beijing Institute of Computer Technology and Applications
Current assignee: Beijing Institute of Computer Technology and Applications
Priority date: 2020-04-10
Filing date: 2020-04-10
Publication date: 2020-08-11
Anticipated expiration: 2040-04-10
Also published as: CN111526188B

Abstract

The invention relates to a system and a method for ensuring zero data loss based on Spark Streaming combined with Kafka, and belongs to the technical field of real-time processing and sequencing of Streaming data. The invention designs a new anonymous communication system by combining the cluster idea and the SDN network centralized control idea, thereby ensuring that the anonymous communication service is safer and more reliable. The network system architecture based on the SDN improves the difficulty of attackers in obtaining user privacy and the response rate of network requests; by adopting a cluster mode and a node selection limiting strategy, safety threats such as malicious node injection, flow analysis and single-point attack are avoided to a great extent, and the defense capability of the system is improved.

Description

System and method for ensuring zero data loss based on Spark Streaming in combination with Kafka

Technical Field

The invention belongs to the technical field of streaming data real-time processing and sorting, and particularly relates to a system and a method for ensuring zero data loss based on spark streaming and Kafka.

Background

With the advent and popularization of the information age, data informatization is closely related to life and work. The daily operation of an enterprise often generates TB-level data, the sources covering various types of data that an internet appliance can capture. In the face of huge log quantity, the traditional log processing system framework cannot meet the current demand. Big data analysis is the analysis of data on a huge scale. The real-time requirement of system services on data is gradually increased. The real-time big data analysis analyzes the data with huge scale, and the big data technology is utilized to efficiently and quickly complete the analysis, thereby achieving the effect of approximate real-time and reflecting the value and significance of the data more timely. The real-time processing is widely applied, such as real-time recommendation scenes of business departments, real-time reports of data departments and real-time monitoring of operation and maintenance departments. Kafka is a high-throughput distributed publish-subscribe messaging system that can handle all streaming data in consumer-scale websites, and can get feedback in time when an event occurs through Kafka's real-time data streaming. The Spark Streaming realizes the processing of real-time Streaming data with high throughput and a fault-tolerant mechanism. The method supports the acquisition of data from multiple data sources, then utilizes a high-level function to perform complex algorithm processing, and finally stores the complex algorithm processing in a file system or a database. The kafka is used as a message system to provide message persistence capability, but the kafka has hidden danger of data loss, how to realize kafka related configuration to ensure zero data loss, and ensuring message transmission reliability in the process of combining with spark timing is a technical problem to be solved urgently.

The support technology of the real-time data platform mainly comprises four aspects: real-time data collection (e.g., FLUME), message middleware (e.g., Kafka), stream computation frameworks (e.g., Storm, Spark, Flink, and Beam), and real-time storage of data (e.g., HBase for column family storage). The most central technology of the real-time data platform is stream computing.

Disclosure of Invention

Technical problem to be solved

The technical problem to be solved by the invention is as follows: how to solve the problem that Kafka loses data in the transmission process and consumption, the reliability in the data consumption process is ensured.

(II) technical scheme

In order to solve the technical problem, the invention provides a system for real-time stream processing based on the Kafka partitioning technology and sparkStreamin, which comprises a data caching module and a stream calculating module, wherein the data caching module is used for caching data acquired from different sources and forwarding the data to the stream calculating module; and the flow calculation module is used for processing the data after the data are read.

Preferably, the data caching module is specifically configured to use Kafka to implement caching of data acquired from different sources and forward the data to the stream computation module, and when Kafka is used to implement caching of data acquired from different sources and forward the data to the stream computation module, parameters are specifically set from three aspects, namely, a data production end, a zookeeper cluster end, and a data consumption end of Kafka, so as to prevent data loss.

Preferably, the setting of the parameters at the zookeeper cluster end by the data caching module to prevent data loss specifically includes: kafka can guarantee the sequence of the messages of the partitions, the messages sent to the Kafka partition firstly can be consumed firstly by the same partition, each topic of the Kafka has a plurality of partitions, each partition has a plurality of copies, each copy is divided into a leader copy and other follower copies, all the messages are sent to the leader copy, the message consumption is also obtained from the leader copy and is then synchronized with other copies, when the leader copy is unavailable, a follower copy is elected to be the leader copy, when the follower copy and the leader copy are kept synchronized, the follower copy is a synchronized copy, when the follower copy is not synchronized, the non-synchronized copy is not, if the leader copy is down, a follower copy needs to be elected to be the leader, if the non-synchronized copy is taken as the leader, a part of data can be lost, and the action is called as follows: for such situations, a parameter is set to false to prevent incomplete preference, or "minimum number of synchronized copies" is set to 1 to ensure that there are 1 synchronized copies when the host is down.

Preferably, the setting of the parameters at the data production end by the data caching module to prevent data loss specifically includes: after receiving the message, Kafka returns an ack parameter, where ack is 1, and after the producer leader copy successfully writes the message, the zookeeper cluster side feeds back a successful response as the server side, so that ack is set to 1.

Preferably, the setting of the parameter at the data consuming end by the data caching module to prevent data loss specifically includes: setting manual update offset, and setting to consume a batch and then submit, or setting an accumulator, and submitting the offset failed in current processing when an exception occurs, and consuming the next time from the submitted offset.

Preferably, the stream calculation module specifically uses Kafka Direct API, sets a data unique ID, and adds a partition offset to the data to solve the data loss problem.

Preferably, the stream calculation module solves the data loss problem by specifically using a method of Kafka Direct API: the Kafka Direct API uses Spark Drive to calculate the range of offsets that the next batch needs to process in Kafka, consuming data directly from the Kafka topoic partition.

Preferably, the stream computation module specifically uses a method of setting a data unique ID to solve the data loss problem: when writing into the database, an update statement is adopted, if the update statement exists, the update statement is updated, and if the update statement does not exist, the insertion is performed, and the method sets an offset mode of Direct DStream consumption.

Preferably, the stream computation module specifically uses a method of adding a partition offset to solve the data loss problem: and adding the offset of each partition into each piece of data, and if the program is down, obtaining the latest partition offset information read from the database after the program is restarted.

The invention also provides a method for realizing zero data loss in the data storage and transmission process by using the system.

(III) advantageous effects

The invention designs a new anonymous communication system by combining the cluster idea and the SDN network centralized control idea, thereby ensuring that the anonymous communication service is safer and more reliable. The network system architecture based on the SDN improves the difficulty of attackers in obtaining user privacy and the response rate of network requests; by adopting a cluster mode and a node selection limiting strategy, safety threats such as malicious node injection, flow analysis and single-point attack are avoided to a great extent, and the defense capability of the system is improved.

Drawings

FIG. 1 is a schematic diagram of zookeeper cluster replica classification in the present invention;

FIG. 2 is a schematic diagram of the mechanism of the kafka production terminal ack in the present invention;

FIG. 3 is a schematic diagram of the Kafka Direct API of the present invention.

Detailed Description

In order to make the objects, contents, and advantages of the present invention clearer, the following detailed description of the embodiments of the present invention will be made in conjunction with the accompanying drawings and examples.

The invention provides a system and a corresponding method for ensuring zero data loss based on Kafka and Spark Streaming combination. The invention provides a system for real-time stream processing based on Kafka partitioning technology and Spark Streamin, which comprises a data caching module and a stream calculating module. The data caching module is used for caching the data acquired from different sources and forwarding the data to the stream calculating module. And after the stream calculation module reads the data, processing the received data to ensure that no data is lost and no repeated data is consumed. The two modules supplement each other, and zero data loss is guaranteed. The invention uses Kafka offset mechanism to combine with Spark Streaming to solve the problem of data loss in the transmission process and consumption of Kafka and ensure the reliability in the data consumption process.

First, data buffer module

Kafka is a distributed, highly available, high throughput, distributed messaging system. The method has excellent message persistence capability, and can persist the messages to a local disk and a partition to prevent loss. Constant time access performance can be maintained for data above the TB level. Based on zookeeper operation, the method has fault tolerance and allows nodes in the cluster to fail without data loss. By means of sending compressed data in batches, data transmission overhead is reduced, and throughput is improved. Partitions are supported and message ordering is performed for the same partition, but global message ordering cannot be achieved. The messages in the kafka partition all have a continuous sequence number offset that uniquely identifies a message and records the next message sequence number to be provided to the consumer. Kafka has a hidden danger of data loss, if a consumer finishes reading, the offset is already submitted, but SparkStreaming is hung up when the processing is not finished, the offset is updated at the moment, and the data lost before can not be consumed any more, so that the data loss can be caused. The invention prevents data loss from three aspects of Kafka's data production end, zookeeper cluster end, and data consumption end.

1.1 zookeeper Cluster end

Kafka may guarantee the order of the partition messages, and messages sent to Kafka partitions first may be consumed first by the same partition. Kafka has multiple partitions per topic, with each partition having multiple copies, one leader copy, and the remaining follower copies. All messages are sent to the captain copy, and message consumption is also obtained from the captain copy and then synchronized with the other copies. When the leader copy is unavailable, a follower copy is elected to become the leader copy. As with FIG. 1, the follower copy is a synchronized copy when the follower copy remains synchronized with the captain copy, and is a non-synchronized copy when synchronization is not possible. If the leader copy is down, a follower copy needs to be elected as the leader, and if the asynchronous copy is taken as the leader, a part of data is lost. This behavior is called: incomplete top-collar elections. For such situations, a parameter needs to be set, and setting this parameter to false prevents incomplete preference. Or the minimum number of the synchronous copies can be set to be 1, so that 1 synchronous copy is ensured when the host is down.

1.2 data production end

After Kafka receives the message, it returns an ack parameter. ack parameter 0, 1, all, representing different acknowledgement modes. As shown in fig. 2, ack is 0, and the producer of the data producer does not wait for the response of the server after sending the message. and ack is 1, after the producer leader copy successfully writes the message, the zookeeper cluster terminal is used as a server side to feed back a successful response. and after all copies in the Kafka successfully write the data, the server side feeds back a successful response. ack-all ensures high reliability but reduces throughput. Therefore, the ack is set to 1 in the step, so that the reliability of the data is ensured, and the high throughput of the Kafka is ensured.

1.3 data consuming side

A manual update offset is set. Match is false. When the offset is only automatically submitted, when 30 pieces of data are pulled, the data are automatically submitted when 20 pieces of data are processed, abnormality occurs in processing 21 pieces of data, when the data are pulled again, pulling is started from the position after 30 pieces of data, and 21-30 pieces of data are lost. To prevent data loss, the automatic commit is modified to a manual commit. It can be set to consume a batch of post-commit, or set to an accumulator, when an exception occurs, commit the offset of the current processing failure, and the next consumption starts from the committed offset.

Two, flow calculating module

Spark Streaming is an extension of Spark core api, and supports scalable, high-throughput and fault-tolerant Streaming of real-time data streams. The Spark Streaming receives the real-time data stream and decomposes the real-time data stream into a series of short batch processing operations, namely DStream, converts each verse into RDD (elastic distributed data set), then converts the RDD, and stores the result in the memory. The data of Kafka is received by a receiver of Spark Streaming and then stored in Spark. Once the data is stored in Spark, the Kafka offset in zookeeper is updated. There may be a scenario of data loss. The invention uses Kafka Direct API, sets the unique ID of the data, and adds the partition offset in the data to solve the problem of data loss.

2.1 Kafka Direct API: and consumption processing is carried out on the streaming data of the Kafka, so that zero loss of the data is ensured and repeated consumption is prevented. As shown in FIG. 3, the Kafka Direct API uses Spark Drive to calculate the range of offset amounts in Kafka that the next batch process needs to process, consuming data directly from the Kafka topoic partition.

2.2 setting data unique ID: when writing into the database, adopting an update statement, if the update statement exists, the update statement is updated, and if the update statement does not exist, the insertion statement is inserted. The method needs to set an offset mode of Direct DStream consumption, commit.

2.3 Add partition offset: the offset of the partition is added to each piece of data. If the program is down, after the program is restarted, the latest partition offset information is read from the database, the atomicity of the data and the offset is ensured, and the problems of data loss and repeated consumption are solved.

The invention collects and summarizes the data acquired by the data source and persists the data to the disk by utilizing the high expansibility and the high reliability of the Kafka, thereby reducing the data loss probability. The method is combined with Spark Streaming data processing, so that the data can be efficiently processed in real time, and the reliability of the data is guaranteed.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A system based on Kafka partitioning technology and Spark Streamin real-time stream processing is characterized by comprising a data caching module and a stream calculating module, wherein the data caching module is used for caching data acquired from different sources and forwarding the data to the stream calculating module; and the flow calculation module is used for processing the data after the data are read.

2. The system according to claim 1, wherein the data caching module is specifically configured to implement caching of data acquired from different sources by using Kafka and forward the data to the stream computation module, and when implementing caching of data acquired from different sources by using Kafka and forward the data to the stream computation module, parameters are set to prevent data loss in three aspects, specifically, from a data production end, a zookeeper cluster end, and a data consumption end of Kafka.

3. The system of claim 2, wherein the data caching module sets parameters at the zookeeper cluster end to prevent data loss specifically: kafka can guarantee the sequence of the messages of the partitions, the messages sent to the Kafka partition firstly can be consumed firstly by the same partition, each topic of the Kafka has a plurality of partitions, each partition has a plurality of copies, each copy is divided into a leader copy and other follower copies, all the messages are sent to the leader copy, the message consumption is also obtained from the leader copy and is then synchronized with other copies, when the leader copy is unavailable, a follower copy is elected to be the leader copy, when the follower copy and the leader copy are kept synchronized, the follower copy is a synchronized copy, when the follower copy is not synchronized, the non-synchronized copy is not, if the leader copy is down, a follower copy needs to be elected to be the leader, if the non-synchronized copy is taken as the leader, a part of data can be lost, and the action is called as follows: for such situations, a parameter is set to false to prevent incomplete preference, or "minimum number of synchronized copies" is set to 1 to ensure that there are 1 synchronized copies when the host is down.

4. The system of claim 2, wherein the data caching module sets parameters at the data production end to prevent data loss specifically: after receiving the message, Kafka returns an ack parameter, where ack is 1, and after the producer leader copy successfully writes the message, the zookeeper cluster side feeds back a successful response as the server side, so that ack is set to 1.

5. The system of claim 2, wherein the data caching module sets parameters at the data consuming end to prevent data loss specifically: setting manual update offset, and setting to consume a batch and then submit, or setting an accumulator, and submitting the offset failed in current processing when an exception occurs, and consuming the next time from the submitted offset.

6. The system of claim 2, wherein the stream calculation module solves the data loss problem by using one of Kafka DirectAPI, setting a data unique ID, and adding a partition offset to data.

7. The system of claim 6, wherein the stream computation module solves the data loss problem specifically using the method of Kafka DirectAPI: kafka DirectAPI uses Spark Drive to calculate the range of offset amounts in Kafka that the next batch process needs to process, consuming data directly from the Kafkatopic partition.

8. The system of claim 6, wherein the stream computation module solves the data loss problem specifically using a method of setting a data unique ID: when writing into the database, an update statement is adopted, if the update statement exists, the update statement is updated, and if the update statement does not exist, the insertion is performed, and the method sets an offset mode of Direct DStream consumption.

9. The system of claim 6, wherein the stream computation module solves the data loss problem specifically using the method of adding partition offsets: and adding the offset of each partition into each piece of data, and if the program is down, obtaining the latest partition offset information read from the database after the program is restarted.

10. A method of achieving zero data loss during data storage and transmission using the system of any one of claims 1 to 9.