CN106776855A

CN106776855A - The processing method of Kafka data is read based on Spark Streaming

Info

Publication number: CN106776855A
Application number: CN201611069230.9A
Authority: CN
Inventors: 程永新; 谢涛; 王仁铮
Original assignee: Shanghai Qingwei Software Co Ltd
Current assignee: Shanghai Qingwei Software Co Ltd
Priority date: 2016-11-29
Filing date: 2016-11-29
Publication date: 2017-05-31
Anticipated expiration: 2036-11-29
Also published as: CN106776855B

Abstract

The invention discloses a kind of processing method that Kafka data are read based on Spark Streaming, comprise the following steps：S1 in) storing data in topic using Kafka；S2 it is) using Spark Streaming that real time input data stream is blocking as unit cutting with timeslice；S3 number) is recorded previously according to Kafka data failures, SparkStreaming complement scheduling times is set；S4) monitor in real time SparkStreaming reads Kafka data procedures；S5 Kafka data) are re-read by SparkStreaming.The present invention records number and sets SparkStreaming complement scheduling times according to Kafka data failures, and monitor in real time reading process simultaneously re-reads failure record number and carries out complement, more flexibly, easily accomplishes that zero loses several guarantees.

Description

The processing method of Kafka data is read based on Spark Streaming

Technical field

Read based on Spark Streaming the present invention relates to a kind of Kafka data processing methods, more particularly to one kind The processing method of Kafka data.

Background technology

Spark Streaming are to calculate streaming to resolve into a series of short and small batch processing jobs.Here batch processing Engine is Spark, that is, the input data of Spark Streaming is divided into one section one according to batch size (such as 1 second) The data (Discretized Stream) of section, the RDD (Resilient in Spark are all converted into per one piece of data Distributed Dataset), then the Transformation operations in Spark Streaming to DStream are changed into For the Transformation operations in Spark to RDD, RDD is become into intermediate result by operation and is stored in internal memory.It is whole Individual streaming is calculated can be overlapped according to the demand of business to middle result, or external equipment is arrived in storage.Fig. 1 shows The whole flow process of Spark Streaming.

Kafka is distributed post-subscription message system.It is initially developed by LinkedIn companies, is turned into afterwards A part for Apache projects.Kafka is one distributed, can be divided, persistent log services of redundancy backup.It Mainly for the treatment of active stream data, as shown in Figure 2.

It is well known that real-time, stability, accuracy requirement more and more higher of the big data epoch to data processing；Now The combo architectures of rise have SparkStreaming to dock Kafka, and it is excellent to be based on internal memory iterative calculation by SparkStreaming Gesture and Kafka high concurrent data distribution capabilities, and then reach the real-time of data processing；But SparkStreaming is docked During kafka, potential loss of data scene still occurs unavoidably, detailed process is as follows：

1st, two Exectuor receive input data from receiver, and it is cached to the internal memory of Exectuor In；2nd, receiver notifies that input source data has been received；3rd, Exectuor has delayed according to the code start to process of application program The data deposited；4th, at this time Driver hangs suddenly；5th, from from the point of view of design, once after Driver hangs, it is safeguarded Exectuor also will be all by kill；Since the 6, all of Exectuor is by kill, so being cached in their internal memories Data will also be lost.As a result, but these have notified data source data cached just lost of not processing also；7th, cache When can not possibly recover because they are buffered in the internal memory of Exectuor, data are lost.

Therefore, urgent need is a kind of to be prevented from zero to lose several methods to ensure at SparkStreaming docking Kafka data Reason stability.

The content of the invention

The technical problems to be solved by the invention are to provide and a kind of read Kafka data based on Spark Streaming Processing method, can effectively prevent loss of data, after failure recovery from Kafka consumption data again, so as to In the case of SparkStreaming program exceptions, more flexibly, easily accomplish that zero loses several guarantees.

The technical scheme that the present invention is used to solve above-mentioned technical problem is to provide a kind of based on Spark Streaming The processing method of Kafka data is read, is comprised the following steps：S1 in) storing data in topic using Kafka, each topic Include the subregion of some configurable numbers；S2 it is single with timeslice real time input data stream) to utilize Spark Streaming Position cutting is blocking, and each block generates a Spark Job treatment；S3)

Number is recorded previously according to Kafka data failures, SparkStreaming complement scheduling times are set；S4) supervise in real time Control SparkStreaming reads the processing procedure of Kafka data；S5 number and scheduling time) are recorded according to Kafka data failures, The Kafka data unsuccessfully lost are re-read by SparkStreaming.

The above-mentioned processing method that Kafka data are read based on Spark Streaming, wherein, the step S3) use Relevant database creates two database tables, respectively dispatch list and failure record number table, storage scheduling in the dispatch list Numbering id, time started, end time, state and creation time information, it is described unsuccessfully to count storage failure record id in record sheet, Side-play amount, Kafka topics, Kafka node listing information, scheduling numbering id and the unsuccessfully mistake of several record sheets in the dispatch list It is main foreign key relationship to lose record id.

The above-mentioned processing method that Kafka data are read based on Spark Streaming, wherein, the step S4) bag Include：In SparkStreaming reads Kafka data procedures, if corresponding Kafka topic datas are not sky, get Reading the side-play amount of data from Kafka, and by the data offset, Kafka topics and Kafka node listing information Be put in storage during relevant database unsuccessfully counts record sheet, if data processing exception, the state in modification tables of data is failure.

The above-mentioned processing method that Kafka data are read based on Spark Streaming, wherein, the step S4) in SparkStreaming is directly connected on Kafka nodes by Direct modes, and by createDirectStream side Method gets the side-play amount that data are read from Kafka, while being in progress by the status indicator in dispatch list；When In SparkStreaming docking Kafka reading process data procedures, occur it is abnormal cause the program can not normally to perform, then change State in dispatch list is failure.

The above-mentioned processing method that Kafka data are read based on Spark Streaming, wherein, the step S5) bag Include：First according to dispatch list mode field as querying condition, scan schedule table drops according to creation time field as sequence Sequence, obtains earliest dispatching record, then obtains scheduling numbering id, using the field as inquiry failure number scale record surface condition, obtains All Kafka failure records numbers are obtained, Kafka data are re-read further according to Kafka topics and side-play amount.

The above-mentioned processing method that Kafka data are read based on Spark Streaming, wherein, the step S4) first read Take dispatch list and failure number scale in relational database and record table cache in internal memory, then the data in caching are updated by thread timing Carry out monitor in real time.

Present invention contrast prior art has following beneficial effect：What the present invention was provided is read based on Spark Streaming The processing method of Kafka data is taken, number is recorded according to Kafka data failures, SparkStreaming complement scheduling times are set, Monitor in real time reading process simultaneously re-reads failure record number and carries out complement such that it is able to effectively prevent loss of data, in failure After recovery from Kafka consumption data again, in the case of SparkStreaming program exceptions, more flexibly, easily Accomplish that zero loses several guarantees.

Brief description of the drawings

Fig. 1 is the Spark Streaming Organization Charts that the present invention is used；

Fig. 2 is the Kafka treatment streaming schematic diagram datas that the present invention is used；

Fig. 3 is dispatch list of the invention and failure number scale record table model structure chart；

Fig. 4 is the monitoring flow chart that Kafka data are read based on Spark Streaming of the invention；

Fig. 5 is failure record complement flow chart of the invention.

Specific embodiment

The invention will be further described with reference to the accompanying drawings and examples.

The processing method that Kafka data are read based on Spark Streaming that the present invention is provided, uses relational data Storehouse creates two database tables, respectively dispatch list (control), failure record number table (fai lure).Wherein dispatch list is deposited What is put is schedule information, including scheduling numbering id, the time started, the end time, state, the information such as creation time.Failure number scale Record table deposits specific miss data record details, including failure record id, side-play amount, topic (topic), Kafka nodes The information such as list.Scheduling numbering id wherein inside dispatch list is main foreign key relationship with unsuccessfully counting the id of record sheet.

In SparkStreaming docking Kafka reading process data procedures, SparkStreaming can be first passed through first CreateDirectStream methods, get the side-play amount that data are read from Kafka, and by the data-bias During amount information storage unsuccessfully counts record sheet to relevant database, state representation is in progress.

When in SparkStreaming docking Kafka reading process data procedures, there is exception and cause program normal Perform, according to the Exception information for capturing, with reference to corresponding data offset information, modification state is failure；Otherwise repair It is changed to successfully.

With reference to record sheet is unsuccessfully counted, dispatch list can be manually set, complement setting is carried out unsuccessfully, work as restarting During SparkStreaming programs, meeting scan schedule table and unsuccessfully several record sheets obtain complement strategy, re-read Kafka and refer to Data on fixed topic.

SparkStreaming of the invention obtains the mode of the two ways Receiver and Direct of Kafka data, Receiver modes are that Kafka queues are connected by zookeeper, and Direct modes are directly to the node of Kafka Upper acquisition data.Mode based on Receiver, this mode obtains data using Receiver.Receiver is to use The high-level Consumer API of Kafka are realized.The data that Receiver is obtained from Kafka are all stored in Spark In the internal memory of Executor, the job that then Spark Streaming start can go to process those data.However, in acquiescence Under configuration, this mode may lose data because of the failure of bottom.If enabling highly reliable mechanism, data zero are allowed to lose Lose, must just enable the write-ahead log mechanism (Write Ahead Log, WAL) of Spark Streaming.The mechanism can be synchronous The Kafka data that ground will be received are write in the write-ahead log in distributed file system (such as HDFS)；But Receiver's There is shortcoming in mode：1st, WAL reduces the handling capacity of receiver, because the data for receiving must be saved in reliable distribution In file system；2nd, for some input sources, it can repeat identical data.Such as when data are read from Kafka, first A data are preserved in the brokers of Kafka, but also portion need to be preserved in Spark Streaming.It is of the invention Technical scheme by SparkStreaming obtain kafka data Direct modes premised under carry out, gather it is of the invention Technical scheme, zero several modes are lost relative to the first, can bring significant beneficial effect, and specific advantage is as follows：1st, no longer need Kafka receivers, Exectuor directly uses Simple Consumer API consumption datas from Kafka；2nd, no longer need WAL mechanism, still can from after failure recovery from Kafka consumption data again；3rd, exactly-once semantemes are protected Deposit, the data of repetition are no longer read from WAL；4th, in the case of can guarantee that SparkStreaming program exceptions, more flexibly, just Accomplish that zero loses several guarantees promptly.

The Spark Streaming that the present invention is used are built upon the real-time Computational frame on Spark, are provided by it Abundant API, the high-speed execution engine based on internal memory, user can combine streaming, batch processing and interaction audit trial and ask application；With The development of big data, people to the processing requirement of big data also more and more higher, original batch processing framework MapReduce be adapted to from Line computation, cannot but meet requirement of real-time business higher.Therefore, how to go to ensure that Spark Streaming obtain kafka Data and efficiently, stabilization be very important.The problem of kafka loss of data data is obtained for Spark Streaming, The Spark Streaming that the present invention is provided read kafka and fail the method for complement, relate generally to scheduling and monitoring model sets Three aspects such as meter, the design of complement control centre, Surveillance center's design.Specific implementation process is as follows:

1st, dispatch list (control), failure record number table (failure), specific table knot are created in relevant database Structure is as shown in Figure 3.

2nd, programming realization Surveillance center service, in SparkStreaming docking Kafka reading process data procedures, first The createDirectStream methods of SparkStreaming can be first passed through, is got and is read data from kafka Side-play amount, and the information such as the data offset (offset) storage to relevant database is unsuccessfully counted into record sheet, state It is expressed as in progress.When in SparkStreaming docking kafka reading process data procedures, there is exception and cause program Can not normally perform, according to the Exception information for capturing, with reference to corresponding data offset information, call update to repair It is failure to change state；Otherwise it is revised as successfully, as shown in Figure 4.

3rd, complement control centre interface, is control centre's program, as shown in figure 5, being made according to dispatch list mode field first It is querying condition, scan schedule table according to creation time field as sequence descending, obtains earliest dispatching record, then obtains Numbering ID must be dispatched, using the field as inquiry failure number scale record surface condition, all Kafka failure records numbers is obtained, further according to Topic and side-play amount (offset) re-read Kafka data and are processed.

Although the present invention is disclosed as above with preferred embodiment, so it is not limited to the present invention, any this area skill Art personnel, without departing from the spirit and scope of the present invention, when a little modification and perfect, therefore protection model of the invention can be made Enclose when by being defined that claims are defined.

Claims

1. it is a kind of based on Spark Streaming read Kafka data processing method, it is characterised in that comprise the following steps：

S1 in) storing data in topic using Kafka, each topic subregion comprising some configurable numbers；

S2) using Spark Streaming that real time input data stream is blocking as unit cutting with timeslice, each block is generated One Spark Job treatment；

S3 number) is recorded previously according to Kafka data failures, SparkStreaming complement scheduling times is set；

S4) monitor in real time SparkStreaming reads the processing procedure of Kafka data；

S5 number and scheduling time) are recorded according to Kafka data failures, re-reads what is unsuccessfully lost by SparkStreaming Kafka data.

It is 2. as claimed in claim 1 to be based on the processing method that Spark Streaming read Kafka data, it is characterised in that The step S3) create two database tables, respectively dispatch list and failure record number table, the tune using relevant database Scheduling numbering id, time started, end time, state and creation time information are deposited in degree table, in unsuccessfully several record sheets Storage failure record id, side-play amount, Kafka topics, Kafka node listing information, scheduling numbering id in the dispatch list with The failure record id of failure number record sheet is main foreign key relationship.

It is 3. as claimed in claim 2 to be based on the processing method that Spark Streaming read Kafka data, it is characterised in that The step S4) include：In SparkStreaming reads Kafka data procedures, if corresponding Kafka topic datas are not Be sky, then get the side-play amount that data are read from Kafka, and by the data offset, Kafka topics and During Kafka node listings information storage unsuccessfully counts record sheet to relevant database, if data processing exception, changes data State in table is failure.

It is 4. as claimed in claim 3 to be based on the processing method that Spark Streaming read Kafka data, it is characterised in that The step S4) in SparkStreaming be directly connected on Kafka nodes by Direct modes, and pass through CreateDirectStream methods get the side-play amount that data are read from Kafka, while by the shape in dispatch list State is designated in progress；When in SparkStreaming docking Kafka reading process data procedures, there is exception and cause journey Sequence can not be performed normally, then it is failure to change the state in dispatch list.

It is 5. as claimed in claim 4 to be based on the processing method that Spark Streaming read Kafka data, it is characterised in that The step S5) include：First according to dispatch list mode field as querying condition, scan schedule table, according to creation time word Duan Zuowei sort descending, obtain earliest dispatching record, then obtain scheduling numbering id, using the field as inquiry fail number scale Record surface condition, obtains all Kafka failure records numbers, and Kafka data are re-read further according to Kafka topics and side-play amount.

It is 6. as claimed in claim 3 to be based on the processing method that Spark Streaming read Kafka data, it is characterised in that The step S4) first read dispatch list and failure number scale in relational database and record table cache in internal memory, then by thread regularly The data updated in caching carry out monitor in real time.