CN107463595A

CN107463595A - A kind of data processing method and system based on Spark

Info

Publication number: CN107463595A
Application number: CN201710335307.0A
Authority: CN
Inventors: 木伟民; 张云; 李名扬; 张明诚; 王伟平
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2017-05-12
Filing date: 2017-05-12
Publication date: 2017-12-12

Abstract

The invention discloses a kind of data processing method and system based on Spark.This method is：1) user chooses operator according to the demand of pending document and configures selected operator parameter, then the annexation of operator selected by foundation, generates the XML file of scene；The XML file of the scene includes each selected XML content of operator and the annexation of each operator；2) corresponding directed acyclic graph DAG is generated according to the XML file of scene；3) directed acyclic graph DAG is cut into some subtask subJob that can be performed in a distributed computing environment, the subtask subJob obtained after cutting is performed under Spark Computational frames, realizes the processing to the pending document.The present invention can achieve a butt joint various isomeric datas, improve data processing flexibility.

Description

A kind of data processing method and system based on Spark

Technical field

The present invention relates to a kind of data processing method and system based on Spark, belong to computer software technical field.

Background technology

What is currently existed is developed based on Hadoop mostly on big data pretreatment system, in Hadoop Between result be stored in HDFS file system, this will cause have many extra expenses, and Spark has used RDD's Theory, this allow it can in transparent internal memory data storage.This way greatly reducing magnetic in data handling procedure The read-write of disk.There are some big data pretreatment systems to be developed based on spark in addition, but it does not have versatility.

The characteristics of present system there is provided substantial amounts of operator interface, and user can be realized to specific with self-defined scene The respective handling of file；User can need self-defined operator according to oneself；The system is the further encapsulation to Spark, is used Family need not use Spark Basic API when self-defined operator；The system can be from the different data sources that user specifies Move data to HDFS；The system can handle different types of file.The present invention solves existing big data pretreatment system Efficiency of uniting is low, technical problem without versatility.

Existing similar operation does not have versatility mostly, and user can only use the function operator that system provides, it is impossible to root It is self-defined according to the demand of oneself, some flexible application scenarios can not be applied to, and all deposited from performance, scalability More or less the problem of.

The content of the invention

It is an object of the invention to provide a kind of data processing method and system based on Spark, the system can be realized Dock various isomeric datas.

The technical scheme is that：

A kind of data processing method based on Distributed Computing Platform, its step are：

1) user chooses operator according to the demand of pending document and configures selected operator parameter, then selected by foundation The annexation of operator, generate the XML file of scene；The XML file of the scene include the XML content of each selected operator with And the annexation of each operator；

2) corresponding directed acyclic graph DAG is generated according to the XML file of scene；

3) directed acyclic graph DAG is cut into some subtasks that can be performed in a distributed computing environment SubJob, the subtask subJob obtained after cutting is performed under Spark Computational frames, realizes the place to the pending document Reason.

Further, the method for directed acyclic graph DAG being cut into some subtask subJob is：

21) XML file of the scene is read, obtains the type of each operator, judges whether complicated operator；Wherein, The complicated operator refers to that operation object is the operator of data complete or collected works；

22) if there is no complicated operator, then using the scene as a subtask subJob；Calculated if there is complexity Son, then using the subtask subjob independent as one of each operator in directed acyclic graph DAG, then advised according to setting Then subtask subjob is merged；The operator is divided into two classes, that is, is adapted to operator and calculates operator；Being adapted to operator includes fitting With input operator and adaptation output operator, calculating operator includes simple computation operator and complicated calculations operator；The setting rule Including：

1) simple computation operator connects simple computation operator and then merged

2) simple computation operator connects the then nonjoinder of complicated calculations operator

3) complicated calculations operator connects the then nonjoinder of simple computation operator

4) complicated calculations operator connects the then nonjoinder of complicated calculations operator

5) adaptation input operator connects simple computation operator and then merged

6) adaptation input operator connects the then nonjoinder of complicated calculations operator

7) simple computation operator connects adaptation output operator and then merged

8) complicated calculations operator connects adaptation output operator then nonjoinder

23) for the subtask subjob after step 22) processing, if subtask subjob end end is not that adaptation is defeated Go out operator or complicated operator, then sink operators are added in subtask subjob ends, the wherein function of sink operators is by number According to storage into the interim tables of hive；If subtask subjob top is not adaptation input operator or complicated operator, at this Subtask subjob tops add scan operators, and the wherein function of scan operators is to read data from the interim tables of hive.

Further, in step 2), directed acyclic graph DAG is judged, determine in directed acyclic graph DAG whether There is ring, have subring or fracture, if one of them, then stop performing, and the interface that result is fed back to where the user.

Further, in the step 3), before subtasking sujob, subtask subjob is scanned first； If it find that Reduce operators, then add ReduceSink operators before the operator during scanning, if do not found, Do not process then；Subtasking subjob after scanning.

A kind of data handling system based on Distributed Computing Platform, it is characterised in that including administrative unit, execution unit And computing unit；Wherein,

The administrative unit, operator is chosen according to the demand of pending document for user and configures selected operator and is joined The annexation of number, then operator selected by foundation, generate the XML file of scene；The XML file of the scene includes each selected The annexation of the XML content of operator and each operator；

The computing unit, for generating corresponding directed acyclic graph DAG according to the XML file of scene；

The execution unit, for directed acyclic graph DAG to be cut into what can be performed in a distributed computing environment Subtask subJob；Then subtask subJob is submitted into Distributed Computing Platform to perform.

As shown in figure 1, the main handling process of the system is：

First, user pulls operator, configuration operator parameter, connection operator according to the demand of itself processing document on interface Generating scene, (each operator has an XML file in itself, when operator generation scene is pulled, according to each calculation in scene The XML file of the annexation generation scene of son.The XML file of scene includes the XML content of each operator and each The annexation of operator), when scene is run, its corresponding XML file is submitted to backstage and carries out related resolution, according to scene XML file in the order of connection of operator that records generate corresponding DAG (Directed acyclic graph, directed acyclic Figure).

Then DAG is cut into many subtask subJob by system controller according to dependency rule, and controller is by subJob Submit to actuator to perform under Spark Computational frames, while real-time running state and result are fed back into interface.

Finally by the file distribution handled well to HDFS for further analysis of the down-stream to file, excavation etc..

Off-line data processing system provided by the invention based on Spark can be divided into four parts, be management respectively Layer, execution level, computation layer and system monitoring O＆M.

Each several part main functional modules are as follows：

(1) management level：

1) interface：

Friendly user mutual is provided, user can be carried out being increased, delete, change the behaviour such as specifying information, inquiry Make.Interface can list the information of each operator, facilitate selection and use of the user to operator.User can be under oneself authority Operator such as is increased, deleted, being changed, being inquired about at the operation.When scene is run, the operation feelings of scene can be shown on interface Condition, scene operation progress is fed back into user.

2) process management：

Storage is provided each scene, control and performs and (be divided into and perform and regularly perform immediately) function.Wherein regularly hold Row is controlled by Cron.

3) user management

The management for provide platform user registration, deleting, distribute resource and authority.

4) operator management

Function of registration, renewal and deletion etc. is provided platform operator.

5) resource management

The computing resource and storage resource of each user are managed.

6) rights management

The operator access right of each user, data access authority and execution authority are managed.

(2) execution level：The part is converted to user-defined application layer DAG parsings can be in a distributed computing environment The task Spark Job of execution, and submit it and performed in Spark frameworks, while Spark Job operation information is carried out Collect.

1) metadata：

Storage to operator, process and task is provided.

2) scheduler：

1. resolver (Parser)

Parsing to XML is provided.

2. controller (Controller)

Control to performing task is provided.

3) actuator

The execution for receiving controller is asked so as to perform task, there is provided the hot standby and function of load balancing.

(3) computation layer：The system is based on big data, and user is by pulling operator, configuration operator parameter, line operator Scene is generated, realizes operator DAG.With reference to Spark calculating platforms, the input, calculating and output of data are realized.The present invention is to be based on Spark Computational frames.

1)Spark

Apache Spark computing engines.

2)HDFS

Hadoop distributed file systems.

(4) system monitoring O＆M：

Monitoring function is provided to scene implementation progress, operator running status, O＆M is provided to data prediction platform.

Compared with prior art, the present invention has following advantage：

1. system provides substantial amounts of operator, user can be with self-defined scene；

2. user can need self-defined operator according to oneself；

3. system is the further encapsulation to Spark, user need not use Spark Basic API；

4. due to the system uses DAG, so it possesses directed acyclic graph autgmentability and the characteristic of flexibility.

Brief description of the drawings

Fig. 1 is flow chart of the method for the present invention.

Embodiment

With reference to specific embodiment, the present invention will be further described in detail, but do not limit the invention in any way Scope.

Two student tables files are handled, has id, name, Chinese Achievement Test in table 1, there is id, name, number in table 2 Study achievement, it is desirable to which last result is：This row of grade will be increased in the file of table 1, be all second grade, by the file of table 2 The mathematics achievement of student all adds 3 points, and two tables finally are merged into a table.

User draws operator on interface, is two adaptation input operators respectively, and one is realized " increase row " function operator, One operator for realizing " increase point " function, one is the operator for realizing " merging of two tables " function, and one is adapted to output operator.

User has configured the parameter of relational operator on interface：" increase row " operator：Increased row are " grades ", and content is " two "；" increase point " operator：In " mathematics achievement ", that is arranged for increase, and increased fraction is " 3 "；Adaptation input operator 1：Extraction File is table 1；Adaptation input operator 2：The file of extraction is table 2；Combined operators：Two tables merge according to id；Adaptation output operator： It is determined that the title of output table.

User connects the context between each operator with line, to each operator in the XML file of operator Input and output be marked, the output according to an operator is that this relation of input of some other operator can be in XML Annexation between middle determination operator；Then point preserves, that is, generates DAG scenes.When user clicks on execution, scene pair The XML file answered is transferred to backstage and parsed, and obtains the context between each operator of whole scene, program pair from the background DAG is judged, determines whether ring, has subring or fracture, and if one of them, then program stopped performs, and will knot Fruit feeds back to front-end interface；If DAG is normal, continue executing with.Controller carries out DAG cuttings, merging, combination, generation below SubJob, this example are exactly a subjob.Subjob is submitted to actuator and performed by controller after generation subjob, is held Row device subjob is scanned first (result obtained after scanning has 2 kinds of situations, i.e., if scanning during if it find that Reduce operators, then ReduceSink operators are added before the operator；If do not found, do not process), have in this example Reduce operators are " combined operators ", so adding reduceSink operators between " combined operators " and each of which father node. Subjob is performed below, in processing procedure, processing progress can on interface real-time display, after processing terminates, place Reason result can show that user can take the file after processing on HDFS.

The key problem in technology point of the present invention is：

1. facing isomeric data, how to realize and the data file of different-format is handled

In face of isomeric data, system is parsed using different methods, makes the file of each type finally all unified A kind of form is parsed into, such system just easily can identify and handle file.The present invention is All Files by inputting operator It is processed into avro forms.

2. how to make the scene conversion that user builds into the program that can be run on Spark

Judge the DAG scenes (whole scene is exactly a Job) of user's structure with the presence or absence of complicated operator (complicated operator life Name with " CO.CO " start)：XML file corresponding to reading scene, the class attributes of each operator are obtained one by one.Class attributes Middle display operator type.The type of operator can also be read from the description.xml of operator registration packet.Complicated operator is Refer to the operator that operation object is data complete or collected works.

If complicated operator is not present in whole scene, whole scene is exactly a subJob, directly issues actuator .

If containing complicated operator in scene, three step operations are carried out to Job, first, will be each in DAG Operator (operator) is cut into an independent subjob, second, the subjob being related in following 8 big rules is merged Into a subjob.Operator is broadly divided into two classes, that is, be adapted to operator and calculate operator, adaptation operator include adaptation input operator and Adaptation output operator, calculating operator includes simple computation operator and complicated calculations operator (referred to as complicated operator).Third, to part Subjob adds sink (landing) operators or scan (to pick up) operator, if subjob end end be not adaptation output operator or Complicated operator, then in the subjob ends, plus sink, (sink is acted on：Store data into the interim tables of hive)；If Subjob top is not adaptation input operator or complicated operator, then at the subjob tops, plus scan, (scan is acted on：From Data are read in the interim tables of hive).Arrive here, subjob, which is just constructed, to finish.

8 big rules are as follows：

After subjob segmentation is completed, whole scene (Job) partial ordering relation has been obtained.Will be each after segmentation Subjob issues actuator and is scanned, if it find that Reduce operators, then add before the operator during scanning ReduceSink operators.Then carry out second to scan, by Transformation execution flow by RDD (Resilient Distributed Data sets elasticity distribution formula data sets) dependence build come, formation may finally be in Spark The Job of upper execution, i.e. Spark Job.When second of scanning, by Transformation execution flow by RDD's Dependence, which is built, to be come.Such as：map(func):Each element in RDD data sets to calling map is used Func, it is then back to a RDD.filter(func):Each element in RDD data sets to calling filter uses Func, it is then back to one and includes the RDD for forming the element that func is true.

When startup program, according to Job partial ordering relation, while start all no predecessor nodes in DAG, realize simultaneously Row is performed, and the node in figure is then deleted after the node, which performs, to be terminated, and repeats said process, untill execution terminates.

Claims

1. a kind of data processing method based on Distributed Computing Platform, its step are：

1) user chooses operator according to the demand of pending document and configures selected operator parameter, then operator selected by foundation Annexation, generate the XML file of scene；The XML file of the scene includes the XML content of each selected operator and each The annexation of operator；

3) directed acyclic graph DAG is cut into some subtask subJob that can be performed in a distributed computing environment, The subtask subJob obtained after cutting is performed under Spark Computational frames, realizes the processing to the pending document.

2. the method as described in claim 1, it is characterised in that directed acyclic graph DAG is cut into some subtasks SubJob method is：

21) XML file of the scene is read, obtains the type of each operator, judges whether complicated operator；Wherein, it is described Complicated operator refers to that operation object is the operator of data complete or collected works；

22) if there is no complicated operator, then using the scene as a subtask subJob；If there is complicated operator, then It is then right according to setting rule using the subtask subjob that each operator in directed acyclic graph DAG is independent as one Subtask subjob is merged；The operator is divided into two classes, that is, is adapted to operator and calculates operator；It is defeated including being adapted to be adapted to operator Enter operator and adaptation output operator, calculating operator includes simple computation operator and complicated calculations operator；

The setting rule includes：

23) for the subtask subjob after step 22) processing, if subtask subjob end end is not adaptation, output is calculated Sub or complicated operator, then sink operators are added in subtask subjob ends, the wherein function of sink operators is to deposit data Store up in the interim tables of hive；If subtask subjob top is not adaptation input operator or complicated operator, appoint in the son Business subjob tops add scan operators, and the wherein function of scan operators is to read data from the interim tables of hive.

3. method as claimed in claim 1 or 2, it is characterised in that in step 2), directed acyclic graph DAG is judged, Determine whether there is ring in directed acyclic graph DAG, have subring or fracture, if one of them, then stop performing, and will knot The interface that fruit is fed back to where the user.

4. method as claimed in claim 1 or 2, it is characterised in that first before subtasking sujob in the step 3) First subtask subjob is scanned；If it find that Reduce operators, then add before the operator during scanning ReduceSink operators, if do not found, do not process；Subtasking subjob after scanning.

A kind of 5. data handling system based on Distributed Computing Platform, it is characterised in that including administrative unit, execution unit and Computing unit；Wherein,

The administrative unit, operator is chosen according to the demand of pending document for user and configures selected operator parameter, Then the annexation of operator selected by establishing, the XML file of scene is generated；The XML file of the scene includes each selected calculation The XML content of son and the annexation of each operator；

The execution unit, appoint for directed acyclic graph DAG to be cut into the son that can be performed in a distributed computing environment Be engaged in subJob；Then subtask subJob is submitted into Distributed Computing Platform to perform.

6. system as claimed in claim 5, it is characterised in that the computing unit reads the XML file of the scene, obtains every The type of individual operator, judge whether complicated operator；Wherein, the complicated operator refers to that operation object is the calculation of data complete or collected works Son；If there is no complicated operator, then using the scene as a subtask subJob；If there is complicated operator, then should Each operator in directed acyclic graph DAG subtask subjob independent as one, then appoint according to setting regular antithetical phrase Business subjob is merged；The operator is divided into two classes, that is, is adapted to operator and calculates operator；Being adapted to operator includes adaptation input calculation Son and adaptation output operator, calculating operator includes simple computation operator and complicated calculations operator；The setting rule includes：

Then for the subtask subjob after above-mentioned processing, if subtask subjob end end is not adaptation output operator Or complicated operator, then sink operators are added in subtask subjob ends, the wherein function of sink operators is by data storage Into the interim tables of hive；If subtask subjob top is not adaptation input operator or complicated operator, in the subtask Subjob tops add scan operators, and the wherein function of scan operators is to read data from the interim tables of hive.

7. the system as described in claim 5 or 6, it is characterised in that the execution unit is to the subtask subjob after cutting It is scanned；If it find that Reduce operators, then add ReduceSink operators before the operator；Then subJob is submitted Performed to Distributed Computing Platform.