CN108196970A

CN108196970A - The dynamic memory management method and device of Spark platforms

Info

Publication number: CN108196970A
Application number: CN201711477992.7A
Authority: CN
Inventors: 孙浩
Original assignee: Neusoft Corp
Current assignee: Neusoft Corp
Priority date: 2017-12-29
Filing date: 2017-12-29
Publication date: 2018-06-22

Abstract

The present invention proposes a kind of dynamic memory management method and device of Spark platforms, and this method includes determining to generate the first situation type that memory overflows in Spark platform operational process；Mobile state management is internally deposited into according to the first situation type corresponding method；Wherein, the first situation type includes：Memory caused by the memory in Map stages overflows, the memory caused by calling coalesce functions overflows, the memory in Shuffle stages overflows, resource allocation is uneven under singleton pattern overflows, the memory spilling caused by data skew.The memory that Spark platform operational process can be effectively prevented from by the present invention overflows, and promotes the robustness and operational effect of Spark platforms operation.

Description

The dynamic memory management method and device of Spark platforms

Technical field

The present invention relates to field of computer technology more particularly to a kind of dynamic memory management methods and dress of Spark platforms It puts.

Background technology

Memory of the Spark platforms in an execution node Executor is divided into three pieces：Execution memories, storage Memory and other memories.Wherein, execution memory, join generic operations and aggregate generic operations etc. are saved as in execution Being performed in execution memories, the data in Shuffle stages also can be first buffered in execution memories, Execution memories are filled with and then write data into disk, can reduce input/output, the Map stages are also in execution It is performed in memory；Storage memories are used to store broadcast generic operations, cache generic operations and persist generic operations Etc. data；Other memories are the memories reserved when the program in Spark platforms performs.In the relevant technologies, Spark platforms are avoided The mode that the memory of operational process overflows usually statically manages, under this mode, it is impossible to effectively be run in Spark platforms Memory is avoided to overflow in time in the process.

Invention content

The present invention is directed to solve at least some of the technical problems in related technologies.

For this purpose, an object of the present invention is to provide a kind of dynamic memory management method of Spark platforms, it can be effective Ground avoids the memory of Spark platform operational process from overflowing, and promotes the robustness and operational effect of the operation of Spark platforms.

It is another object of the present invention to propose a kind of memory dynamic management device of Spark platforms.

It is another object of the present invention to propose a kind of non-transitorycomputer readable storage medium.

It is another object of the present invention to propose a kind of computer program product.

In order to achieve the above objectives, the dynamic memory management method for the Spark platforms that first aspect present invention embodiment proposes, Including：It determines to generate the first situation type that memory overflows in the Spark platforms operational process；According to first situation Type corresponding method is to the memory into Mobile state management；Wherein, the first situation type includes：The memory in Map stages overflows Go out, the memory caused by coalesce functions is called to overflow, the memory in Shuffle stages overflows, resource allocation under singleton pattern Memory caused by uneven overflows, the memory caused by data skew overflows；The basis and the first situation type pair Induction method to the memory into Mobile state management, including：If the first situation type for the Map stages memory overflow and The memory spilling called caused by coalesce functions, then before the Map in Map stages operations, to each Task Carry out multidomain treat-ment；If the first situation type is overflowed for the memory in the Shuffle stages, in the Shuffle stages The quantity of incoming parameter partitioner be adjusted processing；If the first situation type is to be provided under the singleton pattern Memory caused by source distribution is uneven overflows, then to parameter executor-cores or parameter under singleton pattern Spark.executor.cores carries out configuration processing；If the first situation type is the memory caused by the data skew It overflows, it is determined that the code position of the data skew and data distribution situation are generated, according to the code position and data point Cloth situation is to the memory into Mobile state management.

The dynamic memory management method for the Spark platforms that first aspect present invention embodiment proposes, by determining that Spark is put down The first situation type that memory overflows is generated in platform operational process, action is internally deposited into according to the first situation type corresponding method State management, since the first situation type has been considered there may be many reasons that memory overflows, and then, using targetedly solving Scheme, into Mobile state management, can be effectively prevented from Spark platform operational process to the memory in Spark platform operational process Memory overflows, and promotes the robustness and operational effect of the operation of Spark platforms.

In order to achieve the above objectives, the memory dynamic management device for the Spark platforms that second aspect of the present invention embodiment proposes, Including：First determining module, for determining to generate the first situation type of memory spilling in the Spark platforms operational process； Dynamic management module, for according to the first situation type corresponding method to the memory into Mobile state management；Wherein, institute The first situation type is stated to include：The memory in Map stages overflows, the memory caused by calling coalesce functions overflows, Shuffle Memory caused by the memory in stage overflows, resource allocation is uneven under singleton pattern overflows, the memory caused by data skew It overflows；The dynamic management module, is specifically used for：If the first situation type is overflowed for the memory in the Map stages and institute State call coalesce functions caused by memory overflow, then the Map in the Map stages operation before, to each Task into Row multidomain treat-ment；If the first situation type is overflowed for the memory in the Shuffle stages, to institute in the Shuffle stages The quantity of incoming parameter partitioner is adjusted processing；If the first situation type is resource under the singleton pattern It distributes uneven caused memory to overflow, then to parameter executor-cores or parameter under singleton pattern Spark.executor.cores carries out configuration processing；If the first situation type is the memory caused by the data skew It overflows, then triggers the second determining module and determine to generate the code position of the data skew and data distribution situation, the dynamic Management module, according to the code position and data distribution situation to the memory into Mobile state management.

The memory dynamic management device for the Spark platforms that second aspect of the present invention embodiment proposes, by determining that Spark is put down The first situation type that memory overflows is generated in platform operational process, action is internally deposited into according to the first situation type corresponding method State management, since the first situation type has been considered there may be many reasons that memory overflows, and then, using targetedly solving Scheme, into Mobile state management, can be effectively prevented from Spark platform operational process to the memory in Spark platform operational process Memory overflows, and promotes the robustness and operational effect of the operation of Spark platforms.

In order to achieve the above objectives, the non-transitorycomputer readable storage medium that third aspect present invention embodiment proposes, When the instruction in the storage medium is performed by the processor of mobile terminal so that mobile terminal is able to carry out one kind The dynamic memory management method of Spark platforms, the method includes：It determines to generate memory in the Spark platforms operational process The the first situation type overflowed；According to the first situation type corresponding method to the memory into Mobile state management；Wherein, The first situation type includes：The memory in Map stages overflows, the memory caused by calling coalesce functions overflows, Memory caused by the memory in Shuffle stages overflows, resource allocation is uneven under singleton pattern overflows, caused by data skew Memory overflow；The basis and the first situation type corresponding method to the memory into Mobile state management, including：If institute It states the memory that the first situation type is the Map stages and overflows and overflowed with the memory caused by the calling coalesce functions, Then before the Map in Map stages operations, multidomain treat-ment is carried out to each Task；If the first situation type is described The memory in Shuffle stages overflows, then in the Shuffle stages the quantity of incoming parameter partitioner be adjusted place Reason；If the first situation type is that the uneven caused memory of resource allocation overflows under the singleton pattern, in singleton Configuration processing is carried out to parameter executor-cores or parameter spark.executor.cores under pattern；If described first Situation type is that the memory caused by the data skew overflows, it is determined that generates the code position and data of the data skew Distribution situation, according to the code position and data distribution situation to the memory into Mobile state management.

The non-transitorycomputer readable storage medium that third aspect present invention embodiment proposes, by determining that Spark is put down The first situation type that memory overflows is generated in platform operational process, action is internally deposited into according to the first situation type corresponding method State management, since the first situation type has been considered there may be many reasons that memory overflows, and then, using targetedly solving Scheme, into Mobile state management, can be effectively prevented from Spark platform operational process to the memory in Spark platform operational process Memory overflows, and promotes the robustness and operational effect of the operation of Spark platforms.

In order to achieve the above objectives, the computer program product that fourth aspect present invention embodiment proposes, when the computer When instruction in program product is performed by processor, a kind of dynamic memory management method of Spark platforms, the method packet are performed It includes：It determines to generate the first situation type that memory overflows in the Spark platforms operational process；According to the first situation class Type corresponding method is to the memory into Mobile state management；Wherein, the first situation type includes：The memory spilling in Map stages, Memory caused by calling coalesce functions overflows, the memory in Shuffle stages overflows, resource allocation is uneven under singleton pattern Memory caused by even overflows, the memory caused by data skew overflows；The basis and the first situation type counterparty Method to the memory into Mobile state management, including：If the first situation type for the Map stages memory overflow and it is described Memory caused by calling coalesce functions overflows, then before the Map in Map stages operations, each Task is carried out Multidomain treat-ment；If the first situation type is overflowed for the memory in the Shuffle stages, to being passed in the Shuffle stages The quantity for entering parameter partitioner is adjusted processing；If the first situation type is resource point under the singleton pattern It is overflowed with uneven caused memory, then to parameter executor-cores or parameter under singleton pattern Spark.executor.cores carries out configuration processing；If the first situation type is the memory caused by the data skew It overflows, it is determined that the code position of the data skew and data distribution situation are generated, according to the code position and data point Cloth situation is to the memory into Mobile state management.

The computer program product that fourth aspect present invention embodiment proposes, by determining in Spark platform operational process The first situation type that memory overflows is generated, Mobile state management is internally deposited into according to the first situation type corresponding method, due to First situation type has been considered there may be many reasons that memory overflows, and then, using targetedly solution to Spark Into Mobile state management, the memory that can be effectively prevented from Spark platform operational process is overflowed, is carried memory in platform operational process Rise the robustness and operational effect of Spark platforms operation.

The additional aspect of the present invention and advantage will be set forth in part in the description, and will partly become from the following description It obtains significantly or is recognized by the practice of the present invention.

Description of the drawings

Above-mentioned and/or additional aspect and advantage of the invention will become from the following description of the accompanying drawings of embodiments Significantly and it is readily appreciated that, wherein：

Fig. 1 is the flow diagram of the dynamic memory management method for the Spark platforms that one embodiment of the invention proposes；

Fig. 2 is the flow diagram of the dynamic memory management method for the Spark platforms that another embodiment of the present invention proposes；

Fig. 3 is the flow diagram of the dynamic memory management method for the Spark platforms that another embodiment of the present invention proposes；

Fig. 4 is the process schematic for the quantity for increasing Task in the embodiment of the present invention；

Fig. 5 is the process schematic converted in the embodiment of the present invention to Key values；

Fig. 6 is to carry out partition processing and the process schematic distributed in the embodiment of the present invention to part Key values；

Fig. 7 is the flow diagram of the dynamic memory management method for the Spark platforms that another embodiment of the present invention proposes；

Fig. 8 is the structure diagram of the memory dynamic management device for the Spark platforms that one embodiment of the invention proposes；

Fig. 9 is the structure diagram of the memory dynamic management device for the Spark platforms that another embodiment of the present invention proposes.

Specific embodiment

The embodiment of the present invention is described below in detail, the example of the embodiment is shown in the drawings, wherein from beginning to end Same or similar label represents same or similar element or the element with same or like function.Below with reference to attached The embodiment of figure description is exemplary, and is only used for explaining the present invention, and is not considered as limiting the invention.On the contrary, this The embodiment of invention includes falling into all changes in the range of the spirit and intension of attached claims, modification and equivalent Object.

Fig. 1 is the flow diagram of the dynamic memory management method for the Spark platforms that one embodiment of the invention proposes.

The memory dynamic that the present embodiment is configured as Spark platforms with the dynamic memory management method of the Spark platforms is managed It is illustrated in reason device.

The memory dynamic management device of Spark platforms can set in the server or can also set in the present embodiment It puts in the electronic device, the embodiment of the present invention is not restricted this.

The dynamic memory management method of Spark platforms in the embodiment of the present invention, can be used for the memory to Spark platforms Into Mobile state management, (Out Of Memory, OOM) is overflowed to avoid memory.

Wherein, electronic equipment is, for example, PC (Personal Computer, PC), and cloud device or movement are set It is standby, mobile equipment such as smart mobile phone or tablet computer etc..

It should be noted that the executive agent of the embodiment of the present invention, can be, for example, server or electronics on hardware Central processing unit (Central Processing Unit, CPU) in equipment, on software can be, for example, server or Back-stage management service in electronic equipment, is not restricted this.

The embodiment of the present invention carries out example in the electronic device with the memory dynamic management device setting of Spark platforms.

Spark platforms therein may operate in distributed server or cluster server.Therefore, Spark platforms Multiple execution node Executor can be corresponded to during operation.

Spark platforms are a kind of increase income cluster computing environments similar to Hadoop, and Spark enables memory distributed data Collect RDD, be capable of providing interactive inquiry and Optimized Iterative workload.

Spark platforms realize that Scala language is used as its application framework by it in Scala language.Spark platforms It can be closely integrated with Scala language, wherein, Scala language can neatly operate distributed data collection RDD.

Referring to Fig. 1, this method includes：

S101：It determines to generate the first situation type that memory overflows in Spark platform operational process.

The embodiment of the present invention, can be by calling application programming interface during specific perform (Application Programming Interface, API) reads the daily record text formed in Spark platform operational process Part then, analyzes and processes the data in the journal file and system core file, determines Spark platform operational process Middle the first situation type for generating memory and overflowing.

S102：Mobile state management is internally deposited into according to the first situation type corresponding method.

Wherein, the first situation type includes：The memory in Map stages overflows, the memory caused by calling coalesce functions It overflows, the memory in Shuffle stages overflows, the uneven caused memory spilling of resource allocation, data skew under singleton pattern Caused memory overflows.

The memory of Spark platforms can be described as follows：

Memory of the Spark platforms in an execution node Executor is divided into three pieces：Execution memories, storage Memory and other memories.Wherein,

Execution memory is saved as in execution, join generic operations and aggregate generic operations etc. are in execution Middle execution is deposited, the data in Shuffle stages also can be first buffered in execution memories, after execution memories are filled with, Disk is write data into again, can reduce input/output, what the Map stages also performed in execution memories；In storage It deposits to store the data such as broadcast generic operations, cache generic operations and persist generic operations；Other memories are The memory that program in Spark platforms is reserved when performing.

S102 is specifically included：

S1021：If the first situation type is the memory caused by the memory spilling in Map stages and calling coalesce functions It overflows, then before the Map in Map stages operations, multidomain treat-ment is carried out to each Task.

The memory spilling in Map stages therein is caused by the Map stages generate a large amount of object.

For example, operation rdd.map (x=>for(i<- 1to 10000) yield i.toString), the operation is in rdd In, each object produces 10000 objects.

It, can be in the Map stages in the case where not increasing memory for the situation type that above-mentioned generation memory overflows Before Map operations, multidomain treat-ment is carried out to each Task.That is, the size by reducing each Task, so that each Task is being produced In the case of raw a large amount of object, the affiliated memory for performing node Executor is also enough.

Specific practice can be, for example, to call repartition methods before the map operations that can generate a large amount of objects, Task subregions are passed to map operations for smaller task block.

Such as：Rdd.repartition (10000) .map (x=>for(i<-1to 10000)yield i.toString)。

S1022：If the first situation type is overflowed for the memory in Shuffle stages, to being passed to ginseng in the Shuffle stages The quantity of number partitioner is adjusted processing.

It is specially the occupied memory of single file after Shuffle processing that the memory in Shuffle stages therein, which overflows, Memory caused by excessive overflows.

According to the operation principle of Spark platforms, in Spark platforms, join generic operations and reduceByKey generic operations Process has the process of Shuffle processing, needs to be passed to a parameter during Shuffle processing Partitioner, the Shuffle processing operations in most of Spark platforms, default parameters partitioner are HashPatitioner, default value are the maximum number of partitions in father's memory distributed data collection RDD, parameter partitioner It can be controlled by parameter spark.default.parallelism (to be used in spark-sql Spark.sql.shuffle.partitions), parameter spark.default.parallelism is only right HashPartitioner is effective, if the other parameter Partitioner or parameter Partitioner oneself realized then not The concurrency of parameter spark.default.parallelism control Shuffle processing can be used.If other parameter Memory caused by partitioner overflows, and can increase the number of parameter partitions from the code of parameter partitioner Amount.

S1023：If the first situation type is that the uneven caused memory of resource allocation overflows under singleton pattern, in list Configuration processing is carried out to parameter executor-cores or parameter spark.executor.cores under example pattern.

According to the operation principle of Spark platforms, under singleton pattern, if being configured with parameter -- total-executor- Cores and parameter -- executor-memory, and parameter is not configured -- executor-cores may then cause memory to overflow.

, then can be by the way that parameter be configured simultaneously if this kind of situation type -- executor-cores or parameter Spark.executor.cores ensures that execution node Executor resource allocations are uniform with this.

It further, can also be by the object in memory distributed data collection RDD sharing that memory is avoided to overflow.

For example, similar rdd.flatMap (x=>for(i<- 1to 1000) yield (" key ", " value ")) it may Lead to OOM, but under similar circumstances, use rdd.flatMap (x=>for(i<-1to 1000)yield"key"+" Value ") will not then lead to OOM, be since each (" key ", " value ") all generates a Tuple object, and " key "+" Value ", no matter how many, all only there are one objects, are directed toward constant pool.

Above-mentioned example illustrates that (" key ", " value ") and (" key ", " value ") is that there are in different location in memory , that is, two parts have been deposited, and " key "+" value " only deposits portion, in same address, therefore, in the embodiment of the present invention, if memory There is a large amount of repeated data in distributed data collection RDD, alternatively, when needing to deposit a large amount of repeated datas in data, it can be by repeat number According to String is converted into, memory use can be effectively reduced.

S1024：It is overflowed if the first situation type is the memory caused by data skew, it is determined that generate the generation of data skew Code position and data distribution situation internally deposit into Mobile state management according to code position and data distribution situation.

Optionally it is determined that the code position of data skew and data distribution situation are generated, according to code position and data point Cloth situation internally deposits into Mobile state management, can also include：It is determined according to the code position of data skew and data distribution situation Generate the second situation type of data skew；Mobile state management is internally deposited into according to method corresponding with the second situation type.

The embodiment of the present invention, can also be by calling application programming interface during specific perform (Application Programming Interface, API) reads the daily record text formed in Spark platform operational process Part then, analyzes and processes the data in the journal file and system core file, determines Spark platform operational process Middle the second situation type for generating memory and overflowing.

Specific implementation procedure may refer to following.

In the present embodiment, by determining the first situation type of generation memory spilling in Spark platform operational process, according to Mobile state management is internally deposited into the first situation type corresponding method, there may be memories to overflow since the first situation type has been considered The many reasons gone out, and then, using targetedly solution to the memory in Spark platform operational process into Mobile state pipe Reason, the memory that can be effectively prevented from Spark platform operational process overflow, and promote robustness and the operation of the operation of Spark platforms Effect.

Fig. 2 is the flow diagram of the dynamic memory management method for the Spark platforms that another embodiment of the present invention proposes.

Referring to Fig. 2, this method includes：

S201：The second situation class of generation data skew is determined according to the code position of data skew and data distribution situation Type.

The code position of generation data skew can be determined in the embodiment of the present invention, specific example is as follows：

According to the operation principle of Spark platforms, data skew was happened in the Shuffle stages.Shuffle may be triggered The operator of stages operating can be, for example, distinct, groupByKey, reduceByKey, aggregateByKey, join, Cogroup, repartition etc..

It determines the code position of generation data skew, can determine which stage data skews is happened in first, if It is to be submitted using yarn-client patterns, it can be local directly it can be seen that running log log, can find in log and work as Before which stage run to；If being submitted using yarn-cluster patterns, then can be looked by Spark Web UI See which stage currently run to.In addition, either make using yarn-client patterns or yarn-cluster patterns, The data volume of each task distribution of this current stage can also be confirmed in the embodiment of the present invention in Spark Web UI, so as to It further determines whether that the data distributed due to task are uneven and results in data skew.

Then, it after determining which stage data skews is happened in, can be drawn in the embodiment of the present invention according to stage Divide principle, determine that the corresponding code positions of stage of data skew occur, such as, however, it is determined that there are one in Spark codes Occur causing sentence (such as the group by of shuffle in the shuffle class operators either SQL statement of Spark SQL Sentence), then it can be determined that, using shuffle class operators or the sentence of shuffle can be caused to have gone out former and later two as boundary line delimitation Stage, and then determine corresponding code position.

The data distribution situation of generation data skew can be determined in the embodiment of the present invention, specific example is as follows：

The embodiment of the present invention is during specific perform, can be with after the code position for determining to generate data skew It determines the data distribution situation of generation data skew, operates for example, can analyze and perform shuffle operators and result in data Inclined RDD/Hive tables determine the distribution situation of Key values therein, to determine to generate the second situation type of data skew.

Second situation type includes：

The uneven caused data skew of data distribution in Hive tables.

The number of the Key values of data skew is caused to be less than or equal to the first predetermined threshold value.

The parallelization degree in Shuffle stages is unsatisfactory for preset condition.

To memory distributed data collection RDD perform polymerization class operator or in Spark SQL using group by sentences into Data skew caused by row packet aggregation.

When using memory distributed data collection RDD join generic operations or join sentences used in Spark SQL, and The data volume of a memory distributed data collection RDD or table in join generic operations are less than the data caused by the second predetermined threshold value It tilts.

The distribution situation of Key values is in two memory distributed data collection RDD or Hive tables：One memory distributed data collection The data volume of part Key values in RDD Hive tables is more than or equal to third predetermined threshold value, and another memory distribution number It is evenly distributed according to the data volume of the Key values in collection RDD Hive tables, caused data skew.

When carrying out join generic operations, the data volume of Key values present in memory distributed data collection RDD is more than or equal to 4th predetermined threshold value, and data volume is more than or equal to the number of the Key values of the 4th predetermined threshold value more than or equal to the 5th in advance If threshold value, caused data skew.

With reference to the data in Spark platform actual moving process in the embodiment of the present invention, operation reality can be effectively combined, The solution that memory is more targetedly avoided to overflow is provided, and for different in Spark platform actual moving process Memory caused by stage overflows, and a kind of technical solution switched between different solutions in real time is provided, into one Step has ensured the robustness and operational effect of Spark platforms operation.

S202：Mobile state management is internally deposited into according to method corresponding with the second situation type.

Optionally, in some embodiments, referring to Fig. 3, above-mentioned S202 can specifically include：

S301：If the second situation type is：The uneven caused data skew of data distribution in Hive tables, then use Hive ETL in advance polymerize the data in Hive tables according to Key values, alternatively, Hive tables and other tables are carried out join classes Operation.

In the embodiment of the present invention, the data in Hive tables distribute uneven caused data skew, can be, for example, Some Key value in Hive tables corresponds to the data of 1,000,000, and other Key values correspond to 10 datas, also, Spark platforms use Scene needs frequently perform some analysis operation to Hive tables.

For the second above-mentioned situation type, Hive ETL may be used in advance to the number in Hive tables in the embodiment of the present invention It is polymerize according to according to Key values, alternatively, Hive tables and other tables are carried out join generic operations, then, in Spark platforms reality The data source being directed in operational process is pretreated Hive tables.At this point, due to data carried out in advance polymerization or Join generic operations, in Spark platform actual moving process, it may not be necessary to be carried out using original shuffle class operators It performs, realization is simple and convenient, and effect is good, evades data skew completely so that the performance of Spark platform actual motions significantly carries It rises.The shuffle operations of part Spark platform actual motions are advanceed in Hive ETL, it is direct so as to fulfill Spark platforms The Hive middle tables of pretreatment are called, are reduced as far as the shuffle operations of Spark, it can be by the property of part operation phase 6 times or more can be promoted.

S302：If the second situation type is：The number of the Key values of data skew is caused to be less than or equal to the first default threshold Value, then be directly filtered processing to the Key values for causing data skew.

First predetermined threshold value therein can have user to be set according to the actual use demand of Spark platforms, alternatively, It can also be preset by the manufacture program of the memory dynamic management device of Spark platforms, this is not restricted.

In the embodiment of the present invention, the number of the Key values of data skew is caused to be less than or equal to the first predetermined threshold value, it can be with For example, 99% Key values correspond to 10 datas, and only there are one Key values to correspond to 1,000,000 data, so as to cause data skew.

For the second above-mentioned situation type, the embodiment of the present invention can directly carry out the Key values for causing data skew Filtration treatment, for example, can be filtered out in Spark SQL using where clause Key values or in Spark Core it is right Memory distributed data collection RDD performs filter operators and filters out Key values.

Further, which the embodiment of the present invention can also dynamically judge in each job execution of Spark platforms The data volume of Key values is most, then, then is filtered, that is, can use sample operators to memory distributed data collection RDD into Row sampling, then, calculates the handled quantity of each Key values, the key for taking data volume most is filtered out.Realize letter Single convenient, effect is good, evades data skew completely.

S303：If the second situation type is：The parallelization degree in Shuffle stages is unsatisfactory for preset condition, then right The Shuffle stages, required parameter spark.sql.shuffle.partitions values carried out increase processing, to increase The quantity of Task in the Shuffle stages.

In the embodiment of the present invention, the parallelization degree in Shuffle stages is unsatisfactory for preset condition, can be, for example,：Right When memory distributed data collection RDD performs shuffle operators, a parameter is passed to shuffle operators, for example, reduceByKey (1000), the quantity of shufflereadtask when this shuffle operator of the parameter setting performs.In actual use scene In, for the shuffle quasi-sentences in SparkSQL, for example, groupby, join etc., can set a parameter, i.e. parameter Spark.sql.shuffle.partitions, the parameter represent the degree of parallelism of shufflereadtask, and value acquiescence is 200, It is too small for much actually using scene.

Preset condition in the embodiment of the present invention can be, for example, that the parallelization degree in Shuffle stages is smaller.

For the second above-mentioned situation type, the embodiment of the present invention can be to Shuffle stages required parameter Spark.sql.shuffle.partitions values carry out increase processing, to increase the quantity of Task in the Shuffle stages, that is, Increase the quantity of shuffle read task, the multiple Key values for distributing to a Task originally can be allowed to distribute to multiple Task, so as to allow each Task processing than original less data.For example, if having 5 Key values, each Key values originally Corresponding 10 datas, 5 Key values are allocated to a Task, then, which handles 50 datas.And it increases After shuffle read task, each Task is assigned to a Key value, that is, each Task just handles 10 datas, then, often The execution time of a Task shortens.Realizing simply, effectively can alleviate and mitigate the influence of data skew.

For concrete principle referring to Fig. 4, Fig. 4 is the process schematic for the quantity for increasing Task in the embodiment of the present invention.

S304：If the second situation type is：Polymerization class operator is performed to memory distributed data collection RDD or in Spark Be grouped the data skew caused by polymerization using group by sentences in SQL, then it is attached to Key values identical in implementation procedure Add random prefix, identical Key values are transformed to multiple and different Key values, and obtained multiple and different Key values point will be converted It is assigned in different Task and is handled.

In the embodiment of the present invention, polymerization class operator is performed to memory distributed data collection RDD or is made in Spark SQL Data skew caused by being grouped polymerization with group by sentences.

For the second above-mentioned situation type, the embodiment of the present invention Key values identical in implementation procedure can be added with Identical Key values, are transformed to multiple and different Key values by machine prefix, and will convert obtained multiple and different Key values distribute to It is handled in different Task, for example, being first Key value additional random prefixes each in identical Key values, for example, 10 Within random number, identical Key values original at this time are transformed to multiple and different Key values, (hello, 1) (hello, 1) (hello, 1) (hello, 1) is transformed to (1_hello, 1) (1_hello, 1) (2_hello, 1) (2_hello, 1).Then, it is right The converging operations such as reduceByKey can also be performed in data after additional random prefix, carry out partial polymerization, partial polymerization knot Fruit is (1_hello, 2) (2_hello, 2).Then, by the prefix of each Key values remove, be transformed to (hello, 2) (hello, 2) global converging operation, again, is carried out, obtains final result (hello, 4).Originally identical Key values are passed through into additional random The mode of prefix becomes multiple and different Key values, and the data originally by a Task processing can be allowed to be distributed on multiple Task Partial polymerization is done, and then solves the technical issues of single Task processing data amounts are excessive.The shuffle of polymeric type is operated Caused data skew, effect are preferable.Data skew can significantly be alleviated, more than the performance boost several times of Spark operations.

For concrete principle referring to Fig. 5, Fig. 5 is the process schematic converted in the embodiment of the present invention to Key values.

S305：If the second situation type is：To memory distributed data collection RDD using join generic operations or in Spark When join sentences are used in SQL, and a memory distributed data collection RDD in join generic operations or the data volume of table are less than the Data skew caused by two predetermined threshold values then directly realizes join generic operations using Broadcast variables and Map class operators.

Second predetermined threshold value therein can have user to be set according to the actual use demand of Spark platforms, alternatively, It can also be preset by the manufacture program of the memory dynamic management device of Spark platforms, this is not restricted.

In the embodiment of the present invention, memory distributed data collection RDD is made using join generic operations or in Spark SQL During with join sentences, and to be less than second default for a memory distributed data collection RDD in join generic operations or the data volume of table Data skew caused by threshold value can be, for example, to memory distributed data collection RDD using join generic operations or When join sentences are used in Spark SQL, and a memory distributed data collection RDD or the number of table in the operation of join generic operations It is hundreds of million according to amount.

For the second above-mentioned situation type, the embodiment of the present invention can be calculated directly using Broadcast variables and Map classes Son realizes join generic operations, that is, realizes join generic operations, and then evade falling completely using Broadcast variables and Map class operators The operation of shuffle classes thoroughly avoids the generation of data skew.By the data in the smaller memory distributed data collection RDD of data volume The memory at Driver ends is directly pulled by collect operators, then, a Broadcast variable is created to it；Then, Map class operators are performed to another memory distributed data collection RDD, in Map class operator functions, from Broadcast variables The full dose data of data volume smaller memory distributed data collection RDD are obtained, each data with current RDD is according to connecting Key Value is compared, and if connection Key values are identical, the data of two RDD are attached.To caused by join generic operations Data skew, effect is preferable, due to the operation without shuffle classes, would not also generate data skew.

S306：If the second situation type is：The distribution feelings of Key values in two memory distributed data collection RDD or Hive tables Condition is：The data volume of part Key values in one memory distributed data collection RDD or Hive table is preset more than or equal to third Threshold value, and the data volume of the Key values in another memory distributed data collection RDD or Hive table is evenly distributed, caused number According to inclination, then partition processing is carried out to part Key values, multiple memory distributed data collection RDD after being decoupled, and to decoupling The each memory distributed data collection RDD additional random prefixes arrived, be transformed to multiple and different memory distributed data collection RDD and Multiple and different memory distributed data collection RDD is distributed into different Task and is handled.

Third predetermined threshold value therein can have user to be set according to the actual use demand of Spark platforms, alternatively, It can also be preset by the manufacture program of the memory dynamic management device of Spark platforms, this is not restricted.

For the second above-mentioned situation type, the embodiment of the present invention can carry out partition processing to part Key values, be divided Multiple memory distributed data collection RDD after tearing open, and to decoupling obtained each memory distributed data collection RDD additional random prefixes, It is transformed to multiple and different memory distributed data collection RDD and distributes multiple and different memory distributed data collection RDD to difference Task in handled.

As a kind of example, referring to Fig. 6, Fig. 6 is to carry out partition processing in the embodiment of the present invention to part Key values and distribute Process schematic.Just because of 3 times of the RDD dilatations of right column in Fig. 6, therefore, no matter to the Key value additional randoms of left column Any one prefix within 3, can ensure every data of left column and right column can be attached.

In the embodiment of the present invention, if the part Key values of data skew is caused to be merely present in a memory distributed data collection In RDD Hive tables, then part Key values can be split into multiple independent memory distributed data collection RDD, and additional random Prefix is transformed to n parts and goes to carry out join generic operations, at this point, Key values corresponding data in the part are not concentrated in a few It on Task, but is distributed to multiple Task and carries out join generic operations, it is only necessary to tilt the corresponding data of key for minority and be expanded Hold n times, do not need to carry out dilatation to full dose data, avoid and occupy excessive memory.

S307：If the second situation type is：When carrying out join generic operations, Key present in memory distributed data collection RDD The data volume of value is more than or equal to the 4th predetermined threshold value, and data volume is more than or equal to the Key values of the 4th predetermined threshold value Number is more than or equal to the 5th predetermined threshold value, caused data skew, then to every number in memory distributed data collection RDD According to the equal additional random prefix of Key values, a plurality of data after being converted, and to another be not present data skew memory Distributed data collection RDD carries out dilatation and will be in a plurality of data after transformation and the memory distributed data collection RDD after institute's dilatation A plurality of data carry out join generic operations.

When carrying out join generic operations, the data volume of Key values present in memory distributed data collection RDD is more than or equal to 4th predetermined threshold value, and data volume is more than or equal to the number of the Key values of the 4th predetermined threshold value more than or equal to the 5th in advance If threshold value, for example, there are the Key that the data volume of multiple Key values is above 10,000 in memory distributed data collection RDD/Hive tables Value.

4th predetermined threshold value or the 4th predetermined threshold value therein can be by users according to the actual use need of Spark platforms It asks and is set, alternatively, can also be preset by the manufacture program of the memory dynamic management device of Spark platforms, to this not It is restricted.

For the second above-mentioned situation type, the embodiment of the present invention can first confirm that memory distributed data collection RDD/Hive Data distribution situation in table determines to cause the memory distributed data collection RDD/Hive tables of data skew, for example, this causes data There are the Key values that the data volume of multiple Key values is above 10,000 in inclined memory distributed data collection RDD/Hive tables, and Afterwards, which can be integrated every data in RDD the equal additional random prefix of Key values (prefix can as N with Interior, N is positive integer), meanwhile, dilatation is carried out to another memory distributed data collection RDD there is no data skew, it will wherein The equal dilatation of every data for N datas, and to a plurality of data in the memory distributed data collection RDD after institute's dilatation it is additional with Machine prefix, and then, the data in two treated memory distributed data collection RDD are finally subjected to join generic operations.It can be right The data skew of join types can be effectively treated substantially, and significant effect, and performance boost effect is preferable.

The embodiment of the present invention is during specific perform, can be with if only handling relatively simple data skew scene Using a kind of in said program, can solve.It, can be by a variety of sides if handling a more complicated data skew scene Case combines use.For example, for the Spark operations for multiple data skew links occur, above-mentioned S301 can be first used With the step in S302, pretreatment portion divided data, and filtration fraction data are alleviated, and then, some shuffle can be operated Degree of parallelism is promoted, optimizes its performance, different polymerizations or join generic operations can also be directed to, a kind of above-mentioned scheme is selected excellent Change its performance, can further promote flexibility and the applicability of dynamic management.

Fig. 7 is the flow diagram of the dynamic memory management method for the Spark platforms that another embodiment of the present invention proposes.

Referring to Fig. 7, this method includes：

S701：It determines to generate the first situation type that memory overflows in Spark platform operational process.

S702：If the first situation type is overflowed for the memory in Shuffle stages, the Shuffle stages are carried out at tuning Reason.

During specific perform, the performance in most of Spark platforms operational process is mainly the embodiment of the present invention Consumption is in the Shuffle stages, since the Shuffle stages contain the behaviour such as a large amount of disk I/O, serializing, network data transmission Make.Therefore, if the embodiment of the present invention is overflowed by memory of the first situation type for the Shuffle stages, to the Shuffle stages Tuning processing is carried out, can further promote the runnability of Spark platforms.

Optionally, tuning processing is carried out to the Shuffle stages, including：

Each Task to be performed in the first batch in the Shuffle stages creates Shuffle file groups, and Shuffle files group corresponds to Multiple disk files, the quantity of multiple disk files are identical with the Task quantity of next processing stage in Shuffle stages；Right After each Task performed in the first batch is finished, start the implementation procedure of the Task performed to second batch, and held to second batch In the implementation procedure of capable Task, it is multiplexed Shuffle files group and its corresponding multiple disk files.

It, can be first by ginseng in order to carry out tuning processing to the Shuffle stages in the embodiment of the present invention as a kind of example The parameter value of number spark.shuffle.consolidateFiles is set as " true ", to open consolidate mechanism, opens It is each to be performed in the first batch in the Shuffle stages in shuffle write operating process after opening consolidate mechanism Task, which creates Shuffle file groups shuffleFileGroup, each shuffleFileGroup, can correspond to multiple disk files, The quantity of disk file is identical with the Task quantity of next processing stage in Shuffle stages.One performs node Executor How many upper CPU core, then can correspond to and perform multiple Task parallel.And the first batch of each Task performed parallel can create one A Shuffle files group shuffleFileGroup, and write data into corresponding disk file.

In the embodiment of the present invention, a Task is completed when the CPU core for performing node Executor performs, is then performed next When criticizing Task, existing Shuffle files group before next group Task (that is, Task of next processing stage) can be multiplexed ShuffleFileGroup, including disk file therein.That is, at this point, Task can write data into existing disk In file, without being written in new disk file.Therefore, consolidate mechanism allows the Task multiplexings of different batches same Criticize disk file, it is possible to the merging effectively by the progress of the disk file of multiple Task to a certain extent, so as to be greatly reduced The quantity of disk file, and then promote the performance of shuffle write operating process.

In the embodiment of the present invention, it is assumed that second stage has 100 Task, first stage to have 50 Task, in total There are 10 execution node Executor, each node Executor that performs to perform 5 Task.

After the optimization in the embodiment of the present invention, each quantity for performing the disk file that node Executor is created Calculation formula be：The Task quantity of the next stage of quantity * of CPU core.That is, each perform node Executor 100 disk files only can be created at this time, all execution node Executor can only create 1000 disk files.

In the present embodiment, by the first situation type for the Shuffle stages memory overflow when, to the Shuffle stages Tuning processing is carried out, can further promote the runnability of Spark platforms.

Optionally, in some embodiments, the first situation type further includes：The memory in Reduce stages overflows, described According to the first situation type corresponding method to the memory into Mobile state management, further include：Determine that generating memory overflows Reduce nodes；Determine the Reduce nodes of the generation memory spilling relative to the Reduce stages using algorithm of smoldering In, the balance accounting of the Reduce nodes in addition to the Reduce nodes that the generation memory overflows；According to the balance The data result of Reduce nodes that the generation memory overflows is retained accounting or discard processing.

Algorithm therein of smoldering, to be a kind of based on communication thread, in Reduce node statistics some characteristic dimensions Statistical method counts the balance accounting of the data of difference Reduce nodes under this feature.

In the embodiment of the present invention, however, it is determined that the balance accounting gone out meets a preset value, then can determine that balance accounts for Relatively good, the data result of Reduce nodes that can be overflowed to the generation memory carries out discard processing, and retains and taken Whole result data, and if it is determined that balance accounting be unsatisfactory for the preset value, then can determine that balance accounts for not Good, the data result of Reduce nodes that can be overflowed to the generation memory carries out reservation process.

Fig. 8 is the structure diagram of the memory dynamic management device for the Spark platforms that one embodiment of the invention proposes.

Referring to Fig. 8, which includes：

First determining module 801, for determining to generate the first situation class of memory spilling in Spark platform operational process Type；

Dynamic management module 802 internally deposits into Mobile state management for basis with the first situation type corresponding method；

Wherein, the first situation type includes：The memory in Map stages overflows, the memory caused by calling coalesce functions It overflows, the memory in Shuffle stages overflows, the uneven caused memory spilling of resource allocation, data skew under singleton pattern Caused memory overflows；

Dynamic management module 802, is specifically used for：

If the memory caused by the first situation type overflows for the memory in Map stages and calls coalesce functions overflows, Then before the Map in Map stages operations, multidomain treat-ment is carried out to each Task；

If the first situation type is overflowed for the memory in Shuffle stages, to the incoming parameter of institute in the Shuffle stages The quantity of partitioner is adjusted processing；

If the first situation type is that the uneven caused memory of resource allocation overflows under singleton pattern, in singleton pattern Under configuration processing is carried out to parameter executor-cores or parameter spark.executor.cores；

It is overflowed if the first situation type is the memory caused by data skew, triggers the second determining module 803 and determine production The raw code position of data skew and data distribution situation, dynamic management module 802, according to code position and data distribution situation Internally deposit into Mobile state management.

Optionally, in some embodiments, the second determining module 803 is additionally operable to the code position sum number according to data skew The second situation type of generation data skew is determined according to distribution situation；

Dynamic management module 802, specifically for internally depositing into Mobile state pipe according to method corresponding with the second situation type Reason；

Second situation type includes：

The uneven caused data skew of data distribution in Hive tables；

The number of the Key values of data skew is caused to be less than or equal to the first predetermined threshold value；

The parallelization degree in Shuffle stages is unsatisfactory for preset condition；

To memory distributed data collection RDD perform polymerization class operator or in Spark SQL using group by sentences into Data skew caused by row packet aggregation；

When using memory distributed data collection RDD join generic operations or join sentences used in Spark SQL, and The data volume of a memory distributed data collection RDD or table in join generic operations are less than the data caused by the second predetermined threshold value It tilts；

The distribution situation of Key values is in two memory distributed data collection RDD or Hive tables：One memory distributed data collection The data volume of part Key values in RDD Hive tables is more than or equal to third predetermined threshold value, and another memory distribution number It is evenly distributed according to the data volume of the Key values in collection RDD Hive tables, caused data skew；

When carrying out join generic operations, the data volume of Key values present in memory distributed data collection RDD is more than or equal to 4th predetermined threshold value, caused data skew.

Optionally, in some embodiments, dynamic management module 802, also particularly useful for：

If the second situation type is：The uneven caused data skew of data distribution in Hive tables, then using Hive ETL in advance polymerize the data in Hive tables according to Key values, alternatively, Hive tables and other tables are carried out join generic operations；

If the second situation type is：The number of the Key values of data skew is caused to be less than or equal to the first predetermined threshold value, then Processing directly is filtered to the Key values for causing data skew；

If the second situation type is：The parallelization degree in Shuffle stages is unsatisfactory for preset condition, then to Shuffle ranks The required parameter spark.sql.shuffle.partitions values of section carry out increase processing, to increase in the Shuffle stages The quantity of Task；

If the second situation type is：Polymerization class operator is performed to memory distributed data collection RDD or in Spark SQL Using group by sentences be grouped polymerization caused by data skew, then Key values identical in implementation procedure are added with Identical Key values, are transformed to multiple and different Key values by machine prefix, and will convert obtained multiple and different Key values distribute to It is handled in different Task；

If the second situation type is：To memory distributed data collection RDD using join generic operations or in Spark SQL During using join sentences, and to be less than second pre- for a memory distributed data collection RDD in join generic operations or the data volume of table If the data skew caused by threshold value, then join generic operations are directly realized using Broadcast variables and Map class operators；

If the second situation type is：The distribution situation of Key values is in two memory distributed data collection RDD or Hive tables： The data volume of part Key values in one memory distributed data collection RDD or Hive table is more than or equal to third predetermined threshold value, And the data volume of the Key values in another memory distributed data collection RDD or Hive table is evenly distributed, caused data are inclined Tiltedly, then partition processing is carried out to part Key values, multiple memory distributed data collection RDD after being decoupled, and partition is obtained Each memory distributed data collection RDD additional random prefixes, are transformed to multiple and different memory distributed data collection RDD and will be more A different memory distributed data collection RDD, which is distributed into different Task, to be handled；

If the second situation type is：When carrying out join generic operations, Key values present in memory distributed data collection RDD Data volume is more than or equal to the 4th predetermined threshold value, and data volume is more than or equal to the number of the Key values of the 4th predetermined threshold value More than or equal to the 5th predetermined threshold value, caused data skew, then to every data in memory distributed data collection RDD The equal additional random prefix of Key values, a plurality of data after being converted, and another memory there is no data skew is distributed Data set RDD carries out dilatation and will be a plurality of in a plurality of data after transformation and the memory distributed data collection RDD after institute's dilatation Data carry out join generic operations.

Optionally, in some embodiments, referring to Fig. 9, which further includes：

Tuning module 804, for the first situation type for the Shuffle stages memory overflow when, to the Shuffle stages Carry out tuning processing.

Optionally, in some embodiments, referring to Fig. 9, tuning module 804, including：

Submodule 8041 is created, for creating Shuffle files for each Task performed in the first batch in the Shuffle stages Group, Shuffle file groups correspond to multiple disk files, the quantity of multiple disk files and next processing rank in Shuffle stages The Task quantity of section is identical.

Submodule 8042 is multiplexed, second batch is performed for after being finished to each Task performed in the first batch, starting Task implementation procedure, and in the implementation procedure of the Task performed to second batch, be multiplexed Shuffle files group and its institute be right The multiple disk files answered.

Optionally, in some embodiments, the first situation type further includes：The memory in Reduce stages overflows, described Dynamic management module 802, is additionally operable to：

It determines to generate the Reduce nodes that memory overflows, and determine what the generation memory overflowed using algorithm of smoldering Relative in the Reduce stages, the Reduce in addition to the Reduce nodes that the generation memory overflows is saved Reduce nodes The balance accounting of point and the data result of Reduce nodes overflowed according to the balance accounting to the generation memory Retained or discard processing.

It should be noted that the dynamic memory management method embodiment of Spark platforms in earlier figures 1- Fig. 7 embodiments The memory dynamic management device 800 for the Spark platforms for being also applied for the embodiment is illustrated, realization principle is similar, herein It repeats no more.

In order to realize above-described embodiment, the present invention also proposes a kind of non-transitorycomputer readable storage medium, works as storage When instruction in medium is performed by the processor of terminal so that the memory dynamic that terminal is able to carry out a kind of Spark platforms manages Method, method include：

It determines to generate the first situation type that memory overflows in Spark platform operational process；

Mobile state management is internally deposited into according to the first situation type corresponding method；

Mobile state management is internally deposited into according to the first situation type corresponding method, including：

It is overflowed if the first situation type is the memory caused by data skew, it is determined that generate the code position of data skew With data distribution situation, Mobile state management is internally deposited into according to code position and data distribution situation.

Non-transitorycomputer readable storage medium in the present embodiment is produced by determining in Spark platform operational process The first situation type that raw memory overflows, Mobile state management is internally deposited into according to the first situation type corresponding method, due to the One situation type has been considered there may be many reasons that memory overflows, and then, Spark is put down using targetedly solution Into Mobile state management, the memory that can be effectively prevented from Spark platform operational process is overflowed, is promoted memory in platform operational process The robustness and operational effect of Spark platforms operation.

In order to realize above-described embodiment, the present invention also proposes a kind of computer program product, when in computer program product Instruction when being executed by processor, perform a kind of dynamic memory management method of Spark platforms, method includes：

Computer program product in the present embodiment generates what memory overflowed by determining in Spark platform operational process First situation type internally deposits into Mobile state management, since the first situation type is examined according to the first situation type corresponding method It has measured there may be many reasons that memory overflows, and then, using targetedly solution in Spark platform operational process Memory into Mobile state management, the memory that can be effectively prevented from Spark platform operational process overflows, and promotes Spark platforms fortune Capable robustness and operational effect.

It should be noted that in the description of the present invention, term " first ", " second " etc. are only used for description purpose, without It is understood that indicate or implying relative importance.In addition, in the description of the present invention, unless otherwise indicated, the meaning of " multiple " It is two or more.

Any process described otherwise above or method description are construed as in flow chart or herein, represent to include Module, segment or the portion of the code of the executable instruction of one or more the step of being used to implement specific logical function or process Point, and the range of the preferred embodiment of the present invention includes other realization, wherein can not press shown or discuss suitable Sequence, including according to involved function by it is basic simultaneously in the way of or in the opposite order, to perform function, this should be of the invention Embodiment person of ordinary skill in the field understood.

It should be appreciated that each section of the present invention can be realized with hardware, software, firmware or combination thereof.Above-mentioned In embodiment, software that multiple steps or method can in memory and by suitable instruction execution system be performed with storage Or firmware is realized.If for example, with hardware come realize in another embodiment, can be under well known in the art Any one of row technology or their combination are realized：With for the logic gates to data-signal realization logic function Discrete logic, have suitable combinational logic gate circuit application-specific integrated circuit, programmable gate array (PGA), scene Programmable gate array (FPGA) etc..

Those skilled in the art are appreciated that realize all or part of step that above-described embodiment method carries Suddenly it is that relevant hardware can be instructed to complete by program, the program can be stored in a kind of computer-readable storage medium In matter, the program when being executed, one or a combination set of the step of including embodiment of the method.

In addition, each functional unit in each embodiment of the present invention can be integrated in a processing module, it can also That each unit is individually physically present, can also two or more units be integrated in a module.Above-mentioned integrated mould The form that hardware had both may be used in block is realized, can also be realized in the form of software function module.The integrated module is such as Fruit is realized in the form of software function module and is independent product sale or in use, can also be stored in a computer In read/write memory medium.

Storage medium mentioned above can be read-only memory, disk or CD etc..

In the description of this specification, reference term " one embodiment ", " example ", " is specifically shown " some embodiments " The description of example " or " some examples " etc. means specific features, structure, material or the spy for combining the embodiment or example description Point is contained at least one embodiment of the present invention or example.In the present specification, schematic expression of the above terms are not Centainly refer to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be any One or more embodiments or example in combine in an appropriate manner.

Although the embodiments of the present invention has been shown and described above, it is to be understood that above-described embodiment is example Property, it is impossible to limitation of the present invention is interpreted as, those of ordinary skill in the art within the scope of the invention can be to above-mentioned Embodiment is changed, changes, replacing and modification.

Claims

1. a kind of dynamic memory management method of Spark platforms, which is characterized in that include the following steps：

It determines to generate the first situation type that memory overflows in the Spark platforms operational process；

According to the first situation type corresponding method to the memory into Mobile state management；

Wherein, the first situation type includes：The memory in Map stages overflows, the memory caused by calling coalesce functions It overflows, the memory in Shuffle stages overflows, the uneven caused memory spilling of resource allocation, data skew under singleton pattern Caused memory overflows and the memory in Reduce stages overflows；

The basis and the first situation type corresponding method to the memory into Mobile state management, including：

If the first situation type is interior caused by the memory spilling in the Map stages and the calling coalesce functions Spilling is deposited, then before the Map in Map stages operations, multidomain treat-ment is carried out to each Task；

If the first situation type is overflowed for the memory in the Shuffle stages, to the incoming parameter of institute in the Shuffle stages The quantity of partitioner is adjusted processing, and carries out tuning processing to the Shuffle stages；

If the first situation type is that the uneven caused memory of resource allocation overflows under the singleton pattern, in singleton Configuration processing is carried out to parameter executor-cores or parameter spark.executor.cores under pattern；

It is overflowed if the first situation type is the memory caused by the data skew, it is determined that generate the data skew Code position and data distribution situation, according to the code position and data distribution situation to the memory into Mobile state management；

If the first situation type is overflowed for the memory in the Reduce stages, it is determined that generates the Reduce sections that memory overflows Point, and Reduce nodes that the generation memory overflows are determined relative in the Reduce stages using algorithm of smoldering, except institute State the balance accounting of the Reduce nodes except the Reduce nodes for generating memory spilling and according to the balance accounting The data result of Reduce nodes overflowed to the generation memory is retained or discard processing.

2. the dynamic memory management method of Spark platforms as described in claim 1, which is characterized in that the determining generation institute The code position of data skew and data distribution situation are stated, interior is deposited into described according to the code position and data distribution situation Mobile state management, including：

The second situation type of generation data skew is determined according to the code position of the data skew and data distribution situation；

According to method corresponding with the second situation type to the memory into Mobile state management；

The second situation type includes：

The uneven caused data skew of data distribution in Hive tables；

Polymerization class operator is performed to memory distributed data collection RDD or is divided in Spark SQL using group by sentences Data skew caused by group polymerization；

When using memory distributed data collection RDD join generic operations or join sentences used in Spark SQL, and join Data caused by the data volume of a memory distributed data collection RDD or table in generic operation are less than the second predetermined threshold value are inclined Tiltedly；

The distribution situation of Key values is in two memory distributed data collection RDD or Hive tables：One memory distributed data collection RDD Or the data volume of the part Key values in Hive tables is more than or equal to third predetermined threshold value, and another memory distributed data The data volume of Key values in collection RDD Hive tables is evenly distributed, caused data skew；

When carrying out join generic operations, the data volume of Key values present in memory distributed data collection RDD is more than or equal to the 4th Predetermined threshold value, caused data skew.

3. the dynamic memory management method of Spark platforms as described in claim 1, which is characterized in that the basis with it is described The corresponding method of second situation type to the memory into Mobile state management, including：

If the second situation type is：The parallelization degree in Shuffle stages is unsatisfactory for preset condition, then to described The Shuffle stages, required parameter spark.sql.shuffle.partitions values carried out increase processing, with described in increase The quantity of Task in the Shuffle stages；

If the second situation type is：Polymerization class operator is performed to memory distributed data collection RDD or in Spark SQL Using group by sentences be grouped polymerization caused by data skew, then Key values identical in implementation procedure are added with The identical Key values are transformed to multiple and different Key values, and will convert obtained multiple and different Key values point by machine prefix It is assigned in different Task and is handled；

If the second situation type is：The distribution situation of Key values is in two memory distributed data collection RDD or Hive tables： The data volume of part Key values in one memory distributed data collection RDD or Hive table is more than or equal to third predetermined threshold value, And the data volume of the Key values in another memory distributed data collection RDD or Hive table is evenly distributed, caused data are inclined Tiltedly, then partition processing is carried out to the part Key values, multiple memory distributed data collection RDD after being decoupled, and to decoupling The each memory distributed data collection RDD additional random prefixes arrived, be transformed to multiple and different memory distributed data collection RDD and The plurality of different memory distributed data collection RDD is distributed into different Task and is handled；

If the second situation type is：When carrying out join generic operations, Key values present in memory distributed data collection RDD Data volume is more than or equal to the 4th predetermined threshold value, and the data volume is more than or equal to the Key values of the 4th predetermined threshold value Number is more than or equal to the 5th predetermined threshold value, caused data skew, then to every in the memory distributed data collection RDD The equal additional random prefix of Key values of data, a plurality of data after being converted, and there is no data skews to another Memory distributed data collection RDD carries out dilatation and by a plurality of data after the transformation and the memory distributed data after institute's dilatation The a plurality of data collected in RDD carry out join generic operations.

4. the dynamic memory management method of Spark platforms as described in claim 1, which is characterized in that described to described The Shuffle stages carry out tuning processing, including：

Each Task to be performed in the first batch in the Shuffle stages creates Shuffle file groups, the Shuffle files group Corresponding multiple disk files, the quantity of the multiple disk file and the Task of next processing stage in the Shuffle stages Quantity is identical；

After being finished to each Task performed in the first batch, start the implementation procedure of the Task performed to second batch, and In the implementation procedure of the Task performed to the second batch, it is multiplexed the Shuffle files group and its corresponding multiple magnetic Disk file.

5. a kind of memory dynamic management device of Spark platforms, which is characterized in that including：

First determining module, for determining to generate the first situation type of memory spilling in the Spark platforms operational process；

Dynamic management module, for according to the first situation type corresponding method to the memory into Mobile state management；

The dynamic management module, is specifically used for：

If the first situation type is overflowed for the memory in the Shuffle stages, to the incoming parameter of institute in the Shuffle stages The quantity of partitioner is adjusted processing；

It is overflowed if the first situation type is the memory caused by the data skew, triggers the second determining module and determine to produce The code position and data distribution situation of the raw data skew, the dynamic management module, according to the code position sum number According to distribution situation to the memory into Mobile state management；

It further includes：

Tuning module, for the first situation type for the Shuffle stages memory overflow when, to described The Shuffle stages carry out tuning processing；

The dynamic management module, is additionally operable to：

It determines to generate the Reduce nodes that memory overflows, and the Reduce sections for generating memory and overflowing are determined using algorithm of smoldering Point is relative in the Reduce stages, the balance of the Reduce nodes in addition to the Reduce nodes that the generation memory overflows Property accounting and retained according to the data result of Reduce nodes that the balance accounting overflows the generation memory Or discard processing.

6. the memory dynamic management device of Spark platforms as claimed in claim 5, which is characterized in that wherein,

Second determining module is additionally operable to be determined to generate number according to the code position and data distribution situation of the data skew According to inclined second situation type；

The dynamic management module, specifically for according to method corresponding with the second situation type to the memory into action State management；

The second situation type includes：

The uneven caused data skew of data distribution in Hive tables；

7. the memory dynamic management device of Spark platforms as claimed in claim 5, which is characterized in that the dynamic manages mould Block, also particularly useful for：

8. the memory dynamic management device of Spark platforms as claimed in claim 5, which is characterized in that the tuning module, packet It includes：

Submodule is created, for creating Shuffle file groups, institute for each Task performed in the first batch in the Shuffle stages It states Shuffle file groups and corresponds to multiple disk files, the quantity of the multiple disk file is next with the Shuffle stages The Task quantity of processing stage is identical；

Submodule is multiplexed, for after being finished to each Task performed in the first batch, starting what second batch was performed The implementation procedure of Task, and in the implementation procedure of the Task performed to the second batch, be multiplexed the Shuffle files group and Multiple disk files corresponding to it.

9. a kind of non-transitorycomputer readable storage medium, is stored thereon with computer program, which is characterized in that the program quilt The dynamic memory management method of the Spark platforms as described in any one of claim 1-4 is realized when processor performs.

10. a kind of computer program product when the instruction in the computer program product is performed by processor, performs one kind The dynamic memory management method of Spark platforms, the method includes：