CN108280008A

CN108280008A - One kind being directed to Hadoop cluster abnormal nodes method of real-time

Info

Publication number: CN108280008A
Application number: CN201711049620.4A
Authority: CN
Inventors: 田帅; 汪海涛
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2017-10-31
Filing date: 2017-10-31
Publication date: 2018-07-13

Abstract

The present invention relates to one kind being directed to Hadoop cluster abnormal nodes method of real-time, belongs to Hadoop cluster abnormality detection technical fields.The present invention then analyzed and sorts out the daily record and count its data information, z-score is converted and acquired according to its information by collecting the daily records that export in real time of hadoop, judges whether the score more than threshold value determines the abnormality of node.The present invention has fully considered map tasks and the big feature of reduce task couplings in hadoop tasks, and two kinds of tasks are considered and converted, make accuracy higher；The present invention uses map tasks completeness as time measure, the more elastic real-time for having weighed method.

Description

One kind being directed to Hadoop cluster abnormal nodes method of real-time

Technical field

The present invention relates to one kind being directed to Hadoop cluster abnormal nodes method of real-time, belongs to Hadoop clusters and examines extremely Survey technology field.

Background technology

Scientific technological advance necessarily brings sizable variation, big data epoch just to meet the tendency of with the development of science and technology to society And give birth to, in such circumstances, mass data calculating also emerges one after another with storing framework, and Hadoop is apache companies according to Google Big data, effectively can be averagely divided into the portion of very little by the parallel distributed frame for the MapReduce thoughts exploitation delivered Point, it distributes to individual node in cluster and runs.One of realization as MapReducee framework technologies, hadoop by including Multiple research institutions such as Baidu, Huawei, yahoo, facebook and company use, the hadoop cluster node of these enterprises deployment Number is mostly thousands of.With the continuous increase of cluster scale, various problems come one after another, and node maintenance is one of them. When performance issue occurs in cluster, trouble node is navigated in time and determines that the reason of leading to the problem is abnormal difficult, and Certain class problem will not cause node directly collapse but only the speed of service can be made to be slowed by, efficiency significantly reduce.

Invention content

The technical problem to be solved by the present invention is to propose a kind of real-time checkout and diagnosis of abnormal nodes for Hadoop clusters Method, the abnormality of node when detecting hadoop operation tasks in real time.

The technical scheme is that：One kind being directed to Hadoop cluster abnormal nodes method of real-time, collects first The daily record that hadoop is exported in real time is then analyzed and sorts out the daily record and count its data information, converted according to its information And z scores are acquired, judge whether the score more than threshold value determines the abnormality of node.

The method is as follows：

The status log that Step1, real-time collecting hadoop tasks export extracts relevant information, including：The section to work Point number, map tasks that each node is currently running, reduce number of tasks；And count each node has been running for how many A map task numbers and reduce task numbers, the run time of each task and unfinished task run how long；

Step2, the logic for calculating each node complete number：

It is under node current state to define logical transition value, and reduce Runtimes are convertible into how many a map and appoint The value of business specially calculates the total operation duration of individual node reduce tasks, including times for having run through into and being currently running Business；The value obtained with the map task time that the duration divided by the node are recently completed is logical transition value, and logic is completed Number is map task quantity+logical transition value that the node has executed completion at present；

Step3, threshold value is calculated：

In view of that may build at small cluster (node less be less than or equal to 30), using t distributions come threshold value, when to When fixation reliability is with degree of freedom, corresponding threshold value just can determine that；Confidence level can be arranged according to actual conditions, and the value is smaller, precision It is higher, but fail to report probability and also increase, it is recommended as 0.01；Free angle value is that the number of nodes that operation task is working subtracts one, such as four A node is currently running task, then degree of freedom is 4-1=3 at this time.

Step4, the z-score for calculating each node：

The offset of joint behavior is weighed using the criterion score (z_scorei) under t distributions, the value is bigger to be illustrated to deviate It is more, when its be more than threshold value when, be determined as outlier, wherein t distribution under z-score calculation formula be：

In formula, x is that the logic of the node completes number, and μ represents the mean value that all node logicals complete number, and it is corresponding that σ represents it Standard deviation, Freedom is degree of freedom；

Step5, judge whether z-score is less than mean value, if it is, the node is currently normal；If if it is not, then The node is abnormal nodes.

The present invention operation principle be：The log information that extraction hadoop is exported in real time, it is each to obtain hadoop cluster The operating status of a node.The state Gaussian distributed of each node under homogeneous environment.The state of each node is analyzed in real time, is used To judge whether the node is normal.

The beneficial effects of the invention are as follows：

(1) present invention improves the analysis precision of later stage each node by collecting real-time output journal and sorting out.

(2) present invention is according to Principle of Statistics, in conjunction with the high coupling of hadoop tasks map, reduce, considers and turns It changes, keeps the state description of node more reliable.

(3) abnormality of the invention by decision node, when contributing to cluster to safeguard, and saving a large amount of job runs Between.

Description of the drawings

Fig. 1 is collection phase flow chart of the present invention；

Fig. 2 is analysis process figure of the present invention.

Specific implementation mode

With reference to the accompanying drawings and detailed description, the invention will be further described.

Embodiment 1：As shown in Figs. 1-2, a kind of to be directed to Hadoop cluster abnormal nodes method of real-time, it collects first The daily record that hadoop is exported in real time is then analyzed and sorts out the daily record and count its data information, converted according to its information And z scores are acquired, judge whether the score more than threshold value determines the abnormality of node.

The method is as follows：

Step2, the logic for calculating each node complete number：

Step3, threshold value is calculated：

Using t distributions come threshold value, when given confidence level and degree of freedom, corresponding threshold value just can determine that；Confidence level It can be arranged according to actual conditions, the value is smaller, and precision is higher, but fails to report probability and also increase；Free angle value be operation task The number of nodes of work subtracts one；

Step4, the z-score for calculating each node：

The offset of joint behavior is weighed using the criterion score under t distributions, the value is bigger to illustrate that offset is more, when it is big When threshold value, be determined as outlier, wherein t distribution under z-score calculation formula be：

The recommendation of the confidence level is 0.01.

Embodiment 2：In the present embodiment, 6 server nodes have been used, a host node, 5 from node.Each section Point is configured to：

Cpu models:Intel Xeon E5645；CPU number of logic：24；Memory size：32GB；Hard disk size：1TB；Behaviour Make system:Cent OS 6.8.

Operation is selected as：Terasort:Arrange the data for the 12G sizes that Teragen is generated.

The present embodiment selects the value that t is distributed under confidence level parameter is 0.01 as threshold value, and threshold value is 3.74 after calculating.

The present embodiment uses spark streaming as real-time analysis tool, and 5s is divided between batch processing.

Change with job change since map executes the time, is used as time measure with Map task completenesses.

Cpu hog failures are injected to node five when running terasort operation Map task completenesses 40%.When map appoints When completeness of being engaged in reaches 66%, the data of each node are as follows：

By taking 1 data of node obtain as an example：

Step 1：

Real-time collecting hadoop running logs, the daily record include following main information：

1. the operation submitted is assigned to which node is executed

2. some task brings into operation in which node, some task is in certain node end of run.

3. task type, which is map tasks or reduce tasks.

Step 2：

The logic for calculating each node completes number.

It is statistical value that Map tasks, which complete number,；Node 1 is 41.

Map task execution average times (s) are the mean value of all map task execution times of the node；Node one is 90.

Reduce task execution total timesNode 1 is 1550s。

The time for completing a nearest map task is statistical value；Node 1 is 88s

The time of the nearest map task of logical transition number=Reduce task execution total times/completion；Node 1 is 1550/88=17

Logic completes number=logical transition number+Map tasks and completes number；Node 1 is 17+41=58

Step 3：

Calculate threshold value：

The moment degree of freedom is 4, confidence level 0.01, so the threshold value under t distributions is 3.74.

Step 4：

Calculate the z-score of each node

All node logicals complete average value=Σ nodei_logic_num/node_num of number；The value is： (58+71 + 63+59+32)/5=56.6 rounds up is 57.

The value is： [(58- 57)²+(71-57)²⁺(63-57)²+(59-57)²+(32-57)²]/5=172.4 is 13 after extraction of square root.

Degree of freedom is 5-1=4.

The z-score of node 1 is (58-57)/(13/2)=0.15.

Step 5：

Judge whether the node is abnormal.

The z-score of node one is 0.15 at this time, is less than threshold value 3.74, so the node is in normal condition.

And the z-score of calculated node 5 is 3.84, more than being given above threshold value 3.74, so node is without being determined For exception, the abnormal nodes are exported, in order to further analyze reason.Table one gives the related data of the moment each node.

Table 1

The specific implementation mode of the present invention is explained in detail above in association with attached drawing, but the present invention is not limited to above-mentioned Embodiment within the knowledge of a person skilled in the art can also be before not departing from present inventive concept Put that various changes can be made.

Claims

1. one kind being directed to Hadoop cluster abnormal nodes method of real-time, it is characterised in that：It is defeated in real time that hadoop is collected first The daily record gone out is then analyzed and sorts out the daily record and count its data information, z-score is converted and acquired according to its information, Judge whether the score more than threshold value determines the abnormality of node.

2. according to claim 1 be directed to Hadoop cluster abnormal nodes method of real-time, it is characterised in that the side Method is as follows：

The status log that Step1, real-time collecting hadoop tasks export extracts relevant information, including：The node to work is compiled Number, map tasks that each node is currently running, reduce number of tasks；And it counts each node and has been running for how many a map Task numbers and reduce task numbers, the run time of each task and unfinished task run how long；

Step2, the logic for calculating each node complete number：

It is under node current state to define logical transition value, and reduce Runtimes are convertible into how many a map tasks Value specially calculates the total operation duration of individual node reduce tasks, including having run through into and being currently running for task；With The value that the map task time that the duration divided by the node are recently completed obtains is logical transition value, and logic completes number and is The node has executed map task quantity+logical transition value of completion at present；

Step3, threshold value is calculated：

Using t distributions come threshold value, when given confidence level and degree of freedom, corresponding threshold value just can determine that；Confidence level can root It is arranged according to actual conditions, the value is smaller, and precision is higher, but fails to report probability and also increase；Free angle value is that operation task is working Number of nodes subtract one；

Step4, the z-score for calculating each node：

The offset of joint behavior is weighed using the criterion score under t distributions, the value is bigger to illustrate that offset is more, when it is more than threshold When value, be determined as outlier, wherein t distribution under z-score calculation formula be：

In formula, x is that the logic of the node completes number, and μ represents the mean value that all node logicals complete number, and σ represents its corresponding mark Accurate poor, Freedom is degree of freedom；

Step5, judge whether z-score is less than mean value, if it is, the node is currently normal；If if it is not, then the section Point is abnormal nodes.

3. according to claim 2 be directed to Hadoop cluster abnormal nodes method of real-time, it is characterised in that：It is described to set The recommendation of reliability is 0.01.