CN108280008A - One kind being directed to Hadoop cluster abnormal nodes method of real-time - Google Patents

One kind being directed to Hadoop cluster abnormal nodes method of real-time Download PDF

Info

Publication number
CN108280008A
CN108280008A CN201711049620.4A CN201711049620A CN108280008A CN 108280008 A CN108280008 A CN 108280008A CN 201711049620 A CN201711049620 A CN 201711049620A CN 108280008 A CN108280008 A CN 108280008A
Authority
CN
China
Prior art keywords
node
value
time
real
tasks
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711049620.4A
Other languages
Chinese (zh)
Inventor
田帅
汪海涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN201711049620.4A priority Critical patent/CN108280008A/en
Publication of CN108280008A publication Critical patent/CN108280008A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The present invention relates to one kind being directed to Hadoop cluster abnormal nodes method of real-time, belongs to Hadoop cluster abnormality detection technical fields.The present invention then analyzed and sorts out the daily record and count its data information, z-score is converted and acquired according to its information by collecting the daily records that export in real time of hadoop, judges whether the score more than threshold value determines the abnormality of node.The present invention has fully considered map tasks and the big feature of reduce task couplings in hadoop tasks, and two kinds of tasks are considered and converted, make accuracy higher;The present invention uses map tasks completeness as time measure, the more elastic real-time for having weighed method.

Description

One kind being directed to Hadoop cluster abnormal nodes method of real-time
Technical field
The present invention relates to one kind being directed to Hadoop cluster abnormal nodes method of real-time, belongs to Hadoop clusters and examines extremely Survey technology field.
Background technology
Scientific technological advance necessarily brings sizable variation, big data epoch just to meet the tendency of with the development of science and technology to society And give birth to, in such circumstances, mass data calculating also emerges one after another with storing framework, and Hadoop is apache companies according to Google Big data, effectively can be averagely divided into the portion of very little by the parallel distributed frame for the MapReduce thoughts exploitation delivered Point, it distributes to individual node in cluster and runs.One of realization as MapReducee framework technologies, hadoop by including Multiple research institutions such as Baidu, Huawei, yahoo, facebook and company use, the hadoop cluster node of these enterprises deployment Number is mostly thousands of.With the continuous increase of cluster scale, various problems come one after another, and node maintenance is one of them. When performance issue occurs in cluster, trouble node is navigated in time and determines that the reason of leading to the problem is abnormal difficult, and Certain class problem will not cause node directly collapse but only the speed of service can be made to be slowed by, efficiency significantly reduce.
Invention content
The technical problem to be solved by the present invention is to propose a kind of real-time checkout and diagnosis of abnormal nodes for Hadoop clusters Method, the abnormality of node when detecting hadoop operation tasks in real time.
The technical scheme is that:One kind being directed to Hadoop cluster abnormal nodes method of real-time, collects first The daily record that hadoop is exported in real time is then analyzed and sorts out the daily record and count its data information, converted according to its information And z scores are acquired, judge whether the score more than threshold value determines the abnormality of node.
The method is as follows:
The status log that Step1, real-time collecting hadoop tasks export extracts relevant information, including:The section to work Point number, map tasks that each node is currently running, reduce number of tasks;And count each node has been running for how many A map task numbers and reduce task numbers, the run time of each task and unfinished task run how long;
Step2, the logic for calculating each node complete number:
It is under node current state to define logical transition value, and reduce Runtimes are convertible into how many a map and appoint The value of business specially calculates the total operation duration of individual node reduce tasks, including times for having run through into and being currently running Business;The value obtained with the map task time that the duration divided by the node are recently completed is logical transition value, and logic is completed Number is map task quantity+logical transition value that the node has executed completion at present;
Step3, threshold value is calculated:
In view of that may build at small cluster (node less be less than or equal to 30), using t distributions come threshold value, when to When fixation reliability is with degree of freedom, corresponding threshold value just can determine that;Confidence level can be arranged according to actual conditions, and the value is smaller, precision It is higher, but fail to report probability and also increase, it is recommended as 0.01;Free angle value is that the number of nodes that operation task is working subtracts one, such as four A node is currently running task, then degree of freedom is 4-1=3 at this time.
Step4, the z-score for calculating each node:
The offset of joint behavior is weighed using the criterion score (z_scorei) under t distributions, the value is bigger to be illustrated to deviate It is more, when its be more than threshold value when, be determined as outlier, wherein t distribution under z-score calculation formula be:
In formula, x is that the logic of the node completes number, and μ represents the mean value that all node logicals complete number, and it is corresponding that σ represents it Standard deviation, Freedom is degree of freedom;
Step5, judge whether z-score is less than mean value, if it is, the node is currently normal;If if it is not, then The node is abnormal nodes.
The present invention operation principle be:The log information that extraction hadoop is exported in real time, it is each to obtain hadoop cluster The operating status of a node.The state Gaussian distributed of each node under homogeneous environment.The state of each node is analyzed in real time, is used To judge whether the node is normal.
The beneficial effects of the invention are as follows:
(1) present invention improves the analysis precision of later stage each node by collecting real-time output journal and sorting out.
(2) present invention is according to Principle of Statistics, in conjunction with the high coupling of hadoop tasks map, reduce, considers and turns It changes, keeps the state description of node more reliable.
(3) abnormality of the invention by decision node, when contributing to cluster to safeguard, and saving a large amount of job runs Between.
Description of the drawings
Fig. 1 is collection phase flow chart of the present invention;
Fig. 2 is analysis process figure of the present invention.
Specific implementation mode
With reference to the accompanying drawings and detailed description, the invention will be further described.
Embodiment 1:As shown in Figs. 1-2, a kind of to be directed to Hadoop cluster abnormal nodes method of real-time, it collects first The daily record that hadoop is exported in real time is then analyzed and sorts out the daily record and count its data information, converted according to its information And z scores are acquired, judge whether the score more than threshold value determines the abnormality of node.
The method is as follows:
The status log that Step1, real-time collecting hadoop tasks export extracts relevant information, including:The section to work Point number, map tasks that each node is currently running, reduce number of tasks;And count each node has been running for how many A map task numbers and reduce task numbers, the run time of each task and unfinished task run how long;
Step2, the logic for calculating each node complete number:
It is under node current state to define logical transition value, and reduce Runtimes are convertible into how many a map and appoint The value of business specially calculates the total operation duration of individual node reduce tasks, including times for having run through into and being currently running Business;The value obtained with the map task time that the duration divided by the node are recently completed is logical transition value, and logic is completed Number is map task quantity+logical transition value that the node has executed completion at present;
Step3, threshold value is calculated:
Using t distributions come threshold value, when given confidence level and degree of freedom, corresponding threshold value just can determine that;Confidence level It can be arranged according to actual conditions, the value is smaller, and precision is higher, but fails to report probability and also increase;Free angle value be operation task The number of nodes of work subtracts one;
Step4, the z-score for calculating each node:
The offset of joint behavior is weighed using the criterion score under t distributions, the value is bigger to illustrate that offset is more, when it is big When threshold value, be determined as outlier, wherein t distribution under z-score calculation formula be:
In formula, x is that the logic of the node completes number, and μ represents the mean value that all node logicals complete number, and it is corresponding that σ represents it Standard deviation, Freedom is degree of freedom;
Step5, judge whether z-score is less than mean value, if it is, the node is currently normal;If if it is not, then The node is abnormal nodes.
The recommendation of the confidence level is 0.01.
Embodiment 2:In the present embodiment, 6 server nodes have been used, a host node, 5 from node.Each section Point is configured to:
Cpu models:Intel Xeon E5645;CPU number of logic:24;Memory size:32GB;Hard disk size:1TB;Behaviour Make system:Cent OS 6.8.
Operation is selected as:Terasort:Arrange the data for the 12G sizes that Teragen is generated.
The present embodiment selects the value that t is distributed under confidence level parameter is 0.01 as threshold value, and threshold value is 3.74 after calculating.
The present embodiment uses spark streaming as real-time analysis tool, and 5s is divided between batch processing.
Change with job change since map executes the time, is used as time measure with Map task completenesses.
Cpu hog failures are injected to node five when running terasort operation Map task completenesses 40%.When map appoints When completeness of being engaged in reaches 66%, the data of each node are as follows:
By taking 1 data of node obtain as an example:
Step 1:
Real-time collecting hadoop running logs, the daily record include following main information:
1. the operation submitted is assigned to which node is executed
2. some task brings into operation in which node, some task is in certain node end of run.
3. task type, which is map tasks or reduce tasks.
Step 2:
The logic for calculating each node completes number.
It is statistical value that Map tasks, which complete number,;Node 1 is 41.
Map task execution average times (s) are the mean value of all map task execution times of the node;Node one is 90.
Reduce task execution total timesNode 1 is 1550s。
The time for completing a nearest map task is statistical value;Node 1 is 88s
The time of the nearest map task of logical transition number=Reduce task execution total times/completion;Node 1 is 1550/88=17
Logic completes number=logical transition number+Map tasks and completes number;Node 1 is 17+41=58
Step 3:
Calculate threshold value:
The moment degree of freedom is 4, confidence level 0.01, so the threshold value under t distributions is 3.74.
Step 4:
Calculate the z-score of each node
All node logicals complete average value=Σ nodei_logic_num/node_num of number;The value is: (58+71 + 63+59+32)/5=56.6 rounds up is 57.
The value is: [(58- 57)2+(71-57)2+(63-57)2+(59-57)2+(32-57)2]/5=172.4 is 13 after extraction of square root.
Degree of freedom is 5-1=4.
The z-score of node 1 is (58-57)/(13/2)=0.15.
Step 5:
Judge whether the node is abnormal.
The z-score of node one is 0.15 at this time, is less than threshold value 3.74, so the node is in normal condition.
And the z-score of calculated node 5 is 3.84, more than being given above threshold value 3.74, so node is without being determined For exception, the abnormal nodes are exported, in order to further analyze reason.Table one gives the related data of the moment each node.
Table 1
The specific implementation mode of the present invention is explained in detail above in association with attached drawing, but the present invention is not limited to above-mentioned Embodiment within the knowledge of a person skilled in the art can also be before not departing from present inventive concept Put that various changes can be made.

Claims (3)

1. one kind being directed to Hadoop cluster abnormal nodes method of real-time, it is characterised in that:It is defeated in real time that hadoop is collected first The daily record gone out is then analyzed and sorts out the daily record and count its data information, z-score is converted and acquired according to its information, Judge whether the score more than threshold value determines the abnormality of node.
2. according to claim 1 be directed to Hadoop cluster abnormal nodes method of real-time, it is characterised in that the side Method is as follows:
The status log that Step1, real-time collecting hadoop tasks export extracts relevant information, including:The node to work is compiled Number, map tasks that each node is currently running, reduce number of tasks;And it counts each node and has been running for how many a map Task numbers and reduce task numbers, the run time of each task and unfinished task run how long;
Step2, the logic for calculating each node complete number:
It is under node current state to define logical transition value, and reduce Runtimes are convertible into how many a map tasks Value specially calculates the total operation duration of individual node reduce tasks, including having run through into and being currently running for task;With The value that the map task time that the duration divided by the node are recently completed obtains is logical transition value, and logic completes number and is The node has executed map task quantity+logical transition value of completion at present;
Step3, threshold value is calculated:
Using t distributions come threshold value, when given confidence level and degree of freedom, corresponding threshold value just can determine that;Confidence level can root It is arranged according to actual conditions, the value is smaller, and precision is higher, but fails to report probability and also increase;Free angle value is that operation task is working Number of nodes subtract one;
Step4, the z-score for calculating each node:
The offset of joint behavior is weighed using the criterion score under t distributions, the value is bigger to illustrate that offset is more, when it is more than threshold When value, be determined as outlier, wherein t distribution under z-score calculation formula be:
In formula, x is that the logic of the node completes number, and μ represents the mean value that all node logicals complete number, and σ represents its corresponding mark Accurate poor, Freedom is degree of freedom;
Step5, judge whether z-score is less than mean value, if it is, the node is currently normal;If if it is not, then the section Point is abnormal nodes.
3. according to claim 2 be directed to Hadoop cluster abnormal nodes method of real-time, it is characterised in that:It is described to set The recommendation of reliability is 0.01.
CN201711049620.4A 2017-10-31 2017-10-31 One kind being directed to Hadoop cluster abnormal nodes method of real-time Pending CN108280008A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711049620.4A CN108280008A (en) 2017-10-31 2017-10-31 One kind being directed to Hadoop cluster abnormal nodes method of real-time

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711049620.4A CN108280008A (en) 2017-10-31 2017-10-31 One kind being directed to Hadoop cluster abnormal nodes method of real-time

Publications (1)

Publication Number Publication Date
CN108280008A true CN108280008A (en) 2018-07-13

Family

ID=62801296

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711049620.4A Pending CN108280008A (en) 2017-10-31 2017-10-31 One kind being directed to Hadoop cluster abnormal nodes method of real-time

Country Status (1)

Country Link
CN (1) CN108280008A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110460663A (en) * 2019-08-12 2019-11-15 深圳市网心科技有限公司 Data distributing method, device, server and storage medium between distributed node
CN116796043A (en) * 2023-08-29 2023-09-22 山东通维信息工程有限公司 Intelligent park data visualization method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110154341A1 (en) * 2009-12-20 2011-06-23 Yahoo! Inc. System and method for a task management library to execute map-reduce applications in a map-reduce framework
CN102664961A (en) * 2012-05-04 2012-09-12 北京邮电大学 Method for anomaly detection in MapReduce environment
CN104331520A (en) * 2014-11-28 2015-02-04 北京奇艺世纪科技有限公司 Performance optimization method and device of Hadoop cluster and node state recognition method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110154341A1 (en) * 2009-12-20 2011-06-23 Yahoo! Inc. System and method for a task management library to execute map-reduce applications in a map-reduce framework
CN102664961A (en) * 2012-05-04 2012-09-12 北京邮电大学 Method for anomaly detection in MapReduce environment
CN104331520A (en) * 2014-11-28 2015-02-04 北京奇艺世纪科技有限公司 Performance optimization method and device of Hadoop cluster and node state recognition method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李锋刚 等: "基于和声算法异构Hadoop 集群资源分配优化", 《计算机工程与应用》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110460663A (en) * 2019-08-12 2019-11-15 深圳市网心科技有限公司 Data distributing method, device, server and storage medium between distributed node
CN110460663B (en) * 2019-08-12 2022-09-20 深圳市网心科技有限公司 Data distribution method and device among distributed nodes, server and storage medium
CN116796043A (en) * 2023-08-29 2023-09-22 山东通维信息工程有限公司 Intelligent park data visualization method and system

Similar Documents

Publication Publication Date Title
Borghesi et al. Anomaly detection using autoencoders in high performance computing systems
KR102522005B1 (en) Apparatus for VNF Anomaly Detection based on Machine Learning for Virtual Network Management and a method thereof
US7502971B2 (en) Determining a recurrent problem of a computer resource using signatures
US20070185990A1 (en) Computer-readable recording medium with recorded performance analyzing program, performance analyzing method, and performance analyzing apparatus
CN107943668A (en) Computer server cluster daily record monitoring method and monitor supervision platform
WO2021143268A1 (en) Electric power information system health assessment method and system based on fuzzy inference theory
CN106600115A (en) Intelligent operation and maintenance analysis method for enterprise information system
CN111459700A (en) Method and apparatus for diagnosing device failure, diagnostic device, and storage medium
Ali-Eldin et al. Workload classification for efficient auto-scaling of cloud resources
JP2017111601A (en) Inspection object identification program and inspection object identification method
CN107301118A (en) A kind of fault indices automatic marking method and system based on daily record
Di et al. Exploring properties and correlations of fatal events in a large-scale hpc system
Yin et al. Cloudscout: A non-intrusive approach to service dependency discovery
CN109857618B (en) Monitoring method, device and system
Fu et al. Performance issue diagnosis for online service systems
CN105574032A (en) Rule matching operation method and device
CN108647137A (en) A kind of transaction capabilities prediction technique, device, medium, equipment and system
CN113128076A (en) Power dispatching automation system fault tracing method based on bidirectional weighted graph model
Bang et al. HPC workload characterization using feature selection and clustering
CN108280008A (en) One kind being directed to Hadoop cluster abnormal nodes method of real-time
CN109800130A (en) A kind of apparatus monitoring method, device, equipment and medium
CN102761429B (en) A kind of abnormal call bill processing method and system
CN109582555A (en) Data exception detection method, device, detection system and storage medium
CN117034149A (en) Fault processing strategy determining method and device, electronic equipment and storage medium
JP6458157B2 (en) Data analysis apparatus and analysis method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20180713

WD01 Invention patent application deemed withdrawn after publication