CN108280008A - One kind being directed to Hadoop cluster abnormal nodes method of real-time - Google Patents
One kind being directed to Hadoop cluster abnormal nodes method of real-time Download PDFInfo
- Publication number
- CN108280008A CN108280008A CN201711049620.4A CN201711049620A CN108280008A CN 108280008 A CN108280008 A CN 108280008A CN 201711049620 A CN201711049620 A CN 201711049620A CN 108280008 A CN108280008 A CN 108280008A
- Authority
- CN
- China
- Prior art keywords
- node
- value
- time
- real
- tasks
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000002159 abnormal effect Effects 0.000 title claims abstract description 16
- 238000000034 method Methods 0.000 title claims abstract description 15
- 230000005856 abnormality Effects 0.000 claims abstract description 7
- 230000007704 transition Effects 0.000 claims description 11
- 238000009826 distribution Methods 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 3
- 239000000284 extract Substances 0.000 claims description 3
- 230000008878 coupling Effects 0.000 abstract description 2
- 238000010168 coupling process Methods 0.000 abstract description 2
- 238000005859 coupling reaction Methods 0.000 abstract description 2
- 238000001514 detection method Methods 0.000 abstract 1
- 238000005516 engineering process Methods 0.000 description 4
- 230000006399 behavior Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000010223 real-time analysis Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/3006—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3065—Monitoring arrangements determined by the means or processing involved in reporting the monitored data
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Debugging And Monitoring (AREA)
Abstract
The present invention relates to one kind being directed to Hadoop cluster abnormal nodes method of real-time, belongs to Hadoop cluster abnormality detection technical fields.The present invention then analyzed and sorts out the daily record and count its data information, z-score is converted and acquired according to its information by collecting the daily records that export in real time of hadoop, judges whether the score more than threshold value determines the abnormality of node.The present invention has fully considered map tasks and the big feature of reduce task couplings in hadoop tasks, and two kinds of tasks are considered and converted, make accuracy higher;The present invention uses map tasks completeness as time measure, the more elastic real-time for having weighed method.
Description
Technical field
The present invention relates to one kind being directed to Hadoop cluster abnormal nodes method of real-time, belongs to Hadoop clusters and examines extremely
Survey technology field.
Background technology
Scientific technological advance necessarily brings sizable variation, big data epoch just to meet the tendency of with the development of science and technology to society
And give birth to, in such circumstances, mass data calculating also emerges one after another with storing framework, and Hadoop is apache companies according to Google
Big data, effectively can be averagely divided into the portion of very little by the parallel distributed frame for the MapReduce thoughts exploitation delivered
Point, it distributes to individual node in cluster and runs.One of realization as MapReducee framework technologies, hadoop by including
Multiple research institutions such as Baidu, Huawei, yahoo, facebook and company use, the hadoop cluster node of these enterprises deployment
Number is mostly thousands of.With the continuous increase of cluster scale, various problems come one after another, and node maintenance is one of them.
When performance issue occurs in cluster, trouble node is navigated in time and determines that the reason of leading to the problem is abnormal difficult, and
Certain class problem will not cause node directly collapse but only the speed of service can be made to be slowed by, efficiency significantly reduce.
Invention content
The technical problem to be solved by the present invention is to propose a kind of real-time checkout and diagnosis of abnormal nodes for Hadoop clusters
Method, the abnormality of node when detecting hadoop operation tasks in real time.
The technical scheme is that:One kind being directed to Hadoop cluster abnormal nodes method of real-time, collects first
The daily record that hadoop is exported in real time is then analyzed and sorts out the daily record and count its data information, converted according to its information
And z scores are acquired, judge whether the score more than threshold value determines the abnormality of node.
The method is as follows:
The status log that Step1, real-time collecting hadoop tasks export extracts relevant information, including:The section to work
Point number, map tasks that each node is currently running, reduce number of tasks;And count each node has been running for how many
A map task numbers and reduce task numbers, the run time of each task and unfinished task run how long;
Step2, the logic for calculating each node complete number:
It is under node current state to define logical transition value, and reduce Runtimes are convertible into how many a map and appoint
The value of business specially calculates the total operation duration of individual node reduce tasks, including times for having run through into and being currently running
Business;The value obtained with the map task time that the duration divided by the node are recently completed is logical transition value, and logic is completed
Number is map task quantity+logical transition value that the node has executed completion at present;
Step3, threshold value is calculated:
In view of that may build at small cluster (node less be less than or equal to 30), using t distributions come threshold value, when to
When fixation reliability is with degree of freedom, corresponding threshold value just can determine that;Confidence level can be arranged according to actual conditions, and the value is smaller, precision
It is higher, but fail to report probability and also increase, it is recommended as 0.01;Free angle value is that the number of nodes that operation task is working subtracts one, such as four
A node is currently running task, then degree of freedom is 4-1=3 at this time.
Step4, the z-score for calculating each node:
The offset of joint behavior is weighed using the criterion score (z_scorei) under t distributions, the value is bigger to be illustrated to deviate
It is more, when its be more than threshold value when, be determined as outlier, wherein t distribution under z-score calculation formula be:
In formula, x is that the logic of the node completes number, and μ represents the mean value that all node logicals complete number, and it is corresponding that σ represents it
Standard deviation, Freedom is degree of freedom;
Step5, judge whether z-score is less than mean value, if it is, the node is currently normal;If if it is not, then
The node is abnormal nodes.
The present invention operation principle be:The log information that extraction hadoop is exported in real time, it is each to obtain hadoop cluster
The operating status of a node.The state Gaussian distributed of each node under homogeneous environment.The state of each node is analyzed in real time, is used
To judge whether the node is normal.
The beneficial effects of the invention are as follows:
(1) present invention improves the analysis precision of later stage each node by collecting real-time output journal and sorting out.
(2) present invention is according to Principle of Statistics, in conjunction with the high coupling of hadoop tasks map, reduce, considers and turns
It changes, keeps the state description of node more reliable.
(3) abnormality of the invention by decision node, when contributing to cluster to safeguard, and saving a large amount of job runs
Between.
Description of the drawings
Fig. 1 is collection phase flow chart of the present invention;
Fig. 2 is analysis process figure of the present invention.
Specific implementation mode
With reference to the accompanying drawings and detailed description, the invention will be further described.
Embodiment 1:As shown in Figs. 1-2, a kind of to be directed to Hadoop cluster abnormal nodes method of real-time, it collects first
The daily record that hadoop is exported in real time is then analyzed and sorts out the daily record and count its data information, converted according to its information
And z scores are acquired, judge whether the score more than threshold value determines the abnormality of node.
The method is as follows:
The status log that Step1, real-time collecting hadoop tasks export extracts relevant information, including:The section to work
Point number, map tasks that each node is currently running, reduce number of tasks;And count each node has been running for how many
A map task numbers and reduce task numbers, the run time of each task and unfinished task run how long;
Step2, the logic for calculating each node complete number:
It is under node current state to define logical transition value, and reduce Runtimes are convertible into how many a map and appoint
The value of business specially calculates the total operation duration of individual node reduce tasks, including times for having run through into and being currently running
Business;The value obtained with the map task time that the duration divided by the node are recently completed is logical transition value, and logic is completed
Number is map task quantity+logical transition value that the node has executed completion at present;
Step3, threshold value is calculated:
Using t distributions come threshold value, when given confidence level and degree of freedom, corresponding threshold value just can determine that;Confidence level
It can be arranged according to actual conditions, the value is smaller, and precision is higher, but fails to report probability and also increase;Free angle value be operation task
The number of nodes of work subtracts one;
Step4, the z-score for calculating each node:
The offset of joint behavior is weighed using the criterion score under t distributions, the value is bigger to illustrate that offset is more, when it is big
When threshold value, be determined as outlier, wherein t distribution under z-score calculation formula be:
In formula, x is that the logic of the node completes number, and μ represents the mean value that all node logicals complete number, and it is corresponding that σ represents it
Standard deviation, Freedom is degree of freedom;
Step5, judge whether z-score is less than mean value, if it is, the node is currently normal;If if it is not, then
The node is abnormal nodes.
The recommendation of the confidence level is 0.01.
Embodiment 2:In the present embodiment, 6 server nodes have been used, a host node, 5 from node.Each section
Point is configured to:
Cpu models:Intel Xeon E5645;CPU number of logic:24;Memory size:32GB;Hard disk size:1TB;Behaviour
Make system:Cent OS 6.8.
Operation is selected as:Terasort:Arrange the data for the 12G sizes that Teragen is generated.
The present embodiment selects the value that t is distributed under confidence level parameter is 0.01 as threshold value, and threshold value is 3.74 after calculating.
The present embodiment uses spark streaming as real-time analysis tool, and 5s is divided between batch processing.
Change with job change since map executes the time, is used as time measure with Map task completenesses.
Cpu hog failures are injected to node five when running terasort operation Map task completenesses 40%.When map appoints
When completeness of being engaged in reaches 66%, the data of each node are as follows:
By taking 1 data of node obtain as an example:
Step 1:
Real-time collecting hadoop running logs, the daily record include following main information:
1. the operation submitted is assigned to which node is executed
2. some task brings into operation in which node, some task is in certain node end of run.
3. task type, which is map tasks or reduce tasks.
Step 2:
The logic for calculating each node completes number.
It is statistical value that Map tasks, which complete number,;Node 1 is 41.
Map task execution average times (s) are the mean value of all map task execution times of the node;Node one is 90.
Reduce task execution total timesNode 1 is
1550s。
The time for completing a nearest map task is statistical value;Node 1 is 88s
The time of the nearest map task of logical transition number=Reduce task execution total times/completion;Node 1 is
1550/88=17
Logic completes number=logical transition number+Map tasks and completes number;Node 1 is 17+41=58
Step 3:
Calculate threshold value:
The moment degree of freedom is 4, confidence level 0.01, so the threshold value under t distributions is 3.74.
Step 4:
Calculate the z-score of each node
All node logicals complete average value=Σ nodei_logic_num/node_num of number;The value is: (58+71
+ 63+59+32)/5=56.6 rounds up is 57.
The value is: [(58-
57)2+(71-57)2+(63-57)2+(59-57)2+(32-57)2]/5=172.4 is 13 after extraction of square root.
Degree of freedom is 5-1=4.
The z-score of node 1 is (58-57)/(13/2)=0.15.
Step 5:
Judge whether the node is abnormal.
The z-score of node one is 0.15 at this time, is less than threshold value 3.74, so the node is in normal condition.
And the z-score of calculated node 5 is 3.84, more than being given above threshold value 3.74, so node is without being determined
For exception, the abnormal nodes are exported, in order to further analyze reason.Table one gives the related data of the moment each node.
Table 1
The specific implementation mode of the present invention is explained in detail above in association with attached drawing, but the present invention is not limited to above-mentioned
Embodiment within the knowledge of a person skilled in the art can also be before not departing from present inventive concept
Put that various changes can be made.
Claims (3)
1. one kind being directed to Hadoop cluster abnormal nodes method of real-time, it is characterised in that:It is defeated in real time that hadoop is collected first
The daily record gone out is then analyzed and sorts out the daily record and count its data information, z-score is converted and acquired according to its information,
Judge whether the score more than threshold value determines the abnormality of node.
2. according to claim 1 be directed to Hadoop cluster abnormal nodes method of real-time, it is characterised in that the side
Method is as follows:
The status log that Step1, real-time collecting hadoop tasks export extracts relevant information, including:The node to work is compiled
Number, map tasks that each node is currently running, reduce number of tasks;And it counts each node and has been running for how many a map
Task numbers and reduce task numbers, the run time of each task and unfinished task run how long;
Step2, the logic for calculating each node complete number:
It is under node current state to define logical transition value, and reduce Runtimes are convertible into how many a map tasks
Value specially calculates the total operation duration of individual node reduce tasks, including having run through into and being currently running for task;With
The value that the map task time that the duration divided by the node are recently completed obtains is logical transition value, and logic completes number and is
The node has executed map task quantity+logical transition value of completion at present;
Step3, threshold value is calculated:
Using t distributions come threshold value, when given confidence level and degree of freedom, corresponding threshold value just can determine that;Confidence level can root
It is arranged according to actual conditions, the value is smaller, and precision is higher, but fails to report probability and also increase;Free angle value is that operation task is working
Number of nodes subtract one;
Step4, the z-score for calculating each node:
The offset of joint behavior is weighed using the criterion score under t distributions, the value is bigger to illustrate that offset is more, when it is more than threshold
When value, be determined as outlier, wherein t distribution under z-score calculation formula be:
In formula, x is that the logic of the node completes number, and μ represents the mean value that all node logicals complete number, and σ represents its corresponding mark
Accurate poor, Freedom is degree of freedom;
Step5, judge whether z-score is less than mean value, if it is, the node is currently normal;If if it is not, then the section
Point is abnormal nodes.
3. according to claim 2 be directed to Hadoop cluster abnormal nodes method of real-time, it is characterised in that:It is described to set
The recommendation of reliability is 0.01.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711049620.4A CN108280008A (en) | 2017-10-31 | 2017-10-31 | One kind being directed to Hadoop cluster abnormal nodes method of real-time |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711049620.4A CN108280008A (en) | 2017-10-31 | 2017-10-31 | One kind being directed to Hadoop cluster abnormal nodes method of real-time |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108280008A true CN108280008A (en) | 2018-07-13 |
Family
ID=62801296
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711049620.4A Pending CN108280008A (en) | 2017-10-31 | 2017-10-31 | One kind being directed to Hadoop cluster abnormal nodes method of real-time |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108280008A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110460663A (en) * | 2019-08-12 | 2019-11-15 | 深圳市网心科技有限公司 | Data distributing method, device, server and storage medium between distributed node |
CN116796043A (en) * | 2023-08-29 | 2023-09-22 | 山东通维信息工程有限公司 | Intelligent park data visualization method and system |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110154341A1 (en) * | 2009-12-20 | 2011-06-23 | Yahoo! Inc. | System and method for a task management library to execute map-reduce applications in a map-reduce framework |
CN102664961A (en) * | 2012-05-04 | 2012-09-12 | 北京邮电大学 | Method for anomaly detection in MapReduce environment |
CN104331520A (en) * | 2014-11-28 | 2015-02-04 | 北京奇艺世纪科技有限公司 | Performance optimization method and device of Hadoop cluster and node state recognition method and device |
-
2017
- 2017-10-31 CN CN201711049620.4A patent/CN108280008A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110154341A1 (en) * | 2009-12-20 | 2011-06-23 | Yahoo! Inc. | System and method for a task management library to execute map-reduce applications in a map-reduce framework |
CN102664961A (en) * | 2012-05-04 | 2012-09-12 | 北京邮电大学 | Method for anomaly detection in MapReduce environment |
CN104331520A (en) * | 2014-11-28 | 2015-02-04 | 北京奇艺世纪科技有限公司 | Performance optimization method and device of Hadoop cluster and node state recognition method and device |
Non-Patent Citations (1)
Title |
---|
李锋刚 等: "基于和声算法异构Hadoop 集群资源分配优化", 《计算机工程与应用》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110460663A (en) * | 2019-08-12 | 2019-11-15 | 深圳市网心科技有限公司 | Data distributing method, device, server and storage medium between distributed node |
CN110460663B (en) * | 2019-08-12 | 2022-09-20 | 深圳市网心科技有限公司 | Data distribution method and device among distributed nodes, server and storage medium |
CN116796043A (en) * | 2023-08-29 | 2023-09-22 | 山东通维信息工程有限公司 | Intelligent park data visualization method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Borghesi et al. | Anomaly detection using autoencoders in high performance computing systems | |
KR102522005B1 (en) | Apparatus for VNF Anomaly Detection based on Machine Learning for Virtual Network Management and a method thereof | |
US7502971B2 (en) | Determining a recurrent problem of a computer resource using signatures | |
US20070185990A1 (en) | Computer-readable recording medium with recorded performance analyzing program, performance analyzing method, and performance analyzing apparatus | |
CN107943668A (en) | Computer server cluster daily record monitoring method and monitor supervision platform | |
WO2021143268A1 (en) | Electric power information system health assessment method and system based on fuzzy inference theory | |
CN106600115A (en) | Intelligent operation and maintenance analysis method for enterprise information system | |
CN111459700A (en) | Method and apparatus for diagnosing device failure, diagnostic device, and storage medium | |
Ali-Eldin et al. | Workload classification for efficient auto-scaling of cloud resources | |
JP2017111601A (en) | Inspection object identification program and inspection object identification method | |
CN107301118A (en) | A kind of fault indices automatic marking method and system based on daily record | |
Di et al. | Exploring properties and correlations of fatal events in a large-scale hpc system | |
Yin et al. | Cloudscout: A non-intrusive approach to service dependency discovery | |
CN109857618B (en) | Monitoring method, device and system | |
Fu et al. | Performance issue diagnosis for online service systems | |
CN105574032A (en) | Rule matching operation method and device | |
CN108647137A (en) | A kind of transaction capabilities prediction technique, device, medium, equipment and system | |
CN113128076A (en) | Power dispatching automation system fault tracing method based on bidirectional weighted graph model | |
Bang et al. | HPC workload characterization using feature selection and clustering | |
CN108280008A (en) | One kind being directed to Hadoop cluster abnormal nodes method of real-time | |
CN109800130A (en) | A kind of apparatus monitoring method, device, equipment and medium | |
CN102761429B (en) | A kind of abnormal call bill processing method and system | |
CN109582555A (en) | Data exception detection method, device, detection system and storage medium | |
CN117034149A (en) | Fault processing strategy determining method and device, electronic equipment and storage medium | |
JP6458157B2 (en) | Data analysis apparatus and analysis method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20180713 |
|
WD01 | Invention patent application deemed withdrawn after publication |