CN104156296B - The system and method for intelligent monitoring large-scale data center cluster calculate node - Google Patents
The system and method for intelligent monitoring large-scale data center cluster calculate node Download PDFInfo
- Publication number
- CN104156296B CN104156296B CN201410377856.0A CN201410377856A CN104156296B CN 104156296 B CN104156296 B CN 104156296B CN 201410377856 A CN201410377856 A CN 201410377856A CN 104156296 B CN104156296 B CN 104156296B
- Authority
- CN
- China
- Prior art keywords
- data
- calculate node
- data target
- node
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 44
- 238000012544 monitoring process Methods 0.000 title claims abstract description 38
- 238000012806 monitoring device Methods 0.000 claims abstract description 26
- 230000008569 process Effects 0.000 claims abstract description 24
- 238000007405 data analysis Methods 0.000 claims abstract description 6
- 238000004458 analytical method Methods 0.000 claims description 22
- 238000007418 data mining Methods 0.000 claims description 12
- 238000009826 distribution Methods 0.000 claims description 5
- 238000005065 mining Methods 0.000 claims 1
- PEDCQBHIVMGVHV-UHFFFAOYSA-N Glycerine Chemical compound OCC(O)CO PEDCQBHIVMGVHV-UHFFFAOYSA-N 0.000 description 24
- 238000013480 data collection Methods 0.000 description 9
- 238000007726 management method Methods 0.000 description 9
- 238000005516 engineering process Methods 0.000 description 5
- 238000005265 energy consumption Methods 0.000 description 4
- 238000003860 storage Methods 0.000 description 4
- 241000260732 Buteo regalis Species 0.000 description 3
- 238000009412 basement excavation Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 230000004888 barrier function Effects 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000005389 magnetism Effects 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 241001269238 Data Species 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- JJWKPURADFRFRB-UHFFFAOYSA-N carbonyl sulfide Chemical compound O=C=S JJWKPURADFRFRB-UHFFFAOYSA-N 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000007717 exclusion Effects 0.000 description 1
- 230000004907 flux Effects 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000005433 particle physics related processes and functions Effects 0.000 description 1
- 238000003825 pressing Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 238000005057 refrigeration Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 230000026676 system process Effects 0.000 description 1
- 238000013024 troubleshooting Methods 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Landscapes
- Debugging And Monitoring (AREA)
Abstract
Propose a kind of system and method for intelligent monitoring large-scale data center cluster calculate node, by the hardware micro-architecture data target of the monitor node collection calculate node in the system data target related to the process of the application program of operation, and the data target is sent to the monitoring device in system, big data is performed by monitoring device to analyze, and send the result to ustomer premises access equipment be shown to user.The system and method can gather the program process data target of calculate node micro-architecture data target and operation, realize intelligent big data analysis, be automatically positioned the calculate node that breaks down and provide failure cause.
Description
Technical field
The present invention relates to field of computer technology, and in particular to a kind of intelligent monitoring large-scale data center cluster calculates section
The system and method for point.
Background technology
With the continuous progress of human society, the development of science and technology, understanding of the people not only to nature is more and more wider
It is general and also more and more urgent to the demand that outfield is explored.This allows for the amount sharpness of the information data that mankind's support is held
Growth, and at the same time, the information data of these magnanimity is required for timely analyzing and processing.For example, a large-scale astronomical
Radio telescope array can just produce the cosmic microwave data of more than 100GB, these data to be required for being divided in time for one second
Analysis;For another example, in particle physics research field, data that LHC once clashes are also to be counted in units of TB
Amount;Additionally, as human genome project, oil exploration, weather forecast etc. field also propose increasingly to computing capability
Requirement high.Already to become the third in addition to experiment, theory analysis of crucial importance for numerical computations under this overall background
Science Explorations means.Such reality is based on, has promoted each science and technology power of the world today big all what is done one's utmost
Power develops supercomputer.Such as, in the world TOP500 of in December, 2013 issue, the China's " Milky Way two for ranking the first
(TH-2) peak velocity of 54.9PFlops " has just been reached, more than 16000 calculate nodes has been used altogether.
In addition, with the development of the new techniques such as cloud computing, big data, Internet of Things, occur in that increasing big
Type data center, cloud computing center.They possess ten hundreds of computer nodes easily.As Google (Google) is located at the U.S.
Oregonian Dalles data centers possess about 150,000 server nodes.In so large-scale data center, calculate
The performance monitoring of node, fault location, fault recovery, and center whole efficiency statistics etc., all exist unprecedented
Challenge.Therefore, how one extensive or even ultra-large data center of efficient management and use, be countries in the world today
All in a popular domain for making great efforts to explore.
For a long time, all manually automanual mode is completed for the monitoring management of data center.It is responsible for O&M
Personnel need to check the running status of cluster in real time, once go wrong, although sometimes can be with positioning node position, but often
The equipment of failure can not be accurately positioned, in addition it is also necessary to waste time and energy by the experience of staff to judge, troubleshooting;The user of cluster
Although the handling situations of oneself can be understood by numerous job scheduling software, the history point of operation can be seldom counted on
Analysis;Furthermore the policymaker of cluster often cannot directly obtain relevant expense expenditure, service efficiency, person works' effect from cluster
Rate, cost effectiveness etc. can only be wasted time and energy about the information material of decision-making by the manual analysis to mass data come decision-making.This
Outward, application developer also tends to that hardware micro-architecture, system process, heap that optimization application software is badly in need of cannot be obtained from cluster
Stack, module error collapse the information such as statistics, it is necessary to be obtained empirically by substantial amounts of experiment, i.e., time-consuming and laborious.
The content of the invention
The present invention proposes a kind of system and method for intelligent monitoring large-scale data center cluster calculate node, with big
The characteristics of type, multi-functional, facing multiple users group.It possesses perfect intellectual analysis and statistical function, can be different levels
The decision-making of user provides data reference foundation.
The system, including:Monitor node and each monitor node on data center's PC cluster node lead to
The monitoring device and subscriber terminal equipment of letter, it is characterised in that:
The monitor node, for the control of the hardware controls register by obtaining calculate node, gathers the meter
The hardware micro-architecture data target of operator node, by obtaining the control of operating system nucleus, obtain with the calculate node
The related data target of the process of the application program of operation, and the data target is sent to monitoring device;
The monitoring device, for receiving the data target, big data analysis is performed based on the data target, and will
The result of the analysis is sent to subscriber terminal equipment;
The subscriber terminal equipment, for receiving the result and being shown to user.
Methods described includes:
Start the monitor node in calculate node;
The monitor node gathers the calculate node by obtaining the control of the hardware controls register of calculate node
Hardware micro-architecture data target, by obtaining the control of operating system nucleus, obtain and the calculate node on run
The related data target of the process of application program, and the data target is sent to monitoring device;
The monitoring device receives the data target, and big data analysis is performed based on the data target, and will be described
The result of analysis is sent to subscriber terminal equipment;
The subscriber terminal equipment receives the result and is shown to user.
Especially, the analysis includes:According to the calculate node that data target positioning breaks down, and determine event
Barrier reason.
Especially, the hardware micro-architecture data target includes the real-time floating-point speed of service of CPU, stream SIMD instruction extension
Needed for collection SSE unit utilization rate, high-level vector superset AVX units utilization rate, vector instruction vectorization ratio, every instruction of completion
Clock number CPI, afterbody caching LLC hit rate, memory bandwidth, PCI fast bus interface PCI-E devices bandwidth, caching
One or more in hit/miss rate of combination;The process phase with the application program of operation in the calculate node
The data target of pass includes the combination of one or more in process switching number of times, stack information, heap memory distribution condition.
Especially, clock of the data target for needed for every instruction of the real-time floating-point speed of service and/or completion of CPU
Number CPI, the analysis includes:When the data target is consistently less than default threshold value in preset time period, then judgement treatment
Device breaks down, and is processor exception frequency reducing the reason for determine failure.
Especially, the monitor node also gathers cpu busy percentage, memory usage, this earth magnetism provided by operating system
Disk I/O data and/or Ethernet handling capacity.
Especially, wherein the hardware controls register of the calculate node is the performance prison of the processor of the calculate node
MSR control registers in control unit PMU.
The beneficial effects of the invention are as follows:
Extract necessary system level performance metrics information by the performance monitoring apparatus of each calculate node, and send by
Monitoring management node is responsible for maintenance.And monitoring management node, then with abnormal identification and alert capability, while pressing customer group
Recorded historical data is excavated respectively, and result is fed back into user.Meanwhile, monitoring management node can also on demand, on time
Between section, the information of the aspect such as hardware micro-architecture feature and process, storehouse is extracted to specified monitor node.So as to realize to big rule
The multi-userization of mould cluster monitoring, multifunction and intellectuality.
In order to realize the actual effect of monitoring, the monitoring client of each calculate node realizes the monitoring mode of refreshing per second.Simultaneously
In order to reduce the resource occupation of calculate node, each calculate node is only extracted for minimum index item, bag necessary to data analysis
Include cpu busy percentage, memory usage, ten several indexs such as local disk read-write and Ethernet handling capacity.
In order to realize multifunction, this intelligent monitor system additionally provides the monitoring point of the index related to hardware micro-architecture
Analysis, such as the floating-point speed of service, vectorization ratio, memory bandwidth, IB bandwidth etc..But due to this partial content monitoring when to system
The occupancy of resource is relatively more, therefore, they start on demand according to user instruction.
In order to realize multi-userization, this intelligent monitor system is proposed covers management level, O&M layer, practical application client layer
With application and development layer, four hierarchical views of level.
In order to realize intellectuality, this intelligent monitor system has invented a kind of analysis method of data mining, and it is according to basic
Performance monitoring data information, by calculate excavate the statistical indicator that different levels user is most interested in.
Brief description of the drawings
Fig. 1 is a kind of system block diagram of intelligent monitoring large-scale data center cluster proposed by the present invention
Fig. 2 is a kind of flow chart of the method for intelligent monitoring large-scale data center cluster proposed by the present invention
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, the present invention is done into one below in conjunction with accompanying drawing
Step ground is described in detail.
Referring to accompanying drawing 1, a kind of intelligent monitoring large-scale data center cluster calculate node proposed by the present invention is shown
System, including monitor node on data center's PC cluster node and the monitoring device of each monitor node connection,
And subscriber terminal equipment.Wherein data center's PC cluster node has a corresponding hardware device, such as processor CPU, interior
Deposit, hard disk, Ethernet controller etc., operating system and application software are run in the calculate node;Monitoring device bag
Main monitor node and database are included, main monitor node communicates with each monitor node in above-mentioned calculate node, can
Obtain the hardware and software service data of data center's PC cluster node, such as cpu busy percentage, memory usage, this earth magnetism
Disk I/O data, Ethernet handling capacity, and micro-architecture data target and the application program of operation for the calculate node hardware
The data target of process level.In the above-mentioned data write into Databasce that main monitor node will be obtained, the automatic big data that performs is excavated simultaneously
Preserve the result obtained after big data is excavated.User reads result and shows from database by ustomer premises access equipment.User is also
User-defined data mining program can be input into monitoring device by subscriber terminal equipment, by monitoring device extraction data
The corresponding data index of heart clustered node, performs big data and excavates and shown to user according to user-defined data mining program
As a result.
Referring to accompanying drawing 2, a kind of method of intelligent monitoring large-scale data center cluster calculate node proposed by the present invention by
Data acquisition, big data are excavated, are classified several key steps compositions such as displaying and fault location and alarm.Wherein data are adopted
Collection includes master data collection and high-level data collection, and master data collection is performed, set without user automatically by system;It is senior
Data acquisition needs to be set according to user intention.
1. data acquisition
Data acquisition refers to install monitor node on data center's PC cluster node, extracts the CPU of the calculate node
Utilization rate, memory usage, local disk I/O data, Ethernet handling capacity, and for the micro-architecture of the calculate node hardware
The data target of data target and the program process of operation level.Wherein, the micro-architecture data for calculate node hardware refer to
The collection of the data target of mark and program process level is referred to as high-level data collection, and the collection of remaining index is referred to as master data
Collection.Master data collection for system default set the step of, be that can perform without user intervention, high-level data collection according to
Family demand and execution is set.Due to needing the actual effect of guarantee performance indications data, monitor node must is fulfilled for what second level refreshed
Acquisition capacity, while must assure that extremely low calculate node resources occupation rate.
Collecting method proposed by the present invention is different from the method for proposing in the prior art.In the prior art, data
Collection is only to collect some achievement datas that operating system is provided in itself, i.e. the collection of data target is depended in calculate node
The operating system of operation, for the data target that operating system cannot be provided, monitor node cannot be obtained.And it is proposed by the invention
Collecting method, be not only only capable of complete it is above-mentioned by operating system provide data target collection, can also gather
Hardware micro-architecture data target, the real-time floating-point speed of service of such as CPU, stream SIMD instruction superset SSE (Streaming
SIMD Extensions) unit utilization rate, high-level vector superset AVX (Advanced Vector Extensios) unit profit
With rate, vector instruction vectorization ratio, complete every required clock number (CPI) of instruction, afterbody caching LLC (Last
Level Cache) hit rate, translation lookaside buffer TLB (Translation Lookaside Buffer) parameter, internal memory band
Width, PCI fast bus interfaces PCI-E (PCI Express) device bandwidth, cache hit/miss (cache hit/miss)
Rate, TLB unit etc..Further, it is also possible to gather the data target of some program process level, such as process switching number of times, heap
Stack information, heap memory distribution condition etc..These indexs are soft for excavating performance, analysis cluster features and the positioning of application software
Part level failure tool is of great significance.
Due to needing acquisition hardware and process level data target, therefore monitor node proposed by the present invention passes through software client
The mode at end is realized.The method that collection of the monitor node to master data is proposed compared with technology, will not be repeated here, right
The process of high-level data collection is specifically described as follows:
Extraction to above-mentioned hardware micro-architecture data target needs to be realized by the control to related register in hardware.
Such as, for processor micro-architecture data target, mainly by the performance monitoring unit PMU (Performance in processor
Monitoring Unit) it is controlled to realize.Therefore, this requires that the monitor node of this case possesses highest root authority.
Control flow to PMU is described below:
S1:MSR (Module Specific Register) control deposits in the PMU of the processor for obtaining calculate node
The control of device;
S2:In the MSR control registers that the coding of dependent event and mask write-in have been controlled, and control deposit is set
Device, starts to count dependent event, for example, when LLC hit rate data target is gathered, first by the coding of LLC hit rate and covering
In code write-in MSR control registers, then the register is set and starts counting up LLC hit quantity, counting reads the control after terminating
Count number in register processed, counts LLC hit rate.
Extraction to system kernel level index needs the monitoring to correlative code in kernel to realize.For example to process switching
Monitoring, it is necessary in monitoring kernel relevant control process in the code of management of process part part.It is interior when calculate node starts
Core starts monitoring after successfully loading.Therefore, monitor node must possess the control to kernel level.To system kernel level index
Extract may slightly affected system performance, therefore can be directed to monitoring occasion provide on demand.
2. big data is excavated and classification displaying
The above-mentioned monitor node in calculate node also has the ability that data are sent to monitoring device, is set by monitoring
Standby unification receives and manages each monitor node.Main monitor node in monitoring device is responsible for receiving collection from each monitor node
Data target, and to each monitor node send control command, the control command include the system default produce
Master data acquisition, and the high-level data acquisition for being set according to user and being produced, described each monitor node root
The collection of corresponding data index is performed according to the control command.Main monitor node is also responsible for the data target that will be received simultaneously
It is stored in database by certain storage format, as the input data of next step data mining.
In order to realize intellectuality, monitoring device also has big data mining ability, and it is set to data according to default statistics
The data target preserved in storehouse carries out big data treatment, and according to default classification exhibition scheme, respectively different users carry
For data statistics and analysis result.Additionally, monitoring device also has user interface, custom data excavation can be received
Algorithm, and perform data mining according to the data mining algorithm.Default statistics setting includes:
First, management level customer group index
1. throughput rate (task flux)
A. real time execution task, using number
B. in one week (moon, year), the number of tasks of (failure) is completed daily【Row figure, table】
C. it is average to complete (failure) number of tasks daily in one week (moon, year)
D. it is total to complete (failure) number of tasks in one week (moon, year)
E. per task time
2. O&M cost (energy consumption) (calculate, storage, exchange, computer room【Refrigeration】)
A. real-time total power consumption
B. in one week (moon, year), daily energy consumption (KW/h)【Row figure, table】
C. in one week (moon, year), average energy consumption (KW/h) daily
D. in one week (moon, year), total energy consumption (KW/h)
E. it is completeer than Data-Statistics, unit costs operation between equipment depreciation, computer room entirety amortization charge monitoring and each expense unit
Cheng Liang
3. assets utilization efficiency
A. in one week (moon, year), daily cluster dutycycle
B. in one week (moon, year), average cluster dutycycle daily
C. in one week (moon, year), daily cluster peak hours/period (calculating cluster dutycycle per hour)
D. in one week (moon, year), time consistent busy hour section (the annual dutycycle on 24 hour period)
E. real-time online number of users (special delegated authority, check personal information)
F. in one week (moon, year), daily online user number【Row figure, table】
G. in one week (moon, year), average online user number daily
H. in one week (moon, year), daily average user completes number of tasks
I. in one week (moon, year), average per-user completes number of tasks
4. equipment health degree
A. real time fail nodes, fault rate
B. in one week (moon, year), daily malfunctioning node number, fault rate【Row figure, table】
C. in one week (moon, year), average malfunctioning node number, fault rate daily
2nd, cluster device management service human user group index
1. fault alarm and positioning
A. real time fail nodes, fault rate
B. in one week (moon, year), daily malfunctioning node record, fault rate【Row figure, table】
C. it is average per node failure number of times in one week (moon, year), per node failure rate (the easy malfunctioning node of statistics)
D. malfunctioning node is positioned in real time
E. malfunctioning node Realtime Alerts
F. failure, the classification of failure node failure type:Can couple, can not couple, power down etc.
G. pair can couple failure and be accurately positioned faulty equipment:Faulty disk position, fall internal memory (position) etc.
2. equipment running status are checked
A. cluster overall cpu busy percentage, centrally stored I/O bandwidth in real time
B. in one week (moon, year), daily cluster ensemble average cpu busy percentage, average centrally stored I/O bandwidth
C. in one week (moon, year), cluster ensemble average cpu busy percentage, average centrally stored I/O bandwidth
D. can the every node running status of real time inspection:CPU, internal memory, local disk, network etc. index
E. attitude can daily be run by all nodes in historical query 1 year
F. resource bottleneck analyzes (CPU, storage, internal memory, network【Distinguish storage, data exchange】)
3. billing function
A. counted during subscriber computer
3rd, task customer group index
1. current task information
A. current task is used nodes, check figure, memory size of occupancy etc.
B. the status information of the nodes that current task is used can be checked:CPU, internal memory, local disk, network etc.
C. the number of tasks currently queued up
D. current task queuing time
2. historic task is counted
A. the user's history Runtime
B. the average Runtime of the user's history
C. the user completes the historic task number of (failure)
D. Mission Success rate (success number of tasks/failure number of tasks)
E. the user's history task is used nodes, check figure
F. user's averaged historical task is used nodes, check figure
G. the average queuing time of historic task
4th, application software research staff customer group index
1. program (module) use information is counted
A. in one week (moon, year), the total number of modules of (failure) is processed daily
B. in one week (moon, year), Module Fail rate
C. in one week (moon, year), module uses hot statistics, ranking, and each module access times accounting
D. in one week (moon, year), failed module hot statistics, ranking, and each failed module Failure count accounting
2. performance trace index
A. the service (database, file system, job scheduling, middle acceleration layer, parallel framework etc.) of all applications
Loading condition
B. micro-architecture level information:Cache hit/miss rates, TLB
C. the information of operating system grade:Enter number of passes, process switching, storehouse, heap memory distribution condition etc..
3. the statistics of user's use habit
A. the delay of the access data of interactive application, residence time, I/O access modules etc.
Finally, monitoring device has been pressed the statistical analysis information of the above excavation, has been opened up respectively by the client layer specified
Show ustomer premises access equipment.
Data mining in embodiments of the present invention is distinguished by the type of user.The excavation listed in invention
Item is summary after the real needs and focus for fully analyzing correlation type user.And this kind of index is in common monitoring
It is no, it is necessary to artificial derive data is analyzed, and implementation method proposed by the present invention is intelligent, is automatically performed.
Further it is proposed that implementation method be also devised with it is reserved interface is excavated by custom data, can perform user
The data mining program of definition.
3. fault location and alarm
By above-mentioned data mining analysis, the equipment work at present performance indications of calculate node are obtained in that, according to described
The reason for whether service behaviour index can be broken down and be broken down with analytical equipment.On the one hand error message can be led to
The intelligent display module for crossing ustomer premises access equipment shows specific user, on the other hand, can install event in user's visitor's end equipment
Barrier alarm module, for example, install certain stereo set, light units etc., is sent a warning with equipment failure, so that
Remind attendant quickly to pay close attention to faulty equipment, be rapidly completed equipment fault exclusion.
The failure exception situation of equipment or application software can reflect according to the performance data index of statistics.In order to simple
The easy-to-use present invention is the failure that failure, particularly some aspect of performances are positioned by the exception of analytical performance data target,
It is that cannot be excluded by usual method.Such as, the radiating of cluster is bad, may result in the frequency reducing operation of processor, this
Will not be alarmed by normal failure monitoring means when individual, but use method proposed by the present invention, have treatment due to collecting
Device micro-architecture data target, can in real time monitoring processor complete the floating-point speed of service and complete every instruction needed for
Clock number CPI, so when in monitored node heavy duty and this two indexs are consistently less than in a longer time
Default threshold value, then judge be out of order generation and intelligent alarm by monitoring device, while the reason for also just located failure and occur,
That is the improper frequency reducing of processor.
Certainly, the present invention can also have other various embodiments, ripe in the case of without departing substantially from spirit of the invention and its essence
Know those skilled in the art and work as and various corresponding changes and deformation, but these corresponding changes and change can be made according to the present invention
Shape should all belong to scope of the claims of the invention.
Claims (12)
1. a kind of system of intelligent monitoring large-scale data center cluster calculate node, including installed in data center's PC cluster
Monitoring device and subscriber terminal equipment that monitor node on node communicates with each monitor node, it is characterised in that:
The monitor node, for the control of the hardware controls register by obtaining calculate node, gathers described calculating and saves
The hardware micro-architecture data target of point, by obtaining the control of operating system nucleus, obtains and is run with the calculate node
Application program the related data target of process, and the data target is sent to monitoring device;
The monitoring device, for receiving the data target, big data analysis is performed based on the data target, and will be described
The result of analysis is sent to subscriber terminal equipment;The subscriber terminal equipment, for receiving the result and being shown to user;
The data target related to the process of the application program run in the calculate node includes process switching number of times, heap
One or more in stack information, heap memory distribution condition of combination;
Monitoring device, be additionally operable to be set according to default statistics carries out big data treatment to the data target preserved in database, and
According to default classification exhibition scheme, respectively different users provide data statistics and analysis result;
Monitoring device also has user interface, specifically for receiving custom data mining algorithm, and according to the number
Data mining is performed according to mining algorithm.
2. the system as claimed in claim 1, it is characterised in that the analysis includes:Positioned according to the data target and occurred
The calculate node of failure, and determine failure cause.
3. system as claimed in claim 1 or 2, it is characterised in that:The hardware micro-architecture data target includes that CPU's is real-time
The floating-point speed of service, stream SIMD instruction superset SSE unit utilization rate, high-level vector superset AVX units utilization rate, vector refer to
Clock number CPI, afterbody caching LLC hit rate needed for making vectorization ratio, every instruction of completion, memory bandwidth, PCI are quick
One or more in EBI PCI-E device bandwidth, cache hit/miss rate of combination.
4. system as claimed in claim 3, it is characterised in that:The data target for CPU the real-time floating-point speed of service and/
Or every required clock number CPI of instruction is completed, the analysis includes:When the data target is persistently low in preset time period
In default threshold value, then decision processor breaks down, and is processor exception frequency reducing the reason for determine failure.
5. the system as claimed in claim 1, it is characterised in that:The monitor node also gathers the CPU provided by operating system
Utilization rate, memory usage, local disk I/O data and/or Ethernet handling capacity.
6. the system as claimed in claim 1, it is characterised in that:The hardware controls register of wherein described calculate node is described
MSR control registers in the performance monitoring unit PMU of the processor of calculate node.
7. a kind of method of intelligent monitoring large-scale data center cluster calculate node, it is characterised in that:
Start the monitor node in calculate node;
The monitor node gathers the hard of the calculate node by obtaining the control of the hardware controls register of calculate node
Part micro-architecture data target, by obtaining the control of operating system nucleus, obtains and the application run in the calculate node
The related data target of the process of program, and the data target is sent to monitoring device;
The monitoring device receives the data target, and big data analysis is performed based on the data target, and by the analysis
Result be sent to subscriber terminal equipment;
The subscriber terminal equipment receives the result and is shown to user;
The data target related to the process of the application program run in the calculate node includes process switching number of times, heap
One or more in stack information, heap memory distribution condition of combination;
Monitoring device, set according to default statistics carries out big data treatment to the data target preserved in database, and according to pre-
If classification exhibition scheme, respectively different users provide data statistics and analysis result;
Monitoring device also has user interface, receives custom data mining algorithm, and calculate according to the data mining
Method performs data mining.
8. method as claimed in claim 7, it is characterised in that the analysis includes:Positioned according to the data target and occurred
The calculate node of failure, and determine failure cause.
9. method as claimed in claim 7 or 8, it is characterised in that:The hardware micro-architecture data target includes that CPU's is real-time
The floating-point speed of service, stream SIMD instruction superset SSE unit utilization rate, high-level vector superset AVX units utilization rate, vector refer to
Clock number CPI, afterbody caching LLC hit rate needed for making vectorization ratio, every instruction of completion, memory bandwidth, PCI are quick
One or more in EBI PCI-E device bandwidth, cache hit/miss rate of combination.
10. system as claimed in claim 9, it is characterised in that:The data target is the real-time floating-point speed of service of CPU
And/or every required clock number CPI of instruction is completed, the analysis includes:When the data target is held in preset time period
The reason for continuing and be less than default threshold value, then decision processor breaks down, and determine failure is processor exception frequency reducing.
11. methods as claimed in claim 10, it is characterised in that:The monitor node also gathers what is provided by operating system
Cpu busy percentage, memory usage, local disk I/O data and/or Ethernet handling capacity.
12. methods as claimed in claim 11, it is characterised in that:The hardware controls register of wherein described calculate node is institute
MSR control registers in the performance monitoring unit PMU of the processor for stating calculate node.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410377856.0A CN104156296B (en) | 2014-08-01 | 2014-08-01 | The system and method for intelligent monitoring large-scale data center cluster calculate node |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410377856.0A CN104156296B (en) | 2014-08-01 | 2014-08-01 | The system and method for intelligent monitoring large-scale data center cluster calculate node |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104156296A CN104156296A (en) | 2014-11-19 |
CN104156296B true CN104156296B (en) | 2017-06-30 |
Family
ID=51881801
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410377856.0A Active CN104156296B (en) | 2014-08-01 | 2014-08-01 | The system and method for intelligent monitoring large-scale data center cluster calculate node |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104156296B (en) |
Families Citing this family (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104407959A (en) * | 2014-12-12 | 2015-03-11 | 深圳中兴网信科技有限公司 | Application based monitoring method and monitoring device |
CN106325200B (en) * | 2016-08-30 | 2019-04-23 | 江苏永冠给排水设备有限公司 | A kind of implementation method based on self-service hypochlorite generator's group control of equipment system of networking |
CN107205243A (en) * | 2017-06-05 | 2017-09-26 | 柳州市盛景科技有限公司 | A kind of intelligent gateway for possessing monitoring function |
CN107257305B (en) * | 2017-08-02 | 2020-05-15 | 苏州浪潮智能科技有限公司 | Monitoring method and device for multi-node system |
CN108108282B (en) * | 2017-12-07 | 2020-06-23 | 联想(北京)有限公司 | Information processing method and device and electronic equipment |
CN108319538B (en) * | 2018-02-02 | 2019-11-08 | 世纪龙信息网络有限责任公司 | The monitoring method and system of big data platform operating status |
CN108845878A (en) * | 2018-05-08 | 2018-11-20 | 南京理工大学 | The big data processing method and processing device calculated based on serverless backup |
CN109040478A (en) * | 2018-08-31 | 2018-12-18 | 北京云迹科技有限公司 | The overload alarm method and device of phone box |
CN110928750B (en) * | 2018-09-19 | 2023-04-18 | 阿里巴巴集团控股有限公司 | Data processing method, device and equipment |
CN110928738B (en) * | 2018-09-19 | 2023-04-18 | 阿里巴巴集团控股有限公司 | Performance analysis method, device and equipment |
CN109660537A (en) * | 2018-12-20 | 2019-04-19 | 武汉钢铁工程技术集团通信有限责任公司 | A method of real time monitoring and maintenance cloud platform physical resource service operation state |
CN113574502A (en) * | 2020-02-12 | 2021-10-29 | 深圳元戎启行科技有限公司 | Data acquisition method and device for unmanned vehicle operating system |
CN112148316B (en) * | 2020-09-29 | 2022-04-22 | 联想(北京)有限公司 | Information processing method and information processing device |
CN112306802A (en) * | 2020-10-29 | 2021-02-02 | 平安科技(深圳)有限公司 | Data acquisition method, device, medium and electronic equipment of system |
WO2023279815A1 (en) * | 2021-07-08 | 2023-01-12 | 华为技术有限公司 | Performance monitoring system and related method |
CN117724928A (en) * | 2023-12-15 | 2024-03-19 | 谷技数据(武汉)股份公司 | Intelligent operation and maintenance visual monitoring method and system based on big data |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102945198A (en) * | 2012-10-19 | 2013-02-27 | 浪潮电子信息产业股份有限公司 | Method for characterizing application characteristics of high performance computing |
CN103246569A (en) * | 2013-05-20 | 2013-08-14 | 浪潮(北京)电子信息产业有限公司 | Method and device for representing high-performance calculation application characteristics |
CN103501253A (en) * | 2013-10-18 | 2014-01-08 | 浪潮电子信息产业股份有限公司 | Monitoring organization method for high-performance computing application characteristics |
-
2014
- 2014-08-01 CN CN201410377856.0A patent/CN104156296B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102945198A (en) * | 2012-10-19 | 2013-02-27 | 浪潮电子信息产业股份有限公司 | Method for characterizing application characteristics of high performance computing |
CN103246569A (en) * | 2013-05-20 | 2013-08-14 | 浪潮(北京)电子信息产业有限公司 | Method and device for representing high-performance calculation application characteristics |
CN103501253A (en) * | 2013-10-18 | 2014-01-08 | 浪潮电子信息产业股份有限公司 | Monitoring organization method for high-performance computing application characteristics |
Non-Patent Citations (1)
Title |
---|
大规模机群监控***信息采集与存储技术研究;易昭华;《中国优秀博硕士学位论文全文数据库》;20060615(第01期);第I140-86页 * |
Also Published As
Publication number | Publication date |
---|---|
CN104156296A (en) | 2014-11-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104156296B (en) | The system and method for intelligent monitoring large-scale data center cluster calculate node | |
CN106020715B (en) | Storage pool capacity management | |
CN104113585B (en) | The method and apparatus that hardware level for producing instruction load balanced state interrupts | |
CN104915793A (en) | Public information intelligent analysis platform based on big data analysis and mining | |
CN108038040A (en) | Computer cluster performance indicator detection method, electronic equipment and storage medium | |
CN106506266B (en) | Network flow analysis method based on GPU, Hadoop/Spark mixing Computational frame | |
CN103399851A (en) | Method and system for analyzing and predicting performance of structured query language (SQL) scrip | |
CN110162445A (en) | The host health assessment method and device of Intrusion Detection based on host log and performance indicator | |
CN107645410A (en) | A kind of virtual machine management system and method based on OpenStack cloud platforms | |
CN102945198B (en) | A kind of method characterizing high-performance calculation application characteristic | |
CN112884452A (en) | Intelligent operation and maintenance multi-source data acquisition visualization analysis system | |
CN109088747A (en) | The management method and device of resource in cloud computing system | |
CN103986790A (en) | Monitoring and warning method of infrastructures of cloud data center | |
US8134935B2 (en) | Transaction topology discovery using constraints | |
CN103501253A (en) | Monitoring organization method for high-performance computing application characteristics | |
CN103246569A (en) | Method and device for representing high-performance calculation application characteristics | |
US10346204B2 (en) | Creating models based on performance metrics of a computing workloads running in a plurality of data centers to distribute computing workloads | |
CN109412155B (en) | Power distribution network power supply capacity evaluation method based on graph calculation | |
CN106649765A (en) | Smart power grid panoramic data analysis method based on big data technology | |
CN110168503A (en) | Timeslice inserts facility | |
CN106649034B (en) | Visual intelligent operation and maintenance method and platform | |
CN115471215B (en) | Business process processing method and device | |
Lee et al. | Refining micro services placement over multiple kubernetes-orchestrated clusters employing resource monitoring | |
CN107590747A (en) | Power grid asset turnover rate computational methods based on the analysis of comprehensive energy big data | |
Xiong et al. | SZTS: A novel big data transportation system benchmark suite |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |