CN104156296B

CN104156296B - The system and method for intelligent monitoring large-scale data center cluster calculate node

Info

Publication number: CN104156296B
Application number: CN201410377856.0A
Authority: CN
Inventors: 刘羽; 吕文静; 金莲; 陈博文; 于涛
Original assignee: Inspur Beijing Electronic Information Industry Co Ltd
Current assignee: Inspur Beijing Electronic Information Industry Co Ltd
Priority date: 2014-08-01
Filing date: 2014-08-01
Publication date: 2017-06-30
Anticipated expiration: 2034-08-01
Also published as: CN104156296A

Abstract

Propose a kind of system and method for intelligent monitoring large-scale data center cluster calculate node, by the hardware micro-architecture data target of the monitor node collection calculate node in the system data target related to the process of the application program of operation, and the data target is sent to the monitoring device in system, big data is performed by monitoring device to analyze, and send the result to ustomer premises access equipment be shown to user.The system and method can gather the program process data target of calculate node micro-architecture data target and operation, realize intelligent big data analysis, be automatically positioned the calculate node that breaks down and provide failure cause.

Description

The system and method for intelligent monitoring large-scale data center cluster calculate node

Technical field

The present invention relates to field of computer technology, and in particular to a kind of intelligent monitoring large-scale data center cluster calculates section The system and method for point.

Background technology

With the continuous progress of human society, the development of science and technology, understanding of the people not only to nature is more and more wider It is general and also more and more urgent to the demand that outfield is explored.This allows for the amount sharpness of the information data that mankind's support is held Growth, and at the same time, the information data of these magnanimity is required for timely analyzing and processing.For example, a large-scale astronomical Radio telescope array can just produce the cosmic microwave data of more than 100GB, these data to be required for being divided in time for one second Analysis；For another example, in particle physics research field, data that LHC once clashes are also to be counted in units of TB Amount；Additionally, as human genome project, oil exploration, weather forecast etc. field also propose increasingly to computing capability Requirement high.Already to become the third in addition to experiment, theory analysis of crucial importance for numerical computations under this overall background Science Explorations means.Such reality is based on, has promoted each science and technology power of the world today big all what is done one's utmost Power develops supercomputer.Such as, in the world TOP500 of in December, 2013 issue, the China's " Milky Way two for ranking the first (TH-2) peak velocity of 54.9PFlops " has just been reached, more than 16000 calculate nodes has been used altogether.

In addition, with the development of the new techniques such as cloud computing, big data, Internet of Things, occur in that increasing big Type data center, cloud computing center.They possess ten hundreds of computer nodes easily.As Google (Google) is located at the U.S. Oregonian Dalles data centers possess about 150,000 server nodes.In so large-scale data center, calculate The performance monitoring of node, fault location, fault recovery, and center whole efficiency statistics etc., all exist unprecedented Challenge.Therefore, how one extensive or even ultra-large data center of efficient management and use, be countries in the world today All in a popular domain for making great efforts to explore.

For a long time, all manually automanual mode is completed for the monitoring management of data center.It is responsible for O＆M Personnel need to check the running status of cluster in real time, once go wrong, although sometimes can be with positioning node position, but often The equipment of failure can not be accurately positioned, in addition it is also necessary to waste time and energy by the experience of staff to judge, troubleshooting；The user of cluster Although the handling situations of oneself can be understood by numerous job scheduling software, the history point of operation can be seldom counted on Analysis；Furthermore the policymaker of cluster often cannot directly obtain relevant expense expenditure, service efficiency, person works' effect from cluster Rate, cost effectiveness etc. can only be wasted time and energy about the information material of decision-making by the manual analysis to mass data come decision-making.This Outward, application developer also tends to that hardware micro-architecture, system process, heap that optimization application software is badly in need of cannot be obtained from cluster Stack, module error collapse the information such as statistics, it is necessary to be obtained empirically by substantial amounts of experiment, i.e., time-consuming and laborious.

The content of the invention

The present invention proposes a kind of system and method for intelligent monitoring large-scale data center cluster calculate node, with big The characteristics of type, multi-functional, facing multiple users group.It possesses perfect intellectual analysis and statistical function, can be different levels The decision-making of user provides data reference foundation.

The system, including：Monitor node and each monitor node on data center's PC cluster node lead to The monitoring device and subscriber terminal equipment of letter, it is characterised in that：

The monitor node, for the control of the hardware controls register by obtaining calculate node, gathers the meter The hardware micro-architecture data target of operator node, by obtaining the control of operating system nucleus, obtain with the calculate node The related data target of the process of the application program of operation, and the data target is sent to monitoring device；

The monitoring device, for receiving the data target, big data analysis is performed based on the data target, and will The result of the analysis is sent to subscriber terminal equipment；

The subscriber terminal equipment, for receiving the result and being shown to user.

Methods described includes：

Start the monitor node in calculate node；

The monitor node gathers the calculate node by obtaining the control of the hardware controls register of calculate node Hardware micro-architecture data target, by obtaining the control of operating system nucleus, obtain and the calculate node on run The related data target of the process of application program, and the data target is sent to monitoring device；

The monitoring device receives the data target, and big data analysis is performed based on the data target, and will be described The result of analysis is sent to subscriber terminal equipment；

The subscriber terminal equipment receives the result and is shown to user.

Especially, the analysis includes：According to the calculate node that data target positioning breaks down, and determine event Barrier reason.

Especially, the hardware micro-architecture data target includes the real-time floating-point speed of service of CPU, stream SIMD instruction extension Needed for collection SSE unit utilization rate, high-level vector superset AVX units utilization rate, vector instruction vectorization ratio, every instruction of completion Clock number CPI, afterbody caching LLC hit rate, memory bandwidth, PCI fast bus interface PCI-E devices bandwidth, caching One or more in hit/miss rate of combination；The process phase with the application program of operation in the calculate node The data target of pass includes the combination of one or more in process switching number of times, stack information, heap memory distribution condition.

Especially, clock of the data target for needed for every instruction of the real-time floating-point speed of service and/or completion of CPU Number CPI, the analysis includes：When the data target is consistently less than default threshold value in preset time period, then judgement treatment Device breaks down, and is processor exception frequency reducing the reason for determine failure.

Especially, the monitor node also gathers cpu busy percentage, memory usage, this earth magnetism provided by operating system Disk I/O data and/or Ethernet handling capacity.

Especially, wherein the hardware controls register of the calculate node is the performance prison of the processor of the calculate node MSR control registers in control unit PMU.

The beneficial effects of the invention are as follows：

Extract necessary system level performance metrics information by the performance monitoring apparatus of each calculate node, and send by Monitoring management node is responsible for maintenance.And monitoring management node, then with abnormal identification and alert capability, while pressing customer group Recorded historical data is excavated respectively, and result is fed back into user.Meanwhile, monitoring management node can also on demand, on time Between section, the information of the aspect such as hardware micro-architecture feature and process, storehouse is extracted to specified monitor node.So as to realize to big rule The multi-userization of mould cluster monitoring, multifunction and intellectuality.

In order to realize the actual effect of monitoring, the monitoring client of each calculate node realizes the monitoring mode of refreshing per second.Simultaneously In order to reduce the resource occupation of calculate node, each calculate node is only extracted for minimum index item, bag necessary to data analysis Include cpu busy percentage, memory usage, ten several indexs such as local disk read-write and Ethernet handling capacity.

In order to realize multifunction, this intelligent monitor system additionally provides the monitoring point of the index related to hardware micro-architecture Analysis, such as the floating-point speed of service, vectorization ratio, memory bandwidth, IB bandwidth etc..But due to this partial content monitoring when to system The occupancy of resource is relatively more, therefore, they start on demand according to user instruction.

In order to realize multi-userization, this intelligent monitor system is proposed covers management level, O＆M layer, practical application client layer With application and development layer, four hierarchical views of level.

In order to realize intellectuality, this intelligent monitor system has invented a kind of analysis method of data mining, and it is according to basic Performance monitoring data information, by calculate excavate the statistical indicator that different levels user is most interested in.

Brief description of the drawings

Fig. 1 is a kind of system block diagram of intelligent monitoring large-scale data center cluster proposed by the present invention

Fig. 2 is a kind of flow chart of the method for intelligent monitoring large-scale data center cluster proposed by the present invention

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, the present invention is done into one below in conjunction with accompanying drawing Step ground is described in detail.

Referring to accompanying drawing 1, a kind of intelligent monitoring large-scale data center cluster calculate node proposed by the present invention is shown System, including monitor node on data center's PC cluster node and the monitoring device of each monitor node connection, And subscriber terminal equipment.Wherein data center's PC cluster node has a corresponding hardware device, such as processor CPU, interior Deposit, hard disk, Ethernet controller etc., operating system and application software are run in the calculate node；Monitoring device bag Main monitor node and database are included, main monitor node communicates with each monitor node in above-mentioned calculate node, can Obtain the hardware and software service data of data center's PC cluster node, such as cpu busy percentage, memory usage, this earth magnetism Disk I/O data, Ethernet handling capacity, and micro-architecture data target and the application program of operation for the calculate node hardware The data target of process level.In the above-mentioned data write into Databasce that main monitor node will be obtained, the automatic big data that performs is excavated simultaneously Preserve the result obtained after big data is excavated.User reads result and shows from database by ustomer premises access equipment.User is also User-defined data mining program can be input into monitoring device by subscriber terminal equipment, by monitoring device extraction data The corresponding data index of heart clustered node, performs big data and excavates and shown to user according to user-defined data mining program As a result.

Referring to accompanying drawing 2, a kind of method of intelligent monitoring large-scale data center cluster calculate node proposed by the present invention by Data acquisition, big data are excavated, are classified several key steps compositions such as displaying and fault location and alarm.Wherein data are adopted Collection includes master data collection and high-level data collection, and master data collection is performed, set without user automatically by system；It is senior Data acquisition needs to be set according to user intention.

1. data acquisition

Data acquisition refers to install monitor node on data center's PC cluster node, extracts the CPU of the calculate node Utilization rate, memory usage, local disk I/O data, Ethernet handling capacity, and for the micro-architecture of the calculate node hardware The data target of data target and the program process of operation level.Wherein, the micro-architecture data for calculate node hardware refer to The collection of the data target of mark and program process level is referred to as high-level data collection, and the collection of remaining index is referred to as master data Collection.Master data collection for system default set the step of, be that can perform without user intervention, high-level data collection according to Family demand and execution is set.Due to needing the actual effect of guarantee performance indications data, monitor node must is fulfilled for what second level refreshed Acquisition capacity, while must assure that extremely low calculate node resources occupation rate.

Collecting method proposed by the present invention is different from the method for proposing in the prior art.In the prior art, data Collection is only to collect some achievement datas that operating system is provided in itself, i.e. the collection of data target is depended in calculate node The operating system of operation, for the data target that operating system cannot be provided, monitor node cannot be obtained.And it is proposed by the invention Collecting method, be not only only capable of complete it is above-mentioned by operating system provide data target collection, can also gather Hardware micro-architecture data target, the real-time floating-point speed of service of such as CPU, stream SIMD instruction superset SSE (Streaming SIMD Extensions) unit utilization rate, high-level vector superset AVX (Advanced Vector Extensios) unit profit With rate, vector instruction vectorization ratio, complete every required clock number (CPI) of instruction, afterbody caching LLC (Last Level Cache) hit rate, translation lookaside buffer TLB (Translation Lookaside Buffer) parameter, internal memory band Width, PCI fast bus interfaces PCI-E (PCI Express) device bandwidth, cache hit/miss (cache hit/miss) Rate, TLB unit etc..Further, it is also possible to gather the data target of some program process level, such as process switching number of times, heap Stack information, heap memory distribution condition etc..These indexs are soft for excavating performance, analysis cluster features and the positioning of application software Part level failure tool is of great significance.

Due to needing acquisition hardware and process level data target, therefore monitor node proposed by the present invention passes through software client The mode at end is realized.The method that collection of the monitor node to master data is proposed compared with technology, will not be repeated here, right The process of high-level data collection is specifically described as follows：

Extraction to above-mentioned hardware micro-architecture data target needs to be realized by the control to related register in hardware. Such as, for processor micro-architecture data target, mainly by the performance monitoring unit PMU (Performance in processor Monitoring Unit) it is controlled to realize.Therefore, this requires that the monitor node of this case possesses highest root authority. Control flow to PMU is described below：

S1：MSR (Module Specific Register) control deposits in the PMU of the processor for obtaining calculate node The control of device；

S2：In the MSR control registers that the coding of dependent event and mask write-in have been controlled, and control deposit is set Device, starts to count dependent event, for example, when LLC hit rate data target is gathered, first by the coding of LLC hit rate and covering In code write-in MSR control registers, then the register is set and starts counting up LLC hit quantity, counting reads the control after terminating Count number in register processed, counts LLC hit rate.

Extraction to system kernel level index needs the monitoring to correlative code in kernel to realize.For example to process switching Monitoring, it is necessary in monitoring kernel relevant control process in the code of management of process part part.It is interior when calculate node starts Core starts monitoring after successfully loading.Therefore, monitor node must possess the control to kernel level.To system kernel level index Extract may slightly affected system performance, therefore can be directed to monitoring occasion provide on demand.

2. big data is excavated and classification displaying

The above-mentioned monitor node in calculate node also has the ability that data are sent to monitoring device, is set by monitoring Standby unification receives and manages each monitor node.Main monitor node in monitoring device is responsible for receiving collection from each monitor node Data target, and to each monitor node send control command, the control command include the system default produce Master data acquisition, and the high-level data acquisition for being set according to user and being produced, described each monitor node root The collection of corresponding data index is performed according to the control command.Main monitor node is also responsible for the data target that will be received simultaneously It is stored in database by certain storage format, as the input data of next step data mining.

In order to realize intellectuality, monitoring device also has big data mining ability, and it is set to data according to default statistics The data target preserved in storehouse carries out big data treatment, and according to default classification exhibition scheme, respectively different users carry For data statistics and analysis result.Additionally, monitoring device also has user interface, custom data excavation can be received Algorithm, and perform data mining according to the data mining algorithm.Default statistics setting includes：

First, management level customer group index

1. throughput rate (task flux)

A. real time execution task, using number

B. in one week (moon, year), the number of tasks of (failure) is completed daily【Row figure, table】

C. it is average to complete (failure) number of tasks daily in one week (moon, year)

D. it is total to complete (failure) number of tasks in one week (moon, year)

E. per task time

2. O＆M cost (energy consumption) (calculate, storage, exchange, computer room【Refrigeration】)

A. real-time total power consumption

B. in one week (moon, year), daily energy consumption (KW/h)【Row figure, table】

C. in one week (moon, year), average energy consumption (KW/h) daily

D. in one week (moon, year), total energy consumption (KW/h)

E. it is completeer than Data-Statistics, unit costs operation between equipment depreciation, computer room entirety amortization charge monitoring and each expense unit Cheng Liang

3. assets utilization efficiency

A. in one week (moon, year), daily cluster dutycycle

B. in one week (moon, year), average cluster dutycycle daily

C. in one week (moon, year), daily cluster peak hours/period (calculating cluster dutycycle per hour)

D. in one week (moon, year), time consistent busy hour section (the annual dutycycle on 24 hour period)

E. real-time online number of users (special delegated authority, check personal information)

F. in one week (moon, year), daily online user number【Row figure, table】

G. in one week (moon, year), average online user number daily

H. in one week (moon, year), daily average user completes number of tasks

I. in one week (moon, year), average per-user completes number of tasks

4. equipment health degree

A. real time fail nodes, fault rate

B. in one week (moon, year), daily malfunctioning node number, fault rate【Row figure, table】

C. in one week (moon, year), average malfunctioning node number, fault rate daily

2nd, cluster device management service human user group index

1. fault alarm and positioning

A. real time fail nodes, fault rate

B. in one week (moon, year), daily malfunctioning node record, fault rate【Row figure, table】

C. it is average per node failure number of times in one week (moon, year), per node failure rate (the easy malfunctioning node of statistics)

D. malfunctioning node is positioned in real time

E. malfunctioning node Realtime Alerts

F. failure, the classification of failure node failure type：Can couple, can not couple, power down etc.

G. pair can couple failure and be accurately positioned faulty equipment：Faulty disk position, fall internal memory (position) etc.

2. equipment running status are checked

A. cluster overall cpu busy percentage, centrally stored I/O bandwidth in real time

B. in one week (moon, year), daily cluster ensemble average cpu busy percentage, average centrally stored I/O bandwidth

C. in one week (moon, year), cluster ensemble average cpu busy percentage, average centrally stored I/O bandwidth

D. can the every node running status of real time inspection：CPU, internal memory, local disk, network etc. index

E. attitude can daily be run by all nodes in historical query 1 year

F. resource bottleneck analyzes (CPU, storage, internal memory, network【Distinguish storage, data exchange】)

3. billing function

A. counted during subscriber computer

3rd, task customer group index

1. current task information

A. current task is used nodes, check figure, memory size of occupancy etc.

B. the status information of the nodes that current task is used can be checked：CPU, internal memory, local disk, network etc.

C. the number of tasks currently queued up

D. current task queuing time

2. historic task is counted

A. the user's history Runtime

B. the average Runtime of the user's history

C. the user completes the historic task number of (failure)

D. Mission Success rate (success number of tasks/failure number of tasks)

E. the user's history task is used nodes, check figure

F. user's averaged historical task is used nodes, check figure

G. the average queuing time of historic task

4th, application software research staff customer group index

1. program (module) use information is counted

A. in one week (moon, year), the total number of modules of (failure) is processed daily

B. in one week (moon, year), Module Fail rate

C. in one week (moon, year), module uses hot statistics, ranking, and each module access times accounting

D. in one week (moon, year), failed module hot statistics, ranking, and each failed module Failure count accounting

2. performance trace index

A. the service (database, file system, job scheduling, middle acceleration layer, parallel framework etc.) of all applications Loading condition

B. micro-architecture level information：Cache hit/miss rates, TLB

C. the information of operating system grade：Enter number of passes, process switching, storehouse, heap memory distribution condition etc..

3. the statistics of user's use habit

A. the delay of the access data of interactive application, residence time, I/O access modules etc.

Finally, monitoring device has been pressed the statistical analysis information of the above excavation, has been opened up respectively by the client layer specified Show ustomer premises access equipment.

Data mining in embodiments of the present invention is distinguished by the type of user.The excavation listed in invention Item is summary after the real needs and focus for fully analyzing correlation type user.And this kind of index is in common monitoring It is no, it is necessary to artificial derive data is analyzed, and implementation method proposed by the present invention is intelligent, is automatically performed. Further it is proposed that implementation method be also devised with it is reserved interface is excavated by custom data, can perform user The data mining program of definition.

3. fault location and alarm

By above-mentioned data mining analysis, the equipment work at present performance indications of calculate node are obtained in that, according to described The reason for whether service behaviour index can be broken down and be broken down with analytical equipment.On the one hand error message can be led to The intelligent display module for crossing ustomer premises access equipment shows specific user, on the other hand, can install event in user's visitor's end equipment Barrier alarm module, for example, install certain stereo set, light units etc., is sent a warning with equipment failure, so that Remind attendant quickly to pay close attention to faulty equipment, be rapidly completed equipment fault exclusion.

The failure exception situation of equipment or application software can reflect according to the performance data index of statistics.In order to simple The easy-to-use present invention is the failure that failure, particularly some aspect of performances are positioned by the exception of analytical performance data target, It is that cannot be excluded by usual method.Such as, the radiating of cluster is bad, may result in the frequency reducing operation of processor, this Will not be alarmed by normal failure monitoring means when individual, but use method proposed by the present invention, have treatment due to collecting Device micro-architecture data target, can in real time monitoring processor complete the floating-point speed of service and complete every instruction needed for Clock number CPI, so when in monitored node heavy duty and this two indexs are consistently less than in a longer time Default threshold value, then judge be out of order generation and intelligent alarm by monitoring device, while the reason for also just located failure and occur, That is the improper frequency reducing of processor.

Certainly, the present invention can also have other various embodiments, ripe in the case of without departing substantially from spirit of the invention and its essence Know those skilled in the art and work as and various corresponding changes and deformation, but these corresponding changes and change can be made according to the present invention Shape should all belong to scope of the claims of the invention.

Claims

1. a kind of system of intelligent monitoring large-scale data center cluster calculate node, including installed in data center's PC cluster Monitoring device and subscriber terminal equipment that monitor node on node communicates with each monitor node, it is characterised in that：

The monitor node, for the control of the hardware controls register by obtaining calculate node, gathers described calculating and saves The hardware micro-architecture data target of point, by obtaining the control of operating system nucleus, obtains and is run with the calculate node Application program the related data target of process, and the data target is sent to monitoring device；

The monitoring device, for receiving the data target, big data analysis is performed based on the data target, and will be described The result of analysis is sent to subscriber terminal equipment；The subscriber terminal equipment, for receiving the result and being shown to user；

The data target related to the process of the application program run in the calculate node includes process switching number of times, heap One or more in stack information, heap memory distribution condition of combination；

Monitoring device, be additionally operable to be set according to default statistics carries out big data treatment to the data target preserved in database, and According to default classification exhibition scheme, respectively different users provide data statistics and analysis result；

Monitoring device also has user interface, specifically for receiving custom data mining algorithm, and according to the number Data mining is performed according to mining algorithm.

2. the system as claimed in claim 1, it is characterised in that the analysis includes：Positioned according to the data target and occurred The calculate node of failure, and determine failure cause.

3. system as claimed in claim 1 or 2, it is characterised in that：The hardware micro-architecture data target includes that CPU's is real-time The floating-point speed of service, stream SIMD instruction superset SSE unit utilization rate, high-level vector superset AVX units utilization rate, vector refer to Clock number CPI, afterbody caching LLC hit rate needed for making vectorization ratio, every instruction of completion, memory bandwidth, PCI are quick One or more in EBI PCI-E device bandwidth, cache hit/miss rate of combination.

4. system as claimed in claim 3, it is characterised in that：The data target for CPU the real-time floating-point speed of service and/ Or every required clock number CPI of instruction is completed, the analysis includes：When the data target is persistently low in preset time period In default threshold value, then decision processor breaks down, and is processor exception frequency reducing the reason for determine failure.

5. the system as claimed in claim 1, it is characterised in that：The monitor node also gathers the CPU provided by operating system Utilization rate, memory usage, local disk I/O data and/or Ethernet handling capacity.

6. the system as claimed in claim 1, it is characterised in that：The hardware controls register of wherein described calculate node is described MSR control registers in the performance monitoring unit PMU of the processor of calculate node.

7. a kind of method of intelligent monitoring large-scale data center cluster calculate node, it is characterised in that：

Start the monitor node in calculate node；

The monitor node gathers the hard of the calculate node by obtaining the control of the hardware controls register of calculate node Part micro-architecture data target, by obtaining the control of operating system nucleus, obtains and the application run in the calculate node The related data target of the process of program, and the data target is sent to monitoring device；

The monitoring device receives the data target, and big data analysis is performed based on the data target, and by the analysis Result be sent to subscriber terminal equipment；

The subscriber terminal equipment receives the result and is shown to user；

Monitoring device, set according to default statistics carries out big data treatment to the data target preserved in database, and according to pre- If classification exhibition scheme, respectively different users provide data statistics and analysis result；

Monitoring device also has user interface, receives custom data mining algorithm, and calculate according to the data mining Method performs data mining.

8. method as claimed in claim 7, it is characterised in that the analysis includes：Positioned according to the data target and occurred The calculate node of failure, and determine failure cause.

9. method as claimed in claim 7 or 8, it is characterised in that：The hardware micro-architecture data target includes that CPU's is real-time The floating-point speed of service, stream SIMD instruction superset SSE unit utilization rate, high-level vector superset AVX units utilization rate, vector refer to Clock number CPI, afterbody caching LLC hit rate needed for making vectorization ratio, every instruction of completion, memory bandwidth, PCI are quick One or more in EBI PCI-E device bandwidth, cache hit/miss rate of combination.

10. system as claimed in claim 9, it is characterised in that：The data target is the real-time floating-point speed of service of CPU And/or every required clock number CPI of instruction is completed, the analysis includes：When the data target is held in preset time period The reason for continuing and be less than default threshold value, then decision processor breaks down, and determine failure is processor exception frequency reducing.

11. methods as claimed in claim 10, it is characterised in that：The monitor node also gathers what is provided by operating system Cpu busy percentage, memory usage, local disk I/O data and/or Ethernet handling capacity.

12. methods as claimed in claim 11, it is characterised in that：The hardware controls register of wherein described calculate node is institute MSR control registers in the performance monitoring unit PMU of the processor for stating calculate node.