CN106487597A

CN106487597A - A kind of service monitoring system and method based on Zookeeper

Info

Publication number: CN106487597A
Application number: CN201610952315.5A
Authority: CN
Inventors: 邹炜
Original assignee: Nubia Technology Co Ltd
Current assignee: Nubia Technology Co Ltd
Priority date: 2016-10-26
Filing date: 2016-10-26
Publication date: 2017-03-08

Abstract

The invention discloses a kind of service monitoring system based on Zookeeper, it is characterised in that include：Monitoring data collects center, builds, using Zookeeper, the cluster service that the monitoring data collects center, and the performance indications data of capturing service service, while obtain the operation information for running the physical server of the business service；Real-time Risk Calculation center, based on the performance indications data and the operation information, business service described in real-time judge whether there is risk；Alarm platform, if the business service has risk, receives signal an alert after the warning message that the real-time Risk Calculation center is sent.

Description

A kind of service monitoring system and method based on Zookeeper

Technical field

The present invention relates to business service technical field, more particularly to a kind of service monitoring system based on Zookeeper and Method.

Background technology

With the high speed development of internet, all kinds of joyful hot internet products are continued to bring out, and have attracted many users Experience is used.But most products follow-up developments are not enough, it is impossible to continue the hot impetus for a long time, reason has two kinds, and one is to produce Product lack follow-up innovation, cause customer loss；Two is service of goods deficient in stability, regular service interim card or long-time Service delay machine, give user very disagreeableness experience, cause user's stickiness not enough, the substantial amounts of minimizing of user.

Therefore in internet industry, product maintains permanent life cycle, removes and weeds out the old and bring forth the new on intention, stablizes available Service is most basic rigid premise.

Stable can use seems simple, actually more complicated.During running in business service On-line Product, often Some indescribable faults can be run into, had is even quite thorny, if without correlation experience, it is likely that cannot be in the short time Exclude, the long-time so as to cause business service is unavailable, and this is just stablized available original intention and disagrees with business service.

Content of the invention

Present invention is primarily targeted at proposing a kind of service monitoring system and method based on Zookeeper, it is intended to solve The problem that can not be found in time when certainly operation is broken down on business service line.

For achieving the above object, a kind of service monitoring system based on Zookeeper that the present invention is provided, including：Monitoring DCC, builds, using Zookeeper, the cluster service that the monitoring data collects center；The property of capturing service service Energy achievement data, while obtain the operation information for running the physical server of the business service；In real time in Risk Calculation The heart, based on the performance indications data and the operation information, business service described in real-time judge whether there is risk；Report to the police flat Platform, if the business service has risk, sends police after receiving the warning message that the real-time Risk Calculation center is sent The number of notifying.

Further, the system, also includes, idle failure analysis center and troubleshooting memorandum platform, the fault Processing memorandum platform includes fault case database, and the idle failure analysis center has the bar of risk in the business service Under part, determine the factor that the business service breaks down, and obtained from the fault case database according to the factor The alarm platform is pushed to after corresponding preparation fault case, so that the alarm platform shows the preparation fault case.

Further, the system, also includes management platform and configuration management database, wherein, the management platform pair The business service carries out parameter configuration, and parameter is preserved to the configuration management database.

Further, the monitoring data collects center, reads the parameter from the configuration management database, and will The parameter persistence collects center to realize the monitoring data according to the Zookeeper to the Zookeeper Distributed task dispatching.

Further, the troubleshooting memorandum platform, under conditions of the fault is solved, recording and storing is used for The fault solution of the fault is solved to the fault case database, as preparation fault case.

Additionally, for achieving the above object, the present invention also proposes a kind of service monitoring method based on Zookeeper, described Method, comprises the following steps：Build, using Zookeeper, the cluster service that monitoring data collects center；Capturing service service Performance indications data, while obtain the operation information for running the physical server of the business service；Based on the performance Achievement data and the operation information, business service described in real-time judge whether there is risk；If there is wind in the business service Danger, then signal an alert.

Further, if the business service has risk, after signal an alert, methods described, also wrap Include：Determine the factor that the business service breaks down；Obtained from the troubleshooting memorandum platform according to the factor corresponding Preparation fault case push to the alarm platform so that the alarm platform shows the preparation fault case.

Further, before the utilization Zookeeper builds the cluster service that monitoring data collects center, the side Method, also includes：Parameter configuration is carried out to the business service；Preserve the parameter.

Further, the utilization Zookeeper builds the cluster service that monitoring data collects center, specifically includes：Read Take above-mentioned parameter；By in parameter persistence to Zookeeper, collect center cluster monitoring data is realized using Zookeeper The distributed task dispatching of service.

Further, described acquisition accordingly from the troubleshooting memorandum platform according to the factor prepares fault case After pushing to the alarm platform, methods described, also include：Under conditions of the fault is solved, use is recorded and stores In the fault solution of the fault is solved to the fault case database, as preparation fault case.

Service monitoring system based on Zookeeper proposed by the present invention, by the monitoring in business service running The collection of data, the real-time early warning for analyzing to realize business service potential risk, especially give in time when fault occurs Auxiliary is excluded, and investigates the uncertain factor of business service product on line most possibly, and intelligent assist trouble is excluded, and shortens fault In the solution cycle, the robustness of business service product is ensured, lift the stable availability of business service product.

Description of the drawings

Fig. 1 is the block diagram of a preferred embodiment of service monitoring system of the present invention；

Fig. 2 is the flow chart of a preferred embodiment of service monitoring method of the present invention；

Fig. 3 is the flow chart of another preferred embodiment of service monitoring method of the present invention.

The realization of the object of the invention, functional characteristics and advantage will be described further in conjunction with the embodiments referring to the drawings.

Specific embodiment

It should be appreciated that specific embodiment described herein is not intended to limit the present invention only in order to explain the present invention.

The service monitoring system and method for each embodiment of the present invention are realized referring now to Description of Drawings.Retouch in follow-up In stating, using for represent equipment such as " suffix of " center " or " platform " only for being conducive to the explanation of the present invention, its Body does not have specific meaning.Therefore, " center " mixedly can be used with " platform ".

Zookeeper is distributed, open source code a distributed application program coordination service, is Google Mono- realization that increases income of Chubby, is the significant components of Hadoop and Hbase, frequently encounters for solving distributed application program Some data management problems, such as：Uniform Name service, state synchronized service, cluster service, Distributed Application configuration parameter Management etc..

The cluster service refers to execute the set constituted by the component of cluster operation, two cluster resources on each node It is the hardware and software component managed by cluster service in cluster.

As shown in figure 1, first embodiment of the invention provides a kind of service monitoring system based on Zookeeper, utilize Zookeeper builds the cluster service that monitoring data collects center 10, is achieved in monitoring data and collects the distributed of center 10 Monitoring.The system includes that monitoring data collects center 10, real-time Risk Calculation center 20 and alarm platform 30.Specifically：Monitoring DCC 10, the performance indications data of capturing service service, while obtain the physical services for running business service The operation information of device.Monitoring data collects the function at center 10 mainly includes active pull and passive push, wherein, active pull It is that the Server Mission Monitor cluster that monitoring data collection center 10 includes carries out periodic property to the business service being monitored Energy achievement data collection, and above-mentioned performance indications data are stored in the data warehouse 70 of monitoring data, such as HBase is distributed Data storage warehouse.Passive push is to install corresponding Agent spy on the physical server that monitored business service is run Pin, Agent probe periodically collect the operation information of host's physical machine, such as hardware health status, network condition and daily record Data etc..Meanwhile, monitoring data collect center 10 actively call Collector service, by Collector service by above-mentioned The performance indications data of collection and operation information are persisted to HBase data warehouse.

As shown in figure 1, real-time Risk Calculation center 20, based on above-mentioned performance indications data and operation information, real-time judge Business service whether there is risk.In real time the major responsibility at Risk Calculation center 20 be based on pre-set level threshold value, industrial security Specification and conventional fault case (with reference to the preparation fault case in fault case database 600), analysis Hbase is distributed in real time The performance indications data stored in data storage warehouse and operation information, the monitored business service of anticipation currently whether there is wind There is performance bottleneck in danger, if existing, notify alarm platform to report to the police, and to inform relevant item person liable, promote business clothes The timely exclusion of business hidden danger, and the breakthrough of performance bottleneck.When the performance indications data of above-mentioned business service and operation information are held Longization to after HBase distributed storage data warehouse, then open the risk to the business service and comment by real-time Risk Calculation center 20 Estimate the anticipation with performance bottleneck；When early-warning conditions are reached, then alarm platform is notified to report to the police, to inform that responsible person concerned carries out in time Process.

Alarm platform 30, if business service has risk, receives the alarm signal that real-time Risk Calculation center 20 is sent Signal an alert after breath.The major responsibility of alarm platform 30 is to send alarm signal to inform corresponding director, for example, can send out Go out chimes of doom prompting operator on duty, it is also possible to which the alarm signal comprising fault message is informed by way of mail or note Corresponding responsibility people, reduces the possibility that fault occurs.

As shown in figure 1, further, the service monitoring system, also includes idle failure analysis center 40 and troubleshooting Memorandum platform 60, troubleshooting memorandum platform 60 include fault case database 600.Idle failure analysis center 40 is taken in business Under conditions of business has risk, the intelligent anticipation of fault is carried out, to homogeneity case than ever, determines what business service broke down Factor, and corresponding preparation fault case is obtained from fault case database 600 according to above-mentioned factor push to alarm platform 30, so that alarm platform 30 shows preparation fault case, responsible person concerned's Rapid reversal business service fault is aided in, accelerate business The recovery efficiency of service.Troubleshooting memorandum platform 60, under conditions of fault is solved, can also record and store for solving Certainly the fault solution of fault is to fault case database 600, used as preparation fault case.

Monitoring data (the performance indications data of business service of the center 10 to business service are collected from there through monitoring data And operation information) collection, real-time Risk Calculation center 20 Hadoop is adopted using Spark technology and idle failure analysis center 40 Auxiliary when technology realizes the timely early warning of the potential risk of big data statistical analysis finishing service service and fault generation is excluded, Reduce the uncertain factor of business service product on line most possibly, shorten the solution cycle, so as to ensure business service product Robustness.

In addition, on the basis of above preferred embodiment, the service monitoring system, also include management platform 50 and configuration pipe Reason database 80, wherein, management platform 50 carries out parameter configuration to business service, and parameter is preserved to configuration management data Storehouse 80.Monitoring data is collected center and reads the parameter from the configuration management database 80, persistence to Zookeeper, with Realize, using Zookeeper, the distributed task dispatching that monitoring data collects center.Monitoring data collects center 10, from configuration pipe Reason database 80 reads the parameter, and by the parameter persistence to Zookeeper, with according to the Zookeeper reality Existing monitoring data collects the distributed task dispatching at center 10.The function of management platform 50 is used for：Mainly it is responsible for two parts business, Configuration management is checked with project service monitoring data；Wherein configuration management can be divided into three fractions again：Conventional RBAC configuration Management, monitoring project parameter configuration management, distributed task dispatching parameter configuration management；Configuration target prison in Admin system The Back ground Information of control service and all kinds of metrics-thresholds.

Additionally, for achieving the above object, as shown in Fig. 2 the present invention also provides a kind of service monitoring based on Zookeeper Method, the method, comprise the following steps：Step S2：Build, using Zookeeper, the cluster clothes that monitoring data collects center 10 Business；Step S10：The performance indications data of capturing service service, while obtain for the physical server that runs business service Operation information；Step S20：Based on performance indications data and operation information, real-time judge business service whether there is risk；Step S30：If there is risk, signal an alert in business service.Wherein step S20 specifically includes step S200：Real-time risk meter Calculation center is carried out to above-mentioned performance indications data and operation information according to predetermined threshold value, industry standard and conventional accident case in real time Analysis is calculated, wherein accident case is referred to the preparation fault case that store in fault case database in the past.Then execute Step S201：Risk is judged whether according to the analysis result that step S200 draws, if there is risk, then execution step S30；If there is no risk, then this Calculation results, waits monitoring data next time to collect central collection business service Performance indications data and operation information, to carry out correlation analysis calculating.

In addition, as shown in Fig. 2 step S2：Using Zookeeper build monitoring data collect center 10 cluster service it Before, the service monitoring method, also include：Step S1：Parameter configuration is carried out to business service in management platform, parameter include IP and Port etc., then preserves parameter to configuration management database 80 (shown in Figure 1).Further, built using Zookeeper Monitoring data collects the cluster service at center 10, specifically includes：Read above-mentioned parameter；By in parameter persistence to Zookeeper, Thus the distributed task dispatching that monitoring data collects center cluster service is realized using Zookeeper.

In real time Risk Calculation center 20 is assessed business service from three aspects and whether there is risk or performance bottleneck occur： 1. the achievement data of default performance threshold, 2. industry standard specification, 3. conventional fault case, it is only necessary to meet any one side Face, then inform corresponding responsibility people (referring to step S30) by alarm platform 30, it can thus be seen that realizing the wind of business service Dangerous early warning be from the parameter configuration of business service, the collection of monitoring data, the calculating of real-time risk assessment, to ultimate risk and Alarm, realizes the Risk-warning of business service.

Referring to Fig. 2 and Fig. 3, in step S30：If business service has risk, after signal an alert, the service prison Prosecutor method, also includes：Step S40：Determine the factor that business service breaks down, specifically, the presentation according to fault, performance refer to Mark data etc. carry out fault location, determine the factor for breaking down；Obtained from troubleshooting memorandum platform according to factor corresponding Preparation fault case (be similar to fault case and solution) push to alarm platform so that alarm platform show preparation therefore Barrier case, wherein, calculates out of order similarity according to the failure condition of conventional fault case.The concretely comprising the following steps of step S40： Step S400：Idle failure analysis center 40 checks troubleshooting memorandum platform 60 with the presence or absence of being similar to case (i.e. current failure Factor data whether meet the factor index of conventional fault case), if it is present execution step S401：Idle fault is divided The event for linking to alarm platform 30, answering person liable's solution business service with auxiliary phase of 20 intelligent recommendation solution of analysis center Barrier.Specifically：Idle failure analysis center 40 is pushed with lettergram mode and links to corresponding responsibility people, and thus person liable can be in fault Memorandum platform 60 is checked, is easy to quickly solve early warning risk.Then execution step S402：Judge 40 intelligence of idle failure analysis center Whether the solution that can recommend can solve the problem that the fault for running into, if it is possible to solve, then execution step S50, the tool of step S50 Body flow process is referring to description below.Solution and above-mentioned solution party can not be recommended in 40 intelligence of above-mentioned idle failure analysis center In the case that case can not solve the fault for running into, execution step S403：Idle failure analysis center 40 is in fault memorandum platform 60 Search is searched and is similar to fault case and solution, then execution step S404：Whether verification above-mentioned steps can find class Like fault case and solution, on the premise of it can find, execution step S405：With reference to similar fault case and solution party Case solves the failure problems of business service, recovers business service, then execution step S50.In addition similar event can not found In the case of barrier case and solution, still execution step S50.

Further, step S40：Corresponding preparation fault case is obtained according to factor from troubleshooting memorandum platform to push To alarm platform, the service monitoring method, also include：Step S50 (not shown)：Under conditions of fault is solved, The fault solution for solving fault is recorded and is stored to fault case database, as preparation fault case.Step S50 is specifically included：Step S500：Write after this fault solves flow process in fault memorandum platform 60 by corresponding responsibility people and submit event to Barrier memorandum platform is examined, then execution step S501：Evaluated by full-time staff, file to fault case database, as preparation Fault case, in case with reference to applicable when being subsequently similar to Risk-warning or fixing a breakdown.It can thus be seen that the flow process of failture evacuation Including：Idle failure analysis center carries out fault anticipation, recommends preparation fault case, solve business service fault, arrive memorandum again Business service fault and fault solution.

Can be seen that by above technical Analysis each when idle failure analysis center 40 is broken down according to destination service Item achievement data, to fault case than ever, carries out fault anticipation and reason positioning, and the similarity according to fault, from fault Process in memorandum platform 60 and the preparation fault case for prestoring is retrieved, corresponding responsibility people is informed, accelerate fault and solve progress. And after failture evacuation, corresponding responsibility people needs the thinking for solving this fault and flow process to record in detail, is submitted to Troubleshooting memorandum platform 60, is evaluated by full-time staff, if meeting the requirements, examination ＆ verification passes through, and is persisted to fault case In database 600, the early warning for follow-up business service or fault solve to use, and go round and begin again, form benign cycle, shorten In the recovery cycle of fault, lift the efficiency of fault solution and the availability of business service.

It should be noted that herein, term " including ", "comprising" or its any other variant are intended to non-row The including of his property, so that a series of process including key elements, method, article or device not only include those key elements, and And also include other key elements being not expressly set out, or also include intrinsic for this process, method, article or device institute Key element.In the absence of more restrictions, the key element for being limited by sentence "including a ...", it is not excluded that including to be somebody's turn to do Also there is other identical element in the process, method of key element, article or device.

The embodiments of the present invention are for illustration only, do not represent the quality of embodiment.

Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can add the mode of required general hardware platform by software to realize, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.Based on such understanding, technical scheme is substantially done to prior art in other words The part for going out contribution can be embodied in the form of software product, and the computer software product is stored in a storage medium In (as ROM/RAM, magnetic disc, CD), use so that a station terminal equipment including some instructions (can be mobile phone, computer, clothes Business device, air-conditioner, or network equipment etc.) execute method described in each embodiment of the present invention.

The preferred embodiments of the present invention are these are only, the scope of the claims of the present invention is not thereby limited, every using this Equivalent structure or equivalent flow conversion that bright specification and accompanying drawing content are made, or directly or indirectly it is used in other related skills Art field, is included within the scope of the present invention.

Claims

1. a kind of service monitoring system based on Zookeeper, it is characterised in that include：

Monitoring data collects center, builds the cluster service that the monitoring data collects center, capturing service using Zookeeper The performance indications data of service, while obtain the operation information for running the physical server of the business service；

Real-time Risk Calculation center, based on the performance indications data and the operation information, business service described in real-time judge With the presence or absence of risk；

Alarm platform, if the business service has risk, receives the alarm signal that the real-time Risk Calculation center is sent Signal an alert after breath.

2. service monitoring system according to claim 1, it is characterised in that the system, also includes, idle accident analysis Center and troubleshooting memorandum platform, the troubleshooting memorandum platform include fault case database, and the idle fault is divided Analysis center determines, under conditions of the business service has risk, the factor that the business service breaks down, and according to The factor is obtained after corresponding preparation fault case from the fault case database and pushes to the alarm platform, so that institute State alarm platform and show the preparation fault case.

3. service monitoring system according to claim 1 and 2, it is characterised in that the system, also include management platform and Configuration management database, wherein, the management platform carries out parameter configuration to the business service, and parameter is preserved to institute State configuration management database.

4. service monitoring system according to claim 3, it is characterised in that the monitoring data collects center, from described Configuration management database reads the parameter, and by the parameter persistence to the Zookeeper, described to utilize Zookeeper realizes the distributed task dispatching that the monitoring data collects center.

5. service monitoring system according to claim 2, it is characterised in that the troubleshooting memorandum platform, described Under conditions of fault is solved, the fault solution for solving the fault is recorded and stores to the fault case data Storehouse, used as preparation fault case.

6. a kind of service monitoring method based on Zookeeper, it is characterised in that methods described, comprises the following steps：

Build, using Zookeeper, the cluster service that monitoring data collects center；

The performance indications data of capturing service service, while obtain the operation for running the physical server of the business service Information；

Based on the performance indications data and the operation information, business service described in real-time judge whether there is risk；

If there is risk, signal an alert in the business service.

7. service monitoring method according to claim 6, it is characterised in that if the business service has risk, Then after signal an alert, methods described, also include：

Determine the factor that the business service breaks down；

Described warning is pushed to according to the factor from the corresponding preparation fault case of troubleshooting memorandum platform acquisition to put down Platform, so that the alarm platform shows the preparation fault case.

8. the service monitoring method according to claim 6 or 7, it is characterised in that the utilization Zookeeper builds monitoring Before the cluster service of DCC, methods described, also include：

Parameter configuration is carried out to the business service；

Preserve the parameter.

9. service monitoring method according to claim 8, it is characterised in that the utilization Zookeeper builds monitoring number According to the cluster service at the center of collection, specifically include：

Read above-mentioned parameter；By in parameter persistence to Zookeeper, to be realized during monitoring data collects using Zookeeper The distributed task dispatching of heart cluster service.

10. service monitoring method according to claim 7, it is characterised in that described according to the factor from the fault After the corresponding preparation fault case of process memorandum platform acquisition pushes to the alarm platform, methods described, also include：

Under conditions of the fault is solved, the fault solution for solving the fault is recorded and stores to the event Barrier case database, used as preparation fault case.