CN107181608A

CN107181608A - A kind of method and operation management system for recovering service and performance boost

Info

Publication number: CN107181608A
Application number: CN201610140348.XA
Authority: CN
Inventors: 姚文辉; 刘俊峰; 黄硕; 朱家稷
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2016-03-11
Filing date: 2016-03-11
Publication date: 2017-09-19
Anticipated expiration: 2036-03-11
Also published as: CN107181608B

Abstract

A kind of method and operation management system for recovering service and performance boost, the state change of the node cluster interior joint of operation management system detection running status synchronous protocol, the number NN, NN for determining normal node are integer；If NN becomes smaller than S0 from more than or equal to S0, emergent management is carried out to recover normal service, wherein, when the normal node includes host node, the emergent management includes：The value for the parameter S that configuration center and the normal node are preserved is revised as the positive integer value less than or equal to NN；Wherein, parameter S represents that the node cluster provides the minimum number of the synchronous successful node required by normal service, and S0 is the parameter S determined according to the state synchronized agreement value.The application can effectively solve multiple nodes while there is the unavailable problem that hardware error is brought.

Description

A kind of method and operation management system for recovering service and performance boost

Technical field

The present invention relates to distributed system, service and property are recovered more particularly, to a kind of distributed system The method and operation management system that can be lifted.

Background technology

Currently in large-scale distributed storage system, purview certification, quota control are concentrated in order to realize, The method that major part employs centralized metadata management, will entirely in storage system all data member Information is left concentratedly to be stored in some nodes.Metadata node (alternatively referred to as first number in this framework According to server etc.) availability be directly connected to the availability of whole system, in a variety of distributed systems Increase the availability of Metadata Service all by way of redundancy.

The mode of redundancy can introduce necessary use state synchronous protocol between multiple nodes, node, it is ensured that in office When the decision that time is made all is correct and undeniable.In a distributed system, if each section The original state of point is consistent, and each node is carried out identical command sequence, then they can finally obtain One consistent state.To ensure that each node performs identical command sequence, it is necessary in each instruction It is upper to perform one " consistency algorithm " to ensure that the instruction that each node is seen is consistent.

Paxos agreements are acknowledged as one of most widely used agreement in state synchronized agreement, and it is solved The problem of be how a distributed system reaches an agreement with regard to some value (resolution).Paxos agreements are being repaiied When changing operation, all modifications state monotonic increase can be numbered, and decision-making is carried out on multiple nodes, such as Really most of nodes are all agreed to receive this decision-making, then change and be persisted to multiple nodes respectively.So Protocol Design can ensure that resolution all most of nodes are agreed to make every time, it is ensured that resolution is just True property, whereas if a small number of nodes can make a resolution, can cause to produce two in same protocol number Resolution, appears to be mistake resolution from user perspective or resolves inconsistent.Resolution number and resolution every time simultaneously Persistence itself ensure that an error has occurred recover when, as long as the data of most of nodes are not appointed What is lost, then what the resolution made in the past was still retained, and any resolution afterwards can be based on one Individual correct resolution basis proceeds, and ensures that the consistent of data is resolved correctly at any time.

In by the use of multiple metadata nodes as the distributed memory system of backup, if having used Paxos Agreement is as election and the theoretical foundation of Log backup, in the case of remaining a small number of metadata nodes just not Normal Metadata Service can be provided.It is hard due to the machine where metadata node in production system Part configuration is substantially coincident, for example, all employ the solid state hard disc (SSD of same manufacturer：Solid State Drives), erasing and writing life is more or less the same, and causes many machines while the probability gone wrong can increase.Once More than half machines occurs in that disk reading mode, can cause service stopping.Delayed there are most metadata nodes When machine, if host node still also can use, the service for reading metadata can be externally provided, but repair The operation of metadata can not all succeed.

When a kind of simplified way of Paxos agreements is used in distributed memory system, multiple metadata sections Point is conducted an election by Paxos agreements, is produced host node (Primary) and is provided Metadata Service；Other Node is as from node (Slave), and the daily record for only receiving host node is synchronous.The daily record meeting that host node is produced Issue all from node, if agreed to from node and to receive daily record synchronous, host node can receive from The feedback that node is agreed to, in the synchronously success (including host node) of most nodes, host node is to sending clothes The client (Client) of business request is returned successfully, and the otherwise request of client will be suspended, client meeting Time exceeded message is received, service stopping is now shown as.That is, providing first number using Paxos agreements During according to service redundant ability, if most of metadata nodes stop service, whole service can be caused to stop Only, even if wherein also there is normal node.In addition, if when at least half of joint behavior is deteriorated, it is whole The performance of individual service can also be deteriorated therewith.Because being returned when daily record is synchronous after most of nodes and agreeing to that ability is complete Into the operation of client, so operating characteristics depends on the performance of most slow node in most of nodes.

There is also similar situation for the node cluster of other running status synchronous protocols.

The content of the invention

In view of this, the invention provides following scheme.

A kind of method for recovering service, applied to operation management system, including：

The state change of the node cluster interior joint of running status synchronous protocol is detected, normal node is determined Number NN, NN are integer；

If NN becomes smaller than S0 from more than or equal to S0, emergent management is carried out to recover normal service, Wherein, when the normal node includes host node, the emergent management includes：By configuration center and The value for the parameter S that the normal node is preserved is revised as the positive integer value less than or equal to NN；

Wherein, parameter S represents that the node cluster provides the synchronous successful node required by normal service Minimum number, S0 is the parameter S determined according to the state synchronized agreement value.

A kind of operation management system, including state detection module, control module and emergent management module, its In：

The state detection module, the shape of the node cluster interior joint for detecting running status synchronous protocol State changes, and determines the number NN of normal node and notifies the control module, NN is integer；

The control module, for after S0 is become smaller than more than or equal to S0, calling urgent place in NN Manage module and carry out emergent management, to recover normal service；

The emergent management module, it is following tight for when the normal node includes host node, performing Anxious processing：The value for the parameter S that configuration center and the normal node are preserved is revised as being less than or waited In NN positive integer value；

Such scheme does not reach synchronous successful node required by normal service most in the number of normal node During small number, service is set to recover immediately by parameter modification, can be easily extensive after failture evacuation Multiple service, during which will not cause data inconsistent and lose.Such processing method can effectively solve many There is the unavailable problem that hardware error is brought simultaneously in individual node, reduces economic loss.

A kind of method of performance boost, including：

Determine the low performance node in the node cluster of running status synchronous protocol；

When the synchronously success of at least one low performance node, the synchronizing process of the node cluster could succeed When, carry out performance boost processing so that synchronizing process during without low performance node synchronization success Also it can succeed.

A kind of operation management system, including performance management module, wherein：

The performance management module is used to that to the node cluster progressive of running status synchronous protocol place can be lifted Reason, including：

First processing units, the value of the parameter S for configuration center to be preserved is revised as T-SN；

Second processing unit, the value of the parameter S for the node cluster interior joint to be preserved is revised as T-SN；

Wherein, T is the nodes of the node cluster, and T >=2, SN is low performance in the node cluster The number of node, parameter S represents that the node cluster provides the synchronous successful node required by normal service Minimum number.

Such scheme increases in low performance node so that when synchronizing speed is slack-off, by reasonable disposition parameter, Allow node cluster performance always with high-performance node matching.

Brief description of the drawings

Fig. 1 is the schematic diagram of the network architecture of the embodiment of the present invention one；

Fig. 2 is the flow chart for the method that the embodiment of the present invention one recovers service；

Fig. 3 is the module map of the operation management system of the embodiment of the present invention one；

Fig. 4 is the flow chart of the method for the performance boost of the embodiment of the present invention two；

Fig. 5 is the module map of the operation management system of the embodiment of the present invention two.

Embodiment

For the object, technical solutions and advantages of the present invention are more clearly understood, below in conjunction with accompanying drawing Embodiments of the invention are described in detail.It should be noted that in the case where not conflicting, this Shen Please in embodiment and the feature in embodiment can mutually be combined.

Embodiment one

The present embodiment in a distributed manner in storage system operation Paxos agreements or its derive the metadata section of agreement Exemplified by point cluster, state for convenience, the metadata node in embodiment is also referred to as node.This reality Apply the related network architecture of example as shown in figure 1, including：Metadata node cluster, configuration center, O＆M Node in management system and client, metadata node cluster in figure exemplified by 3, it is therein many Individual metadata node is divided into host node and from node.

Host node：The node of reading and writing service is externally provided, write operation is converted into modification daily record, it is synchronous To all from node；

From node：Receive the daily record that host node synchronously comes, judge whether to receive by protocol conventions Daily record, if can receive, return receives to be applied in internal memory successfully and by daily record, otherwise returns and connects By failure.

Configuration center is the storage system of node start-up parameter persistence, when each node starts from configuration The configuration parameter that election stage and service stage should be used is obtained in the heart, is stored in internal memory, and corresponding Stage application.

Operation management system：Alternatively referred to as operation and maintenance tools, send the request of modification parameter when being necessary To any node.It can be performed on the machine where any one node from the nodes records to local Daily record is extracted in daily record, the operation of other normal nodes is synchronized to；Current cluster interior joint can be checked State, determine the number of normal node and the number of abnormal nodes；Judge whether to need promptly to be located Manage to recover normal service；And, it is determined which kind of emergent management scheme, etc. used.Fortune in text Dimension management system includes being used to complete corresponding operation maintenance, the various functions module of management function.In entity On, it can be deployed in metadata node, other equipment can also be deployed as, can be real with individual equipment It is existing, it can also be realized with multiple equipment, this present invention is not limited to.

Client (Client)：The process of reading and writing metadata request is initiated, the request of metadata is read and write all Current host node is sent to, if write operation is received that modification daily record can be produced in host node.

Set the parameter in current normal metadata node internal memory in second step to match somebody with somebody using operation and maintenance tools The new parameter put.

The relevant treatment of metadata node cluster is divided into election stage and service stage, wherein：

The election stage：

This stage allows multiple nodes to participate in election process, and determines externally to provide the host node of service, its His node becomes from node.In Paxos agreements, completing election needs two parameters T and E.. T=EN+NN, EN are the numbers of abnormal nodes in T node, and NN is normal node in T node Number.E represent to elect successfully required by normal node minimum number, according to state synchronized agreement The E of determination value is designated as E0, and E0=floor (T/2)+1, floor () represent to take downwards in Paxos agreements It is whole.That is, have it is most namely more than the node of half it is normal when, can successfully elect new Host node.

Service stage：

Selection is produced after host node, and all read-write requests are all handled by host node, write operation Modification daily record can be produced, daily record is synchronized to all from node by host node by network.Synchronous phase is needed There are two parameters T and S just to judge whether to complete synchronization.T implication is as described above.S is represented The node cluster provides the minimum number of the synchronous successful node required by normal service.It is same according to state The value for the S that step agreement is determined is designated as S0, in Paxos agreements, S0=floor (T/2)+1, such as T=5 When S0=3, and S0=5 during T=9.In the once operation of modification metadata, only it is successfully written corresponding The number for changing the node of daily record is more than or equal to S, and this time operation of modification metadata could succeed, main section Point is also calculated in S.I.e. local except daily record is write, host node also needs to that daily record successful synchronization will be changed It could complete to change the operation of metadata from node to S-1, if can not complete to change the behaviour of metadata Make, the request of client will be suspended, and now show as service stopping.

Obviously, the parameter for node number being represented in text is integer.

As described above, when the number of normal node is less than the parameter S0 of Paxos protocol requirements, due to different Chang Jiedian not successfully writes modification daily record, thus node cluster will be caused to provide normal service.For This, the present embodiment is adjudicated by operation management system according to node state, can not provide positive informal dress During business, service is allowed to recover immediately by parameter modification.

As shown in Fig. 2 the method that the present embodiment recovers service is applied to operation management system, including：

Step 110, the state change of the node cluster interior joint of detection running status synchronous protocol, it is determined that The number NN, NN of normal node are integer；

In the present embodiment, the node cluster be in distributed memory system operation Paxos agreements or its spread out The metadata node cluster of raw agreement.In a few days will is synchronous for the synchronization.But the present invention is not limited to this, It can be used for other node clusters of running status synchronous protocol, because the present embodiment is by joining to agreement Several modifications makes synchronously succeed to recover service, the service kind that node cluster is provided between node Class difference has no effect on the recovery of service.It is contemplated that the present invention may be use with running other state synchronized agreements such as HA Multiple nodes of agreement.

Step 120, if NN becomes smaller than S0 from more than or equal to S0, carry out emergent management to recover just Informal dress is engaged in, and when the normal node includes host node, the emergent management includes：By configuration center And the parameter S of normal node preservation value is revised as the positive integer value less than or equal to NN；

Wherein, parameter S implication is as described above.

In the case where host node is normal node, because host node is also in service state, so now Recovery service needs not move through the election stage, it is only necessary to change the synchronization log parameter S of service stage Complete to recover.Parameter S value is revised as after NN (value after change) by the present embodiment, because cluster In there is NN normal node, thus can have the synchronously success of NN node, meet to synchronous success The requirement of the minimum number of node, thus positive informal dress service can be recovered.In another embodiment, also may be used It so that parameter S value to be revised as to the value less than NN, can also now recover normal service, there is few again Amount normal node need not change S when being changed into abnormal nodes again, but the present embodiment is when being revised as NN, repaiies Successfully nodes are relatively more for will synchronization on some other day, and Information Security is more preferable.

In the present embodiment, when the normal node includes host node, the emergent management also includes： The value for the parameter E that the configuration center and the normal node are preserved is revised as T-NN '+1, wherein, Parameter E represents the minimum number of the required normal node of the node cluster election success, and NN ' is ginseng The positive integer value less than or equal to NN that number S is revised as.Herein, E values are revised as T-NN+1 Namely FN+1 so that want to elect successfully, it is necessary to have the participation of a normal node, and this is normally saved The daily record that point is preserved is complete, thus original abnormal nodes can be avoided to revert to after normal node, New host node is elected between abnormal nodes and causes daily record data inconsistent and loses.Here, lead to Cross modification E values to ensure the complete of daily record, be a kind of convenient mode, can not also increase other interfaces. But not sole mode, can also such as record current normal node, be had in follow-up election There is one of normal node reference to elect successfully, etc..

Operation management system can need emergent restoring Metadata Service (such as receive keeper instruction or Person triggers according to corresponding strategy) when start above-mentioned flow.

The present embodiment is additionally provided after triggering emergent management, when not including host node in the normal node Emergent management scheme, now, due in the absence of host node, so needing the experience election stage to produce Raw host node, handling process is with emergent management when there is host node in normal node afterwards.Specifically, Emergent management scheme when not including host node in normal node includes：

The first step, a normal section is synchronized to by the local daily record of at least FN-floor (T/2) individual abnormal nodes Point, the value for the parameter E for preserving the configuration center and the normal node after synchronous success is revised as NN；

When host node is abnormal, the daily record that normal node is preserved there may be inconsistent situation.Thus this Place first carries out the synchronization of daily record.The local daily record of at least FN-floor (T/2) individual abnormal nodes is synchronized to one After individual normal node, the normal node and other normal nodes just have (FN-floor (T/2))+NN=together The daily record of T-floor (T/2) individual node, these nodes are assured that together there is abnormal preceding newest shape State.Certainly, it is synchronous successfully on condition that machine where at least FN-floor (T/2) individual abnormal nodes can be with Log in, here, abnormal nodes need that daily record synchronization can not be completed, but by O＆M management tool, still have Where it may be signed in on machine, its local daily record reproduction is come out.

In this step, parameter E value is revised as NN so that NN normal node can be with successful holding Go out new host node to recover service.

Second step, after the NN normal node elects new host node, stops servicing and performing Above-mentioned normal node includes recovering normal service after the completion of emergent management during host node, execution.

After new host node is elected, it is possible to according to the emergent management mode in step 120 to S's Value is modified and recovers normal service, and E value can also be modified.In order to avoid in modification ginseng New modification daily record is produced during number, should now stop Metadata Service, a kind of simple mode is exactly in choosing Enumerate before new host node, the value for the parameter S that the configuration center and the normal node are preserved is repaiied The value more than or equal to FN+1 is changed to, so synchronous to succeed, service is stopped.And performing Into emergent management be complete step 120 in the modification of S values after, you can recovery normal service.Using Other modes forbid service also possible, such as increase some steps, and extra match somebody with somebody is increased in these steps Item is put to realize.

After emergent management is carried out, the working condition of node cluster is transferred to the state of emergency, realizes minority Metadata node can externally provide the ability of service.In emergency situations, node state (it is normal or It is abnormal) it can change, for example there is normal node to be changed into weight after abnormal nodes, abnormal nodes failture evacuation It is changed into normal node after new startup, therefore, the present embodiment additionally provides the place when node state changes Reason scheme is as follows：

S0 is become smaller than from more than or equal to S0 in NN, emergent management is carried out and recovers after normal service, If the state for having node in the node cluster changes again, NN can also change, and now compare NN and S0 after relatively changing：

If NN<S0, re-starts the emergent management to recover normal service；

If NN >=S0, the parameter S and parameter E value point that the configuration center and normal node are preserved Value S0 and value E0 are not revised as；

Wherein, E0 is the parameter E determined according to the state synchronized agreement value..

If that is, having there is most node states normal, i.e., S, E value are revised as according to synchronization The respective value that status protocol is determined is to recover initial working condition.In Paxos agreements, S0=E0= floor(T/2)+1.If still there are most node states abnormal, need to re-start emergent management with extensive Multiple normal service, recovers the redundancy properties of Paxos protocol realizations.

In such scheme, operation management system needs to change the parameter that metadata node is preserved, and this can lead to Cross the completion of one of in the following manner：

When host node is normal or after re-electing out host node, parameter is changed and ordered by operation management system It is sent to host node；Host node directly changes the parameter in oneself internal memory, and produces modification log recording and arrive Locally, and issue all from node；Daily record synchronization is received from node, is judged whether according to protocol conventions Daily record can be received, if can receive, parameter and master is returned to by log recording to local, in modification internal memory Node success.

When host node is abnormal, operation management system gets all normal metadata nodes, will change The request of parameter is sent to each metadata node；Receive directly being changed from node certainly for parameter modification request Configuration parameter in own internal memory, but any modification daily record is not produced.

In the present embodiment, in emergent management scheme in above-mentioned two situations, to the configuration center and During the parameter S and/or parameter E value that the normal node is preserved modify, forbid abnormal section Point starts, or disconnects the configuration center and the connection of the abnormal nodes, to prevent in configuration process Configuration parameter before abnormal nodes are applied starts.

The present embodiment additionally provides a kind of operation management system, as shown in figure 3, including state detection module 10th, control module 20 and emergent management module 30, wherein：

The state detection module 10, the node cluster interior joint for detecting running status synchronous protocol State change, determines the number NN of normal node and notifies the control module, NN is integer；

The control module 20 is urgent for after S0 is become smaller than more than or equal to S0, being called in NN Processing module carries out emergent management, to recover normal service；

The emergent management module 30, it is following for when the normal node includes host node, performing Emergent management：The value for the parameter S that configuration center and the normal node are preserved be revised as being less than or Positive integer value equal to NN；

Alternatively,

The emergent management module is when the normal node includes host node, and the emergent management of execution is also Including：The value for the parameter E that the configuration center and the normal node are preserved is revised as T-NN '+1, Wherein, parameter E represents the minimum number of the required normal node of the node cluster election success, NN ' The positive integer value less than or equal to NN being revised as parameter S.

Alternatively,

When the emergent management module is additionally operable in the normal node not include host node, perform following Emergent management：The local daily record of at least FN-floor (T/2) individual abnormal nodes is synchronized to a normal node, The value for the parameter E for preserving the configuration center and the normal node after synchronous success is revised as NN； And, after the NN normal node elects new host node, stop service and perform described normal Node includes recovering normal service after the completion of emergent management during host node, execution, wherein, T is institute The nodes of node cluster are stated, T >=2, FN is the number of abnormal nodes in the node cluster, FN=T - NN, floor () represent to round downwards.

Alternatively,

The emergent management module stops clothes after the NN normal node elects new host node Business, is achieved in the following ways：Before new host node is elected, by the configuration center and The value for the parameter S that the normal node is preserved is revised as the value more than or equal to FN+1.

Alternatively,

The control module is additionally operable to when the NN that the state detection module is notified changes again, Compare the NN and S0 after change：

If NN<S0, calls emergent management module to re-start emergent management to recover normal service；

If NN >=S0, the parameter S and parameter E value point that the configuration center and normal node are preserved S0 and E0 are not revised as it；

Wherein, E0 is the parameter E determined according to the state synchronized agreement value.

Alternatively,

The node cluster is first number of operation Paxos agreements or its derivative agreement in distributed memory system According to node cluster, in a few days will is synchronous for the synchronization, and S0=E0=floor (T/2)+1, floor () represent to take downwards It is whole.

Alternatively,

Parameter S and/or ginseng that the emergent management module is preserved to the configuration center and the normal node During number E value is modified, forbid abnormal nodes to start, or disconnect the configuration center and institute State the connection of abnormal nodes.

The such scheme of the present embodiment is based on Paxos state synchronized agreements, is taken by configuring multiple metadata It is engaged in electing the relevant parameter synchronous with daily record between device, in the case of a small number of metadata nodes are normal in the cluster Metadata read-write service still can be provided, can conveniently recover service after failture evacuation, during which will not Cause data inconsistent and lose.

Embodiment two

What the present embodiment was related to is also the node cluster for running synchronous regime agreement, and also storage is in a distributed manner Exemplified by the metadata node cluster that Paxos agreements or its derivative agreement are run in system.Its network architecture such as Fig. 1 It is shown, repeat no more.

The present embodiment is more for low performance node in node cluster, causes the service performance of whole cluster A kind of method for the performance boost for being deteriorated and proposing, as shown in figure 4, including：

Step 210, the low performance node in the node cluster of running status synchronous protocol is determined；

In this step, low performance node can be determined according to indexs such as the node response speeds of setting, such as It can be determined by keeper.

Step 220, when the synchronously success, the synchronous mistake of the node cluster of at least one low performance node During Cheng Caineng successes, performance boost processing is carried out so that without low performance node synchronization success when institute Stating synchronizing process can also succeed.

In the present embodiment, if in the node cluster low performance node number SN >=T-S0+1, i.e. table Show the synchronously success of at least one low performance node, the synchronizing process of the node cluster could succeed.Institute Stating performance boost processing includes：Operation management system preserves configuration center and the node cluster interior joint Parameter S value be revised as T-SN.Wherein, T is the nodes of the node cluster, T >=2, parameter S represents that the node cluster provides the minimum number of the synchronous successful node required by normal service, and S0 is The parameter S determined according to the state synchronized agreement value.What the present embodiment considered is normal node number Scene more than S0.

Parameter S value is revised as after T-SN, as long as there is T-SN node to return into when synchronizing Work(is synchronous success, and the T-SN node can not include low performance node, thus the performance of node cluster Can with high-performance node matching, without being dragged down by low performance node.

In the present embodiment, the performance boost processing can also include following at least one processing：

Processing one, the value for the parameter E that the configuration center and the node cluster interior joint are preserved is changed For SN+1；

By taking Paxos agreements as an example, exemplified by S0=floor (T/2)+1, if the node of more than half is low Performance node, then need to carry out performance boost processing, S value be revised as into T-SN, less than agreement regulation Value S0, in order to avoid under extreme case produce daily record data it is inconsistent and lose, parameter E is revised as SN+1, to ensure that the node of preamble of at least one participation participates in electing.

The modification of the value of parameter S, E preserved to the node cluster interior joint can be by the node Host node in cluster sends the modification to the value of parameter S, E and asks to realize, carry to repair in the request Value T-SN, SN+1 being changed to.

Processing two, if current host node is low performance node, the node is switched to by host node Another node in cluster in addition to low performance node.

In the present embodiment, the node cluster be in distributed memory system operation Paxos agreements or its spread out The metadata node cluster of raw agreement, in a few days will is synchronous for the synchronization, S0=floor (T/2)+1, floor () Expression is rounded downwards.But the scheme of the present embodiment can be used for providing the node cluster of other services.

The present embodiment additionally provides a kind of operation management system, including performance management module, the performance pipe Reason module is used to that to the node cluster progressive of running status synchronous protocol processing can be lifted, as shown in figure 5, Including：

First processing units 50, the value of the parameter S for configuration center to be preserved is revised as T-SN；

Alternatively,

The performance management module also includes with least one of lower unit：

3rd processing unit, for the parameter for preserving the configuration center and the node cluster interior joint E value is revised as SN+1, wherein, parameter E represents normal required by the node cluster election success The minimum number of node；

Fourth processing unit, for when current host node is low performance node, host node to be switched to Another node in the node cluster in addition to low performance node.

Alternatively,

The node cluster is first number of operation Paxos agreements or its derivative agreement in distributed memory system According to node cluster, in a few days will is synchronous for the synchronization.

The present embodiment leads to when there is more metadata node performance reduction to cause cluster overall performance to reduce Modification parameter is crossed, the service performance of cluster can be mentioned.

The embodiments of the present invention are for illustration only, and the quality of embodiment is not represented.Pass through the above Embodiment description, those skilled in the art can be understood that above-described embodiment method can Realized by the mode of software plus required general hardware platform, naturally it is also possible to by hardware, but very The former is more preferably embodiment in the case of many.Understood based on such, the technical side of the embodiment of the present invention The part that case substantially contributes to prior art in other words can be embodied in the form of software product Come, the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disc, CD) In, including some instructions to cause a station terminal equipment (can be mobile phone, computer, server, Or the network equipment etc.) perform method described in each embodiment of the invention.

The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the invention, for this For the technical staff in field, the present invention can have various modifications and variations.It is all the present invention spirit and Within principle, any modification, equivalent substitution and improvements made etc. should be included in the protection of the present invention Within the scope of.

Claims

1. a kind of method for recovering service, applied to operation management system, including：

2. the method as described in claim 1, it is characterised in that：

When the normal node includes host node, the emergent management also includes：By in the configuration The value for the parameter E that the heart and the normal node are preserved is revised as T-NN '+1, wherein, parameter E represents institute The minimum number of the required normal node of node cluster election success is stated, NN ' is what parameter S was revised as The positive integer value less than or equal to NN.

3. method as claimed in claim 2, it is characterised in that：

It is described progress emergent management to recover normal service, wherein, in the normal node not include master During node, the emergent management includes：

The local daily record of at least FN-floor (T/2) individual abnormal nodes is synchronized to a normal node, it is synchronous The value for the parameter E for preserving the configuration center and the normal node after success is revised as NN；And

After the NN normal node elects new host node, stop service and perform it is described just Chang Jiedian includes recovering normal service after the completion of emergent management during host node, execution；

Wherein, T be the node cluster nodes, T >=2, FN be the node cluster in save extremely The number of point, FN=T-NN, floor () represents to round downwards.

4. method as claimed in claim 3, it is characterised in that：

Stop service after the NN normal node elects new host node, be in the following manner Realize：Before new host node is elected, the configuration center and the normal node are preserved Parameter S value is revised as the value more than or equal to FN+1.

5. the method as described in any in claim 2-4, it is characterised in that：

S0 is become smaller than from more than or equal to S0 in NN, emergent management is carried out and recovers after normal service, Also include：When NN changes again, compare the NN and S0 after change：

If NN<S0, re-starts the emergent management to recover normal service；

6. method as claimed in claim 5, it is characterised in that：

The node cluster is first number of operation Paxos agreements or its derivative agreement in distributed memory system According to node cluster, in a few days will is synchronous for the synchronization, and S0=E0=floor (T/2)+1, floor () represent to round downwards.

7. the method as described in any in claim 2-4,6, it is characterised in that：

The parameter S and/or parameter E value preserved to the configuration center and the normal node modifies During, forbid abnormal nodes to start, or disconnect the configuration center and the connection of the abnormal nodes.

8. a kind of operation management system, it is characterised in that including state detection module, control module and tight Anxious processing module, wherein：

9. operation management system as claimed in claim 8, it is characterised in that：

10. operation management system as claimed in claim 9, it is characterised in that：

11. operation management system as claimed in claim 10, it is characterised in that：

12. the operation management system as described in any in claim 8-11, it is characterised in that：

13. operation management system as claimed in claim 12, it is characterised in that：

14. the operation management system as described in claim 8-11,13, it is characterised in that：

15. a kind of method of performance boost, including：

16. method as claimed in claim 15, it is characterised in that：

When the synchronously success of at least one low performance node, the synchronizing process of the node cluster could succeed When, performance boost processing is carried out, including：

In the node cluster during number SN >=T-S0+1 of low performance node, performance boost processing is carried out, The performance boost processing includes：Operation management system protects configuration center and the node cluster interior joint The parameter S deposited value is revised as T-SN；

Wherein, T is the nodes of the node cluster, and T >=2, parameter S represents that the node cluster is provided The minimum number of synchronous successful node required by normal service, S0 is true according to the state synchronized agreement Fixed parameter S value.

17. method as claimed in claim 16, it is characterised in that：

The performance boost processing also includes following at least one processing：

The value for the parameter E that the configuration center and the node cluster interior joint are preserved is revised as SN+1； And

If current host node is low performance node, host node is switched to and removed in the node cluster Another node outside low performance node；

Wherein, parameter E represents the minimum number of the required normal node of the node cluster election success.

18. the method as described in claim 16 or 17, it is characterised in that：

The node cluster is first number of operation Paxos agreements or its derivative agreement in distributed memory system According to node cluster, in a few days will is synchronous for the synchronization, and S0=floor (T/2)+1, floor () represent to round downwards.

19. a kind of operation management system, it is characterised in that including performance management module, wherein：

20. operation management system as claimed in claim 19, it is characterised in that：

The performance management module also includes with least one of lower unit：

21. the operation management system as described in claim 19 or 20, it is characterised in that：