CN107181608A - A kind of method and operation management system for recovering service and performance boost - Google Patents
A kind of method and operation management system for recovering service and performance boost Download PDFInfo
- Publication number
- CN107181608A CN107181608A CN201610140348.XA CN201610140348A CN107181608A CN 107181608 A CN107181608 A CN 107181608A CN 201610140348 A CN201610140348 A CN 201610140348A CN 107181608 A CN107181608 A CN 107181608A
- Authority
- CN
- China
- Prior art keywords
- node
- parameter
- normal
- value
- cluster
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0654—Management of faults, events, alarms or notifications using network fault recovery
- H04L41/0659—Management of faults, events, alarms or notifications using network fault recovery by isolating or reconfiguring faulty entities
- H04L41/0661—Management of faults, events, alarms or notifications using network fault recovery by isolating or reconfiguring faulty entities by reconfiguring faulty entities
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0681—Configuration of triggering conditions
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Hardware Redundancy (AREA)
- Telephonic Communication Services (AREA)
Abstract
A kind of method and operation management system for recovering service and performance boost, the state change of the node cluster interior joint of operation management system detection running status synchronous protocol, the number NN, NN for determining normal node are integer;If NN becomes smaller than S0 from more than or equal to S0, emergent management is carried out to recover normal service, wherein, when the normal node includes host node, the emergent management includes:The value for the parameter S that configuration center and the normal node are preserved is revised as the positive integer value less than or equal to NN;Wherein, parameter S represents that the node cluster provides the minimum number of the synchronous successful node required by normal service, and S0 is the parameter S determined according to the state synchronized agreement value.The application can effectively solve multiple nodes while there is the unavailable problem that hardware error is brought.
Description
Technical field
The present invention relates to distributed system, service and property are recovered more particularly, to a kind of distributed system
The method and operation management system that can be lifted.
Background technology
Currently in large-scale distributed storage system, purview certification, quota control are concentrated in order to realize,
The method that major part employs centralized metadata management, will entirely in storage system all data member
Information is left concentratedly to be stored in some nodes.Metadata node (alternatively referred to as first number in this framework
According to server etc.) availability be directly connected to the availability of whole system, in a variety of distributed systems
Increase the availability of Metadata Service all by way of redundancy.
The mode of redundancy can introduce necessary use state synchronous protocol between multiple nodes, node, it is ensured that in office
When the decision that time is made all is correct and undeniable.In a distributed system, if each section
The original state of point is consistent, and each node is carried out identical command sequence, then they can finally obtain
One consistent state.To ensure that each node performs identical command sequence, it is necessary in each instruction
It is upper to perform one " consistency algorithm " to ensure that the instruction that each node is seen is consistent.
Paxos agreements are acknowledged as one of most widely used agreement in state synchronized agreement, and it is solved
The problem of be how a distributed system reaches an agreement with regard to some value (resolution).Paxos agreements are being repaiied
When changing operation, all modifications state monotonic increase can be numbered, and decision-making is carried out on multiple nodes, such as
Really most of nodes are all agreed to receive this decision-making, then change and be persisted to multiple nodes respectively.So
Protocol Design can ensure that resolution all most of nodes are agreed to make every time, it is ensured that resolution is just
True property, whereas if a small number of nodes can make a resolution, can cause to produce two in same protocol number
Resolution, appears to be mistake resolution from user perspective or resolves inconsistent.Resolution number and resolution every time simultaneously
Persistence itself ensure that an error has occurred recover when, as long as the data of most of nodes are not appointed
What is lost, then what the resolution made in the past was still retained, and any resolution afterwards can be based on one
Individual correct resolution basis proceeds, and ensures that the consistent of data is resolved correctly at any time.
In by the use of multiple metadata nodes as the distributed memory system of backup, if having used Paxos
Agreement is as election and the theoretical foundation of Log backup, in the case of remaining a small number of metadata nodes just not
Normal Metadata Service can be provided.It is hard due to the machine where metadata node in production system
Part configuration is substantially coincident, for example, all employ the solid state hard disc (SSD of same manufacturer:Solid State
Drives), erasing and writing life is more or less the same, and causes many machines while the probability gone wrong can increase.Once
More than half machines occurs in that disk reading mode, can cause service stopping.Delayed there are most metadata nodes
When machine, if host node still also can use, the service for reading metadata can be externally provided, but repair
The operation of metadata can not all succeed.
When a kind of simplified way of Paxos agreements is used in distributed memory system, multiple metadata sections
Point is conducted an election by Paxos agreements, is produced host node (Primary) and is provided Metadata Service;Other
Node is as from node (Slave), and the daily record for only receiving host node is synchronous.The daily record meeting that host node is produced
Issue all from node, if agreed to from node and to receive daily record synchronous, host node can receive from
The feedback that node is agreed to, in the synchronously success (including host node) of most nodes, host node is to sending clothes
The client (Client) of business request is returned successfully, and the otherwise request of client will be suspended, client meeting
Time exceeded message is received, service stopping is now shown as.That is, providing first number using Paxos agreements
During according to service redundant ability, if most of metadata nodes stop service, whole service can be caused to stop
Only, even if wherein also there is normal node.In addition, if when at least half of joint behavior is deteriorated, it is whole
The performance of individual service can also be deteriorated therewith.Because being returned when daily record is synchronous after most of nodes and agreeing to that ability is complete
Into the operation of client, so operating characteristics depends on the performance of most slow node in most of nodes.
There is also similar situation for the node cluster of other running status synchronous protocols.
The content of the invention
In view of this, the invention provides following scheme.
A kind of method for recovering service, applied to operation management system, including:
The state change of the node cluster interior joint of running status synchronous protocol is detected, normal node is determined
Number NN, NN are integer;
If NN becomes smaller than S0 from more than or equal to S0, emergent management is carried out to recover normal service,
Wherein, when the normal node includes host node, the emergent management includes:By configuration center and
The value for the parameter S that the normal node is preserved is revised as the positive integer value less than or equal to NN;
Wherein, parameter S represents that the node cluster provides the synchronous successful node required by normal service
Minimum number, S0 is the parameter S determined according to the state synchronized agreement value.
A kind of operation management system, including state detection module, control module and emergent management module, its
In:
The state detection module, the shape of the node cluster interior joint for detecting running status synchronous protocol
State changes, and determines the number NN of normal node and notifies the control module, NN is integer;
The control module, for after S0 is become smaller than more than or equal to S0, calling urgent place in NN
Manage module and carry out emergent management, to recover normal service;
The emergent management module, it is following tight for when the normal node includes host node, performing
Anxious processing:The value for the parameter S that configuration center and the normal node are preserved is revised as being less than or waited
In NN positive integer value;
Wherein, parameter S represents that the node cluster provides the synchronous successful node required by normal service
Minimum number, S0 is the parameter S determined according to the state synchronized agreement value.
Such scheme does not reach synchronous successful node required by normal service most in the number of normal node
During small number, service is set to recover immediately by parameter modification, can be easily extensive after failture evacuation
Multiple service, during which will not cause data inconsistent and lose.Such processing method can effectively solve many
There is the unavailable problem that hardware error is brought simultaneously in individual node, reduces economic loss.
A kind of method of performance boost, including:
Determine the low performance node in the node cluster of running status synchronous protocol;
When the synchronously success of at least one low performance node, the synchronizing process of the node cluster could succeed
When, carry out performance boost processing so that synchronizing process during without low performance node synchronization success
Also it can succeed.
A kind of operation management system, including performance management module, wherein:
The performance management module is used to that to the node cluster progressive of running status synchronous protocol place can be lifted
Reason, including:
First processing units, the value of the parameter S for configuration center to be preserved is revised as T-SN;
Second processing unit, the value of the parameter S for the node cluster interior joint to be preserved is revised as
T-SN;
Wherein, T is the nodes of the node cluster, and T >=2, SN is low performance in the node cluster
The number of node, parameter S represents that the node cluster provides the synchronous successful node required by normal service
Minimum number.
Such scheme increases in low performance node so that when synchronizing speed is slack-off, by reasonable disposition parameter,
Allow node cluster performance always with high-performance node matching.
Brief description of the drawings
Fig. 1 is the schematic diagram of the network architecture of the embodiment of the present invention one;
Fig. 2 is the flow chart for the method that the embodiment of the present invention one recovers service;
Fig. 3 is the module map of the operation management system of the embodiment of the present invention one;
Fig. 4 is the flow chart of the method for the performance boost of the embodiment of the present invention two;
Fig. 5 is the module map of the operation management system of the embodiment of the present invention two.
Embodiment
For the object, technical solutions and advantages of the present invention are more clearly understood, below in conjunction with accompanying drawing
Embodiments of the invention are described in detail.It should be noted that in the case where not conflicting, this Shen
Please in embodiment and the feature in embodiment can mutually be combined.
Embodiment one
The present embodiment in a distributed manner in storage system operation Paxos agreements or its derive the metadata section of agreement
Exemplified by point cluster, state for convenience, the metadata node in embodiment is also referred to as node.This reality
Apply the related network architecture of example as shown in figure 1, including:Metadata node cluster, configuration center, O&M
Node in management system and client, metadata node cluster in figure exemplified by 3, it is therein many
Individual metadata node is divided into host node and from node.
Host node:The node of reading and writing service is externally provided, write operation is converted into modification daily record, it is synchronous
To all from node;
From node:Receive the daily record that host node synchronously comes, judge whether to receive by protocol conventions
Daily record, if can receive, return receives to be applied in internal memory successfully and by daily record, otherwise returns and connects
By failure.
Configuration center is the storage system of node start-up parameter persistence, when each node starts from configuration
The configuration parameter that election stage and service stage should be used is obtained in the heart, is stored in internal memory, and corresponding
Stage application.
Operation management system:Alternatively referred to as operation and maintenance tools, send the request of modification parameter when being necessary
To any node.It can be performed on the machine where any one node from the nodes records to local
Daily record is extracted in daily record, the operation of other normal nodes is synchronized to;Current cluster interior joint can be checked
State, determine the number of normal node and the number of abnormal nodes;Judge whether to need promptly to be located
Manage to recover normal service;And, it is determined which kind of emergent management scheme, etc. used.Fortune in text
Dimension management system includes being used to complete corresponding operation maintenance, the various functions module of management function.In entity
On, it can be deployed in metadata node, other equipment can also be deployed as, can be real with individual equipment
It is existing, it can also be realized with multiple equipment, this present invention is not limited to.
Client (Client):The process of reading and writing metadata request is initiated, the request of metadata is read and write all
Current host node is sent to, if write operation is received that modification daily record can be produced in host node.
Set the parameter in current normal metadata node internal memory in second step to match somebody with somebody using operation and maintenance tools
The new parameter put.
The relevant treatment of metadata node cluster is divided into election stage and service stage, wherein:
The election stage:
This stage allows multiple nodes to participate in election process, and determines externally to provide the host node of service, its
His node becomes from node.In Paxos agreements, completing election needs two parameters T and E..
T=EN+NN, EN are the numbers of abnormal nodes in T node, and NN is normal node in T node
Number.E represent to elect successfully required by normal node minimum number, according to state synchronized agreement
The E of determination value is designated as E0, and E0=floor (T/2)+1, floor () represent to take downwards in Paxos agreements
It is whole.That is, have it is most namely more than the node of half it is normal when, can successfully elect new
Host node.
Service stage:
Selection is produced after host node, and all read-write requests are all handled by host node, write operation
Modification daily record can be produced, daily record is synchronized to all from node by host node by network.Synchronous phase is needed
There are two parameters T and S just to judge whether to complete synchronization.T implication is as described above.S is represented
The node cluster provides the minimum number of the synchronous successful node required by normal service.It is same according to state
The value for the S that step agreement is determined is designated as S0, in Paxos agreements, S0=floor (T/2)+1, such as T=5
When S0=3, and S0=5 during T=9.In the once operation of modification metadata, only it is successfully written corresponding
The number for changing the node of daily record is more than or equal to S, and this time operation of modification metadata could succeed, main section
Point is also calculated in S.I.e. local except daily record is write, host node also needs to that daily record successful synchronization will be changed
It could complete to change the operation of metadata from node to S-1, if can not complete to change the behaviour of metadata
Make, the request of client will be suspended, and now show as service stopping.
Obviously, the parameter for node number being represented in text is integer.
As described above, when the number of normal node is less than the parameter S0 of Paxos protocol requirements, due to different
Chang Jiedian not successfully writes modification daily record, thus node cluster will be caused to provide normal service.For
This, the present embodiment is adjudicated by operation management system according to node state, can not provide positive informal dress
During business, service is allowed to recover immediately by parameter modification.
As shown in Fig. 2 the method that the present embodiment recovers service is applied to operation management system, including:
Step 110, the state change of the node cluster interior joint of detection running status synchronous protocol, it is determined that
The number NN, NN of normal node are integer;
In the present embodiment, the node cluster be in distributed memory system operation Paxos agreements or its spread out
The metadata node cluster of raw agreement.In a few days will is synchronous for the synchronization.But the present invention is not limited to this,
It can be used for other node clusters of running status synchronous protocol, because the present embodiment is by joining to agreement
Several modifications makes synchronously succeed to recover service, the service kind that node cluster is provided between node
Class difference has no effect on the recovery of service.It is contemplated that the present invention may be use with running other state synchronized agreements such as HA
Multiple nodes of agreement.
Step 120, if NN becomes smaller than S0 from more than or equal to S0, carry out emergent management to recover just
Informal dress is engaged in, and when the normal node includes host node, the emergent management includes:By configuration center
And the parameter S of normal node preservation value is revised as the positive integer value less than or equal to NN;
Wherein, parameter S implication is as described above.
In the case where host node is normal node, because host node is also in service state, so now
Recovery service needs not move through the election stage, it is only necessary to change the synchronization log parameter S of service stage
Complete to recover.Parameter S value is revised as after NN (value after change) by the present embodiment, because cluster
In there is NN normal node, thus can have the synchronously success of NN node, meet to synchronous success
The requirement of the minimum number of node, thus positive informal dress service can be recovered.In another embodiment, also may be used
It so that parameter S value to be revised as to the value less than NN, can also now recover normal service, there is few again
Amount normal node need not change S when being changed into abnormal nodes again, but the present embodiment is when being revised as NN, repaiies
Successfully nodes are relatively more for will synchronization on some other day, and Information Security is more preferable.
In the present embodiment, when the normal node includes host node, the emergent management also includes:
The value for the parameter E that the configuration center and the normal node are preserved is revised as T-NN '+1, wherein,
Parameter E represents the minimum number of the required normal node of the node cluster election success, and NN ' is ginseng
The positive integer value less than or equal to NN that number S is revised as.Herein, E values are revised as T-NN+1
Namely FN+1 so that want to elect successfully, it is necessary to have the participation of a normal node, and this is normally saved
The daily record that point is preserved is complete, thus original abnormal nodes can be avoided to revert to after normal node,
New host node is elected between abnormal nodes and causes daily record data inconsistent and loses.Here, lead to
Cross modification E values to ensure the complete of daily record, be a kind of convenient mode, can not also increase other interfaces.
But not sole mode, can also such as record current normal node, be had in follow-up election
There is one of normal node reference to elect successfully, etc..
Operation management system can need emergent restoring Metadata Service (such as receive keeper instruction or
Person triggers according to corresponding strategy) when start above-mentioned flow.
The present embodiment is additionally provided after triggering emergent management, when not including host node in the normal node
Emergent management scheme, now, due in the absence of host node, so needing the experience election stage to produce
Raw host node, handling process is with emergent management when there is host node in normal node afterwards.Specifically,
Emergent management scheme when not including host node in normal node includes:
The first step, a normal section is synchronized to by the local daily record of at least FN-floor (T/2) individual abnormal nodes
Point, the value for the parameter E for preserving the configuration center and the normal node after synchronous success is revised as
NN;
When host node is abnormal, the daily record that normal node is preserved there may be inconsistent situation.Thus this
Place first carries out the synchronization of daily record.The local daily record of at least FN-floor (T/2) individual abnormal nodes is synchronized to one
After individual normal node, the normal node and other normal nodes just have (FN-floor (T/2))+NN=together
The daily record of T-floor (T/2) individual node, these nodes are assured that together there is abnormal preceding newest shape
State.Certainly, it is synchronous successfully on condition that machine where at least FN-floor (T/2) individual abnormal nodes can be with
Log in, here, abnormal nodes need that daily record synchronization can not be completed, but by O&M management tool, still have
Where it may be signed in on machine, its local daily record reproduction is come out.
In this step, parameter E value is revised as NN so that NN normal node can be with successful holding
Go out new host node to recover service.
Second step, after the NN normal node elects new host node, stops servicing and performing
Above-mentioned normal node includes recovering normal service after the completion of emergent management during host node, execution.
After new host node is elected, it is possible to according to the emergent management mode in step 120 to S's
Value is modified and recovers normal service, and E value can also be modified.In order to avoid in modification ginseng
New modification daily record is produced during number, should now stop Metadata Service, a kind of simple mode is exactly in choosing
Enumerate before new host node, the value for the parameter S that the configuration center and the normal node are preserved is repaiied
The value more than or equal to FN+1 is changed to, so synchronous to succeed, service is stopped.And performing
Into emergent management be complete step 120 in the modification of S values after, you can recovery normal service.Using
Other modes forbid service also possible, such as increase some steps, and extra match somebody with somebody is increased in these steps
Item is put to realize.
After emergent management is carried out, the working condition of node cluster is transferred to the state of emergency, realizes minority
Metadata node can externally provide the ability of service.In emergency situations, node state (it is normal or
It is abnormal) it can change, for example there is normal node to be changed into weight after abnormal nodes, abnormal nodes failture evacuation
It is changed into normal node after new startup, therefore, the present embodiment additionally provides the place when node state changes
Reason scheme is as follows:
S0 is become smaller than from more than or equal to S0 in NN, emergent management is carried out and recovers after normal service,
If the state for having node in the node cluster changes again, NN can also change, and now compare
NN and S0 after relatively changing:
If NN<S0, re-starts the emergent management to recover normal service;
If NN >=S0, the parameter S and parameter E value point that the configuration center and normal node are preserved
Value S0 and value E0 are not revised as;
Wherein, E0 is the parameter E determined according to the state synchronized agreement value..
If that is, having there is most node states normal, i.e., S, E value are revised as according to synchronization
The respective value that status protocol is determined is to recover initial working condition.In Paxos agreements, S0=E0=
floor(T/2)+1.If still there are most node states abnormal, need to re-start emergent management with extensive
Multiple normal service, recovers the redundancy properties of Paxos protocol realizations.
In such scheme, operation management system needs to change the parameter that metadata node is preserved, and this can lead to
Cross the completion of one of in the following manner:
When host node is normal or after re-electing out host node, parameter is changed and ordered by operation management system
It is sent to host node;Host node directly changes the parameter in oneself internal memory, and produces modification log recording and arrive
Locally, and issue all from node;Daily record synchronization is received from node, is judged whether according to protocol conventions
Daily record can be received, if can receive, parameter and master is returned to by log recording to local, in modification internal memory
Node success.
When host node is abnormal, operation management system gets all normal metadata nodes, will change
The request of parameter is sent to each metadata node;Receive directly being changed from node certainly for parameter modification request
Configuration parameter in own internal memory, but any modification daily record is not produced.
In the present embodiment, in emergent management scheme in above-mentioned two situations, to the configuration center and
During the parameter S and/or parameter E value that the normal node is preserved modify, forbid abnormal section
Point starts, or disconnects the configuration center and the connection of the abnormal nodes, to prevent in configuration process
Configuration parameter before abnormal nodes are applied starts.
The present embodiment additionally provides a kind of operation management system, as shown in figure 3, including state detection module
10th, control module 20 and emergent management module 30, wherein:
The state detection module 10, the node cluster interior joint for detecting running status synchronous protocol
State change, determines the number NN of normal node and notifies the control module, NN is integer;
The control module 20 is urgent for after S0 is become smaller than more than or equal to S0, being called in NN
Processing module carries out emergent management, to recover normal service;
The emergent management module 30, it is following for when the normal node includes host node, performing
Emergent management:The value for the parameter S that configuration center and the normal node are preserved be revised as being less than or
Positive integer value equal to NN;
Wherein, parameter S represents that the node cluster provides the synchronous successful node required by normal service
Minimum number, S0 is the parameter S determined according to the state synchronized agreement value.
Alternatively,
The emergent management module is when the normal node includes host node, and the emergent management of execution is also
Including:The value for the parameter E that the configuration center and the normal node are preserved is revised as T-NN '+1,
Wherein, parameter E represents the minimum number of the required normal node of the node cluster election success, NN '
The positive integer value less than or equal to NN being revised as parameter S.
Alternatively,
When the emergent management module is additionally operable in the normal node not include host node, perform following
Emergent management:The local daily record of at least FN-floor (T/2) individual abnormal nodes is synchronized to a normal node,
The value for the parameter E for preserving the configuration center and the normal node after synchronous success is revised as NN;
And, after the NN normal node elects new host node, stop service and perform described normal
Node includes recovering normal service after the completion of emergent management during host node, execution, wherein, T is institute
The nodes of node cluster are stated, T >=2, FN is the number of abnormal nodes in the node cluster, FN=T
- NN, floor () represent to round downwards.
Alternatively,
The emergent management module stops clothes after the NN normal node elects new host node
Business, is achieved in the following ways:Before new host node is elected, by the configuration center and
The value for the parameter S that the normal node is preserved is revised as the value more than or equal to FN+1.
Alternatively,
The control module is additionally operable to when the NN that the state detection module is notified changes again,
Compare the NN and S0 after change:
If NN<S0, calls emergent management module to re-start emergent management to recover normal service;
If NN >=S0, the parameter S and parameter E value point that the configuration center and normal node are preserved
S0 and E0 are not revised as it;
Wherein, E0 is the parameter E determined according to the state synchronized agreement value.
Alternatively,
The node cluster is first number of operation Paxos agreements or its derivative agreement in distributed memory system
According to node cluster, in a few days will is synchronous for the synchronization, and S0=E0=floor (T/2)+1, floor () represent to take downwards
It is whole.
Alternatively,
Parameter S and/or ginseng that the emergent management module is preserved to the configuration center and the normal node
During number E value is modified, forbid abnormal nodes to start, or disconnect the configuration center and institute
State the connection of abnormal nodes.
The such scheme of the present embodiment is based on Paxos state synchronized agreements, is taken by configuring multiple metadata
It is engaged in electing the relevant parameter synchronous with daily record between device, in the case of a small number of metadata nodes are normal in the cluster
Metadata read-write service still can be provided, can conveniently recover service after failture evacuation, during which will not
Cause data inconsistent and lose.
Embodiment two
What the present embodiment was related to is also the node cluster for running synchronous regime agreement, and also storage is in a distributed manner
Exemplified by the metadata node cluster that Paxos agreements or its derivative agreement are run in system.Its network architecture such as Fig. 1
It is shown, repeat no more.
The present embodiment is more for low performance node in node cluster, causes the service performance of whole cluster
A kind of method for the performance boost for being deteriorated and proposing, as shown in figure 4, including:
Step 210, the low performance node in the node cluster of running status synchronous protocol is determined;
In this step, low performance node can be determined according to indexs such as the node response speeds of setting, such as
It can be determined by keeper.
Step 220, when the synchronously success, the synchronous mistake of the node cluster of at least one low performance node
During Cheng Caineng successes, performance boost processing is carried out so that without low performance node synchronization success when institute
Stating synchronizing process can also succeed.
In the present embodiment, if in the node cluster low performance node number SN >=T-S0+1, i.e. table
Show the synchronously success of at least one low performance node, the synchronizing process of the node cluster could succeed.Institute
Stating performance boost processing includes:Operation management system preserves configuration center and the node cluster interior joint
Parameter S value be revised as T-SN.Wherein, T is the nodes of the node cluster, T >=2, parameter
S represents that the node cluster provides the minimum number of the synchronous successful node required by normal service, and S0 is
The parameter S determined according to the state synchronized agreement value.What the present embodiment considered is normal node number
Scene more than S0.
Parameter S value is revised as after T-SN, as long as there is T-SN node to return into when synchronizing
Work(is synchronous success, and the T-SN node can not include low performance node, thus the performance of node cluster
Can with high-performance node matching, without being dragged down by low performance node.
In the present embodiment, the performance boost processing can also include following at least one processing:
Processing one, the value for the parameter E that the configuration center and the node cluster interior joint are preserved is changed
For SN+1;
By taking Paxos agreements as an example, exemplified by S0=floor (T/2)+1, if the node of more than half is low
Performance node, then need to carry out performance boost processing, S value be revised as into T-SN, less than agreement regulation
Value S0, in order to avoid under extreme case produce daily record data it is inconsistent and lose, parameter E is revised as
SN+1, to ensure that the node of preamble of at least one participation participates in electing.
The modification of the value of parameter S, E preserved to the node cluster interior joint can be by the node
Host node in cluster sends the modification to the value of parameter S, E and asks to realize, carry to repair in the request
Value T-SN, SN+1 being changed to.
Processing two, if current host node is low performance node, the node is switched to by host node
Another node in cluster in addition to low performance node.
In the present embodiment, the node cluster be in distributed memory system operation Paxos agreements or its spread out
The metadata node cluster of raw agreement, in a few days will is synchronous for the synchronization, S0=floor (T/2)+1, floor ()
Expression is rounded downwards.But the scheme of the present embodiment can be used for providing the node cluster of other services.
The present embodiment additionally provides a kind of operation management system, including performance management module, the performance pipe
Reason module is used to that to the node cluster progressive of running status synchronous protocol processing can be lifted, as shown in figure 5,
Including:
First processing units 50, the value of the parameter S for configuration center to be preserved is revised as T-SN;
Second processing unit, the value of the parameter S for the node cluster interior joint to be preserved is revised as
T-SN;
Wherein, T is the nodes of the node cluster, and T >=2, SN is low performance in the node cluster
The number of node, parameter S represents that the node cluster provides the synchronous successful node required by normal service
Minimum number.
Alternatively,
The performance management module also includes with least one of lower unit:
3rd processing unit, for the parameter for preserving the configuration center and the node cluster interior joint
E value is revised as SN+1, wherein, parameter E represents normal required by the node cluster election success
The minimum number of node;
Fourth processing unit, for when current host node is low performance node, host node to be switched to
Another node in the node cluster in addition to low performance node.
Alternatively,
The node cluster is first number of operation Paxos agreements or its derivative agreement in distributed memory system
According to node cluster, in a few days will is synchronous for the synchronization.
The present embodiment leads to when there is more metadata node performance reduction to cause cluster overall performance to reduce
Modification parameter is crossed, the service performance of cluster can be mentioned.
The embodiments of the present invention are for illustration only, and the quality of embodiment is not represented.Pass through the above
Embodiment description, those skilled in the art can be understood that above-described embodiment method can
Realized by the mode of software plus required general hardware platform, naturally it is also possible to by hardware, but very
The former is more preferably embodiment in the case of many.Understood based on such, the technical side of the embodiment of the present invention
The part that case substantially contributes to prior art in other words can be embodied in the form of software product
Come, the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disc, CD)
In, including some instructions to cause a station terminal equipment (can be mobile phone, computer, server,
Or the network equipment etc.) perform method described in each embodiment of the invention.
The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the invention, for this
For the technical staff in field, the present invention can have various modifications and variations.It is all the present invention spirit and
Within principle, any modification, equivalent substitution and improvements made etc. should be included in the protection of the present invention
Within the scope of.
Claims (21)
1. a kind of method for recovering service, applied to operation management system, including:
The state change of the node cluster interior joint of running status synchronous protocol is detected, normal node is determined
Number NN, NN are integer;
If NN becomes smaller than S0 from more than or equal to S0, emergent management is carried out to recover normal service,
Wherein, when the normal node includes host node, the emergent management includes:By configuration center and
The value for the parameter S that the normal node is preserved is revised as the positive integer value less than or equal to NN;
Wherein, parameter S represents that the node cluster provides the synchronous successful node required by normal service
Minimum number, S0 is the parameter S determined according to the state synchronized agreement value.
2. the method as described in claim 1, it is characterised in that:
When the normal node includes host node, the emergent management also includes:By in the configuration
The value for the parameter E that the heart and the normal node are preserved is revised as T-NN '+1, wherein, parameter E represents institute
The minimum number of the required normal node of node cluster election success is stated, NN ' is what parameter S was revised as
The positive integer value less than or equal to NN.
3. method as claimed in claim 2, it is characterised in that:
It is described progress emergent management to recover normal service, wherein, in the normal node not include master
During node, the emergent management includes:
The local daily record of at least FN-floor (T/2) individual abnormal nodes is synchronized to a normal node, it is synchronous
The value for the parameter E for preserving the configuration center and the normal node after success is revised as NN;And
After the NN normal node elects new host node, stop service and perform it is described just
Chang Jiedian includes recovering normal service after the completion of emergent management during host node, execution;
Wherein, T be the node cluster nodes, T >=2, FN be the node cluster in save extremely
The number of point, FN=T-NN, floor () represents to round downwards.
4. method as claimed in claim 3, it is characterised in that:
Stop service after the NN normal node elects new host node, be in the following manner
Realize:Before new host node is elected, the configuration center and the normal node are preserved
Parameter S value is revised as the value more than or equal to FN+1.
5. the method as described in any in claim 2-4, it is characterised in that:
S0 is become smaller than from more than or equal to S0 in NN, emergent management is carried out and recovers after normal service,
Also include:When NN changes again, compare the NN and S0 after change:
If NN<S0, re-starts the emergent management to recover normal service;
If NN >=S0, the parameter S and parameter E value point that the configuration center and normal node are preserved
S0 and E0 are not revised as it;
Wherein, E0 is the parameter E determined according to the state synchronized agreement value.
6. method as claimed in claim 5, it is characterised in that:
The node cluster is first number of operation Paxos agreements or its derivative agreement in distributed memory system
According to node cluster, in a few days will is synchronous for the synchronization, and S0=E0=floor (T/2)+1, floor () represent to round downwards.
7. the method as described in any in claim 2-4,6, it is characterised in that:
The parameter S and/or parameter E value preserved to the configuration center and the normal node modifies
During, forbid abnormal nodes to start, or disconnect the configuration center and the connection of the abnormal nodes.
8. a kind of operation management system, it is characterised in that including state detection module, control module and tight
Anxious processing module, wherein:
The state detection module, the shape of the node cluster interior joint for detecting running status synchronous protocol
State changes, and determines the number NN of normal node and notifies the control module, NN is integer;
The control module, for after S0 is become smaller than more than or equal to S0, calling urgent place in NN
Manage module and carry out emergent management, to recover normal service;
The emergent management module, it is following tight for when the normal node includes host node, performing
Anxious processing:The value for the parameter S that configuration center and the normal node are preserved is revised as being less than or waited
In NN positive integer value;
Wherein, parameter S represents that the node cluster provides the synchronous successful node required by normal service
Minimum number, S0 is the parameter S determined according to the state synchronized agreement value.
9. operation management system as claimed in claim 8, it is characterised in that:
The emergent management module is when the normal node includes host node, and the emergent management of execution is also
Including:The value for the parameter E that the configuration center and the normal node are preserved is revised as T-NN '+1,
Wherein, parameter E represents the minimum number of the required normal node of the node cluster election success, NN '
The positive integer value less than or equal to NN being revised as parameter S.
10. operation management system as claimed in claim 9, it is characterised in that:
When the emergent management module is additionally operable in the normal node not include host node, perform following
Emergent management:The local daily record of at least FN-floor (T/2) individual abnormal nodes is synchronized to a normal node,
The value for the parameter E for preserving the configuration center and the normal node after synchronous success is revised as NN;
And, after the NN normal node elects new host node, stop service and perform described normal
Node includes recovering normal service after the completion of emergent management during host node, execution, wherein, T is institute
The nodes of node cluster are stated, T >=2, FN is the number of abnormal nodes in the node cluster, FN=T
- NN, floor () represent to round downwards.
11. operation management system as claimed in claim 10, it is characterised in that:
The emergent management module stops clothes after the NN normal node elects new host node
Business, is achieved in the following ways:Before new host node is elected, by the configuration center and
The value for the parameter S that the normal node is preserved is revised as the value more than or equal to FN+1.
12. the operation management system as described in any in claim 8-11, it is characterised in that:
The control module is additionally operable to when the NN that the state detection module is notified changes again,
Compare the NN and S0 after change:
If NN<S0, calls emergent management module to re-start emergent management to recover normal service;
If NN >=S0, the parameter S and parameter E value point that the configuration center and normal node are preserved
S0 and E0 are not revised as it;
Wherein, E0 is the parameter E determined according to the state synchronized agreement value.
13. operation management system as claimed in claim 12, it is characterised in that:
The node cluster is first number of operation Paxos agreements or its derivative agreement in distributed memory system
According to node cluster, in a few days will is synchronous for the synchronization, and S0=E0=floor (T/2)+1, floor () represent to round downwards.
14. the operation management system as described in claim 8-11,13, it is characterised in that:
Parameter S and/or ginseng that the emergent management module is preserved to the configuration center and the normal node
During number E value is modified, forbid abnormal nodes to start, or disconnect the configuration center and institute
State the connection of abnormal nodes.
15. a kind of method of performance boost, including:
Determine the low performance node in the node cluster of running status synchronous protocol;
When the synchronously success of at least one low performance node, the synchronizing process of the node cluster could succeed
When, carry out performance boost processing so that synchronizing process during without low performance node synchronization success
Also it can succeed.
16. method as claimed in claim 15, it is characterised in that:
When the synchronously success of at least one low performance node, the synchronizing process of the node cluster could succeed
When, performance boost processing is carried out, including:
In the node cluster during number SN >=T-S0+1 of low performance node, performance boost processing is carried out,
The performance boost processing includes:Operation management system protects configuration center and the node cluster interior joint
The parameter S deposited value is revised as T-SN;
Wherein, T is the nodes of the node cluster, and T >=2, parameter S represents that the node cluster is provided
The minimum number of synchronous successful node required by normal service, S0 is true according to the state synchronized agreement
Fixed parameter S value.
17. method as claimed in claim 16, it is characterised in that:
The performance boost processing also includes following at least one processing:
The value for the parameter E that the configuration center and the node cluster interior joint are preserved is revised as SN+1;
And
If current host node is low performance node, host node is switched to and removed in the node cluster
Another node outside low performance node;
Wherein, parameter E represents the minimum number of the required normal node of the node cluster election success.
18. the method as described in claim 16 or 17, it is characterised in that:
The node cluster is first number of operation Paxos agreements or its derivative agreement in distributed memory system
According to node cluster, in a few days will is synchronous for the synchronization, and S0=floor (T/2)+1, floor () represent to round downwards.
19. a kind of operation management system, it is characterised in that including performance management module, wherein:
The performance management module is used to that to the node cluster progressive of running status synchronous protocol place can be lifted
Reason, including:
First processing units, the value of the parameter S for configuration center to be preserved is revised as T-SN;
Second processing unit, the value of the parameter S for the node cluster interior joint to be preserved is revised as
T-SN;
Wherein, T is the nodes of the node cluster, and T >=2, SN is low performance in the node cluster
The number of node, parameter S represents that the node cluster provides the synchronous successful node required by normal service
Minimum number.
20. operation management system as claimed in claim 19, it is characterised in that:
The performance management module also includes with least one of lower unit:
3rd processing unit, for the parameter for preserving the configuration center and the node cluster interior joint
E value is revised as SN+1, wherein, parameter E represents normal required by the node cluster election success
The minimum number of node;
Fourth processing unit, for when current host node is low performance node, host node to be switched to
Another node in the node cluster in addition to low performance node.
21. the operation management system as described in claim 19 or 20, it is characterised in that:
The node cluster is first number of operation Paxos agreements or its derivative agreement in distributed memory system
According to node cluster, in a few days will is synchronous for the synchronization.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610140348.XA CN107181608B (en) | 2016-03-11 | 2016-03-11 | Method for recovering service and improving performance and operation and maintenance management system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610140348.XA CN107181608B (en) | 2016-03-11 | 2016-03-11 | Method for recovering service and improving performance and operation and maintenance management system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107181608A true CN107181608A (en) | 2017-09-19 |
CN107181608B CN107181608B (en) | 2020-06-09 |
Family
ID=59830377
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610140348.XA Active CN107181608B (en) | 2016-03-11 | 2016-03-11 | Method for recovering service and improving performance and operation and maintenance management system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107181608B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108984635A (en) * | 2018-06-21 | 2018-12-11 | 郑州云海信息技术有限公司 | A kind of HDFS storage system and date storage method |
CN109167690A (en) * | 2018-09-25 | 2019-01-08 | 郑州云海信息技术有限公司 | A kind of restoration methods, device and the relevant device of the service of distributed system interior joint |
CN114328098A (en) * | 2021-12-23 | 2022-04-12 | 北京百度网讯科技有限公司 | Slow node detection method and device, electronic equipment and storage medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1107119A2 (en) * | 1999-12-02 | 2001-06-13 | Sun Microsystems, Inc. | Extending cluster membership and quorum determinations to intelligent storage systems |
US6526432B1 (en) * | 1999-08-31 | 2003-02-25 | International Business Machines Corporation | Relaxed quorum determination for a quorum based operation of a distributed computing system |
US20030120715A1 (en) * | 2001-12-20 | 2003-06-26 | International Business Machines Corporation | Dynamic quorum adjustment |
CN1568467A (en) * | 2001-09-06 | 2005-01-19 | Bea***公司 | Exactly once cache framework |
CN1201245C (en) * | 1999-08-31 | 2005-05-11 | 国际商业机器公司 | Non-strict legal number decision based on legal number operation |
US7120821B1 (en) * | 2003-07-24 | 2006-10-10 | Unisys Corporation | Method to revive and reconstitute majority node set clusters |
WO2011134053A1 (en) * | 2010-04-26 | 2011-11-03 | Locationary, Inc. | Method and system for distributed data verification |
CN104077181A (en) * | 2014-06-26 | 2014-10-01 | 国电南瑞科技股份有限公司 | Status consistent maintaining method applicable to distributed task management system |
-
2016
- 2016-03-11 CN CN201610140348.XA patent/CN107181608B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6526432B1 (en) * | 1999-08-31 | 2003-02-25 | International Business Machines Corporation | Relaxed quorum determination for a quorum based operation of a distributed computing system |
CN1201245C (en) * | 1999-08-31 | 2005-05-11 | 国际商业机器公司 | Non-strict legal number decision based on legal number operation |
EP1107119A2 (en) * | 1999-12-02 | 2001-06-13 | Sun Microsystems, Inc. | Extending cluster membership and quorum determinations to intelligent storage systems |
CN1568467A (en) * | 2001-09-06 | 2005-01-19 | Bea***公司 | Exactly once cache framework |
US20030120715A1 (en) * | 2001-12-20 | 2003-06-26 | International Business Machines Corporation | Dynamic quorum adjustment |
US7120821B1 (en) * | 2003-07-24 | 2006-10-10 | Unisys Corporation | Method to revive and reconstitute majority node set clusters |
WO2011134053A1 (en) * | 2010-04-26 | 2011-11-03 | Locationary, Inc. | Method and system for distributed data verification |
CN104077181A (en) * | 2014-06-26 | 2014-10-01 | 国电南瑞科技股份有限公司 | Status consistent maintaining method applicable to distributed task management system |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108984635A (en) * | 2018-06-21 | 2018-12-11 | 郑州云海信息技术有限公司 | A kind of HDFS storage system and date storage method |
CN109167690A (en) * | 2018-09-25 | 2019-01-08 | 郑州云海信息技术有限公司 | A kind of restoration methods, device and the relevant device of the service of distributed system interior joint |
CN114328098A (en) * | 2021-12-23 | 2022-04-12 | 北京百度网讯科技有限公司 | Slow node detection method and device, electronic equipment and storage medium |
CN114328098B (en) * | 2021-12-23 | 2023-04-18 | 北京百度网讯科技有限公司 | Slow node detection method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN107181608B (en) | 2020-06-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11360854B2 (en) | Storage cluster configuration change method, storage cluster, and computer system | |
WO2021136422A1 (en) | State management method, master and backup application server switching method, and electronic device | |
CN111581284A (en) | High-availability method, device and system for database and storage medium | |
CN101079896B (en) | A method for constructing multi-availability mechanism coexistence framework of concurrent storage system | |
CN102394914A (en) | Cluster brain-split processing method and device | |
CN104469699B (en) | Cluster quorum method and more cluster coupled systems | |
CN104994168A (en) | distributed storage method and distributed storage system | |
CN106484565A (en) | Method of data synchronization between multiple data centers and relevant device | |
CN111984274B (en) | Method and device for automatically deploying ETCD cluster by one key | |
WO2017097006A1 (en) | Real-time data fault-tolerance processing method and system | |
CN111935244B (en) | Service request processing system and super-integration all-in-one machine | |
CN115794499B (en) | Method and system for dual-activity replication data among distributed block storage clusters | |
CN107181608A (en) | A kind of method and operation management system for recovering service and performance boost | |
CN107357800A (en) | A kind of database High Availabitity zero loses solution method | |
CN114124650A (en) | Master-slave deployment method of SPTN (shortest Path bridging) network controller | |
CN113438111A (en) | Method for restoring RabbitMQ network partition based on Raft distribution and application | |
CN106095618A (en) | The method and system of data manipulation | |
CN104052799B (en) | A kind of method that High Availabitity storage is realized using resource ring | |
CN105323271B (en) | Cloud computing system and processing method and device thereof | |
CN116185697B (en) | Container cluster management method, device and system, electronic equipment and storage medium | |
CN117271227A (en) | Database cluster master node switching method, system and management and control platform | |
CN115878361A (en) | Node management method and device for database cluster and electronic equipment | |
KR101513943B1 (en) | Method and system for operating management of real-time replicated database | |
CN112202601B (en) | Application method of two physical node mongo clusters operated in duplicate set mode | |
CN114124803A (en) | Device management method, device, electronic device and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20210402 Address after: Room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province Patentee after: Alibaba (China) Co.,Ltd. Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands Patentee before: Alibaba Group Holding Ltd. |
|
TR01 | Transfer of patent right |