CN103995759A - High-availability computer system failure handling method and device based on core internal-external synergy - Google Patents
High-availability computer system failure handling method and device based on core internal-external synergy Download PDFInfo
- Publication number
- CN103995759A CN103995759A CN201410215175.4A CN201410215175A CN103995759A CN 103995759 A CN103995759 A CN 103995759A CN 201410215175 A CN201410215175 A CN 201410215175A CN 103995759 A CN103995759 A CN 103995759A
- Authority
- CN
- China
- Prior art keywords
- fault
- hardware
- trouble report
- service
- operating system
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Debugging And Monitoring (AREA)
- Hardware Redundancy (AREA)
Abstract
The invention discloses a high-availability computer system failure handling method and device based on core internal-external synergy. The method comprises the steps of 1 respectively detecting service failures and hardware failures and outputting the failures through a failure reporting interface; 2 detecting and then analyzing a failure report and performing failure handling on the hardware failures or the service failures according to an analysis result, reporting logs, informing a manager and then judging whether dual-computer hot standby is needed or not, wherein specific dual-computer hot standby software is informed to perform dual-computer hot standby if the dual-computer hot standby is needed. The device comprises a unified failure reporting subsystem and a unified failure handling subsystem, wherein the subsystems completely correspond to the steps of the method. The high-availability computer system failure handling method and device based on core internal-external synergy can achieve software and hardware unified failure management and efficiently and timely detect software and hardware failures, a handling process is simple, failure handling rules are convenient to expand, and the high availability of a computer system subjected to the software or hardware failures can be ensured.
Description
Technical field
The present invention relates to the available administrative skill of the height field of computer system, be specifically related to a kind of high available computers system failure disposal route and device based on collaborative inside and outside core.
Background technology
The availability of computer system is to evaluate an index that computer system is reliable and stable, and it is measured by the mean free error time conventionally.Mean free error time is longer, and the availability of this computer system is just higher.Also there is hardware aspect the existing software of the factor aspect that affects computer system availability.Program or software that software fault is often referred to computer system cause normally working or to affect normal use, other software or program that the domain of influence of software fault is generally software self and depends on this software because of certain factors disrupt.Hardware fault is often referred to the physical hardware of computer system because certain factors disrupt causes normally working or to affect normal use, and hardware fault is larger on computer system impact, can cause the system machine of delaying when serious.
The computer system of prior art depends on hardware drive program for the detection of hardware fault, and for software fault, conventionally adopts automatic regular polling mechanism to complete service state and detect.Complete after fault detect, according to driving or program default policy, carry out fault handling immediately, and record processing daily record separately.But there is following problem in the computer system of prior art in the available management of height: 1, computer system independent processing and reporting software and hardware fault, lack hardware and software failure unified management; 2, traditional backup technique is low to software fault monitoring efficiency, cannot in time perception hardware fault; 3, computer system is complicated to hardware and software failure treatment scheme, and user cannot define and dispose rule.
Summary of the invention
The technical problem to be solved in the present invention is: the technical matters existing for prior art, provide a kind of and can realize hardware and software failure unified management, efficiently timely to the detection of software fault and hardware fault, treatment scheme is simple, fault handling Rule Extended is convenient, can guarantee computer system under software fault or hardware fault high availability based on core inside and outside collaborative high available computers system failure disposal route and device.
In order to solve the problems of the technologies described above, technical scheme provided by the invention is:
A high available computers system failure disposal route based on collaborative inside and outside core, implementation step is as follows:
1) outside operating system nucleus, detect and comprise system service fault and application service fault generates Trouble Report and exports by described Trouble Report interface at interior service fault, in operating system nucleus, detection hardware fault generates Trouble Report and exports by the Trouble Report interface of foundation operating system nucleus outside simultaneously;
2) Trouble Report of detection failure reporting interface outside operating system nucleus, after receiving Trouble Report, Trouble Report is analyzed, according to analysis result, in operating system nucleus, hardware corresponding to hardware fault is carried out to fault handling, or to service fault, fault handling is carried out in corresponding service outside operating system nucleus, to fault handling log and send notice to keeper, then according to default rule judgment, whether need to carry out two-node cluster hot backup, if need two-node cluster hot backup, notify the two-node cluster hot backup software of appointment to carry out two-node cluster hot backup.
Preferably, in described step 1), outside operating system nucleus, detect and comprise system service fault and application service fault generates Trouble Report and specifically refers to by described Trouble Report interface output at interior service fault:
1.1.1) outside operating system nucleus, the mode with poll is carried out state-detection to system service in operating system and application service, if abnormality appears in any system service or application service, judges service fault occurs;
1.1.2) after judge there is service fault, according to system service or application service, there is the Information generation Trouble Report of abnormality, described Trouble Report is exported by described Trouble Report interface.
Preferably, in described step 1), in operating system nucleus, detection hardware fault generates Trouble Report as follows by the detailed step of described Trouble Report interface output:
1.2.1) by a plurality of hardware states monitoring point being distributed in advance in fault grouting socket, fault interrupting processing routine and hardware driving, detect corresponding hardware status information, if the hardware state that any hardware status monitoring point detects occurs abnormal, the field data that corresponding hardware is collected in described hardware state monitoring point according to default rule is as hardware fault data;
1.2.2) hardware fault data are encapsulated and generate Trouble Report and deposit default failure message queue in;
1.2.3) according to failure message queue, to depositing the Trouble Report of failure message queue in, dispatch distribution;
1.2.4) utilize thread that the Trouble Report of scheduling output is exported by described Trouble Report interface.
Preferably, detailed step described step 2) is as follows:
2.1) Trouble Report based on finger daemon detection failure reporting interface outside operating system nucleus;
2.2) Trouble Report is analyzed after receiving Trouble Report operating system nucleus is external, the fault type of failure judgement report, if fault type is service fault, according to service dependence, describes system service corresponding to service fault or application service are recovered; If fault type is hardware fault, judge whether to carry out faulty hardware isolation to hardware corresponding to Trouble Report, if need to carry out faulty hardware isolation, redirect execution step 2.3), otherwise judge whether to carry out failed hardware recovery to hardware corresponding to Trouble Report, if need to carry out failed hardware recovery, redirect execution step 2.4), otherwise redirect execution step 2.5);
2.3) in the time need to carrying out faulty hardware isolation to hardware corresponding to Trouble Report, in operating system nucleus, hardware corresponding to Trouble Report is carried out to faulty hardware isolation;
2.4) in the time need to carrying out failed hardware recovery to hardware corresponding to Trouble Report, in operating system nucleus, hardware corresponding to Trouble Report is carried out to failed hardware recovery;
2.5) to fault handling log;
2.6) to keeper, send notice;
2.7) according to default rule judgment, whether need to carry out two-node cluster hot backup, if need two-node cluster hot backup, by calling the notice plug-in unit of the two-node cluster hot backup software of appointment, notify described two-node cluster hot backup software to carry out two-node cluster hot backup.
The present invention also provides a kind of high available computers system failure treating apparatus based on collaborative inside and outside core, comprising:
The unified report of fault subsystem, for detecting operating system nucleus outside, comprise system service fault and application service fault generates Trouble Report and exports by described Trouble Report interface at interior service fault, in operating system nucleus, detection hardware fault generates Trouble Report and exports by the Trouble Report interface of foundation outside operating system nucleus simultaneously;
The unified subsystem of disposing of fault, Trouble Report for detection failure reporting interface outside operating system nucleus, after receiving Trouble Report, Trouble Report is analyzed, according to analysis result, in operating system nucleus, hardware corresponding to hardware fault is carried out to fault handling, or to service fault, fault handling is carried out in corresponding service outside operating system nucleus, to fault handling log and send notice to keeper, then according to default rule judgment, whether need to carry out two-node cluster hot backup, if need two-node cluster hot backup, notify the two-node cluster hot backup software of appointment to carry out two-node cluster hot backup.
Preferably, the unified report of described fault subsystem comprises for detecting operating system nucleus outside and comprises the service detection module that system service fault and application service fault generate Trouble Report and export by described Trouble Report interface at interior service fault, and described service detection module comprises:
Service state poll detection sub-module, carries out state-detection for the mode with poll outside operating system nucleus to operating system system service and application service, if abnormality appears in any system service or application service, judges service fault occurs;
, for occurring after service fault judging, there is the Information generation Trouble Report of abnormality in service fault report submodule, described Trouble Report is exported by described Trouble Report interface according to system service or application service.
Preferably, the unified report of described fault subsystem comprises the hardware detecting module for generating Trouble Report and export by described Trouble Report interface in operating system nucleus detection hardware fault, and described hardware detecting module comprises:
Hardware state monitoring submodule, for detecting corresponding hardware status information by being distributed in advance a plurality of hardware states monitoring point of fault grouting socket, fault interrupting processing routine and hardware driving, if the hardware state that any hardware status monitoring point detects occurs abnormal, the field data that corresponding hardware is collected in described hardware state monitoring point according to default rule is as hardware fault data;
Hardware fault data encapsulation submodule, generates Trouble Report and deposits default failure message queue in for hardware fault data are encapsulated;
Failure message queue scheduling submodule, for dispatching distribution according to failure message queue to depositing the Trouble Report of failure message queue in;
Hardware fault data report submodule, for utilizing thread that the Trouble Report of scheduling output is exported by described Trouble Report interface.
Preferably, the unified subsystem of disposing of described fault comprises:
Fault management finger daemon module, for the Trouble Report based on finger daemon detection failure reporting interface outside operating system nucleus;
Fault handling engine, for Trouble Report being analyzed after receiving Trouble Report operating system nucleus is external, the fault type of failure judgement report, if fault type is service fault, according to service dependence, describes system service corresponding to service fault or application service are recovered; If fault type is hardware fault, judge whether to carry out faulty hardware isolation to hardware corresponding to Trouble Report, if need to carry out faulty hardware isolation, call fault isolation module, otherwise judge whether to carry out failed hardware recovery to hardware corresponding to Trouble Report, if need to carry out failed hardware recovery, call fault recovery module;
Fault isolation module in the time need to carrying out faulty hardware isolation to hardware corresponding to Trouble Report, is carried out faulty hardware isolation to hardware corresponding to Trouble Report in operating system nucleus;
Fault recovery module in the time need to carrying out failed hardware recovery to hardware corresponding to Trouble Report, is carried out failed hardware recovery to hardware corresponding to Trouble Report in operating system nucleus;
Logger module, for to fault handling log;
Signalling trouble module, for sending notice to keeper;
Whether two-node cluster hot backup processing module, for needing to carry out two-node cluster hot backup according to default rule judgment, if need two-node cluster hot backup, by calling the notice plug-in unit of the two-node cluster hot backup software of appointment, notify described two-node cluster hot backup software to carry out two-node cluster hot backup.
The present invention is based on the inside and outside collaborative high available computers system failure disposal route of core and there is following technique effect: the present invention comprises system service fault by detection outside operating system nucleus and application service fault generates Trouble Report and exports by Trouble Report interface at interior service fault, in operating system nucleus, detection hardware fault generates Trouble Report and exports by the Trouble Report interface of setting up outside operating system nucleus simultaneously, by the mode of hardware detection and service detection combination, realize the unified report of software fault (service fault) and hardware fault, and by kernel module and core, guard the mode of obtaining unified report combination outward and realize and reported that follow-up unified treatment mechanism and two-node cluster hot backup process, the fault that computer system exists in the available management of height can be solved and report cannot be worked in coordination with, software and hardware cannot unified management and traditional hot standby software cannot in time perception fault problem, efficiently timely to the detection of software fault and hardware fault, treatment scheme is simple, fault handling Rule Extended is convenient, under the prerequisite of diagnosing in read failure data, by fault isolation and fault recovery, fault is disposed and the business between multi-host hot swap implement software machine of circulating a notice of is switched, guarantee the high availability of computer system under the exception condition of software fault or hardware fault.
The present invention is based on the inside and outside collaborative high available computers system failure treating apparatus of core is to the present invention is based on the completely corresponding device of the inside and outside collaborative high available computers system failure disposal route of core, therefore also have with the present invention is based on core inside and outside the identical technique effect of collaborative high available computers system failure disposal route, therefore do not repeat them here.
Accompanying drawing explanation
Fig. 1 is the basic procedure schematic diagram of embodiment of the present invention method.
Fig. 2 is the schematic flow sheet that embodiment of the present invention method detects service fault.
Fig. 3 is the schematic flow sheet of embodiment of the present invention method detection hardware fault.
Fig. 4 is the unified disposal process schematic diagram of embodiment of the present invention method to service fault and hardware fault.
Fig. 5 is the framed structure schematic diagram of embodiment of the present invention device.
Fig. 6 is the framed structure schematic diagram of the unified report of fault subsystem in embodiment of the present invention device.
Fig. 7 is the unified framed structure schematic diagram of disposing subsystem of fault in embodiment of the present invention device.
Embodiment
As shown in Figure 1, the present invention is based on the implementation step of the inside and outside collaborative high available computers system failure disposal route of core as follows:
1) outside operating system nucleus, detect and comprise system service fault and application service fault generates Trouble Report and exports by Trouble Report interface at interior service fault, in operating system nucleus, detection hardware fault generates Trouble Report and exports by the Trouble Report interface of foundation operating system nucleus outside simultaneously;
2) Trouble Report of detection failure reporting interface outside operating system nucleus, after receiving Trouble Report, Trouble Report is analyzed, according to analysis result, in operating system nucleus, hardware corresponding to hardware fault is carried out to fault handling, or to service fault, fault handling is carried out in corresponding service outside operating system nucleus, to fault handling log and send notice to keeper, then according to default rule judgment, whether need to carry out two-node cluster hot backup, if need two-node cluster hot backup, notify the two-node cluster hot backup software of appointment to carry out two-node cluster hot backup.
As shown in Figure 2, in the present embodiment step 1), outside operating system nucleus, detect and comprise system service fault and application service fault generates Trouble Report and specifically refers to by the output of Trouble Report interface at interior service fault:
1.1.1) outside operating system nucleus, the mode with poll is carried out state-detection to system service in operating system and application service, if abnormality appears in any system service or application service, judges service fault occurs;
1.1.2) after judge there is service fault, according to system service or application service, there is the Information generation Trouble Report of abnormality, Trouble Report is exported by Trouble Report interface.
As shown in Figure 3, in the present embodiment step 1) in operating system nucleus detection hardware fault generate Trouble Report and the detailed step exported by Trouble Report interface as follows:
1.2.1) by a plurality of hardware states monitoring point being distributed in advance in fault grouting socket, fault interrupting processing routine and hardware driving, detect corresponding hardware status information, if the hardware state that any hardware status monitoring point detects occurs abnormal, the field data that corresponding hardware is collected in hardware state monitoring point according to default rule is as hardware fault data;
1.2.2) hardware fault data are encapsulated and generate Trouble Report and deposit default failure message queue in;
1.2.3) according to failure message queue, to depositing the Trouble Report of failure message queue in, dispatch distribution;
1.2.4) utilize thread that the Trouble Report of scheduling output is exported by Trouble Report interface.
As shown in Figure 4, detailed step the present embodiment step 2) is as follows:
2.1) Trouble Report based on finger daemon detection failure reporting interface outside operating system nucleus;
2.2) Trouble Report is analyzed after receiving Trouble Report operating system nucleus is external, the fault type of failure judgement report, if fault type is service fault, according to service dependence, describes system service corresponding to service fault or application service are recovered; If fault type is hardware fault, judge whether to carry out faulty hardware isolation to hardware corresponding to Trouble Report, if need to carry out faulty hardware isolation, redirect execution step 2.3), otherwise judge whether to carry out failed hardware recovery to hardware corresponding to Trouble Report, if need to carry out failed hardware recovery, redirect execution step 2.4), otherwise redirect execution step 2.5);
2.3) in the time need to carrying out faulty hardware isolation to hardware corresponding to Trouble Report, in operating system nucleus, hardware corresponding to Trouble Report is carried out to faulty hardware isolation;
2.4) in the time need to carrying out failed hardware recovery to hardware corresponding to Trouble Report, in operating system nucleus, hardware corresponding to Trouble Report is carried out to failed hardware recovery;
2.5) to fault handling log;
2.6) to keeper, send notice;
2.7) whether according to default rule judgment, need to carry out two-node cluster hot backup, if need two-node cluster hot backup, by calling the notice plug-in unit of the two-node cluster hot backup software of appointment, notice two-node cluster hot backup software carries out two-node cluster hot backup.
As shown in Figure 5, the high available computers system failure treating apparatus of the present embodiment based on collaborative inside and outside core comprises:
The unified report of fault subsystem, for detecting operating system nucleus outside, comprise system service fault and application service fault generates Trouble Report and exports by Trouble Report interface at interior service fault, in operating system nucleus, detection hardware fault generates Trouble Report and exports by the Trouble Report interface of foundation outside operating system nucleus simultaneously;
The unified subsystem of disposing of fault, Trouble Report for detection failure reporting interface outside operating system nucleus, after receiving Trouble Report, Trouble Report is analyzed, according to analysis result, in operating system nucleus, hardware corresponding to hardware fault is carried out to fault handling, or to service fault, fault handling is carried out in corresponding service outside operating system nucleus, to fault handling log and send notice to keeper, then according to default rule judgment, whether need to carry out two-node cluster hot backup, if need two-node cluster hot backup, notify the two-node cluster hot backup software of appointment to carry out two-node cluster hot backup.
As shown in Figure 6, the unified report of the fault of the present embodiment subsystem comprises for detecting operating system nucleus outside and comprises system service fault and application service fault in interior service fault generation Trouble Report the service detection module of exporting by Trouble Report interface, and service detection module comprises:
Service state poll detection sub-module, carries out state-detection for the mode with poll outside operating system nucleus to operating system system service and application service, if abnormality appears in any system service or application service, judges service fault occurs;
, for occurring after service fault judging, there is the Information generation Trouble Report of abnormality in service fault report submodule, Trouble Report is exported by Trouble Report interface according to system service or application service.
As shown in Figure 6, the unified report of the fault of the present embodiment subsystem comprises the hardware detecting module for generating Trouble Report and export by Trouble Report interface in operating system nucleus detection hardware fault, and hardware detecting module comprises:
Hardware state monitoring submodule, for detecting corresponding hardware status information by being distributed in advance a plurality of hardware states monitoring point of fault grouting socket, fault interrupting processing routine and hardware driving, if the hardware state that any hardware status monitoring point detects occurs abnormal, the field data that corresponding hardware is collected in hardware state monitoring point according to default rule is as hardware fault data;
Hardware fault data encapsulation submodule, generates Trouble Report and deposits default failure message queue in for hardware fault data are encapsulated;
Failure message queue scheduling submodule, for dispatching distribution according to failure message queue to depositing the Trouble Report of failure message queue in;
Hardware fault data report submodule, for utilizing thread that the Trouble Report of scheduling output is exported by Trouble Report interface.
In the present embodiment, hardware state monitoring submodule detects corresponding hardware status information by a plurality of hardware states monitoring point being distributed in advance in fault grouting socket, fault interrupting processing routine and hardware driving, the early warning to hardware fault, quick ability of discovery be can promote, promptness and efficiency that hardware fault is found improved.
As shown in Figure 7, the unified subsystem of disposing of the fault of the present embodiment comprises:
Fault management finger daemon module, for the Trouble Report based on finger daemon detection failure reporting interface outside operating system nucleus;
Fault handling engine, for Trouble Report being analyzed after receiving Trouble Report operating system nucleus is external, the fault type of failure judgement report, if fault type is service fault, according to service dependence, describes system service corresponding to service fault or application service are recovered; If fault type is hardware fault, judge whether to carry out faulty hardware isolation to hardware corresponding to Trouble Report, if need to carry out faulty hardware isolation, call fault isolation module, otherwise judge whether to carry out failed hardware recovery to hardware corresponding to Trouble Report, if need to carry out failed hardware recovery, call fault recovery module;
Fault isolation module in the time need to carrying out faulty hardware isolation to hardware corresponding to Trouble Report, is carried out faulty hardware isolation to hardware corresponding to Trouble Report in operating system nucleus;
Fault recovery module in the time need to carrying out failed hardware recovery to hardware corresponding to Trouble Report, is carried out failed hardware recovery to hardware corresponding to Trouble Report in operating system nucleus;
Logger module, for to fault handling log;
Signalling trouble module, for sending notice to keeper;
Two-node cluster hot backup processing module, whether for needing to carry out two-node cluster hot backup according to default rule judgment, if need two-node cluster hot backup, by calling the notice plug-in unit of the two-node cluster hot backup software of appointment, notice two-node cluster hot backup software carries out two-node cluster hot backup.
The present embodiment is by the mode of kernel module and the outer finger daemon combination of core, designed the unified subsystem of disposing of fault, fault is unified to be disposed in subsystem, fault handling engine is by describing service dependence and for the state-detection of specific service, real-time perception kernel and key service running status, the consistance of maintenance system service state, stability that can elevator system.The high available computers system failure treating apparatus of the present embodiment based on collaborative inside and outside core is by the mode of hardware detection and service detection combination, design fault and unified report frame as the specific implementation of the unified report of fault subsystem, fault is unified report frame by insert hardware state checkpoint in processor and memory management code, bus and device driver code, has promoted the early warning to hardware fault, quick ability of discovery.Step is as follows: 1, hardware state Checkpoint detection is to abnormal; 2, collect hardware state data encapsulation; 3, data are put into message queue; 4, calling fault send-thread sends.The high available computers system failure treating apparatus of the present embodiment based on collaborative inside and outside core is by the mode of kernel module and the outer finger daemon combination of core, designed the unified framework of disposing of fault as the unified specific implementation of disposing subsystem of fault, fault is unified disposes framework by service dependence being described and for the state-detection of specific service, real-time perception kernel and key service running status, the consistance of maintenance system service state.Step is as follows: 1, kernel or service detection module are abnormal to the report of fault management finger daemon; 2, fault management finger daemon carries out fault handling; 3, notice fault isolation module isolated fault hardware; 4, notice fault recovery module completes fault recovery; 5, by signalling trouble module, send signalling trouble; 6, by logger module log; The high available computers system failure treating apparatus of the present embodiment based on collaborative inside and outside core unified report frame and the unified framework of disposing of fault based on fault, traditional multi-host hot swap technology is improved, it is characterized in that unifying report frame by fault, obtain real-time hardware and software failure information, by the unified framework of disposing of fault, complete processing, and result is circulated a notice of to multi-host hot swap software by efficient event communication and callback mechanism, by the latter, carry out business between machine and switch or migration.Step is as follows: 1, multi-host hot swap software is to fault management finger daemon registration migration signal and migration signal triggering rule; 2, fault processing module handling failure, when meeting triggering rule, sends migration signal; 3, multi-host hot swap software receives migration signal, business migration between enforcement machine.The high available computers system failure treating apparatus of the present embodiment based on collaborative inside and outside core realized and the unified monitoring of integrated hardware fault and software fault, and assurance keeper can obtain hardware and software failure information real-time.Application and trouble diagnosis, fault isolation and fault recovery module are disposed fault and the business between multi-host hot swap implement software machine of circulating a notice of is switched, and guarantee the high availability of computer system under the exception condition of software fault or hardware fault.
The above is only the preferred embodiment of the present invention, and protection scope of the present invention is also not only confined to above-described embodiment, and all technical schemes belonging under thinking of the present invention all belong to protection scope of the present invention.It should be pointed out that for those skilled in the art, some improvements and modifications without departing from the principles of the present invention, these improvements and modifications also should be considered as protection scope of the present invention.
Claims (8)
1. the high available computers system failure disposal route based on collaborative inside and outside core, is characterized in that implementation step is as follows:
1) outside operating system nucleus, detect and comprise system service fault and application service fault generates Trouble Report and exports by described Trouble Report interface at interior service fault, in operating system nucleus, detection hardware fault generates Trouble Report and exports by the Trouble Report interface of foundation operating system nucleus outside simultaneously;
2) Trouble Report of detection failure reporting interface outside operating system nucleus, after receiving Trouble Report, Trouble Report is analyzed, according to analysis result, in operating system nucleus, hardware corresponding to hardware fault is carried out to fault handling, or to service fault, fault handling is carried out in corresponding service outside operating system nucleus, to fault handling log and send notice to keeper, then according to default rule judgment, whether need to carry out two-node cluster hot backup, if need two-node cluster hot backup, notify the two-node cluster hot backup software of appointment to carry out two-node cluster hot backup.
2. the high available computers system failure disposal route based on collaborative inside and outside core according to claim 2, is characterized in that: in described step 1), operating system nucleus outside, detect and comprise system service fault and application service fault generates Trouble Report and exported and specifically referred to by described Trouble Report interface at interior service fault:
1.1.1) outside operating system nucleus, the mode with poll is carried out state-detection to system service in operating system and application service, if abnormality appears in any system service or application service, judges service fault occurs;
1.1.2) after judge there is service fault, according to system service or application service, there is the Information generation Trouble Report of abnormality, described Trouble Report is exported by described Trouble Report interface.
3. the high available computers system failure disposal route based on collaborative inside and outside core according to claim 2, it is characterized in that, in described step 1), in operating system nucleus, detection hardware fault generates Trouble Report as follows by the detailed step of described Trouble Report interface output:
1.2.1) by a plurality of hardware states monitoring point being distributed in advance in fault grouting socket, fault interrupting processing routine and hardware driving, detect corresponding hardware status information, if the hardware state that any hardware status monitoring point detects occurs abnormal, the field data that corresponding hardware is collected in described hardware state monitoring point according to default rule is as hardware fault data;
1.2.2) hardware fault data are encapsulated and generate Trouble Report and deposit default failure message queue in;
1.2.3) according to failure message queue, to depositing the Trouble Report of failure message queue in, dispatch distribution;
1.2.4) utilize thread that the Trouble Report of scheduling output is exported by described Trouble Report interface.
4. the high available computers system failure disposal route based on collaborative inside and outside core according to claim 3, is characterized in that described step 2) detailed step as follows:
2.1) Trouble Report based on finger daemon detection failure reporting interface outside operating system nucleus;
2.2) Trouble Report is analyzed after receiving Trouble Report operating system nucleus is external, the fault type of failure judgement report, if fault type is service fault, according to service dependence, describes system service corresponding to service fault or application service are recovered; If fault type is hardware fault, judge whether to carry out faulty hardware isolation to hardware corresponding to Trouble Report, if need to carry out faulty hardware isolation, redirect execution step 2.3), otherwise judge whether to carry out failed hardware recovery to hardware corresponding to Trouble Report, if need to carry out failed hardware recovery, redirect execution step 2.4), otherwise redirect execution step 2.5);
2.3) in the time need to carrying out faulty hardware isolation to hardware corresponding to Trouble Report, in operating system nucleus, hardware corresponding to Trouble Report is carried out to faulty hardware isolation;
2.4) in the time need to carrying out failed hardware recovery to hardware corresponding to Trouble Report, in operating system nucleus, hardware corresponding to Trouble Report is carried out to failed hardware recovery;
2.5) to fault handling log;
2.6) to keeper, send notice;
2.7) according to default rule judgment, whether need to carry out two-node cluster hot backup, if need two-node cluster hot backup, by calling the notice plug-in unit of the two-node cluster hot backup software of appointment, notify described two-node cluster hot backup software to carry out two-node cluster hot backup.
5. the high available computers system failure treating apparatus based on collaborative inside and outside core, is characterized in that comprising:
The unified report of fault subsystem, for detecting operating system nucleus outside, comprise system service fault and application service fault generates Trouble Report and exports by described Trouble Report interface at interior service fault, in operating system nucleus, detection hardware fault generates Trouble Report and exports by the Trouble Report interface of foundation outside operating system nucleus simultaneously;
The unified subsystem of disposing of fault, Trouble Report for detection failure reporting interface outside operating system nucleus, after receiving Trouble Report, Trouble Report is analyzed, according to analysis result, in operating system nucleus, hardware corresponding to hardware fault is carried out to fault handling, or to service fault, fault handling is carried out in corresponding service outside operating system nucleus, to fault handling log and send notice to keeper, then according to default rule judgment, whether need to carry out two-node cluster hot backup, if need two-node cluster hot backup, notify the two-node cluster hot backup software of appointment to carry out two-node cluster hot backup.
6. the high available computers system failure treating apparatus based on collaborative inside and outside core according to claim 5, it is characterized in that: the unified report of described fault subsystem comprises for detecting operating system nucleus outside and comprise the service detection module that system service fault and application service fault generate Trouble Report and export by described Trouble Report interface at interior service fault, described service detection module comprises:
Service state poll detection sub-module, carries out state-detection for the mode with poll outside operating system nucleus to operating system system service and application service, if abnormality appears in any system service or application service, judges service fault occurs;
, for occurring after service fault judging, there is the Information generation Trouble Report of abnormality in service fault report submodule, described Trouble Report is exported by described Trouble Report interface according to system service or application service.
7. the high available computers system failure treating apparatus based on collaborative inside and outside core according to claim 6, it is characterized in that, the unified report of described fault subsystem comprises the hardware detecting module for generating Trouble Report and export by described Trouble Report interface in operating system nucleus detection hardware fault, and described hardware detecting module comprises:
Hardware state monitoring submodule, for detecting corresponding hardware status information by being distributed in advance a plurality of hardware states monitoring point of fault grouting socket, fault interrupting processing routine and hardware driving, if the hardware state that any hardware status monitoring point detects occurs abnormal, the field data that corresponding hardware is collected in described hardware state monitoring point according to default rule is as hardware fault data;
Hardware fault data encapsulation submodule, generates Trouble Report and deposits default failure message queue in for hardware fault data are encapsulated;
Failure message queue scheduling submodule, for dispatching distribution according to failure message queue to depositing the Trouble Report of failure message queue in;
Hardware fault data report submodule, for utilizing thread that the Trouble Report of scheduling output is exported by described Trouble Report interface.
8. the high available computers system failure treating apparatus based on collaborative inside and outside core according to claim 7, is characterized in that, the unified subsystem of disposing of described fault comprises:
Fault management finger daemon module, for the Trouble Report based on finger daemon detection failure reporting interface outside operating system nucleus;
Fault handling engine, for Trouble Report being analyzed after receiving Trouble Report operating system nucleus is external, the fault type of failure judgement report, if fault type is service fault, according to service dependence, describes system service corresponding to service fault or application service are recovered; If fault type is hardware fault, judge whether to carry out faulty hardware isolation to hardware corresponding to Trouble Report, if need to carry out faulty hardware isolation, call fault isolation module, otherwise judge whether to carry out failed hardware recovery to hardware corresponding to Trouble Report, if need to carry out failed hardware recovery, call fault recovery module;
Fault isolation module in the time need to carrying out faulty hardware isolation to hardware corresponding to Trouble Report, is carried out faulty hardware isolation to hardware corresponding to Trouble Report in operating system nucleus;
Fault recovery module in the time need to carrying out failed hardware recovery to hardware corresponding to Trouble Report, is carried out failed hardware recovery to hardware corresponding to Trouble Report in operating system nucleus;
Logger module, for to fault handling log;
Signalling trouble module, for sending notice to keeper;
Whether two-node cluster hot backup processing module, for needing to carry out two-node cluster hot backup according to default rule judgment, if need two-node cluster hot backup, by calling the notice plug-in unit of the two-node cluster hot backup software of appointment, notify described two-node cluster hot backup software to carry out two-node cluster hot backup.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410215175.4A CN103995759B (en) | 2014-05-21 | 2014-05-21 | High-availability computer system failure handling method and device based on core internal-external synergy |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410215175.4A CN103995759B (en) | 2014-05-21 | 2014-05-21 | High-availability computer system failure handling method and device based on core internal-external synergy |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103995759A true CN103995759A (en) | 2014-08-20 |
CN103995759B CN103995759B (en) | 2015-04-29 |
Family
ID=51309932
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410215175.4A Active CN103995759B (en) | 2014-05-21 | 2014-05-21 | High-availability computer system failure handling method and device based on core internal-external synergy |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103995759B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106338982A (en) * | 2016-09-26 | 2017-01-18 | 深圳前海弘稼科技有限公司 | Fault processing method, fault processing device and server |
CN106815114A (en) * | 2017-01-12 | 2017-06-09 | 西安科技大学 | A kind of computer system fault handling method based on software-hardware synergism |
CN107851054A (en) * | 2015-09-15 | 2018-03-27 | 德克萨斯仪器股份有限公司 | IC chip with multiple kernels |
CN111367769A (en) * | 2020-03-30 | 2020-07-03 | 浙江大华技术股份有限公司 | Application fault processing method and electronic equipment |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1841341A (en) * | 2005-03-31 | 2006-10-04 | 冲电气工业株式会社 | Information processing device, information processing method and information processing program |
CN101833497A (en) * | 2010-03-30 | 2010-09-15 | 山东高效能服务器和存储研究院 | Computer fault management system based on expert system method |
CN102364448A (en) * | 2011-09-19 | 2012-02-29 | 浪潮电子信息产业股份有限公司 | Fault-tolerant method for computer fault management system |
US8190946B2 (en) * | 2008-05-30 | 2012-05-29 | Fujitsu Limited | Fault detecting method and information processing apparatus |
CN103279367A (en) * | 2013-05-07 | 2013-09-04 | 浪潮电子信息产业股份有限公司 | Kernel drive isolating system |
US20140115575A1 (en) * | 2012-10-18 | 2014-04-24 | Vmware, Inc. | Systems and methods for detecting system exceptions in guest operating systems |
-
2014
- 2014-05-21 CN CN201410215175.4A patent/CN103995759B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1841341A (en) * | 2005-03-31 | 2006-10-04 | 冲电气工业株式会社 | Information processing device, information processing method and information processing program |
US8190946B2 (en) * | 2008-05-30 | 2012-05-29 | Fujitsu Limited | Fault detecting method and information processing apparatus |
CN101833497A (en) * | 2010-03-30 | 2010-09-15 | 山东高效能服务器和存储研究院 | Computer fault management system based on expert system method |
CN102364448A (en) * | 2011-09-19 | 2012-02-29 | 浪潮电子信息产业股份有限公司 | Fault-tolerant method for computer fault management system |
US20140115575A1 (en) * | 2012-10-18 | 2014-04-24 | Vmware, Inc. | Systems and methods for detecting system exceptions in guest operating systems |
CN103279367A (en) * | 2013-05-07 | 2013-09-04 | 浪潮电子信息产业股份有限公司 | Kernel drive isolating system |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107851054A (en) * | 2015-09-15 | 2018-03-27 | 德克萨斯仪器股份有限公司 | IC chip with multiple kernels |
CN107851054B (en) * | 2015-09-15 | 2022-02-08 | 德克萨斯仪器股份有限公司 | Integrated circuit chip with multiple cores |
US11269742B2 (en) | 2015-09-15 | 2022-03-08 | Texas Instruments Incorporated | Integrated circuit chip with cores asymmetrically oriented with respect to each other |
US11698841B2 (en) | 2015-09-15 | 2023-07-11 | Texas Instruments Incorporated | Integrated circuit chip with cores asymmetrically oriented with respect to each other |
CN106338982A (en) * | 2016-09-26 | 2017-01-18 | 深圳前海弘稼科技有限公司 | Fault processing method, fault processing device and server |
CN106815114A (en) * | 2017-01-12 | 2017-06-09 | 西安科技大学 | A kind of computer system fault handling method based on software-hardware synergism |
CN111367769A (en) * | 2020-03-30 | 2020-07-03 | 浙江大华技术股份有限公司 | Application fault processing method and electronic equipment |
CN111367769B (en) * | 2020-03-30 | 2023-07-21 | 浙江大华技术股份有限公司 | Application fault processing method and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN103995759B (en) | 2015-04-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
TWI746512B (en) | Physical machine fault classification processing method and device, and virtual machine recovery method and system | |
CN106789306B (en) | Method and system for detecting, collecting and recovering software fault of communication equipment | |
JP4882845B2 (en) | Virtual computer system | |
CN104268061B (en) | A kind of storage state monitoring method suitable for virtual machine | |
CN103324565B (en) | Daily record monitoring method | |
CN106598790A (en) | Server hardware failure detection method, apparatus of server, and server | |
CN103607297A (en) | Fault processing method of computer cluster system | |
CN104102572A (en) | Method and device for detecting and processing system faults | |
CN105224888B (en) | A kind of data of magnetic disk array protection system based on safe early warning technology | |
EP3148116A1 (en) | Information system fault scenario information collection method and system | |
CN102364448A (en) | Fault-tolerant method for computer fault management system | |
CN101556679A (en) | Method for processing failures in integrated front-end system and computer equipment | |
CN103995759B (en) | High-availability computer system failure handling method and device based on core internal-external synergy | |
CN105243004A (en) | Failure resource detection method and apparatus | |
CN111143167B (en) | Alarm merging method, device, equipment and storage medium for multiple platforms | |
CN103490919A (en) | Fault management system and fault management method | |
CN103378982A (en) | Internet business operation monitoring method and Internet business operation monitoring system | |
CN116126772A (en) | UART serial port management system and method applied to ARM server | |
CN109062723A (en) | The treating method and apparatus of server failure | |
CN105760241A (en) | Exporting method and system for memory data | |
CN103207825A (en) | Method and device for managing faults of entire equipment cabinet | |
CN111078445A (en) | PSU power failure reason detection method and device | |
CN104283718A (en) | Network device and hardware fault diagnosis method used for network device | |
CN103605592A (en) | Mechanism of detecting malfunctions of distributed computer system | |
JP2014120001A (en) | Monitoring device, monitoring method of monitoring object host, monitoring program, and recording medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |