CN104579802A

CN104579802A - Method for fast fault restoration of multipath server

Info

Publication number: CN104579802A
Application number: CN201510080647.4A
Authority: CN
Inventors: 王岩; 薛广营; 黄小东
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2015-02-15
Filing date: 2015-02-15
Publication date: 2015-04-29

Abstract

The invention provides a method for fast fault restoration of a multipath server and relates to the technology of a multipath server architecture. The method comprises the following steps that a DMI bus of a PCH is connected with a master CPU and a slave CPPU through a PCIE switch chip and switch of the switch chip is jointly controlled by the PCH and BMC; when the slave CPU faults, the system shields the slave CPU; when the master CPU faults, BIOS or BMC automatically switches the DMI bus to the slave CPU and shields the faulted master CPU, so that the system can be quickly restored from fault, that is, fault shielding of any one CPU in the server is realized. The downtime during fault restoration of the server is shortened, so that the loss caused by downtime of the system due to CPU fault is reduced to minimum.

Description

A kind of method that multipath server fast failure recovers

Technical field

The present invention relates to multipath server architecture technology, particularly relate to a kind of method that multipath server fast failure recovers.

Background technology

Common multipath server framework, the DMI bus of South Bridge chip (PCH) is connected with host CPU, as Fig. 1.When system boot starts, PCH obtains configuration information, the device driver and self-check program etc. of system from BIOS, and has carried out the self-inspection to all CPU and internal memory by the DMI bus between host CPU.After self-inspection completes, BIOS can start to guide operating system, completes start.In this server architecture design, system can mask fault from CPU, if but host CPU breaks down, and the DMI bus between PCH just cannot work, bios program cannot load, and system cannot shield host CPU, must complete fault recovery by the artificial mode changing host CPU, add the downtime of server, this is very disadvantageous for the server that key is applied.

Summary of the invention

In order to solve this problem, the method that the fast failure that the present invention proposes a kind of new multipath server recovers.

Technical scheme of the present invention is:

The DMI bus of PCH is connected from CPU with one with host CPU by a PCIE switch chip, and the switching of switch chip is by PCH and Management Controller (BMC) co-controlling.Because DMI bus uses PCIE agreement, therefore use PCIE switch chip can ensure the signal integrity of DMI bus.Under this scheme, when breaking down from CPU, system can will should shield from CPU; When host CPU breaks down, BIOS or BMC can automatically by DMI bus switch to from CPU, and mask the host CPU of fault, system can be recovered fast from fault, namely the fault masking of any one CPU in server is achieved, significantly reduce the downtime during fault recovery of server, the loss causing the system machine of delaying to cause because of cpu fault is dropped to minimum.The mode using PCH and BMC dual control to switch can ensure that switch chip can be stablized when host CPU breaks down and switch fast.

The control signal of switch chip, by the GPIO port of PCH and BMC co-controlling, selects the DMI bus of PCH to be connected to host CPU by control signal or from CPU.

Switch chip acquiescence selects the DMI bus of host CPU, and control signal is high level, and under default conditions, the GPIO port of PCH and BMC all discharge the control to this control signal; After when system cloud gray model, host CPU breaks down, BMC can detect the fault of host CPU, and automatically control signal is dragged down, and carries out primary system and restart, and completes the switching of DMI bus after restarting.

When system boot self-inspection, host CPU breaks down, BIOS can respond according to the self-inspection code of CPU automatically, the GPIO port of control PCH drags down the control signal of switch chip, is switched to from CPU and carries out hot restart self-inspection again, completing the switching of DMI bus.

This method for designing makes when host CPU breaks down, BIOS or BMC can automatically by DMI bus switch to from CPU, and mask the host CPU of fault, system can be recovered fast from fault, significantly reduce the downtime during fault recovery of server, the loss caused because of the cpu fault system machine of delaying is dropped to minimum.

Accompanying drawing explanation

Fig. 1 is the syndeton schematic diagram of prior art.

Fig. 2 is syndeton schematic diagram of the present invention.

Embodiment

More detailed elaboration is carried out to content of the present invention below:

As shown in Figure 2,

1, this invention is by host CPU, form from CPU, switch chip, PCH and BMC;

2, host CPU and the DMI bus from CPU are all connected to switch chip, the other end of chip is connected to the PCH of system, the control signal of switch chip, by the GPIO port of PCH and BMC co-controlling, selects the DMI bus of PCH to be connected to host CPU by control signal or from CPU;

3, Switch chip acquiescence selects the DMI bus (control signal is high level) of host CPU, and under default conditions, the GPIO port of PCH and BMC all discharge the control to this control signal.After when system OS runs, host CPU breaks down, BMC can detect the fault of host CPU, and automatically control signal is dragged down, and carries out primary system and restart, and completes the switching of DMI bus after restarting;

4, when when system boot self-inspection host CPU break down, BIOS can respond according to the self-inspection code of CPU automatically, the GPIO port of control PCH drags down the control signal of switch chip, is switched to from CPU and carries out hot restart self-inspection again, completing the switching of DMI bus.

Claims

1. a method for multipath server fast failure recovery, is characterized in that,

The DMI bus of PCH is connected from CPU with one with host CPU by a PCIE switch chip, and the switching of switch chip is by PCH and BMC co-controlling; When breaking down from CPU, system should shield from CPU; When host CPU breaks down, BIOS or BMC automatically by DMI bus switch to from CPU, and mask the host CPU of fault, system can be recovered fast from fault, namely achieve the fault masking of any one CPU in server.

2. method according to claim 1, is characterized in that, the control signal of switch chip, by the GPIO port of PCH and BMC co-controlling, selects the DMI bus of PCH to be connected to host CPU by control signal or from CPU.

3. method according to claim 2, is characterized in that, Switch chip acquiescence selects the DMI bus of host CPU, and control signal is high level, and under default conditions, the GPIO port of PCH and BMC all discharge the control to this control signal; After when system cloud gray model, host CPU breaks down, BMC can detect the fault of host CPU, and automatically control signal is dragged down, and carries out primary system and restart, and completes the switching of DMI bus after restarting.

4. method according to claim 3, it is characterized in that, when when system boot self-inspection, host CPU breaks down, BIOS can respond according to the self-inspection code of CPU automatically, the GPIO port of control PCH drags down the control signal of switch chip, be switched to from CPU and carry out hot restart self-inspection again, completing the switching of DMI bus.