CN114116280B - Interactive BMC self-recovery method, system, terminal and storage medium - Google Patents

Interactive BMC self-recovery method, system, terminal and storage medium Download PDF

Info

Publication number
CN114116280B
CN114116280B CN202111334036.XA CN202111334036A CN114116280B CN 114116280 B CN114116280 B CN 114116280B CN 202111334036 A CN202111334036 A CN 202111334036A CN 114116280 B CN114116280 B CN 114116280B
Authority
CN
China
Prior art keywords
bmc
firmware
response information
communication
programmable logic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111334036.XA
Other languages
Chinese (zh)
Other versions
CN114116280A (en
Inventor
丁宇翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202111334036.XA priority Critical patent/CN114116280B/en
Publication of CN114116280A publication Critical patent/CN114116280A/en
Application granted granted Critical
Publication of CN114116280B publication Critical patent/CN114116280B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • G06F11/0754Error or fault detection not based on redundancy by exceeding limits
    • G06F11/0757Error or fault detection not based on redundancy by exceeding limits by exceeding a time limit, i.e. time-out, e.g. watchdogs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)
  • Retry When Errors Occur (AREA)

Abstract

The invention relates to the technical field of servers, and particularly provides an interactive BMC self-recovery method, an interactive BMC self-recovery system, a terminal and a storage medium, wherein the method comprises the following steps: the first BMC confirms abnormal communication with the second BMC, acquires response information of the second BMC responding to the feeding dog signal from the complex programmable logic device and verifies the second BMC fault based on the response information; sending a restarting instruction to the second BMC; and monitoring the communication state with the second BMC, and if the communication with the second BMC is not recovered within a set period, writing a firmware file into the second BMC. The invention can restart the BMC damaged by the firmware and recover the firmware in time by means of the communication relation of the interactive BMC, and provides powerful support for the normal operation of the server.

Description

Interactive BMC self-recovery method, system, terminal and storage medium
Technical Field
The invention relates to the technical field of servers, in particular to an interactive BMC self-recovery method, an interactive BMC self-recovery system, a terminal and a storage medium.
Background
Currently, a BMC (baseboard management controller) is a baseboard management system on a server, which is responsible for important functions such as asset information display, hardware monitoring, heat dissipation regulation, system configuration, remote monitoring, log collection, fault diagnosis, and system maintenance of the server. The server can be matched with satellite equipment at present, such as intelligent network cards, intelligent GPU and the like with BMC chips, the equipment is connected with the server through NCSI connecting wires, the BMC of the server and the BMC chips of the satellite equipment can communicate, and the server can monitor the satellite controller through the mode. The BMC chip design is the dual flash chip design and is used for storing firmware, when the firmware stored by the chip is damaged, the BMC can be recovered in a roll back mode by restarting the BMC in a mode of feeding dogs to the BMC by the CPLD, and in the situation, one chip is required to work normally, and when the two chips of the BMC are damaged, the recovery cannot be carried out.
In the prior art, a monitoring CPLD is arranged, the CPLD respectively transmits a dog feeding signal to two BMCs, and the CPLD can perform hard restarting operation to the other BMC after no response is carried out for 1 hour; when the BMC flash firmware is damaged, the BMC cannot recover. Disadvantages of this approach include: when the BMC flash firmware is damaged, the time for the BMC to recover to be normal is 1 hour, and the time is too long; when the BMC flash firmware is damaged, the BMC cannot recover automatically and can only refresh and burn the OS, and manual operation is needed.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides an interactive BMC self-recovery method, an interactive BMC self-recovery system, a terminal and a storage medium, so as to solve the technical problems.
In a first aspect, the present invention provides an interactive BMC self-recovery method, including:
the first BMC confirms abnormal communication with the second BMC, acquires response information of the second BMC responding to the feeding dog signal from the complex programmable logic device and verifies the second BMC fault based on the response information;
sending a restarting instruction to the second BMC;
and monitoring the communication state with the second BMC, and if the communication with the second BMC is not recovered within a set period, writing a firmware file into the second BMC.
Further, the first BMC confirms that communication with the second BMC is abnormal, obtains response information of the second BMC in response to the feeding dog signal from the complex programmable logic device, and verifies the second BMC fault based on the response information, and comprises the following steps:
the first BMC and the second BMC send heartbeat signals to each other periodically, and if the heartbeat signals are not received within a specified period, communication abnormality is judged;
the method comprises the steps that response information of a second BMC is obtained from a complex programmable logic device, the complex programmable logic device regularly sends a dog feeding signal to a first BMC and the second BMC, and the response information of the first BMC and the second BMC is recorded;
and matching the response information of the second BMC with the feeding signal sent by the complex programmable logic device, and judging that the second BMC fails if the response information of the second BMC and the feeding signal are not matched.
Further, sending a restart instruction to the second BMC includes:
and sending a restart instruction to the second BMC to enable the second BMC to switch the firmware through restarting and executing the rollback operation.
Further, monitoring a communication state with the second BMC, and if communication with the second BMC is not restored within a set period, writing a firmware file to the second BMC, including:
starting timing while sending a restart instruction;
monitoring whether a heartbeat signal sent by the second BMC is received, and resetting the timing time if the heartbeat signal sent by the second BMC is received;
if the timing time reaches the set period, judging that all the firmware of the second BMC is damaged, writing a local firmware file into a firmware storage device of the second BMC, and controlling the second BMC to restart.
In a second aspect, the present invention provides an interactive BMC self-recovery system comprising:
the fault confirming unit is used for confirming abnormal communication between the first BMC and the second BMC, acquiring response information of the second BMC responding to the feeding dog signal from the complex programmable logic device and verifying the fault of the second BMC based on the response information;
the restarting control unit is used for sending a restarting instruction to the second BMC;
and the firmware repairing unit is used for monitoring the communication state with the second BMC, and writing a firmware file into the second BMC if the communication with the second BMC is not recovered within a set period.
Further, the fault confirming unit includes:
the abnormal judging module is used for periodically mutually sending heartbeat signals by the first BMC and the second BMC, and judging that communication is abnormal if the heartbeat signals are not received within a specified period;
the information acquisition module is used for acquiring response information of the second BMC from the complex programmable logic device, and the complex programmable logic device periodically transmits a feeding signal to the first BMC and the second BMC and records the response information of the first BMC and the second BMC;
and the fault verification module is used for matching the response information of the second BMC with the feeding dog signal sent by the complex programmable logic device, and judging that the second BMC is faulty if the response information of the second BMC is not matched with the feeding dog signal.
Further, the restart control unit includes:
and the restarting control module is used for sending a restarting instruction to the second BMC so that the second BMC switches the firmware through restarting and executing rollback operation.
Further, the firmware repairing unit includes:
the timing execution module is used for starting timing while sending a restarting instruction;
the communication monitoring module is used for monitoring whether the heartbeat signal sent by the second BMC is received or not, and resetting the timing time if the heartbeat signal sent by the second BMC is received;
and the file writing module is used for judging that all the firmware of the second BMC is damaged if the timing time reaches the set period, writing the local firmware file into the firmware storage device of the second BMC and controlling the second BMC to restart.
In a third aspect, a terminal is provided, including:
a processor, a memory, wherein,
the memory is used for storing a computer program,
the processor is configured to call and run the computer program from the memory, so that the terminal performs the method of the terminal as described above.
In a fourth aspect, there is provided a computer storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the method of the above aspects.
The interactive BMC self-recovery method, the system, the terminal and the storage medium have the advantages that by means of communication between the server BMC and the satellite controller BMC, when one of the BMCs is damaged by firmware stored by one flash, the other BMC can quickly recover the BMC to a normal state, when the two flash of the fault BMC are damaged, the other BMC which works normally can write the damaged BMC firmware to recover the normal working state of the fault BMC, and powerful support is provided for normal working of the server.
In addition, the invention has reliable design principle, simple structure and very wide application prospect.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.
FIG. 1 is a schematic flow chart of a method of one embodiment of the invention.
Fig. 2 is another schematic flow chart of a method of one embodiment of the invention.
FIG. 3 is a schematic block diagram of a system of one embodiment of the present invention.
Fig. 4 is a schematic structural diagram of a terminal according to an embodiment of the present invention.
Detailed Description
In order to make the technical solution of the present invention better understood by those skilled in the art, the technical solution of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
The following explains key terms appearing in the present invention.
BMC, execution server remote management controller, english name Baseboard Management controller. The method can perform firmware upgrade, check machine equipment and other operations on the machine in a state that the machine is not started. Fully implementing IPMI functionality in a BMC requires a powerful 16-bit or 32-bit microcontroller and RAM for data storage, flash memory for non-volatile data storage, and firmware to provide basic remote manageability in terms of secure remote reboot, secure re-power-up, LAN alerting, and system health monitoring. In addition to the basic IPMI and system operation monitoring functions, the mBMC can also enable BIOS flash element selection and protection by storing the previous BIOS using one of the 2 flash memories. For example, when the system fails to boot after a remote BIOS upgrade, the remote administrator may switch back to the previously-working BIOS image to boot the system. Once BIOS is upgraded, the BIOS image can be locked, so as to effectively prevent virus from invading it.
The complex programmable logic device CPLD adopts programming technologies such as CMOS EPROM, EEPROM, flash memory, SRAM and the like, thereby forming a programmable logic device with high density, high speed and low power consumption.
Flash memory belongs to one of memory devices, and is a Non-Volatile (Non-Volatile) memory.
A CPU central processing unit (central processing unit, abbreviated as CPU) is used as an operation and control core of the computer system, and is a final execution unit for information processing and program running.
FIG. 1 is a schematic flow chart of a method of one embodiment of the invention. The execution body of fig. 1 may be an interactive BMC self-recovery system.
As shown in fig. 1, the method includes:
step 110, the first BMC confirms that the communication with the second BMC is abnormal, acquires response information of the second BMC responding to the feeding signal from the complex programmable logic device and verifies the second BMC fault based on the response information;
step 120, sending a restart instruction to the second BMC;
in step 130, the communication state with the second BMC is monitored, and if the communication with the second BMC is not restored within a set period of time, the firmware file is written into the second BMC.
In order to facilitate understanding of the present invention, the following describes the interactive BMC self-recovery method provided by the present invention by combining the fault self-recovery process of the interactive BMC in the embodiment.
Specifically, referring to fig. 2, the method for self-recovery of the interactive BMC includes:
s1, the first BMC confirms abnormal communication with the second BMC, obtains response information of the second BMC responding to the feeding signal from the complex programmable logic device, and verifies the second BMC fault based on the response information.
The first BMC and the second BMC send heartbeat signals to each other periodically, and if the heartbeat signals are not received within a specified period, communication abnormality is judged; the method comprises the steps that response information of a second BMC is obtained from a complex programmable logic device, the complex programmable logic device regularly sends a dog feeding signal to a first BMC and the second BMC, and the response information of the first BMC and the second BMC is recorded; and matching the response information of the second BMC with the feeding signal sent by the complex programmable logic device, and judging that the second BMC fails if the response information of the second BMC and the feeding signal are not matched.
For example, when two BMCs work normally, the BMC of the server sends a "hand-waving" action to the BMC of the satellite device, and the satellite controller makes a "hand-waving" action to the server BMC after receiving the "hand-waving" action of the server, so that the BMC of the server BMC still works normally, and meanwhile, the BMC of the satellite controller also makes the same "hand-waving" action to the server. When a flash firmware stored by the second BMC is damaged, communication of the two BMCs is abnormal, when the 'hand-in' action returned by the second BMC is not received, the first BMC can go to the CPLD to check whether the second BMC responds to the 'dog feeding' operation or not, and if the second BMC does not respond to the 'dog feeding' operation, the second BMC is judged to have a fault.
S2, sending a restarting instruction to the second BMC.
And after confirming that the second BMC has a fault, sending a restarting instruction to the second BMC so that the second BMC switches the firmware through restarting and executing a rollback operation. The BMC chip is designed to be a dual flash chip for storing firmware, when the firmware stored in the chip is damaged, the BMC can be restarted to perform the rollback, and the BMC can be restored to a normal working state by switching to another flash chip with the undamaged firmware.
And S3, monitoring the communication state with the second BMC, and if the communication with the second BMC is not recovered within a set period, writing a firmware file into the second BMC.
If the other flash chip firmware of the second BMC is not damaged, the BMC operates normally after restarting and establishes communication with the first BMC, and the first BMC can acquire heartbeat information sent by the second BMC. If the second BMC cannot recover the normal working state after hard restarting, communication of the two BMCs cannot recover the firmware stored in the two chips of the damaged second BMC, and the normal first BMC can write the damaged second BMC, so that the damaged BMC of the two chips for storing the firmware is recovered to be normal. The former stage also needs to make a copy of the BMC of the server and the BMC of the satellite device so that the two BMCs can write to each other.
The following method is performed based on the above principle: starting timing while sending a restart instruction; monitoring whether a heartbeat signal sent by the second BMC is received, and resetting the timing time if the heartbeat signal sent by the second BMC is received; if the timing time reaches the set period, judging that all the firmware of the second BMC is damaged, writing a local firmware file into a firmware storage device of the second BMC, and controlling the second BMC to restart.
As shown in fig. 3, the system 300 includes:
the fault confirmation unit 310 is configured to confirm that communication with the second BMC is abnormal, obtain response information of the second BMC in response to the feeding dog signal from the complex programmable logic device, and verify that the second BMC has a fault based on the response information;
a restart control unit 320, configured to send a restart instruction to the second BMC;
and the firmware repairing unit 330 is configured to monitor a communication state with the second BMC, and if communication with the second BMC is not restored within a set period, write a firmware file to the second BMC.
Optionally, as an embodiment of the present invention, the fault confirming unit includes:
the abnormal judging module is used for periodically mutually sending heartbeat signals by the first BMC and the second BMC, and judging that communication is abnormal if the heartbeat signals are not received within a specified period;
the information acquisition module is used for acquiring response information of the second BMC from the complex programmable logic device, and the complex programmable logic device periodically transmits a feeding signal to the first BMC and the second BMC and records the response information of the first BMC and the second BMC;
and the fault verification module is used for matching the response information of the second BMC with the feeding dog signal sent by the complex programmable logic device, and judging that the second BMC is faulty if the response information of the second BMC is not matched with the feeding dog signal.
Optionally, as an embodiment of the present invention, the restart control unit includes:
and the restarting control module is used for sending a restarting instruction to the second BMC so that the second BMC switches the firmware through restarting and executing rollback operation.
Optionally, as an embodiment of the present invention, the firmware repairing unit includes:
the timing execution module is used for starting timing while sending a restarting instruction;
the communication monitoring module is used for monitoring whether the heartbeat signal sent by the second BMC is received or not, and resetting the timing time if the heartbeat signal sent by the second BMC is received;
and the file writing module is used for judging that all the firmware of the second BMC is damaged if the timing time reaches the set period, writing the local firmware file into the firmware storage device of the second BMC and controlling the second BMC to restart.
Fig. 4 is a schematic structural diagram of a terminal 400 according to an embodiment of the present invention, where the terminal 400 may be used to execute the interactive BMC self-recovery method according to the embodiment of the present invention.
The terminal 400 may include: processor 410, memory 420, and communication unit 430. The components may communicate via one or more buses, and it will be appreciated by those skilled in the art that the configuration of the server as shown in the drawings is not limiting of the invention, as it may be a bus-like structure, a star-like structure, or include more or fewer components than shown, or may be a combination of certain components or a different arrangement of components.
The memory 420 may be used to store instructions for execution by the processor 410, and the memory 420 may be implemented by any type of volatile or nonvolatile memory terminal or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk, or optical disk. The execution of the instructions in memory 420, when executed by processor 410, enables terminal 400 to perform some or all of the steps in the method embodiments described below.
The processor 410 is a control center of the storage terminal, connects various parts of the entire electronic terminal using various interfaces and lines, and performs various functions of the electronic terminal and/or processes data by running or executing software programs and/or modules stored in the memory 420, and invoking data stored in the memory. The processor may be comprised of an integrated circuit (Integrated Circuit, simply referred to as an IC), for example, a single packaged IC, or may be comprised of a plurality of packaged ICs connected to the same function or different functions. For example, the processor 410 may include only a central processing unit (Central Processing Unit, simply CPU). In the embodiment of the invention, the CPU can be a single operation core or can comprise multiple operation cores.
And a communication unit 430 for establishing a communication channel so that the storage terminal can communicate with other terminals. Receiving user data sent by other terminals or sending the user data to other terminals.
The present invention also provides a computer storage medium in which a program may be stored, which program may include some or all of the steps in the embodiments provided by the present invention when executed. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a random-access memory (random access memory, RAM), or the like.
Therefore, by means of communication between the server BMC and the satellite controller BMC, when one of the BMCs is damaged by firmware stored in one flash, the other BMC can quickly restore the BMC to a normal state, when the two flashes of the failed BMC are damaged, the other BMC which works normally can write the firmware into the damaged BMC firmware so as to restore the normal working state of the failed BMC, and powerful support is provided for normal working of the server.
It will be apparent to those skilled in the art that the techniques of embodiments of the present invention may be implemented in software plus a necessary general purpose hardware platform. Based on such understanding, the technical solution in the embodiments of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium such as a U-disc, a mobile hard disc, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk or an optical disk, etc. various media capable of storing program codes, including several instructions for causing a computer terminal (which may be a personal computer, a server, or a second terminal, a network terminal, etc.) to execute all or part of the steps of the method described in the embodiments of the present invention.
The same or similar parts between the various embodiments in this specification are referred to each other. In particular, for the terminal embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and reference should be made to the description in the method embodiment for relevant points.
In the several embodiments provided by the present invention, it should be understood that the disclosed systems and methods may be implemented in other ways. For example, the system embodiments described above are merely illustrative, e.g., the division of the elements is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interface, system or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
Although the present invention has been described in detail by way of preferred embodiments with reference to the accompanying drawings, the present invention is not limited thereto. Various equivalent modifications and substitutions may be made in the embodiments of the present invention by those skilled in the art without departing from the spirit and scope of the present invention, and it is intended that all such modifications and substitutions be within the scope of the present invention/be within the scope of the present invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A method for self-recovery of an interactive BMC, comprising:
the first BMC confirms abnormal communication with the second BMC, acquires response information of the second BMC responding to the feeding dog signal from the complex programmable logic device and verifies the second BMC fault based on the response information;
sending a restarting instruction to the second BMC;
and monitoring the communication state with the second BMC, and if the communication with the second BMC is not recovered within a set period, writing a firmware file into the second BMC.
2. The method of claim 1, wherein the first BMC acknowledges the communication exception with the second BMC, and wherein obtaining response information of the second BMC in response to the watchdog signal from the complex programmable logic device and verifying the second BMC failure based on the response information comprises:
the first BMC and the second BMC send heartbeat signals to each other periodically, and if the heartbeat signals are not received within a specified period, communication abnormality is judged;
the method comprises the steps that response information of a second BMC is obtained from a complex programmable logic device, the complex programmable logic device regularly sends a dog feeding signal to a first BMC and the second BMC, and the response information of the first BMC and the second BMC is recorded;
and matching the response information of the second BMC with the feeding signal sent by the complex programmable logic device, and judging that the second BMC fails if the response information of the second BMC and the feeding signal are not matched.
3. The method of claim 1, wherein sending a restart instruction to the second BMC comprises:
and sending a restart instruction to the second BMC to enable the second BMC to switch the firmware through restarting and executing the rollback operation.
4. The method of claim 2, wherein monitoring the status of communication with the second BMC, and if communication with the second BMC is not restored within a set period of time, writing the firmware file to the second BMC comprises:
starting timing while sending a restart instruction;
monitoring whether a heartbeat signal sent by the second BMC is received, and resetting the timing time if the heartbeat signal sent by the second BMC is received;
if the timing time reaches the set period, judging that all the firmware of the second BMC is damaged, writing a local firmware file into a firmware storage device of the second BMC, and controlling the second BMC to restart.
5. An interactive BMC self-healing system, comprising:
the fault confirming unit is used for confirming abnormal communication between the first BMC and the second BMC, acquiring response information of the second BMC responding to the feeding dog signal from the complex programmable logic device and verifying the fault of the second BMC based on the response information;
the restarting control unit is used for sending a restarting instruction to the second BMC;
and the firmware repairing unit is used for monitoring the communication state with the second BMC, and writing a firmware file into the second BMC if the communication with the second BMC is not recovered within a set period.
6. The system of claim 5, wherein the fault confirmation unit comprises:
the abnormal judging module is used for periodically mutually sending heartbeat signals by the first BMC and the second BMC, and judging that communication is abnormal if the heartbeat signals are not received within a specified period;
the information acquisition module is used for acquiring response information of the second BMC from the complex programmable logic device, and the complex programmable logic device periodically transmits a feeding signal to the first BMC and the second BMC and records the response information of the first BMC and the second BMC;
and the fault verification module is used for matching the response information of the second BMC with the feeding dog signal sent by the complex programmable logic device, and judging that the second BMC is faulty if the response information of the second BMC is not matched with the feeding dog signal.
7. The system of claim 5, wherein the restart control unit comprises:
and the restarting control module is used for sending a restarting instruction to the second BMC so that the second BMC switches the firmware through restarting and executing rollback operation.
8. The system of claim 6, wherein the firmware restoration unit comprises:
the timing execution module is used for starting timing while sending a restarting instruction;
the communication monitoring module is used for monitoring whether the heartbeat signal sent by the second BMC is received or not, and resetting the timing time if the heartbeat signal sent by the second BMC is received;
and the file writing module is used for judging that all the firmware of the second BMC is damaged if the timing time reaches the set period, writing the local firmware file into the firmware storage device of the second BMC and controlling the second BMC to restart.
9. A terminal, comprising:
a processor;
a memory for storing execution instructions of the processor;
wherein the processor is configured to perform the method of any of claims 1-4.
10. A computer readable storage medium storing a computer program, which when executed by a processor implements the method of any one of claims 1-4.
CN202111334036.XA 2021-11-11 2021-11-11 Interactive BMC self-recovery method, system, terminal and storage medium Active CN114116280B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111334036.XA CN114116280B (en) 2021-11-11 2021-11-11 Interactive BMC self-recovery method, system, terminal and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111334036.XA CN114116280B (en) 2021-11-11 2021-11-11 Interactive BMC self-recovery method, system, terminal and storage medium

Publications (2)

Publication Number Publication Date
CN114116280A CN114116280A (en) 2022-03-01
CN114116280B true CN114116280B (en) 2023-08-18

Family

ID=80378452

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111334036.XA Active CN114116280B (en) 2021-11-11 2021-11-11 Interactive BMC self-recovery method, system, terminal and storage medium

Country Status (1)

Country Link
CN (1) CN114116280B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115529261A (en) * 2022-08-31 2022-12-27 苏州浪潮智能科技有限公司 Multi-BMC communication method, device, equipment and storage medium
CN115858251B (en) * 2023-01-18 2023-05-16 苏州浪潮智能科技有限公司 Control method and device of substrate control unit, electronic equipment and storage medium
CN116820837A (en) * 2023-06-28 2023-09-29 合芯科技有限公司 Exception handling method and device for system component
CN116737471B (en) * 2023-08-04 2023-11-21 金舟远航(北京)信息产业有限公司 BIOS automatic switching method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2015109465A (en) * 2014-12-02 2016-10-10 ЭйАйСи ИНК. STAND WITH AUTO RECOVERY FUNCTION AND METHOD FOR AUTOMATIC RECOVERY FOR THIS STAND
CN109656739A (en) * 2018-12-10 2019-04-19 英业达科技有限公司 Method of servicing, system, mainboard and computer readable storage medium
CN109976949A (en) * 2019-03-28 2019-07-05 苏州浪潮智能科技有限公司 A kind of BMC failure mirror image rollback method for refreshing, device, terminal and storage medium
CN110209258A (en) * 2019-04-28 2019-09-06 北京达佳互联信息技术有限公司 Repositioning method, device, server cluster, electronic equipment and storage medium
CN111078452A (en) * 2019-12-13 2020-04-28 苏州浪潮智能科技有限公司 BMC firmware image recovery method and device
CN112231140A (en) * 2020-09-18 2021-01-15 苏州浪潮智能科技有限公司 Method, system, terminal and storage medium for fault recovery of BMC (baseboard management controller) of storage device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106936616B (en) * 2015-12-31 2020-01-03 伊姆西公司 Backup communication method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2015109465A (en) * 2014-12-02 2016-10-10 ЭйАйСи ИНК. STAND WITH AUTO RECOVERY FUNCTION AND METHOD FOR AUTOMATIC RECOVERY FOR THIS STAND
CN109656739A (en) * 2018-12-10 2019-04-19 英业达科技有限公司 Method of servicing, system, mainboard and computer readable storage medium
CN109976949A (en) * 2019-03-28 2019-07-05 苏州浪潮智能科技有限公司 A kind of BMC failure mirror image rollback method for refreshing, device, terminal and storage medium
CN110209258A (en) * 2019-04-28 2019-09-06 北京达佳互联信息技术有限公司 Repositioning method, device, server cluster, electronic equipment and storage medium
CN111078452A (en) * 2019-12-13 2020-04-28 苏州浪潮智能科技有限公司 BMC firmware image recovery method and device
CN112231140A (en) * 2020-09-18 2021-01-15 苏州浪潮智能科技有限公司 Method, system, terminal and storage medium for fault recovery of BMC (baseboard management controller) of storage device

Also Published As

Publication number Publication date
CN114116280A (en) 2022-03-01

Similar Documents

Publication Publication Date Title
CN114116280B (en) Interactive BMC self-recovery method, system, terminal and storage medium
WO2022198972A1 (en) Method, system and apparatus for fault positioning in starting process of server
CN109471770B (en) System management method and device
CN102455950A (en) Firmware recovery system and method of base board management controller
WO2018095107A1 (en) Bios program abnormal processing method and apparatus
CN101364193A (en) BIOS automatic recovery method and computer and system using the method
CN102880527B (en) Data recovery method of baseboard management controller
US20130117518A1 (en) System controller, information processing system and method of saving and restoring data in the information processing system
US20200394144A1 (en) Information processing system, information processing device, bios updating method for information processing device, and bios updating program for information processing device
CN111722960A (en) Starting method, system, equipment and medium under CMOS information abnormity
CN113360347A (en) Server and control method thereof
CN115658113A (en) Server self-starting method and device, readable storage medium and electronic equipment
US10824517B2 (en) Backup and recovery of configuration files in management device
JP2017078998A (en) Information processor, log management method, and computer program
CN114116330B (en) Server performance testing method, system, terminal and storage medium
CN111221683A (en) Double-flash hot backup method, system, terminal and storage medium for data center switch
CN110620684A (en) Storage double-control split-brain-preventing method, system, terminal and storage medium
CN115098342A (en) System log collection method, system, terminal and storage medium
CN114116276A (en) BMC hang-up self-recovery method, system, terminal and storage medium
CN111427721B (en) Abnormality recovery method and device
CN113608603A (en) Method, system, equipment and storage medium for repairing PCIe fault equipment
CN114253573A (en) PCIe device firmware batch upgrading method, system, terminal and storage medium
CN114443446B (en) Hard disk indicator lamp control method, system, terminal and storage medium
CN111078452A (en) BMC firmware image recovery method and device
CN114385379B (en) Method, system, terminal and storage medium for detecting on-board information refreshing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant