WO2021190252A1 - Chip starting method and apparatus and computer device - Google Patents

Chip starting method and apparatus and computer device Download PDF

Info

Publication number
WO2021190252A1
WO2021190252A1 PCT/CN2021/078549 CN2021078549W WO2021190252A1 WO 2021190252 A1 WO2021190252 A1 WO 2021190252A1 CN 2021078549 W CN2021078549 W CN 2021078549W WO 2021190252 A1 WO2021190252 A1 WO 2021190252A1
Authority
WO
WIPO (PCT)
Prior art keywords
chip
storage unit
processing
isolated
processing chips
Prior art date
Application number
PCT/CN2021/078549
Other languages
French (fr)
Chinese (zh)
Inventor
方志强
陈志平
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021190252A1 publication Critical patent/WO2021190252A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/4401Bootstrapping
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44505Configuring for program initiating, e.g. using registry, configuration files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44505Configuring for program initiating, e.g. using registry, configuration files
    • G06F9/4451User profiles; Roaming

Definitions

  • This application relates to the field of computer technology, and in particular to a chip startup method, device and computer equipment.
  • the computer equipment may include multiple processing chips, and the multiple processing chips may form a chip topology for cooperative processing services.
  • processing chips when a single board is designed, multiple processing chips can be consecutively numbered.
  • the processing chips can be started sequentially starting from the processing chip with the smallest number.
  • a certain processing chip fails, other processing chips whose numbers are located after the processing chip cannot be started anymore. This will lead to waste of processing resources.
  • the present application provides a chip startup method, device, and computer equipment, which can realize the independent startup of multiple processing chips that make up the chip topology and save processing resources.
  • the technical solution is as follows:
  • a chip startup method is provided, which is applied to a baseboard management controller (BMC) included in a computer device.
  • the computer device further includes a plurality of processing chips, a first storage unit, and each processing chip.
  • the second storage unit corresponding to the chip, and the second storage unit corresponding to each processing chip stores the startup file of the corresponding processing chip.
  • the method includes: determining the chip to be isolated among the plurality of processing chips;
  • the isolation chip sends a first configuration instruction to the first storage unit, where the first configuration instruction is used to instruct to configure the activation state of each processing chip stored in the first storage unit, and the activation state is To indicate whether to isolate the corresponding processing chips; control the multiple processing chips to restart according to the activation status of the multiple processing chips in the first storage unit and the startup file in the second storage unit corresponding to each processing chip .
  • the computer device includes a first storage unit.
  • the BMC can configure the activation state of each processing chip stored in the first storage unit to indicate which of the processing chips are Processing chips are isolated, which ones start normally. In this way, multiple processing chips can read the enabled state in the first storage unit. If the enabled state is for isolation, the processing chip will not start. If the enabled state is that isolation is not required, the corresponding second storage unit can be used The startup file stored in the computer is restarted, which realizes the independent startup of each processing chip, and avoids the waste of chip resources in the computer equipment.
  • the method further includes: sending a second configuration instruction to the first storage unit according to the chip to be isolated, where the second configuration instruction is used to instruct the The chip numbers of the processing chips are reconfigured. After the reconfiguration, the chip numbers of the processing chips other than the chip to be isolated are continuous.
  • the BMC may also reconfigure the chip number stored in the first storage unit, so that the chain can be rebuilt according to the chip number when the chip is restarted.
  • the method further includes: according to the chip to be isolated, sending a third configuration instruction to the first storage unit, where the third configuration instruction is used to instruct each storage unit to be stored in the first storage unit.
  • the health status of each processing chip is configured, and the health status is used to indicate whether the processing chip fails.
  • the implementation process of determining the chips to be isolated among the plurality of processing chips may be: after the computer device is powered on, receiving fault detection information and link establishment of each of the plurality of processing chips Information, the failure detection information is used to indicate whether the corresponding processing chip fails, and the link establishment information is used to indicate whether the corresponding processing chip has a failure when establishing connections with other processing chips; The fault detection information and the link establishment information are used to determine the chip to be isolated from the multiple processing chips.
  • the BMC may determine the chip to be isolated based on the information obtained by the self-check of the processing chip and the link establishment information after the computer device is powered on. That is, the embodiments of the present application can be applied to scenarios where there are faulty chips that affect system startup among multiple processing chips when a computer device is powered on, or there are chips that fail to establish a link to perform a reduced topology on multiple processing chips.
  • the implementation process of determining the chip to be isolated among the plurality of processing chips may be: during the operation of the computer device, receiving the operation status information reported by the plurality of processing chips; if the operation is If there is abnormal information in the status information, the abnormal information is fed back to the upper-level operation and maintenance system; the restart instruction issued by the upper-level operation and maintenance system according to the abnormal information and business processing conditions is received, and the restart instruction carries the instruction to indicate The instruction information of the chip to be isolated; and the chip to be isolated is determined according to the instruction information.
  • the BMC can obtain the running status information reported by multiple processing chips in real time. If there is abnormal information in the running status information, the running status information can be reported to the upper layer.
  • the upper-level operation and maintenance system determines the chip to be isolated according to the abnormal information and the business processing situation. That is, the embodiment of the present application can implement a reduced topology on multiple processing chips according to service processing conditions during the operation of the computer device.
  • a computer device in a second aspect, includes a substrate management controller BMC, a plurality of processing chips, a first storage unit, and a second storage unit corresponding to each processing chip.
  • the flash memory is connected, and the flash memory corresponding to each processing chip stores the startup file of the corresponding processing chip;
  • the BMC is used to determine the chip to be isolated among the multiple processing chips, and send the first configuration to the first storage unit Instruction, the first configuration instruction is used to instruct to configure the enabled states of the multiple processing chips stored in the first storage unit, and the enabled state is used to indicate whether to isolate the corresponding processing chips;
  • the BMC is also used to send a restart instruction to the multiple processing chips; each processing chip of the multiple processing chips is used to read its own data from the first storage unit when the restart instruction is received. In the enabled state, if its own enabled state is used to indicate that no isolation is performed, the startup file is read from the corresponding second storage unit, and the startup file is restarted according to the read startup file.
  • the computer device includes a first storage unit.
  • the BMC can configure the activation state of each processing chip stored in the first storage unit to indicate which of the processing chips are Processing chips are isolated, which ones start normally.
  • each processing chip can read the enabled state in the first storage unit, not start the processing chip that needs to be isolated, and restart the processing chip that does not need to be isolated through its corresponding startup file to realize each The separate startup of the processing chip avoids the waste of chip resources in the computer equipment.
  • the BMC is further configured to send a second configuration instruction to the first storage unit according to the chip to be isolated, and the second configuration instruction is used to instruct the The chip numbers of the processing chips are reconfigured. After the reconfiguration, the chip numbers of the processing chips other than the chip to be isolated are continuous; the restarted processing chip among the multiple processing chips is also used to download the The corresponding chip number is read in the first storage unit, and a chain is built according to the read chip number.
  • the BMC can also reconfigure the chip number stored in the first storage unit, so that when the processing chip is restarted, the chip number can be read from the first storage unit, and the chain can be rebuilt according to the chip number. .
  • the computer device further includes a first storage unit; accordingly, the BMC is further configured to send a third configuration instruction to the first storage unit according to the chip to be isolated, and the third configuration instruction It is used to indicate to configure the health status of each processing chip stored in the first storage unit, and the health status is used to indicate whether the processing chip fails.
  • the BMC is configured to receive failure detection information and link establishment information of each of the multiple processing chips after the computer device is powered on, and the failure detection information is used to indicate the corresponding processing Whether the chip is faulty, the link establishment information is used to indicate whether the corresponding processing chip has a fault when establishing connections with other processing chips; according to the failure detection information and link establishment information of the multiple processing chips, the link establishment information is The chip to be isolated is determined in the processing chip.
  • the BMC may determine the chip to be isolated based on the fault detection information and link establishment information obtained by the processing chip self-check after the computer device is powered on. That is, the embodiments of the present application can be applied to scenarios where there are faulty chips that affect system startup among multiple processing chips when a computer device is powered on, or there are chips that fail to establish a link, so as to perform a reduced topology on multiple processing chips.
  • the BMC is used to receive the operating status information reported by the multiple processing chips during the operation of the computer equipment, and if there is abnormal information in the operating status information, feedback the information to the upper-level operation and maintenance system.
  • the abnormal information; the upper-level operation and maintenance system is used to issue a restart instruction based on the abnormal information and business processing conditions, the restart instruction carries instruction information for instructing the chip to be isolated; the BMC is used to The instruction information determines the chip to be isolated.
  • the computer device includes a first storage unit.
  • the BMC can configure the activation state of each processing chip stored in the first storage unit to indicate which of the processing chips are Processing chips are isolated, which ones start normally.
  • each processing chip can read the enabled state in the first storage unit, not start the processing chip that needs to be isolated, and restart the processing chip that does not need to be isolated through its corresponding startup file to realize each The separate startup of the processing chip avoids the waste of chip resources in the computer equipment.
  • a computer device in a third aspect, includes a plurality of processing chips, a first storage unit, and a second storage unit corresponding to each processing chip; each of the plurality of processing chips It is connected to the corresponding second storage unit, and the second storage unit corresponding to each processing chip stores the startup file of the corresponding processing chip; the first storage unit is used to store the activation status of the multiple processing chips, so The enabled state is used to indicate whether to isolate the corresponding processing chip; the plurality of processing chips are connected to the first storage unit, and the non-isolated chip among the plurality of processing chips is used to determine whether to isolate the corresponding processing chip according to the first storage unit. The activation state stored in the unit and the corresponding startup file in the second storage unit are restarted.
  • each of the multiple processing chips in the computer device is connected to a second storage unit, and the second storage unit connected to each processing chip stores the startup file of the corresponding processing chip .
  • the multiple processing chips are also connected to the first storage unit, and the first storage unit stores the activation state of each processing chip. In this way, the multiple processing chips can read their own data in the first storage unit.
  • the activation state determines whether the processing chip needs to be started.
  • the processing chip that needs to be started can be restarted through the startup file in the corresponding second storage unit, which realizes the independent startup of each processing chip and avoids the waste of chip resources in the computer equipment.
  • the first storage unit is a complex programmable logic device CPLD; the second storage unit is a flash memory.
  • a chip activation device in a fourth aspect, is provided, and the chip activation device has the function of realizing the behavior of the chip activation method in the first aspect.
  • the chip startup device includes at least one module, and the at least one module is used to implement the chip startup method provided in the first aspect.
  • a computer-readable storage medium stores instructions that, when run on a computer device, enable the BMC of the computer device to execute the chip described in the first aspect. Start method.
  • a computer program product containing instructions which when running on a computer device, causes the BMC of the computer device to execute the chip startup method described in the first aspect.
  • the computer device includes a first storage unit.
  • the BMC can configure the activation state of each processing chip stored in the first storage unit to indicate which of the processing chips are Processing chips are isolated, which ones start normally.
  • multiple processing chips can read the enabled state in the first storage unit through the BIOS deployed on it, and then the processing chips that need to be isolated are not activated, and the processing chips that do not need to be isolated are activated through their corresponding The file is restarted to realize the independent startup of each processing chip and avoid the waste of chip resources in the computer equipment.
  • Fig. 1 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • Fig. 2 is a schematic structural diagram of another computer device provided by an embodiment of the present application.
  • FIG. 3 is a flowchart of a chip startup method provided by an embodiment of the present application.
  • FIG. 4 is a flowchart of another chip startup method provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a chip activation device provided by an embodiment of the present application.
  • multiple processing chips can be numbered consecutively.
  • the processing chips can be started sequentially starting from the processing chip with the smallest number.
  • forming a chip topology through multiple processing chips will increase the complexity of the single board, and accordingly, the failure rate of the single board will also increase. For example, a certain processing chip of the multiple processing chip may malfunction. If the number of the malfunctioning chip is higher, when the multiple processing chips are started, the processing chip whose number is located after the malfunctioning chip will not be able to start.
  • FIG. 1 is a schematic structural diagram of a computer device 100 provided by an embodiment of the present application.
  • the computer device 100 includes a plurality of processing chips 101, a first storage unit 102, and a second storage unit 103 corresponding to each processing chip.
  • the number of processing chips 101 is 4 as an example for description, but this does not constitute a limitation on the number of processing chips.
  • multiple processing chips 101 can be connected to the first storage unit 102 respectively, and each processing chip 101 can be connected to its corresponding second storage unit 103. Each two processing chips 101 of the plurality of processing chips 101 can be connected to each other, thereby forming a chip topology.
  • the first storage unit 102 stores the enabled state of each processing chip 101, and the enabled state may be used to indicate whether to isolate the corresponding processing chip.
  • the second storage unit 103 corresponding to each processing chip 101 may also store a startup file of the corresponding processing chip. In this way, each processing chip 101 can determine whether it needs to be started by reading its own activation state stored in the first storage unit 102. For the processing chip 101 that needs to be started, it can read the startup file stored in the corresponding second storage unit 103 to restart. In this way, the independent startup of each processing chip in the computer device is realized, and the waste of processing resources is avoided.
  • the multiple processing chips 101 may be CPUs, AI chips, or other processing chips.
  • the first storage unit 102 may be a complex programmable logic device (CPLD) or other types of registers.
  • the second storage unit 103 may be a storage device such as flash memory, which is not limited in the embodiment of the present application.
  • each first storage unit 102 is connected to a processing chip 101, and each first storage unit 102 stores The enabled state of the corresponding processing chip 101.
  • the computer device 100 may also include a BMC 104.
  • the BMC 104 can be connected to multiple processing chips 101 respectively, and the BMC 104 can also be connected to the first storage unit 102.
  • the BMC 104 can determine the chip to be isolated among the multiple processing chips 101.
  • the chip to be isolated may refer to a faulty chip detected after the computer device is powered on that affects the startup of the computer device, or may refer to a chip that is in a sub-health state and needs to be isolated according to business processing conditions.
  • the BMC 104 can receive the fault detection information and link establishment information of each processing chip 101 in the multiple processing chips, and the fault detection information can be used to indicate the corresponding processing Whether the chip is faulty, the link establishment information can be used to indicate whether the corresponding processing chip fails when the link is established with other processing chips, that is, it can be used to indicate whether the corresponding processing chip has successfully established the link with which processing chips, and which The processing chip failed to establish a chain.
  • the BMC104 may determine the chip to be isolated from the multiple processing chips according to the received fault detection information and link establishment information.
  • a basic input and output system may be deployed on the multiple processing chips.
  • the BIOS can perform self-checks on the multiple processing chips 101.
  • the BIOS may send the fault detection information of the multiple processing chips 101 to the BMC 104.
  • the BIOS program may be stored in a separate BIOS chip, or the BIOS program may be directly stored in a second storage unit (for example, a flash chip) connected to the processing chip.
  • the BIOS program is loaded on the processing chip to run. At this time, the BIOS will perform a self-check and send the self-check result to the BMC 104.
  • the BMC 104 can determine the chips that fail to start due to their own failures among the multiple processing chips 101, and the chips that fail to establish a link with other processing chips. BMC104 can use these chips as the chips to be isolated.
  • the BIOS of the computer device 100 can also monitor the operating status of each processing chip 101 in real time, and report the operating status information of each processing chip 101 to the BMC 104.
  • the BMC 104 can detect whether there is abnormal information in the received operating status information. If there is abnormal information, the BMC 104 can determine the processing chip corresponding to the abnormal information as the chip to be isolated.
  • an upper-level operation and maintenance system may also be deployed on the multiple processing chips 101.
  • an upper-level operation and maintenance system may be deployed on the management server corresponding to the computer device.
  • the upper-level operation and maintenance system refers to a business system for fault information management.
  • the BMC 104 receives the operating status information of each processing chip 101 reported by the BIOS in real time, it can also send the received operating status information to the upper-level operation and maintenance system.
  • the upper-level operation and maintenance system can detect whether there is abnormal information in the operating status information. If there is abnormal information, the upper-level operation and maintenance system can determine whether to isolate the processing chip corresponding to the abnormal information in combination with the current business processing situation.
  • the upper-level operation and maintenance system can isolate it.
  • the upper-level operation and maintenance system can send a restart instruction to the BMC104, and the restart instruction can carry an instruction to indicate the chip to be isolated information.
  • the BMC 104 can determine to obtain the chip to be isolated according to the instruction information carried in the restart instruction.
  • the indication information may be the slot ID of the chip to be isolated.
  • the number of chips to be isolated is determined to be 1.
  • another processing chip paired with the chip to be isolated can be It is also determined as the chip to be isolated. For example, assuming that the determined chip to be isolated is a processing chip with a slot ID of 1, then the pair of processing chips with slot IDs of 0 and 1 may be determined as the chip to be isolated.
  • the BMC 104 may send a first configuration instruction to the first storage unit 102 according to the chip to be isolated to configure the activation state of each processing chip stored in the first storage unit 102.
  • the first storage unit 102 can establish a communication connection with multiple processing chips 101.
  • the multiple processing chips 101 can read the corresponding enabled state from the first storage unit 102.
  • the enabled state can be used to indicate whether to isolate the corresponding processing chip.
  • a first mark field may be stored in the first storage unit 102, the first mark field includes a plurality of mark bits, each mark bit corresponds to a processing chip, and the mark on each mark bit is used to indicate the corresponding The activation state of the processing chip, and each flag bit can be arranged from small to large according to the slot ID of each processing chip. In this way, after the BMC104 determines the chip to be isolated, it can generate a tag field according to the slot ID of the chip to be isolated.
  • the tag field includes multiple tag bits, and the multiple tag bits are also arranged according to the slot ID of each processing chip from small to large, where ,
  • the mark field the mark bit corresponding to the chip to be isolated is the first mark, which is used to indicate that the chip to be isolated is isolated, and the mark bits corresponding to the other chips that do not need to be isolated are the second mark to indicate Start these chips normally.
  • the BMC104 can send the tag field as a configuration instruction to the first storage unit 102, so as to overwrite the first tag field stored in the first storage unit to realize the activation state of multiple processing chips. Configuration.
  • the BMC104 can generate a tag field of 1101, where 1 is used to indicate No isolation, 0 is used to indicate isolation.
  • the BMC 104 may send the tag field to the first storage unit 102 to overwrite the first tag field stored in the first storage unit 102.
  • each processing chip 101 may correspond to a first storage unit 102, that is to say, each processing chip 101 may have its own independent first storage unit 102. .
  • each processing chip 101 is connected to its corresponding first storage unit 102.
  • the first storage unit 102 can be used to store the activation state of the corresponding processing chip 101.
  • the enabled state stored in the first storage unit 102 corresponding to the chip to be isolated can be configured as an isolated state, and the enabled state in the first storage unit 102 corresponding to other processing chips can be configured as not. Isolated state.
  • the enabled state in the first storage unit 102 can also be implemented by means of flag bits. For example, for a chip to be isolated, a first mark can be stored in the corresponding first storage unit 102 to indicate that the chip to be isolated is isolated, and for a chip that does not need to be isolated, it can be stored in the corresponding first storage unit 102 The second mark indicates that the corresponding processing chip is not isolated.
  • the BMC 104 can control multiple processing chips 101 according to the enabled state in the first storage unit 102 and the startup file in the second storage unit 103 corresponding to each processing chip 101 Restart.
  • the BMC104 can send a restart instruction to the BIOS.
  • the BIOS can control multiple processing chips to read from the first storage unit 102
  • the BIOS can control each processing chip to read its own corresponding enabled state from its corresponding first storage unit 102.
  • the processing chip whose enabled state indicates to be isolated that is, the chip to be isolated, it may not be started directly, and for the processing chip whose enabled state indicates not to be isolated, the data stored in the second storage unit 103 corresponding to the corresponding processing chip can be read.
  • the startup file is used to reinitialize the processing chip through the startup file to restart the processing chip.
  • the first storage unit 102 may also be used to store the chip number of each processing chip 101.
  • the BMC104 may also send a second configuration instruction to the first storage unit 102, where the second configuration instruction is used to instruct the multiple processing chips 101 stored in the first storage unit 102
  • the chip numbers of the processing chips other than the chip to be isolated are continuous.
  • the first storage unit 102 may store a second flag field, and the second flag field can also be stored in the second flag field. It includes a plurality of mark bits, and each mark bit is used to record the original chip number of a processing chip 101. Each mark bit can be arranged in sequence from small to large or from large to small according to the slot ID of each processing chip, and the original chip numbers on the multiple mark bits are continuous and increase sequentially.
  • the second flag field can be 0123, where if multiple flag bits are arranged in the order of the slot ID of the processing chip from small to large, the second flag field is used to indicate the slot ID
  • the original chip number of the processing chip with 0 can be 0, the original chip number of the processing chip with slot ID 1 can be 1, the original chip number of the processing chip with slot ID 2 can be 2, and the processing chip with slot ID 3
  • the original chip number can be 3.
  • the tag field indicates that the original chip number of the processing chip with slot ID 3 can be 0, and the original chip number of the processing chip with slot ID 2 can be 1, slot The original chip number of the processing chip with ID 1 can be 2, and the original chip number of the processing chip with slot ID 0 can be 3.
  • BMC104 After BMC104 determines the chip to be isolated, it can generate a tag field including multiple tag bits according to the above rules. After that, the chip number on the tag bit corresponding to the chip to be isolated remains unchanged, while for chips on other tag bits
  • the BMC104 can serially number it from 0 to obtain the modified tag field, and send the tag field as a configuration instruction to the first storage unit 102 to compare the second tag field stored in the first storage unit 102. Cover, thereby completing the reconfiguration of the chip number of the processing chip.
  • the processing chip ranked first according to the slot ID is the chip to be isolated, and it corresponds to the first tag bit in the tag field 0123
  • the original chip number 0 on the first tag bit can be kept unchanged, and then ,
  • serial numbering starts from 0, so that the changed flag field is 0012.
  • the chip number of the processing chip whose HCCS port is M0 in these processing chips can be changed to 0, where the HCCS port refers to the built-in
  • the corresponding processing chip connects and communicates with other processing chips, including three ports M0, M1, and M2.
  • a chip number of 0 can indicate that the processing chip is the master chip, and the remaining processing chips are the slave chips of the processing chip.
  • the BMC104 can also use other methods to reconfigure the chip number stored in the first storage unit 102.
  • the BMC104 can read the second tag field from the first storage unit 102, and then perform the above method After the modification, the modified second mark field is written into the first storage unit 102 to overwrite the previously stored serial number.
  • the embodiment of the application does not limit this.
  • the BIOS can control each processing chip to read the corresponding chip number from the first storage unit 102 while restarting each processing chip, and according to each The chip number of the processing chip is used to establish the connection between the processing chips, that is, to build a chain.
  • the first storage unit may also store the health status of each processing chip, and the health status is used to indicate whether the corresponding processing chip has a fault.
  • the BMC104 determines the chip to be isolated, it can also send a third configuration instruction to the first storage unit 102 according to the chip to be isolated. Configure the health status of each processing chip.
  • the first storage unit 102 can also use multiple flag bits included in a flag field to record the health status of each processing chip.
  • the BMC 104 can also refer to the aforementioned method for configuring the enabled state in the first storage unit 102 to configure the health status of each processing chip in the first storage unit 102.
  • the first configuration instruction, the second configuration instruction, and the third configuration instruction sent by the BMC104 to the first storage unit 102 may also be combined into one instruction for transmission, that is, the BMC104 may The activation state, chip number, and health state of each processing chip stored in the first storage unit 102 are configured through a configuration instruction.
  • the activation state of each processing chip described above may be stored in the first storage unit 102, and the chip number and health state of each processing chip may be stored in other storage units, respectively.
  • the BMC may send corresponding configuration instructions to the corresponding storage unit to configure the information stored in the storage unit.
  • the above embodiment mainly introduces the process of configuring the state in the first storage unit through BMC.
  • other methods can also be used to perform the configuration of the information in the first storage unit.
  • the configuration in turn, enables multiple processing chips to be restarted by reading the enabled state in the first storage unit.
  • the computer device includes a first storage unit.
  • the BMC can configure the activation state of each processing chip stored in the first storage unit to indicate which of the processing chips are Processing chips are isolated, which ones start normally.
  • the BIOS deployed on multiple processing chips can read the activation status of each processing chip stored in the first storage unit.
  • the processing chip that needs to be isolated is not activated, and the processing chip that does not need to be isolated can be passed through the corresponding first storage unit.
  • the startup file stored in the storage unit is restarted, which realizes the independent startup of each processing chip and avoids the waste of chip resources in the computer equipment.
  • each processing chip can be started independently and does not depend on the startup of other processing chips, even if the main processing chip fails, other processing chips can still be started normally, which improves the performance of the computer equipment. reliability.
  • FIG. 3 is a flowchart of a chip startup method provided by an embodiment of the present application. This method can be applied to the BMC of the computer equipment shown in FIGS. 1 and 2. Referring to FIG. 3, the method includes:
  • Step 301 Determine the chip to be isolated among the multiple processing chips.
  • the chip to be isolated may refer to a faulty chip detected after the computer device is powered on that affects the startup of the computer device, or may refer to a chip that is in a sub-health state and needs to be isolated according to business processing conditions.
  • the BMC may determine the faulty chip according to the fault detection information and link establishment information of each processing chip reported by the BIOS deployed on the multiple processing chips after the computer device is powered on, and use the faulty chip as the waiting chip. Isolate the chip.
  • the BMC may determine the chip to be isolated according to the operating state information of each processing chip detected in real time by the BIOS during the operation of the computer.
  • the BMC can directly determine the chip to be isolated based on the abnormal information in the operating status information of each processing chip, or it can report the operating status information to the upper-level operation and maintenance system, and the upper-level operation and maintenance system can use the operation status information and To determine the current business processing situation.
  • Step 302 Send a first configuration instruction to the first storage unit, where the first configuration instruction is used to instruct to configure the enabled state of each processing chip stored in the first storage unit, and the enabled state is used to indicate whether to configure the corresponding processing chip Isolate.
  • the BMC may send a first configuration instruction to the first storage unit according to the chip to be isolated, so as to configure the activation state of each processing chip stored in the first storage unit.
  • a first configuration instruction to the first storage unit according to the chip to be isolated, so as to configure the activation state of each processing chip stored in the first storage unit.
  • the BMC can also send a second configuration instruction to the first storage unit to reconfigure the chip numbers of the processing chips stored in the first storage unit. , Sending a third configuration instruction to the first storage unit to configure the health status of each processing chip stored in the first storage unit.
  • Step 303 Control the non-isolated chips among the multiple processing chips to restart according to the enabled state stored in the first storage unit and the corresponding startup file in the second storage unit.
  • the BMC can send a restart instruction to the BIOS, and the BIOS reads the enable state of each processing chip in the first storage unit, and performs isolated processing for the enable state indication
  • the BIOS may not start the chip, and the BIOS may reinitialize the processing chip that is not isolated for the enable state indication to perform a restart.
  • the restart instruction can be generated by the BMC itself. If the chip to be isolated is determined during the operation of the computer device, the restart instruction can be the upper-level operation and maintenance system Restart command sent.
  • the BMC when the BMC also reconfigures the chip number in the first storage unit, when the BIOS restarts the processing chip, it can also read the reconfigured chip number of each processing chip in the first storage unit. Then, the connection between the restarted processing chips is established according to the reconfigured chip number.
  • the computer device includes a first storage unit.
  • the BMC can configure the activation state of each processing chip stored in the first storage unit to indicate which of the processing chips are Processing chips are isolated, which ones start normally.
  • the BIOS can read the enabled state in the first storage unit, not start the processing chip that needs to be isolated, and restart the processing chip that does not need to be isolated through its corresponding startup file, so as to realize the operation of each processing chip. Start separately, avoiding the waste of chip resources in computer equipment.
  • each processing chip can be started independently and does not depend on the startup of other processing chips, even if the main processing chip fails, other processing chips can still be started normally, which improves the performance of the computer equipment. reliability.
  • Fig. 4 is a flowchart of another chip startup method provided by an embodiment of the present application. Referring to Figure 4, the method includes the following steps:
  • Step 401 When the computer equipment is powered on, the BIOS deployed on the multiple processing chips controls the multiple processing chips to perform chip self-checks to obtain fault detection information and link establishment information of the multiple processing chips.
  • the BIOS is deployed on the CPU of the computer equipment.
  • the BIOS can control each processing chip to perform self-checking to obtain the fault detection information and link establishment information of each processing chip.
  • the fault detection information can be It is used to indicate whether the corresponding processing chip has a failure
  • the link establishment information is used to indicate whether the corresponding processing chip has a failure when establishing a connection with other processing chips, that is, it indicates whether the corresponding processing chip has successfully established a link with other processing chips.
  • Step 402 The BIOS sends fault detection information and link establishment information of multiple processing chips to the BMC.
  • Step 403 The BMC determines the chip to be isolated according to the fault detection information and link establishment information of the multiple processing chips.
  • Step 404 The BMC sends a first configuration instruction to the first storage unit to configure the activation state of each processing chip stored in the first storage unit.
  • Step 405 The BMC sends a restart instruction to the BIOS.
  • the restart instruction may be generated after the BMC determines that there is a faulty processing chip based on the fault detection information and the link establishment information.
  • Step 406 The BIOS reads the activation status of each processing chip from the first storage unit, and restarts according to the activation status of each processing chip and the startup file in the second storage unit corresponding to each processing chip.
  • Step 407 During the normal operation of the computer device, the BIOS detects multiple processing chips in real time to obtain operating status information of the multiple processing chips.
  • the BIOS may obtain the operating state information of multiple processing chips at predetermined time intervals. Among them, the running status information is used to indicate whether the running status of the corresponding processing chip is abnormal.
  • Step 408 The BIOS feeds back the operating status information to the upper-level operation and maintenance system through the BMC.
  • the BIOS may report the operating state information only when it detects that there is abnormal information in the operating state information that is used to indicate the abnormal operating state of the processing chip.
  • the BIOS may also report every time the operating status information is acquired, regardless of whether there is abnormal information.
  • Step 409 The upper-level operation and maintenance system determines the chip to be isolated according to the operating status information and the current business processing situation.
  • the upper-level operation and maintenance system after the upper-level operation and maintenance system receives the operating status information fed back by the BIOS, it can determine the chip to be isolated according to the implementation manner described in the foregoing embodiment, and then generate a restart instruction.
  • the restart instruction carries instruction information for instructing the chip to be isolated.
  • the indication information may be the slot ID of the chip to be isolated.
  • Step 410 The upper-level operation and maintenance system sends the restart instruction to the BMC.
  • Step 411 The BMC determines the chip to be isolated according to the restart instruction.
  • Step 412 The BMC sends a first configuration instruction to the first storage unit to configure the activation state of each processing chip stored in the first storage unit.
  • Step 413 The BMC sends a restart instruction to the BIOS.
  • the restart instruction may be a restart instruction sent by the upper-layer operation and maintenance system received by the BMC.
  • Step 414 The BIOS reads the activation status of each processing chip from the first storage unit, and restarts according to the activation status of each processing chip and the startup file in the second storage unit corresponding to each processing chip.
  • the computer device includes a first storage unit.
  • the BMC can configure the activation state of each processing chip stored in the first storage unit to indicate which of the processing chips are Processing chips are isolated, which ones start normally.
  • the BIOS deployed on multiple processing chips can read the enabled state in the first storage unit to disable the startup of the processing chips that need to be isolated, and use the corresponding startup files for the processing chips that do not need to be isolated. Restart realizes the independent startup of each processing chip, avoiding the waste of chip resources in the computer equipment.
  • each processing chip can be started independently and does not depend on the startup of other processing chips, even if the main processing chip fails, other processing chips can still be started normally, which improves the performance of the computer equipment. reliability.
  • processing chips can also be isolated according to the operating status information and business processing conditions of the processing chips, thereby realizing flexible deployment of processing chips. , So that the computer equipment can better improve the service.
  • an embodiment of the present application provides a chip activation device 500.
  • the chip activation device 500 can be applied to the BMC of the computer equipment introduced in the foregoing embodiments.
  • the device 500 includes:
  • the determining module 501 is executed by the BMC to implement step 301 in the foregoing embodiment
  • the configuration module 502 is executed by the BMC to implement step 302 in the foregoing embodiment
  • the control module 503 is executed by the BMC to implement step 303 in the foregoing embodiment.
  • the configuration module 502 is also used to:
  • a second configuration instruction is sent to the first storage unit.
  • the second configuration instruction is used to instruct to reconfigure the chip numbers of the multiple processing chips stored in the first storage unit.
  • the chip numbers of processing chips other than the isolation chip are continuous.
  • the configuration module 502 is also used to:
  • a third configuration instruction is sent to the first storage unit.
  • the third configuration instruction is used to instruct to configure the health status of each processing chip stored in the first storage unit, and the health status is used to indicate whether the processing chip has occurred Fault.
  • the determining module 501 is specifically configured to:
  • the computer device After the computer device is powered on, it receives the failure detection information and link establishment information of each of the multiple processing chips.
  • the failure detection information is used to indicate whether the corresponding processing chip fails, and the link establishment information is used to indicate the corresponding processing chip. Whether there is a fault when establishing a connection with other processing chips;
  • the chip to be isolated is determined from the multiple processing chips.
  • the determining module 501 is specifically configured to:
  • the abnormal information is fed back to the upper-level operation and maintenance system;
  • the restart instruction Receiving a restart instruction issued by the upper-level operation and maintenance system based on abnormal information and business processing conditions, the restart instruction carrying instruction information for indicating the chip to be isolated;
  • the instruction information determine the chip to be isolated.
  • the computer device includes a first storage unit.
  • the BMC can configure the activation state of each processing chip stored in the first storage unit to indicate Which of the processing chips are isolated and which ones are normally started.
  • the BIOS deployed on multiple processing chips can read the enabled state in the first storage unit to disable the startup of the processing chips that need to be isolated, and use the corresponding startup files for the processing chips that do not need to be isolated. Restart realizes the independent startup of each processing chip, avoiding the waste of chip resources in the computer equipment.
  • each processing chip can be started independently and does not depend on the startup of other processing chips, even if the main processing chip fails, other processing chips can still be started normally, which improves the performance of the computer equipment. reliability.
  • chip startup device provided in the above embodiment performs chip restart
  • only the division of the above-mentioned functional modules is used as an example for illustration.
  • the above-mentioned function allocation can be completed by different functional modules according to needs. That is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the chip activation device provided in the foregoing embodiment and the chip activation method embodiment belong to the same concept, and the specific implementation process is detailed in the method embodiment, and will not be repeated here.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • the computer instructions may be transmitted from a website, computer, server, or data center.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center integrated with one or more available media.
  • the usable medium may be a magnetic medium (for example: floppy disk, hard disk, tape), optical medium (for example: Digital Versatile Disc (DVD)), or semiconductor medium (for example: Solid State Disk (SSD) )Wait.
  • the program can be stored in a computer-readable storage medium.
  • the storage medium mentioned can be a read-only memory, a magnetic disk or an optical disk, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Computer Security & Cryptography (AREA)
  • Computing Systems (AREA)
  • Stored Programmes (AREA)
  • Debugging And Monitoring (AREA)

Abstract

Provided are a chip starting method and apparatus and a computer device. After determining a chip to be isolated, a BMC configures the enabling state of each processing chip stored in a first storage unit (102) to indicate which of the processing chips (101) need to be isolated and which of the processing chips need to be normally started. By reading the enabling state in the first storage unit (102), a BIOS deployed on multiple processing chips does not start the processing chips needing to be isolated and restarts, by means of corresponding starting files, the processing chips not needing to be isolated, so that separate starting of all the processing chips is realized and waste of chip resources in the computer device is avoided. Since each processing chip can be separately started without depending on other processing chips, even if the main processing chip breaks down, other processing chips can still be normally started, thereby improving the reliability of a computer device.

Description

芯片启动方法、装置及计算机设备Chip starting method, device and computer equipment
本申请要求于2020年3月25日提交中国专利局、申请号为202010218853.8、发明名称为“芯片启动方法、装置及计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on March 25, 2020, the application number is 202010218853.8, and the invention title is "Chip startup method, device and computer equipment", the entire content of which is incorporated into this application by reference middle.
技术领域Technical field
本申请涉及计算机技术领域,特别涉及一种芯片启动方法、装置及计算机设备。This application relates to the field of computer technology, and in particular to a chip startup method, device and computer equipment.
背景技术Background technique
当前,为了提升计算机设备的处理能力,计算机设备中可以包括多个处理芯片,该多个处理芯片可以组成芯片拓扑协作处理业务。At present, in order to improve the processing capability of computer equipment, the computer equipment may include multiple processing chips, and the multiple processing chips may form a chip topology for cooperative processing services.
相关技术中,在进行单板设计时,可以对多个处理芯片进行连续编号。在计算机设备启动的过程中,可以从编号最小的处理芯片开始依次启动各个处理芯片。其中,如果某个处理芯片故障,则编号位于该处理芯片之后的其他处理芯片将无法再启动。这样,将会导致处理资源浪费。In the related art, when a single board is designed, multiple processing chips can be consecutively numbered. During the startup process of the computer device, the processing chips can be started sequentially starting from the processing chip with the smallest number. Among them, if a certain processing chip fails, other processing chips whose numbers are located after the processing chip cannot be started anymore. This will lead to waste of processing resources.
发明内容Summary of the invention
本申请提供了一种芯片启动方法、装置及计算机设备,可以实现组成芯片拓扑的多个处理芯片的单独启动,节省处理资源。所述技术方案如下:The present application provides a chip startup method, device, and computer equipment, which can realize the independent startup of multiple processing chips that make up the chip topology and save processing resources. The technical solution is as follows:
第一方面,提供了一种芯片启动方法,应用于计算机设备包括的基板管理控制器(baseboard management controller,BMC)中,所述计算机设备还包括多个处理芯片、第一存储单元以及每个处理芯片对应的第二存储单元,每个处理芯片对应的第二存储单元中存储有相应处理芯片的启动文件,所述方法包括:确定所述多个处理芯片中的待隔离芯片;根据所述待隔离芯片,向所述第一存储单元发送第一配置指令,所述第一配置指令用于指示对所述第一存储单元中存储的每个处理芯片的启用状态进行配置,所述启用状态用于指示是否对相应处理芯片进行隔离;控制所述多个处理芯片根据所述第一存储单元中的多个处理芯片的启用状态和每个处理芯片对应的第二存储单元中的启动文件进行重启。In the first aspect, a chip startup method is provided, which is applied to a baseboard management controller (BMC) included in a computer device. The computer device further includes a plurality of processing chips, a first storage unit, and each processing chip. The second storage unit corresponding to the chip, and the second storage unit corresponding to each processing chip stores the startup file of the corresponding processing chip. The method includes: determining the chip to be isolated among the plurality of processing chips; The isolation chip sends a first configuration instruction to the first storage unit, where the first configuration instruction is used to instruct to configure the activation state of each processing chip stored in the first storage unit, and the activation state is To indicate whether to isolate the corresponding processing chips; control the multiple processing chips to restart according to the activation status of the multiple processing chips in the first storage unit and the startup file in the second storage unit corresponding to each processing chip .
在本申请实施例中,计算机设备包括第一存储单元,BMC在确定待隔离芯片之后,可以对第一存储单元中存储的各个处理芯片的启用状态进行配置,以指示对各个处理芯片中的哪些处理芯片进行隔离,哪些正常启动。这样,多个处理芯片可以读取第一存储单元中的启用状态,如果启用状态为进行隔离,则处理芯片不启动,如果启用状态为不需要进行隔离,则可以通过其对应的第二存储单元中存储的启动文件进行重启,实现了各个处理芯片的单独启动,避免了计算机设备中芯片资源的浪费。In the embodiment of the present application, the computer device includes a first storage unit. After determining the chip to be isolated, the BMC can configure the activation state of each processing chip stored in the first storage unit to indicate which of the processing chips are Processing chips are isolated, which ones start normally. In this way, multiple processing chips can read the enabled state in the first storage unit. If the enabled state is for isolation, the processing chip will not start. If the enabled state is that isolation is not required, the corresponding second storage unit can be used The startup file stored in the computer is restarted, which realizes the independent startup of each processing chip, and avoids the waste of chip resources in the computer equipment.
可选地,所述方法还包括:根据所述待隔离芯片,向所述第一存储单元发送第二配置指令,所述第二配置指令用于指示对所述第一存储单元中存储的多个处理芯片的芯片编号进行重新配置,其中,重新配置后,除待隔离芯片之外的其他处理芯片的芯片编号连续。Optionally, the method further includes: sending a second configuration instruction to the first storage unit according to the chip to be isolated, where the second configuration instruction is used to instruct the The chip numbers of the processing chips are reconfigured. After the reconfiguration, the chip numbers of the processing chips other than the chip to be isolated are continuous.
在本申请实施例中,BMC还可以对第一存储单元中存储的芯片编号进行重新配置,以便 于芯片重启时按照该芯片编号重新建链。In the embodiment of the present application, the BMC may also reconfigure the chip number stored in the first storage unit, so that the chain can be rebuilt according to the chip number when the chip is restarted.
可选地,所述方法还包括:根据所述待隔离芯片,向所述第一存储单元发送第三配置指令,所述第三配置指令用于指示对所述第一存储单元中存储的每个处理芯片的健康状态进行配置,所述健康状态用于指示处理芯片是否发生故障。Optionally, the method further includes: according to the chip to be isolated, sending a third configuration instruction to the first storage unit, where the third configuration instruction is used to instruct each storage unit to be stored in the first storage unit. The health status of each processing chip is configured, and the health status is used to indicate whether the processing chip fails.
可选地,所述确定所述多个处理芯片中的待隔离芯片的实现过程可以为:在计算机设备上电之后,接收所述多个处理芯片中每个处理芯片的故障检测信息和建链信息,所述故障检测信息用于指示对应的处理芯片是否发生故障,所述建链信息用于指示对应的处理芯片在与其他处理芯片建立连接时是否存在故障;根据所述多个处理芯片的故障检测信息和建链信息,从所述多个处理芯片中确定所述待隔离芯片。Optionally, the implementation process of determining the chips to be isolated among the plurality of processing chips may be: after the computer device is powered on, receiving fault detection information and link establishment of each of the plurality of processing chips Information, the failure detection information is used to indicate whether the corresponding processing chip fails, and the link establishment information is used to indicate whether the corresponding processing chip has a failure when establishing connections with other processing chips; The fault detection information and the link establishment information are used to determine the chip to be isolated from the multiple processing chips.
在本申请实施例中,BMC可以在计算机设备上电之后,根据处理芯片自检得到的信息和建链信息来确定待隔离芯片。也即,本申请实施例可以应用于计算机设备上电时多个处理芯片中存在影响***启动的故障芯片或者是存在建链失败的芯片的场景中,以对多个处理芯片进行降拓扑。In the embodiment of the present application, the BMC may determine the chip to be isolated based on the information obtained by the self-check of the processing chip and the link establishment information after the computer device is powered on. That is, the embodiments of the present application can be applied to scenarios where there are faulty chips that affect system startup among multiple processing chips when a computer device is powered on, or there are chips that fail to establish a link to perform a reduced topology on multiple processing chips.
可选地,所述确定所述多个处理芯片中的待隔离芯片的实现过程可以为:在所述计算机设备运行过程中,接收所述多个处理芯片上报的运行状态信息;如果所述运行状态信息中存在异常信息,则向上层运维***反馈所述异常信息;接收所述上层运维***根据所述异常信息和业务处理情况下发的重启指令,所述重启指令携带用于指示所述待隔离芯片的指示信息;根据所述指示信息,确定所述待隔离芯片。Optionally, the implementation process of determining the chip to be isolated among the plurality of processing chips may be: during the operation of the computer device, receiving the operation status information reported by the plurality of processing chips; if the operation is If there is abnormal information in the status information, the abnormal information is fed back to the upper-level operation and maintenance system; the restart instruction issued by the upper-level operation and maintenance system according to the abnormal information and business processing conditions is received, and the restart instruction carries the instruction to indicate The instruction information of the chip to be isolated; and the chip to be isolated is determined according to the instruction information.
在本申请实施例中,在计算机设备启动运行的过程中,BMC可以获取多个处理芯片实时上报的运行状态信息,在运行状态信息中存在异常信息的情况下,可以将该运行状态信息上报上层运维***,由上层运维***根据该异常信息和业务处理情况来确定待隔离芯片。也即,本申请实施例可以实现计算机设备运行过程中,根据业务处理情况来对多个处理芯片进行降拓扑。In the embodiment of this application, during the startup and operation of the computer equipment, the BMC can obtain the running status information reported by multiple processing chips in real time. If there is abnormal information in the running status information, the running status information can be reported to the upper layer. In the operation and maintenance system, the upper-level operation and maintenance system determines the chip to be isolated according to the abnormal information and the business processing situation. That is, the embodiment of the present application can implement a reduced topology on multiple processing chips according to service processing conditions during the operation of the computer device.
第二方面,提供了一种计算机设备,所述计算机设备包括基板管理控制器BMC、多个处理芯片、第一存储单元以及每个处理芯片对应的第二存储单元,每个处理芯片与对应的闪存连接,且每个处理芯片对应的闪存中存储有相应处理芯片的启动文件;所述BMC用于确定所述多个处理芯片中的待隔离芯片,向所述第一存储单元发送第一配置指令,所述第一配置指令用于指示对所述第一存储单元中存储的所述多个处理芯片的启用状态进行配置,所述启用状态用于指示是否对相应处理芯片进行隔离;所述BMC还用于向所述多个处理芯片发送重启指令;所述多个处理芯片中的每个处理芯片用于在接收到所述重启指令时,从所述第一存储单元中读取自身的启用状态,如果自身的启用状态用于指示不进行隔离,则从对应的第二存储单元中读取启动文件,根据读取到的启动文件进行重新启动。In a second aspect, a computer device is provided. The computer device includes a substrate management controller BMC, a plurality of processing chips, a first storage unit, and a second storage unit corresponding to each processing chip. The flash memory is connected, and the flash memory corresponding to each processing chip stores the startup file of the corresponding processing chip; the BMC is used to determine the chip to be isolated among the multiple processing chips, and send the first configuration to the first storage unit Instruction, the first configuration instruction is used to instruct to configure the enabled states of the multiple processing chips stored in the first storage unit, and the enabled state is used to indicate whether to isolate the corresponding processing chips; The BMC is also used to send a restart instruction to the multiple processing chips; each processing chip of the multiple processing chips is used to read its own data from the first storage unit when the restart instruction is received. In the enabled state, if its own enabled state is used to indicate that no isolation is performed, the startup file is read from the corresponding second storage unit, and the startup file is restarted according to the read startup file.
在本申请实施例中,计算机设备包括第一存储单元,BMC在确定待隔离芯片之后,可以对第一存储单元中存储的各个处理芯片的启用状态进行配置,以指示对各个处理芯片中的哪些处理芯片进行隔离,哪些正常启动。这样,每个处理芯片就可以通过读取第一存储单元中的启用状态,对需要进行隔离的处理芯片不启动,对不需要进行隔离的处理芯片通过其对应的启动文件进行重启,实现了各个处理芯片的单独启动,避免了计算机设备中芯片资源的浪费。In the embodiment of the present application, the computer device includes a first storage unit. After determining the chip to be isolated, the BMC can configure the activation state of each processing chip stored in the first storage unit to indicate which of the processing chips are Processing chips are isolated, which ones start normally. In this way, each processing chip can read the enabled state in the first storage unit, not start the processing chip that needs to be isolated, and restart the processing chip that does not need to be isolated through its corresponding startup file to realize each The separate startup of the processing chip avoids the waste of chip resources in the computer equipment.
可选地,所述BMC还用于根据所述待隔离芯片,向所述第一存储单元发送第二配置指 令,所述第二配置指令用于指示对所述第一存储单元中存储的多个处理芯片的芯片编号进行重新配置,其中,重新配置后,除待隔离芯片之外的其他处理芯片的芯片编号连续;所述多个处理芯片中重新启动后的处理芯片还用于从所述第一存储单元中读取对应的芯片编号,按照读取到的芯片编号建链。Optionally, the BMC is further configured to send a second configuration instruction to the first storage unit according to the chip to be isolated, and the second configuration instruction is used to instruct the The chip numbers of the processing chips are reconfigured. After the reconfiguration, the chip numbers of the processing chips other than the chip to be isolated are continuous; the restarted processing chip among the multiple processing chips is also used to download the The corresponding chip number is read in the first storage unit, and a chain is built according to the read chip number.
在本申请实施例中,BMC还可以对第一存储单元中存储的芯片编号进行重新配置,这样,处理芯片重启时可以从第一存储单元中读取芯片编号,并按照该芯片编号重新建链。In the embodiment of the present application, the BMC can also reconfigure the chip number stored in the first storage unit, so that when the processing chip is restarted, the chip number can be read from the first storage unit, and the chain can be rebuilt according to the chip number. .
可选地,所述计算机设备还包括第一存储单元;相应地,所述BMC还用于根据所述待隔离芯片,向所述第一存储单元发送第三配置指令,所述第三配置指令用于指示对所述第一存储单元中存储的每个处理芯片的健康状态进行配置,所述健康状态用于指示处理芯片是否发生故障。Optionally, the computer device further includes a first storage unit; accordingly, the BMC is further configured to send a third configuration instruction to the first storage unit according to the chip to be isolated, and the third configuration instruction It is used to indicate to configure the health status of each processing chip stored in the first storage unit, and the health status is used to indicate whether the processing chip fails.
可选地,所述BMC用于在所述计算机设备上电之后,接收所述多个处理芯片中每个处理芯片的故障检测信息和建链信息,所述故障检测信息用于指示对应的处理芯片是否发生故障,所述建链信息用于指示对应的处理芯片在与其他处理芯片建立连接时是否存在故障;根据所述多个处理芯片的故障检测信息和建链信息,从所述多个处理芯片中确定所述待隔离芯片。Optionally, the BMC is configured to receive failure detection information and link establishment information of each of the multiple processing chips after the computer device is powered on, and the failure detection information is used to indicate the corresponding processing Whether the chip is faulty, the link establishment information is used to indicate whether the corresponding processing chip has a fault when establishing connections with other processing chips; according to the failure detection information and link establishment information of the multiple processing chips, the link establishment information is The chip to be isolated is determined in the processing chip.
在本申请实施例中,BMC可以在计算机设备上电之后,根据处理芯片自检得到的故障检测信息和建链信息来确定待隔离芯片。也即,本申请实施例可以应用于计算机设备上电时多个处理芯片中存在影响***启动的故障芯片或者是存在建链失败的芯片的场景中,以对多个处理芯片进行降拓扑。In the embodiment of the present application, the BMC may determine the chip to be isolated based on the fault detection information and link establishment information obtained by the processing chip self-check after the computer device is powered on. That is, the embodiments of the present application can be applied to scenarios where there are faulty chips that affect system startup among multiple processing chips when a computer device is powered on, or there are chips that fail to establish a link, so as to perform a reduced topology on multiple processing chips.
可选地,所述BMC用于在所述计算机设备运行过程中,接收所述多个处理芯片上报的运行状态信息,如果所述运行状态信息中存在异常信息,则向上层运维***反馈所述异常信息;所述上层运维***用于根据所述异常信息和业务处理情况下发重启指令,所述重启指令携带用于指示所述待隔离芯片的指示信息;所述BMC用于根据所述指示信息,确定所述待隔离芯片。Optionally, the BMC is used to receive the operating status information reported by the multiple processing chips during the operation of the computer equipment, and if there is abnormal information in the operating status information, feedback the information to the upper-level operation and maintenance system. The abnormal information; the upper-level operation and maintenance system is used to issue a restart instruction based on the abnormal information and business processing conditions, the restart instruction carries instruction information for instructing the chip to be isolated; the BMC is used to The instruction information determines the chip to be isolated.
在本申请实施例中,计算机设备包括第一存储单元,BMC在确定待隔离芯片之后,可以对第一存储单元中存储的各个处理芯片的启用状态进行配置,以指示对各个处理芯片中的哪些处理芯片进行隔离,哪些正常启动。这样,每个处理芯片就可以通过读取第一存储单元中的启用状态,对需要进行隔离的处理芯片不启动,对不需要进行隔离的处理芯片通过其对应的启动文件进行重启,实现了各个处理芯片的单独启动,避免了计算机设备中芯片资源的浪费。In the embodiment of the present application, the computer device includes a first storage unit. After determining the chip to be isolated, the BMC can configure the activation state of each processing chip stored in the first storage unit to indicate which of the processing chips are Processing chips are isolated, which ones start normally. In this way, each processing chip can read the enabled state in the first storage unit, not start the processing chip that needs to be isolated, and restart the processing chip that does not need to be isolated through its corresponding startup file to realize each The separate startup of the processing chip avoids the waste of chip resources in the computer equipment.
第三方面,提供了一种计算机设备,所述计算机设备包括包括多个处理芯片、第一存储单元以及每个处理芯片对应的第二存储单元;所述多个处理芯片中的每个处理芯片与对应的第二存储单元连接,且每个处理芯片对应的第二存储单元中存储有相应处理芯片的启动文件;所述第一存储单元用于存储所述多个处理芯片的启用状态,所述启用状态用于指示是否对相应处理芯片进行隔离;所述多个处理芯片与所述第一存储单元连接,且所述多个处理芯片中未被隔离的芯片用于根据所述第一存储单元中存储的启用状态和对应的第二存储单元中的启动文件进行重启。In a third aspect, a computer device is provided. The computer device includes a plurality of processing chips, a first storage unit, and a second storage unit corresponding to each processing chip; each of the plurality of processing chips It is connected to the corresponding second storage unit, and the second storage unit corresponding to each processing chip stores the startup file of the corresponding processing chip; the first storage unit is used to store the activation status of the multiple processing chips, so The enabled state is used to indicate whether to isolate the corresponding processing chip; the plurality of processing chips are connected to the first storage unit, and the non-isolated chip among the plurality of processing chips is used to determine whether to isolate the corresponding processing chip according to the first storage unit. The activation state stored in the unit and the corresponding startup file in the second storage unit are restarted.
在本申请实施例中,计算机设备中的多个处理芯片中的每个处理芯片均连接有一个第二存储单元,且每个处理芯片连接的第二存储单元中存储有相应处理芯片的启动文件。除此之 外,多个处理芯片还均与第一存储单元连接,该第一存储单元存储有每个处理芯片的启用状态,这样,多个处理芯片可以通过读取第一存储单元中自身的启用状态来确定自身是否启动,对于需要启动的处理芯片,则可以通过对应的第二存储单元中的启动文件进行重启,实现了各个处理芯片的单独启动,避免了计算机设备中芯片资源的浪费。In the embodiment of the present application, each of the multiple processing chips in the computer device is connected to a second storage unit, and the second storage unit connected to each processing chip stores the startup file of the corresponding processing chip . In addition, the multiple processing chips are also connected to the first storage unit, and the first storage unit stores the activation state of each processing chip. In this way, the multiple processing chips can read their own data in the first storage unit. The activation state determines whether the processing chip needs to be started. The processing chip that needs to be started can be restarted through the startup file in the corresponding second storage unit, which realizes the independent startup of each processing chip and avoids the waste of chip resources in the computer equipment.
其中,所述第一存储单元为复杂可编程逻辑器件CPLD;所述第二存储单元为闪存。Wherein, the first storage unit is a complex programmable logic device CPLD; the second storage unit is a flash memory.
第四方面,提供了一种芯片启动装置,所述芯片启动装置具有实现上述第一方面中芯片启动方法行为的功能。所述芯片启动装置包括至少一个模块,该至少一个模块用于实现上述第一方面所提供的芯片启动方法。In a fourth aspect, a chip activation device is provided, and the chip activation device has the function of realizing the behavior of the chip activation method in the first aspect. The chip startup device includes at least one module, and the at least one module is used to implement the chip startup method provided in the first aspect.
第五方面,提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机设备上运行时,使得计算机设备的BMC可以执行上述第一方面所述的芯片启动方法。In a fifth aspect, a computer-readable storage medium is provided. The computer-readable storage medium stores instructions that, when run on a computer device, enable the BMC of the computer device to execute the chip described in the first aspect. Start method.
第六方面,提供了一种包含指令的计算机程序产品,当其在计算机设备上运行时,使得计算机设备的BMC执行上述第一方面所述的芯片启动方法。In a sixth aspect, a computer program product containing instructions is provided, which when running on a computer device, causes the BMC of the computer device to execute the chip startup method described in the first aspect.
上述第二方面、第三方面、第四方面和第五方面所获得的技术效果与第一方面中对应的技术手段获得的技术效果近似,在这里不再赘述。The technical effects obtained by the above second, third, fourth and fifth aspects are similar to the technical effects obtained by the corresponding technical means in the first aspect, and will not be repeated here.
本申请提供的技术方案带来的有益效果至少包括:The beneficial effects brought about by the technical solution provided by this application include at least:
在本申请实施例中,计算机设备包括第一存储单元,BMC在确定待隔离芯片之后,可以对第一存储单元中存储的各个处理芯片的启用状态进行配置,以指示对各个处理芯片中的哪些处理芯片进行隔离,哪些正常启动。这样,多个处理芯片就可以通过部署在其上的BIOS读取第一存储单元中的启用状态,进而对需要进行隔离的处理芯片不启动,对不需要进行隔离的处理芯片通过其对应的启动文件进行重启,实现了各个处理芯片的单独启动,避免了计算机设备中芯片资源的浪费。In the embodiment of the present application, the computer device includes a first storage unit. After determining the chip to be isolated, the BMC can configure the activation state of each processing chip stored in the first storage unit to indicate which of the processing chips are Processing chips are isolated, which ones start normally. In this way, multiple processing chips can read the enabled state in the first storage unit through the BIOS deployed on it, and then the processing chips that need to be isolated are not activated, and the processing chips that do not need to be isolated are activated through their corresponding The file is restarted to realize the independent startup of each processing chip and avoid the waste of chip resources in the computer equipment.
附图说明Description of the drawings
图1是本申请实施例提供的一种计算机设备的结构示意图;Fig. 1 is a schematic structural diagram of a computer device provided by an embodiment of the present application;
图2是本申请实施例提供的另一种计算机设备的结构示意图;Fig. 2 is a schematic structural diagram of another computer device provided by an embodiment of the present application;
图3是本申请实施例提供的一种芯片启动方法流程图;FIG. 3 is a flowchart of a chip startup method provided by an embodiment of the present application;
图4是本申请实施例提供的另一种芯片启动方法流程图;FIG. 4 is a flowchart of another chip startup method provided by an embodiment of the present application;
图5是本申请实施例提供的一种芯片启动装置结构示意图。FIG. 5 is a schematic structural diagram of a chip activation device provided by an embodiment of the present application.
具体实施方式Detailed ways
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。In order to make the purpose, technical solutions, and advantages of the present application clearer, the implementation manners of the present application will be described in further detail below in conjunction with the accompanying drawings.
在对本申请实施例进行详细的解释说明之前,先对本申请实施例涉及的应用场景予以介绍。Before explaining the embodiments of the present application in detail, the application scenarios involved in the embodiments of the present application will be introduced first.
随着计算机和网络技术的发展,服务器的发展呈现两种趋势。其中,一种趋势是单路处理芯片的处理能力逐步增强。例如,单路中央处理器(central processing unit,CPU)的频率 增大、处理核数增多。另一种趋势是采用多个处理芯片组成芯片拓扑,多路处理芯片同时协作,来提升***处理能力。例如,采用2路CPU、4路CPU或8路CPU等。With the development of computer and network technology, there are two trends in the development of servers. Among them, one trend is to gradually increase the processing capabilities of single-channel processing chips. For example, the frequency of a single central processing unit (CPU) increases and the number of processing cores increases. Another trend is to use multiple processing chips to form a chip topology, and multiple processing chips can cooperate at the same time to improve system processing capabilities. For example, use 2 CPUs, 4 CPUs, or 8 CPUs.
对于上述的第二种趋势,在进行单板设计时,可以对多个处理芯片进行连续编号。在计算机设备启动的过程中,可以从编号最小的处理芯片开始依次启动各个处理芯片。然而,通过多路处理芯片组成芯片拓扑,将会增加单板的复杂性,相应地,单板失效率也会增加。例如,多路处理芯片的某个处理芯片有可能发生故障,如果该故障芯片的编号靠前,则在启动该多个处理芯片时,编号位于该故障芯片之后的处理芯片将会无法启动。尤其是当该故障芯片的处理芯片的编号为所有处理芯片中的最小编号时,其他所有的正常处理芯片将全部无法启动,这将导致***宕机,不可用,从而严重影响业务和客户满意度。所以,针对上述多个处理芯片组成的芯片拓扑,有必要通过本申请实施例提供的芯片启动方法来实现芯片的降拓扑,以此来提升***可靠性,避免处理资源的浪费。For the second trend mentioned above, when designing a single board, multiple processing chips can be numbered consecutively. During the startup process of the computer device, the processing chips can be started sequentially starting from the processing chip with the smallest number. However, forming a chip topology through multiple processing chips will increase the complexity of the single board, and accordingly, the failure rate of the single board will also increase. For example, a certain processing chip of the multiple processing chip may malfunction. If the number of the malfunctioning chip is higher, when the multiple processing chips are started, the processing chip whose number is located after the malfunctioning chip will not be able to start. Especially when the processing chip number of the faulty chip is the smallest number among all processing chips, all other normal processing chips will not be able to start, which will cause system downtime and unavailability, which will seriously affect business and customer satisfaction. . Therefore, for the above-mentioned chip topology composed of multiple processing chips, it is necessary to implement the reduced topology of the chip through the chip startup method provided in the embodiment of the present application, so as to improve system reliability and avoid waste of processing resources.
接下来对本申请实施例提供的计算机设备进行介绍。Next, the computer equipment provided by the embodiments of the present application will be introduced.
图1是本申请实施例提供的计算机设备100的结构示意图。如图1中所示,该计算机设备100包括多个处理芯片101、第一存储单元102以及每个处理芯片对应的第二存储单元103。图1中以处理芯片101的个数为4个为例来进行说明,但是,这并不构成对处理芯片的数量的限定。FIG. 1 is a schematic structural diagram of a computer device 100 provided by an embodiment of the present application. As shown in FIG. 1, the computer device 100 includes a plurality of processing chips 101, a first storage unit 102, and a second storage unit 103 corresponding to each processing chip. In FIG. 1, the number of processing chips 101 is 4 as an example for description, but this does not constitute a limitation on the number of processing chips.
如图1中所示,多个处理芯片101可以分别与第一存储单元102连接,每个处理芯片101均可以与自身对应的第二存储单元103进行连接。该多个处理芯片101中每两个处理芯片101之间均可以连接,从而形成芯片拓扑。As shown in FIG. 1, multiple processing chips 101 can be connected to the first storage unit 102 respectively, and each processing chip 101 can be connected to its corresponding second storage unit 103. Each two processing chips 101 of the plurality of processing chips 101 can be connected to each other, thereby forming a chip topology.
其中,第一存储单元102中存储有每个处理芯片101的启用状态,该启用状态可以用于指示是否对相应地处理芯片进行隔离。除此之外,每个处理芯片101对应的第二存储单元103中还可以存储有相应处理芯片的启动文件。这样,各个处理芯片101即可以通过读取第一存储单元102中存储的自身的启用状态来确定自身是否需要启动。对于需要启动的处理芯片101,其可以读取对应的第二存储单元103中存储的启动文件进行重启。如此,实现了计算机设备中各个处理芯片的单独启动,避免了避免处理资源的浪费。Wherein, the first storage unit 102 stores the enabled state of each processing chip 101, and the enabled state may be used to indicate whether to isolate the corresponding processing chip. In addition, the second storage unit 103 corresponding to each processing chip 101 may also store a startup file of the corresponding processing chip. In this way, each processing chip 101 can determine whether it needs to be started by reading its own activation state stored in the first storage unit 102. For the processing chip 101 that needs to be started, it can read the startup file stored in the corresponding second storage unit 103 to restart. In this way, the independent startup of each processing chip in the computer device is realized, and the waste of processing resources is avoided.
需要说明的是,该多个处理芯片101可以为CPU,也可以为AI芯片或者是其他处理芯片。第一存储单元102可以为复杂可编程逻辑器件(complex programmable logic device,CPLD)或者是其他类型的寄存器。第二存储单元103可以为闪存(flash)等存储器件,本申请实施例在此不作限定。It should be noted that the multiple processing chips 101 may be CPUs, AI chips, or other processing chips. The first storage unit 102 may be a complex programmable logic device (CPLD) or other types of registers. The second storage unit 103 may be a storage device such as flash memory, which is not limited in the embodiment of the present application.
可选地,在一种可能的实现方式中,第一存储单元102可以有多个,其中,每个第一存储单元102与一个处理芯片101连接,且每个第一存储单元102中存储有对应的处理芯片101的启用状态。Optionally, in a possible implementation manner, there may be multiple first storage units 102, where each first storage unit 102 is connected to a processing chip 101, and each first storage unit 102 stores The enabled state of the corresponding processing chip 101.
可选地,参见图2,计算机设备100还可以包括BMC104。其中,BMC104可以与多个处理芯片101分别进行连接,且BMC104还可以与第一存储单元102进行连接。Optionally, referring to FIG. 2, the computer device 100 may also include a BMC 104. Wherein, the BMC 104 can be connected to multiple processing chips 101 respectively, and the BMC 104 can also be connected to the first storage unit 102.
需要说明的是,在本申请实施例中,BMC104可以确定多个处理芯片101中的待隔离芯片。其中,待隔离芯片可以是指计算机设备上电之后检测到的影响该计算机设备启动的故障芯片,也可以是指处于亚健康状态且根据业务处理情况需要隔离的芯片。It should be noted that, in the embodiment of the present application, the BMC 104 can determine the chip to be isolated among the multiple processing chips 101. Among them, the chip to be isolated may refer to a faulty chip detected after the computer device is powered on that affects the startup of the computer device, or may refer to a chip that is in a sub-health state and needs to be isolated according to business processing conditions.
在一种可能的实现方式中,在计算机设备100上电后,BMC104可以接收多个处理芯片中每个处理芯片101的故障检测信息和建链信息,该故障检测信息可以用于指示相应的处理 芯片是否发生故障,该建链信息可以用于指示相应的处理芯片在与其他处理芯片建链时是否发生故障,也即,可以用于指示相应的处理芯片与哪些处理芯片建链成功,与哪些处理芯片建链失败。BMC104可以根据接收到的故障检测信息和建链信息从多个处理芯片中确定待隔离芯片。In a possible implementation, after the computer device 100 is powered on, the BMC 104 can receive the fault detection information and link establishment information of each processing chip 101 in the multiple processing chips, and the fault detection information can be used to indicate the corresponding processing Whether the chip is faulty, the link establishment information can be used to indicate whether the corresponding processing chip fails when the link is established with other processing chips, that is, it can be used to indicate whether the corresponding processing chip has successfully established the link with which processing chips, and which The processing chip failed to establish a chain. The BMC104 may determine the chip to be isolated from the multiple processing chips according to the received fault detection information and link establishment information.
其中,该多个处理芯片上可以部署有基本输入输出***(basic input and output system,BIOS),在这种情况下,在计算机设备上电之后,BIOS可以对多个处理芯片101进行自检,从而得到多个处理芯片101的故障检测信息和建链信息。BIOS在得到故障检测信息和建链信息之后,可以将该多个处理芯片101的故障检测信息发送至BMC104。在一种可能的实施场景中,BIOS程序可能存储在单独的BIOS芯片中,或者,BIOS程序直接存储在与处理芯片相连的第二存储单元(例如,flash芯片)中。在计算机设备100启动时,BIOS程序被加载到处理芯片上运行,此时,BIOS会进行自检,并发送自检结果给BMC104。Wherein, a basic input and output system (BIOS) may be deployed on the multiple processing chips. In this case, after the computer device is powered on, the BIOS can perform self-checks on the multiple processing chips 101. In this way, fault detection information and link establishment information of multiple processing chips 101 are obtained. After obtaining the fault detection information and the link establishment information, the BIOS may send the fault detection information of the multiple processing chips 101 to the BMC 104. In a possible implementation scenario, the BIOS program may be stored in a separate BIOS chip, or the BIOS program may be directly stored in a second storage unit (for example, a flash chip) connected to the processing chip. When the computer device 100 is started, the BIOS program is loaded on the processing chip to run. At this time, the BIOS will perform a self-check and send the self-check result to the BMC 104.
通过BIOS上报的故障检测信息和建链信息,BMC104可以确定出多个处理芯片101中由于自身存在故障而启动失败的芯片,以及与其他处理芯片建链失败的芯片。BMC104可以将这些芯片作为待隔离芯片。Based on the fault detection information and link establishment information reported by the BIOS, the BMC 104 can determine the chips that fail to start due to their own failures among the multiple processing chips 101, and the chips that fail to establish a link with other processing chips. BMC104 can use these chips as the chips to be isolated.
在另一种可能的实现方式中,计算机设备100在正常运行的过程中,计算机设备100的BIOS也可以实时监测各个处理芯片101的运行状态,并向BMC104上报各个处理芯片101的运行状态信息。在这种情况下,BMC104可以检测接收到的运行状态信息中是否存在异常信息,如果存在异常信息,BMC104可以将该异常信息所对应的处理芯片确定为待隔离芯片。In another possible implementation manner, during the normal operation of the computer device 100, the BIOS of the computer device 100 can also monitor the operating status of each processing chip 101 in real time, and report the operating status information of each processing chip 101 to the BMC 104. In this case, the BMC 104 can detect whether there is abnormal information in the received operating status information. If there is abnormal information, the BMC 104 can determine the processing chip corresponding to the abnormal information as the chip to be isolated.
在另一种可能的实现方式中,该多个处理芯片101上还可以部署有上层运维***。或者,该计算机设备对应的管理服务器上可以部署有上层运维***。该上层运维***是指用于进行故障信息管理的业务***。在这种情况下,在计算机设备100的运行过程中,BMC104在接收到BIOS实时上报的各个处理芯片101的运行状态信息之后,还可以将接收到的运行状态信息发送至上层运维***。上层运维***可以检测该运行状态信息中是否存在异常信息,如果存在异常信息,则上层运维***可以结合当前业务处理情况来决定是否将该异常信息对应的处理芯片进行隔离。例如,如果该异常信息所对应的处理芯片对当前业务的处理存在较大的影响,则上层运维***可以对其进行隔离。当确定对异常信息对应的处理芯片进行隔离时,也即将异常信息对应的处理芯片进行隔离时,上层运维***可以向BMC104发送重启指令,该重启指令中可以携带有用于指示待隔离芯片的指示信息。BMC104在接收到该重启指令之后,可以根据该重启指令携带的指示信息确定得到待隔离芯片。其中,该指示信息可以为待隔离芯片的插槽标识(slot ID)。In another possible implementation manner, an upper-level operation and maintenance system may also be deployed on the multiple processing chips 101. Alternatively, an upper-level operation and maintenance system may be deployed on the management server corresponding to the computer device. The upper-level operation and maintenance system refers to a business system for fault information management. In this case, during the operation of the computer device 100, after the BMC 104 receives the operating status information of each processing chip 101 reported by the BIOS in real time, it can also send the received operating status information to the upper-level operation and maintenance system. The upper-level operation and maintenance system can detect whether there is abnormal information in the operating status information. If there is abnormal information, the upper-level operation and maintenance system can determine whether to isolate the processing chip corresponding to the abnormal information in combination with the current business processing situation. For example, if the processing chip corresponding to the abnormal information has a greater impact on the processing of the current business, the upper-level operation and maintenance system can isolate it. When it is determined to isolate the processing chip corresponding to the abnormal information, that is, when the processing chip corresponding to the abnormal information is to be isolated, the upper-level operation and maintenance system can send a restart instruction to the BMC104, and the restart instruction can carry an instruction to indicate the chip to be isolated information. After receiving the restart instruction, the BMC 104 can determine to obtain the chip to be isolated according to the instruction information carried in the restart instruction. Wherein, the indication information may be the slot ID of the chip to be isolated.
可选地,在一些可能的实现方式中,确定的待隔离芯片的数量为1个,在这种情况下,考虑到成对隔离原则,可以将与该待隔离芯片成对的另一个处理芯片也确定为待隔离芯片。例如,假设确定的待隔离芯片为slot ID为1的处理芯片,则可以将slot ID为0和1这一对处理芯片均确定为待隔离芯片。Optionally, in some possible implementations, the number of chips to be isolated is determined to be 1. In this case, considering the principle of pair isolation, another processing chip paired with the chip to be isolated can be It is also determined as the chip to be isolated. For example, assuming that the determined chip to be isolated is a processing chip with a slot ID of 1, then the pair of processing chips with slot IDs of 0 and 1 may be determined as the chip to be isolated.
BMC104在确定出待隔离芯片之后,可以根据该待隔离芯片向第一存储单元102发送第一配置指令,以对第一存储单元102中存储的各个处理芯片的启用状态进行配置。After determining the chip to be isolated, the BMC 104 may send a first configuration instruction to the first storage unit 102 according to the chip to be isolated to configure the activation state of each processing chip stored in the first storage unit 102.
其中,由于第一存储单元102中可以与多个处理芯片101建立有通信连接。这样,多个处理芯片101即可以从第一存储单元102中读取对应的启用状态。其中,该启用状态可以用于指示是否对相应处理芯片进行隔离。Among them, because the first storage unit 102 can establish a communication connection with multiple processing chips 101. In this way, the multiple processing chips 101 can read the corresponding enabled state from the first storage unit 102. Among them, the enabled state can be used to indicate whether to isolate the corresponding processing chip.
示例性地,第一存储单元102中可以存储有第一标记字段,该第一标记字段包括多个标 记位,每个标记位对应一个处理芯片,每个标记位上的标记用于指示对应的处理芯片的启用状态,且各个标记位可以按照各个处理芯片的slot ID从小到大排列。这样,BMC104在确定待隔离芯片之后,可以按照待隔离芯片的slot ID生成一个标记字段,该标记字段包括多个标记位,多个标记位同样按照各个处理芯片的slot ID从小到大排列,其中,在该标记字段中,待隔离芯片对应的标记位上为第一标记,用于指示对该待隔离芯片进行隔离,其他不需要隔离的芯片对应的标记位上则为第二标记,以指示对这些芯片进行正常启动。在生成这个标记字段之后,BMC104可以将该标记字段作为配置指令发送至第一存储单元102,从而将第一存储单元中存储的第一标记字段进行覆盖,以实现对多个处理芯片的启用状态的配置。Exemplarily, a first mark field may be stored in the first storage unit 102, the first mark field includes a plurality of mark bits, each mark bit corresponds to a processing chip, and the mark on each mark bit is used to indicate the corresponding The activation state of the processing chip, and each flag bit can be arranged from small to large according to the slot ID of each processing chip. In this way, after the BMC104 determines the chip to be isolated, it can generate a tag field according to the slot ID of the chip to be isolated. The tag field includes multiple tag bits, and the multiple tag bits are also arranged according to the slot ID of each processing chip from small to large, where , In the mark field, the mark bit corresponding to the chip to be isolated is the first mark, which is used to indicate that the chip to be isolated is isolated, and the mark bits corresponding to the other chips that do not need to be isolated are the second mark to indicate Start these chips normally. After generating this tag field, the BMC104 can send the tag field as a configuration instruction to the first storage unit 102, so as to overwrite the first tag field stored in the first storage unit to realize the activation state of multiple processing chips. Configuration.
例如,假设当前4个处理芯片中的第三个处理芯片(也即slot ID排在第三位的处理芯片)为待隔离芯片,则BMC104可以生成一个标记字段为1101,其中,1用于指示不隔离,0用于指示隔离。BMC104可以将该标记字段发送至第一存储单元102,以对第一存储单元102中存储的第一标记字段进行覆盖。For example, assuming that the third processing chip of the current four processing chips (that is, the processing chip with the slot ID in the third position) is the chip to be isolated, the BMC104 can generate a tag field of 1101, where 1 is used to indicate No isolation, 0 is used to indicate isolation. The BMC 104 may send the tag field to the first storage unit 102 to overwrite the first tag field stored in the first storage unit 102.
可选地,由前述介绍可知,在一种可能的实现方式中,每个处理芯片101可以对应一个第一存储单元102,也就是说每个处理芯片101可以具有自身独立的第一存储单元102。在这种情况下,每个处理芯片101和自身对应的第一存储单元102进行连接。相应的,第一存储单元102中可以用于存储对应的处理芯片101的启用状态。这样,当BMC104确定待隔离芯片之后,可以将待隔离芯片对应的第一存储单元102中存储的启用状态配置为隔离状态,而将其他处理芯片对应第一存储单元102中的启用状态配置为不隔离状态。具体地,第一存储单元102中的启用状态同样可以通过标记位的方式来实现。例如,对于待隔离芯片,可以在对应的第一存储单元102中存储第一标记,以指示对该待隔离芯片进行隔离,对于不需要隔离的芯片,可以在对应的第一存储单元102中存储第二标记,以指示不对对应的处理芯片进行隔离。Optionally, as can be seen from the foregoing introduction, in a possible implementation manner, each processing chip 101 may correspond to a first storage unit 102, that is to say, each processing chip 101 may have its own independent first storage unit 102. . In this case, each processing chip 101 is connected to its corresponding first storage unit 102. Correspondingly, the first storage unit 102 can be used to store the activation state of the corresponding processing chip 101. In this way, after the BMC104 determines the chip to be isolated, the enabled state stored in the first storage unit 102 corresponding to the chip to be isolated can be configured as an isolated state, and the enabled state in the first storage unit 102 corresponding to other processing chips can be configured as not. Isolated state. Specifically, the enabled state in the first storage unit 102 can also be implemented by means of flag bits. For example, for a chip to be isolated, a first mark can be stored in the corresponding first storage unit 102 to indicate that the chip to be isolated is isolated, and for a chip that does not need to be isolated, it can be stored in the corresponding first storage unit 102 The second mark indicates that the corresponding processing chip is not isolated.
在对第一存储单元102中的启用状态进行配置之后,BMC104可以控制多个处理芯片101根据第一存储单元102中的启用状态和每个处理芯片101对应的第二存储单元103中的启动文件进行重启。After configuring the enabled state in the first storage unit 102, the BMC 104 can control multiple processing chips 101 according to the enabled state in the first storage unit 102 and the startup file in the second storage unit 103 corresponding to each processing chip 101 Restart.
其中,BMC104可以向BIOS发送重启指令,BIOS在接收到该重启指令之后,如果多个处理芯片101对应一个第一存储单元102,则BIOS可以控制多个处理芯片从第一存储单元102中读取自身的启用状态,如果每个处理芯片101对应一个第一存储单元102,则BIOS可以控制各个处理芯片从自身对应的第一存储单元102中读取自身对应的启用状态。之后,对于启用状态指示进行隔离的处理芯片,也即待隔离芯片,可以直接不启动,而对于启用状态指示不隔离的处理芯片,可以读取相应处理芯片对应的第二存储单元103中存储的启动文件,通过该启动文件来对该处理芯片进行重新初始化,以重新启动该处理芯片。Among them, the BMC104 can send a restart instruction to the BIOS. After the BIOS receives the restart instruction, if multiple processing chips 101 correspond to a first storage unit 102, the BIOS can control multiple processing chips to read from the first storage unit 102 For its own enabled state, if each processing chip 101 corresponds to a first storage unit 102, the BIOS can control each processing chip to read its own corresponding enabled state from its corresponding first storage unit 102. After that, for the processing chip whose enabled state indicates to be isolated, that is, the chip to be isolated, it may not be started directly, and for the processing chip whose enabled state indicates not to be isolated, the data stored in the second storage unit 103 corresponding to the corresponding processing chip can be read. The startup file is used to reinitialize the processing chip through the startup file to restart the processing chip.
可选地,在本申请实施例中,该第一存储单元102还可以用于存储各个处理芯片101的芯片编号。在这种情况下,BMC104在确定待隔离芯片之后,还可以向该第一存储单元102发送第二配置指令,第二配置指令用于指示对第一存储单元102中存储的多个处理芯片101的芯片编号进行重新配置,其中,重新配置后,除待隔离芯片之外的其他处理芯片的芯片编号连续。Optionally, in this embodiment of the present application, the first storage unit 102 may also be used to store the chip number of each processing chip 101. In this case, after the BMC104 determines the chip to be isolated, it may also send a second configuration instruction to the first storage unit 102, where the second configuration instruction is used to instruct the multiple processing chips 101 stored in the first storage unit 102 After the reconfiguration, the chip numbers of the processing chips other than the chip to be isolated are continuous.
需要说明的是,在计算机设备100上电之后未对第一存储单元102中的芯片编号进行配置前,第一存储单元102中可以存储有一个第二标记字段,该第二标记字段中同样可以包括多个标记位,每个标记位用于记录一个处理芯片101的原始芯片编号。各个标记位可以按照 各个处理芯片的slot ID从小到大或从大到小的顺序依次排列,且多个标记位上的原始芯片编号连续且依次增大。例如,以4个处理芯片为例,该第二标记字段可以为0123,其中,如果多个标记位按照处理芯片的slot ID从小到大的顺序排列,则该第二标记字段用于指示slot ID为0的处理芯片的原始芯片编号可以为0,slot ID为1的处理芯片的原始芯片编号可以为1,slot ID为2的处理芯片的原始芯片编号可以为2,slot ID为3的处理芯片的原始芯片编号可以为3。如果按照处理芯片的slot ID从大到小的顺序,则该标记字段表示slot ID为3的处理芯片的原始芯片编号可以为0,slot ID为2的处理芯片的原始芯片编号可以为1,slot ID为1的处理芯片的原始芯片编号可以为2,slot ID为0的处理芯片的原始芯片编号可以为3。It should be noted that before the chip number in the first storage unit 102 is configured after the computer device 100 is powered on, the first storage unit 102 may store a second flag field, and the second flag field can also be stored in the second flag field. It includes a plurality of mark bits, and each mark bit is used to record the original chip number of a processing chip 101. Each mark bit can be arranged in sequence from small to large or from large to small according to the slot ID of each processing chip, and the original chip numbers on the multiple mark bits are continuous and increase sequentially. For example, taking 4 processing chips as an example, the second flag field can be 0123, where if multiple flag bits are arranged in the order of the slot ID of the processing chip from small to large, the second flag field is used to indicate the slot ID The original chip number of the processing chip with 0 can be 0, the original chip number of the processing chip with slot ID 1 can be 1, the original chip number of the processing chip with slot ID 2 can be 2, and the processing chip with slot ID 3 The original chip number can be 3. If the slot ID of the processing chip is in descending order, the tag field indicates that the original chip number of the processing chip with slot ID 3 can be 0, and the original chip number of the processing chip with slot ID 2 can be 1, slot The original chip number of the processing chip with ID 1 can be 2, and the original chip number of the processing chip with slot ID 0 can be 3.
BMC104在确定待隔离芯片之后,可以按照上述规则生成一个包括多个标记位的标记字段,之后,对于待隔离芯片对应的标记位上的芯片编号,保持不变,而对于其他标记位上的芯片编号,BMC104可以从0开始对其进行连续编号,从而得到更改后的标记字段,将该标记字段作为配置指令发送至第一存储单元102,以对第一存储单元102中存储的第二标记字段进行覆盖,从而完成对处理芯片的芯片编号的重新配置。After BMC104 determines the chip to be isolated, it can generate a tag field including multiple tag bits according to the above rules. After that, the chip number on the tag bit corresponding to the chip to be isolated remains unchanged, while for chips on other tag bits The BMC104 can serially number it from 0 to obtain the modified tag field, and send the tag field as a configuration instruction to the first storage unit 102 to compare the second tag field stored in the first storage unit 102. Cover, thereby completing the reconfiguration of the chip number of the processing chip.
例如,假设按照slot ID排在第一位的处理芯片为待隔离芯片,其对应标记字段0123中的第一个标记位,则可以保持第一个标记位上的原始芯片编号0不变,之后,对于标记字段中剩余的标记位,从0开始进行连续编号,从而得到更改后的标记字段为0012。For example, assuming that the processing chip ranked first according to the slot ID is the chip to be isolated, and it corresponds to the first tag bit in the tag field 0123, the original chip number 0 on the first tag bit can be kept unchanged, and then , For the remaining flag bits in the flag field, serial numbering starts from 0, so that the changed flag field is 0012.
在一些示例中,对于不需要隔离的处理芯片,在对这些处理芯片进行重新编号时,可以将这些处理芯片中HCCS端口为M0的处理芯片的芯片编号更改为0,其中,HCCS端口是指建链时相应处理芯片与其他处理芯片连接通信的端口,包括M0、M1和M2三种端口,芯片编号为0可以指示该处理芯片为主芯片,其余处理芯片为该处理芯片的从芯片。In some examples, for processing chips that do not need to be isolated, when renumbering these processing chips, the chip number of the processing chip whose HCCS port is M0 in these processing chips can be changed to 0, where the HCCS port refers to the built-in When linking, the corresponding processing chip connects and communicates with other processing chips, including three ports M0, M1, and M2. A chip number of 0 can indicate that the processing chip is the master chip, and the remaining processing chips are the slave chips of the processing chip.
可选地,BMC104也可以通过其他方法来对第一存储单元102中存储的芯片编号进行重新配置,例如,BMC104可以从第一存储单元102中读取第二标记字段,之后,按照上述方法对其修改,之后,将修改后的第二标记字段再写入第一存储单元102以覆盖之前存储的编号。本申请实施例对此不作限定。Optionally, the BMC104 can also use other methods to reconfigure the chip number stored in the first storage unit 102. For example, the BMC104 can read the second tag field from the first storage unit 102, and then perform the above method After the modification, the modified second mark field is written into the first storage unit 102 to overwrite the previously stored serial number. The embodiment of the application does not limit this.
在对第一存储单元102中各个处理芯片101的芯片编号进行修改之后,BIOS在重启各个处理芯片的同时,可以控制各个处理芯片从第一存储单元102中读取对应的芯片编号,并按照各个处理芯片的芯片编号来建立各个处理芯片之间的连接,也即建链。After modifying the chip number of each processing chip 101 in the first storage unit 102, the BIOS can control each processing chip to read the corresponding chip number from the first storage unit 102 while restarting each processing chip, and according to each The chip number of the processing chip is used to establish the connection between the processing chips, that is, to build a chain.
可选地,在本申请实施例中,第一存储单元中还可以存储各个处理芯片的健康状态,该健康状态用于指示对应的处理芯片是否存在故障。在这种情况下,BMC104在确定待隔离芯片之后,还可以根据待隔离芯片,向第一存储单元102发送第三配置指令,第三配置指令用于指示对第一存储单元102中存储的每个处理芯片的健康状态进行配置。其中,第一存储单元102中同样可以采用一个标记字段包括的多个标记位来记录各个处理芯片的健康状态,具体实现方式可以参考前述实施例中第一存储单元102存储启用状态的相关实现方式,本申请实施例对此不再赘述。相应地,BMC104也可以参考前述介绍的对第一存储单元102中的启用状态的配置方法,对第一存储单元102中各个处理芯片的健康状态进行配置。Optionally, in this embodiment of the present application, the first storage unit may also store the health status of each processing chip, and the health status is used to indicate whether the corresponding processing chip has a fault. In this case, after the BMC104 determines the chip to be isolated, it can also send a third configuration instruction to the first storage unit 102 according to the chip to be isolated. Configure the health status of each processing chip. Among them, the first storage unit 102 can also use multiple flag bits included in a flag field to record the health status of each processing chip. For a specific implementation, please refer to the related implementation of the first storage unit 102 to store the enabled status in the foregoing embodiment. This is not repeated in the embodiment of this application. Correspondingly, the BMC 104 can also refer to the aforementioned method for configuring the enabled state in the first storage unit 102 to configure the health status of each processing chip in the first storage unit 102.
可选地,在一些可能的实现方式中,上述BMC104向第一存储单元102发送的第一配置指令、第二配置指令和第三配置指令也可以合并为一个指令进行发送,也即,BMC104可以通过一个配置指令来对第一存储单元102中存储的各个处理芯片的启用状态、芯片编号以及健康状态进行配置。Optionally, in some possible implementation manners, the first configuration instruction, the second configuration instruction, and the third configuration instruction sent by the BMC104 to the first storage unit 102 may also be combined into one instruction for transmission, that is, the BMC104 may The activation state, chip number, and health state of each processing chip stored in the first storage unit 102 are configured through a configuration instruction.
可选地,在一些可能的实现方式中,上述的各个处理芯片的启用状态可以存储在第一存 储单元102中,而各个处理芯片的芯片编号和健康状态可以分别存储在其他存储单元中,在这种情况下,BMC可以向对应的存储单元发送相应地配置指令,以对存储单元中存储的信息进行配置。Optionally, in some possible implementations, the activation state of each processing chip described above may be stored in the first storage unit 102, and the chip number and health state of each processing chip may be stored in other storage units, respectively. In this case, the BMC may send corresponding configuration instructions to the corresponding storage unit to configure the information stored in the storage unit.
另外,还需要说明的是,上述实施例主要介绍了通过BMC对第一存储单元中的状态进行配置的过程,在一些可能的情况中,也可以采用其他方式对第一存储单元中的信息进行配置,进而使得多个处理芯片可以通过读取第一存储单元中的启用状态来进行重启。In addition, it should be noted that the above embodiment mainly introduces the process of configuring the state in the first storage unit through BMC. In some possible cases, other methods can also be used to perform the configuration of the information in the first storage unit. The configuration, in turn, enables multiple processing chips to be restarted by reading the enabled state in the first storage unit.
在本申请实施例中,计算机设备包括第一存储单元,BMC在确定待隔离芯片之后,可以对第一存储单元中存储的各个处理芯片的启用状态进行配置,以指示对各个处理芯片中的哪些处理芯片进行隔离,哪些正常启动。这样,多个处理芯片上部署的BIOS可以读取第一存储单元中存储的各个处理芯片的启用状态,对于需要隔离的处理芯片不启动,对不需要进行隔离的处理芯片,可以通过对应的第二存储单元中存储的启动文件进行重启,实现了各个处理芯片的单独启动,避免了计算机设备中芯片资源的浪费。另外,在本申请实施例中,由于各个处理芯片均可以单独启动,不依赖于其他处理芯片的启动,所以,即使主处理芯片发生故障,其他处理芯片也仍然可以正常启动,提高了计算机设备的可靠性。In the embodiment of the present application, the computer device includes a first storage unit. After determining the chip to be isolated, the BMC can configure the activation state of each processing chip stored in the first storage unit to indicate which of the processing chips are Processing chips are isolated, which ones start normally. In this way, the BIOS deployed on multiple processing chips can read the activation status of each processing chip stored in the first storage unit. The processing chip that needs to be isolated is not activated, and the processing chip that does not need to be isolated can be passed through the corresponding first storage unit. Second, the startup file stored in the storage unit is restarted, which realizes the independent startup of each processing chip and avoids the waste of chip resources in the computer equipment. In addition, in the embodiments of the present application, since each processing chip can be started independently and does not depend on the startup of other processing chips, even if the main processing chip fails, other processing chips can still be started normally, which improves the performance of the computer equipment. reliability.
接下来对本申请实施例提供的芯片启动方法进行详细的解释说明。Next, the chip startup method provided by the embodiment of the present application will be explained in detail.
图3是本申请实施例提供的一种芯片启动方法的流程图。该方法可以应用于图1和2所示的计算机设备的BMC中,参见图3,该方法包括:FIG. 3 is a flowchart of a chip startup method provided by an embodiment of the present application. This method can be applied to the BMC of the computer equipment shown in FIGS. 1 and 2. Referring to FIG. 3, the method includes:
步骤301:确定多个处理芯片中的待隔离芯片。Step 301: Determine the chip to be isolated among the multiple processing chips.
其中,待隔离芯片可以是指计算机设备上电之后检测到的影响该计算机设备启动的故障芯片,也可以是指处于亚健康状态且根据业务处理情况需要隔离的芯片。Among them, the chip to be isolated may refer to a faulty chip detected after the computer device is powered on that affects the startup of the computer device, or may refer to a chip that is in a sub-health state and needs to be isolated according to business processing conditions.
在本申请实施例中,BMC可以在计算机设备上电之后,根据多个处理芯片上部署的BIOS上报的各个处理芯片的故障检测信息和建链信息来确定故障芯片,并将该故障芯片作为待隔离芯片。In the embodiment of the present application, the BMC may determine the faulty chip according to the fault detection information and link establishment information of each processing chip reported by the BIOS deployed on the multiple processing chips after the computer device is powered on, and use the faulty chip as the waiting chip. Isolate the chip.
或者,BMC可以在计算机运行过程中,根据BIOS实时检测到的各个处理芯片的运行状态信息确定待隔离芯片。在该种情况下,BMC可以直接根据各个处理芯片的运行状态信息中的异常信息确定待隔离芯片,也可以将运行状态信息上报至上层运维***,由上层运维***根据该运行状态信息和当前业务处理情况来确定。Alternatively, the BMC may determine the chip to be isolated according to the operating state information of each processing chip detected in real time by the BIOS during the operation of the computer. In this case, the BMC can directly determine the chip to be isolated based on the abnormal information in the operating status information of each processing chip, or it can report the operating status information to the upper-level operation and maintenance system, and the upper-level operation and maintenance system can use the operation status information and To determine the current business processing situation.
其中,上述各种确定待隔离芯片的具体实现方式可以参考前述实施例中的相关实现方式,本申请实施例在此不再赘述。Among them, the above-mentioned various specific implementation manners for determining the chip to be isolated may refer to the related implementation manners in the foregoing embodiments, and the details are not described herein again in the embodiments of the present application.
步骤302:向第一存储单元发送第一配置指令,第一配置指令用于指示对第一存储单元中存储的每个处理芯片的启用状态进行配置,该启用状态用于指示是否对相应处理芯片进行隔离。Step 302: Send a first configuration instruction to the first storage unit, where the first configuration instruction is used to instruct to configure the enabled state of each processing chip stored in the first storage unit, and the enabled state is used to indicate whether to configure the corresponding processing chip Isolate.
BMC在确定出待隔离芯片之后,可以根据该待隔离芯片向第一存储单元发送第一配置指令,以对第一存储单元中存储的各个处理芯片的启用状态进行配置。其中,具体的配置方式可以参考前述实施例中BMC对第一存储单元中存储的启用状态进行配置的实现方式,本申请实施例在此不再赘述。After determining the chip to be isolated, the BMC may send a first configuration instruction to the first storage unit according to the chip to be isolated, so as to configure the activation state of each processing chip stored in the first storage unit. For the specific configuration manner, refer to the implementation manner in which the BMC configures the enabled state stored in the first storage unit in the foregoing embodiment, and details are not described herein again in the embodiment of the present application.
可选地,BMC在向第一存储单元发送第一配置指令的同时,还可以向第一存储单元发送第二配置指令,以对第一存储单元中存储的各个处理芯片的芯片编号进行重新配置,向第一存储单元发送第三配置指令,以对第一存储单元中存储的各个处理芯片的健康状态进行配 置。其中,具体的配置方式均可以参考前述实施例中的介绍,本申请实施例在此不再赘述。Optionally, while sending the first configuration instruction to the first storage unit, the BMC can also send a second configuration instruction to the first storage unit to reconfigure the chip numbers of the processing chips stored in the first storage unit. , Sending a third configuration instruction to the first storage unit to configure the health status of each processing chip stored in the first storage unit. For specific configuration methods, reference may be made to the introduction in the foregoing embodiment, and the description of the embodiment of the present application will not be repeated here.
步骤303:控制多个处理芯片中未被隔离的芯片根据第一存储单元中存储的启用状态和对应的第二存储单元中的启动文件进行重启。Step 303: Control the non-isolated chips among the multiple processing chips to restart according to the enabled state stored in the first storage unit and the corresponding startup file in the second storage unit.
在对第一存储单元中多个处理芯片的启用状态进行配置之后,BMC可以向BIOS发送重启指令,由BIOS读取第一存储单元中各个处理芯片的启用状态,对于启用状态指示进行隔离的处理芯片,BIOS可以不对其进行启动,对于启用状态指示不进行隔离的处理芯片,BIOS可以对其进行重新初始化,以进行重新启动。After configuring the enable state of the multiple processing chips in the first storage unit, the BMC can send a restart instruction to the BIOS, and the BIOS reads the enable state of each processing chip in the first storage unit, and performs isolated processing for the enable state indication The BIOS may not start the chip, and the BIOS may reinitialize the processing chip that is not isolated for the enable state indication to perform a restart.
其中,如果待隔离芯片是计算机设备上电的场景中确定的,重启指令可以是BMC自身生成的,如果待隔离芯片是在计算机设备运行过程中确定的,则该重启指令可以是上层运维***发送的重启指令。Among them, if the chip to be isolated is determined in the scenario where the computer device is powered on, the restart instruction can be generated by the BMC itself. If the chip to be isolated is determined during the operation of the computer device, the restart instruction can be the upper-level operation and maintenance system Restart command sent.
另外,在BMC还对第一存储单元中的芯片编号进行重新配置的情况下,BIOS在对处理芯片进行重启时,还可以读取第一存储单元中各个处理芯片的重新配置后的芯片编号,进而根据重新配置后的芯片编号建立重新启动的处理芯片之间的连接。In addition, when the BMC also reconfigures the chip number in the first storage unit, when the BIOS restarts the processing chip, it can also read the reconfigured chip number of each processing chip in the first storage unit. Then, the connection between the restarted processing chips is established according to the reconfigured chip number.
在本申请实施例中,计算机设备包括第一存储单元,BMC在确定待隔离芯片之后,可以对第一存储单元中存储的各个处理芯片的启用状态进行配置,以指示对各个处理芯片中的哪些处理芯片进行隔离,哪些正常启动。这样,BIOS就可以通过读取第一存储单元中的启用状态,对需要进行隔离的处理芯片不启动,对不需要进行隔离的处理芯片通过其对应的启动文件进行重启,实现了各个处理芯片的单独启动,避免了计算机设备中芯片资源的浪费。另外,在本申请实施例中,由于各个处理芯片均可以单独启动,不依赖于其他处理芯片的启动,所以,即使主处理芯片发生故障,其他处理芯片也仍然可以正常启动,提高了计算机设备的可靠性。In the embodiment of the present application, the computer device includes a first storage unit. After determining the chip to be isolated, the BMC can configure the activation state of each processing chip stored in the first storage unit to indicate which of the processing chips are Processing chips are isolated, which ones start normally. In this way, the BIOS can read the enabled state in the first storage unit, not start the processing chip that needs to be isolated, and restart the processing chip that does not need to be isolated through its corresponding startup file, so as to realize the operation of each processing chip. Start separately, avoiding the waste of chip resources in computer equipment. In addition, in the embodiments of the present application, since each processing chip can be started independently and does not depend on the startup of other processing chips, even if the main processing chip fails, other processing chips can still be started normally, which improves the performance of the computer equipment. reliability.
图4是本申请实施例提供的另一种芯片启动方法的流程图。参见图4,该方法包括以下步骤:Fig. 4 is a flowchart of another chip startup method provided by an embodiment of the present application. Referring to Figure 4, the method includes the following steps:
步骤401:在计算机设备上电时,部署在多个处理芯片上的BIOS控制多个处理芯片进行芯片自检,以得到多个处理芯片的故障检测信息和建链信息。Step 401: When the computer equipment is powered on, the BIOS deployed on the multiple processing chips controls the multiple processing chips to perform chip self-checks to obtain fault detection information and link establishment information of the multiple processing chips.
其中,BIOS部署在计算机设备的CPU上,在开机上电时,BIOS可以控制每个处理芯片进行自检,以得到每个处理芯片的故障检测信息和建链信息,其中,该故障检测信息可以用于指示对应的处理芯片是否发生故障,建链信息用于指示对应的处理芯片在与其他处理芯片建立连接时是否存在故障,也即指示对应的处理芯片与其他处理芯片建链是否成功。Among them, the BIOS is deployed on the CPU of the computer equipment. When the power is turned on, the BIOS can control each processing chip to perform self-checking to obtain the fault detection information and link establishment information of each processing chip. The fault detection information can be It is used to indicate whether the corresponding processing chip has a failure, and the link establishment information is used to indicate whether the corresponding processing chip has a failure when establishing a connection with other processing chips, that is, it indicates whether the corresponding processing chip has successfully established a link with other processing chips.
步骤402:BIOS向BMC发送多个处理芯片的故障检测信息和建链信息。Step 402: The BIOS sends fault detection information and link establishment information of multiple processing chips to the BMC.
步骤403:BMC根据多个处理芯片的故障检测信息和建链信息确定待隔离芯片。Step 403: The BMC determines the chip to be isolated according to the fault detection information and link establishment information of the multiple processing chips.
该步骤的实现方式可以参考前述实施例中介绍的相关实现方式,本申请实施例在此不再赘述。For the implementation manner of this step, reference may be made to the related implementation manners introduced in the foregoing embodiments, and details are not described herein again in the embodiments of the present application.
步骤404:BMC向第一存储单元发送第一配置指令,以对第一存储单元中存储的每个处理芯片的启用状态进行配置。Step 404: The BMC sends a first configuration instruction to the first storage unit to configure the activation state of each processing chip stored in the first storage unit.
该步骤的实现方式可以参考前述实施例中介绍的相关实现方式,本申请实施例在此不再赘述。For the implementation manner of this step, reference may be made to the related implementation manners introduced in the foregoing embodiments, and details are not described herein again in the embodiments of the present application.
步骤405:BMC向BIOS发送重启指令。Step 405: The BMC sends a restart instruction to the BIOS.
其中,该重启指令可以是BMC根据故障检测信息和建链信息确定出存在发生故障的处 理芯片之后生成的。Wherein, the restart instruction may be generated after the BMC determines that there is a faulty processing chip based on the fault detection information and the link establishment information.
步骤406:BIOS从第一存储单元中读取每个处理芯片的启用状态,根据每个处理芯片的启用状态和每个处理芯片对应的第二存储单元中的启动文件进行重启。Step 406: The BIOS reads the activation status of each processing chip from the first storage unit, and restarts according to the activation status of each processing chip and the startup file in the second storage unit corresponding to each processing chip.
本步骤的实现方式可以参考前述实施例的相关实现方式,本申请实施例在此不再赘述。For the implementation of this step, reference may be made to the related implementation of the foregoing embodiment, and the details of the embodiment of the present application are not repeated here.
步骤407:在计算机设备正常运行过程中,BIOS实时检测多个处理芯片,以得到多个处理芯片的运行状态信息。Step 407: During the normal operation of the computer device, the BIOS detects multiple processing chips in real time to obtain operating status information of the multiple processing chips.
在本申请实施例中,BIOS可以每隔预设时间间隔,获取多个处理芯片的运行状态信息。其中,运行状态信息用于指示对应的处理芯片的运行状态是否异常。In the embodiment of the present application, the BIOS may obtain the operating state information of multiple processing chips at predetermined time intervals. Among them, the running status information is used to indicate whether the running status of the corresponding processing chip is abnormal.
步骤408:BIOS通过BMC向上层运维***反馈该运行状态信息。Step 408: The BIOS feeds back the operating status information to the upper-level operation and maintenance system through the BMC.
在一种可能的实现方式中,BIOS可以在检测到该运行状态信息中存在用于指示处理芯片运行状态异常的异常信息时才上报该运行状态信息。可选地,BIOS也可以在每获取到一次运行状态信息之后,不管是否存在异常信息,均进行上报。In a possible implementation manner, the BIOS may report the operating state information only when it detects that there is abnormal information in the operating state information that is used to indicate the abnormal operating state of the processing chip. Optionally, the BIOS may also report every time the operating status information is acquired, regardless of whether there is abnormal information.
步骤409:上层运维***根据该运行状态信息和当前的业务处理情况,确定待隔离芯片。Step 409: The upper-level operation and maintenance system determines the chip to be isolated according to the operating status information and the current business processing situation.
其中,上层运维***接收到BIOS反馈的运行状态信息之后,可以根据前述实施例中介绍的实现方式来确定待隔离芯片,进而生成重启指令。其中,重启指令中携带有用于指示待隔离芯片的指示信息。例如,该指示信息可以为待隔离芯片的slot ID。Wherein, after the upper-level operation and maintenance system receives the operating status information fed back by the BIOS, it can determine the chip to be isolated according to the implementation manner described in the foregoing embodiment, and then generate a restart instruction. Wherein, the restart instruction carries instruction information for instructing the chip to be isolated. For example, the indication information may be the slot ID of the chip to be isolated.
步骤410:上层运维***向BMC发送该重启指令。Step 410: The upper-level operation and maintenance system sends the restart instruction to the BMC.
步骤411:BMC根据重启指令确定待隔离芯片。Step 411: The BMC determines the chip to be isolated according to the restart instruction.
本步骤的实现方式可以参考前述实施例中的相关实现方式,本申请实施例在此不再赘述。For the implementation of this step, reference may be made to the related implementations in the foregoing embodiments, and the details of the embodiments of the present application are not repeated here.
步骤412:BMC向第一存储单元发送第一配置指令,以对第一存储单元中存储的每个处理芯片的启用状态进行配置。Step 412: The BMC sends a first configuration instruction to the first storage unit to configure the activation state of each processing chip stored in the first storage unit.
本步骤的实现方式可以参考前述实施例中的相关实现方式,本申请实施例在此不再赘述。For the implementation of this step, reference may be made to the related implementations in the foregoing embodiments, and the details of the embodiments of the present application are not repeated here.
步骤413:BMC向BIOS发送重启指令。Step 413: The BMC sends a restart instruction to the BIOS.
此时,该重启指令可以是BMC接收到的上层运维***发送的重启指令。At this time, the restart instruction may be a restart instruction sent by the upper-layer operation and maintenance system received by the BMC.
步骤414:BIOS从第一存储单元中读取每个处理芯片的启用状态,根据每个处理芯片的启用状态和每个处理芯片对应的第二存储单元中的启动文件进行重启。Step 414: The BIOS reads the activation status of each processing chip from the first storage unit, and restarts according to the activation status of each processing chip and the startup file in the second storage unit corresponding to each processing chip.
本步骤的实现方式可以参考前述实施例的相关实现方式,本申请实施例在此不再赘述。For the implementation of this step, reference may be made to the related implementation of the foregoing embodiment, and the details of the embodiment of the present application are not repeated here.
在本申请实施例中,计算机设备包括第一存储单元,BMC在确定待隔离芯片之后,可以对第一存储单元中存储的各个处理芯片的启用状态进行配置,以指示对各个处理芯片中的哪些处理芯片进行隔离,哪些正常启动。这样,部署在多个处理芯片上的BIOS就可以通过读取第一存储单元中的启用状态,对需要进行隔离的处理芯片不启动,对不需要进行隔离的处理芯片通过其对应的启动文件进行重启,实现了各个处理芯片的单独启动,避免了计算机设备中芯片资源的浪费。另外,在本申请实施例中,由于各个处理芯片均可以单独启动,不依赖于其他处理芯片的启动,所以,即使主处理芯片发生故障,其他处理芯片也仍然可以正常启动,提高了计算机设备的可靠性。In the embodiment of the present application, the computer device includes a first storage unit. After determining the chip to be isolated, the BMC can configure the activation state of each processing chip stored in the first storage unit to indicate which of the processing chips are Processing chips are isolated, which ones start normally. In this way, the BIOS deployed on multiple processing chips can read the enabled state in the first storage unit to disable the startup of the processing chips that need to be isolated, and use the corresponding startup files for the processing chips that do not need to be isolated. Restart realizes the independent startup of each processing chip, avoiding the waste of chip resources in the computer equipment. In addition, in the embodiments of the present application, since each processing chip can be started independently and does not depend on the startup of other processing chips, even if the main processing chip fails, other processing chips can still be started normally, which improves the performance of the computer equipment. reliability.
除此之外,在本申请实施例中,在计算机设备正常运行的过程中,还可以根据处理芯片的运行状态信息和业务处理情况来对某些处理芯片进行隔离,实现了处理芯片的灵活调配,使得该计算机设备可以更好的提高服务。In addition, in the embodiments of the present application, during the normal operation of the computer equipment, certain processing chips can also be isolated according to the operating status information and business processing conditions of the processing chips, thereby realizing flexible deployment of processing chips. , So that the computer equipment can better improve the service.
参见图5,本申请实施例提供了一种芯片启动装置500,该芯片启动装置500可以应用于前述实施例中介绍的计算机设备的BMC中,该装置500包括:Referring to FIG. 5, an embodiment of the present application provides a chip activation device 500. The chip activation device 500 can be applied to the BMC of the computer equipment introduced in the foregoing embodiments. The device 500 includes:
确定模块501,由BMC执行以实现前述实施例中的步骤301;The determining module 501 is executed by the BMC to implement step 301 in the foregoing embodiment;
配置模块502,由BMC执行以实现前述实施例中的步骤302;The configuration module 502 is executed by the BMC to implement step 302 in the foregoing embodiment;
控制模块503,由BMC执行以实现前述实施例中的步骤303。The control module 503 is executed by the BMC to implement step 303 in the foregoing embodiment.
可选地,配置模块502还用于:Optionally, the configuration module 502 is also used to:
根据待隔离芯片,向第一存储单元发送第二配置指令,第二配置指令用于指示对第一存储单元中存储的多个处理芯片的芯片编号进行重新配置,其中,重新配置后,除待隔离芯片之外的其他处理芯片的芯片编号连续。According to the chip to be isolated, a second configuration instruction is sent to the first storage unit. The second configuration instruction is used to instruct to reconfigure the chip numbers of the multiple processing chips stored in the first storage unit. The chip numbers of processing chips other than the isolation chip are continuous.
可选地,配置模块502还用于:Optionally, the configuration module 502 is also used to:
根据待隔离芯片,向第一存储单元发送第三配置指令,第三配置指令用于指示对第一存储单元中存储的每个处理芯片的健康状态进行配置,健康状态用于指示处理芯片是否发生故障。According to the chip to be isolated, a third configuration instruction is sent to the first storage unit. The third configuration instruction is used to instruct to configure the health status of each processing chip stored in the first storage unit, and the health status is used to indicate whether the processing chip has occurred Fault.
可选地,该确定模块501具体用于:Optionally, the determining module 501 is specifically configured to:
在计算机设备上电之后,接收多个处理芯片中每个处理芯片的故障检测信息和建链信息,故障检测信息用于指示对应的处理芯片是否发生故障,建链信息用于指示对应的处理芯片在与其他处理芯片建立连接时是否存在故障;After the computer device is powered on, it receives the failure detection information and link establishment information of each of the multiple processing chips. The failure detection information is used to indicate whether the corresponding processing chip fails, and the link establishment information is used to indicate the corresponding processing chip. Whether there is a fault when establishing a connection with other processing chips;
根据多个处理芯片的故障检测信息和建链信息,从多个处理芯片中确定待隔离芯片。According to the fault detection information and link establishment information of the multiple processing chips, the chip to be isolated is determined from the multiple processing chips.
可选地,该确定模块501具体用于:Optionally, the determining module 501 is specifically configured to:
在计算机设备运行过程中,接收多个处理芯片上报的运行状态信息;During the operation of the computer equipment, receiving operation status information reported by multiple processing chips;
如果运行状态信息中存在异常信息,则向上层运维***反馈异常信息;If there is abnormal information in the operation status information, the abnormal information is fed back to the upper-level operation and maintenance system;
接收上层运维***根据异常信息和业务处理情况下发的重启指令,重启指令携带用于指示待隔离芯片的指示信息;Receiving a restart instruction issued by the upper-level operation and maintenance system based on abnormal information and business processing conditions, the restart instruction carrying instruction information for indicating the chip to be isolated;
根据指示信息,确定待隔离芯片。According to the instruction information, determine the chip to be isolated.
综上所述,在本申请实施例中,计算机设备包括第一存储单元,BMC在确定待隔离芯片之后,可以对第一存储单元中存储的各个处理芯片的启用状态进行配置,以指示对各个处理芯片中的哪些处理芯片进行隔离,哪些正常启动。这样,部署在多个处理芯片上的BIOS就可以通过读取第一存储单元中的启用状态,对需要进行隔离的处理芯片不启动,对不需要进行隔离的处理芯片通过其对应的启动文件进行重启,实现了各个处理芯片的单独启动,避免了计算机设备中芯片资源的浪费。另外,在本申请实施例中,由于各个处理芯片均可以单独启动,不依赖于其他处理芯片的启动,所以,即使主处理芯片发生故障,其他处理芯片也仍然可以正常启动,提高了计算机设备的可靠性。In summary, in the embodiment of the present application, the computer device includes a first storage unit. After determining the chip to be isolated, the BMC can configure the activation state of each processing chip stored in the first storage unit to indicate Which of the processing chips are isolated and which ones are normally started. In this way, the BIOS deployed on multiple processing chips can read the enabled state in the first storage unit to disable the startup of the processing chips that need to be isolated, and use the corresponding startup files for the processing chips that do not need to be isolated. Restart realizes the independent startup of each processing chip, avoiding the waste of chip resources in the computer equipment. In addition, in the embodiments of the present application, since each processing chip can be started independently and does not depend on the startup of other processing chips, even if the main processing chip fails, other processing chips can still be started normally, which improves the performance of the computer equipment. reliability.
需要说明的是:上述实施例提供的芯片启动装置在进行芯片重启时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的芯片启动装置与芯片启动方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。It should be noted that when the chip startup device provided in the above embodiment performs chip restart, only the division of the above-mentioned functional modules is used as an example for illustration. In actual applications, the above-mentioned function allocation can be completed by different functional modules according to needs. That is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above. In addition, the chip activation device provided in the foregoing embodiment and the chip activation method embodiment belong to the same concept, and the specific implementation process is detailed in the method embodiment, and will not be repeated here.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意结合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如:同轴电缆、光纤、数据用户线(Digital Subscriber Line,DSL))或无线(例如:红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如:软盘、硬盘、磁带)、光介质(例如:数字通用光盘(Digital Versatile Disc,DVD))、或者半导体介质(例如:固态硬盘(Solid State Disk,SSD))等。In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented by software, it can be implemented in the form of a computer program product in whole or in part. The computer program product includes one or more computer instructions. When the computer instructions are loaded and executed on the computer, the processes or functions described in the embodiments of the present application are generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, server, or data center. Transmission to another website, computer, server or data center via wired (for example: coaxial cable, optical fiber, Digital Subscriber Line (DSL)) or wireless (for example: infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center integrated with one or more available media. The usable medium may be a magnetic medium (for example: floppy disk, hard disk, tape), optical medium (for example: Digital Versatile Disc (DVD)), or semiconductor medium (for example: Solid State Disk (SSD) )Wait.
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。Those of ordinary skill in the art can understand that all or part of the steps in the above embodiments can be implemented by hardware, or by instructing related hardware through a program. The program can be stored in a computer-readable storage medium. The storage medium mentioned can be a read-only memory, a magnetic disk or an optical disk, etc.
以上所述为本申请提供的实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。The above-mentioned examples provided for this application are not intended to limit this application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included in the protection scope of this application. Inside.

Claims (17)

  1. 一种芯片启动方法,其特征在于,计算机设备包括多个处理芯片、第一存储单元以及每个处理芯片对应的第二存储单元,每个处理芯片对应的第二存储单元中存储有相应处理芯片的启动文件,所述方法包括:A method for starting a chip, characterized in that a computer device includes a plurality of processing chips, a first storage unit, and a second storage unit corresponding to each processing chip, and the second storage unit corresponding to each processing chip stores a corresponding processing chip The startup file of, the method includes:
    确定所述多个处理芯片中的待隔离芯片;Determining a chip to be isolated among the plurality of processing chips;
    向所述第一存储单元发送第一配置指令,所述第一配置指令用于指示对所述第一存储单元中存储的每个处理芯片的启用状态进行配置,所述启用状态用于指示是否对相应处理芯片进行隔离;Send a first configuration instruction to the first storage unit, where the first configuration instruction is used to instruct to configure the enable state of each processing chip stored in the first storage unit, and the enable state is used to indicate whether Isolate the corresponding processing chip;
    控制所述多个处理芯片中未被隔离的处理芯片根据所述第一存储单元中存储的启用状态和对应的第二存储单元中的启动文件进行重启。Control the processing chips that are not isolated among the plurality of processing chips to restart according to the enabled state stored in the first storage unit and the corresponding startup file in the second storage unit.
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method according to claim 1, wherein the method further comprises:
    根据所述待隔离芯片,向所述第一存储单元发送第二配置指令,所述第二配置指令用于指示对所述第一存储单元中存储的多个处理芯片的芯片编号进行重新配置,其中,重新配置后,除待隔离芯片之外的其他处理芯片的芯片编号连续。Sending a second configuration instruction to the first storage unit according to the chip to be isolated, where the second configuration instruction is used to instruct to reconfigure the chip numbers of the multiple processing chips stored in the first storage unit, Among them, after the reconfiguration, the chip numbers of the processing chips other than the chip to be isolated are continuous.
  3. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method according to claim 1, wherein the method further comprises:
    根据所述待隔离芯片,向所述第一存储单元发送第三配置指令,所述第三配置指令用于指示对所述第一存储单元中存储的每个处理芯片的健康状态进行配置,所述健康状态用于指示处理芯片是否发生故障。According to the chip to be isolated, a third configuration instruction is sent to the first storage unit, where the third configuration instruction is used to instruct to configure the health status of each processing chip stored in the first storage unit, so The health status is used to indicate whether the processing chip fails.
  4. 根据权利要求1-3任一所述的方法,其特征在于,所述确定所述多个处理芯片中的待隔离芯片,包括:The method according to any one of claims 1 to 3, wherein the determining the chip to be isolated among the plurality of processing chips comprises:
    在计算机设备上电之后,接收所述多个处理芯片中每个处理芯片的故障检测信息和建链信息,所述故障检测信息用于指示对应的处理芯片是否发生故障,所述建链信息用于指示对应的处理芯片在与其他处理芯片建立连接时是否存在故障;After the computer device is powered on, it receives fault detection information and link establishment information for each of the multiple processing chips. The fault detection information is used to indicate whether the corresponding processing chip fails, and the link establishment information is used for To indicate whether the corresponding processing chip has a fault when establishing a connection with other processing chips;
    根据所述多个处理芯片的故障检测信息和建链信息,从所述多个处理芯片中确定所述待隔离芯片。The chip to be isolated is determined from the multiple processing chips according to the failure detection information and the link establishment information of the multiple processing chips.
  5. 根据权利要求1-3任一所述的方法,其特征在于,所述确定所述多个处理芯片中的待隔离芯片,包括:The method according to any one of claims 1 to 3, wherein the determining the chip to be isolated among the plurality of processing chips comprises:
    在所述计算机设备运行过程中,接收所述多个处理芯片上报的运行状态信息;During the operation of the computer device, receiving operation status information reported by the multiple processing chips;
    如果所述运行状态信息中存在异常信息,则向上层运维***反馈所述异常信息;If there is abnormal information in the operating status information, feedback the abnormal information to the upper-level operation and maintenance system;
    接收所述上层运维***根据所述异常信息和业务处理情况下发的重启指令,所述重启指令携带用于指示所述待隔离芯片的指示信息;Receiving a restart instruction issued by the upper-level operation and maintenance system based on the abnormal information and business processing conditions, where the restart instruction carries instruction information for instructing the chip to be isolated;
    根据所述指示信息,确定所述待隔离芯片。According to the instruction information, the chip to be isolated is determined.
  6. 一种计算机设备,其特征在于,所述计算机设备包括基板管理控制器BMC、多个处 理芯片、第一存储单元以及每个处理芯片对应的第二存储单元,每个处理芯片与对应的第二存储单元连接,且每个处理芯片对应的第二存储单元中存储有相应处理芯片的启动文件;A computer device, characterized in that the computer device includes a substrate management controller BMC, a plurality of processing chips, a first storage unit, and a second storage unit corresponding to each processing chip, and each processing chip is associated with a second storage unit corresponding to each processing chip. The storage unit is connected, and the second storage unit corresponding to each processing chip stores the startup file of the corresponding processing chip;
    所述BMC用于确定所述多个处理芯片中的待隔离芯片,向所述第一存储单元发送第一配置指令,所述第一配置指令用于指示对所述第一存储单元中存储的所述多个处理芯片的启用状态进行配置,所述启用状态用于指示是否对相应处理芯片进行隔离;The BMC is used to determine the chip to be isolated among the multiple processing chips, and send a first configuration instruction to the first storage unit, where the first configuration instruction is used to instruct the Configuring the enabling state of the multiple processing chips, and the enabling state is used to indicate whether to isolate the corresponding processing chips;
    所述BMC还用于向所述多个处理芯片发送重启指令;The BMC is also used to send a restart instruction to the multiple processing chips;
    所述多个处理芯片中的每个处理芯片用于在接收到所述重启指令时,从所述第一存储单元中读取自身的启用状态,如果自身的启用状态用于指示不进行隔离,则从对应的第二存储单元中读取启动文件,根据读取到的启动文件进行重新启动。Each processing chip of the plurality of processing chips is configured to read its own enable state from the first storage unit when receiving the restart instruction, and if its own enable state is used to indicate that isolation is not performed, Then the startup file is read from the corresponding second storage unit, and the startup file is restarted according to the read startup file.
  7. 根据权利要求6所述的计算机设备,其特征在于,所述BMC还用于根据所述待隔离芯片,向所述第一存储单元发送第二配置指令,所述第二配置指令用于指示对所述第一存储单元中存储的多个处理芯片的芯片编号进行重新配置,其中,重新配置后,除待隔离芯片之外的其他处理芯片的芯片编号连续;The computer device according to claim 6, wherein the BMC is further configured to send a second configuration instruction to the first storage unit according to the chip to be isolated, and the second configuration instruction is used to instruct to The chip numbers of the multiple processing chips stored in the first storage unit are reconfigured, where after the reconfiguration, the chip numbers of the processing chips other than the chip to be isolated are continuous;
    所述多个处理芯片中重新启动后的处理芯片还用于从所述第一存储单元中读取对应的芯片编号,按照读取到的芯片编号建链。The restarted processing chip among the multiple processing chips is also used to read the corresponding chip number from the first storage unit, and build a chain according to the read chip number.
  8. 根据权利要求6所述的计算机设备,其特征在于,所述BMC还用于根据所述待隔离芯片,向所述第一存储单元发送第三配置指令,所述第三配置指令用于指示对所述第一存储单元中存储的每个处理芯片的健康状态进行配置,所述健康状态用于指示处理芯片是否发生故障。The computer device according to claim 6, wherein the BMC is further configured to send a third configuration instruction to the first storage unit according to the chip to be isolated, and the third configuration instruction is used to instruct to The health status of each processing chip stored in the first storage unit is configured, and the health status is used to indicate whether the processing chip fails.
  9. 根据权利要求6-8任一所述的计算机设备,其特征在于,8. The computer device according to any one of claims 6-8, wherein:
    所述BMC用于在所述计算机设备上电之后,接收所述多个处理芯片中每个处理芯片的故障检测信息和建链信息,所述故障检测信息用于指示对应的处理芯片是否发生故障,所述建链信息用于指示对应的处理芯片在与其他处理芯片建立连接时是否存在故障;根据所述多个处理芯片的故障检测信息和建链信息,从所述多个处理芯片中确定所述待隔离芯片。The BMC is used to receive failure detection information and link establishment information of each processing chip in the plurality of processing chips after the computer device is powered on, and the failure detection information is used to indicate whether the corresponding processing chip fails , The link establishment information is used to indicate whether the corresponding processing chip has a fault when establishing a connection with other processing chips; according to the failure detection information and link establishment information of the multiple processing chips, it is determined from the multiple processing chips The chip to be isolated.
  10. 根据权利要求6-8任一所述的计算机设备,其特征在于,8. The computer device according to any one of claims 6-8, wherein:
    所述BMC用于在所述计算机设备运行过程中,接收所述多个处理芯片上报的运行状态信息,如果所述运行状态信息中存在异常信息,则向上层运维***反馈所述异常信息;The BMC is used to receive the operating status information reported by the multiple processing chips during the operation of the computer equipment, and if there is abnormal information in the operating status information, feed back the abnormal information to the upper-level operation and maintenance system;
    所述上层运维***用于根据所述异常信息和业务处理情况下发重启指令,所述重启指令携带用于指示所述待隔离芯片的指示信息;The upper-level operation and maintenance system is configured to issue a restart instruction according to the abnormal information and business processing conditions, and the restart instruction carries instruction information for instructing the chip to be isolated;
    所述BMC用于根据所述指示信息,确定所述待隔离芯片。The BMC is used to determine the chip to be isolated according to the instruction information.
  11. 一种芯片启动装置,其特征在于,应用于计算机设备中,所述计算机设备还包括多个处理芯片、第一存储单元以及每个处理芯片对应的第二存储单元,每个处理芯片对应的第二存储单元中存储有相应处理芯片的启动文件,所述装置包括:A chip startup device, characterized in that it is applied to computer equipment, the computer equipment further includes a plurality of processing chips, a first storage unit, and a second storage unit corresponding to each processing chip, and a second storage unit corresponding to each processing chip. The startup file of the corresponding processing chip is stored in the second storage unit, and the device includes:
    确定模块,用于确定所述多个处理芯片中的待隔离芯片;A determining module, configured to determine a chip to be isolated among the plurality of processing chips;
    配置模块,用于根据所述待隔离芯片,向所述第一存储单元发送第一配置指令,所述第 一配置指令用于指示对所述第一存储单元中存储的每个处理芯片的启用状态进行配置,所述启用状态用于指示是否对相应处理芯片进行隔离;The configuration module is configured to send a first configuration instruction to the first storage unit according to the chip to be isolated, where the first configuration instruction is used to instruct the activation of each processing chip stored in the first storage unit Configure the state, and the enable state is used to indicate whether to isolate the corresponding processing chip;
    控制模块,用于控制所述多个处理芯片中未进行隔离的处理芯片根据所述第一存储单元中存储的启用状态和对应的第二存储单元中的启动文件进行重启。The control module is configured to control the processing chips that are not isolated among the multiple processing chips to restart according to the enabled state stored in the first storage unit and the corresponding startup file in the second storage unit.
  12. 根据权利要求11所述的装置,其特征在于,所述配置模块还用于:The device according to claim 11, wherein the configuration module is further configured to:
    根据所述待隔离芯片,向所述第一存储单元发送第二配置指令,所述第二配置指令用于指示对所述第一存储单元中存储的多个处理芯片的芯片编号进行重新配置,其中,重新配置后,除待隔离芯片之外的其他处理芯片的芯片编号连续。Sending a second configuration instruction to the first storage unit according to the chip to be isolated, where the second configuration instruction is used to instruct to reconfigure the chip numbers of the multiple processing chips stored in the first storage unit, Among them, after the reconfiguration, the chip numbers of the processing chips other than the chip to be isolated are continuous.
  13. 根据权利要求11所述的装置,其特征在于,所述配置模块还用于:The device according to claim 11, wherein the configuration module is further configured to:
    根据所述待隔离芯片,向所述第一存储单元发送第三配置指令,所述第三配置指令用于指示对所述第一存储单元中存储的每个处理芯片的健康状态进行配置,所述健康状态用于指示处理芯片是否发生故障。According to the chip to be isolated, a third configuration instruction is sent to the first storage unit, where the third configuration instruction is used to instruct to configure the health status of each processing chip stored in the first storage unit, so The health status is used to indicate whether the processing chip fails.
  14. 根据权利要求11-13任一所述的装置,其特征在于,所述确定模块具体用于:The device according to any one of claims 11-13, wherein the determining module is specifically configured to:
    在计算机设备上电之后,接收所述多个处理芯片中每个处理芯片的故障检测信息和建链信息,所述故障检测信息用于指示对应的处理芯片是否发生故障,所述建链信息用于指示对应的处理芯片在与其他处理芯片建立连接时是否存在故障;After the computer device is powered on, it receives fault detection information and link establishment information for each of the multiple processing chips. The fault detection information is used to indicate whether the corresponding processing chip fails, and the link establishment information is used for To indicate whether the corresponding processing chip has a fault when establishing a connection with other processing chips;
    根据所述多个处理芯片的故障检测信息和建链信息,从所述多个处理芯片中确定所述待隔离芯片。The chip to be isolated is determined from the multiple processing chips according to the failure detection information and the link establishment information of the multiple processing chips.
  15. 根据权利要求11-13任一所述的装置,其特征在于,所述确定模块具体用于:The device according to any one of claims 11-13, wherein the determining module is specifically configured to:
    在所述计算机设备运行过程中,接收所述多个处理芯片上报的运行状态信息;During the operation of the computer device, receiving operation status information reported by the multiple processing chips;
    如果所述运行状态信息中存在异常信息,则向上层运维***反馈所述异常信息;If there is abnormal information in the operating status information, feedback the abnormal information to the upper-level operation and maintenance system;
    接收所述上层运维***根据所述异常信息和业务处理情况下发的重启指令,所述重启指令携带用于指示所述待隔离芯片的指示信息;Receiving a restart instruction issued by the upper-level operation and maintenance system based on the abnormal information and business processing conditions, where the restart instruction carries instruction information for instructing the chip to be isolated;
    根据所述指示信息,确定所述待隔离芯片。According to the instruction information, the chip to be isolated is determined.
  16. 一种计算机设备,其特征在于,所述计算机设备包括多个处理芯片、第一存储单元以及每个处理芯片对应的第二存储单元;A computer device, characterized in that the computer device includes a plurality of processing chips, a first storage unit, and a second storage unit corresponding to each processing chip;
    所述多个处理芯片中的每个处理芯片与对应的第二存储单元连接,且每个处理芯片对应的第二存储单元中存储有相应处理芯片的启动文件;Each processing chip of the plurality of processing chips is connected to a corresponding second storage unit, and the second storage unit corresponding to each processing chip stores a startup file of the corresponding processing chip;
    所述第一存储单元用于存储所述多个处理芯片的启用状态,所述启用状态用于指示是否对相应处理芯片进行隔离;The first storage unit is used to store the activation status of the multiple processing chips, and the activation status is used to indicate whether to isolate the corresponding processing chips;
    所述多个处理芯片与所述第一存储单元连接,且所述多个处理芯片中未被隔离的芯片用于根据所述第一存储单元中存储的启用状态和对应的第二存储单元中的启动文件进行重启。The plurality of processing chips are connected to the first storage unit, and the non-isolated chips among the plurality of processing chips are used to store the enabled state in the first storage unit and the corresponding second storage unit. Boot file to restart.
  17. 根据权利要求16所述的计算机设备,所述第一存储单元为复杂可编程逻辑器件CPLD;所述第二存储单元为闪存。The computer device according to claim 16, wherein the first storage unit is a complex programmable logic device (CPLD); and the second storage unit is a flash memory.
PCT/CN2021/078549 2020-03-25 2021-03-01 Chip starting method and apparatus and computer device WO2021190252A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010218853.8 2020-03-25
CN202010218853.8A CN113515312A (en) 2020-03-25 2020-03-25 Chip starting method and device and computer equipment

Publications (1)

Publication Number Publication Date
WO2021190252A1 true WO2021190252A1 (en) 2021-09-30

Family

ID=77890922

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/078549 WO2021190252A1 (en) 2020-03-25 2021-03-01 Chip starting method and apparatus and computer device

Country Status (2)

Country Link
CN (1) CN113515312A (en)
WO (1) WO2021190252A1 (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5491788A (en) * 1993-09-10 1996-02-13 Compaq Computer Corp. Method of booting a multiprocessor computer where execution is transferring from a first processor to a second processor based on the first processor having had a critical error
CN1916858A (en) * 2006-09-19 2007-02-21 杭州华为三康技术有限公司 Monitoring methd, monitoring equipment in system with multiple cores, and multiple cores system
US20090240981A1 (en) * 2008-03-24 2009-09-24 Advanced Micro Devices, Inc. Bootstrap device and methods thereof
CN104750510A (en) * 2013-12-30 2015-07-01 深圳市中兴微电子技术有限公司 Chip start method and multi-core processor chip
US20160210255A1 (en) * 2015-01-16 2016-07-21 Oracle International Corporation Inter-processor bus link and switch chip failure recovery
CN107870662A (en) * 2016-09-23 2018-04-03 华为技术有限公司 The method of cpu reset and PCIe interface card in a kind of multi-CPU system
CN110187923A (en) * 2019-05-10 2019-08-30 杭州迪普科技股份有限公司 A kind of CPU starting method and apparatus applied to multi -CPU board
CN110347534A (en) * 2018-04-02 2019-10-18 英特尔公司 Selfreparing is carried out in computing systems using embedded non-volatile memory

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609327B (en) * 2012-01-17 2015-07-22 北京华为数字技术有限公司 Method and device for improving reliability of multi-core processor
CN103870353A (en) * 2014-03-18 2014-06-18 北京控制工程研究所 Multicore-oriented reconfigurable fault tolerance system and multicore-oriented reconfigurable fault tolerance method
CN109947586A (en) * 2019-03-20 2019-06-28 浪潮商用机器有限公司 A kind of method, apparatus and medium of isolated fault equipment

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5491788A (en) * 1993-09-10 1996-02-13 Compaq Computer Corp. Method of booting a multiprocessor computer where execution is transferring from a first processor to a second processor based on the first processor having had a critical error
CN1916858A (en) * 2006-09-19 2007-02-21 杭州华为三康技术有限公司 Monitoring methd, monitoring equipment in system with multiple cores, and multiple cores system
US20090240981A1 (en) * 2008-03-24 2009-09-24 Advanced Micro Devices, Inc. Bootstrap device and methods thereof
CN104750510A (en) * 2013-12-30 2015-07-01 深圳市中兴微电子技术有限公司 Chip start method and multi-core processor chip
US20160210255A1 (en) * 2015-01-16 2016-07-21 Oracle International Corporation Inter-processor bus link and switch chip failure recovery
CN107870662A (en) * 2016-09-23 2018-04-03 华为技术有限公司 The method of cpu reset and PCIe interface card in a kind of multi-CPU system
CN110347534A (en) * 2018-04-02 2019-10-18 英特尔公司 Selfreparing is carried out in computing systems using embedded non-volatile memory
CN110187923A (en) * 2019-05-10 2019-08-30 杭州迪普科技股份有限公司 A kind of CPU starting method and apparatus applied to multi -CPU board

Also Published As

Publication number Publication date
CN113515312A (en) 2021-10-19

Similar Documents

Publication Publication Date Title
US8417774B2 (en) Apparatus, system, and method for a reconfigurable baseboard management controller
US9137111B2 (en) Discovering, validating, and configuring hardware-inventory components
TWI754317B (en) Method and system for optimal boot path for a network device
US20070255430A1 (en) Shelf management controller with hardware/software implemented dual redundant configuration
JP5460188B2 (en) Lane specification for SAS wide port connection
JP2008262538A (en) Method and system for handling input/output (i/o) errors
WO2018120200A1 (en) Server management method and server
US20190042161A1 (en) Hard Disk Operation Method and Hard Disk Manager
WO2021190252A1 (en) Chip starting method and apparatus and computer device
US7475164B2 (en) Apparatus, system, and method for automated device configuration and testing
US10999128B2 (en) System and method for automatically repairing a faultily connected network element
US8032791B2 (en) Diagnosis of and response to failure at reset in a data processing system
JP2001022599A (en) Fault tolerant system, fault tolerant processing method and recording medium for fault tolerant control program
CN111966520A (en) Database high-availability switching method, device and system
CN115766405B (en) Fault processing method, device, equipment and storage medium
WO2024036857A1 (en) I2c link management method and apparatus, device, and nonvolatile readable medium
US7519741B2 (en) Apparatus, system, and method for automating adapter replacement
WO2018179739A1 (en) Information processing device, information processing method, and program
US7305497B2 (en) Performing resource analysis on one or more cards of a computer system wherein a plurality of severity levels are assigned based on a predetermined criteria
TWI774464B (en) Expanded availability computing system
US11474904B2 (en) Software-defined suspected storage drive failure identification
CN114968610B (en) Data processing method, multimedia framework and related equipment
CN110896407B (en) NFVO component configuration management, request forwarding method and request processing device
US10997012B2 (en) Identifying defective field-replaceable units that include multi-page, non-volatile memory devices
WO2023092430A1 (en) Virtual machine initialization method and apparatus, terminal device, and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21775041

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21775041

Country of ref document: EP

Kind code of ref document: A1