CN111611111B - Method and system for quickly recovering fault of multiprocessor signal processing equipment - Google Patents

Method and system for quickly recovering fault of multiprocessor signal processing equipment Download PDF

Info

Publication number
CN111611111B
CN111611111B CN202010441610.0A CN202010441610A CN111611111B CN 111611111 B CN111611111 B CN 111611111B CN 202010441610 A CN202010441610 A CN 202010441610A CN 111611111 B CN111611111 B CN 111611111B
Authority
CN
China
Prior art keywords
processor
data
communication
processor core
communication group
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010441610.0A
Other languages
Chinese (zh)
Other versions
CN111611111A (en
Inventor
袁华进
李莉
张�杰
李红兵
周萍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongke Haixun Digital Technology Co ltd
Original Assignee
Beijing Zhongke Haixun Digital Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongke Haixun Digital Technology Co ltd filed Critical Beijing Zhongke Haixun Digital Technology Co ltd
Priority to CN202010441610.0A priority Critical patent/CN111611111B/en
Publication of CN111611111A publication Critical patent/CN111611111A/en
Application granted granted Critical
Publication of CN111611111B publication Critical patent/CN111611111B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/1608Error detection by comparing the output signals of redundant hardware
    • G06F11/1625Error detection by comparing the output signals of redundant hardware in communications, e.g. transmission, interfaces

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Hardware Redundancy (AREA)

Abstract

A method and system for fast failure recovery of a multiprocessor signal processing apparatus are provided. A method of fast failure recovery for a multiprocessor signal processing apparatus is provided, comprising: in response to a failure occurrence of the first processor, saving a communication middleware state and a failure recovery flag of the first processor in a non-volatile memory; a code segment for protecting the system firmware operated by the first processor in the memory and enabling the first processor to be in soft reset; when the first processor is restarted after being reset, the boot firmware is operated, the boot firmware accesses the nonvolatile memory, the boot firmware responds to the fault recovery mark stored in the nonvolatile memory to obtain an entry address of the code segment in the memory, and the system firmware is executed through an instruction executed at the entry address; the first processor executes the system firmware to acquire the saved communication middleware state so as to enable the first processor to rejoin the communication group to which the first processor belongs when the first processor fails; the first processor executes the system firmware to perform a task processing cycle.

Description

Method and system for quickly recovering fault of multiprocessor signal processing equipment
Technical Field
The present application relates to reliability and fault tolerance, and more particularly, to a method for fast failure recovery after a node in a multiprocessor signal processing apparatus fails and a system thereof.
Background
High-performance signal processing devices generally adopt a multi-core and multi-processor cooperative computing-based architecture, and computing tasks for implementing signal processing are shared among multiple processors or processor cores. In addition to computing, multiple processor cores or processors frequently communicate among themselves to distribute and/or collect computing tasks and to synchronize the progress of the computing tasks. The software is operated on the processor core or the processor to control the signal processing process, the operated software has large scale and complex flow, realizes a plurality of complex algorithms, and also comprises communication middleware for managing the communication between the processors or the processor cores.
Signal processing devices need to provide high reliability and high availability to reduce the system's down time. Software and hardware are inevitable to have defects, sudden failures can be encountered during the operation of the signal processing equipment, system anomalies are caused, and even some problems are fatal to the whole system. Recovery techniques are therefore needed to restore system operation in time after an error or fault in system operation.
In the prior art, the availability of a system is improved by adopting a main/standby structure. Normally, the master device is running. And when the main equipment fails, the standby equipment takes over the operation of the main equipment. Both the master and slave devices are, for example, processors. However, the main/standby structure needs to deploy backup devices in the system, which increases the cost. The master device also needs to synchronize the operation information to the slave device at any time during the operation period, thereby increasing the communication overhead.
The prior art also employs rollback techniques to restore system operation through processor reboots. But the restarting processor needs to wait for a long time, and in the multiprocessor system, even if the single processor is recovered to the failure point after restarting and continues to operate, the operation states of all nodes (processors or processor cores) of the whole system are difficult to synchronize because other processors are not restarted.
Disclosure of Invention
There is a need for a fault recovery technique for a multi-processor signal processing apparatus that allows the signal processing apparatus to resume operation in as short a time as possible in the event of a single processor therein failing. After resuming operation, the failing processor can resume communication with other processors of the signal processing device and continue to cooperatively process the computing task such that the computing task being processed continues to be processed after the failure is resumed, while portions of the computing task that have previously completed are utilized without having to reprocess the entire computing task. It is also desirable that the recovery process of the failing processor has as little impact on other processors as possible to avoid introducing new defects or failures in the failure recovery due to increased complexity. Because the fault may occur at any time of processing the task, the fault recovery process needs not to be limited by the time of the fault occurrence, and can be effectively recovered when the fault occurs at any time.
According to a first aspect of the present application, there is provided a method of fast failure recovery of a first multiprocessor signal processing apparatus according to the first aspect of the present application, comprising: in response to a failure occurrence of the first processor, saving a communication middleware state and a failure recovery flag of the first processor in a non-volatile memory; a code segment for protecting the system firmware operated by the first processor in the memory and enabling the first processor to be in soft reset; when the first processor is restarted after being reset, the boot firmware is operated, the boot firmware accesses the nonvolatile memory, the boot firmware responds to the fault recovery mark stored in the nonvolatile memory to obtain an entry address of the code segment in the memory, and the system firmware is executed through an instruction executed at the entry address; the first processor executes the system firmware to acquire the saved communication middleware state so as to enable the first processor to rejoin the communication group to which the first processor belongs when the first processor fails; the first processor executes the system firmware to perform a task processing cycle.
According to the first multiprocessor signal processing apparatus fast failure recovery method of the first aspect of the present application, there is provided the second multiprocessor signal processing apparatus fast failure recovery method of the first aspect of the present application, further comprising: the boot firmware responds to that the fault recovery mark is not stored in the non-easy memory, a firmware image is obtained from an image server, the obtained firmware image is decompressed to obtain system firmware, the system firmware is loaded to the memory, the entry address in the memory is obtained, and the system firmware is executed through instructions executed from the entry address.
A method for fast failure recovery of a third multiprocessor signal processing apparatus according to the first aspect of the present application is provided according to the first aspect of the present application, wherein the communication middleware state further records the position of the failed processor in a task processing cycle when the failure occurs; the location indicates whether the interface invoking the communication middleware receives data, the interface invoking the communication middleware sends data, and/or whether an acknowledgement message is successfully given for the interface invoking the communication middleware to receive data.
According to the third multiprocessor signal processing apparatus fast failure recovery method of the first aspect of the present application, there is provided the fourth multiprocessor signal processing apparatus fast failure recovery method of the first aspect of the present application, further comprising: in response to the interface for receiving the data being called, skipping the current operation for receiving the data if the communication middleware recognizes that the fault recovery mark exists; and if the communication middleware identifies that the fault recovery mark does not exist, executing the current data receiving operation.
According to a fourth multiprocessor signal processing apparatus fast failure recovery method of the first aspect of the present application, there is provided the fifth multiprocessor signal processing apparatus fast failure recovery method of the first aspect of the present application, further comprising: and in response to the interface for sending the data being called, if the communication middleware recognizes that the fault recovery mark exists, skipping the current data sending operation if the position of the processor which has failed in the task cycle is before the confirmation message is successfully given for the interface for calling the communication middleware to receive the data, indicating the failure of sending the data to the receiver of the data sending operation, and clearing the fault recovery mark.
According to a fifth multiprocessor signal processing apparatus fast failure recovery method of the first aspect of the present application, there is provided the sixth multiprocessor signal processing apparatus fast failure recovery method of the first aspect of the present application, further comprising: in response to the interface for sending data being called, if the communication middleware recognizes that a fault recovery mark exists, if the position of the processor which has failed in the task cycle is after a confirmation message is successfully given for the interface for calling the communication middleware to receive data when the fault occurs, skipping the current operation for sending data without indicating the failure of sending data to the receiver of the operation for sending data, and clearing the fault recovery mark; and in response to the interface for sending data being called, if the communication middleware recognizes that the fault recovery mark does not exist, executing the current operation for sending data.
According to a sixth multiprocessor signal processing apparatus fast failure recovery method of the first aspect of the present application, there is provided the seventh multiprocessor signal processing apparatus fast failure recovery method of the first aspect of the present application, further comprising: before the task processing cycle is finished, if the fault recovery mark is identified, and the position of the processor which has faults when the faults occur in the task cycle is before the confirmation message is successfully given to the interface of the calling communication middleware for receiving data, the confirmation message is given to the corresponding data sending party for the current data receiving operation, and the failure of the data receiving operation is indicated in the confirmation message.
According to a seventh multi-processor signal processing apparatus fast failure recovery method of the first aspect of the present application, there is provided an eighth multi-processor signal processing apparatus fast failure recovery method of the first aspect of the present application, wherein the first processor is a slave member of a first communication group; the multiprocessor signal processing apparatus further includes a second processor that is a master member of the first communication group and transmits data to a slave member of the first communication group in the first communication group; in response to the sending interface being invoked by the second processor as a master member, the communication middleware sends data to all slave members of the first communication group and returns an invocation to the sending interface in response to all slave members of the first communication group receiving data successfully giving an acknowledgement message for the received data; and resending the data to any slave member of the first communication group that receives the data in response to the slave member giving an acknowledgement message for a failure to receive the data.
According to a method of fast failure recovery of an eighth multiprocessor signal processing apparatus according to the first aspect of the present application, there is provided a method of fast failure recovery of a ninth multiprocessor signal processing apparatus according to the first aspect of the present application, wherein the first processor is a slave member of a second communication group; the multiprocessor signal processing apparatus further includes a third processor which is a master member of a second communication group in which data is received from a slave member of the second communication group; the communication middleware receiving data from all slave members of the second communication group in response to the receiving interface being called by the third processor as the master member, and returning the call to the receiving interface in response to receiving data from all slave members of the second communication group that sent data; and re-receiving data from any slave member of the second communication group in response to a failure to receive data from the slave member.
According to a ninth multiprocessor signal processing apparatus fast failure recovery method of the first aspect of the present application, there is provided the tenth multiprocessor signal processing apparatus fast failure recovery method of the first aspect of the present application, wherein the multiprocessor signal processing apparatus further comprises a fourth processor, the fourth processor being a slave member of the first communication group and also being a slave member of the second communication group; the method further comprises the following steps: the communication middleware of the third processor gives an acknowledgement message to all slave members of the second communication group that sent data in response to receiving data from all slave members of the second communication group that sent data; in response to the transmission interface being invoked by a fourth processor that is a slave member, the communication middleware transmits data to a third processor that is a master member of the second communication group, and in response to receiving an acknowledgement message given by the third processor, causes the fourth processor to return a call for transmission reception; the communication middleware receives data from the second processor which is the master member of the first communication group in response to the reception interface being called by the fourth processor which is the slave member, and returns a call to the reception interface by the fourth processor in response to an acknowledgement message being given to the second processor for success in receiving the data from the second processor.
According to a second aspect of the present application, there is provided a first information processing apparatus according to the second aspect of the present application, comprising a memory, a plurality of processors, and a program stored on the memory and executable on each processor, characterized in that each processor of the plurality of processors implements a method for fast failure recovery of a processor signal processing apparatus according to one of the first to tenth aspects of the present application when executing the program.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art according to the drawings.
FIG. 1 illustrates a block diagram of an information processing apparatus according to an embodiment of the present application;
FIG. 2A illustrates a block diagram of communication between processor cores of an information processing apparatus according to an embodiment of the present application;
FIG. 2B illustrates a schematic diagram of inter-process communication according to an embodiment of the present application;
3A-3D illustrate a flow diagram of communications of members within a communication group;
FIG. 4A illustrates a flow diagram of fault handling according to an embodiment of the present application;
FIG. 4B illustrates a flow diagram of boot firmware implemented fast failover in accordance with an embodiment of the present application;
FIG. 5A illustrates a block diagram of system firmware, according to an embodiment of the present application;
FIG. 5B illustrates a block diagram of a task processing unit according to an embodiment of the present application;
FIGS. 6A and 6B illustrate a flow diagram of a communication of a slave member within a communication group according to an embodiment of the present application;
FIGS. 7A-7E illustrate diagrams of an information handling device implementing fast failover in accordance with an embodiment of the present application; and
fig. 8A to 8D are diagrams illustrating implementation of fast failure recovery of an information processing apparatus according to still another embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application are clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Fig. 1 shows a block diagram of an information processing apparatus according to an embodiment of the present application.
The information processing apparatus includes a processor, a memory (NVM, DRAM and ROM), and a mirror server. The Processor is, for example, a multi-core DSP (digital Signal Processor). The processor in FIG. 1 includes 8 cores, shown as core 0 through core 7. Alternatively, the processor may include other numbers of processor cores. The processor cores communicate with each other. Dynamic Random Access Memory (DRAM), non-volatile memory (NVM), and Read Only Memory (ROM) are each coupled to the processor. The image server is remotely coupled to the processor through a network.
The non-volatile memory (NVM) acts as a boot memory in which boot firmware is recorded. Boot firmware is embedded software that is executed first after the processor is powered on. By executing the boot firmware, the processor loads the firmware into the DRAM. The mirroring server stores the firmware. Firmware is embedded software that will be loaded by the processor cores of the processor and executed during normal operation. To distinguish from boot firmware, firmware is also referred to as system firmware. Typically, when the processor is started, the latest version of the system firmware is obtained from the image server and run by executing the boot firmware. Alternatively, the system firmware recorded therein is loaded from a Read Only Memory (ROM) and run when the remote server is unavailable, for example, by executing boot firmware. Optionally, the system firmware stored in the NVM or the system firmware stored in the mirror server is compressed, the system firmware obtained from the mirror server or the NVM is decompressed and written into the DRAM by executing the boot firmware, and the system firmware is executed by the processor core. The decompressed system firmware includes different parts in the DRAM, such as code segments, data segments, etc.
According to an embodiment of the present application, an information processing apparatus includes a plurality of processors (not shown). Each processor includes a plurality of processor cores. The processor cores of the processors are capable of communicating with each other. Each processor core of each processor constitutes each node of the information processing apparatus, and processes the signal processing task in cooperation. Optionally, each processor is coupled to its own dedicated Dynamic Random Access Memory (DRAM), non-volatile memory (NVM) and read-only memory (ROM), or one or more processors share Dynamic Random Access Memory (DRAM), non-volatile memory (NVM) and read-only memory (ROM).
According to an embodiment of the present application, the non-volatile memory (NVM) also stores temporary data. Temporary data is data recorded in a non-volatile memory (NVM) before or when a failure occurs in order to implement failure recovery, including, for example, communication middleware status, static data, and a fast recovery flag. The manner in which the temporary data is generated and used will also be described later. When the information processing equipment operates, static data, operation data and the state of communication middleware are recorded in the DRAM. At processor reset or restart, static data, runtime data, and communication middleware states recorded in the DRAM are cleared or overwritten. Optionally, the DRAM further includes a protected area in which code segments such as an interrupt vector table and system firmware are recorded. The interrupt vector table and code segments are typically DRAM memory space that is not changed by the processor when running firmware. Therefore, when the processor is reset or restarted, the protected area of the DRAM is not cleared or covered, so that the content of the protected area of the DRAM can be directly used after the processor is restarted without generating and writing the DRAM again. Optionally, the original version of the system firmware is also recorded in the protected area of the DRAM.
FIG. 2A shows a block diagram of communication between processor cores of an information processing apparatus according to an embodiment of the present application.
By way of example, the information processing apparatus includes 4 processors (processor 1, processor 2, processor 3, and processor 4), each of which includes 8 processor cores (denoted as core 0 to core 7), each of which independently executes a program. The 4 processors are coupled to each other via a network and/or an SRIO (Serial Rapid I/O) interface, so that the processor cores located in different processors can communicate with each other, and the processor cores located in the same processor can communicate with each other via a communication mechanism provided by the processors, such as a shared memory, a shared Cache (Cache), and the like. Each processor core that needs to communicate runs communication middleware that is the underlying software. The communication middleware has, for example, a plurality of instances, and the respective processor cores communicate via the respective communication middleware instances. The communication middleware provides communication support for programs run by each processor core, so that the details of a physical communication mode are hidden. The communication middleware provides, for example, a standard interprocess communication interface, one or more processes running on the processor core, and communicates with the current processor core or processes running on other processor cores using the interprocess communication interface. An interprocess communication interface provides operations for communication including, for example, sending, receiving, and so on.
FIG. 2B shows a schematic diagram of communication between processor cores according to an embodiment of the application.
According to the information processing apparatus of the embodiment of the present application, a plurality of processor cores of a plurality of processors are divided into a plurality of communication groups. Interprocess communication occurs between processor cores within a communication group. A processor core may belong to two or more communication groups simultaneously.
Each processor core in a communication group is a member of the communication group. The members of the communication group include a master member and a member. There is one and only one master member in each communication group. Communication modes in a communication group are divided into a 1-to-1 mode, a 1-to-N mode, and an N-to-1 mode according to the number of members that transmit or receive data. In the 1-to-1 mode, the communication group includes 2 members, one of which transmits data and the other of which receives data, any one of which may be the primary member of the communication group. In the 1-to-N mode, the communication group includes 1+ N members, 1 member transmits data, N members receive data, and the 1 member transmitting data is a master member of the communication group and the other members are slave members of the communication group. In the N-to-1 communication mode, the communication group includes N +1 members, the N members transmit data, the 1 member receives data, the 1 member receiving data is a master member of the communication group, and the other members are slave members of the communication group.
Communication in the communication group only occurs between the master member and the slave members, and no communication occurs between the slave members of the communication group. If two slave members of a communication group need to communicate, another communication group needs to be created for the two members, and a master-slave relationship is set for the two members in the created communication group.
Fig. 2B shows 3 communication groups, which are respectively denoted as (IPC group 1, IPC group 2, and IPC group 3). The communication group IPC group 1 comprises 8 processor cores of the processor 1, wherein the processor core 0 is a main member, and the processor cores 1-7 are all slave members. Communication group IPC group 2 includes processor core 7 of processor 1, processor core 0 of processor 2 and processor core 0 of processor 3, wherein processor core 7 of processor 1 is a master member of communication group IPC group 2, and processor core 0 of processor 2 and processor core 0 of processor 3 are both slave members of communication group IPC group 2. In communication group IPC group 2, core 7 of processor 1 transmits data, and core 0 of processor 2 and core 0 of processor 3 receive data. Communication group IPC set 3 includes processor core 0 of processor 2, processor core 0 of processor 3, and processor core 0 of processor 4, where processor core 0 of processor 4 is the master member of IPC set 3, and processor core 0 of processor 2 and processor core 0 of processor 3 are both slave members of communication group IPC set 2. In communication group IPC group 3, processor core 0 of processor 2 transmits data with processor core 0 of processor 3, and processor core 0 of processor 4 receives data.
According to an embodiment of the application, configuration information of a communication group is provided to each processor core. The configuration information of the communication group includes members of the communication group and a master-slave relationship of the members of the communication group. The processor core thus creates a communication group after startup via the configuration information. By way of example, processor core 0 of processor 2 belongs to communication group IPC group 2 and communication group IPC group 3. After the processor core 0 of the processor 2 is started, the information processing apparatus broadcasts that the information processing apparatus belongs to the communication group IPC group 2 and the communication group IPC group 3, and the information processing apparatus is a slave member in both the groups. Each processor core on-line can receive the broadcast message of the processor core of processor 2. The processor core 7 of the processor 1 knows that the processor core 7 of the processor 1 belongs to the communication group IPC group 2 with the processor core 0 of the processor 2 from the received broadcast message, so that the processor core 7 of the processor 1 records that the processor core 0 of the processor 2 as a slave member of the communication group IPC group 2 is on-line and also records the address of the processor core 0, so that the processor core 7 of the processor 1 can be used as a master member of the communication group IPC group 2 to send data to the processor core 0 of the processor 2. Processor core 0 of processor 2 in response also records the address of processor core 7 of processor 1, which is the master member of communication group IPC group 2. Similarly, processor core 0 of processor 4 knows itself from the received broadcast message that it belongs to communication group IPC group 3 with processor core 0 of processor 2, so that processor 0 of processor 4 records that processor core 0 of processor 2, which is a slave member of communication group IPC group 3, has come on-line and also records its address, so that processor core 0 of processor 4 is enabled to receive data from processor core 0 of processor 2, which is a master member of communication group IPC group 3. Processor core 0 of processor 2 in response also records the address of processor core 0 of processor 4 which is the master member of communication group IPC group 3. It will be appreciated that addressing may be implemented in the information processing apparatus in other ways than by addresses.
As an example, for signal processing, processor core 0 of processor 1 collects data to be processed and distributes to processor cores 1 to 7 of communication group IPC group 1. The processor core 7 of the processor 1 supplies the processing results of the processor cores of the communication group IPC group 1 to the processor core 0 of the processor 2 and the processor core 0 of the processor 3. Processor core 0 of processor 2 distributes the computation tasks represented by the received data to processor cores 1 to 7 of processor core 2 for parallel processing, and collects the computation results and sends the results to processor core 0 of processor 4. Processor core 0 of processor 3 distributes the computation tasks represented by the received data to processor cores 1 to 7 of processor core 3 for parallel processing, and collects the computation results and sends the results to processor core 0 of processor 4. Processor core 0 of processor 4 distributes the computation tasks represented by the received data to processor cores 1 to 7 of processor core 4 for parallel processing, and collects the computation results. Thus, the respective processor cores of the signal processing apparatus cooperatively perform signal processing.
Fig. 3A-3D illustrate a flow diagram of communications of members within a communication group.
Fig. 3A is a flow diagram of a communication group receiving data from a member.
A slave member of a communication group that is to receive data receives data from a master member affiliated with the communication group. Optionally, the slave member of the communication group also sends an acknowledgement message to the master member. The confirmation message indicates that the operation to receive data from the primary member is complete and thus is no longer resent by the primary member requesting the transmission of data for the confirmed completed received data operation. Optionally, the acknowledgement message also indicates that the current step of communication of the communication group has been completed. According to an embodiment of the present application, a "step" of a communication represents a phase of the communication within a communication group. During the task processing, the members of the communication group directly perform repeated communication, for example, the master member sends data to the slave member for multiple times, and each communication is called a "step". In yet another example, a member's task processing loop includes one operation of receiving data and one operation of sending data, and such a task processing loop is referred to as a "step". Members distinguish or track the phases in which communications are proceeding in terms of "steps".
Alternatively, if the master member does not receive a confirmation message from the slave member or does not receive a confirmation message from the slave member in time, the master member considers that the current "step" communication failed and retransmits the data of the current step to the slave member. Similarly, during the process of receiving data from the member, a timeout or other event indicating a transmission failure is also detected, and accordingly, the data received at the current step is discarded, and the operation of receiving data from the master member is re-initiated.
Still alternatively, the slave member indicates a communication failure of the current step through the confirmation message, and in response, the master member retransmits the data of the current step to the slave member.
Fig. 3B is a flow chart of a slave member of a communication group sending data.
A slave member of a communication group to which data is to be transmitted transmits data to a master member affiliated with the communication group. Optionally, the slave member of the communication group also receives an acknowledgement message from the master member. The acknowledgement message indicates that the current step of communication of the communication group has been completed. Optionally, if the slave member does not receive the confirmation message from the master member or does not receive the confirmation message from the master member in time, the slave member considers that the communication of the current "step" is failed, and the slave member retransmits the data of the current "step" to the master member.
Still optionally, the slave member also monitors whether the transmission data times out. And in response to the time-out, the slave member ignores the data sent in the current step and reinitiates the operation of sending the data to the master member.
Fig. 3C is a flow chart of a master member of a communication group sending data.
A master member of a communication group to which data is to be transmitted transmits data to all slave members affiliated with the communication group in each "step" of the communication (340). The master member also identifies whether all slave members of the communication group successfully received their transmitted data (342). By way of example, the communication middleware provides an indication of whether data was successfully transmitted or received. As yet another example, a slave member receiving data sends an acknowledgement message to the master member sending the data to indicate that it successfully received the data (see also the "send acknowledgement message" step of FIG. 3A).
In response to all slave members of the communication group successfully receiving the data transmitted by the master member (342), the master member determines that transmitting the data is complete and may proceed to the next "step" (346). Optionally, in response to all slave members of the communication group successfully receiving the data transmitted by the master member (342), the master member also transmits a message to each slave member of the communication group confirming that its current step of transmitting data is complete.
Still optionally, in response to one or more slave members of the communication group not successfully receiving data transmitted by the master member (342) (failure or error to transmit data), the master member retransmits data for the slave members that failed to successfully receive data (348).
Optionally, in response to one or more slave members of the communication group not successfully receiving the data transmitted by the master member (342), the master member transmits a message to the slave member that failed to receive the data to indicate that its current step of transmitting data has failed, and retransmits the current step of data only to the one or more slave members.
By way of example, the master member sets a timeout time for transmitting data to each slave member, and in response to occurrence of a timeout, a failure to transmit data to the corresponding slave member is recognized.
According to an embodiment of the application, a master member of a communication group manages the stage or progress of a communication by steps. The next step is advanced after the completion of the sending of data to all slave members within a certain step. Thus, when a failure occurs and recovers from the failure, the steps provide a smaller granularity of failure recovery. When a fault occurs, the step where the communication is located is identified, and the fault step is correctly processed by means of retransmission and the like, for example, when the fault occurs, the data transmitted in the step is abandoned, and the data transmission of the step is executed again. Thus, the failure is recovered, and the redo required to recover the failure is limited to a limited range (step), shortening the time for failure recovery.
Fig. 3D is a flow chart of a primary member of a communication group receiving data.
A master member of a communication group that is to receive data receives data from all slave members that are affiliated with the communication group in each "step" of the communication (360). The master member also identifies whether the data was successfully received from all slave members of the communication group (362).
Optionally, in response to all slave members of the slave communication group successfully receiving data (362), the master member also sends a message to each slave member of the communication group to confirm that its current step of receiving data is complete (366). The acknowledgement message sent at step 366 corresponds to the acknowledgement message received from the member in fig. 3B. Still optionally, in response to the master member not successfully receiving data from one or more slave members of the communication group (362), the master member also sends a message to each slave member of the communication group to indicate that its current step of receiving data has failed (368). In response to the message indicating success or failure of receiving data at the current step indicated by the master member, each slave member is allowed to determine whether to resend the data at the current step to the master member. For a slave member, even if it successfully sends data to the master member, it resends the data to the master member in response to a message indicated by the master member that the data reception failed.
Optionally, in response to one or more slave members of the communication group not successfully receiving the data sent by the master member (362), the master member sends a message to only the slave members within the communication group whose reception data failed to indicate that its current step of receiving data has failed, and only re-receives the current step of data from the one or more slave members. And in response to all slave members of the slave communication group successfully receiving data (362), the master member again sends a message to each slave member of the communication group to confirm that its current step of receiving data is complete (366) and proceeds to the next step.
By way of example, the master member sets a timeout time for receiving data from each slave member, and in response to the occurrence of a timeout, identifies a failure to receive data from the corresponding slave member.
FIG. 4A shows a flow diagram of fault handling according to an embodiment of the application.
When the information processing apparatus is operating normally, the firmware captures the occurrence of a failure, and for a failure type that can be quickly recovered from a failure, the processing flow shown in fig. 4A is implemented to prepare for quick failure recovery.
When a fault occurs, an interrupt is typically triggered. An interrupt handling unit of the firmware identifies the type of interrupt and, optionally, the cause of the interrupt to be generated to determine whether fast failover can be implemented. For example, for transient faults such as memory access out-of-bounds, zero removal, error check failure, instruction fetch exception, code exception, or interrupt types such as software-generated exception, internal exception, etc., a restart mode is usually adopted to solve the problem. According to the quick fault recovery method and the quick fault recovery device, the restarting process can be quickly completed, and the whole information processing device can be recovered to work.
Referring to fig. 4A, firmware captures the occurrence of a failure (410) and identifies whether the captured failure is of a type that can be repaired by the fast failure recovery means of embodiments of the present application (412). For non-repairable failures, the usual failure handling is used, such as shutting down and waiting for further maintenance. For a repairable failure (412), in the interrupt handling unit, the current state of the communication middleware is saved (414). The current state of the communication middleware includes one or more members to which the failing processor or processor core belongs, the communication groups to which the members belong, the master-slave identities of the members in each communication group, the communication step in each communication group in which the members are located, whether the communication step is confirmed to be completed, and the like.
In the interrupt handling unit, the code segments in the DRAM and an optional interrupt vector table are also protected (416). The code segments and the interrupt vector table are not typically modified during firmware execution. This portion of the contents in the DRAM is protected so that after the processor core reboots, the protected contents in the DRAM are not cleared, but can continue to be used directly. Therefore, the processes of loading the firmware image, decompressing the firmware image and creating the interrupt vector table are reduced, and the restarting time is shortened. It will be appreciated that the interrupt handlers associated with the interrupt vector table are also located in the protected DRAM space. Optionally, static data in the DRAM is also protected.
A fast failure recovery flag (418) is also recorded in non-volatile memory (NVM) (see also fig. 1) to mark that a fast failure recovery process can be implemented after a reboot. The location of the fast failure recovery flag in the NVM is a designated location to be accessed during boot firmware execution.
Finally, the failing processor core or the processor to which the processor core belongs is soft reset (420), so that the processor core or the processor is restarted and the fast failure recovery flow is implemented by executing the boot firmware in the NVM. During the soft reset, the internal state of the processor core or the processor is reset and executed from the designated position (the entry address of the boot firmware stored in the NVM) without powering down the DRAM or the like coupled with the processor, so that the protected contents are preserved.
FIG. 4B illustrates a flow diagram of boot firmware implemented fast failover in accordance with an embodiment of the present application.
Execution of the boot firmware begins in response to a reboot (450) following a reset of the processor core. The boot firmware first identifies whether a fast failure recovery flag is recorded in the NVM (452). If the fast failure recovery flag is not found (452), the boot firmware performs the usual boot process, including loading the system firmware image from the image server (454), decompressing the system firmware image obtained from the image server and storing it in DRAM (456), thereby obtaining the program needed for firmware execution, such as a code segment of the firmware (also referred to as "system firmware" to distinguish from "boot firmware") in DRAM. At the end of the boot firmware execution, an entry address of a code segment of the system firmware is obtained (458) and execution of the processor core is initiated from the entry address to begin execution of the system firmware (462), thereby handing over control of the processor core from the boot firmware to the system firmware. Optionally, the boot firmware also clears the contents of the DRAM before loading the firmware image to avoid legacy data in the DRAM from adversely affecting the execution of the system firmware.
If the fast failover flag is found (452), meaning that a fast failover process can currently be implemented, the code segments and/or interrupt vector tables are already stored in DRAM without having to perform the firmware image loading process. The boot firmware thus skips the step of loading the firmware image to directly obtain the entry address of the code segment of the system firmware (460) and causes the processor core to execute from that entry address to begin executing the system firmware (462). The restarting time of the processor core is shortened by omitting the processes of loading the firmware image, clearing the content of the complete DRAM, decompressing the firmware image, creating the interrupt vector table and the like. The entry address in step 460 is, for example, the same entry address as the entry address in step 458, so that there is no need to record the address of the code being processed at the time of the failure and to adjust the boot firmware or system firmware to locate the entry address dedicated for fast failure recovery in order to implement fast failure recovery.
After the processor core or the processor is restarted, the system firmware needs to recover to the position where the fault occurs and continue the task processing. During this time, other processor cores of the signal processing apparatus are still operating without waiting for the restarting processor core to rejoin at the location where the fault occurred.
FIG. 5A shows a block diagram of system firmware according to an embodiment of the present application.
The system firmware comprises at least two parts, a communication group initialization unit and a task processing unit. After control of the processor core is given to the system firmware, the processor core is added into the communication group through the communication group initialization unit. The communication group initialization unit acquires a communication middleware state. For example, the current state of the communication middleware saved at the time of the occurrence of the fault is acquired from the NVM (see also fig. 4A) as configuration information of the communication group from which each communication group to which the processor core belongs is acquired. And causes the processor core to rejoin each communication group to which it belongs. For the case of normal startup, the communication middleware state for initialization is also acquired. Optionally, the boot firmware moves the current state of the communication middleware obtained from the NVM to a specified location to be accessed by the communication group initialization unit, so that the communication group initialization unit obtains the state of the communication middleware from the specified location without concern for a normal boot process or a fast failure recovery process at present.
After the communication group is initialized, the task processing unit starts to process the computing task. The task processing unit uses communication middleware to transmit data in each communication group.
FIG. 5B illustrates a block diagram of a task processing unit according to an embodiment of the present application.
The task processing unit main body is a task processing loop including at least three parts of receiving data, processing data and transmitting data, receives data from the outside, processes the received data, and transmits the data of the processing result to other members to implement task processing. An acknowledgement message is also given to the sender of the received data in the task processing loop to indicate that it successfully received the data (see also the "send acknowledgement message" step of fig. 3A). As an example, an acknowledgement message for the received data is given at the end of the loop, so that even if the processor in which the task processing unit is located restarts, the data sent to the task processing unit is determined from the acknowledgement message without retransmission.
Referring also to fig. 2B, the task processing units of the four processors cooperatively process a task (denoted as task P). Processor 1 processes task P1, processor 2 processes task P2.1, processor 3 processes task P2.2, and processor 4 processes task P3. Let P be { P1.1, P2.1, P2.2, P3}, and the representative task P includes subtasks P1, P2.1, P2.2, and P3, which are all processed and represent that the task P is processed.
One cycle of the task processing unit of fig. 5B represents, for example, one of the subtasks (subtask P1, P2.1, P2.2, or P3). After the task processing unit completes the calculation of one task processing cycle, a new cycle is started and data is received again. In receiving data and/or transmitting data, communication middleware may be used to exchange data with other members of the communication group. It will be appreciated that some task processing units (e.g., members that process the beginning phase of a computing task) need not receive data from members of other communication groups, and some task processing units (e.g., members that process the last phase of a computing task) need not send data to members of other communication groups.
According to the embodiment of the application, after the processor core is restarted, the boot firmware acquires the entry of the system firmware and starts to execute the system firmware. Starting from the entrance of the system firmware, the processor is added into the communication group through the communication group initialization unit, and then a task processing loop is started through the task processing unit. The processing flow is consistent with the processing flow when the system firmware is normally started, so that different processing flows do not need to be provided in the system firmware for various situations with faults, and the process of quickly recovering the faults has better adaptability. And the communication group initialization unit and the task processing unit do not need to be adjusted for the quick failure recovery process. According to the embodiment of the application, the state of the communication middleware when the fault occurs is identified in the communication middleware, and the fault is quickly recovered under the condition that the task processing unit does not change the processing flow.
Fig. 6A and 6B illustrate a flow chart of communication of a slave member within a communication group according to an embodiment of the present application.
Fig. 6A is a flow chart of a communication group receiving data from a member.
Referring also to fig. 5B, after the processor core that is a slave member of the communication group reboots, the processing flow of its task processing unit begins by receiving data regardless of the task processing stage in which it was at the time of the failure. In response to the task processing unit receiving the data, the communication middleware identifies whether a failure recovery flag exists or not through a state of the communication middleware at the time of occurrence of the failure, for example, by calling a reception interface (610) provided by the communication middleware, or acquires a fast failure recovery flag recorded in the NVM (612). The fault recovery flag indicates that communication in the first task processing cycle after fast fault recovery is currently present. If a failure recovery flag is present (612), the communication middleware skips the operation of receiving data (618). Although currently in response to a process calling the receiving interface, a receive data operation is not performed in order to perform fast failure recovery to avoid receiving erroneous data. And the call to the receiving interface is returned, and the task processing unit continues to perform subsequent processing. Alternatively, in order not to change the flow of the task processing loop, the communication middleware indicates success of the received data to the task processing unit at step 618, and provides false received data.
If the fault recovery flag is not present, the communication middleware performs normal operations to receive data, receives data from the designated communication group master (614), and the call to the receive interface is returned at step 612. Optionally, an acknowledgement message is also sent to the sender of the data (616) in response to completion of the operation of receiving the data. As an example, an acknowledgement message is sent by the task processing unit to the sender of the data after processing of the current task processing cycle is completed. Optionally, if the sending of the confirmation message fails or the other party does not receive the confirmation message, the confirmation message is also resent.
Thus, in the process of quick failure recovery after the occurrence of the failure, the operation of receiving data for the first time of the task processing loop is ignored by the communication middleware. Because of the failure, the master member sending data to the failure processor may be sending data, may have sent data completed, may have received an acknowledgement message, or may not have received an acknowledgement message, but no matter at what stage the master member was at when the failure occurred, the operation of receiving data for the first time after fast recovery by the failed slave member is ignored according to the process flow of fig. 6A. If the failure occurs, the master member is sending data, the sending data is complete, or no acknowledgement message is received, the master member will recognize the failure of the slave member due to the failure of the slave member and attempt retransmission without going to the next step of the communication (see also fig. 3C). If the primary member has received the confirmation message when the failure occurred, the next step in the communication will be entered (see also fig. 3C). After the failure occurs, when the slave member calls the receiving interface again to receive data after the slave member is restarted, the receiving operation may also indicate the receiving failure to the master member and make the master member try retransmission again.
Fig. 6B is a flow chart of a slave member of a communication group sending data.
Referring also to fig. 5B, after the processor core as the slave member of the communication group is restarted, the processing flow of its task processing unit starts from receiving data, and after the data processing phase, enters the data sending phase regardless of the task processing phase in which the fault occurred.
In response to the task processing unit sending data, the communication middleware identifies whether a fault recovery flag exists (662) by the state of the communication middleware at the time of the fault, for example, by calling a sending interface (660) provided by the communication middleware. The presence of the fault recovery flag indicates that communication in the first task processing cycle (and that the phase of receiving data has previously taken place) is currently after fast fault recovery. If a failure recovery flag is present 662, the communication middleware further identifies 670, by the state of the communication middleware at the time of the failure, whether the location of the failure occurred before or after the confirmation message is given to the primary member sending the data (see also FIG. 5B). When the communication middleware state records that the fault occurs, whether the current task processing cycle gives a confirmation message to the main member in the stage of receiving the data or not is judged. If the main member in the data receiving stage has given the confirmation message, the current task processing cycle can be completed normally without the main member cooperating with the retransmission data; if no confirmation message has been given to the primary member, the current task processing cycle should be skipped and the primary member requested to retransmit the data.
With continued reference to FIG. 6B, if the communication middleware identifies the location of the failure prior to giving an acknowledgement message to the master member sending the data (670), the communication middleware skips the operation of sending the data and optionally also indicates a failure to send the data to the master member of the communication group receiving the data at the stage of sending the data (672). Although currently in response to a process calling the send interface, a send data operation is not performed in order to perform fast failure recovery to avoid sending erroneous data. And the call to the sending interface is returned, and the task processing unit continues to perform subsequent processing. Optionally, the communication middleware or task processing unit also gives an acknowledgement message to the primary member of the receive data phase of the current task processing cycle to indicate a failure to receive the data (674), such that the primary member of the receive data phase that sent the data retransmits the data.
The fault recovery flag is also cleared (676). Therefore, when the transmission interface or the reception interface of the communication middleware is called again, the communication in the first task data cycle after the failure recovery is not recognized any more, but the processing is performed according to the normal communication.
If the communication middleware identifies the location where the failure occurred after giving an acknowledgement message to the primary member that sent the data (670), the communication middleware knows that no further processing is needed for the current cycle of task processing. Optionally, the communication middleware skips the operation of sending data (680), returning a call to the send interface. The fault recovery flag is also cleared 684. Therefore, when the transmission interface or the reception interface of the communication middleware is called again, the communication in the first task data cycle after the failure recovery is not recognized any more, but the processing is performed according to the normal communication.
If the fault recovery flag is not present, the communication middleware performs the normal send data operation, sends data to the designated communication group master member (664), and the call to the send interface is returned, via step 662. Optionally, an acknowledgement message is also received from the recipient primary member of the data (666) in response to completion of the operation of sending the data.
Thus, in the process of quick failure recovery after the occurrence of the failure, the communication middleware skips this stage also for the operation of sending data for the first time of the task processing loop. And determines whether to retransmit the acknowledgement message and require retransmission by the sender's primary member of the data-receiving phase of the current task processing cycle, depending on the relative relationship (before or after) of the location of the failure occurrence and the acknowledgement message given for the data-receiving phase.
Because of the occurrence of the failure, the failure handler may not have started sending data, may be sending data as a slave member is sending data, and may have completed sending data. Regardless of the stage at which the data is sent from the member, regardless of the occurrence of the failure, the confirmation message has been given depending on whether the task processing loop in which the failure occurred has been present (see also fig. 5B). And if the confirmation message is given, in the first task processing cycle after the fault recovery, ignoring the data receiving and sending processes, entering the next task processing cycle, and performing normal operation. If the fault occurs, no confirmation message is given, in the first task processing cycle after fault recovery, for the data sending stage, the data receiving side is ensured to recognize the data sending failure and try to receive the data again, and for the data receiving stage of the current task processing cycle, the confirmation message is sent to enable the data sending side to send the data again, and the task processing unit after fault recovery is enabled to receive the data through normal operation in the next task processing cycle. Therefore, no matter which stage of the processing cycle of the fault occurring task, the task which is being processed when the fault occurs can be properly processed after the fault is recovered.
According to an embodiment of the present application, the master member of the communication group communicates with the slave member of the communication group implementing the flow of fig. 6A or 6B according to the flow illustrated in fig. 3C or 3D to assist the slave member in completing the failure recovery process.
Fig. 7A to 7E are diagrams illustrating implementation of fast failure recovery of an information processing apparatus according to an embodiment of the present application.
Referring to fig. 7A, the information processing apparatus includes, by way of example, 4 processors (processor 1, processor 2, processor 3, and processor 4), each of which includes 8 processor cores (denoted as core 0 to core 7), each of which independently executes a program. The 4 processors are coupled to each other via a network and/or SRIO to enable communication between processor cores located in different processors.
Fig. 7A includes two communication groups, denoted as IPC group 2 and IPC group 3. Communication group IPC group 2 includes processor core 7 of processor 1, processor core 0 of processor 2, and processor core 0 of processor 3, where processor core 7 of processor 1 is the master member of IPC group 2, and processor core 0 of processor 2 and processor core 0 of processor 3 are both slave members of IPC group 2. In communication group IPC group 2, core 7 of processor 1 transmits data, and core 0 of processor 2 and core 0 of processor 3 receive data. Communication group IPC group 3 includes processor core 0 of processor 2, processor core 0 of processor 3, and processor core 0 of processor 4, where processor core 0 of processor 4 is the master member of IPC group 3, and processor core 0 of processor 2 and processor core 0 of processor 3 are both slave members of IPC group 2. In communication group IPC group 3, processor core 0 of processor 2 transmits data with processor core 0 of processor 3, and processor core 0 of processor 4 receives data.
In fig. 7A, n represents a step of communication. In the communication group IPC group 2, the step of transmitting data to the processor core 0 of the processor 2 by the processor core 7 of the processor 1 (denoted as receiving Rn (from the perspective of the processor core 0 of the processor 2)) is completed, and the step of transmitting data to the processor core 0 of the processor 3 by the processor core 7 of the processor 1 (denoted as receiving Rn (from the perspective of the processor core 0 of the processor 3)) is completed. In the communication group IPC group 3, the step of transmitting data from the processor core 0 of the processor 2 to the processor core 0 of the processor 4 (denoted as transmission Sn (from the perspective of the processor core 0 of the processor 2)) is completed, and the step of transmitting data from the processor core 0 of the processor 3 to the processor core 0 of the processor 4 (denoted as transmission Sn (from the perspective of the processor core 0 of the processor 3)) is completed. So that the processor cores each enter the next task processing cycle.
Referring to FIG. 7B, next, processor core 0 of processor 2 fails (indicated by the vertical hatching). Through the processes shown in fig. 4A and 4B, processor 2 causes processor core 0 to implement a fast fault recovery process.
When a failure of processor core 0 of processor 2 occurs, in communication group IPC group 2, the step of sending data to processor core 0 of processor 2 by processor core 7 of processor 1 (denoted as receiving Rn +1 (from the perspective of processor core 0 of processor 2)) is completed, but processor core 0 of processor 2 has not yet given an acknowledgement message to processor core 7 of processor 1. The step of the processor core 7 of processor 1 sending data to the processor core 0 of processor 3, denoted as receiving Rn +1 (from the perspective of the core 0 of processor 3), is completed and the processor core 0 of processor 3 has given an acknowledgement message to the processor core 7 of processor 1. In the communication group IPC group 3, the step of transmitting data from the processor core 0 of the processor 2 to the processor core 0 of the processor 4 (denoted as transmission Sn +1 (from the perspective of the processor core 0 of the processor 2)) is in progress, and this transmission fails due to a failure of the processor core 0 of the processor 2, and the step of transmitting data from the processor core 0 of the processor 3 to the processor core 0 of the processor 4 (denoted as transmission Sn +1 (from the perspective of the processor core 0 of the processor 3)) is completed.
Processor core 0 of processor 4 is a master member waiting to receive data from processor core 0 of processor 2. Since all the data sent from the slave member is not received and the communication of the current step thereof is not completed, processor core 0 of processor 4 does not advance and waits for the reception of the data from processor core 0 of processor 2. Optionally, since processor core 0 of processor 4 is the master member, it does not receive all the transmitted data of communication group IPC group 3, it also blocks processor core 0 of processor 3 of this communication group in the n +1 th step transmitting data phase.
Processor core 7 of processor 1, as a master member, waits for the receipt of an acknowledgement message from processor core 0 of processor 2. Processor core 7 of processor 1 also does not advance to step n +2 until all acknowledgement messages are received, thereby also potentially blocking processor core 0 of processor 3 from receiving data at step n + 2.
Referring to fig. 7C, next, processor core 0 of processor 2 implements fast fault recovery (indicated by cross-hatching) according to an embodiment of the present application (flow shown by fig. 4A and 4B).
At step 462 of FIG. 4B, processor core 0 of processor 2 executes the system firmware starting from the entry address and performs communication group initialization. In the communication group initialization, the processor core 0 of the processor 2 acquires itself belonging to the communication group IPC group 2 and serving as a slave member, belonging to the communication group IPC group 3 and serving as a slave member from the communication group configuration information, and then joins these communication groups. Accordingly, processor core 7 of processor 1 recognizes that it joined communication group IPC group 2 from a member, and processor core 0 of processor 4 recognizes that it joined communication group IPC group 3 from a member.
Referring to fig. 7D, next, processor core 0 of processor 2 enters the first task processing cycle after fast failover (see fig. 5B). In receiving data, processor core 0 of processor 2 implements the flow shown in fig. 6A, and in sending data, processor core 0 of processor 2 implements the flow shown in fig. 6B.
In receiving data, processor core 0 of processor 2 calls a receive interface of the communication middleware. The communication middleware recognizes that the failure flag exists and the task processing loop at the time of the failure occurrence is at step n +1, and skips the operation of receiving data (indicated as "receive Rn +1 (skip)" in fig. 7D), or takes the dummy data as the result of receiving data. While in sending data, processor core 0 of processor 2 calls the send interface of the communication middleware. The communication middleware recognizes that the failure flag still exists and the failure occurs before the transmission of the acknowledgement message, and thus skips the operation of transmitting data (indicated as "transmit Sn +1 (skip)" in fig. 7D). In this way, communication middleware makes processor core 0 of processor 4 aware of its n +1 th step of failing to receive data from processor core 0 of processor 2. So that processor core 0 of processor 4 re-initiates the operation of receiving data from processor core 0 of processor 2 at step n + 1.
Optionally, the processor core 0 of the processor 4 also indicates to the processor core 0 of the processor 3 of the slave member of the affiliated communication group IPC group 3 that its step n +1 sending data fails, and in response, the processor core 0 of the processor 3 re-initiates a step n +1 sending data operation to the processor core 0 of the processor 4. In this embodiment, after the processor core 0 of the processor 3 as the slave member of the communication group IPC group 3 completes the step n +1 to send data to the processor core 0 of the processor 4, the processor core of the processor 4 does not wait for the acknowledgement message and enters the next task processing cycle. In the next round of task processing, the receive data operation of the processor core 7 of the processor 1 belonging to the communication group IPC group 2 of round n +2 is carried out. However, the processor core 7 of the processor 1 does not receive the acknowledgement message of the data receiving operation of the n +1 th round (receiving Rn +1 in fig. 7B and 7D) from the processor core 0 of the processor 2 of the communication group IPC group 2, and thus remains waiting without entering the next round of task processing, and does not transmit data to the processor core 0 of the processor 3 of the communication group IPC group 2 in the n +2 th round, so that the processor core 0 of the processor 3 also waits for the processor 2 to recover from the failure.
Still alternatively, processor core 0 of processor 4 does not indicate to processor core 0 of processor 3 of the slave member of the subordinate communication group IPC group 3 that its data transmission in step n +1 failed, and thus, processor core 0 of processor 3 also transmits an acknowledgement message to processor core 7 of processor 1 after completing the data transmission in step n +1 to processor core 0 of processor 4, and enters the next round of task processing cycle, and initiates the data reception operation in round n +2 of processor core 7 of processor 1. And the processor core 7 of the processor 1 continues to wait for the acknowledgement message without entering the next round of task processing and initiating the n +2 th step of data transmission to the processor core 0 of the processor 3 because the acknowledgement message of the processor core 0 of the processor 2 for the n +1 th round of data reception is not received. Thus, both processor core 7 of processor 1 and processor core 0 of processor 3 belonging to the communication group IPC group 2 wait for processor core 0 of processor 2 belonging to the same communication group IPC group 2 to recover from the failure.
With continued reference to fig. 7D, after communication middleware of processor core 0 of processor 2 skips the operation of transmitting data (transmission Sn +1 of fig. 7D) to processor core 0 of processor 4 in turn n +1, processor core 0 of processor 2 also transmits an acknowledgement message to processor core 7 of processor 1 (see also step 616 of fig. 6A) to indicate that the operation of receiving data at step n +1 failed, requesting processor core 7 of processor 1 to retransmit the data.
Referring to fig. 7E, after processor core 0 of processor 2 sends an acknowledgement message to processor core 7 of processor 1, it enters the next cycle of task processing and retries to receive data from the main member of communication group IPC group 2 (processor core 7 of processor 1) (denoted as redo receive Rn +1 in fig. 7E). Here, the data received in the (n +1) th step is still performed, because the processor core 0 of the processor 2 knows that the data receiving operation in the (n +1) th step failed last time, and thus the data receiving operation in the (n +1) th step is performed again.
Processor core 7 of processor 1 starts retransmitting the data to be transmitted to processor core 0 of processor 2 at step n +1 (denoted as redo receive Rn +1 in fig. 7E) in response to receiving the acknowledgement message transmitted by processor core 0 of processor 2.
Since processor core 0 of processor 2 has completed the fault recovery and the fault recovery flag is cleared, the communication middleware performs a normal flow of received data (see also step 614 of fig. 6A) in response to the receive interface being called, so that the operation of receiving data at step n +1 succeeds.
Next, the processor core 2 of the processor 2 processes the received data in a task processing loop and sends the data to the main member of the communication group IPC group 3 (processor core 0 of the processor 4). In response to the sending interface being invoked, the communication middleware performs the normal send data flow (see also step 664 of fig. 6B), so that the send data operation of step n +1 (redo Sn +1 of fig. 7E) is successful.
And processor core 0 of processor 2 also sends an acknowledgement message to processor core 7 of processor 1 to indicate that the operation of receiving data Rn +1 at step n +1 of its redo is successful. To this end, processor core 0 of processor 2 enters the next round (n +2 steps) of task processing cycle. And the processor core 7 of the processor 1 receives the confirmation messages of all the slave members (the processor core 0 of the processor 2 and the processor core 0 of the processor 3) of the communication group IPC group 2, the data sending operation of the step (n +1) is completed, and the step (n + 2) is advanced to send data to all the slave members of the communication group IPC group 2. And the processor core 0 of the processor 4 receives the data sent by all the slave members (the processor core 0 of the processor 2 and the processor core 0 of the processor 3) of the communication group IPC group 3 in the (n +1) th step, so that the data receiving in the (n +1) th step is completed, and the step is advanced to the (n + 2) th step to receive the data from all the slave members of the communication group IPC group 3. Accordingly, processor core 0 of processor 3 advances to step n +2 and initiates reception of data from processor core 7 of processor 1.
So far, the processor core 0 of the processor 2 which has failed in the (n +1) th step completes the communication of the (n +1) th step with all other members which have communication relations with the processor core, and all the members advance to normal task processing.
Referring back to fig. 7B, if the step of the processor core 0 of the processor 2 receiving data from the processor core 7 of the processor 1 of the communication group IPC group 2 (receiving Rn +1 (from the perspective of the core 0 of the processor 2)) is not completed when the failure of the processor core 0 of the processor 2 occurs, the process flow after the failure recovery also applies to the process flows of the embodiments shown in fig. 7A to 7E.
Fig. 8A to 8D are diagrams illustrating implementation of fast failure recovery of an information processing apparatus according to still another embodiment of the present application.
Referring to fig. 8A, as an example, the hardware configuration (the number of processors, the coupling manner, and the like) and the communication group configuration of the information processing apparatus are the same as those of the embodiment shown in fig. 7A.
Referring to FIG. 8A, processor core 0 of processor 2 fails (indicated by the vertical hatching). When a fault of processor core 0 of processor 2 occurs, in communication group IPC group 2, the step of processor core 7 of processor 1 sending data to processor core 0 of processor 2 (denoted as receiving Rn +1 (from the perspective of processor core 0 of processor 2)) is completed, and processor core 0 of processor 2 has sent an acknowledgement message to processor core 7 of processor 1, and processor core 7 of processor 1 has also received the acknowledgement message. The step of the processor core 7 of processor 1 sending data to the processor core 0 of processor 3, denoted as receiving Rn +1 (from the perspective of the processor core 0 of processor 3), is completed and the processor core 0 of processor 3 has given an acknowledgement message to the processor core 7 of processor 1. In communication group 3, the step of processor core 0 of processor 2 transmitting data to processor core 0 of processor 4 (denoted as transmission Sn +1 (from the perspective of processor core 0 of processor 2)) is completed, and the step of processor core 0 of processor 3 transmitting data to processor core 0 of processor 4 (denoted as transmission Sn +1 (from the perspective of processor core 0 of processor 3)) is also completed.
The processor core 0 of the processor 4 is the main member of the communication group IPC group 3, and the processor core 0 of the processor 2 and the processor core 0 of the processor 3 complete the communication of step n +1 and advance the communication. It waits in communication of the reception data at the next step (step n + 2).
The processor core 7 of the processor 1, which is the main member of the communication group IPC group 2, also completes the communication of step n +1 due to the reception of the acknowledgement message sent by the processor core 0 of the processor 2 and the processor core 0 of the processor 3, and advances forward, which initiates the communication of the next step (step n + 2) of sending data.
Alternatively, processor core 0 of processor 3 is a slave member of communication group IPC group 3, and since the data transmission to processor core 0 of processor 4 is successful, the communication of step n +1 is completed and advances.
Next, processor core 0 of processor 2 implements fast failover according to embodiments of the application (via the flows shown in fig. 4A and 4B). Processor core 0 of processor 2 executes the system firmware starting from the entry address and performs communication group initialization.
Referring to FIG. 8B, processor core 0 of processor 2 rejoins communication group IPC group 2 and communication group IPC group 3 through communication group initialization.
At this time, processor core 0 of processor 3 does not stop operating, and it completes the data receiving operation of step n +2 from processor core 7 of processor 1. And after processing the received data, the processor core 0 of the processor 4 is also initiated with the data sending operation of step n + 2. Processor core 0 of processor 4 receives the transmission data of step n +2 from processor core 0 of processor 3 but does not receive the transmission data from processor core 0 of processor 2, and thus it continues to wait for the data of processor core 0 of processor 2 and blocks it in the transmission data phase of step n +2 by giving no acknowledgement to processor core 0 of processor 3.
Referring to fig. 8C, next, processor core 0 of processor 2 enters the first task processing cycle after fast failover (see fig. 5B). In receiving data, processor core 0 of processor 2 implements the flow shown in fig. 6A, and in sending data, processor core 0 of processor 2 implements the flow shown in fig. 6B.
In receiving data, processor core 0 of processor 2 calls a receive interface of the communication middleware. The communication middleware recognizes that the failure flag exists, and the task processing loop when the failure occurs is at step n +1, so that the communication middleware skips the operation of receiving data (indicated as "receive Rn +1 (skip)" in fig. 8B), or takes the false data as the result of receiving the data. While in the process of transmitting data, the communication middleware recognizes that the failure flag still exists, and the failure occurs after the transmission of the confirmation message, thus skipping the operation of transmitting data (in fig. 8C, indicated as "transmit Sn +1 (skip)").
Having advanced to the next step of communication, processor core 0 of processor 4 is waiting for processor core 0 of processor 2 to transmit data at step n +2 and processor core 0 of processor 1 is waiting for processor core 0 of processor 2 to receive data at step n + 2. Thus, in fig. 8C, processor core 0 of processor 2 skips the operations of receiving data and transmitting data, having no effect on other processor cores belonging to the same communication group as it.
Next, processor core 0 of processor 2 also clears the fault recovery flag and proceeds to the next cycle of task processing. At this time, processor core 0 of processor 4 still blocks processor core 0 of processor 3 in the stage of sending data up to step n + 2.
In the next round of circulation, because the fault recovery mark is cleared, the fault recovery mark carries out task processing according to a normal mode, and the stages of receiving data, sending data and the like in the (n + 2) th step are carried out.
Referring to fig. 8D, processor core 0 of processor 2 has obtained the received data of step n +2 from processor core 7 of processor 1 and completed the transmission data of step n +2 to processor core 0 of processor 4.
In response, processor core 0 of processor 4 has completed step n +2 receiving data and has advanced, also causing processor core 0 of processor 3, which was previously blocked, to advance. Processor core 0 of processor 3 proceeds to the process of receiving data from processor core 7 of processor 1 at step n + 3.
Processor core 0 of processor 2 also gives acknowledgement message to processor core 7 of processor 1 for the data received in step n +2, and causes processor core 7 of processor 1 to complete the data transmission in step n +2 and advance to enter the data transmission phase in step n + 3.
And performing task processing by each member according to a normal mode, and performing data receiving and transmitting stages of the (n + 3) th step and the subsequent steps.
Thus, according to the fast failure recovery method of the embodiment of the present application, for the slave members of the communication group, after they fail and recover, the influence of the failure is limited to one cycle of the task processing. After the member with the fault recovers, the task processing interrupted by the fault continues through a loop of the multi-task processing, namely, the normal task processing process is recovered, and other members belonging to the same communication group can also perform normal communication and cooperative task processing with the other members.
While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application. It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims (10)

1. A method of fast failure recovery for a multiprocessor signal processing apparatus, comprising:
in response to a failure occurrence of the first processor, saving a communication middleware state and a failure recovery flag of the first processor in a non-volatile memory; a code segment for protecting the system firmware operated by the first processor in the memory and enabling the first processor to be in soft reset;
running boot firmware when restarting after the first processor is reset, wherein the boot firmware accesses the nonvolatile memory, acquires an entry address of the code segment in the memory in response to the nonvolatile memory storing a fault recovery flag, and executes the system firmware by executing an instruction from the entry address;
the first processor executes the system firmware to acquire the saved communication middleware state so as to enable the first processor to rejoin the communication group to which the first processor belongs when the first processor fails;
a first processor executes the system firmware to communicate among the communication groups and to perform a task processing cycle.
2. The method of claim 1, further comprising:
the boot firmware responds to that the fault recovery mark is not stored in the nonvolatile memory, acquires a firmware image from an image server, decompresses the acquired firmware image to obtain system firmware, loads the system firmware into a memory, and acquires the entry address in the memory, and executes the system firmware by executing instructions from the entry address.
3. The method of claim 1 or 2, wherein
The state of the communication middleware also records the position of the processor with the fault in the task processing cycle when the fault occurs; the location indicates whether the interface invoking the communication middleware receives data, the interface invoking the communication middleware sends data, and/or whether an acknowledgement message is successfully given for the interface invoking the communication middleware to receive data.
4. The method of claim 3, further comprising:
in response to the interface for receiving the data being called, skipping the current operation for receiving the data if the communication middleware recognizes that the fault recovery mark exists; and if the communication middleware identifies that the fault recovery mark does not exist, executing the current data receiving operation.
5. The method of claim 4, further comprising:
and in response to the interface for sending the data being called, if the communication middleware recognizes that the fault recovery mark exists, skipping the data sending operation if the position of the processor which has failed in the task cycle is before the confirmation message is successfully sent for the interface for calling the communication middleware to receive the data, indicating the failure of sending the data to the receiver of the data sending operation, and clearing the fault recovery mark.
6. The method of claim 5, further comprising:
in response to the interface for sending data being called, if the communication middleware recognizes that a fault recovery flag exists, if the position of the processor which has failed in the task cycle when the fault occurs is after a confirmation message is successfully given for the interface for calling the communication middleware to receive data, skipping the operation for sending data without indicating to a receiver of the operation for sending data that the data has failed to be sent, and clearing the fault recovery flag; and
and in response to the interface for sending the data being called, if the communication middleware identifies that the fault recovery mark does not exist, executing the data sending operation.
7. The method of claim 6, further comprising:
before the task processing cycle is finished, if the fault recovery mark is identified, and the position of the processor which has faults when the faults occur in the task cycle is before the confirmation message is successfully given to the interface of the calling communication middleware for receiving data, the confirmation message is given to the corresponding data sending party for the current data receiving operation, and the failure of the data receiving operation is indicated in the confirmation message.
8. The method of claim 7, wherein the first processor is a slave member of a first communication group;
the multiprocessor signal processing apparatus further includes a second processor that is a master member of the first communication group,
transmitting data in the first communication group to the slave members of the first communication group;
in response to the sending interface being invoked by the second processor as a master member, the communication middleware of the second processor sends data to all slave members of the first communication group, and returns an invocation of the sending interface in response to all slave members of the first communication group receiving data successfully giving an acknowledgement message for the received data; and resending the data to any slave member of the first communication group that receives the data in response to the slave member giving an acknowledgement message for a failure to receive the data.
9. The method of claim 8, wherein the first processor is a slave member of a second communication group;
the multi-processor signal processing apparatus further comprises a third processor, the third processor being a primary member of a second communication group,
receiving data from a slave member of the second communication group;
in response to the receiving interface being invoked by a third processor that is a master member, the communication middleware of the third processor receiving data from all slave members of the second communication group and returning the invocation of the receiving interface in response to receiving data from all slave members of the second communication group that sent the data; and re-receiving data from any slave member of the second communication group in response to a failure to receive data from the slave member.
10. An information processing apparatus comprising a memory, a plurality of processors, and a program stored on the memory and executable on each processor, wherein each processor of the plurality of processors implements the method according to one of claims 1 to 9 when executing the program.
CN202010441610.0A 2020-05-22 2020-05-22 Method and system for quickly recovering fault of multiprocessor signal processing equipment Active CN111611111B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010441610.0A CN111611111B (en) 2020-05-22 2020-05-22 Method and system for quickly recovering fault of multiprocessor signal processing equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010441610.0A CN111611111B (en) 2020-05-22 2020-05-22 Method and system for quickly recovering fault of multiprocessor signal processing equipment

Publications (2)

Publication Number Publication Date
CN111611111A CN111611111A (en) 2020-09-01
CN111611111B true CN111611111B (en) 2020-12-22

Family

ID=72199549

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010441610.0A Active CN111611111B (en) 2020-05-22 2020-05-22 Method and system for quickly recovering fault of multiprocessor signal processing equipment

Country Status (1)

Country Link
CN (1) CN111611111B (en)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101216793A (en) * 2008-01-18 2008-07-09 华为技术有限公司 Multiprocessor system fault restoration method and device
CN104657229A (en) * 2015-03-19 2015-05-27 哈尔滨工业大学 Multi-core processor rollback recovering system and method based on high-availability hardware checking point
US10275302B2 (en) * 2015-12-18 2019-04-30 Microsoft Technology Licensing, Llc System reliability by prioritizing recovery of objects
US11243782B2 (en) * 2016-12-14 2022-02-08 Microsoft Technology Licensing, Llc Kernel soft reset using non-volatile RAM
CN109240840B (en) * 2017-07-11 2022-04-19 阿里巴巴集团控股有限公司 Disaster recovery method and device for cluster system and machine readable medium

Also Published As

Publication number Publication date
CN111611111A (en) 2020-09-01

Similar Documents

Publication Publication Date Title
CA2344311C (en) Protocol for replicated servers
US5621885A (en) System and method for providing a fault tolerant computer program runtime support environment
US7761734B2 (en) Automated firmware restoration to a peer programmable hardware device
Silva et al. Fault-tolerant execution of mobile agents
US7076689B2 (en) Use of unique XID range among multiple control processors
US9424143B2 (en) Method and system for providing high availability to distributed computer applications
US7761735B2 (en) Automated firmware restoration to a peer programmable hardware device
US7194652B2 (en) High availability synchronization architecture
US8375363B2 (en) Mechanism to change firmware in a high availability single processor system
JP5392594B2 (en) Virtual machine redundancy system, computer system, virtual machine redundancy method, and program
JP2013012250A (en) Firmware image update and management
US7065673B2 (en) Staged startup after failover or reboot
JP2002082816A (en) Fault monitoring system
CN102045187B (en) Method and equipment for realizing HA (high-availability) system with checkpoints
US20140149994A1 (en) Parallel computer and control method thereof
EP2145253B1 (en) Automated firmware restoration to a peer programmable hardware device
CN111611111B (en) Method and system for quickly recovering fault of multiprocessor signal processing equipment
JP2009080705A (en) Virtual machine system and method for restoring virtual machine in the system
US20100085871A1 (en) Resource leak recovery in a multi-node computer system
CN111880947A (en) Data transmission method and device
JP2001022709A (en) Cluster system and computer-readable storage medium storing program
CN111858177B (en) Inter-process communication abnormality repairing method and device, electronic equipment and storage medium
JP2011053780A (en) Restoration system, restoration method and backup control system
JP2785992B2 (en) Server program management processing method
US7873941B2 (en) Manager component that causes first software component to obtain information from second software component

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant