WO2014161373A1 - 一种***故障检测及处理方法、装置和计算机可读存储介质 - Google Patents

一种***故障检测及处理方法、装置和计算机可读存储介质 Download PDF

Info

Publication number
WO2014161373A1
WO2014161373A1 PCT/CN2014/070187 CN2014070187W WO2014161373A1 WO 2014161373 A1 WO2014161373 A1 WO 2014161373A1 CN 2014070187 W CN2014070187 W CN 2014070187W WO 2014161373 A1 WO2014161373 A1 WO 2014161373A1
Authority
WO
WIPO (PCT)
Prior art keywords
task
detection
infinite loop
dog
interrupt service
Prior art date
Application number
PCT/CN2014/070187
Other languages
English (en)
French (fr)
Inventor
于光波
朱怀云
邱静
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Priority to US14/781,403 priority Critical patent/US9720761B2/en
Priority to EP14779970.4A priority patent/EP2983086A4/en
Publication of WO2014161373A1 publication Critical patent/WO2014161373A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0715Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a system implementing multitasking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • G06F11/0754Error or fault detection not based on redundancy by exceeding limits
    • G06F11/0757Error or fault detection not based on redundancy by exceeding limits by exceeding a time limit, i.e. time-out, e.g. watchdogs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/1417Boot up procedures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/85Active fault masking without idle spares

Definitions

  • the present invention relates to the field of software system fault detection processing technologies, and in particular, to a system fault detection and processing method, apparatus, and computer readable storage medium. Background technique
  • a hardware dog is a simple time-reset device that requires software to generate a pulse-feeding signal for it. Once the timing threshold (usually 1 to 2 seconds) is exceeded, a pulse-feeding dog signal is generated, which automatically generates a hardware reset. Signal, trigger system reset.
  • the software watchdog technology is implemented in order to solve the problem that the hardware dog time is too short, and the hardware watchdog reset time is increased by some simple heartbeat messages or a synchronous monitoring mechanism.
  • the embodiments of the present invention provide a system fault detection and Processing method, apparatus, and computer readable storage medium.
  • an embodiment of the present invention provides a system fault detection and processing method, including: the interrupt service program sends a level one dog feed signal, and receives a secondary feed dog signal of the system detection task;
  • system exception processing is performed according to a preset processing policy; wherein, when the interrupt service program does not receive the secondary feed dog signal within a set time, the service routine is interrupted. Stop sending the level one dog signal and restart the system.
  • the system automatically restarts and restores.
  • the interrupt exceeds the set threshold, the task with the higher priority of the system detection task is busy, the system is abnormal during the system startup, or the system detects that the task itself is abnormally suspended, the interrupt service program does not receive the second level. Feed the dog signal.
  • the system detects the task timing secondary software to feed the dog, and the low priority infinite circulation auxiliary task is scheduled to die in a timely manner.
  • the embodiment of the present invention further provides a system fault detection and processing apparatus, including: a signal processing module, configured to enable an interrupt service program to send a level one dog feed signal, and receive a secondary feed dog signal of the system detection task;
  • the exception processing module is configured to perform system exception processing according to a preset processing policy when detecting a task infinite loop or a task abnormality; wherein, when the interrupt service program does not receive the secondary feed dog signal within a set time At this time, the interrupt service program stops sending the first-level dog feed signal and performs a system restart.
  • the device further includes:
  • the self-restart module is configured to automatically restart the system when an operating system crash or hardware failure occurs.
  • the interrupt service program does not receive the second level. Feed the dog signal.
  • the device further includes:
  • CPU occupancy rate statistics module configured as system detection task timing secondary software feeding dog, low priority infinite loop auxiliary task timing dead loop keep-alive maintenance, timing statistics CPU CPU occupation rate;
  • the task dead loop detection module is configured to determine whether the CPU occupancy rate obtained by the system detection task is higher than the CPU dead loop judgment threshold. If not, it is determined that the task does not have a task infinite loop; if yes, the low priority is determined. Whether the infinite loop auxiliary task is set to keep alive, if yes, it is determined that no infinite loop occurs, and if not, an alarm is sent to notify the maintenance personnel to analyze; the task dead loop detecting module is further configured to: determine that the system detecting task is in the sampling detection Whether only one message is processed in the time period, if not, an alarm is sent to notify the maintenance personnel to analyze, and if so, the task is determined to be in an infinite loop state.
  • the device further includes:
  • the task working state detecting module is configured to periodically detect the working state of all tasks;
  • the task abnormality detecting module is configured to perform task abnormality detection according to the detected task working state and the pre-configured task abnormality determining strategy.
  • the embodiment of the present invention further provides a computer readable storage medium, the storage medium comprising a set of computer executable instructions for performing a system fault detection and processing method according to an embodiment of the present invention.
  • the embodiment of the invention can automatically detect the fault of the software system, and automatically recover the system according to the user policy; can simultaneously detect the system abnormality of the system startup process and the system running process, and automatically recover; can classify and identify the abnormal type in the system running process And perform abnormal judgment and self-recovery according to the user policy; the system abnormality detection and self-recovery policy can be configured by the user, and the abnormal cause can be recorded and can be queried.
  • FIG. 1 is a flowchart of a system fault detection and processing method according to an embodiment of the present invention
  • FIG. 2 is a schematic structural diagram of a system fault detection and processing apparatus according to an embodiment of the present invention. detailed description
  • an embodiment of the present invention relates to a system fault detection and processing method, including: Step S101: An interrupt service program sends a first-level dog feed signal, and receives a secondary feed dog signal of a system detection task;
  • the interrupt service routine normally feeds the hardware to the first level (send a dog feed letter).
  • the interrupt service routine cannot work, and the hardware dog generates an automatic reset.
  • the system starts, the interrupt service program starts the first level hardware feeding dog, and the high priority system detects When the task starts, the system detects that the task starts to feed the dog to the secondary software (send the secondary dog feed signal). If a system abnormality occurs during the startup of the system, the secondary software can not be completed in time, thus stopping the first level hardware feeding. For dogs, the system logs the log as a startup exception and automatically resets at the same time.
  • the high-priority system detects that the task is running normally. If the interrupt exceeds the set threshold (interruption over-frequency), or the task with higher priority than the system detection task is busy, it will cause the secondary software to feed the dog. , thus stopping the primary hardware to feed the dog, the system will log and automatically reset. In addition, if the (high priority) system detection task is suspended due to its own abnormality, it will also cause the secondary software to feed the dog. The first level hardware feeds the dog to stop, and the system will log and automatically reset. Among them, the task with higher priority than the system detection task is busy, which means that the CPU (Central Processing Unit) occupancy rate of the task with higher priority than the system detection task exceeds the predetermined threshold.
  • the CPU Central Processing Unit
  • Step S102 when detecting a task infinite loop or a task abnormality, performing system abnormality processing according to a preset processing policy; wherein, when the interrupt service program does not receive the secondary feed dog signal within a set time, then The interrupt service program stops sending the level one dog feed signal and performs a system restart.
  • the task dead loop detection includes: the CPU occupancy rate of the timing statistics task; and the threshold value and the infinite loop judgment strategy are determined according to the pre-configured CPU dead loop, and the task infinite loop judgment is performed.
  • the task dead loop judgment strategy is pre-configured by the user, and is configured by the user according to the task characteristics and the use environment. Generally, the CPU occupancy rate of the task exceeds the CPU dead loop judgment threshold, and the task is considered to be an infinite loop. Of course, You can set an exception.
  • a low-priority infinite loop auxiliary task that allows the CPU occupancy of other tasks to exceed the CPU dead loop judgment threshold (allowing some low-priority tasks in the embedded system, such as the idle task has been busy, However, it has no effect on the normal function of the system.)
  • Tasks located in the special busy task list allow the CPU occupancy of these special tasks to exceed the CPU dead loop judgment threshold (some key tasks are allowed at certain times when running certain functions). Busy, should not be seen as a task is extremely busy).
  • a task dead loop confirmation step is required, that is, at least two sampling times It is determined that the task is in an infinite loop, and the task can be considered to be an infinite loop.
  • the task When the task is detected abnormally, it includes: periodically detecting the working status of all tasks; performing task abnormality judgment according to the task abnormality determining strategy.
  • the task abnormality judgment policy is pre-configured by the user, and the user can perform different configurations according to actual conditions. For example: Only the decision task is a critical task (a critical task means that the task abnormality will affect the basic functions of the system, the task that must be restored immediately; the critical task can be dynamically configured).
  • the self-recovery (restart) operation is performed only when an abnormality occurs; When the normal task is abnormal, the self-recovery operation can be performed. It can also be considered that each task exception does not perform the self-recovery operation.
  • the task anomaly detection also needs to include the task abnormality confirmation step, that is, the task abnormality is determined at least twice in the sampling time to finally determine the task abnormality.
  • the system self-recovery processing includes: determining whether the system is abnormal immediately after the abnormality of the system (the task infinite loop or the task is abnormal), and if so, immediately resetting, if not, according to the system self-recovery waiting time, the waiting time can be pre-configured; After the abnormal waiting time expires, the reset condition is judged. If the reset condition is met, it will be reset immediately; if it is not satisfied, it will be reset after waiting for the default time; if the system is not reset, the alarm or log will be recorded.
  • System exception logging includes: Logging to memory or logging to the file system.
  • the process of preventing the system startup or normal operation in the embodiment of the present invention includes the following steps:
  • Step S201 The system starts, the interrupt service program starts working, and the default interrupt count number is set.
  • Step S202 When each interrupt arrives, the number of interrupt counts is decreased by 1, and the interrupt service program performs a level one hardware feeding dog. If the system hardware is abnormal, the operating system crashes, etc., the interrupt service program cannot work, the primary hardware feeds the dog and the system restarts.
  • the hardware watchdog feed dog threshold is generally 1 to 2 seconds, so in order to ensure that the system can be positive Normal work, other tasks in the system startup process should pay special attention when shutting down the interrupt. If the off time is longer (more than the dog threshold), you need to add the dog point in the code, that is, the level of feed in the off interrupt. The dog, in order to prevent a normal shutdown interrupt, causes the system to reboot.
  • the judgment of whether the interrupt count is greater than 0 is performed at the same time. If yes, the next interrupt is waited until the system detects that the task is started, and the process goes to step S203; if not, the interrupt count is equal to 0, indicating that the priority is high.
  • the system detection task does not start normally, that is, it encounters an abnormality during system startup. This situation is equivalent to the failure of the secondary software to feed the dog. The reason for recording is that the startup is abnormal, the first-level hardware is fed to the dog, and the system will restart.
  • Step S203 The high priority system detects that the task is started, starts the secondary software to feed the dog, and resets the number of interrupt counts.
  • the high-priority system detection task secondary software feeding dog timing time can be based on the number of interruption counts, to obtain an empirical value, for example, can be set to 30 seconds secondary software to feed the dog once.
  • Step S204 When the task with higher priority than the high priority system detects that the task is busy, interrupts the overfrequency, or the high priority task hangs abnormally, that is, if there is no secondary software feeding dog within 3 minutes, the number of interruption counts is 0, the system considers that the high priority task is busy; at this time, the reason is recorded, the first level hardware is stopped, and the system is restarted.
  • Step S301 The high priority system detection task and the low priority infinite loop auxiliary task are started, the high priority system detects the task timing secondary software to feed the dog, and the low priority infinite loop auxiliary task timing the indefinite loop keepalive maintenance.
  • the high priority and low priority described in this step are relative, that is, the priority of the system detection task is higher than the priority of the infinite loop auxiliary task.
  • Step S302 The high priority system detects the task, and the CPU of the task is counted every 1 minute. Occupancy (statistical task status is the CPU occupancy in the running state).
  • Step S303 The high-priority system detection task compares whether the task CPU occupancy rate that has been statistically obtained is higher than the CPU dead loop judgment threshold value (the CPU dead loop determination threshold value may be manually configured by the user according to the system condition); , it is determined that the task does not have a task infinite loop; if yes, then go to step S304.
  • the CPU dead loop determination threshold value may be manually configured by the user according to the system condition
  • Step S304 When it is determined that the task CPU occupation rate is higher than the CPU dead loop determination threshold value, it is further determined whether the low priority infinite loop auxiliary task is kept in a live state, and if so, the low priority infinite loop auxiliary task is guaranteed.
  • the live bit indicates that the task can be normally scheduled by the system, and the system is within the dead loop statistical range; if not, then go to step S305.
  • Step S305 If the low priority task infinite loop auxiliary task does not have a live flag, it does not indicate the infinite loop of the task. This is because some tasks in the system are always running during the high priority task timing detection period. Therefore, it is necessary to exclude these special tasks in the system, and it is not possible to treat its normal busy state as an infinite loop, but to alert the maintenance personnel to analyze.
  • the above special tasks are manually configured by the user in advance.
  • Step S306 The above determination has determined that the system includes an infinite loop task, and further needs to determine whether only one message is processed in the high priority system detection task timing sampling detection period, if the task is processed simultaneously in the timing sampling period Multiple messages, indicating that the task is scheduled in the system, the system does not have an infinite loop, but the alarm is notified to the maintenance personnel for analysis. If the task only processes one message within the timed sampling period, it is determined that the task is in an infinite loop state. .
  • Step S307 When there is a task infinite loop in the system, wait for another 1 sampling period (sampling period) to perform an infinite loop confirmation, perform logging after confirmation, and prepare to restart recovery, but it is necessary to determine whether the system is running before restarting. More important work (such as file system operation), if the system is running more important work can not be restarted immediately, it is allowed to delay the important work and restart after a delay.
  • the process of the abnormality detection and self-recovery method of the above method in the normal operation of the system in the embodiment of the present invention includes:
  • Step S401 The high priority system detects that the task is started, and the secondary software feeds the dog.
  • Step S402 The high priority system detection task detects the working state of all tasks of the system every 1 minute (detection period).
  • Step S403 The high priority system detection task finds that the task is abnormally suspended, and then identifies whether the task is a critical task or a common task, and the system performs a self-recovery operation according to the abnormality detection processing policy configured by the user.
  • the exception detection processing strategy is: Allow critical tasks to restart abnormally; Or Normal tasks restart abnormally; or all task exceptions are not restarted.
  • the key tasks are preset by the user. If these tasks are not working, they will affect the important functions of the system.
  • Step S404 When it is determined that there is a task abnormality in the system, the system needs to record the stack information of the abnormal task, log at the same time, and restart the recovery, but it is necessary to determine whether the system is running a more important work (such as file system operation) before restarting. If the system is running a more important job and cannot be restarted immediately, it is allowed to delay the important work and restart after a delay.
  • a more important work such as file system operation
  • an embodiment of the present invention further provides a system fault detection and processing apparatus for implementing the foregoing method, including:
  • the signal processing module 201 is configured to enable the interrupt service program to send a level one dog signal, and receive a secondary dog feed signal of the system detection task; when the interrupt exceeds the set threshold, the task with a higher priority than the system detection task is busy.
  • the interrupt service routine does not receive the secondary feed dog signal.
  • the exception handling module 202 is configured to perform system exception processing according to a preset processing policy when detecting a task infinite loop or a task abnormality; wherein, when the interrupt service routine does not receive the secondary feeding dog within a set time When the signal is sent, the interrupt service program stops sending the level one dog signal and restarts the system.
  • the above apparatus further includes:
  • the self-restart module is configured to automatically restart the system when the operating system crashes or the hardware is abnormal.
  • CPU occupancy rate statistics module configured as system detection task timing secondary software feeding dog, low priority infinite loop auxiliary task timing dead loop keep-alive maintenance, timing statistics CPU occupancy;
  • the task dead loop detection module is configured to determine whether the CPU occupancy rate obtained by the system detection task is higher than the CPU dead loop judgment threshold. If not, it is determined that the task does not have a task infinite loop; if yes, the low priority is determined. Whether the infinite loop auxiliary task is set to keep alive, if yes, it is determined that no infinite loop occurs, and if not, an alarm is issued to notify the maintenance personnel to analyze; whether the system detecting task only processes one message during the sampling detection period, if If no, an alarm is sent to notify the maintenance personnel to analyze, and if so, the task is determined to be in an infinite loop state.
  • the task working state detecting module is configured to periodically detect the working state of all tasks; the task abnormality detecting module is configured to perform task abnormality detection according to the detected task working state and the pre-configured task abnormality determining strategy.
  • the above modules are all identifiable in the system fault detection and processing device
  • CPU Microprocessor Unit
  • DSP Digital Signal Processor
  • J Programmable Gate Array
  • the embodiment of the present invention can detect the abnormal situation such as the dead loop of the task and the abnormality of the task by means of the timing active scan detection and the combination of the interrupt and the feeding dog, and can also determine the interrupt over frequency and the hardware and software hanging.
  • the cause of the recording can be classified according to these types of abnormalities, and automatic delay recovery processing is performed. It not only considers the special operation tasks of the system, but also considers the dynamic configuration requirements of different systems. It can also detect the software operation of the system startup process and meet most of the abnormal detection and self-recovery requirements of the software system.
  • embodiments of the present invention can be provided as a method, system, or computer program product. Accordingly, the present invention can take the form of a hardware embodiment, a software embodiment, or a combination of software and hardware aspects. Moreover, the invention may be employed in one or more of its A computer program product embodied on a computer usable storage medium (including but not limited to disk storage and optical storage, etc.) containing computer usable program code.
  • a computer usable storage medium including but not limited to disk storage and optical storage, etc.
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
  • the apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device.
  • the instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.
  • an embodiment of the present invention further provides a computer readable storage medium, the storage medium comprising a set of computer executable instructions for performing a system fault detection and processing method according to an embodiment of the present invention.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Debugging And Monitoring (AREA)

Abstract

公开了一种***故障检测及处理方法、装置和计算机可读存储介质,方法包括:中断服务程序发送一级喂狗信号,并接收***检测任务的二级喂狗信号(S101);在检测到任务死循环或任务异常时,根据预先设定的处理策略进行***异常处理;其中,当中断服务程序在设定时间内接收不到所述二级喂狗信号时,则中断服务程序停止发送一级喂狗信号,进行***重启(S102)。

Description

一种***故障检测及处理方法、 装置和计算机可读存储介庸 技术领域
本发明涉及软件***故障检测处理技术领域, 特别是涉及一种***故 障检测及处理方法、 装置和计算机可读存储介质。 背景技术
在软件***启动和运行过程中, 常常会发生故障导致***无法工作, 如: ***硬件吊死、 操作***崩溃、 任务异常、 任务死循环、 中断过频等。 对于通信***软件来说, 在软件***发生故障时, 能够自动识别任务异常 状态, 并根据用户的配置策略, 进行相应的故障异常告警、 记录以及*** 恢复, 这都是必不可少的功能。 特别是对于那些实时性要求较高的支持语 音业务的***, 在***运行到任何阶段, 遇到任何故障, 都要求能够完全 准确的异常识别、 异常信息记录和自恢复处理。
现有的软件***故障检测及自恢复方法, 一般采用硬件狗或者软件看 门狗技术。 硬件狗就是一个简单的定时复位器件, 其需要软件来定时为其 产生脉冲喂狗信号, 一旦超过定时门限(一般 1到 2秒)没有为其产生脉 冲喂狗信号, 则其会自动产生硬件复位信号, 触发***复位。 软件看门狗 技术其实现原理是为了解决硬件狗时间太短的问题, 通过一些简单的心跳 消息或者同步监听机制来增加硬件看门狗的复位时间。 这些方法虽然简单 易行, 比较可靠, 但是也有其自身缺陷: 不能对***所出现的所有异常情 况进行检测; 不能对***中的特殊应用情况进行监控; 不能对***故障类 型进行分类日志记录。 发明内容
为解决现有存在的技术问题, 本发明实施例提供一种***故障检测及 处理方法、 装置和计算机可读存储介质。
一方面, 本发明实施例提供一种***故障检测及处理方法, 包括: 中断服务程序发送一级喂狗信号, 并接收***检测任务的二级喂狗信 号;
在检测到任务死循环或任务异常时, 根据预先设定的处理策略进行系 统异常处理; 其中, 当中断服务程序在设定时间内接收不到所述二级喂狗 信号时, 则中断服务程序停止发送一级喂狗信号, 进行***重启。
其中, 当***出现操作***崩溃或者硬件异常时,***自动重启恢复。 其中, 当中断超过设定阈值、 比所述***检测任务优先级更高的任务 忙、 ***启动期间***异常或所述***检测任务自身异常挂起时, 中断服 务程序接收不到所述二级喂狗信号。
其中, 进行任务死循环检测时, 包括:
***检测任务定时二级软件喂狗, 低优先级死循环辅助任务定时死循 环保活维持;
定时统计中央处理器 CPU占有率;
判断统计得到的 CPU占有率是否高于 CPU死循环判断门限值,如果否, 则判定上述任务没有出现任务死循环; 如果是, 则判断低优先级死循环辅 助任务是否保活置位, 如果是, 则判定没有出现死循环, 如果否, 则进行 告警, 通知维护人员分析;
判断***检测任务在采样检测时间段内是否只处理了一个消息, 如果 否, 则进行告警, 通知维护人员分析; 如果是, 则判定该任务处于死循环 状态。
其中, 进行任务异常检测时, 包括:
定时检测所有任务的工作状态;
根据检测到的任务工作状态, 以及结合预先配置的任务异常判断策略, 进行任务异常检测。 另一方面,本发明实施例还提供一种***故障检测及处理装置,包括: 信号处理模块, 配置为使中断服务程序发送一级喂狗信号, 并接收系 统检测任务的二级喂狗信号;
异常处理模块, 配置为在检测到任务死循环或任务异常时, 根据预先 设定的处理策略进行***异常处理; 其中, 当中断服务程序在设定时间内 接收不到所述二级喂狗信号时, 则令中断服务程序停止发送一级喂狗信号, 进行***重启。
其中, 所述装置还包括:
自重启模块, 配置为当***出现操作***崩溃或者硬件异常时, *** 自动重启恢复。
其中, 当中断超过设定阈值、 比所述***检测任务优先级更高的任务 忙、 ***启动期间***异常或所述***检测任务自身异常挂起时, 中断服 务程序接收不到所述二级喂狗信号。
其中, 所述装置还包括:
CPU 占有率统计模块, 配置为***检测任务定时二级软件喂狗, 低优 先级死循环辅助任务定时死循环保活维持时, 定时统计中央处理器 CPU占 有率;
任务死循环检测模块, 配置为***检测任务判断统计得到的 CPU占有 率是否高于 CPU死循环判断门限值, 如果否, 则判定上述任务没有出现任 务死循环; 如果是, 则判断低优先级死循环辅助任务是否保活置位, 如果 是, 则判定没有出现死循环, 如果否, 则进行告警, 通知维护人员分析; 所述任务死循环检测模块还配置为, 判断***检测任务在采样检测时间段 内是否只处理了一个消息, 如果否, 则进行告警, 通知维护人员分析, 如 果是, 则判定该任务处于死循环状态。
其中, 所述装置还包括:
任务工作状态检测模块, 配置为定时检测所有任务的工作状态; 任务异常检测模块, 配置为根据检测到的任务工作状态, 以及结合预 先配置的任务异常判断策略, 进行任务异常检测。
本发明实施例还提供一种计算机可读存储介质, 该存储介质包括一组 计算机可执行指令, 所述指令用于执行本发明实施例所述的***故障检测 及处理方法。
本发明实施例有益效果如下:
本发明实施例可以实现软件***的故障自动检测, 并根据用户策略自 动恢复***; 能够同时检测***启动过程和***运行过程的***异常, 并 自动恢复; 能够对***运行过程中的异常类型分类识别, 并根据用户策略 进行异常判断和自恢复; ***异常检测和自恢复策略用户可配置, 异常原 因可以 己录, 可查询。 附图说明
图 1 是本发明实施例中一种***故障检测及处理方法的流程图; 图 2 是本发明实施例中一种***故障检测及处理装置的结构示意图。 具体实施方式
以下结合附图以及实施例,对本发明进行进一步详细说明。应当理解, 此处所描述的具体实施例仅仅用以解释本发明, 并不限定本发明。
如图 1所示,本发明实施例涉及一种***故障检测及处理方法,包括: 步骤 S101, 中断服务程序发送一级喂狗信号, 并接收***检测任务的 二级喂狗信号;
本步骤, 中断服务程序正常一级硬件喂狗(发送一级喂狗信), 当*** 出现操作***崩溃或者硬件异常, 中断服务程序无法工作, 硬件狗产生自 动复位。
***启动, 中断服务程序开始一级硬件喂狗, 待高优先级的***检测 任务启动, ***检测任务开始二级软件喂狗(发送二级喂狗信号), 这段系 统启动期间内如果发生***异常, 则将导致二级软件喂狗无法及时完成, 从而停止一级硬件喂狗, ***将记录日志为启动异常、 并同时自动复位。
***启动后, 高优先级***检测任务正常运行, 如果出现中断超过设 定阈值(中断过频), 或者比***检测任务更高优先级的任务忙, 则将导致 二级软件喂狗无法及时完成, 从而停止一级硬件喂狗, ***将记录日志并 自动复位。 另外, 如果出现(高优先级) ***检测任务由于自身异常导致 挂起, 也会造成无法二级软件喂狗, 一级硬件喂狗停止, ***将记录日志 并自动复位。 其中, 比***检测任务更高优先级的任务忙, 是指比***检 测任务更高优先级的任务的 CPU ( Central Processing Unit, 中央处理器 ) 占 有率超过预定门限值。
步骤 S102, 在检测到任务死循环或任务异常时, 根据预先设定的处理 策略进行***异常处理; 其中, 当中断服务程序在设定时间内接收不到所 述二级喂狗信号时, 则中断服务程序停止发送一级喂狗信号, 进行***重 启。
本步骤中, 任务死循环检测, 包括: 定时统计任务的 CPU占有率; 并 根据预先配置的 CPU死循环判断门限值和死循环判断策略, 进行任务死循 环判断。任务死循环判断策略是由用户预先配置的,由用户根据任务特性、 使用环境等因素进行配置,通常情况,任务的 CPU占有率超过 CPU死循环 判断门限值则认为任务死循环, 当然, 也可以设置例外情况。 例如, 一个 低优先级死循环辅助任务, 该任务的存在允许其他任务的 CPU占有率超过 CPU死循环判断门限值(在嵌入式***中允许一些低优先级任务,比如 idle 任务一直很忙, 但对***的正常功能没有影响); 位于特殊忙任务列表内的 任务,允许这些特殊任务的 CPU占有率超过 CPU死循环判断门限值 (一些 关键任务在运行某些功能时在某段时间允许比较忙, 不应该被看做任务异 常忙)。 另外, 还需要进行任务死循环确认步骤, 即至少在两次采样时间内 都判定任务死循环, 才可以认定该任务为死循环。
任务异常检测时, 包括: 定时检测所有任务工作状态; 根据任务异常 判断策略进行任务异常判断。 任务异常判断策略由用户预先配置, 用户可 以根据实际情况进行不同配置。 例如: 只有判定任务为关键任务(关键任 务指任务异常会影响到***基本功能, 必须马上恢复的任务; 关键任务可 以动态配置)异常时才进行自恢复(重启)操作; 也可以当判定每个普通 任务异常时, 都可进行自恢复操作; 也可以认为每个任务异常都不进行自 恢复操作。 任务异常检测也需要包括任务异常确认步骤, 即至少两次采样 时间内都判定任务异常才最终判定该任务异常。
***自恢复处理, 包括: 判断***异常 (任务死循环或任务异常)后 是否立即复位, 如果是, 立即复位, 如果否, 则根据***自恢复等待时间 而定, 该等待时间可预先配置; ***异常等待时间到之后, 复位条件判断, 如果满足复位条件则立即复位; 如果不满足则在等待默认时间后复位; 系 统异常不复位, 则告警或者日志记录。 ***异常日志记录包括: 日志记录 到内存或者记录到文件***。
下面给出分别给出具体实施例, 以进一步详细说明。
首先, 本发明实施例所述防范在***启动或正常运行过程中的流程包 括如下步骤:
步骤 S201 : ***启动, 中断服务程序开始工作, 并设置默认中断计数 次数。 默认中断计数次数根据***正常启动时间而定, 比如***正常启动 时间最长为 5 分钟, 每次中断时间是 10 毫秒, 则中断计数次数为 5*60*1000/10=30000。
步骤 S202: 每次中断到来时, 中断计数次数减 1, 中断服务程序进行 一级硬件喂狗。 如果此时***硬件异常、 操作***崩溃等导致中断服务程 序无法工作, 则一级硬件喂狗停止, ***重启。
由于硬件看门狗喂狗门限一般为 1到 2秒, 因此为了保证***能够正 常工作, ***启动过程中的其它任务在关中断时要特别关注, 如果关中断 时间比较长(超过喂狗门限) 的, 需要在代码中添加喂狗点, 即在关中断 中进行一级喂狗, 以防止正常的关中断导致***重启。
另外, 每次中断到来时, 同时进行中断计数是否大于 0 的判断, 如果 是,则等待下次中断到来, 直至***检测任务启动,转步骤 S203; 如果否, 即中断计数等于 0,说明高优先级***检测任务没有正常启动工作, 即在系 统启动过程中遇到异常, 这种情况相当于二级软件喂狗失效, 则记录原因 为启动异常, 停止一级硬件喂狗, ***将重启。
步骤 S203 : 高优先级***检测任务启动, 开始定时二级软件喂狗, 重 新设置中断计数次数, 中断计数次数根据***正常运行时任务死循环判断 的及时性而定, 如果死循环判断要求比较及时, 数值则可以设置较小, 相 反则较大; 比如: ***二级软件喂狗时间要求为 3 分钟, 每次中断时间是 10毫秒, 则中断计数次数为 3 X 60 X 1000/10=18000。
其中, 高优先级***检测任务二级软件喂狗定时时间可以根据中断计 数次数, 得到一个经验值, 例如, 可以设置成 30秒二级软件喂狗一次。
步骤 S204: 当比高优先级***检测任务优先级更高的任务忙, 中断过 频, 或者高优先级任务异常挂起, 即: 3分钟内没有一次二级软件喂狗, 则 中断计数次数为 0, ***认为高优先级任务忙; 此时, 记录原因, 停止一级 硬件喂狗, ***重启。
本发明实施例所述方法在***正常运行过程中的任务死循环检测及自 恢复方法流程如下:
步骤 S301 : 高优先级***检测任务和低优先级死循环辅助任务启动, 高优先级***检测任务定时二级软件喂狗, 低优先级死循环辅助任务定时 死循环保活维持。 本步骤所述的高优先级和低优先级是相对来说, 即*** 检测任务的优先级高于死循环辅助任务的优先级。
步骤 S302: 高优先级***检测任务, 每隔 1分钟统计一次任务的 CPU 占有率 (统计任务状态为运行状态下的 CPU占有率)。
步骤 S303 :高优先级***检测任务比较已经统计得到的任务 CPU占有 率是否高于 CPU死循环判断门限值 ( CPU死循环判断门限值可以根据*** 的情况由用户预先手动配置); 如果否, 则判定上述任务没有出现任务死循 环; 如果是, 则转步骤 S304。
步骤 S304:当判定有任务 CPU占有率高于 CPU死循环判断门限值时, 则进一步判断低优先级死循环辅助任务是否保活置位, 如果是, 即低优先 级死循环辅助任务有保活置位, 说明该任务能够得到***正常调度, *** 死循环统计范围之内; 如果否, 则转步骤 S305。
步骤 S305: 如果低优先级任务死循环辅助任务没有置保活标志, 也并 不能说明该任务死循环, 这是因为***中存在某些任务在高优先级任务定 时检测时间段内就是一直在运行, 所以需要排除***中的这些特殊任务, 不能把它的正常忙状态当成死循环, 但要告警通知维护人员分析。 上述特 殊任务由用户预先手动配置。
步骤 S306: 以上判断已经确定***包括死循环任务, 还需要进一步判 断是否是在高优先级***检测任务定时采样检测时间段内只处理了一个消 息, 如果该任务在定时采样时间段内同时处理了多个消息, 说明该任务在 ***中得到调度, ***没有出现死循环, 但要告警通知维护人员分析, 如 果该任务在定时采样时间段内只处理了一个消息, 则判定该任务处于死循 环状态。
步骤 S307: 当***中有任务死循环时,再次等待一个 1个采样周期(采 样时间段)进行死循环确认, 确认之后进行日志记录, 并准备重启恢复, 但是重启之前需要判断***中是否正在运行比较重要的工作 (比如文件系 统操作), 如果***正在运行比较重要工作不能马上重启, 则允许延时一段 时间之后强制关闭这些重要工作并重启。 本发明实施例上述方法在***正常运行过程中的任务异常检测及自恢 复方法流程包括:
步骤 S401 : 高优先级***检测任务启动, 并定时二级软件喂狗。
步骤 S402: 高优先级***检测任务每隔 1分钟(检测周期 )检测*** 所有任务的工作状态。
步骤 S403 : 高优先级***检测任务发现有任务异常挂起, 则识别该任 务是关键任务还是普通任务, ***根据用户配置的异常检测处理策略进行 自恢复操作。 例如, 异常检测处理策略为: 允许关键任务异常重启; 或者 普通任务异常重启; 或者所有任务异常都不重启。 其中, 关键任务是由用 户预先设置的, 如果这些任务不能工作, 则将影响***重要功能。
步骤 S404: 当确定***中有任务异常时, 则***需要记录异常任务的 堆栈信息, 同时日志记录, 并重启恢复, 但是重启之前需要判断***中是 否正在运行比较重要的工作(比如文件***操作), 如果***正在运行比较 重要的工作不能马上重启, 则允许延时一段时间之后强制关闭这些重要工 作并重启。
另外, 如图 2所示, 本发明实施例还提供一种实现上述方法的***故 障检测及处理装置, 包括:
信号处理模块 201, 配置为使中断服务程序发送一级喂狗信号, 并接收 ***检测任务的二级喂狗信号; 当中断超过设定阈值、 比所述***检测任 务优先级更高的任务忙、 ***启动期间***异常或所述***检测任务自身 异常挂起时, 中断服务程序接收不到所述二级喂狗信号。
异常处理模块 202, 配置为在检测到任务死循环或任务异常时, 根据预 先设定的处理策略进行***异常处理; 其中, 当中断服务程序在设定时间 内接收不到所述二级喂狗信号时, 则令中断服务程序停止发送一级喂狗信 号, 进行***重启。
在一种实施方式中, 上述装置还包括: 自重启模块, 配置为当***出现操作***崩溃或者硬件异常时, *** 自动重启恢复;
CPU 占有率统计模块, 配置为***检测任务定时二级软件喂狗, 低优 先级死循环辅助任务定时死循环保活维持时, 定时统计 CPU占有率;
任务死循环检测模块, 配置为***检测任务判断统计得到的 CPU占有 率是否高于 CPU死循环判断门限值, 如果否, 则判定上述任务没有出现任 务死循环; 如果是, 则判断低优先级死循环辅助任务是否保活置位, 如果 是, 则判定没有出现死循环, 如果否, 则进行告警, 通知维护人员分析; 判断***检测任务在采样检测时间段内是否只处理了一个消息, 如果否, 则进行告警,通知维护人员分析,如果是, 则判定该任务处于死循环状态。
任务工作状态检测模块, 配置为定时检测所有任务的工作状态; 任务异常检测模块, 配置为根据检测到的任务工作状态, 以及结合预 先配置的任务异常判断策略, 进行任务异常检测。
在一种实施方式中, 上述模块都可由***故障检测及处理装置中的
CPU, 微处理器(MPU, Micro Processing Unit )、 数字信号处理器(DSP, Digital Signal Processor )或可编程遝辑阵歹' J ( FPGA, Field - Programmable Gate Array ) 实现。
综上所述, 本发明实施例通过定时主动扫描检测和中断、 喂狗相结合 的方法, 即能判断出任务死循环和任务异常等异常情况, 也能判断出中断 过频和硬件、 软件挂死***的情况, 同时可以根据这些异常类型分类记录 原因, 自动延时恢复处理。 既考虑了***的特殊运行任务情况, 也考虑了 不同***的动态配置要求, 同时也可以检测***启动过程的软件运行情况, 满足了软件***的大部分异常检测及自恢复需求。
本领域内的技术人员应明白, 本发明的实施例可提供为方法、 ***、 或计算机程序产品。 因此, 本发明可采用硬件实施例、 软件实施例、 或结 合软件和硬件方面的实施例的形式。 而且, 本发明可采用在一个或多个其 中包含有计算机可用程序代码的计算机可用存储介质 (包括但不限于磁盘 存储器和光学存储器等 )上实施的计算机程序产品的形式。
本发明是参照根据本发明实施例的方法、 设备(***)、 和计算机程序 产品的流程图和 /或方框图来描述的。 应理解可由计算机程序指令实现流程 图和 /或方框图中的每一流程和 /或方框、以及流程图和 /或方框图中的流程和 /或方框的结合。 可提供这些计算机程序指令到通用计算机、 专用计算机、 嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器, 使得 通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现 在流程图一个流程或多个流程和 /或方框图一个方框或多个方框中指定的功 能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理 设备以特定方式工作的计算机可读存储器中, 使得存储在该计算机可读存 储器中的指令产生包括指令装置的制造品, 该指令装置实现在流程图一个 流程或多个流程和 /或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上, 使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现 的处理, 从而在计算机或其他可编程设备上执行的指令提供用于实现在流 程图一个流程或多个流程和 /或方框图一个方框或多个方框中指定的功能的 步骤。
为此, 本发明实施例还提供了一种计算机可读存储介质, 该存储介质 包括一组计算机可执行指令, 所述指令用于执行本发明实施例所述的*** 故障检测及处理方法。
尽管为示例目的, 已经公开了本发明的优选实施例, 本领域的技术人 员将意识到各种改进、 增加和取代也是可能的, 因此, 本发明的范围应当 不限于上述实施例。

Claims

权利要求书
1、 一种***故障检测及处理方法, 包括:
中断服务程序发送一级喂狗信号, 并接收***检测任务的二级喂狗 信号;
在检测到任务死循环或任务异常时, 根据预先设定的处理策略进行 ***异常处理; 其中, 当中断服务程序在设定时间内接收不到所述二级 喂狗信号时, 则中断服务程序停止发送一级喂狗信号, 进行***重启。
2、 如权利要求 1所述的***故障检测及处理方法, 其中, 当***出 现操作***崩溃或者硬件异常时, ***自动重启恢复。
3、 如权利要求 1或 2所述的***故障检测及处理方法, 其中, 当中 断超过设定阈值、 比所述***检测任务优先级更高的任务忙、 ***启动 期间***异常或所述***检测任务自身异常挂起时, 中断服务程序接收 不到所述二级喂狗信号。
4、 如权利要求 3所述的***故障检测及处理方法, 其中, 进行任务 死循环检测时, 包括:
***检测任务定时二级软件喂狗, 低优先级死循环辅助任务定时死 循环保活维持;
定时统计中央处理器 CPU占有率;
判断统计得到的 CPU占有率是否高于 CPU死循环判断门限值,如果 否, 则判定上述任务没有出现任务死循环; 如果是, 则判断低优先级死 循环辅助任务是否保活置位,如果是, 则判定没有出现死循环,如果否, 则进行告警, 通知维护人员分析;
判断***检测任务在采样检测时间段内是否只处理了一个消息, 如 果否, 则进行告警, 通知维护人员分析; 如果是, 则判定该任务处于死 循环状态。
5、如权利要求 1、 2或 4所述的***故障检测及处理方法,其中, 进 行任务异常检测时, 包括:
定时检测所有任务的工作状态;
根据检测到的任务工作状态, 以及结合预先配置的任务异常判断策 略, 进行任务异常检测。
6、 一种***故障检测及处理装置, 包括:
信号处理模块, 配置为使中断服务程序发送一级喂狗信号, 并接收 ***检测任务的二级喂狗信号;
异常处理模块, 配置为在检测到任务死循环或任务异常时, 根据预 先设定的处理策略进行***异常处理; 其中, 当中断服务程序在设定时 间内接收不到所述二级喂狗信号时, 则令中断服务程序停止发送一级喂 狗信号, 进行***重启。
7、 如权利要求 6所述的***故障检测及处理装置, 其中, 所述装置 还包括:
自重启模块, 配置为当***出现操作***崩溃或者硬件异常时, 系 统自动重启恢复。
8、 如权利要求 6或 7所述的***故障检测及处理装置, 其中, 当中 断超过设定阈值、 比所述***检测任务优先级更高的任务忙、 ***启动 期间***异常或所述***检测任务自身异常挂起时, 中断服务程序接收 不到所述二级喂狗信号。
9、 如权利要求 8所述的***故障检测及处理装置, 其中, 所述装置 还包括:
CPU 占有率统计模块, 配置为***检测任务定时二级软件喂狗, 低 优先级死循环辅助任务定时死循环保活维持时, 定时统计 CPU占有率; 任务死循环检测模块, 配置为***检测任务判断统计得到的 CPU占 有率是否高于 CPU死循环判断门限值, 如果否, 则判定上述任务没有出 现任务死循环;如果是,则判断低优先级死循环辅助任务是否保活置位, 如果是, 则判定没有出现死循环, 如果否, 则进行告警, 通知维护人员 分析; 所述任务死循环检测模块还配置为, 判断***检测任务在采样检 测时间段内是否只处理了一个消息, 如果否, 则进行告警, 通知维护人 员分析, 如果是, 则判定该任务处于死循环状态。
10、 如权利要求 6、 7或 9所述的***故障检测及处理装置, 其中, 所述装置还包括:
任务工作状态检测模块, 配置为定时检测所有任务的工作状态; 任务异常检测模块, 配置为根据检测到的任务工作状态, 以及结合 预先配置的任务异常判断策略, 进行任务异常检测。
11、 一种计算机可读存储介质, 该存储介质包括一组计算机可执行 指令, 所述指令用于执行权利要求 1至 5任一项所述的方法。
PCT/CN2014/070187 2013-04-01 2014-01-06 一种***故障检测及处理方法、装置和计算机可读存储介质 WO2014161373A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US14/781,403 US9720761B2 (en) 2013-04-01 2014-01-06 System fault detection and processing method, device, and computer readable storage medium
EP14779970.4A EP2983086A4 (en) 2013-04-01 2014-01-06 SYSTEM FOR ERROR IDENTIFICATION AND PROCESSING, DEVICE AND COMPUTER READABLE STORAGE MEDIUM

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201310111375.0A CN104102572A (zh) 2013-04-01 2013-04-01 一种***故障检测及处理方法、装置
CN201310111375.0 2013-04-01

Publications (1)

Publication Number Publication Date
WO2014161373A1 true WO2014161373A1 (zh) 2014-10-09

Family

ID=51657546

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/070187 WO2014161373A1 (zh) 2013-04-01 2014-01-06 一种***故障检测及处理方法、装置和计算机可读存储介质

Country Status (4)

Country Link
US (1) US9720761B2 (zh)
EP (1) EP2983086A4 (zh)
CN (1) CN104102572A (zh)
WO (1) WO2014161373A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110597716A (zh) * 2019-08-29 2019-12-20 云南昆钢电子信息科技有限公司 一种多业务触发的故障检测处理***及方法
CN110928778A (zh) * 2019-11-19 2020-03-27 百富计算机技术(深圳)有限公司 死循环定位方法、装置、计算机设备和存储介质
CN112596941A (zh) * 2020-12-28 2021-04-02 凌云光技术股份有限公司 一种工业图像处理软件的工具结果判定方法及装置
CN113686550A (zh) * 2021-08-23 2021-11-23 苏州市大创信息运用有限公司 一种基于发光耦合和差值判断的故障探测方法、装置及电子显示设备***

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104268055B (zh) * 2014-09-01 2017-07-14 腾讯科技(深圳)有限公司 一种程序异常的监控方法和装置
CN104572332B (zh) * 2015-02-09 2018-08-21 华为技术有限公司 处理***崩溃的方法和装置
CN106293979B (zh) * 2015-06-25 2019-11-15 伊姆西公司 检测进程无响应的方法和装置
CN106528276B (zh) * 2015-09-10 2019-08-02 中国航空工业第六一八研究所 一种基于任务调度的故障处理方法
CN105260239B (zh) * 2015-10-19 2019-01-11 福建奥通迈胜电力科技有限公司 一种用于故障指示器功能性能均衡调度方法
CN106326049B (zh) * 2016-08-16 2019-07-19 Oppo广东移动通信有限公司 一种故障定位方法及终端
CN106844084B (zh) * 2017-03-16 2020-03-17 北京新能源汽车股份有限公司 一种程序控制方法、装置及汽车
CN107423151A (zh) * 2017-03-28 2017-12-01 上海斐讯数据通信技术有限公司 一种无线接入点***恢复的方法和装置
CN108958989B (zh) * 2017-06-06 2021-09-17 北京猎户星空科技有限公司 一种***故障恢复方法及装置
CN107786374B (zh) * 2017-10-19 2021-02-05 苏州浪潮智能科技有限公司 一种Oracle集群文件***及其实现fence的方法
CN107861840B (zh) * 2017-10-31 2020-07-24 长光卫星技术有限公司 一种增强小卫星在轨可靠性的方法
CN109491824A (zh) * 2018-11-13 2019-03-19 福建北峰通信科技股份有限公司 一种嵌入式操作***的看门狗控制方法
CN109710465A (zh) * 2018-12-29 2019-05-03 出门问问信息科技有限公司 智能手表及其定位模块的初始化方法、装置及电子设备
CN109783267A (zh) * 2019-01-17 2019-05-21 广东小天才科技有限公司 一种解决下载模式异常的方法及***
CN113049871A (zh) * 2019-12-27 2021-06-29 杭州海康微影传感科技有限公司 电压异常监测方法、装置及电子设备
CN111431895B (zh) * 2020-03-20 2022-04-22 宁波和利时信息安全研究院有限公司 ***异常处理方法、装置及***
CN113687980B (zh) * 2020-05-19 2024-03-01 北京京东乾石科技有限公司 异常数据自恢复方法、***、电子设备和可读存储介质
CN111949009B (zh) * 2020-08-14 2022-04-08 深圳市中物互联技术发展有限公司 嵌入式控制器自诊断自维护方法、装置及存储介质
CN112134755A (zh) * 2020-09-21 2020-12-25 杭州迪普科技股份有限公司 公共网关接口程序监测方法及装置
CN112905372A (zh) * 2021-02-02 2021-06-04 浙江大华技术股份有限公司 线程的异常诊断方法及装置
CN113692008B (zh) * 2021-08-27 2024-04-05 京东方科技集团股份有限公司 一种处理收发异常的方法、装置、设备和存储介质
CN117751351A (zh) * 2021-12-27 2024-03-22 宁德时代新能源科技股份有限公司 任务调度方法及多核处理器***
CN114489817B (zh) * 2021-12-28 2024-06-25 深圳市腾芯通智能科技有限公司 处理器启动方法、装置、设备及存储介质
CN117056062B (zh) * 2023-10-13 2024-04-02 武汉天喻信息产业股份有限公司 一种强制退出中断服务程序的方法和装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101196836A (zh) * 2007-12-29 2008-06-11 上海华为技术有限公司 一种控制看门狗电路复位的方法和装置
CN101221518A (zh) * 2008-01-29 2008-07-16 福建星网锐捷网络有限公司 一种防止硬件看门狗的定时器溢出的方法、装置与***
CN101452420A (zh) * 2008-12-30 2009-06-10 中兴通讯股份有限公司 一种嵌入式软件异常监控和处理装置及其方法
CN101561778A (zh) * 2008-04-15 2009-10-21 中兴通讯股份有限公司 一种检测多任务操作***任务死循环的方法
CN102141947A (zh) * 2011-03-30 2011-08-03 东方通信股份有限公司 一种对采用嵌入式操作***的计算机应用***中异常任务的处理方法及***

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040003317A1 (en) 2002-06-27 2004-01-01 Atul Kwatra Method and apparatus for implementing fault detection and correction in a computer system that requires high reliability and system manageability
JP2006338605A (ja) * 2005-06-06 2006-12-14 Denso Corp プログラム異常監視方法及びプログラム異常監視装置
US8448029B2 (en) * 2009-03-11 2013-05-21 Lsi Corporation Multiprocessor system having multiple watchdog timers and method of operation
JP2010277303A (ja) * 2009-05-28 2010-12-09 Renesas Electronics Corp 半導体装置及び異常検出方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101196836A (zh) * 2007-12-29 2008-06-11 上海华为技术有限公司 一种控制看门狗电路复位的方法和装置
CN101221518A (zh) * 2008-01-29 2008-07-16 福建星网锐捷网络有限公司 一种防止硬件看门狗的定时器溢出的方法、装置与***
CN101561778A (zh) * 2008-04-15 2009-10-21 中兴通讯股份有限公司 一种检测多任务操作***任务死循环的方法
CN101452420A (zh) * 2008-12-30 2009-06-10 中兴通讯股份有限公司 一种嵌入式软件异常监控和处理装置及其方法
CN102141947A (zh) * 2011-03-30 2011-08-03 东方通信股份有限公司 一种对采用嵌入式操作***的计算机应用***中异常任务的处理方法及***

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP2983086A4 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110597716A (zh) * 2019-08-29 2019-12-20 云南昆钢电子信息科技有限公司 一种多业务触发的故障检测处理***及方法
CN110597716B (zh) * 2019-08-29 2023-06-30 云南昆钢电子信息科技有限公司 一种多业务触发的故障检测处理***及方法
CN110928778A (zh) * 2019-11-19 2020-03-27 百富计算机技术(深圳)有限公司 死循环定位方法、装置、计算机设备和存储介质
CN110928778B (zh) * 2019-11-19 2023-09-15 百富计算机技术(深圳)有限公司 死循环定位方法、装置、计算机设备和存储介质
CN112596941A (zh) * 2020-12-28 2021-04-02 凌云光技术股份有限公司 一种工业图像处理软件的工具结果判定方法及装置
CN112596941B (zh) * 2020-12-28 2023-10-03 凌云光技术股份有限公司 一种工业图像处理软件的工具结果判定方法及装置
CN113686550A (zh) * 2021-08-23 2021-11-23 苏州市大创信息运用有限公司 一种基于发光耦合和差值判断的故障探测方法、装置及电子显示设备***
CN113686550B (zh) * 2021-08-23 2024-03-01 苏州市大创信息运用有限公司 一种基于发光耦合和差值判断的故障探测方法、装置及电子显示设备***

Also Published As

Publication number Publication date
CN104102572A (zh) 2014-10-15
EP2983086A1 (en) 2016-02-10
US20160055046A1 (en) 2016-02-25
EP2983086A4 (en) 2016-05-04
US9720761B2 (en) 2017-08-01

Similar Documents

Publication Publication Date Title
WO2014161373A1 (zh) 一种***故障检测及处理方法、装置和计算机可读存储介质
EP2733611B1 (en) Internal fault handling method, device and system for virtual machine
EP3142011B9 (en) Anomaly recovery method for virtual machine in distributed environment
CN101452420B (zh) 一种嵌入式软件异常监控和处理装置及其方法
CN110581852A (zh) 一种高效型拟态防御***及方法
CN106789306B (zh) 通信设备软件故障检测收集恢复方法和***
US9210059B2 (en) Cluster system
CN105550057B (zh) 嵌入式软件***故障检测恢复方法和***
TW201737215A (zh) 異常監控報警方法及裝置
WO2015024336A1 (zh) 设备故障报警方法,装置与cim***
US20220055637A1 (en) Electronic control unit and computer readable medium
CN103067209A (zh) 一种心跳模块自检测方法
CN105426263A (zh) 一种实现金库***安全运行的方法及***
CN112749038B (zh) 一种在软件***中实现软件看门狗的方法及***
WO2015188619A1 (zh) 物理主机故障检测方法、装置及虚机管理方法、***
JP6504610B2 (ja) 処理装置、方法及びプログラム
CN112115003A (zh) 一种服务进程的掉线恢复方法、装置、设备及存储介质
CN102231124A (zh) 一种嵌入式***任务的守护方法
JP2006227962A (ja) アプリケーションタスク監視システムおよび方法
WO2014040470A1 (zh) 告警消息的处理方法及装置
JP7211026B2 (ja) ジョブ管理システム
JP2004086520A (ja) 監視制御装置及び監視制御方法
CN109062718B (zh) 一种服务器及数据处理方法
JP2008077324A (ja) サーバ・クライアントシステム
CN111625420B (zh) 一种分布式训练任务处理方法、装置、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14779970

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 14781403

Country of ref document: US

Ref document number: 2014779970

Country of ref document: EP