CN108108259A - A kind of kernel Fault Locating Method and device - Google Patents

A kind of kernel Fault Locating Method and device Download PDF

Info

Publication number
CN108108259A
CN108108259A CN201810026869.1A CN201810026869A CN108108259A CN 108108259 A CN108108259 A CN 108108259A CN 201810026869 A CN201810026869 A CN 201810026869A CN 108108259 A CN108108259 A CN 108108259A
Authority
CN
China
Prior art keywords
failure
kernels
hardware
fault
deadlock
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810026869.1A
Other languages
Chinese (zh)
Inventor
常现超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou Yunhai Information Technology Co Ltd
Original Assignee
Zhengzhou Yunhai Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou Yunhai Information Technology Co Ltd filed Critical Zhengzhou Yunhai Information Technology Co Ltd
Priority to CN201810026869.1A priority Critical patent/CN108108259A/en
Publication of CN108108259A publication Critical patent/CN108108259A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0721Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment within a central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/073Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a memory management context, e.g. virtual memory or cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3024Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3051Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The present invention provides a kind of kernel Fault Locating Method and devices, monitoring server system, hardware whether failure, when system jam on server, or during hardware failure, the memory information of failure system is collected by BMC, analyzes internal storage data, the reason for rapidly analyzing failure simultaneously positions failure, find solution fault method, the present invention can ensure business on server can fast quick-recovery, reduce loss.

Description

A kind of kernel Fault Locating Method and device
Technical field
The present invention relates to the technical fields of server, and in particular to a kind of kernel Fault Locating Method and device.
Background technology
As client traffic demand constantly increases, the performance of server must be continuously increased, the hardware configuration of server It is constantly promoted, as CPU is likely to be breached more than thousand cores, memory reaches more than TB.Also event is improved while server hardware increase Barrier rate, operating system also become increasingly complex, and with the increase of hardware, driver also accordingly increases, and the BUG of introducing can also be got over Come more.When server fail, it is necessary to which quick analyzing failure cause simultaneously finds solution, it is necessary to preserve Or obtain corresponding data and analyzed, especially when key business is disposed on server, quickly cope with problem Economic loss will be reduced to client, ensures the fast quick-recovery of business.
In the prior art, common Fault Locating Method is installs K-UX operating systems and runs on the server, normally In the case of K-UX operating systems in K-UX kernels, when catastrophe failure occurs, K-UX kernels hang up, then start Crash kernels (Crash kernels:One small linux kernel is mainly used for the internal storage data of K-UX kernels being saved in magnetic Disk);The internal storage data that K-UX kernels use is saved on disk by Crash kernels, to restart post analysis orientation problem next time; After Crash kernels have collected K-UX kernel memory informations, restart system and enter in BIOS, BIOS proceeds by hardware initialization etc. Operation, BIOS final stage start to load K-UX kernel activation systems;Into after K-UX systems, analysis crash kernels are saved in Internal storage data (as shown in Figure 2) on disk.The shortcomings that prior art is:1st, user configuration crash kernels are needed, and in distribution It deposits, wastes certain memory headroom;2nd, preserving internal storage data needs a large amount of disk spaces, wastes disk space;3rd, many users Crash kernels are not configured when installing K-UX, great difficulty is brought to follow-up orientation problem.
The content of the invention
Based on the above problem, the present invention proposes a kind of kernel Fault Locating Method and device, and failure system is collected by BMC The memory information of system, quick the reason for analyzing failure, simultaneously position failure.
The present invention provides following technical solution:
On the one hand, the present invention provides a kind of kernel Fault Locating Method, including:
Step 101, monitor K-UX kernels and/or hardware whether failure;
Step 102, if K-UX kernels and/or hardware fault, into BMC systems, the memory information of failure system is obtained;
Step 103, the memory information of the failure system is analyzed, positions failure.
Wherein, solution failure is further included after the positioning failure, recovers server normal operation.
Wherein, the failure system is K-UX systems or hardware system.
Wherein, the K-UX kernels failure includes at least one null pointer, Array Bound, soft deadlock, hard deadlock;It is described hard Part failure includes that disk sector can not be read and write, CPU core at least one can not work normally.
In addition, the present invention also provides a kind of kernel fault locator, described device includes:
Monitoring modular, for monitor K-UX kernels and/or hardware whether failure;
Acquisition module for entering BMC systems when K-UX kernels and/or hardware fault, obtains the memory information of failure system; Locating module for analyzing the memory information of the failure system, positions failure.
Wherein, solution failure is further included after the positioning failure, recovers server normal operation.
Wherein, the failure system is K-UX systems or hardware system.
Wherein, the K-UX kernels failure includes at least one null pointer, Array Bound, soft deadlock, hard deadlock;It is described hard Part failure includes that disk sector can not be read and write, CPU core at least one can not work normally.
The present invention provides a kind of kernel Fault Locating Method and device, monitoring server system, hardware whether failure, when System jam or during hardware failure on server, collects the memory information of failure system by BMC, in analysis Deposit data, rapidly analyze failure the reason for and position failure, find solution fault method, the present invention can ensure on server Business can fast quick-recovery, reduce loss.
Description of the drawings
Fig. 1 is the flow chart of the present invention;
Fig. 2 is the flow chart of the prior art.
Specific embodiment
It to describe the technical solutions in the embodiments of the present invention more clearly, below will be to needed in the embodiment Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is only some embodiments of the present invention, for ability For the those of ordinary skill of domain, without creative efforts, it can also be obtained according to these attached drawings other attached Figure.
Based on above-mentioned, on the one hand, embodiments of the present invention provide a kind of kernel Fault Locating Method, and attached drawing 1 is this The flow chart of invention, the described method includes:
Step 101, monitor K-UX kernels and/or hardware whether failure;
K-UX:Tide operating system, class Linux.K-UX operating systems are installed on server simultaneously normal operation, monitoring K-UX kernels or other hardware faults;
Step 102, if K-UX kernels and/or hardware fault, into BMC systems, the memory information of failure system is obtained;
When K-UX kernels break down, log in BMC systems and obtain K-UX memory informations.Wherein, BMC: Baseboard Management Controller baseboard management controllers run the small-sized behaviour of a separate server system Make system, effect is the operations such as to facilitate the remote management of server, monitoring, install, restart.K-UX kernel catastrophe failures:Such as sky Pointer, Array Bound, soft deadlock, hard deadlock etc. cause the failure that K-UX systems can not work on.Hardware fault:Cause hardware The failure that can not be continuing with, if some sectors of disk can not be read and write, some CPU cores can not work normally.
Step 103, the memory information of the failure system is analyzed, positions failure.
The reason for analyzing the K-UX memory informations obtained, positioning failure;Failure is solved, recovers server normal operation.
Wherein, the K-UX kernels failure includes at least one null pointer, Array Bound, soft deadlock, hard deadlock;It is described hard Part failure includes that disk sector can not be read and write, CPU core at least one can not work normally.
The present invention provides a kind of kernel Fault Locating Method, monitoring server system, hardware whether failure, work as server When upper system jam or hardware failure, the memory information of failure system is collected by BMC, analyzes internal storage data, The reason for rapidly analyzing failure simultaneously positions failure, finds solution fault method, and the present invention can ensure the business on server Can fast quick-recovery, reduce loss.
On the other hand, embodiments of the present invention provide a kind of kernel fault locator, and described device includes:
Monitoring modular 201, for monitor K-UX kernels and/or hardware whether failure;
K-UX:Tide operating system, class Linux.K-UX operating systems are installed on server simultaneously normal operation, monitoring K-UX kernels or other hardware faults;
Acquisition module 202 for entering BMC systems when K-UX kernels and/or hardware fault, obtains the interior of failure system Deposit information;
When K-UX kernels break down, log in BMC systems and obtain K-UX memory informations.Wherein, BMC: Baseboard Management Controller baseboard management controllers run the small-sized behaviour of a separate server system Make system, effect is the operations such as to facilitate the remote management of server, monitoring, install, restart.K-UX kernel catastrophe failures:Such as sky Pointer, Array Bound, soft deadlock, hard deadlock etc. cause the failure that K-UX systems can not work on.Hardware fault:Cause hardware The failure that can not be continuing with, if some sectors of disk can not be read and write, some CPU cores can not work normally.
Locating module 203 for analyzing the memory information of the failure system, positions failure.
The reason for analyzing the K-UX memory informations obtained, positioning failure;Failure is solved, recovers server normal operation.
Wherein, the K-UX kernels failure includes at least one null pointer, Array Bound, soft deadlock, hard deadlock;It is described hard Part failure includes that disk sector can not be read and write, CPU core at least one can not work normally.
The present invention provides a kind of kernel fault locator, monitoring server system, hardware whether failure, work as server When upper system jam or hardware failure, the memory information of failure system is collected by BMC, analyzes internal storage data, The reason for rapidly analyzing failure simultaneously positions failure, finds solution fault method, and the present invention can ensure the business on server Can fast quick-recovery, reduce loss.
The foregoing description of the disclosed embodiments enables those skilled in the art to realize or use the present invention.To this A variety of modifications of a little embodiments will be apparent for a person skilled in the art, and the general principles defined herein can Without departing from the spirit or scope of the present invention, to realize in other embodiments.Therefore, the present invention will not be limited The embodiments shown herein is formed on, but meets the most wide model consistent with the principles and novel features disclosed herein It encloses.

Claims (8)

1. a kind of kernel Fault Locating Method, it is characterised in that:
Step 101, monitor K-UX kernels and/or hardware whether failure;
Step 102, if K-UX kernels and/or hardware fault, into BMC devices, the memory information of failed equipment is obtained;
Step 103, the memory information of the failed equipment is analyzed, positions failure.
2. according to the method described in claim 1, it is characterized in that:Solution failure is further included after the positioning failure, is recovered Server normal operation.
3. according to the method described in claim 1, it is characterized in that:The failed equipment is K-UX devices or hardware unit.
4. according to the method described in claim 1, it is characterized in that:The K-UX kernels failure include null pointer, Array Bound, At least one soft deadlock, hard deadlock;The hardware fault is including disk sector can not be read and write, CPU core can not work normally at least One of.
5. a kind of kernel fault locator, it is characterised in that:Described device includes:
Monitoring modular, for monitor K-UX kernels and/or hardware whether failure;
Acquisition module for entering BMC systems when K-UX kernels and/or hardware fault, obtains the memory information of failure system; Locating module for analyzing the memory information of the failure system, positions failure.
6. device according to claim 5, it is characterised in that:Solution failure is further included after the positioning failure, is recovered Server normal operation.
7. device according to claim 5, it is characterised in that:The failed equipment is K-UX devices or hardware unit.
8. device according to claim 5, it is characterised in that:The K-UX kernels failure include null pointer, Array Bound, At least one soft deadlock, hard deadlock;The hardware fault is including disk sector can not be read and write, CPU core can not work normally at least One of.
CN201810026869.1A 2018-01-11 2018-01-11 A kind of kernel Fault Locating Method and device Pending CN108108259A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810026869.1A CN108108259A (en) 2018-01-11 2018-01-11 A kind of kernel Fault Locating Method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810026869.1A CN108108259A (en) 2018-01-11 2018-01-11 A kind of kernel Fault Locating Method and device

Publications (1)

Publication Number Publication Date
CN108108259A true CN108108259A (en) 2018-06-01

Family

ID=62219541

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810026869.1A Pending CN108108259A (en) 2018-01-11 2018-01-11 A kind of kernel Fault Locating Method and device

Country Status (1)

Country Link
CN (1) CN108108259A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021056912A1 (en) * 2019-09-29 2021-04-01 苏州浪潮智能科技有限公司 Method and device for detecting memory downgrade error
CN112799917A (en) * 2021-02-08 2021-05-14 联想(北京)有限公司 Data processing method, device and equipment
CN114706708A (en) * 2022-05-24 2022-07-05 北京拓林思软件有限公司 Fault analysis method and system for Linux operating system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104598346A (en) * 2015-02-15 2015-05-06 浪潮电子信息产业股份有限公司 Monitoring and management device and method for quick fault positioning in server system
CN105183575A (en) * 2015-08-24 2015-12-23 浪潮(北京)电子信息产业有限公司 Processor fault diagnosis method, device and system
CN105659215A (en) * 2014-06-24 2016-06-08 华为技术有限公司 Fault processing method, related device and computer
CN106293984A (en) * 2016-08-11 2017-01-04 浪潮(北京)电子信息产业有限公司 A kind of computer glitch automatically processes mode and device
CN107357684A (en) * 2017-07-07 2017-11-17 郑州云海信息技术有限公司 A kind of kernel failure method for restarting and device
CN107368385A (en) * 2017-07-26 2017-11-21 郑州云海信息技术有限公司 A kind of method and system of expansible more memory failure fast positionings based on BMC controls

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105659215A (en) * 2014-06-24 2016-06-08 华为技术有限公司 Fault processing method, related device and computer
CN104598346A (en) * 2015-02-15 2015-05-06 浪潮电子信息产业股份有限公司 Monitoring and management device and method for quick fault positioning in server system
CN105183575A (en) * 2015-08-24 2015-12-23 浪潮(北京)电子信息产业有限公司 Processor fault diagnosis method, device and system
CN106293984A (en) * 2016-08-11 2017-01-04 浪潮(北京)电子信息产业有限公司 A kind of computer glitch automatically processes mode and device
CN107357684A (en) * 2017-07-07 2017-11-17 郑州云海信息技术有限公司 A kind of kernel failure method for restarting and device
CN107368385A (en) * 2017-07-26 2017-11-21 郑州云海信息技术有限公司 A kind of method and system of expansible more memory failure fast positionings based on BMC controls

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021056912A1 (en) * 2019-09-29 2021-04-01 苏州浪潮智能科技有限公司 Method and device for detecting memory downgrade error
US11853150B2 (en) 2019-09-29 2023-12-26 Inspur Suzhou Intelligent Technology Co., Ltd. Method and device for detecting memory downgrade error
CN112799917A (en) * 2021-02-08 2021-05-14 联想(北京)有限公司 Data processing method, device and equipment
CN112799917B (en) * 2021-02-08 2024-01-23 联想(北京)有限公司 Data processing method, device and equipment
CN114706708A (en) * 2022-05-24 2022-07-05 北京拓林思软件有限公司 Fault analysis method and system for Linux operating system
CN114706708B (en) * 2022-05-24 2022-08-30 北京拓林思软件有限公司 Fault analysis method and system for Linux operating system

Similar Documents

Publication Publication Date Title
US7017085B2 (en) Systems and methods for remote tracking of reboot status
US6907419B1 (en) Method, system, and product for maintaining within a virtualization system a historical performance database for physical devices
US20150154079A1 (en) Fault tolerant architecture for distributed computing systems
US20080086515A1 (en) Method and System for a Soft Error Collection of Trace Files
CN105518629A (en) Cloud deployment infrastructure validation engine
CN103415840A (en) Error management across hardware and software layers
US20120110378A1 (en) Firmware recovery system and method of baseboard management controller of computing device
KR101331935B1 (en) Method and system of fault diagnosis and repair using based-on tracepoint
CN108536548B (en) Method and device for processing bad track of disk and computer storage medium
US20110154097A1 (en) Field replaceable unit failure determination
US8930761B2 (en) Test case result processing
CN110879742B (en) Method, device and storage medium for asynchronously creating internal snapshot by virtual machine
KR970066876A (en) Calculator system and its software recovery method
CN108108259A (en) A kind of kernel Fault Locating Method and device
US20040148542A1 (en) Method and apparatus for recovering from a failed I/O controller in an information handling system
US20050177763A1 (en) System and method for improving network reliability
US10255124B1 (en) Determining abnormal conditions of host state from log files through Markov modeling
KR101643729B1 (en) System and method of data managing for time base data backup, restoring, and mounting
US7003617B2 (en) System and method for managing target resets
CN101145983B (en) A self-diagnosis and self-discovery subsystem and method of network management system
US11263069B1 (en) Using unsupervised learning to monitor changes in fleet behavior
CN110737924A (en) method and equipment for data protection
US9250942B2 (en) Hardware emulation using on-the-fly virtualization
CN108762999A (en) A kind of kernel failure collection method and device
CN104020963A (en) Method and device for preventing misjudgment of hard disk read-write errors

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180601