CN113852502A - Fault diagnosis method, device and equipment of intelligent network card and readable medium - Google Patents

Fault diagnosis method, device and equipment of intelligent network card and readable medium Download PDF

Info

Publication number
CN113852502A
CN113852502A CN202111112126.4A CN202111112126A CN113852502A CN 113852502 A CN113852502 A CN 113852502A CN 202111112126 A CN202111112126 A CN 202111112126A CN 113852502 A CN113852502 A CN 113852502A
Authority
CN
China
Prior art keywords
diagnosis
ncsi
network card
intelligent network
abnormal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202111112126.4A
Other languages
Chinese (zh)
Inventor
孙崇雨
高磊
刘齐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202111112126.4A priority Critical patent/CN113852502A/en
Publication of CN113852502A publication Critical patent/CN113852502A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0805Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
    • H04L43/0811Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking connectivity

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Environmental & Geological Engineering (AREA)
  • Test And Diagnosis Of Digital Computers (AREA)

Abstract

The invention discloses a fault diagnosis method of an intelligent network card, which comprises the following steps: in response to receiving a fault diagnosis instruction, a host end diagnoses a PCIE link and an NCSI channel and judges whether the PCIE link and the NCSI channel are abnormal or not; if the PCIE link and the NCSI channel are not abnormal, the intelligent network card end carries out self diagnosis; and if the abnormal information is identified through self-diagnosis, recording the abnormal information, comparing the abnormal information with a local fault library, and providing a fault removal suggestion. The invention also discloses a fault diagnosis device of the intelligent network card, computer equipment and a readable storage medium. The invention makes the intelligent network card carry out fault diagnosis in the working state, achieves the purpose of on-line diagnosis, does not need to be disassembled in power down, and improves the operation and maintenance efficiency; meanwhile, whether the fault type belongs to a hardware fault or a software fault can be distinguished, and the purpose of decoupling software and hardware is achieved.

Description

Fault diagnosis method, device and equipment of intelligent network card and readable medium
Technical Field
The invention relates to the technical field of fault diagnosis, in particular to a fault diagnosis method, a fault diagnosis device, fault diagnosis equipment and a readable medium for an intelligent network card.
Background
Under the environment of cloud computing, the intelligent network card can release network message processing work from a server host CPU, unload network processing to the intelligent network card, and fully release computing resources of the server host CPU. The intelligent network card has high integration level, all devices are tightly coupled, and the intelligent network card not only needs to run a client customized program, but also supports the upgrading function of software and firmware.
After the intelligent network card operates at the upper limit, if a fault occurs in the operation process, how to quickly locate the fault type under the condition of not disassembling the machine and not powering down is important to the operation and maintenance work. The intelligent network card belongs to a new object and needs to be matched with a server to operate, and a front-line operation and maintenance engineer does not completely establish a fault diagnosis rule aiming at the intelligent network card.
The traditional troubleshooting of the intelligent network card is usually to collect logs generated by the intelligent network card in an in-band or out-of-band mode, judge the fault generated by the intelligent network card by checking error information in the logs and give a solution suggestion. Some hardware faults are checked by powering off the server and disassembling the server, and disassembling the intelligent network card for fault analysis. The traditional method for analyzing the system logs cannot comprehensively, quickly and accurately position fault points; the method for analyzing the server by power-off and power-off disassembly can affect the operating efficiency of the data center; meanwhile, the intelligent network card is operated by being matched with a server, and the log information generated by the intelligent network card cannot cover the fault information of the whole server system.
Disclosure of Invention
In view of this, an object of the embodiments of the present invention is to provide a method, an apparatus, a device, and a readable medium for diagnosing a fault of an intelligent network card, so that the intelligent network card performs fault diagnosis in a working state, and achieves an online diagnosis purpose, and the operation and maintenance efficiency is improved without power down and power down; meanwhile, whether the fault type belongs to a hardware fault or a software fault can be distinguished, and the purpose of decoupling software and hardware is achieved; finally, fault points can be accurately positioned, fault removal suggestions are given based on a local fault library, and requirements on operation and maintenance personnel are reduced.
Based on the above object, an aspect of the embodiments of the present invention provides a method for diagnosing a fault of an intelligent network card, including the following steps: in response to receiving a fault diagnosis instruction, a host end diagnoses a PCIE link and an NCSI channel and judges whether the PCIE link and the NCSI channel are abnormal or not; if the PCIE link and the NCSI channel are not abnormal, the intelligent network card end carries out self diagnosis; and if the abnormal information is identified through self-diagnosis, recording the abnormal information, comparing the abnormal information with a local fault library, and providing a fault removal suggestion.
In some embodiments, the self-diagnosing by the smart card side includes: receiving the diagnosis instruction by the intelligent network card end and carrying out rapid detection; if the abnormality is detected and identified quickly, detailed diagnosis is carried out, and specific abnormality information of error reporting points is recorded.
In some embodiments, the self-diagnosing by the smart card side includes: the method comprises the steps of health diagnosis and pressure test of the lower-hanging device, link connectivity diagnosis and firmware version check.
In some embodiments, diagnosing the PCIE link and the NCSI lane, and determining whether the PCIE link and the NCSI lane have an abnormality includes: diagnosing a PCIE link, and judging whether the PCIE link is abnormal or not; if the PCIE link is not abnormal, the NCSI channel is further diagnosed, and whether the NCSI channel is abnormal or not is judged.
In some embodiments, diagnosing the PCIE link includes: acquiring PCIE slot information, checking an intelligent network card ID, a link rate, a link bandwidth and a UE/CE error, and performing a DMA pressure test; further diagnosing the NCSI channel includes: and checking the network communication state, the NCSI universal serial bus and the NCSI universal asynchronous receiving and transmitting transmitter, and carrying out a pressurization test on an NCSI network channel.
In some embodiments, the method further comprises: if the PCIE link and/or the NCSI channel are abnormal, reporting abnormal information to operation and maintenance personnel to determine whether the intelligent network card end needs to carry out self diagnosis.
In some embodiments, the method further comprises: and if the abnormal information is not identified through self-diagnosis, generating a diagnosis report based on the self-diagnosis data, and uploading the diagnosis report to operation and maintenance personnel.
In another aspect of the embodiments of the present invention, a fault diagnosis apparatus for an intelligent network card is further provided, including: the first module is configured to respond to a received fault diagnosis instruction, diagnose a PCIE link and an NCSI channel by a host end, and judge whether the PCIE link and the NCSI channel are abnormal or not; the second module is configured to perform self-diagnosis by the intelligent network card end if the PCIE link and the NCSI channel are not abnormal; and a third module configured to record and compare the abnormal information with a local fault library and provide a fault removal suggestion if the abnormal information is identified through self-diagnosis.
In some embodiments, the second module is further configured to: receiving the diagnosis instruction by the intelligent network card end and carrying out rapid detection; if the abnormality is detected and identified quickly, detailed diagnosis is carried out, and specific abnormality information of error reporting points is recorded.
In some embodiments, the second module is further configured to: the method comprises the steps of health diagnosis and pressure test of the lower-hanging device, link connectivity diagnosis and firmware version check.
In some embodiments, the first module is further configured to: diagnosing a PCIE link, and judging whether the PCIE link is abnormal or not; if the PCIE link is not abnormal, the NCSI channel is further diagnosed, and whether the NCSI channel is abnormal or not is judged.
In some embodiments, the first module is further configured to: acquiring PCIE slot information, checking an intelligent network card ID, a link rate, a link bandwidth and a UE/CE error, and performing a DMA pressure test; and checking the network communication state, the NCSI universal serial bus and the NCSI universal asynchronous receiving and transmitting transmitter, and carrying out a pressurization test on an NCSI network channel.
In some embodiments, the second module is further configured to: if the PCIE link and/or the NCSI channel are abnormal, reporting abnormal information to operation and maintenance personnel to determine whether the intelligent network card end needs to carry out self diagnosis.
In some embodiments, the third module is further configured to: and if the abnormal information is not identified through self-diagnosis, generating a diagnosis report based on the self-diagnosis data, and uploading the diagnosis report to operation and maintenance personnel.
In another aspect of the embodiments of the present invention, there is also provided a computer device, including: at least one processor; and a memory storing computer instructions executable on the processor, the instructions when executed by the processor implementing steps of the method comprising: in response to receiving a fault diagnosis instruction, a host end diagnoses a PCIE link and an NCSI channel and judges whether the PCIE link and the NCSI channel are abnormal or not; if the PCIE link and the NCSI channel are not abnormal, the intelligent network card end carries out self diagnosis; and if the abnormal information is identified through self-diagnosis, recording the abnormal information, comparing the abnormal information with a local fault library, and providing a fault removal suggestion.
In some embodiments, the self-diagnosing by the smart card side includes: receiving the diagnosis instruction by the intelligent network card end and carrying out rapid detection; if the abnormality is detected and identified quickly, detailed diagnosis is carried out, and specific abnormality information of error reporting points is recorded.
In some embodiments, the self-diagnosing by the smart card side includes: the method comprises the steps of health diagnosis and pressure test of the lower-hanging device, link connectivity diagnosis and firmware version check.
In some embodiments, diagnosing the PCIE link and the NCSI lane, and determining whether the PCIE link and the NCSI lane have an abnormality includes: diagnosing a PCIE link, and judging whether the PCIE link is abnormal or not; if the PCIE link is not abnormal, the NCSI channel is further diagnosed, and whether the NCSI channel is abnormal or not is judged.
In some embodiments, diagnosing the PCIE link includes: acquiring PCIE slot information, checking an intelligent network card ID, a link rate, a link bandwidth and a UE/CE error, and performing a DMA pressure test; further diagnosing the NCSI channel includes: and checking the network communication state, the NCSI universal serial bus and the NCSI universal asynchronous receiving and transmitting transmitter, and carrying out a pressurization test on an NCSI network channel.
In some embodiments, the steps further comprise: if the PCIE link and/or the NCSI channel are abnormal, reporting abnormal information to operation and maintenance personnel to determine whether the intelligent network card end needs to carry out self diagnosis.
In some embodiments, the steps further comprise: and if the abnormal information is not identified through self-diagnosis, generating a diagnosis report based on the self-diagnosis data, and uploading the diagnosis report to operation and maintenance personnel.
In a further aspect of the embodiments of the present invention, a computer-readable storage medium is also provided, in which a computer program for implementing the above method steps is stored when the computer program is executed by a processor.
The invention has at least the following beneficial technical effects: the intelligent network card is subjected to fault diagnosis in a working state, the purpose of online diagnosis is achieved, power failure disassembly is not needed, and operation and maintenance efficiency is improved; meanwhile, whether the fault type belongs to a hardware fault or a software fault can be distinguished, and the purpose of decoupling software and hardware is achieved; finally, fault points can be accurately positioned, fault removal suggestions are given based on a local fault library, and requirements on operation and maintenance personnel are reduced.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.
Fig. 1 is a schematic diagram of an embodiment of a fault diagnosis method for an intelligent network card provided by the present invention;
fig. 2 is a schematic diagram of an embodiment of a fault diagnosis device of an intelligent network card provided in the present invention;
FIG. 3 is a schematic diagram of an embodiment of a computer device provided by the present invention;
FIG. 4 is a schematic diagram of an embodiment of a computer-readable storage medium provided by the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.
It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.
If the intelligent network card has a fault in the operation process, how to quickly judge whether the equipment has a software fault or a hardware fault in the use field of the equipment is a problem to be solved urgently.
Based on the above purpose, the first aspect of the embodiments of the present invention provides an embodiment of a method for diagnosing a fault of an intelligent network card. Fig. 1 is a schematic diagram illustrating an embodiment of a fault diagnosis method for an intelligent network card provided by the present invention. As shown in fig. 1, the method for diagnosing a fault of an intelligent network card according to an embodiment of the present invention includes the following steps:
001. in response to receiving the fault diagnosis instruction, the host end diagnoses the PCIE link and the NCSI channel and judges whether the PCIE link and the NCSI channel are abnormal or not;
002. if the PCIE link and the NCSI channel are not abnormal, the intelligent network card end carries out self diagnosis; and
003. and if the abnormal information is identified through self-diagnosis, recording the abnormal information, comparing the abnormal information with a local fault library, and providing a fault removal suggestion.
In this embodiment, when the operation and maintenance personnel sense that the network card is abnormal, a diagnosis instruction is sent out at the host end; after the host computer end receives the diagnosis instruction, the functional components related to the intelligent network card are actively detected; firstly, a PCIe link diagnostic program and a NCSI channel diagnostic program of a network card are executed, and if the two items of detection are not wrong, a host end issues a diagnostic instruction to a down-hung intelligent network card; and when the intelligent network card receives a diagnosis instruction in the operation process, starting a self-diagnosis program, and comparing the recorded fault information with a local fault knowledge base to give a fault removal suggestion.
In this embodiment, taking the analysis of a failed intelligent network card running online as an example:
firstly, operation and maintenance personnel find that a certain server intelligent network card possibly has faults, such as low network message receiving and sending efficiency, abnormal voltage and current values of the intelligent network card, abnormal temperature value of the intelligent network card and the like;
secondly, operation and maintenance personnel issue a diagnosis instruction;
thirdly, after the host receives the diagnosis instruction, executing a network card diagnosis program, and checking whether the PCIe link and the NCSI channel have problems;
fourthly, if the two items of detection are not wrong, the host end issues a diagnosis instruction to the down-hung intelligent network card;
fifthly, if the two items of detection have errors, recording the errors, reporting the errors to operation and maintenance personnel, and determining whether to issue a diagnosis instruction to the hung intelligent network card through the host side by the operation and maintenance personnel;
sixthly, the intelligent network card starts a self-diagnosis program when receiving a diagnosis instruction in the operation process;
seventhly, the intelligent network card sequentially performs health diagnosis and pressure test on the SOC down-mounted device, performs connectivity diagnosis on a link, checks a firmware version, performs pressure test on the FPGA down-mounted device and performs FPGA program self-diagnosis;
eighthly, quickly detecting the diagnosis content, automatically switching into a detailed diagnosis program if the quick detection program identifies abnormal information, more specifically reporting error points, and carrying out error recording; after the fault information is recorded, comparing the fault information with a local fault knowledge base to give a fault elimination suggestion;
and step nine, if the content is detected to be correct, reporting a diagnosis report to operation and maintenance personnel, and analyzing the fault caused by the operation of the network card software by the operation and maintenance personnel.
In some embodiments of the present invention, the performing self-diagnosis by the intelligent network card terminal includes: receiving the diagnosis instruction by the intelligent network card end and carrying out rapid detection; if the abnormality is detected and identified quickly, detailed diagnosis is carried out, and specific abnormality information of error reporting points is recorded.
In this embodiment, the intelligent network card performs fast detection after receiving the diagnosis instruction; if the rapid detection program identifies abnormal information, the automatic switching-in detailed diagnosis program can report the error points and record the errors.
In some embodiments of the present invention, the performing self-diagnosis by the intelligent network card terminal includes: the method comprises the steps of health diagnosis and pressure test of the lower-hanging device, link connectivity diagnosis and firmware version check.
In this embodiment, when the intelligent network card receives a diagnosis instruction during the operation process, the intelligent network card starts a self-diagnosis program, and sequentially operates the following diagnosis contents:
the health diagnosis of the devices hung under the intelligent network card SOC comprises CPU information (core number, thread number, running frequency, operation test, register read-write test and PCIe information), DIMM inspection and disk bad track inspection;
the intelligent network card SOC hangs down the device pressure test, including CPU pressurization, memory pressurization, disk pressurization, network channel pressurization;
the link connectivity of the intelligent network card is diagnosed, and detection channels comprise but are not limited to USB, UART, I2C, SPI, LPC and GPIO;
checking the firmware version, wherein the version detection items comprise but are not limited to an intelligent network card operating system, a software development kit, test software, a BIOS, a BMC, an FPGA, a CPLD and other hardware versions;
the method comprises the steps that an intelligent network card FPGA hangs down a device pressure test, the test content comprises but is not limited to FPGA hangs down DDR pressure test, FPGA hangs down ROM pressure test, FPGA hangs down SPI FLASH pressure test and FPGA external network port flow test, in order to support the FPGA hangs down the device pressure test, the FPGA needs to expose an interface to PCIe user space when an IP core is instantiated, and the SOC initiates the test;
the self-diagnosis of the FPGA program of the intelligent network card is realized, and the self-diagnosis program is required to be designed by the FPGA in a development stage and is solidified into the FPGA program in order to support the self-diagnosis of the FPGA program of the intelligent network card.
In some embodiments of the present invention, diagnosing the PCIE link and the NCSI lane, and determining whether the PCIE link and the NCSI lane are abnormal includes: diagnosing a PCIE link, and judging whether the PCIE link is abnormal or not; if the PCIE link is not abnormal, the NCSI channel is further diagnosed, and whether the NCSI channel is abnormal or not is judged.
In this embodiment, a PCIe link diagnostic program of the network card is executed first, if the PCIe link of the network card is detected to be correct, NCSI channel diagnosis is executed, and if the two detections are correct, the host sends a diagnostic instruction to the intelligent network card that is hung down.
In some embodiments of the present invention, diagnosing the PCIE link includes: acquiring PCIE slot information, checking an intelligent network card ID, a link rate, a link bandwidth and a UE/CE error, and performing a DMA pressure test; further diagnosing the NCSI channel includes: and checking the network communication state, the NCSI universal serial bus and the NCSI universal asynchronous receiving and transmitting transmitter, and carrying out a pressurization test on an NCSI network channel.
In this embodiment, the network card PCIe Link diagnostic content includes PCIe slot information acquisition, network card device ID check, Link Speed check, Link Width check, UE/CE Error check, and PCIe DMA pressure test; the NCSI channel diagnosis content comprises network communication state detection, NCSI network channel pressurization test, NCSI-USB detection and NCSI-UART detection.
In some embodiments of the invention, the method further comprises: if the PCIE link and/or the NCSI channel are abnormal, reporting the abnormal information to operation and maintenance personnel to determine whether the intelligent network card end needs to carry out self diagnosis.
In this embodiment, a PCIe link diagnostic program of the network card is executed first, if the PCIe link of the network card is detected incorrectly, NCSI channel diagnosis is executed, if the two detections have errors, an error record is performed and reported to the operation and maintenance staff, and the operation and maintenance staff determines whether to issue a diagnostic instruction to the intelligent network card that is hung down through the host side.
In some embodiments of the invention, the method further comprises: if the abnormality information is not recognized by self-diagnosis, a diagnosis report is generated based on the self-diagnosis data, and the diagnosis report is uploaded to the operation and maintenance personnel.
In this embodiment, if the above contents are detected correctly, the diagnostic report is reported to the operation and maintenance personnel, and the operation and maintenance personnel analyzes the fault caused by the operation of the network card software.
It should be particularly noted that, steps in the foregoing embodiments of the fault diagnosis method for the intelligent network card may be mutually intersected, replaced, added, and deleted, so that the fault diagnosis method for the intelligent network card based on reasonable permutation, combination and transformation also belongs to the protection scope of the present invention, and the protection scope of the present invention should not be limited to the embodiments.
In view of the above, a second aspect of the embodiments of the present invention provides a fault diagnosis device for an intelligent network card. Fig. 2 is a schematic diagram illustrating an embodiment of a fault diagnosis apparatus for an intelligent network card provided by the present invention. As shown in fig. 2, the fault diagnosis apparatus for an intelligent network card according to the embodiment of the present invention includes the following modules: the first module 011 is configured to respond to the received fault diagnosis instruction, diagnose the PCIE link and the NCSI channel by the host end, and determine whether the PCIE link and the NCSI channel are abnormal; a second module 012, configured to perform self-diagnosis by the intelligent network card end if neither the PCIE link nor the NCSI channel is abnormal; and a third module 013 configured to, if the anomaly information is identified by self-diagnosis, record the anomaly information and compare it with the local fault library, and provide a troubleshooting recommendation.
In some embodiments of the invention, the second module 012 is further configured to: receiving the diagnosis instruction by the intelligent network card end and carrying out rapid detection; if the abnormality is detected and identified quickly, detailed diagnosis is carried out, and specific abnormality information of error reporting points is recorded.
In some embodiments of the invention, the second module 012 is further configured to: the method comprises the steps of health diagnosis and pressure test of the lower-hanging device, link connectivity diagnosis and firmware version check.
In some embodiments of the invention, the first module 011 is further configured for: diagnosing a PCIE link, and judging whether the PCIE link is abnormal or not; if the PCIE link is not abnormal, the NCSI channel is further diagnosed, and whether the NCSI channel is abnormal or not is judged.
In some embodiments of the invention, the first module 011 is further configured for: acquiring PCIE slot information, checking an intelligent network card ID, a link rate, a link bandwidth and a UE/CE error, and performing a DMA pressure test; and checking the network communication state, the NCSI universal serial bus and the NCSI universal asynchronous receiving and transmitting transmitter, and carrying out a pressurization test on an NCSI network channel.
In some embodiments of the invention, the second module 012 is further configured to: if the PCIE link and/or the NCSI channel are abnormal, reporting the abnormal information to operation and maintenance personnel to determine whether the intelligent network card end needs to carry out self diagnosis.
In some embodiments of the invention, the third module 013 is further configured to: if the abnormality information is not recognized by self-diagnosis, a diagnosis report is generated based on the self-diagnosis data, and the diagnosis report is uploaded to the operation and maintenance personnel.
In view of the above object, a third aspect of the embodiments of the present invention provides a computer device. Fig. 3 is a schematic diagram of an embodiment of a computer device provided by the present invention. As shown in fig. 3, the computer apparatus of the embodiment of the present invention includes the following means: at least one processor 021; and a memory 022, the memory 022 storing computer instructions 023 executable on the processor, the instructions when executed by the processor implementing steps of the method comprising: in response to receiving the fault diagnosis instruction, the host end diagnoses the PCIE link and the NCSI channel and judges whether the PCIE link and the NCSI channel are abnormal or not; if the PCIE link and the NCSI channel are not abnormal, the intelligent network card end carries out self diagnosis; and if the abnormal information is identified through self-diagnosis, recording the abnormal information, comparing the abnormal information with a local fault library, and providing a fault removal suggestion.
In some embodiments of the present invention, the performing self-diagnosis by the intelligent network card terminal includes: receiving the diagnosis instruction by the intelligent network card end and carrying out rapid detection; if the abnormality is detected and identified quickly, detailed diagnosis is carried out, and specific abnormality information of error reporting points is recorded.
In some embodiments of the present invention, the performing self-diagnosis by the intelligent network card terminal includes: the method comprises the steps of health diagnosis and pressure test of the lower-hanging device, link connectivity diagnosis and firmware version check.
In some embodiments of the present invention, diagnosing the PCIE link and the NCSI lane, and determining whether the PCIE link and the NCSI lane are abnormal includes: diagnosing a PCIE link, and judging whether the PCIE link is abnormal or not; if the PCIE link is not abnormal, the NCSI channel is further diagnosed, and whether the NCSI channel is abnormal or not is judged.
In some embodiments of the present invention, diagnosing the PCIE link includes: acquiring PCIE slot information, checking an intelligent network card ID, a link rate, a link bandwidth and a UE/CE error, and performing a DMA pressure test; further diagnosing the NCSI channel includes: and checking the network communication state, the NCSI universal serial bus and the NCSI universal asynchronous receiving and transmitting transmitter, and carrying out a pressurization test on an NCSI network channel.
In some embodiments of the invention, the steps further comprise: if the PCIE link and/or the NCSI channel are abnormal, reporting the abnormal information to operation and maintenance personnel to determine whether the intelligent network card end needs to carry out self diagnosis.
In some embodiments of the invention, the steps further comprise: if the abnormality information is not recognized by self-diagnosis, a diagnosis report is generated based on the self-diagnosis data, and the diagnosis report is uploaded to the operation and maintenance personnel.
The invention also provides a computer readable storage medium. FIG. 4 is a schematic diagram illustrating an embodiment of a computer-readable storage medium provided by the present invention. As shown in fig. 4, the computer readable storage medium 031 stores a computer program 032 which, when executed by a processor, performs the method as described above.
Finally, it should be noted that, as those skilled in the art can understand, all or part of the processes in the methods of the above embodiments may be implemented by a computer program to instruct related hardware, and the program of the method for diagnosing a failure of an intelligent network card may be stored in a computer readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. The storage medium of the program may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a Random Access Memory (RAM), or the like. The embodiments of the computer program may achieve the same or similar effects as any of the above-described method embodiments.
Furthermore, the methods disclosed according to embodiments of the present invention may also be implemented as a computer program executed by a processor, which may be stored in a computer-readable storage medium. Which when executed by a processor performs the above-described functions defined in the methods disclosed in embodiments of the invention.
Further, the above method steps and system elements may also be implemented using a controller and a computer readable storage medium for storing a computer program for causing the controller to implement the functions of the above steps or elements.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as software or hardware depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosed embodiments of the present invention.
In one or more exemplary designs, the functions may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a general-purpose or special-purpose computer, or a general-purpose or special-purpose processor. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital subscriber line (D0L), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, D0L, or wireless technologies such as infrared, radio, and microwave are all included in the definition of medium. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk, blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.
It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.
The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of the embodiments of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims (10)

1. A fault diagnosis method of an intelligent network card is characterized by comprising the following steps:
in response to receiving a fault diagnosis instruction, a host end diagnoses a PCIE link and an NCSI channel and judges whether the PCIE link and the NCSI channel are abnormal or not;
if the PCIE link and the NCSI channel are not abnormal, the intelligent network card end carries out self diagnosis; and
and if the abnormal information is identified through self-diagnosis, recording the abnormal information, comparing the abnormal information with a local fault library, and providing a fault removal suggestion.
2. The method as claimed in claim 1, wherein the self-diagnosing performed by the intelligent network card end includes:
receiving the diagnosis instruction by the intelligent network card end and carrying out rapid detection;
if the abnormality is detected and identified quickly, detailed diagnosis is carried out, and specific abnormality information of error reporting points is recorded.
3. The method as claimed in claim 1, wherein the self-diagnosing performed by the intelligent network card end includes:
the method comprises the steps of health diagnosis and pressure test of the lower-hanging device, link connectivity diagnosis and firmware version check.
4. The method according to claim 1, wherein diagnosing the PCIE link and the NCSI channel and determining whether the PCIE link and the NCSI channel are abnormal comprises:
diagnosing a PCIE link, and judging whether the PCIE link is abnormal or not;
if the PCIE link is not abnormal, the NCSI channel is further diagnosed, and whether the NCSI channel is abnormal or not is judged.
5. The method according to claim 4, wherein the diagnosing the PCIE link includes: acquiring PCIE slot information, checking an intelligent network card ID, a link rate, a link bandwidth and a UE/CE error, and performing a DMA pressure test;
further diagnosing the NCSI channel includes: and checking the network communication state, the NCSI universal serial bus and the NCSI universal asynchronous receiving and transmitting transmitter, and carrying out a pressurization test on an NCSI network channel.
6. The method for diagnosing the failure of the intelligent network card according to claim 1, further comprising:
if the PCIE link and/or the NCSI channel are abnormal, reporting abnormal information to operation and maintenance personnel to determine whether the intelligent network card end needs to carry out self diagnosis.
7. The method for diagnosing the failure of the intelligent network card according to claim 1, further comprising:
and if the abnormal information is not identified through self-diagnosis, generating a diagnosis report based on the self-diagnosis data, and uploading the diagnosis report to operation and maintenance personnel.
8. A failure diagnosis device of an intelligent network card is characterized by comprising:
the first module is configured to respond to a received fault diagnosis instruction, diagnose a PCIE link and an NCSI channel by a host end, and judge whether the PCIE link and the NCSI channel are abnormal or not;
the second module is configured to perform self-diagnosis by the intelligent network card end if the PCIE link and the NCSI channel are not abnormal; and
and the third module is configured to record the abnormal information and compare the abnormal information with the local fault library if the abnormal information is identified through self diagnosis, and provide a fault removal suggestion.
9. A computer device, comprising:
at least one processor; and
a memory storing computer instructions executable on the processor, the instructions when executed by the processor implementing the steps of the method of any one of claims 1 to 7.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
CN202111112126.4A 2021-09-18 2021-09-18 Fault diagnosis method, device and equipment of intelligent network card and readable medium Withdrawn CN113852502A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111112126.4A CN113852502A (en) 2021-09-18 2021-09-18 Fault diagnosis method, device and equipment of intelligent network card and readable medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111112126.4A CN113852502A (en) 2021-09-18 2021-09-18 Fault diagnosis method, device and equipment of intelligent network card and readable medium

Publications (1)

Publication Number Publication Date
CN113852502A true CN113852502A (en) 2021-12-28

Family

ID=78979116

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111112126.4A Withdrawn CN113852502A (en) 2021-09-18 2021-09-18 Fault diagnosis method, device and equipment of intelligent network card and readable medium

Country Status (1)

Country Link
CN (1) CN113852502A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114979061A (en) * 2022-03-25 2022-08-30 苏州浪潮智能科技有限公司 Method, device, equipment and medium for intelligent network card to respond ARP

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114979061A (en) * 2022-03-25 2022-08-30 苏州浪潮智能科技有限公司 Method, device, equipment and medium for intelligent network card to respond ARP
CN114979061B (en) * 2022-03-25 2023-08-04 苏州浪潮智能科技有限公司 Method, device, equipment and medium for responding ARP (address resolution protocol) of intelligent network card

Similar Documents

Publication Publication Date Title
US10929260B2 (en) Traffic capture and debugging tools for identifying root causes of device failure during automated testing
CN112732477B (en) Method for fault isolation by out-of-band self-checking
CN107526647A (en) A kind of fault handling method, system and computer program product
CN113806127B (en) Server log collection method, device and readable storage medium
US20210111967A1 (en) Graphical user interface for traffic capture and debugging tool
CN112527582A (en) Detection method, detection device, detection equipment and storage medium of server cable
CN111722690B (en) Server power module monitoring method and device, server and storage medium
CN113852502A (en) Fault diagnosis method, device and equipment of intelligent network card and readable medium
CN111048138A (en) Hard disk fault detection method and related device
CN116737471B (en) BIOS automatic switching method and device, electronic equipment and storage medium
CN112311574A (en) Method, device and equipment for checking network topology connection
CN114443381A (en) Method, device, equipment and medium for checking server configuration
CN113992501A (en) Fault positioning system, method and computing device
CN116204361A (en) Asset management method, system, device and storage medium
CN111309553A (en) Method, system, equipment and medium for monitoring storage Jbod
CN112084097B (en) Disk alarm method and device
CN114064401A (en) Method and device for positioning hard disk fault, electronic equipment and storage medium
CN109491846B (en) Method and system for capturing SATA hard disk trace by server
CN111831511A (en) Detection processing method, device and medium for service host of cloud service
CN116719712B (en) Processor serial port log output method and device, electronic equipment and storage medium
CN110781042A (en) Method, device and medium for detecting UBM (Universal boot Module) backboard based on BMC (baseboard management controller)
CN115913895A (en) Server fault diagnosis alarm method, device, equipment and medium
CN113672498B (en) Automatic diagnosis test method, device and equipment
CN114490217B (en) Fault testing method, device and equipment for ME and BIOS interaction and readable medium
CN115955416A (en) Method, device, equipment and storage medium for testing UPI bandwidth reduction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20211228