CN111488233A

CN111488233A - Method and system for processing bandwidth loss problem of PCIe device

Info

Publication number: CN111488233A
Application number: CN202010254405.3A
Authority: CN
Inventors: 孙一心
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2020-04-02
Filing date: 2020-04-02
Publication date: 2020-08-04

Abstract

The system comprises a negotiation result acquisition module, a judgment module, a post module, a restart time storage module, a stop module and a fault positioning module, wherein the system also comprises a BIOS, a PCH, an EEPROM, a BMC and a CP L D, the EEPROM is arranged in the server, the PCH is in communication connection with the BMC, the PCH is connected with the CP L D through a signal, and the fault processing efficiency and the stability of the server can be effectively improved through the system.

Description

Method and system for processing bandwidth loss problem of PCIe device

Technical Field

The present application relates to the technical field of server information transmission, and in particular, to a method and system for handling a bandwidth drop problem of a PCIe (peripheral component interconnect express, a high-speed serial computer expansion bus standard) device.

Background

The PCIe protocol is an important peripheral protocol of the server, and is generally applied to an X86 platform, an arm platform, a PowerPC platform, and the like to meet different functional requirements of the server. But the high rate of PCIe devices is prone to a common type of failure, namely: and (4) the bandwidth fault is dropped. A dropped bandwidth fault generally includes two cases: and (3) dropping lane, namely: lane from X16 to X8, or from X8 to X4, etc.; the drop rate, i.e.: the PCIe rate drops from Gen3 to Gen2, or from Gen3 to Gen1, and so on. Therefore, how to deal with the problem of bandwidth drop of the PCIe device, thereby ensuring the operation stability of the PCIe device and improving the operation stability of the server is an important problem.

At present, a method for handling the problem of bandwidth loss of PCIe devices generally predicts possible reasons according to experience by combining bandwidth loss information after bandwidth loss of PCIe devices occurs, and then verifies hardware one by one, and finally determines the reason of bandwidth loss and performs fault handling.

However, in the current method for handling the problem of bandwidth loss of the PCIe device, since the failure handler is started as long as the bandwidth loss of the PCIe device occurs, the hardware is subjected to troubleshooting, and the hardware troubleshooting frequency is high. Moreover, the hardware needs to be checked one by one for any bandwidth drop phenomenon, and the actual reason cannot be determined in a short time, so that the fault processing efficiency is low.

Disclosure of Invention

The application provides a method and a system for processing the bandwidth loss problem of PCIe equipment, which aim to solve the problem of lower processing efficiency of the bandwidth loss fault of the PCIe equipment in the prior art.

In order to solve the technical problem, the embodiment of the application discloses the following technical scheme:

a method of handling a PCIe device drop bandwidth problem, the method comprising:

s1: obtaining a PCIe port rate negotiation result;

s2: judging whether the bandwidth of the current PCIe equipment is normal or not according to the PCIe port rate negotiation result and PCIe configuration information stored in the server, wherein the PCIe configuration information comprises: the PCIe interface, the PCIe equipment and the mapping relation between the PCIe interface and the PCIe equipment, wherein any PCIe equipment has specific PCIe bandwidth and PCIe speed;

s3: if the bandwidth of the current PCIe equipment is normal, executing a normal starting process;

s4: if the bandwidth of the current PCIe equipment is abnormal, restarting the power-on time sequence of the server, and recording the restart times;

s5: returning to the steps S2-S4, and counting the restart times;

s6: when the restarting times are larger than or equal to the set restarting times, judging that the bandwidth loss of the PCIe equipment is a hard fault and stopping restarting;

s7: and executing a normal starting process and recording the fault position of the PCIe equipment.

Optionally, the method further comprises:

when the restart times are smaller than the set restart times and the bandwidth of the current PCIe equipment is normal, resetting the restart times;

and when the restart times are less than the set restart times and the bandwidth of the current PCIe device is not normal, returning to step S4.

Optionally, the restarting the power-on sequence of the server, and recording the number of times of one restart, includes:

utilizing a Basic Input Output System (BIOS) to pull down a General-purpose Input/Output (GPIO) signal of a PCH (integrated south bridge of intel corporation), so as to generate a low-level GPIO signal;

sending the low-level GPIO signal to CP L D (Complex Programmable L organic Device), and recording the restart times into EEPROM (Electrically erasable Programmable Read-Only Memory) of the server;

and the CP L D terminates the current power-on process according to the low-level GPIO signal and powers on the server again.

Optionally, the executing the normal boot process and recording the fault location of the PCIe device includes:

the BIOS executes a normal boot process until entering an operating system; and the number of the first and second electrodes,

the position of a fault slot BDF (Bus/Device/Function, identifier of each Function in the Bus/Device/Function, PCIe Bus) of a failed PCIe Device is recorded in a BMC (Baseboard Management Controller).

Optionally, the set restart number is 3.

A system to handle PCIe device drop bandwidth issues, the system comprising:

a negotiation result obtaining module for obtaining a PCIe port rate negotiation result;

a judging module, configured to judge whether a bandwidth of a current PCIe device is normal according to the PCIe port rate negotiation result and PCIe configuration information stored in the server, where the PCIe configuration information includes: the PCIe interface, the PCIe equipment and the mapping relation between the PCIe interface and the PCIe equipment, wherein any PCIe equipment has specific PCIe bandwidth and PCIe speed;

the post module is used for executing a normal starting process when the bandwidth of the current PCIe equipment is normal;

the restarting module is used for restarting the power-on time sequence of the server when the bandwidth of the current PCIe equipment is abnormal;

the restart times storage module is used for recording and counting the restart times;

the system comprises a stopping module, a judging module and a restarting module, wherein the stopping module is used for judging that the bandwidth of the PCIe equipment is lost as a hard fault and stopping restarting when the restarting times are more than or equal to the set restarting times;

the post module is also used for executing a normal starting process when judging that the bandwidth loss of the PCIe equipment is a hard fault and stopping restarting;

and the fault positioning module is used for recording the fault position of the PCIe equipment.

Optionally, the system further includes a reset module, configured to clear the restart times when the restart times are smaller than the set restart times and a bandwidth of the current PCIe device is normal.

Optionally, the set restart number is 3.

The system for processing the bandwidth drop problem of PCIe equipment comprises a BIOS, a PCH, an EEPROM, a BMC and a CP L D, wherein the EEPROM is arranged in a server, the PCH is in communication connection with the BMC, and the PCH is connected with the CP L D through GPIO signals;

the BIOS is used for obtaining a PCIe port rate negotiation result and judging whether the bandwidth of the current PCIe equipment is normal or not according to the PCIe port rate negotiation result and PCIe configuration information stored in the server, wherein the PCIe configuration information comprises: the PCIe port, the PCIe device bandwidth, the PCIe device rate and the mapping relation between the PCIe port and the PCIe device bandwidth and between the PCIe port and the PCIe device rate are matched, and any PCIe port is matched with one PCIe device bandwidth and one PCIe device rate;

the BIOS is also used for executing a normal starting process when the bandwidth of the current PCIe equipment is normal, and starting the CP L D through the PCH when the bandwidth of the current PCIe equipment is abnormal;

the CP L D is used for restarting the power-on sequence of the server when the bandwidth of the current PCIe device is abnormal;

the EEPROM is used for recording and counting the restart times;

the BIOS is also used for judging that the bandwidth of the PCIe equipment is lost as a hard fault and stopping restarting when the restarting times are more than or equal to the set restarting times;

the BMC is used for recording fault positions of PCIe devices.

The technical scheme provided by the embodiment of the application can have the following beneficial effects:

the method comprises the steps of firstly obtaining a PCIe port speed negotiation result, then judging whether the bandwidth of the current PCIe device is normal according to the negotiation result and PCIe configuration information, if the current PCIe device is abnormally restarted, a power-on program is restarted and the number of restart times is recorded, repeatedly executing the judgment for multiple times and counting the number of restart times, judging that the bandwidth of the PCIe device is a hard fault and stopping restarting when the number of restart times is larger than or equal to the set number of restart times, continuously executing a normal startup process and recording the fault position of the PCIe device. In this embodiment, by restarting the power-on process, the problem of bandwidth loss of the PCIe device due to a non-hard failure can be avoided, and a stable link can be reestablished by restarting the power-on process, so that high-frequency hardware detection is avoided, and the efficiency of failure processing is improved. Also, the present embodiment is provided with a set number of restarts, i.e.: the target value is set for the restart times, the normal startup process is continuously executed when the restart times are not less than the set restart times, the bandwidth loss of the PCIe equipment caused by hard faults and non-hard faults can be effectively distinguished through the set restart times, the quick method for processing the bandwidth loss of the PCIe equipment can be fully utilized, a large number of repeated restarts can be avoided, and the fault processing efficiency can be improved.

The present application further provides a system for handling a bandwidth drop problem of a PCIe device, the system mainly includes: the device comprises a negotiation result acquisition module, a judgment module, a post module, a restart time storage module, a stop module and a fault positioning module. Through the setting of the restarting module, the problem that the bandwidth of the PCIe equipment falls caused by partial non-hardware faults can be solved by using the restarting mode, and the processing efficiency of the problem that the bandwidth of the PCIe equipment falls is improved. The judgment module and the restart time storage module are arranged, the set restart times are fully utilized, the restart times are limited, multiple invalid restarts can be avoided, bandwidth loss of PCIe equipment caused by hardware faults and non-hardware faults can be relatively accurately judged, and therefore different processing modes are adopted according to different reasons, and the fault processing efficiency is improved.

The application also provides another system for processing the bandwidth drop problem of the PCIe device, which comprises a BIOS, a PCH, an EEPROM, a BMC and a CP L D, wherein the EEPROM is arranged in the server, the PCH is in communication connection with the BMC, and the PCH is connected with the CP L D through a GPIO signal.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart illustrating a method for handling a bandwidth drop problem of a PCIe device according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a system for handling a bandwidth drop problem of a PCIe device according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of another system for handling a bandwidth drop problem of a PCIe device according to an embodiment of the present application.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

For a better understanding of the present application, embodiments of the present application are explained in detail below with reference to the accompanying drawings.

Example one

Referring to fig. 1, fig. 1 is a flowchart illustrating a method for handling a bandwidth drop problem of a PCIe device according to an embodiment of the present application. As shown in fig. 1, the method for processing the bandwidth drop problem of the PCIe device in this embodiment mainly includes the following steps:

s1: and obtaining a PCIe port rate negotiation result.

In this embodiment, the PCIe port rate negotiation is PCIe port tracing. When the CPU is powered on every time, the BIOS controls the PCIe port of the CPU to perform rate negotiation with the PCIe device according to the specification of a PCIe protocol. And after the PCIe port rate negotiation process is finished, obtaining a PCIe port rate negotiation result.

The PCIe device in this embodiment mainly includes: raid card, SAS card, network card, GPU card, FPGA card, etc.

S2: and judging whether the bandwidth of the current PCIe equipment is normal or not according to the PCIe port rate negotiation result and PCIe configuration information stored in the server.

And when the rate negotiation result of the PCIe port is consistent with the PCIe configuration information stored in the server, judging that the bandwidth of the current PCIe equipment is normal, otherwise, judging that the bandwidth of the current PCIe equipment is abnormal. The bandwidth in this embodiment normally includes: and the PCIe port rate negotiation result shows that the rate of the current PCIe equipment is consistent with the rate in the PCIe configuration information, and the PCIe port rate negotiation result shows that the bandwidth of the current PCIe equipment is consistent with the bandwidth in the PCIe configuration information. Bandwidth irregularities include: and/or the PCIe port rate negotiation result shows that the bandwidth of the current PCIe device is inconsistent with the bandwidth in the PCIe configuration information.

Be provided with EEPROM in the server platform for the storage FRU information, this FRU information includes: manufacturer, model, SN, manufacturer, PCIe configuration name, etc. The PCIe configuration of the server can be uniquely specified according to the configuration name, namely: according to the configuration name, the PCIe device with the bandwidth and the rate connected to a certain PCIe port can be determined. The PCIe configuration information in this embodiment includes: PCIe ports, PCIe devices and mapping relationships between PCIe ports and PCIe devices, any PCIe device having a specific PCIe bandwidth and PCIe rate. Here, the mapping relationship is that which bandwidth and rate PCIe devices are connected on a PCIe port. The PCIe bandwidth is also called PCIe lane, namely: PCIe X16, PCIe X8, PCIe X4, etc.

With continued reference to fig. 1, if the bandwidth of the current PCIe device is normal, step S3 is executed: a normal boot process is performed. The normal boot process in this embodiment is generally referred to as a post process.

If the bandwidth of the current PCIe device is not normal, execute step S4: and restarting the power-on time sequence of the server, and recording the restart times.

Specifically, step S4 includes:

s41: and pulling down the GPIO signal of the PCH by using the BIOS to generate a low-level GPIO signal.

And S42, sending the low-level GPIO signal to the CP L D, and simultaneously recording the restart times into the EEPROM of the server.

And S43, the CP L D terminates the current power-on process according to the low-level GPIO signal and powers on the server again.

From the above steps S41-S43, when the bandwidth of the PCIe device is not normal, the problem of dropping the bandwidth occurs, the GPIO signal of the PCH is pulled down under the control of the BIOS, and the GPIO signal is sent to the CP L D, and the restart times are recorded in the EEPROM for recording the restart times, and when the CP L D detects a low-level GPIO signal, the current power-on process is terminated, and power is re-supplied.

S5: returning to steps S2-S4, and counting the number of restarts.

The above steps S2-S4 are a loop, and if the bandwidth of the current PCIe device is normal, the subsequent program after power-on is continuously executed, and the loop is skipped. And if the bandwidth of the current PCIe device is abnormal, restarting, recording the restart times, entering a loop, executing the steps S2-S4 again, and counting the restart times according to the step S5.

When the restart count is greater than or equal to the set restart count, step S6 is executed: and judging the bandwidth loss of the PCIe equipment as a hard fault and stopping restarting.

The hard failure in the present embodiment refers to: a series of failures due to aging, failure or damage of hardware devices. The method mainly comprises the following steps: mechanical faults, hardware faults, software faults, and the like.

Specifically, step S7 includes:

s71: the BIOS executes a normal boot process until entering an operating system; at the same time, the user can select the desired position,

s72: and recording the position of the fault slot BDF of the PCIe equipment with the fault into the BMC.

And subsequent operation and maintenance personnel can perform troubleshooting according to the BDF position of the fault slot position.

As can be seen from the above steps S6 and S7, when the number of times of restart reaches the set number of times of restart, it may be determined that the current problem of bandwidth loss of the PCIe device is not a problem that can be solved by restart, and it is determined as a hard fault, at this time, the restart is stopped, a normal boot process is continuously performed, and at the same time, the fault location of the PCIe device is recorded. The step S6 can avoid executing the steps S2-S4 all the time, so that the next fault processing can be performed according to the determined fault reason, which is beneficial to improving the fault processing efficiency. By setting the restart times, the reason for the bandwidth drop can be defined, and whether the fault is a hard fault or a non-hard fault is determined, so that multiple invalid restarts are avoided, and the fault judgment efficiency is improved.

Further, the reset number value set in this embodiment is 3, and when the reset number is 3 or more than 3, it is determined that the bandwidth of the PCIe device is out of the hard failure and the reset is stopped. The PCIe device bandwidth is still not normal, typically after 3 reboots, i.e., a hard failure is determined and the reboot is stopped.

Accordingly, when the number of reboots is less than the set number of reboots, if the bandwidth of the current PCIe device is normal, step S8 is executed: the restart times are reset, so that the storage space is saved, the counting error in the process of subsequently processing the bandwidth falling problem of the PCIe equipment can be avoided, and the accuracy of fault processing is improved.

And when the restart times are less than the set restart times, if the bandwidth of the current PCIe device is abnormal, returning to the step S4, restarting the power-on sequence of the server, and recording the restart times.

Example two

Referring to fig. 2 based on the embodiment shown in fig. 1, fig. 2 is a schematic structural diagram of a system for handling a bandwidth drop problem of a PCIe device according to an embodiment of the present application. As can be seen from fig. 2, the system for handling the bandwidth drop problem of the PCIe device in this embodiment mainly includes: the device comprises a negotiation result acquisition module, a judgment module, a post module, a restart time storage module, a stop module and a fault positioning module.

The negotiation result obtaining module is used for obtaining a PCIe port rate negotiation result. The judging module is used for judging whether the bandwidth of the current PCIe equipment is normal or not according to the PCIe port rate negotiation result and PCIe configuration information stored in the server, wherein the PCIe configuration information comprises: PCIe ports, PCIe devices and mapping relationships between PCIe ports and PCIe devices, any PCIe device having a specific PCIe bandwidth and PCIe rate. And the post module is used for executing a normal starting process when the bandwidth of the current PCIe equipment is normal. And the restarting module is used for restarting the power-on time sequence of the server when the bandwidth of the current PCIe equipment is abnormal. And the restart times storage module is used for recording and counting the restart times. And the stopping module is used for judging that the bandwidth loss of the PCIe equipment is a hard fault and stopping restarting when the restarting times are more than or equal to the set restarting times. And the post module is also used for executing a normal starting process when the PCIe equipment is judged to be in a hard fault due to bandwidth loss and is stopped restarting. And the fault positioning module is used for recording the fault position of the PCIe equipment. The number of restarts set in this embodiment is set to 3.

Further, the system further includes a reset module, configured to clear the restart times when the restart times are smaller than the set restart times and the bandwidth of the current PCIe device is normal. The setting of the reset module is beneficial to saving the space of the server. The system can also avoid the confusion of the restart times when the system is used for processing the problem of the bandwidth drop of the subsequent PCIe equipment, and is favorable for improving the accuracy of fault processing.

The working principle and working method of the system for handling the bandwidth-dropping problem of the PCIe device in this embodiment have been described in detail in the embodiment shown in fig. 1, and are not described herein again.

EXAMPLE III

Referring to fig. 3 based on the embodiments shown in fig. 1 and fig. 2, fig. 3 is a schematic structural diagram of another system for handling the problem of bandwidth loss of PCIe devices according to the embodiments of the present application, as can be seen from fig. 3, the system mainly includes a BIOS, a PCH, an EEPROM, a BMC, and a CP L D, wherein the EEPROM is disposed in a server, the PCH is communicatively connected to the BMC, and the PCH is connected to the CP L D through a GPIO signal.

The BIOS is used for obtaining a PCIe port rate negotiation result, and judging whether the bandwidth of the current PCIe equipment is normal or not according to the PCIe port rate negotiation result and PCIe configuration information stored in the server, wherein the PCIe configuration information comprises a PCIe port, a PCIe equipment bandwidth, a PCIe equipment rate, a mapping relation between the PCIe port and the PCIe equipment bandwidth and between the PCIe port and the PCIe equipment rate, and any PCIe port is matched with one PCIe equipment bandwidth and one PCIe equipment rate.

Further, the BIOS in this embodiment mainly includes: the device comprises a negotiation result acquisition module, a judgment module, a post module and a stop module. The negotiation result obtaining module is used for obtaining a PCIe port rate negotiation result. The judging module is used for judging whether the bandwidth of the current PCIe equipment is normal or not according to the PCIe port rate negotiation result and PCIe configuration information stored in the server, wherein the PCIe configuration information comprises: PCIe ports, PCIe devices and mapping relationships between PCIe ports and PCIe devices, any PCIe device having a specific PCIe bandwidth and PCIe rate. And the post module is used for executing a normal starting process when the bandwidth of the current PCIe equipment is normal. And the stopping module is used for judging that the bandwidth loss of the PCIe equipment is a hard fault and stopping restarting when the restarting times are more than or equal to the set restarting times.

The parts not described in detail in this embodiment can be referred to the embodiments shown in fig. 1-2, and the three embodiments can be referred to each other, and are not described again here.

The above description is merely exemplary of the present application and is presented to enable those skilled in the art to understand and practice the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of handling a PCIe device bandwidth drop problem, the method comprising:

s1: obtaining a PCIe port rate negotiation result;

s5: returning to the steps S2-S4, and counting the restart times;

2. The method of handling a PCIe device dropped bandwidth problem according to claim 1, said method further comprising:

3. The method of claim 1, wherein restarting the server power-on sequence and recording the number of restarts comprises:

pulling down the GPIO signal of the PCH by using the BIOS to generate a low-level GPIO signal;

sending the low-level GPIO signal to a CP L D, and simultaneously recording the restarting times into an EEPROM of a server;

4. The method of claim 1, wherein the performing a normal boot process and recording the location of the PCIe device failure comprises:

and recording the position of the fault slot BDF of the PCIe equipment with the fault into the BMC.

5. The method for handling the problem of dropped bandwidth for PCIe devices according to any of claims 1-4, wherein the set number of reboots is 3.

6. A system for handling a PCIe device drop bandwidth problem, the system comprising:

7. The system for handling the bandwidth drop problem of the PCIe device according to claim 6, further comprising a reset module, configured to clear the restart times when the restart times are less than the set restart times and the bandwidth of the current PCIe device is normal.

8. The system for handling the problem of dropped bandwidth for PCIe devices according to claim 6 or 7, wherein the set number of reboots is 3.

9. The system for processing the bandwidth drop problem of PCIe equipment is characterized by comprising a BIOS, a PCH, an EEPROM, a BMC and a CP L D, wherein the EEPROM is arranged in a server, the PCH is in communication connection with the BMC, and the PCH is connected with the CP L D through GPIO signals;

the EEPROM is used for recording and counting the restart times;

the BMC is used for recording fault positions of PCIe devices.