WO2023115999A1 - 设备状态监控方法、装置、设备及计算机可读存储介质 - Google Patents

设备状态监控方法、装置、设备及计算机可读存储介质 Download PDF

Info

Publication number
WO2023115999A1
WO2023115999A1 PCT/CN2022/113519 CN2022113519W WO2023115999A1 WO 2023115999 A1 WO2023115999 A1 WO 2023115999A1 CN 2022113519 W CN2022113519 W CN 2022113519W WO 2023115999 A1 WO2023115999 A1 WO 2023115999A1
Authority
WO
WIPO (PCT)
Prior art keywords
state parameters
real
time
historical state
parameters
Prior art date
Application number
PCT/CN2022/113519
Other languages
English (en)
French (fr)
Inventor
孙永博
林楷智
李道童
芦飞
Original Assignee
浪潮(北京)电子信息产业有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 浪潮(北京)电子信息产业有限公司 filed Critical 浪潮(北京)电子信息产业有限公司
Publication of WO2023115999A1 publication Critical patent/WO2023115999A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring

Definitions

  • the present application relates to the technical field of server monitoring, and in particular to a device status monitoring method, device, device and computer-readable storage medium.
  • the inventor realizes that the existing server device status monitoring usually monitors the status of the device in real time, and compares it with the pre-stored standard status to determine whether there is any abnormality in the device at the current moment.
  • the existing server device status monitoring usually monitors the status of the device in real time, and compares it with the pre-stored standard status to determine whether there is any abnormality in the device at the current moment.
  • the operation of the device especially across machines
  • There is a lack of effective monitoring of the operation of the equipment within the time span of startup and restart which makes it impossible to monitor the abnormal status of some equipment that cannot be monitored in real time.
  • BIOS Basic Input Output System
  • BMC Baseboard Management Controller
  • the present application provides a device status monitoring method, including:
  • the present application also provides a device status monitoring device, including:
  • the storage unit is used to save the historical state parameters of the monitoring object equipment collected in history
  • an acquisition unit configured to acquire real-time state parameters of the monitored equipment
  • a comparison unit is used to compare the real-time state parameters with the historical state parameters
  • the exception processing unit is configured to execute a corresponding processing mechanism for the abnormal real-time state parameter when there is an abnormal real-time state parameter whose deviation from the historical state parameter exceeds a preset range.
  • the present application also provides a device status monitoring device including: a memory and one or more processors, where computer-readable instructions are stored in the memory, and when the computer-readable instructions are executed by the one or more processors, the one or more A plurality of processors execute the steps of the above-mentioned device status monitoring method.
  • the present application also provides one or more non-volatile computer-readable storage media storing computer-readable instructions.
  • the above-mentioned computer-readable instructions are executed by one or more processors, the above-mentioned one or more processors execute the above-mentioned The steps of the equipment status monitoring method.
  • FIG. 1 is a flow chart of a method for monitoring equipment status provided by the present application according to one or more embodiments
  • Fig. 2 is a schematic structural diagram of a device status monitoring device provided by the present application according to one or more embodiments;
  • Fig. 3 is a schematic structural diagram of a device status monitoring device provided by the present application according to one or more embodiments.
  • the core of this application is to provide a device state monitoring method, device, equipment and computer-readable storage medium, which are used to realize non-real-time monitoring of the device state, and make up for the way in the prior art that compares the device operating state with the standard state Carry out real-time monitoring of monitoring loopholes, improve the ability to monitor the operating status of equipment, improve the maintainability of abnormal equipment, improve the machine error reporting function, and save maintenance manpower.
  • the device status monitoring method provided in the embodiment of the present application is described by taking computer devices as an example.
  • the method includes:
  • S101 Save the historical state parameters of the monitored equipment collected in history.
  • S102 Acquiring real-time status parameters of the monitored equipment.
  • the monitoring object device targeted by the embodiment of the present application may include but not limited to a PCIe device, a central processing unit, a memory device, a hard disk drive, and the like.
  • the execution subject of the embodiment of the present application may adopt a basic input output system (Basic Input Output System, BIOS), a baseboard management controller (Baseboard Management Controller, BMC) or a device where the operating system (operating system, OS) is located, or a multi-subject Complete the steps together.
  • BIOS Basic Input Output System
  • BMC Baseboard Management Controller
  • OS operating system
  • a multi-subject Complete the steps together By developing a monitoring script, or writing a monitoring program and writing it into the original program of the execution subject for joint compilation, the automatic monitoring of the monitored equipment is realized.
  • Each step of the device state monitoring method provided in the embodiment of the present application can also be executed continuously or separately at different stages such as BIOS startup, UEFI shell, and after entering the operating system.
  • the real-time status parameters of the monitored equipment and the historical status parameters of the monitored equipment can be the status parameters collected by the same execution subject through the same path, or they can be collected by the previous execution subject through different paths. Collected historical state parameters.
  • BIOS and BMC originally had real-time monitoring mechanisms for monitoring target devices such as PCIe devices, central processing units, memory devices, and hard disk drives.
  • the status parameters collected by these real-time monitoring mechanisms can be reused, and monitoring functions can also be developed separately.
  • the time point for obtaining the real-time status parameters of the monitored equipment can specifically be obtained at a preset time point, periodically, triggered by a preset event (such as powering on and off the device), randomly obtained, and the like. Different acquisition and storage strategies can also be adopted for different types of state parameters of different monitoring target devices.
  • the real-time status parameters can be obtained.
  • the same or different paths can be used to obtain the status parameters of the monitored device.
  • the BIOS can access their respective related registers through the Protocol provided by the UEFI specification. , to obtain the concerned information and error status, etc.; this kind of basic hardware register access will also have corresponding function support functions under the operating system OS.
  • the equipment status monitoring method provided by the embodiment of the present application is suitable for long-term non-real-time monitoring, so as to effectively obtain fault information that is difficult to locate in real-time monitoring, and can also perform real-time monitoring in the past. such as bandwidth) to carry out this long-term non-real-time monitoring, so as to know the abnormal changes of these state parameters during the long-term operation.
  • the real-time status parameters may include, but are not limited to, device presence status parameters, vendor ID (Vendor ID), physical slot ID (Physical slot number), and maximum transmission rate (Max Link Speed) , maximum bandwidth (Max Link Width), real-time transmission rate (Current Link Speed), real-time bandwidth (Current Link Width), logical identification (Bus/Device/Function number), topology data of the PCIe link where it is located (upstream bridge at all levels Bus/Device/Function number) etc.
  • the acquired real-time state parameters can be stored as historical state parameters entirely, or only part of the real-time state parameters can be stored or real-time state parameters after calculation and conversion can be stored.
  • the historical state parameters can be saved locally on the device used to collect the historical state parameters, or can be sent to another device.
  • step S101 save the historical state parameters of the monitoring object equipment collected in history, specifically, the historical state parameters can be stored in a preset memory chip, or the historical state parameters can be stored in a pre-divided memory area, or the historical state parameters can be passed through Intelligent Platform Management Interface command (IPMI Command) or Redfish technology or shared memory is sent to the baseboard management controller for storage.
  • IPMI Command Intelligent Platform Management Interface command
  • the baseboard management controller may also send the information to the baseboard management controller through a shared memory chip.
  • a device to be monitored can correspond to one storage area or to multiple storage areas; when storing between multiple storage areas, it can implement a load balancing strategy or store according to storage priority.
  • the storage structure of historical state parameters can be reasonably designed according to needs, for example, it can be designed as a linked list structure, and labels can be set for each stored information to facilitate search and access.
  • each storage strategy can be adopted one by one, or multiple storage strategies can be adopted. If the first-in-first-out storage strategy is adopted, when the storage space capacity or a certain ratio of the storage space capacity is exceeded, the earliest stored historical state parameters will be overwritten with the latest stored historical state parameters.
  • the historical state parameters at preset time points are stored, specifically, the historical state parameters corresponding to a fixed time may be selected for storage or stored periodically. If the storage strategy of storing historical state parameters when preset events occur is adopted, trigger events can be predefined, such as storing state parameters when the device is turned on, or storing state parameters before the device is powered off. If the mean value of historical state parameters is used, the mean value can be calculated according to the historical state parameter values corresponding to each historical time point, and the attributes with the highest probability of occurrence can be calculated by attributes, etc., which can be combined with the first-in-first-out storage principle. The historical state parameters of the storage space capacity calculate the mean value to cover this part of the historical state parameters.
  • the storage strategy of storing preset historical state parameters can be extracted from the real-time monitoring parameters of each monitoring object device for storage, or combined with the method of storing the average value of historical state parameters
  • the strategy is to store the full amount of historical state parameters with high importance, and store the mean value for the historical state parameters with less importance.
  • the real-time state parameters are compared with the historical state parameters, that is, the current state of the monitored device is compared with its historical state, and if there is any inconsistency with the historical state, the corresponding processing mechanism is executed.
  • step S103 may not be performed, or it may be compared with the standard parameter list in the first comparison, and then compared with the real historical state in the subsequent comparison process parameters for comparison.
  • Step S103 Comparing real-time state parameters with historical state parameters, including but not limited to: comparing real-time state parameters with historical state parameters with the earliest storage time, comparing real-time state parameters with historical state parameters with the latest storage time, and comparing real-time state parameters with historical state parameters with the latest storage time.
  • the state parameters are compared with the overall historical state parameters, and the real-time state parameters are compared with the preset historical state parameters in the historical state parameters.
  • Different comparison strategies may also be adopted for different types of state parameters of different monitoring target devices. For a certain status parameter of the monitoring object device, a fixed comparison strategy can be adopted, or the comparison strategy can be flexibly switched according to the computing resources of the current execution subject.
  • all types of real-time status parameters and historical status parameters can be compared
  • all or some types of real-time state parameters can be compared with the historical state parameters with the earliest storage time, or the historical state parameters with the latest storage time, or by a fixed Regularly or randomly extract preset historical state parameters or the mean value of historical state parameters from historical state parameters for comparison.
  • the comparison can be made every time the real-time status parameters are obtained; when the computing resources are not sufficient or in order to save computing resources, the real-time status parameters can be obtained only once for comparison. , or first store the real-time state parameters to be compared and compare them after computing resources are sufficient.
  • Step S104 is a step performed only when there is an abnormal real-time state parameter whose deviation from the historical state parameter exceeds a preset range.
  • the corresponding processing mechanism specifically corresponds to the type of the status parameter of the monitored device.
  • the different types of status parameters of each monitoring object device can adopt the same corresponding processing mechanism, or can adopt different corresponding processing mechanisms, and can accept the corresponding processing mechanism set by the user.
  • the corresponding processing mechanism may include but is not limited to sending error information, recording error logs, pushing maintenance suggestions, implementing error correction strategies, etc.
  • the preset interface can be called to send error information to the user, and the intelligent platform management interface command can be used to notify the base board management controller to record the error log of the abnormal real-time status parameters, and inform the user of the information of the abnormal monitoring object device, the location of the monitoring object device, and the abnormality.
  • the error type corresponding to the real-time status parameters for example, the PCIe device that could be detected last time cannot be detected this time, and the device is suspected of being lost
  • the corresponding maintenance suggestion is called and pushed to the user according to the pre-generated fault handling list, such as the device can be replaced, Check for configuration changes, etc.
  • the device state monitoring method compares the obtained real-time state parameters of the monitored device with the historical state parameters of the monitored device by saving the historical state parameters of the monitored device collected in history. After the deviation from the historical state parameters exceeds the preset range of abnormal real-time state parameters, the corresponding processing mechanism for the abnormal real-time state parameters is executed, thereby making up for the gap in the monitoring of equipment operation that occurs within the time span of cross-machine startup and restart, and realizing It realizes the non-real-time monitoring of equipment status, improves the ability to monitor equipment operation status, improves the maintainability of equipment abnormalities, improves the machine error reporting function, and saves maintenance manpower.
  • the execution subject of the embodiments of the present application may be a basic input output system, a baseboard management controller, or an operating system. Then, on the basis of the above-mentioned embodiments, the device status monitoring method provided by the embodiments of the present application is described in the case of being applied to a device where a basic input/output system (hereinafter referred to as BIOS) is located.
  • BIOS basic input/output system
  • step S102 the real-time status parameters of the monitoring object equipment are obtained, which may specifically be:
  • a system management interrupt is triggered to obtain real-time status parameters.
  • the BIOS When the BIOS enumerates and processes PCIe devices, it will assign a set of Bus, Device, and Function number (the bus, device, and function value of the PCIe device, referred to as the BDF value) for each PCIe device.
  • This Bus, Device, and Function Number information combination Unique, the combination of these three data can be used to uniquely locate a PCIe logical device; in addition, by reading the relevant registers in the device configuration space, to obtain the properties, status, etc.
  • the BIOS will send the relevant information of the device, such as PCIe Bus, Device, Function number, and inconsistent attribute types (speed record , bandwidth, etc.), the physical location information of the device, etc., are sent to the baseboard management controller BMC through forms including but not limited to IPMI Command, and an error log is recorded.
  • the log includes these erroneous devices and status information.
  • the BIOS will implement the asset information function of the PCIe device. Specifically, after the BIOS initializes the PCIe device and before booting to the operating system OS, it will collect a series of information about all the PCIe devices, including but not limited to the device and the location of the device.
  • the BDF value of PCIe Bridge, in-bit status, physical slot number (Physical slot number), location information, etc. pass these information to BMC through including but not limited to IPMI command or Redfish technology, and BMC will send these information after receiving it Device properties are displayed on its web interface by device for users to view and understand.
  • the physical slot ID (Physical slot number) of each device will be set by setting the bridge register of the device, so that each device has a unique physical slot ID (Physical slot number) ) effect, the physical slot ID (Physical slot number) of each device is different; in addition, the unique physical slot ID (Physical slot number) of each device will be used to match its location information string, each A physical slot identification (Physical slot number) uniquely locates a device, and then the location information character string reflects the physical location of the device in the machine; this design and setting can also be used in the embodiment of the present application to achieve the same As a result, the data implemented in the asset information function can be reused.
  • the BIOS after the BIOS initializes the PCIe device and before booting to the operating system OS, collect the attribute information of all PCIe devices one or more times, the amount of information and the specific What information to collect can be increased or decreased according to actual needs; the device information collected by the asset information function can be reused, or it can be realized separately; for example, only the BDF value of the device, and/or the vendor ID (Vendor ID) and device ID of the device can be collected (Device ID), physical slot identification (Physical slot number).
  • the BIOS obtains the real-time status parameters of the PCIe device by reading the registers of the PCIe device.
  • BIOS judges whether the PCIe device is in place by reading the Vendor ID and Device ID registers of the PCIe device; BIOS reads the slot capabilities (slot capabilities) register of the PCIe device, Obtain the physical slot identification (Physical slot number) of the PCIe device, which can correspond to the slot where the physical device is located; the BIOS obtains the maximum transmission rate (Max Link Speed) of the PCIe device by reading the link capabilities (Link Capabilities) register of the PCIe device ), the maximum bandwidth (Max Link Width); the BIOS obtains the real-time transmission rate (Current Link Speed) and real-time bandwidth (Current Link Width) of the PCIe device by reading the link status (Link Status) register of the PCIe device; the BIOS reads the The BDF value of the PCIe device obtains the logical ID of the PCIe device; the BIOS obtains the topology data of the PCIe link where the device is located by reading the BDF values of the upstream
  • the reading methods supported under BIOS and UEFI Shell environment include but are not limited to: member functions supported by EFI_PCI_IO_PROTOCOL, such as EFI_PCI_IO_PROTOCOL_ACCESS, which are divided into two types: Memory and IO; EFI_PCI_IO_PROTOCOL_GET_LOCATION function, etc.; IO and Memory access instructions supported in assembly language; The same registers are read under Windows and Linux operating systems, and have their own IO or Memory access functions.
  • the BIOS saves the information of all monitored devices collected each time.
  • the historical state parameters can be stored in the preset memory chip, or the historical state parameters can be passed through the intelligent platform management interface command (IPMI Command) Or Redfish technology is sent to the baseboard management controller for storage, and a reasonable storage structure design is carried out on demand, and corresponding storage strategies are formulated.
  • IPMI Command intelligent platform management interface command
  • Redfish technology is sent to the baseboard management controller for storage, and a reasonable storage structure design is carried out on demand, and corresponding storage strategies are formulated.
  • the BIOS compares the acquired real-time state parameters of the monitored device with the previously saved historical state parameters. For details, reference may be made to the comparison methods described in the foregoing embodiments.
  • the BIOS compares the collected real-time status parameters of the PCIe device with the previously saved historical status parameters of the PCIe device, including but not limited to, comparing with the last saved data, or comparing with Compare the last data, or compare with all the saved data, randomly or regularly select some of the saved data for comparison, etc.; you can also increase the number of data collection and comparisons as needed; when the comparison finds that the number of PCIe devices has occurred Changes, such as increase or decrease, the number of increase or decrease, the key attributes of the increase or decrease of the device, such as including but not limited to Vendor ID, Device ID, physical slot ID (Physical slot number), location information, etc. are sent to the BMC to record the log of device changes through intelligent platform management interface commands. Targeted model maintenance treatment.
  • the BIOS can also use the periodic system management interrupt (SMI) function provided by the platform chip, including but not limited to the periodic trigger SMI function provided by the Intel chip, and select a suitable trigger interval that can be supported.
  • SMI system management interrupt
  • 64S/32S, etc. implement the above-mentioned data comparison analysis and send relevant data to notify BMC to record logs in the SMI handler function, so that after entering the operating system, SMI can still be triggered at the time set at each interval , to perform the above functions.
  • the behavior of comparing and analyzing data can also be completed by the BMC or the operating system according to actual needs, based on the real-time state parameters and historical state parameters of the monitored equipment collected by the same BIOS.
  • the existing monitoring mechanism for PCIe devices there is a 4K configuration space for each PCIe device, and the information reflected in the 4K configuration space of each PCIe device can be used to detect other types of device error types. If it cannot be monitored in real time, it can also be detected through the above-mentioned non-real-time processing scheme. After the error is detected, various possible forms of error reporting and log processing are performed.
  • the device status monitoring device provided in the embodiment of the present application includes:
  • the storage unit 201 is used to save the historical state parameters of the monitored equipment collected in history;
  • An acquisition unit 202 configured to acquire real-time state parameters of the monitored equipment
  • a comparison unit 203 configured to compare real-time state parameters with historical state parameters
  • the exception processing unit 204 is configured to execute a corresponding processing mechanism for the abnormal real-time state parameter if there is an abnormal real-time state parameter whose deviation from the historical state parameter exceeds a preset range.
  • the device status monitoring device provided by the embodiment of the present application may be a computer device, and the computer device may be a terminal or a server, including:
  • memory 310 for storing computer readable instructions 311;
  • the processor 320 is configured to execute computer-readable instructions 311, and when the computer-readable instructions 311 are executed by the processor 320, implement the steps of the device status monitoring method in any one of the above embodiments.
  • the processor 320 may include one or more processing cores, such as a 3-core processor, an 8-core processor, and the like.
  • the processor 320 can be realized by at least one hardware form of DSP (Digital Signal Processing), Field-Programmable Gate Array (FPGA) and Programmable Logic Array (PLA).
  • Processor 320 may also include a main processor and a coprocessor, the main processor is a processor for processing data in a wake-up state, and is also called a central processing unit CPU (Central Processing Unit); Low-power processor for processing data in standby state.
  • the processor 320 may be integrated with an image processor GPU (Graphics Processing Unit), and the GPU is used for rendering and drawing the content that needs to be displayed on the display screen.
  • the processor 320 may also include an artificial intelligence AI (Artificial Intelligence) processor, which is used to process computing operations related to machine learning.
  • AI Artificial Intelligence
  • Memory 310 may include one or more computer-readable storage media, which may be non-transitory.
  • the memory 310 may also include high-speed random access memory, and non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices.
  • the memory 310 is at least used to store the following computer-readable instructions 311, wherein, after the computer-readable instructions 311 are loaded and executed by the processor 320, the device status monitoring method disclosed in any of the preceding embodiments can be implemented. related steps.
  • the resources stored in the memory 310 may also include an operating system 312 and data 313, etc., and the storage method may be temporary storage or permanent storage.
  • the operating system 312 may be Windows.
  • the data 313 may include but not limited to the data involved in the above method.
  • the device status monitoring device may further include a display screen 330 , a power supply 340 , a communication interface 350 , an input/output interface 360 , a sensor 370 and a communication bus 380 .
  • FIG. 3 does not constitute a limitation on the device status monitoring device, and may include more or less components than those shown in the illustration.
  • the device status monitoring device provided in the embodiment of the present application includes a memory and a processor.
  • the processor executes the program stored in the memory, it can realize any one of the device status monitoring methods above, and the effect is the same as above.
  • the above-described device and device embodiments are only illustrative.
  • the division of modules is only a logical function division.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or modules may be in electrical, mechanical or other forms.
  • a module described as a separate component may or may not be physically separated, and a component shown as a module may or may not be a physical module, that is, it may be located in one place, or may also be distributed to multiple network modules. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional module in each embodiment of the present application may be integrated into one processing module, each module may exist separately physically, or two or more modules may be integrated into one module.
  • the above-mentioned integrated modules can be implemented in the form of hardware or in the form of software function modules.
  • the integrated modules are realized in the form of software function modules and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or part of the contribution to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , executing all or part of the steps of the method in each embodiment of the present application.
  • the embodiment of the present application also provides a non-volatile computer-readable storage medium, the non-volatile computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions are processed by one or more The steps of the device status monitoring method in any one of the above embodiments can be realized when the device is executed.
  • the computer-readable instructions contained in the computer-readable storage medium provided in this embodiment can realize the steps of any one of the above device status monitoring methods when executed by the processor, and the effect is the same as above.
  • Nonvolatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM random access memory
  • RAM is available in many forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Chain Synchlink DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

本申请公开了一种设备状态监控方法、装置、设备及计算机可读存储介质,其中,该方法包括:保存历史上采集到的监控对象设备的历史状态参数,获取监控对象设备的实时状态参数,将实时状态参数与历史状态参数对比,在存在与历史状态参数的偏差超出预设范围的异常实时状态参数时,执行异常实时状态参数的对应处理机制。

Description

设备状态监控方法、装置、设备及计算机可读存储介质
相关申请的交叉引用
本申请要求于2021年12月24日提交中国专利局,申请号为202111602701.9,申请名称为“设备状态监控方法、装置、设备及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及服务器监控技术领域,特别是涉及一种设备状态监控方法、装置、设备及计算机可读存储介质。
背景技术
发明人意识到,现有的服务器设备状态监控通常为对设备状态进行实时监控后,与预先存储的标准状态进行对比以确定当前时刻设备是否存在异常,然而对设备运行过程中,尤其是跨机器启动、重启的时间跨度内出现的设备运行情况,缺乏有效的监控,这就导致对一些实时监测不到的设备异常状态无法监测到。例如,当发生设备掉卡后,基本输入输出***(Basic Input Output System,BIOS)或基板管理控制器(Baseboard Management Controller,BMC)会检测不到设备,但是并不能判断是设备运行中出错导致的掉卡,还是设备本来就没有安装在机器上。这一部分监控空白导致用户无法发现到一些设备运行隐患以及在出现相应异常(如设备掉卡)时无法及时判断设备异常原因。
发明内容
本申请提供一种设备状态监控方法,包括:
保存历史上采集到的监控对象设备的历史状态参数;
获取监控对象设备的实时状态参数;
将实时状态参数与历史状态参数对比;和
在存在与历史状态参数的偏差超出预设范围的异常实时状态参数时,执行异常实时状态参数的对应处理机制。
本申请还提供一种设备状态监控装置,包括:
存储单元,用于保存历史上采集到的监控对象设备的历史状态参数;
获取单元,用于获取监控对象设备的实时状态参数;
对比单元,用于将实时状态参数与历史状态参数对比;和
异常处理单元,用于在存在与历史状态参数的偏差超出预设范围的异常实时状态参数时,执行异常实时状态参数的对应处理机制。
本申请还提供一种设备状态监控设备包括:存储器及一个或多个处理器,存储器中储存有计算机可读指令,上述计算机可读指令被上述一个或多个处理器执行时,使得上述一个或多个处理器执行上述设备状态监控方法的步骤。
本申请还提供一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,上述计算机可读指令被一个或多个处理器执行时,使得上述一个或多个处理器执行上述设备状态监控方法的步骤。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚的说明本申请实施例或现有技术的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单的介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请根据一个或多个实施例提供的一种设备状态监控方法的流程图;
图2为本申请根据一个或多个实施例提供的一种设备状态监控装置的结构示意图;
图3为本申请根据一个或多个实施例提供的一种设备状态监控设备的结构示意图。
具体实施方式
本申请的核心是提供一种设备状态监控方法、装置、设备及计算机可读存储介质,用于实现对设备状态的非实时性监控,弥补现有技术中将设备运行状态与标准状态对比的方式进行实时监控的监控漏洞,提高对设备运行状态监控的能力,提升对设备异常的可维护性,完善机器报错功能,节省维护人力。
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
在一个实施例中,如图1所示,本申请实施例提供的设备状态监控方法,该方法以应用于计算机设备为例进行说明,该方法包括:
S101:保存历史上采集到的监控对象设备的历史状态参数。
S102:获取监控对象设备的实时状态参数。
S103:将实时状态参数与历史状态参数对比。
S104:若存在与历史状态参数的偏差超出预设范围的异常实时状态参数,则执行异常实时状态参数的对应处理机制。
在具体实施中,本申请实施例针对的监控对象设备可以包括但不限于PCIe设备、中央处理器、内存设备、硬盘驱动器等。本申请实施例的执行主体可以采用基本输入输出***(Basic Input Output System,BIOS)、基板管理控制器(Baseboard Management Controller,BMC)或操作***(operating system,OS)所在设备,也可以由多主体协同完成各个步骤。通过开发监控脚本,或编写监控程序并写入执行主体原有程序中进行共同编译,实现对监控对象设备的自动化监控。本申请实施例提供的设备状态监控方法的各个步骤还可以在BIOS启动、UEFI shell、进入操作***后等不同阶段连续执行或分别执行。
对于步骤S101和步骤S102来说,监控对象设备的实时状态参数和监控对象设备的历史状态参数,可以是同一执行主体以相同路径采集到的状态参数,也可以为接收前一执行主体通过不同路径采集到的历史状态参数。BIOS和BMC原本对PCIe设备、中央处理器、内存设备、硬盘驱动器等监控对象设备有实时监控机制,可以复用这些实时监控机制采集到的状态参数,也可以另行开发监控功能。获取监控对象设备的实时状态参数的时间点具体可以为在预设时间点获取、周期性获取、预设事件(如设备上下电)触发获取、随机性获取等。对不同监控对象设备的不同类型的状态参数还可以采用不同的获取及存储策略。
通过调用与监控对象设备对应的接口函数读取监控对象设备的寄存器,可以获取实时状态参数。根据执行主体不同,可以采用相同或不同的路径获取到监控对象设备的状态参数,例如对中央处理器、内存设备、硬盘驱动器等,都可以由BIOS通过UEFI规范提供的Protocol,访问各自相关的寄存器,来获取到关注的信息及出错状态等;这种基本的硬件寄存器的访问,在操作***OS下也都会有相应的功能支持函数。
对于这些监控对象设备,本申请实施例提供的设备状态监控方法适用于进行长期地 非实时性的监控,以便有效获知实时监控难以定位的故障信息,同时也可以对以往具有实时监控机制的参数(如带宽)进行这种长期地非实时性的监控,从而获知这些状态参数在长期运行过程中的异常变化。
以监控对象设备为PCIe设备为例,则实时状态参数可以包括但不限于设备在位状态参数、厂商标识(Vendor ID)、物理插槽标识(Physical slot number)、最大传输速率(Max Link Speed)、最大带宽(Max Link Width)、实时传输速率(Current Link Speed)、实时带宽(Current Link Width)、逻辑标识(Bus/Device/Function number)、所在PCIe链路的拓扑数据(各级upstream bridge的Bus/Device/Function number)等。
获取的实时状态参数可以全部作为历史状态参数进行存储,也可以只存储部分实时状态参数或进行计算、转化后的实时状态参数。
在保存历史上采集到的监控对象设备的历史状态参数时,预先为监控对象设备的历史状态参数划分硬件存储空间或软件存储空间,并制定对应的存储规则避免存储数据超出存储空间容量。可以将历史状态参数保存于用于采集历史状态参数的设备本地,也可以发送至另一设备。
则步骤S101:保存历史上采集到的监控对象设备的历史状态参数,具体可以将历史状态参数存储于预设存储芯片,或将历史状态参数存储于预先划分的内存区域,或将历史状态参数通过智能平台管理接口命令(IPMI Command)或Redfish技术或共享内存发送至基板管理控制器进行存储。其中,若由BIOS采集实时状态参数,则可以将历史状态参数存储于BIOS对应的存储芯片或BIOS Variable变量。若采用BIOS或其他设备采集实时状态参数,也可以将历史状态参数发送至基板管理控制器存储,此时可以通过软件的形式,如智能平台管理接口命令、Redfish技术或软件共享内存的方式发送至基板管理控制器,也可以通过共享的存储芯片发送给基板管理控制器。
一个监控对象设备可以对应一块存储区域,也可以对应多块存储区域;在多块存储区域间进行存储时,可以执行负载均衡策略或按照存储优先级进行存储。
历史状态参数的存储结构按需进行合理设计,例如可以设计为链表结构,可以为每一次存储的信息设置标号等,便于查找访问。
同时,为避免存储数据超出存储空间容量,在保存历史上采集到的监控对象设备的历史状态参数时,制定对应的存储策略,包括但不限于:根据先进先出原则存储预设数据量的历史状态参数,存储预设时间点的历史状态参数,存储预设事件发生时的历史状态参数,存储历史状态参数的均值,存储预设类型的历史状态参数等。其中,各存储策略可以择一采用,或采用多个存储策略。若采用先进先出的存储策略,当超出存储空间 容量或存储空间容量的某比例时,以最新存储的历史状态参数覆盖最早存储的历史状态参数。若采用存储预设时间点的历史状态参数,具体可以为选择固定的时刻对应的历史状态参数进行存储或周期性地进行存储。若采用存储预设事件发生时的历史状态参数的存储策略,则可以预先定义触发事件,如设备开机时存储状态参数,或在设备下电前存储状态参数。若采用历史状态参数的均值的方式,则可以根据各历史时间点对应的历史状态参数值计算均值,属性计算出最大概率出现的属性等,可以结合先进先出的存储原则,如只将预计超出存储空间容量的历史状态参数计算均值以覆盖这部分历史状态参数。若采用存储预设类型的历史状态参数的存储策略,则可以在各监控对象设备的实时监控参数中,提取需要进行非实时性长期监控的状态参数进行存储,或结合存储历史状态参数的均值的策略,将重要程度较高的历史状态参数进行全量存储,而对重要程度次之的历史状态参数采用存储均值的方式。
对于步骤S103和步骤S104来说,将实时状态参数与历史状态参数对比,即将监控对象设备的当前状态和其历史状态进行对比监控,若出现与历史状态不一致的情况则执行对应的处理机制。对于初次监控到实时状态参数、没有历史状态参数的监控对象设备,则可以不执行步骤S103,或在第一次对比中先按照标准参数列表进行对比,在后续对比过程中再与真实的历史状态参数进行对比。
步骤S103:将实时状态参数与历史状态参数对比,包括但不限于:执行将实时状态参数与存储时间最早的历史状态参数对比、将实时状态参数与存储时间最晚的历史状态参数对比、将实时状态参数与全体历史状态参数对比、将实时状态参数与历史状态参数中的预设历史状态参数对比等方式。对不同监控对象设备的不同类型的状态参数还可以采用不同的对比策略。对监控对象设备的某一状态参数,可以采用固定的对比策略,也可以根据当前执行主体的计算资源而灵活切换对比策略,如在计算资源充足时,可以将全部类型的实时状态参数与历史状态参数进行全量对比;在计算资源不充足时或为了节约计算资源时,可以将全部或部分类型的实时状态参数与存储时间最早的历史状态参数、或存储时间最晚的历史状态参数、或按固定规则或随机从历史状态参数中抽取预设历史状态参数、或历史状态参数的均值进行对比。同样的,在计算资源充足时,可以在每次获取到实时状态参数时均进行对比;在计算资源不充足时或为了节约计算资源时,可以在多次获取实时状态参数时仅选择一次进行对比,或先将待对比的实时状态参数进行存储并待计算资源充足后进行对比。
步骤S104是在存在与历史状态参数的偏差超出预设范围的异常实时状态参数时才执行的步骤。对应处理机制具体与监控对象设备的状态参数的类型对应。各监控对象设备 的不同类型的状态参数可以采用相同的对应处理机制,也可以采用不同的对应处理机制,可以接受用户设定的对应处理机制。对应处理机制可以包括但不限于发送报错信息、记录报错日志、推送维护建议、执行纠错策略等。例如可以调用预设接口向用户发送报错信息,利用智能平台管理接口命令通知基板管理控制器对异常实时状态参数记录报错日志,告知用户出现异常的监控对象设备的信息、监控对象设备的位置、异常实时状态参数对应的错误类型(如上一次可以检测到的PCIe设备这次检测不到了,疑似出现设备掉卡),根据预先生成的故障处理列表调用对应的维护建议推送给用户,如可以更换设备、检查配置是否出现了变化等。或对于能够自行处理的异常调用预设的纠错策略进行自动纠错处理等。
本申请实施例提供的设备状态监控方法,通过保存历史上采集到的监控对象设备的历史状态参数,将获取到的监控对象设备的实时状态参数与监控对象设备的历史状态参数进行对比,当发现与历史状态参数的偏差超出预设范围的异常实时状态参数后,执行异常实时状态参数的对应处理机制,从而弥补了对跨机器启动、重启的时间跨度内出现的设备运行情况的监控空白,实现了对设备状态的非实时性监控,提高对设备运行状态监控的能力,提升对设备异常的可维护性,完善机器报错功能,节省维护人力。
在上述实施例中提到,本申请实施例的执行主体可以为基本输入输出***、基板管理控制器或操作***。则在上述实施例的基础上,本申请实施例提供的设备状态监控方法以应用于基本输入输出***(下文简称BIOS)所在装置的情况进行说明。
则步骤S102中获取监控对象设备的实时状态参数,具体可以为:
在完成对监控对象设备的初始化之后、在启动操作***之前,至少一次地获取实时状态参数;
和/或,
在启动操作***之后,触发***管理中断以获取实时状态参数。
以监控对象设备为PCIe设备为例。BIOS在枚举和处理PCIe设备时,会为每个PCIe设备分配一组Bus、Device、Function number(PCIe设备的总线、设备、功能数值,简称BDF值),这个Bus、Device、Function Number信息组合唯一,可以用这三个数据的组合来唯一定位一个PCIe逻辑设备;另外还会通过读取设备配置空间的相关寄存器,来获得设备的属性、状态等,比如获得PCIe可以支持的速录、带宽等,和PCIe当前运行的速录、带宽等,当可以支持的属性和当前运行的状态不一致时,BIOS把设备的相关信息,比如PCIe的Bus、Device、Function number,不一致的属性类型(速录、带宽等),设备 的物理位置信息等,通过包括但不限于IPMI Command等形式,发送给基板管理控制器BMC,记录一条报错的日志,日志里包括这些出错的设备和状态信息。
BIOS会实现PCIe设备的资产信息功能,具体为在BIOS把PCIe设备初始化好以后,在启动到操作***OS之前,会收集所有PCIe设备的各自的一系列信息,包括但不限于设备及设备所在的PCIe Bridge的BDF值,在位状态、物理插槽标识(Physical slot number)、位置信息等等,把这些信息通过包括但不限于IPMI command或者Redfish技术传递给BMC,BMC接收到以后,会把这些设备属性按设备分别显示在它的Web界面上,供用户查看了解。为了区分每个设备的所在位置,会通过设置设备所在bridge寄存器的方式来设置每个设备的物理插槽标识(Physical slot number),达到每个设备都有唯一的物理插槽标识(Physical slot number)的效果,各个设备的物理插槽标识(Physical slot number)都不相同;另外会用每个设备的唯一的物理插槽标识(Physical slot number),来跟它的位置信息字符串匹配,每个物理插槽标识(Physical slot number)唯一地定位到一个设备,然后位置信息字符串来体现这个设备在机器中的物理位置;本申请实施例中也可以采用这样的设计和设置来达到同样的效果,可以复用资产信息功能中实现的数据。
在具体实施中,以监控对象设备为PCIe设备为例,在BIOS初始化PCIe设备处理完成之后,启动到操作***OS前,一次或者多次收集所有PCIe设备的属性信息,其中的信息的数量和具体收集哪些信息可以根据实际需要增减;可以复用资产信息功能收集的设备信息,也可以另外再实现;比如可以只收集设备的BDF值,和/或设备的厂商标识(Vendor ID)、设备标识(Device ID)、物理插槽标识(Physical slot number)。BIOS通过读取PCIe设备的寄存器来获取PCIe设备的实时状态参数。包括但不限于:BIOS通过读取PCIe设备的厂商标识(Vendor ID)和设备标识(Device ID)寄存器,判断PCIe设备是否在位;BIOS通过读取PCIe设备的插槽功能(slot capabilities)寄存器,获得PCIe设备的物理插槽标识(Physical slot number),可以与物理设备所在的插槽对应;BIOS通过读取PCIe设备的连接功能(Link Capabilities)寄存器,获得PCIe设备的最大传输速率(Max Link Speed)、最大带宽(Max Link Width);BIOS通过读取PCIe设备的连接状态(Link Status)寄存器,获得PCIe设备的实时传输速率(Current Link Speed)、实时带宽(Current Link Width);BIOS通过读取PCIe设备的BDF值,获得PCIe设备的逻辑ID标识;BIOS通过读取PCIe设备各级链路(upstream bridge)的BDF值,得到设备所在PCIe链路的拓扑数据。BIOS下和UEFI Shell环境下支持的读取方式包括但不限于:EFI_PCI_IO_PROTOCOL支持的成员函数,如EFI_PCI_IO_PROTOCOL_ACCESS,分为 Memory和IO两种;EFI_PCI_IO_PROTOCOL_GET_LOCATION函数等;在汇编语言中支持的IO和Memory访问指令;在Windows和Linux操作***下读的是同样的寄存器,有各自的IO或者Memory访问功能函数。
BIOS将每一次收集的所有监控对象设备的信息保存起来,如上述实施例所示的,可以将历史状态参数存储于预设存储芯片,或将历史状态参数通过智能平台管理接口命令(IPMI Command)或Redfish技术发送至基板管理控制器进行存储,并按需进行合理的存储结构设计,并制定对应的存储策略。
BIOS将获取到的监控对象设备的实时状态参数与此前保存的历史状态参数进行对比,具体可以参考上述实施例中说明的对比方式。以监控对象设备为PCIe设备为例,BIOS将采集到的PCIe设备的实时状态参数与此前保存的PCIe设备的历史状态参数进行对比,包括但不限于,与最前一次保存的数据作比较、或者与上一次数据比较、或者与保存的所有数据比较、随机或者有规律地选其中的一些保存的数据做比较等;根据需要也可以增加数据的收集和比较次数;当比较发现PCIe设备的数量发生了变化,则把变化情况比如增加了还是减少了,增减了多少个,增减的设备的关键属性,比如包括但不限于厂商标识(Vendor ID)、设备标识(Device ID)、物理插槽标识(Physical slot number)、位置信息等通过智能平台管理接口命令等方式发送给BMC记录设备发生变化的日志,这些数据有助于判断发生变化的设备是什么设备、具***于机器上什么位置等,便于有针对性性地机型维护处理。
进一步的根据实际需要,BIOS也可以利用平台芯片提供的周期性的***管理中断(SMI)功能,包括但不限于Intel芯片提供的周期性地触发SMI功能,选取可支持的合适的触发间隔时间,比如64S/32S等,在SMI handler功能函数里面实现上述的数据比较分析与发送相关数据通知BMC记录日志的功能,这样在进到操作***以后,仍然可以实现在每间隔设定的时间,触发SMI,执行上述功能。
比较分析数据的行为根据实际需要也可以由BMC或者操作***来完成,基于同样的BIOS收集的监控对象设备的实时状态参数和历史状态参数。现有的针对PCIe设备的监控机制中,为每个PCIe设备4K大小的配置空间,则可以通过每个PCIe设备4K大小的配置空间所体现的信息,来侦测的其它类型的设备出错类型,而又不能实时监测的,也可以通过上述非实时性地处理方案来侦测,监测到错误以后进行各种可能形式的报错和记录日志处理。
上文详述了设备状态监控方法对应的各个实施例,在此基础上,本申请还公开了与 上述方法对应的设备状态监控装置、设备及计算机可读存储介质。
在一个实施例中,如图2所示,本申请实施例提供的设备状态监控装置包括:
存储单元201,用于保存历史上采集到的监控对象设备的历史状态参数;
获取单元202,用于获取监控对象设备的实时状态参数;
对比单元203,用于将实时状态参数与历史状态参数对比;
异常处理单元204,用于若存在与历史状态参数的偏差超出预设范围的异常实时状态参数,则执行异常实时状态参数的对应处理机制。
由于装置部分的实施例与方法部分的实施例相互对应,因此装置部分的实施例请参见方法部分的实施例的描述,这里暂不赘述。
在一个实施例中,如图3所示,本申请实施例提供的设备状态监控设备,该设备状态监控设备可以是计算机设备,该计算机设备可以是终端或服务器,包括:
存储器310,用于存储计算机可读指令311;
处理器320,用于执行计算机可读指令311,该计算机可读指令311被处理器320执行时实现如上述任意一项实施例设备状态监控方法的步骤。
其中,处理器320可以包括一个或多个处理核心,比如3核心处理器、8核心处理器等。处理器320可以采用数字信号处理DSP(Digital Signal Processing)、现场可编程门阵列FPGA(Field-Programmable Gate Array)、可编程逻辑阵列PLA(Programmable Logic Array)中的至少一种硬件形式来实现。处理器320也可以包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称中央处理器CPU(Central Processing Unit);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器320可以集成有图像处理器GPU(Graphics Processing Unit),GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中,处理器320还可以包括人工智能AI(Artificial Intelligence)处理器,该AI处理器用于处理有关机器学习的计算操作。
存储器310可以包括一个或多个计算机可读存储介质,该计算机可读存储介质可以是非暂态的。存储器310还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。本实施例中,存储器310至少用于存储以下计算机可读指令311,其中,该计算机可读指令311被处理器320加载并执行之后,能够实现前述任一实施例公开的设备状态监控方法中的相关步骤。另外,存储器310所存储的资源还可以包括操作***312和数据313等,存储方式可以是短暂存储或者永久存储。 其中,操作***312可以为Windows。数据313可以包括但不限于上述方法所涉及到的数据。
在一些实施例中,设备状态监控设备还可包括有显示屏330、电源340、通信接口350、输入输出接口360、传感器370以及通信总线380。
本领域技术人员可以理解,图3中示出的结构并不构成对设备状态监控设备的限定,可以包括比图示更多或更少的组件。
本申请实施例提供的设备状态监控设备,包括存储器和处理器,处理器在执行存储器存储的程序时,能够实现如上任意一项的设备状态监控方法,效果同上。
需要说明的是,以上所描述的装置、设备实施例仅仅是示意性的,例如,模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个模块或组件可以结合或者可以集成到另一个***,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或模块的间接耦合或通信连接,可以是电性,机械或其它的形式。作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理模块,即可以位于一个地方,或者也可以分布到多个网络模块上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。
集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,执行本申请各个实施例的方法的全部或部分步骤。
为此,本申请实施例还提供一种非易失性计算机可读存储介质,该非易失性计算机可读存储介质中存储有计算机可读指令,该计算机可读指令被一个或多个处理器执行时可实现上述任意一个实施例的设备状态监控方法的步骤。
本实施例中提供的计算机可读存储介质所包含的计算机可读指令能够在被处理器执行时实现如上任意一项的设备状态监控方法的步骤,效果同上。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,上述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上上述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (20)

  1. 一种设备状态监控方法,其特征在于,包括:
    保存历史上采集到的监控对象设备的历史状态参数;
    获取所述监控对象设备的实时状态参数;
    将所述实时状态参数与所述历史状态参数对比;和
    在存在与所述历史状态参数的偏差超出预设范围的异常实时状态参数时,执行所述异常实时状态参数的对应处理机制。
  2. 根据权利要求1所述的设备状态监控方法,其特征在于,所述监控对象设备包括PCIe设备、中央处理器、内存设备、硬盘驱动器中的至少一种。
  3. 根据权利要求1所述的设备状态监控方法,其特征在于,所述监控对象设备为PCIe设备。
  4. 根据权利要求3所述的设备状态监控方法,其特征在于,所述实时状态参数具体包括:设备在位状态参数、厂商标识、物理插槽标识、最大传输速率、最大带宽、实时传输速率、实时带宽、逻辑标识、所在PCIe链路的拓扑数据中的至少一项。
  5. 根据权利要求1所述的设备状态监控方法,其特征在于,所述设备状态监控方法应用于基本输入输出***所在装置。
  6. 根据权利要求1所述的设备状态监控方法,其特征在于,所述获取所述监控对象设备的实时状态参数,包括:
    在完成对所述监控对象设备的初始化之后、在启动操作***之前,至少一次地获取所述实时状态参数。
  7. 根据权利要求6所述的设备状态监控方法,其特征在于,在启动所述操作***之后,触发***管理中断以获取所述实时状态参数。
  8. 根据权利要求1所述的设备状态监控方法,其特征在于,所述保存历史上采集到的监控对象设备的历史状态参数,包括:
    根据先进先出原则存储预设数据量的所述历史状态参数,存储预设时间点的所述历史状态参数,存储预设事件发生时的所述历史状态参数,存储所述历史状态参数的均值,存储预设类型的所述历史状态参数中的至少一种。
  9. 根据权利要求8所述的设备状态监控方法,其特征在于,在采用所述根据先进先出原则存储预设数据量的所述历史状态参数时,当超出存储空间容量或存储空间容量的某比例时,以最新存储的历史状态参数覆盖最早存储的历史状态参数。
  10. 根据权利要求8所述的设备状态监控方法,其特征在于,在采用所述存储预设时间点的所述历史状态参数时,选择固定的时刻对应的历史状态参数进行存储或周期性地进行存储。
  11. 根据权利要求8所述的设备状态监控方法,其特征在于,在采用所述存储预设事件发生时的所述历史状态参数时,预先定义触发事件,根据所述触发事件进行存储。
  12. 根据权利要求8所述的设备状态监控方法,其特征在于,在采用存储预设类型的所述历史状态参数时,在各监控对象设备的实时监控参数中,提取需要进行非实时性长期监控的状态参数进行存储。
  13. 根据权利要求8所述的设备状态监控方法,其特征在于,在采用存储预设类型的所述历史状态参数时,结合存储历史状态参数的均值的策略,将重要程度较高的历史状态参数进行全量存储,而对重要程度次之的历史状态参数采用存储均值的方式。
  14. 根据权利要求1所述的设备状态监控方法,其特征在于,所述保存历史上采集到的监控对象设备的历史状态参数,包括:
    将所述历史状态参数存储于预设存储芯片,将所述历史状态参数存储于预先划分的内存区域,将所述历史状态参数通过智能平台管理接口命令或Redfish技术或共享内存发送至基板管理控制器进行存储的至少一种。
  15. 根据权利要求1所述的设备状态监控方法,其特征在于,所述将所述实时状态参数与所述历史状态参数对比,包括:
    执行将所述实时状态参数与存储时间最早的所述历史状态参数对比、将所述实时状态参数与存储时间最晚的所述历史状态参数对比、将所述实时状态参数与全体所述历史状态参数对比、将所述实时状态参数与所述历史状态参数中的预设历史状态参数对比中的至少一项。
  16. 根据权利要求1所述的设备状态监控方法,其特征在于,所述处理机制为发送报错信息、记录报错日志、推送维护建议、执行纠错策略中的至少一项。
  17. 一种设备状态监控装置,其特征在于,包括:
    存储单元,用于保存历史上采集到的监控对象设备的历史状态参数;
    获取单元,用于获取所述监控对象设备的实时状态参数;
    对比单元,用于将所述实时状态参数与所述历史状态参数对比;和
    异常处理单元,用于在存在与所述历史状态参数的偏差超出预设范围的异常实时状态参数时,执行所述异常实时状态参数的对应处理机制。
  18. 根据权利要求17所述的设备状态监控装置,其特征在于,所述存储单元还用于 根据先进先出原则存储预设数据量的所述历史状态参数,存储预设时间点的所述历史状态参数,存储预设事件发生时的所述历史状态参数,存储所述历史状态参数的均值,存储预设类型的所述历史状态参数中的至少一种。
  19. 一种设备状态监控设备,其特征在于,包括存储器及一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器执行如权利要求1至16任意一项所述设备状态监控方法的步骤。
  20. 一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如权利要求1至16任意一项所述设备状态监控方法的步骤。
PCT/CN2022/113519 2021-12-24 2022-08-19 设备状态监控方法、装置、设备及计算机可读存储介质 WO2023115999A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111602701.9A CN114328102B (zh) 2021-12-24 2021-12-24 设备状态监控方法、装置、设备及计算机可读存储介质
CN202111602701.9 2021-12-24

Publications (1)

Publication Number Publication Date
WO2023115999A1 true WO2023115999A1 (zh) 2023-06-29

Family

ID=81012119

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/113519 WO2023115999A1 (zh) 2021-12-24 2022-08-19 设备状态监控方法、装置、设备及计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN114328102B (zh)
WO (1) WO2023115999A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116521378A (zh) * 2023-07-03 2023-08-01 苏州浪潮智能科技有限公司 服务器的传感器访问方法、装置和基板管理控制器
CN117271610A (zh) * 2023-11-17 2023-12-22 深圳曼顿科技有限公司 设备状态管理方法、装置、终端设备及存储介质
CN117527870A (zh) * 2023-12-07 2024-02-06 东莞信易电热机械有限公司 一种塑胶成型的控制方法及***
CN117554681A (zh) * 2024-01-08 2024-02-13 银河航天(西安)科技有限公司 一种应用于卫星的电力监测方法、装置及存储介质
CN117970104A (zh) * 2024-02-28 2024-05-03 威海天拓合创电子工程有限公司 一种基于伺服电机的工作状态监测方法

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114328102B (zh) * 2021-12-24 2024-02-09 浪潮(北京)电子信息产业有限公司 设备状态监控方法、装置、设备及计算机可读存储介质
CN116795650B (zh) * 2023-06-29 2024-05-03 浙江海得智慧能源有限公司 一种储能***运行状态监测方法、***及设备
CN118226800A (zh) * 2024-05-27 2024-06-21 成都飞机工业(集团)有限责任公司 一种数控生产线加工状态监控方法、装置、介质及设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6738811B1 (en) * 2000-03-31 2004-05-18 Supermicro Computer, Inc. Method and architecture for monitoring the health of servers across data networks
CN108254643A (zh) * 2018-01-17 2018-07-06 中科创能实业有限公司 监控方法及监控装置
CN112463541A (zh) * 2020-12-14 2021-03-09 上海金仕达软件科技有限公司 一种数据监控方法及***
CN114328102A (zh) * 2021-12-24 2022-04-12 浪潮(北京)电子信息产业有限公司 设备状态监控方法、装置、设备及计算机可读存储介质

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2372490A1 (en) * 2010-03-31 2011-10-05 Robert Bosch GmbH Circuit arrangement for a data processing system and method for data processing
CN103353851A (zh) * 2013-07-01 2013-10-16 华为技术有限公司 一种管理任务的方法和设备
CN106444662A (zh) * 2016-09-23 2017-02-22 东莞团诚自动化设备有限公司 一种用于物联网的数据采集装置及方法
CN110442402A (zh) * 2019-08-08 2019-11-12 中国建设银行股份有限公司 数据处理方法、装置、设备及存储介质
CN112748847B (zh) * 2019-10-29 2024-04-19 伊姆西Ip控股有限责任公司 管理存储***中的存储空间的方法、设备和程序产品
CN113192233A (zh) * 2021-04-29 2021-07-30 北京车和家信息技术有限公司 数据采集方法、装置、设备及介质
CN113703917B (zh) * 2021-08-26 2022-10-14 上海道客网络科技有限公司 一种多集群资源数据处理***与方法、非暂态存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6738811B1 (en) * 2000-03-31 2004-05-18 Supermicro Computer, Inc. Method and architecture for monitoring the health of servers across data networks
CN108254643A (zh) * 2018-01-17 2018-07-06 中科创能实业有限公司 监控方法及监控装置
CN112463541A (zh) * 2020-12-14 2021-03-09 上海金仕达软件科技有限公司 一种数据监控方法及***
CN114328102A (zh) * 2021-12-24 2022-04-12 浪潮(北京)电子信息产业有限公司 设备状态监控方法、装置、设备及计算机可读存储介质

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116521378A (zh) * 2023-07-03 2023-08-01 苏州浪潮智能科技有限公司 服务器的传感器访问方法、装置和基板管理控制器
CN116521378B (zh) * 2023-07-03 2023-09-19 苏州浪潮智能科技有限公司 服务器的传感器访问方法、装置和基板管理控制器
CN117271610A (zh) * 2023-11-17 2023-12-22 深圳曼顿科技有限公司 设备状态管理方法、装置、终端设备及存储介质
CN117271610B (zh) * 2023-11-17 2024-03-12 深圳曼顿科技有限公司 设备状态管理方法、装置、终端设备及存储介质
CN117527870A (zh) * 2023-12-07 2024-02-06 东莞信易电热机械有限公司 一种塑胶成型的控制方法及***
CN117527870B (zh) * 2023-12-07 2024-05-03 东莞信易电热机械有限公司 一种塑胶成型的控制方法及***
CN117554681A (zh) * 2024-01-08 2024-02-13 银河航天(西安)科技有限公司 一种应用于卫星的电力监测方法、装置及存储介质
CN117554681B (zh) * 2024-01-08 2024-03-22 银河航天(西安)科技有限公司 一种应用于卫星的电力监测方法、装置及存储介质
CN117970104A (zh) * 2024-02-28 2024-05-03 威海天拓合创电子工程有限公司 一种基于伺服电机的工作状态监测方法

Also Published As

Publication number Publication date
CN114328102B (zh) 2024-02-09
CN114328102A (zh) 2022-04-12

Similar Documents

Publication Publication Date Title
WO2023115999A1 (zh) 设备状态监控方法、装置、设备及计算机可读存储介质
CN112948157B (zh) 服务器故障定位方法、装置、***及计算机可读存储介质
TWI450103B (zh) 伺服器之遠端管理系統及方法,及其電腦程式產品
US8516499B2 (en) Assistance in performing action responsive to detected event
US8949676B2 (en) Real-time event storm detection in a cloud environment
US20050188263A1 (en) Detecting and correcting a failure sequence in a computer system before a failure occurs
EP3591485B1 (en) Method and device for monitoring for equipment failure
CN110609778A (zh) 一种保存服务器宕机日志的方法及***
CN115543746A (zh) 图形处理器监测方法、***、装置及电子设备
CN113708986A (zh) 服务器监控装置、方法及计算机可读存储介质
US20140359365A1 (en) Integrated Configuration Management and Monitoring for Computer Systems
CN115599617B (zh) 总线检测方法、装置、服务器及电子设备
CN110471800B (zh) 服务器及自动检修基板管理控制器的方法
CN112631872B (zh) 一种多核***的异常处理方法及装置
CN115525392A (zh) 容器监控方法、装置、电子设备及存储介质
CN109634796A (zh) 一种计算机的故障诊断方法、装置及***
CN115145381A (zh) 一种远程重置bmc芯片的方法、***、存储介质及设备
CN113656378A (zh) 一种服务器管理方法、装置、介质
CN107179911A (zh) 一种重启管理引擎的方法和设备
CN113742113A (zh) 一种嵌入式***健康管理方法、设备及储存介质
CN111542048A (zh) 侦码设备采集功能重启方法、装置、服务器及存储介质
CN113553243A (zh) 远端侦错方法
CN116719663B (zh) 一种数据处理方法、装置、设备以及可读存储介质
CN114189429B (zh) 一种服务器集群故障的监测***、方法、装置及介质
US12007936B2 (en) Power efficient memory value updates for arm architectures

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22909327

Country of ref document: EP

Kind code of ref document: A1