CN110597681A - Server hardware monitoring system - Google Patents

Server hardware monitoring system Download PDF

Info

Publication number
CN110597681A
CN110597681A CN201910341627.6A CN201910341627A CN110597681A CN 110597681 A CN110597681 A CN 110597681A CN 201910341627 A CN201910341627 A CN 201910341627A CN 110597681 A CN110597681 A CN 110597681A
Authority
CN
China
Prior art keywords
server
hardware
acquisition
module
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910341627.6A
Other languages
Chinese (zh)
Inventor
李飞
史峻丞
姚华
冉玄
李环荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
GUIZHOU GUANGSI INFORMATION NETWORK CO Ltd
Original Assignee
GUIZHOU GUANGSI INFORMATION NETWORK CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by GUIZHOU GUANGSI INFORMATION NETWORK CO Ltd filed Critical GUIZHOU GUANGSI INFORMATION NETWORK CO Ltd
Priority to CN201910341627.6A priority Critical patent/CN110597681A/en
Publication of CN110597681A publication Critical patent/CN110597681A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/32Monitoring with visual or acoustical indication of the functioning of the machine
    • G06F11/324Display of status information
    • G06F11/327Alarm or error message display

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Computer And Data Communications (AREA)

Abstract

The invention provides a server hardware monitoring system, which adopts a scheme of combining an SNMP protocol and an IPMI protocol, adopts an acquisition template configuration for carrying out an independent SNMP protocol and an IPMI protocol for each type of equipment model in the acquisition process, and expands the hardware acquisition of all servers of the equipment model by a uniform template, thereby greatly enhancing the expandability. And a uniform Web Service data interface is provided, and the access Service of the hardware performance data and the fault data of the server is provided to the outside.

Description

Server hardware monitoring system
Technical Field
The invention relates to the technical field of server hardware monitoring, in particular to a server hardware monitoring system.
Background
Application services such as an operating system, a database, an application system and cloud computing are operated on the server hardware of the information machine room, and when hardware faults occur on the server hardware bearing various services, if the hardware faults are not found and processed in time, a lot of uncertain factors are brought to stable operation of various application services. Therefore, when the server has a hardware fault, the hardware fault can be discovered at the first time, the fault response time is shortened, the influence range of the server hardware fault is reduced, precious time is strived for processing the server hardware fault, and uncertain factors brought to various service operations by the server hardware fault are eliminated in time.
However, at present, the server enters the information room manually, and the operating state (such as temperature, voltage, fan operating state, power state, etc.) of the server hardware indicator light is checked in a manual timing and fixed-point manner, so as to determine whether the server hardware has a fault. When the manual mode is adopted to inspect the server hardware, the time and labor are wasted, the efficiency is low, the running health state of the server hardware cannot be mastered in real time, and the fault of the server hardware cannot be found in time.
The purpose of the invention is: the server hardware monitoring system is provided, and can automatically acquire the performance operation data of the server hardware and monitor the operation monitoring state of the server hardware in real time; and when the server has a hardware fault, the hardware fault can be acquired at the first time and an alarm is pushed to the alarm system. The problem that server hardware exists is patrolled and examined through artifical mode in order to solve now.
The invention is realized by the following steps: the server hardware monitoring system comprises a hardware performance data acquisition module, a hardware fault alarm collection module, a hardware bottom layer operation module and a data pushing interface module;
the hardware performance data acquisition module comprises an SNMP protocol acquisition submodule and an IPMI protocol acquisition submodule, the SNMP protocol acquisition submodule and the IPMI protocol acquisition submodule are combined to acquire complete server hardware monitoring data, and automatically start an acquisition task according to a set frequency in a timer mode and wait for the next acquisition cycle to continue to execute the acquisition task after the acquisition task is executed;
the hardware fault alarm collection module is used for collecting original hardware fault alarms of the server in the form of a trap protocol data unit based on an SNMP protocol, extracting the OID of the original hardware fault alarms of the server, analyzing the OID with the OID defined in the MIB base of the server, converting the original OID fault alarms of the hardware of the server into standard alarms and pushing the standard alarms to an alarm system to send the alarms; the SNMP protocol acquisition submodule and the IPMI protocol acquisition submodule both adopt a mode of configuring acquisition templates to realize the expandability of the height of transversely adding server equipment, and are configured with independent acquisition templates aiming at each type of server model equipment. And the collection tasks are executed in batches in a multithreading mode of the thread pool, and details are shown in a specific implementation mode.
The hardware bottom layer operation module sends a control command to the server management port in a standard command mode based on an IPMI protocol and provides functions of remotely starting up, shutting down and restarting the BMC (baseboard management controller) for the server through the hardware layer; when the server operating system is stuck and cannot log in the server operating system to restart the server, the hardware bottom layer operating module is adopted, so that the server can be restarted from the bottom layer operation of the server hardware in a remote mode.
The data pushing interface module is used for providing a data access interface to the outside in a Web Service data interface mode; and the Web Service carries out Service description, Service request and result feedback based on the XML document. Can be transmitted through HTTP protocol on Internet, and can be easily accessed and returned. Meanwhile, because the relevant standards of the Web Service are all open protocols of W3C, and are irrelevant to the platform and the operating system, the Web Service on different platforms and operating systems can be realized to a great extent through interoperation, so that the integration of applications and data interaction on heterogeneous platforms become easier.
The collection task of the hardware performance data collection module comprises the following steps: the system comprises a server fan rotating speed, a server fan state, a server air inlet temperature, a CPU temperature, a memory temperature, a hard disk state, an RAID control card temperature, a power supply state, a power supply power, a physical memory state, a power supply modulation module temperature, BMC hardware information, chassis power related strategies and Watchdog information.
When the monitored server hardware equipment has the index abnormal information, the data pushing interface module pushes an alarm to the alarm system, so that an administrator can find the abnormal information of the currently existing server hardware equipment in time.
Because servers of various brands have different support strength on the SNMP protocol and the IPMI protocol, and complete server hardware performance data is difficult to acquire by singly using one protocol, the invention acquires complete server hardware performance operation data based on the combination of the SNMP protocol and the IPMI protocol. And a uniform Web Service data interface is provided, and the access Service of the hardware performance data and the fault data of the server is provided to the outside.
Drawings
FIG. 1 is a block diagram of the present invention;
FIG. 2 is a schematic diagram of the SNMP protocol acquisition of the present invention;
FIG. 3 is a schematic diagram of IPMI protocol collection in accordance with the present invention;
FIG. 4 is a diagram illustrating hardware fault alarm collection in accordance with the present invention.
Detailed Description
Example 1 of the invention: the system comprises a server hardware monitoring system and an SNMP protocol acquisition module (figure 2 acquires a schematic diagram), acquires server hardware performance data based on an SNMP protocol, and is developed through an SNMP4J component. Although the SNMP is a standard simple network management protocol, different server models have different OID feature codes, therefore, the module provides an SNMP protocol acquisition template, the same acquisition index of each type of server model has the characteristic of different OIDs, the module designs an independent SNMP hardware acquisition template for each type of server model, when a new server model needs to be brought into monitoring, the SNMP acquisition template of the server model is defined for the first time, and the access of more servers can be realized only by filling the IP address and the SNMP parameters of the server in the later period.
When the SNMP protocol acquisition module starts to work, the SNMP protocol acquisition module accesses a database system server hardware acquisition task table to acquire a server hardware acquisition task. Meanwhile, the module acquires the SNMP acquisition OID analysis template of various server devices by accessing the SNMP acquisition template table of the database system. Because the acquisition mode of the SNMP protocol is finally acquired through the OID, the module automatically matches a server hardware acquisition task with the SNMP acquisition template to generate an SNMP protocol acquisition task queue, and the task queue comprises the corresponding relation between all server acquisition indexes and the OID. And finally, executing the acquisition tasks in the SNMP protocol acquisition task queue in a high-concurrency mode in the thread pool in batch, and performing high-concurrency warehousing on the acquired data. When abnormal server hardware running state data are acquired through the SNMP protocol acquisition module, an alarm is pushed to the alarm system through the data pushing interface in a Web Service mode.
The IPMI protocol acquisition module (fig. 3 is an acquisition schematic diagram), acquires server hardware performance data based on the IPMI protocol, and realizes the IPMI command by adopting an ipmitool tool package through being deployed in a Linux operating system environment. IPMI is an open standard hardware management interface protocol, but different server models have different return data sets for the same IPMI command. Aiming at the characteristic that the same acquisition index of each type of server model has different returned data sets, the module designs an independent IPMI hardware acquisition template for each type of server model, when a new server model needs to be brought into monitoring, the IPMI acquisition template of the server model is defined for the first time, and the access of more servers can be realized only by filling the IP address and the IPMI parameters of the server in the later period.
The IPMI protocol high concurrency acquisition assembly fully automatically matches server hardware acquisition tasks with an IPMI acquisition template to generate an IPMI protocol acquisition task queue, and the task queue comprises the corresponding relation between all server acquisition indexes and an IPMI return data set. And then, executing the acquisition tasks in the IPMI protocol acquisition task queue in a batch manner in a thread pool high-concurrency manner, and performing high-concurrency warehousing on the acquired data.
When the IPMI protocol acquisition module starts to work, the module accesses a database system server hardware acquisition task table to acquire a server hardware acquisition task. Meanwhile, the module acquires analysis templates of IPMI collection return data sets of various server devices by accessing an IPMI collection template table of the database system. The collection mode of the IPMI protocol finally needs to analyze the collected return data through an analysis template of the return data set, so that the collection index value can be extracted from the return data set. Therefore, the module automatically matches the server hardware acquisition task with the IPMI acquisition template to generate an IPMI protocol acquisition task queue, and the task queue comprises the analysis corresponding relation between all server acquisition indexes and the IPMI return data set. And finally, executing the acquisition tasks in the SNMP protocol acquisition task queue in a high-concurrency mode in the thread pool in batch, and performing high-concurrency warehousing on the acquired data. When abnormal server hardware running state data are acquired through the IPMI protocol acquisition module, an alarm is pushed to an alarm system through a data pushing interface in a Web Service mode.
And a hardware fault alarm collecting module (figure 4 collects a schematic diagram). By researching the data unit of the SNMP protocol Trap protocol and combining with the principle of finding the transmission of the SNMP Trap data in practice, when a specific event occurs at a monitored end, which may be a performance problem, even a network equipment interface is down, a server hardware problem and the like, an agent end can send an alarm event to a management station. Therefore, based on the principle, the hardware fault alarm acquisition module is designed.
When a hardware fault occurs in the server, an SNMPtrap alarm message is pushed to a hardware fault alarm monitoring port of the hardware monitoring system, wherein the SNMPtrap alarm message is an original fault alarm message with OID description, specific alarm content can be known through original OID alarm, and the specific hardware of the server can be known only by converting the original OID alarm into standard alarm through analysis of an alarm analysis module. The hardware fault alarm acquisition module monitors the network port, and after receiving an original alarm sent by the server, the hardware fault alarm acquisition module immediately puts the original alarm in storage on one hand, and extracts an original alarm OID value on the other hand, and submits alarm information and the alarm OID value to the alarm analysis module. The alarm analysis module analyzes the alarm OID and the MIB, converts the original alarm information with the OID into standard alarm information to be stored in a warehouse, and immediately pushes the server hardware fault alarm to the outside through an SNMPtrap standard alarm pushing interface.
The hardware bottom layer operation module provides basic operations of the server and comprises the following steps: hardware startup, hardware shutdown and BMC restart. And realizing an IPMI bottom layer operation control command based on an ipmitool tool package under a Linux operating system. When a certain server needs to be operated, IPMI configuration parameter information (IPMI user name, password, port and the like) of the server is extracted from a database and then assembled into a standard IPMI operation command, the standard IPMI execution command is sent to a remote server hardware management port through an ipmitool tool package, and after the server hardware management port receives the command, the corresponding operation can be executed to complete the control of a server hardware bottom layer.
The invention is not limited to the embodiments described in the specific embodiments, and those skilled in the art can derive other embodiments according to the technical solutions of the invention, and the embodiments also belong to the technical innovation scope of the invention. It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (3)

1. A server hardware monitoring system, characterized by: the system comprises a hardware performance data acquisition module, a hardware fault alarm collection module, a hardware bottom layer operation module and a data push interface module;
the hardware performance data acquisition module comprises an SNMP protocol acquisition submodule and an IPMI protocol acquisition submodule, the SNMP protocol acquisition submodule and the IPMI protocol acquisition submodule are combined to acquire complete server hardware monitoring data, and automatically start an acquisition task according to a set frequency in a timer mode and wait for the next acquisition cycle to continue to execute the acquisition task after the acquisition task is executed;
the hardware fault alarm collection module is used for collecting original hardware fault alarms of the server in the form of a trap protocol data unit based on an SNMP protocol, extracting the OID of the original hardware fault alarms of the server, analyzing the OID with the OID defined in the MIB base of the server, converting the original OID fault alarms of the hardware of the server into standard alarms and pushing the standard alarms to an alarm system to send the alarms;
the hardware bottom layer operation module sends a control command to the server management port in a standard command mode based on an IPMI protocol and provides functions of remotely starting up, shutting down and restarting the BMC (baseboard management controller) for the server through the hardware layer;
the data pushing interface module is used for providing a data access interface to the outside in a Web Service data interface mode; and the Web Service carries out Service description, Service request and result feedback based on the XML document.
2. The server hardware monitoring system according to claim 1, wherein: the collection task of the hardware performance data collection module comprises the following steps: the system comprises a server fan rotating speed, a server fan state, a server air inlet temperature, a CPU temperature, a memory temperature, a hard disk state, an RAID control card temperature, a power supply state, a power supply power, a physical memory state, a power supply modulation module temperature, BMC hardware information, chassis power related strategies and Watchdog information.
3. The server hardware monitoring system according to claim 1, wherein: when the monitored server hardware equipment has the index abnormal information, the data pushing interface module pushes an alarm to the alarm system, so that an administrator can find the abnormal information of the currently existing server hardware equipment in time.
CN201910341627.6A 2019-04-26 2019-04-26 Server hardware monitoring system Pending CN110597681A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910341627.6A CN110597681A (en) 2019-04-26 2019-04-26 Server hardware monitoring system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910341627.6A CN110597681A (en) 2019-04-26 2019-04-26 Server hardware monitoring system

Publications (1)

Publication Number Publication Date
CN110597681A true CN110597681A (en) 2019-12-20

Family

ID=68852515

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910341627.6A Pending CN110597681A (en) 2019-04-26 2019-04-26 Server hardware monitoring system

Country Status (1)

Country Link
CN (1) CN110597681A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112527605A (en) * 2020-12-23 2021-03-19 中盈优创资讯科技有限公司 Server management method and device based on IPMI

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104639380A (en) * 2013-11-07 2015-05-20 英业达科技有限公司 Server monitoring method
CN104780059A (en) * 2014-10-15 2015-07-15 贵州电网公司信息通信分公司 Server performance management method based on WEB page and underlying system service
CN106656632A (en) * 2017-02-03 2017-05-10 上海中信信息发展股份有限公司 Machine room monitoring system fusing Ethernet protocol with Internet of Things protocol, and information processing and control method
CN108777637A (en) * 2018-05-30 2018-11-09 郑州云海信息技术有限公司 A kind of data center's total management system and method for supporting server isomery
CN109165250A (en) * 2018-08-29 2019-01-08 中远海运科技股份有限公司 Intelligent integrated plateform system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104639380A (en) * 2013-11-07 2015-05-20 英业达科技有限公司 Server monitoring method
CN104780059A (en) * 2014-10-15 2015-07-15 贵州电网公司信息通信分公司 Server performance management method based on WEB page and underlying system service
CN106656632A (en) * 2017-02-03 2017-05-10 上海中信信息发展股份有限公司 Machine room monitoring system fusing Ethernet protocol with Internet of Things protocol, and information processing and control method
CN108777637A (en) * 2018-05-30 2018-11-09 郑州云海信息技术有限公司 A kind of data center's total management system and method for supporting server isomery
CN109165250A (en) * 2018-08-29 2019-01-08 中远海运科技股份有限公司 Intelligent integrated plateform system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杨战胜 等: "基于移动运维IT综合监控***实现研究", 《信息通信》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112527605A (en) * 2020-12-23 2021-03-19 中盈优创资讯科技有限公司 Server management method and device based on IPMI

Similar Documents

Publication Publication Date Title
CN111447109B (en) Monitoring management apparatus and method, computer readable storage medium
CN109857613B (en) Automatic operation and maintenance system based on collection cluster
CN112162821B (en) Container cluster resource monitoring method, device and system
CN106407076A (en) A monitoring method for the operation information of software and hardware based on a domestic CPU and operating system environment
CN109254922B (en) Automatic testing method and device for BMC Redfish function of server
CN100514962C (en) Host performance collection proxy in large-scale network
CN106655502B (en) Method and device for acquiring running state data of power distribution network equipment
CN109240851A (en) A kind of autonomous type realization self-healing method and system of batch BMC
CN108199901B (en) Hardware repair reporting method, system, device, hardware management server and storage medium
WO2018010176A1 (en) Method and device for acquiring fault information
CN112506969A (en) BMC address query method, system, equipment and readable storage medium
CN111488258A (en) System for analyzing and early warning software and hardware running state
CN112529223A (en) Equipment fault repair method and device, server and storage medium
CN111694707A (en) Small server cluster management system and method
CN110597681A (en) Server hardware monitoring system
CN110569140A (en) operation and maintenance method and device
CN116089212A (en) Database operation monitoring method, system, device and storage medium
US11237892B1 (en) Obtaining data for fault identification
CN115525392A (en) Container monitoring method and device, electronic equipment and storage medium
Narayanan et al. Towards' integrated'monitoring and management of DataCenters using complex event processing techniques
TWI685740B (en) Method for remotely clearing abnormal status of racks applied in data center
US8930369B2 (en) Information processing apparatus, message classifying method and non-transitory medium for associating series of transactions
CN113553243A (en) Remote error detection method
CN116484373B (en) Abnormal process checking and killing method, system, device, computer equipment and storage medium
CN114281615B (en) Automatic testing system and method for consistency of stored data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
AD01 Patent right deemed abandoned
AD01 Patent right deemed abandoned

Effective date of abandoning: 20231208