WO2017168484A1 - 管理計算機および性能劣化予兆検知方法 - Google Patents
管理計算機および性能劣化予兆検知方法 Download PDFInfo
- Publication number
- WO2017168484A1 WO2017168484A1 PCT/JP2016/059801 JP2016059801W WO2017168484A1 WO 2017168484 A1 WO2017168484 A1 WO 2017168484A1 JP 2016059801 W JP2016059801 W JP 2016059801W WO 2017168484 A1 WO2017168484 A1 WO 2017168484A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- reference value
- operation information
- group
- virtual
- autoscale
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3409—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
- G06F11/1415—Saving, restoring, recovering or retrying at system level
- G06F11/1438—Restarting or rejuvenating
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/202—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
- G06F11/2023—Failover techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/202—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
- G06F11/2023—Failover techniques
- G06F11/2025—Failover techniques using centralised failover control functionality
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/3006—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/301—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is a virtual computing platform, e.g. logically partitioned systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3055—Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3404—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for parallel or distributed programming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3409—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
- G06F11/3433—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment for load management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
- G06F9/45558—Hypervisor-specific management and integration aspects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
- G06F9/45558—Hypervisor-specific management and integration aspects
- G06F2009/4557—Distribution of virtual machine instances; Migration and load balancing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
- G06F9/45558—Hypervisor-specific management and integration aspects
- G06F2009/45591—Monitoring or debugging support
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2201/00—Indexing scheme relating to error detection, to error correction, and to monitoring
- G06F2201/815—Virtual
Definitions
- the present invention relates to a management computer and a performance deterioration sign detection method.
- Patent Document 1 a technique for detecting a sign of performance degradation using a baseline learned from the normal state of the information system has been proposed (Patent Document 1).
- Patent Document 1 since it is difficult to set a threshold for performance monitoring, a baseline is generated by statistically processing the normal behavior of the information system.
- the present invention has been made in view of the above problems, and an object of the present invention is to provide a management computer and a performance deterioration sign that can detect a sign of performance deterioration even when generation and destruction of a virtual operation unit are repeated in a short period of time. It is to provide a detection method.
- a management computer detects and manages a sign of performance deterioration of an information system including one or more computers and one or more virtual operation units virtually installed in the computer.
- An operation information acquisition unit that acquires operation information from all virtual operation units belonging to an autoscale group that is a management computer and automatically adjusts the number of virtual operation units.
- a reference value generation unit that generates a reference value for detecting a sign of performance degradation, and a reference value generated by the reference value generation unit and an operation information acquisition unit
- a detection unit that detects a sign of performance deterioration of each virtual calculation unit from the operation information of the virtual calculation unit.
- the present invention it is possible to generate a reference value for detecting a sign of performance deterioration based on the operation information of all the virtual operation units in the auto scale group, and to compare the reference value with the operation information. It is possible to detect whether there is a sign of performance degradation. As a result, the reliability of the information system can be improved.
- FIG. 20 is a diagram illustrating an overall configuration of a plurality of information systems in a failover relationship according to the third embodiment.
- this embodiment detects signs of performance degradation in an environment in which instances to be monitored disappear before generating a baseline because scale-in and scale-out are repeated frequently. It can be so.
- the virtual operation unit is not limited to an instance (container) but may be a virtual machine. Further, it can be applied to a physical computer instead of the virtual operation unit.
- all monitoring target instances belonging to the same autoscale group are considered to be pseudo-identical instances.
- a baseline total amount baseline and average baseline
- a “reference value” is created from the operation information of all instances in the same autoscale group.
- the total amount of operation information (total amount operation information) of instances belonging to the autoscale group is compared with the total amount baseline, and if the total amount operation information is outside the total amount baseline, a sign of performance deterioration is detected. Judge. In this embodiment, when the total amount baseline violation is found in the information system, the scale-out is instructed. This increases the number of instances that belong to the autoscale group that violates the total baseline, thus improving performance.
- the average value of the operation information of each instance in the auto scale group is compared with the average baseline, and even when the operation information of each instance deviates from the average baseline, a sign of performance deterioration is detected. to decide. In this case, the instance in which the average baseline violation is detected is discarded, and a similar instance is regenerated. As a result, the performance of the information system is restored.
- FIG. 1 is an explanatory diagram showing an overall outline of the present embodiment.
- the configuration shown in FIG. 1 shows an outline of the present embodiment to the extent necessary for understanding and implementing the present invention, and the scope of the present invention is not limited to the illustrated configuration.
- the management server 1 as a “management computer” monitors for signs of performance degradation of the information system and implements countermeasures when it detects signs of performance degradation.
- the information system includes, for example, one or more computers 2, one or more virtual operation units 4 provided in the computer 2, and a replication control device 3 that controls generation and destruction of the virtual operation units 4.
- the virtual operation unit 4 is configured, for example, as an instance, a container, or a virtual machine, and performs arithmetic processing using the physical computer resources of the computer 2.
- the virtual operation unit 4 includes, for example, an application program, middleware, a library (or operating system), and the like.
- the virtual operation unit 4 may operate on an operating system of the computer 2 such as an instance or a container, or may operate on an operating system different from the operating system of the computer 2 such as a virtual machine managed by a hypervisor. May be.
- the virtual calculation unit 4 may be called a virtual server.
- a container is given as an example of the virtual operation unit 4.
- parenthesized numbers are added to the reference numerals so that a plurality of existing elements such as the computer 2 and the virtual operation unit 4 can be distinguished. However, when there is no need to distinguish a plurality of elements, the parenthesized numbers are omitted.
- the virtual operation units 4 (1) to 4 (4) are referred to as the virtual operation unit 4 when it is not necessary to distinguish them.
- the replication control device (Relication Controller) 3 controls generation and destruction of the virtual operation unit 4 in the information system.
- the replication control apparatus 3 holds one or more images 40 as “startup management information”, and generates a plurality of virtual operation units 4 from the same image 40 or a plurality of images generated from the same image 40. Or any one or more of the virtual operation units 4 are discarded.
- the image 40 is management information used to generate (activate) the virtual calculation unit 4 and is a template that defines the configuration of the virtual calculation unit 4.
- the replication control device 3 controls the number of virtual operation units 4 by using the scale management unit P31.
- the replication control device 3 manages the generation and destruction of the virtual operation unit 4 for each autoscale group 5.
- the auto scale group 5 is a management unit that executes auto scale.
- the auto scale is a process for automatically adjusting the number of virtual operation units 4 according to an instruction.
- FIG. 1 shows a state in which a plurality of autoscale groups 5 are formed from virtual operation units 4 provided on different computers 2 respectively. Each virtual operation unit 4 in the autoscale group 5 is generated from the same image 40.
- FIG. 1 shows a plurality of autoscale groups 5 (1) and 5 (2).
- the first autoscale group 5 (1) includes a virtual operation unit 4 (1) provided in the computer 2 (1) and a virtual operation unit 4 (3) provided in the other computer 2 (2). Consists of including.
- the second autoscale group 5 (2) includes a virtual operation unit 4 (2) provided in the computer 2 (1) and a virtual operation unit 4 (3) provided in the other computer 2 (2). Consists of including.
- the autoscale group 5 can be composed of virtual operation units 4 provided in different computers 2.
- the management server 1 detects a sign of performance deterioration in the information system in which the virtual operation unit 4 operates. When the management server 1 detects a sign of performance degradation, it can also notify a system administrator or the like. Further, when the management server 1 detects a sign of performance degradation, it can also deal with the performance degradation by giving a predetermined instruction to the replication control device 3.
- the management server 1 can include, for example, an operation information acquisition unit P10, a baseline generation unit P11, a performance deterioration sign detection unit P12, and a handling unit P13. These functions P10 to P13 are realized by a computer program stored in the management server 1, as will be described later. In FIG. 1, the same reference numerals are assigned to the corresponding computer programs and functions in order to clarify an example of the correspondence between the computer programs and the functions. Each function P10 to P13 may be realized by using a hardware circuit instead of or together with the computer program.
- the operation information acquisition unit P10 acquires the operation information of each virtual operation unit 4 operating on the computer 2 from each computer 2.
- the operation information acquisition unit P10 acquires information related to the configuration of the autoscale group 5 from the replication control device 3, and manages the operation information of the virtual operation unit 4 acquired from each computer 2 by classifying it into each autoscale group. be able to.
- the replication control device 3 can collect the operation information of each virtual calculation unit 4 from each computer 2, the operation information acquisition unit P ⁇ b> 10 may acquire the operation information of each virtual calculation unit 4 via the replication control device 3. .
- the baseline generation unit P11 is an example of a “reference value generation unit”.
- the baseline generation unit P11 generates a baseline for each autoscale group based on the operation information acquired by the operation information acquisition unit P10.
- the baseline is a value serving as a reference for detecting a sign of performance deterioration of the virtual computing unit 4 (a sign of performance deterioration of the information system).
- the baseline has a predetermined width (upper limit value, lower limit value), and when the operation information does not fall within the predetermined width, it can be determined as a sign of performance degradation.
- the total amount baseline is a reference value calculated from the total amount (total value) of operation information of all the virtual operation units 4 in the auto scale group 5, and is calculated for each auto scale group.
- the total amount baseline is compared with the total amount of operation information of the virtual operation unit 4 in the auto scale group 5.
- the average baseline is a reference value calculated from the average of the operation information of each virtual operation unit 4 in the autoscale group 5, and is calculated for each autoscale group.
- the average baseline is compared with each of the operation information of each virtual operation unit 4 in the autoscale group 5.
- the performance deterioration sign detection unit P12 is an example of a “detection unit”. Hereinafter, it may be called the detection part P12 or the sign detection part P12.
- the performance deterioration sign detection unit P12 determines whether or not there is a sign of performance deterioration in the target virtual calculation unit 4 by comparing the operation information of the virtual calculation unit 4 and the baseline.
- the sign detection unit P12 compares the total amount baseline calculated for the autoscale group 5 with the total amount of operation information of all virtual operation units 4 in the autoscale group 5. To do. When the total amount of operation information falls within the total amount baseline, the sign detection unit P12 determines that no sign of performance deterioration has been detected. When the total amount of operation information deviates from the total amount baseline, It is determined that a sign has been detected.
- the sign detection unit P12 compares the average baseline calculated for the autoscale group 5 with the operation information of each virtual operation unit 4 in the autoscale group 5. The sign detection unit P12 determines that no sign of performance deterioration has been detected when the operation information of the virtual calculation unit 4 is within the average baseline, and if the operation information is out of the average baseline, the performance deterioration It is determined that a sign of detection has been detected.
- the sign detection unit P12 When the sign detection unit P12 detects a sign of performance deterioration, the sign detection unit P12 transmits an alert to the terminal 6 used by a user such as a system administrator.
- the handling unit P13 performs a predetermined measure to deal with the detected sign of performance deterioration.
- the handling unit P13 instructs the replication control device 3 to perform scale-out when the total amount of operation information of each virtual computing unit 4 in the autoscale group 5 is out of the total amount baseline.
- the handling unit P13 instructs the replication control device 3 to add a predetermined number of virtual computing units 4 to the autoscale group 5 that has insufficient processing capability.
- the duplication control device 3 generates a predetermined number of virtual operation units 4 using the image 40 corresponding to the scale-out target autoscale group 5, and sets the predetermined number of virtual operation units 4 to the scale-out target autoscale group 5. to add.
- the handling unit P13 instructs the computer 2 provided with the virtual calculation unit 4 in which the sign is detected to redeploy.
- the instructed computer 2 discards the virtual operation unit 4 in which a sign of performance degradation has been detected, newly generates the virtual operation unit 4 from the same image 40 as the discarded virtual operation unit 4 and activates it.
- a baseline can be generated from the operation information of each virtual operation unit 4 configuring the autoscale group.
- the management server 1 regards each virtual computation unit 4 in the autoscale group 5 that is an autoscale management unit as a pseudo-virtual computation unit, which is necessary for generating a baseline. Operation information can be acquired. Since the autoscale group 5 includes the virtual operation unit 4 generated from the common image 40, there is no problem even if the virtual operation unit 4 in the autoscale group 5 is considered as one virtual operation unit.
- the management server 1 can generate a total amount baseline and an average baseline by regarding all the virtual operation units 4 constituting the autoscale group 5 as one virtual operation unit 4. Then, the management server 1 compares the total amount baseline and the total amount of operation information of each virtual operation unit 4 in the autoscale group 5, so that the autoscale group 5 has an overload state or a processing capacity shortage state. Whether it is occurring can be detected in advance.
- the management server 1 compares the average baseline and the operation information of each virtual computing unit 4 in the autoscale group 5 to determine whether the virtual computing unit 4 in the autoscale group 5 is stopped or has a low processing capacity. It can be detected individually.
- the management server 1 compares the total amount baseline and total amount operation information to determine a sign of performance degradation for each autoscale group that is a management unit of the container 4 generated from the same image 40. be able to. Furthermore, the management server 1 of the present embodiment can also individually determine a sign of performance deterioration of each virtual computation unit 4 in the autoscale group 5 by comparing the average baseline and the operation information.
- the management server 1 since the management server 1 instructs execution of scale-out for the autoscale group 5 that violates the total amount baseline, it is possible to suppress the occurrence of performance degradation. Furthermore, since the management server 1 recreates the virtual operation unit 4 that violates the average baseline, this can also suppress the occurrence of performance degradation. Only one of the performance monitoring based on the total amount baseline and its countermeasure, and the performance monitoring based on the average baseline and its countermeasure may be performed, or both may be performed at the same time or at different times.
- FIG. 2 is a configuration diagram of the entire system including the information system and the management server 1 that manages the performance of the information system.
- the entire system includes, for example, at least one management server 1, at least one computer 2, at least one duplication control device, a plurality of containers 4, and at least one autoscale group 5. Furthermore, the overall system can also include a terminal 6 used by a user such as a system administrator and a storage system 7 such as NAS (Network Attached Storage). Among the configurations shown in FIG. 2, at least the computer 2 and the replication control device 3 constitute an information system subject to performance management by the management server 1.
- the devices 1 to 3, 6, and 7 are connected to be capable of bidirectional communication via a communication network CN1 such as a LAN (Local Area Network) or the Internet.
- a communication network CN1 such as a LAN (Local Area Network) or the Internet.
- the container 4 is an example of the virtual operation unit 4 described in FIG. In order to clarify the correspondence, the same reference numeral “4” is assigned to the container and the virtual operation unit.
- the container 4 is a logical container created using container technology. In the following description, the container 4 may be referred to as a container instance 4.
- FIG. 3 is a diagram showing the configuration of the computer 2.
- the computer 2 includes, for example, a CPU (Central Processing Unit) 21, a memory 22, a storage device 23, a communication port 24, an input device 25, and an output device 26.
- a CPU Central Processing Unit
- the storage device 23 is formed from, for example, a hard disk drive or a flash memory, and stores an operating system, a library, an application program, and the like.
- the CPU 21 operates the container 4 by managing the computer program transferred from the storage device 23 to the memory 22, and manages the deployment and destruction of the container 4.
- the communication port 24 is for communicating with the management server 1 and the replication control device 3 via the communication network CN1.
- the input device 25 includes an information input device such as a keyboard and a touch panel, for example.
- the output device 26 includes an information output device such as a display, for example.
- the input device 25 may include a circuit that receives a signal from a device other than the information input device.
- the output device 26 may include a circuit that outputs a signal to a device other than the information output device.
- the container 4 operates as one of the processes.
- the computer 2 receives an instruction from the replication control device 3 or the management server 1, the computer 2 deploys or discards the container 4 based on the instruction. Furthermore, when the computer 2 is instructed to acquire the operation information of the container 4 from the management server 1, the computer 2 acquires the operation information of the container 4 and responds to the management server 1.
- FIG. 4 is a diagram showing the configuration of the duplication control device 3.
- the replication control device 3 can include, for example, a CPU 31, a memory 32, a storage device 33, a communication port 34, an input device 35, and an output device 36.
- a computer program and management information are stored in the storage device 33 including a hard disk drive and a flash memory.
- Examples of the computer program include a life and death monitoring program P30 and a schedule management program P31.
- As management information for example, there is an autoscale group table T30 for managing autoscale groups.
- the CPU 31 implements the function as the replication control device 3 by reading the computer program stored in the storage device 33 into the memory 32 and executing it.
- the communication port 34 is for communicating with each computer 2 and the management server 1 via the communication network CN1.
- the input device 35 is a device that receives input from a user or the like, and the output device 36 is a device that provides information to the user or the like.
- the autoscale group table T30 will be described with reference to FIG.
- the auto scale group table T30 is a table for managing the auto scale group 5 in the information system.
- Each table described below including this table T30 is a management table, but is simply referred to as a table.
- the autoscale group table T30 manages, for example, an autoscale group ID C301, a container ID C302, computer information C303, and a deployment argument C304 in association with each other.
- Auto scale group ID C301 is a column of identification information for uniquely identifying each auto scale group 5.
- the container ID C302 is a column of identification information for uniquely identifying each container 4.
- the computer information C303 is a column of identification information that uniquely identifies each computer 2.
- the argument C304 at the time of deployment is a column that holds an argument when the container 4 (container instance) is deployed.
- a record is created for each container.
- FIG. 6 is a flowchart showing the processing of the life and death monitoring program P30.
- the alive monitoring program P30 periodically checks the alive monitoring results for all of the containers 4 held in the autoscale group table T30.
- the subject of the operation will be described as the life / death monitoring program P30, the life / death monitoring unit P30 or the replication control device 3 may be described as the operation subject instead.
- the life / death monitoring program P30 confirms whether there is a container 4 that has not been confirmed life / death among the containers 4 held in the autoscale group table T30 (S300).
- the life and death monitoring program P30 determines that there is a container 4 whose life and death are not confirmed (S300: YES)
- the life and death of the container 4 is inquired of the computer 2 (S301). Specifically, the life and death monitoring program P30 identifies the computer 2 that should inquire about life and death by referring to the column of the container ID 302 and the column of the computer information C303 of the autoscale group table T30.
- the life and death monitoring program P30 inquires about the life and death of the container 4 having the container ID by explicitly polling the specified computer 2 with the container ID (S301).
- the life and death monitoring program P30 determines whether there is a dead container 4, that is, whether there is a stopped container 4 (S302).
- the alive monitoring program P30 finds a dead container 4 (S302: YES), it refers to the column of the argument C304 at the time of deployment of the autoscale group table T30, and deploys the container using the argument set in that column. (S303).
- the life and death monitoring program P30 returns to step S300 and determines whether there is a container 4 that has not been subjected to life and death monitoring (S300). When the life and death monitoring program P30 finishes the life and death monitoring for all the containers 4 (S300: NO), this process is finished.
- FIG. 7 is a flowchart showing the processing of the scale management program P31.
- the scale management program P31 controls the configuration of the autoscale group 5 in accordance with instructions input from the management server 1 and the input device 35.
- the scale management program P31 is described as being an operation subject, the scale management unit P31 or the replication control device 3 may be described as the operation subject instead.
- the scale management program P31 receives a scale change instruction including an autoscale group ID and the number of scales (number of containers) (S310).
- the scale management program P31 compares the scale number N1 of the designated autoscale group 5 with the designated scale number N2 (S311). Specifically, the scale management program P31 refers to the autoscale group table T30, grasps the number of containers 4 operating in the designated autoscale group 5 as the current scale number N1, and receives the scale number N1.
- the scale number N2 is compared.
- the scale management program P31 determines whether the current scale number N1 is different from the received scale number N2 (S302). When the current scale number N1 matches the received scale number N2 (S312: NO), the scale management program P31 ends this process because it is not necessary to change the scale number.
- the scale management program P31 determines whether the current scale number N1 is larger than the received scale number N2 (S313). .
- FIG. 8 is a diagram showing the configuration of the management server 1.
- the management server 1 includes, for example, a CPU 11, a memory 12, a storage device 13, a communication port 14, an input device 15, and an output device 16.
- the communication port 14 is for communicating with each computer 2 and the replication control device 3 via the communication network CN1.
- the input device 15 is a device that receives input from the user, such as a keyboard and a touch panel.
- the output device 16 is a device that outputs information to be presented to the user, such as a display.
- the storage device 13 stores computer programs P11 to P13 and management tables T10 to T14.
- the computer programs include an operation information acquisition program P10, a baseline generation program P11, a performance deterioration sign detection program P12, and a countermeasure program P13.
- the management table includes a container operation information table T10, a total amount operation information table T11, an average operation information table T12, a total amount baseline table T13, and an average baseline table T14.
- the CPU 11 implements a predetermined function for performance management by reading the computer program stored in the storage device 13 into the memory 12 and executing it.
- FIG. 9 shows a container operation information table T10.
- the container operation information table T10 is a table for managing the operation information of each container 4.
- the container operation information table T10 manages, for example, time C101, autoscale group ID C102, container ID C103, CPU usage C104, memory usage C105, network usage C106, and IO usage C107 in association with each other.
- a record is created for each container.
- Time C101 is a column for storing the date and time when the operation information (CPU usage, memory usage, network usage, IO usage) is measured.
- the auto scale group ID C102 is a column for storing identification information for specifying the auto scale group 5 to which the measurement target container 4 belongs. In the drawing, the autoscale group may be referred to as “AS group”.
- the container ID C103 is a column for storing identification information for specifying the container 4 to be measured.
- the CPU usage amount C104 is a type of container operation information, and is a column for storing the amount (GHz) that the container 4 uses the CPU 21 of the computer 2.
- the memory usage amount C105 is an example of container operation information, and is a column for storing the amount (MB) in which the container 4 uses the memory 22 of the computer 2.
- the network usage amount C106 is a type of container operation information, and is a column for storing the amount (Mbps) that the container 4 communicates using the communication network CN1 (or another communication network not shown). In the figure, the network may be displayed as NW.
- the IO usage amount C107 is a type of container operation information, and is a column for storing information input to the container 4 and the number of times (IOPS) of information output by the container 4.
- the container operation information C104 to C107 shown in FIG. 9 is an example, and the present embodiment is not limited to the illustrated container operation information. A part of the illustrated container operation information may be used, or operation information (not shown) may be newly added.
- the total amount operation information table T11 will be described with reference to FIG.
- the total amount operation information table T11 is a table that manages the total amount of operation information of all containers 4 in the autoscale group 5.
- the total amount operation information table T11 manages, for example, time C111, autoscale group ID C112, CPU usage C113, memory usage C114, network usage C115, and IO usage C116 in association with each other. In the total amount operation information table T11, a record is created for each measurement time and for each autoscale group.
- Time C111 is a column for storing the measurement date and time of operation information (CPU usage, memory usage, network usage, IO usage).
- the auto scale group ID C112 is a column for storing identification information for specifying the auto scale group 5 to be measured.
- the CPU usage amount C113 is a column for storing the total amount (GHz) that each container 4 in the autoscale group 5 uses the CPU 21 of the computer 2.
- the memory usage amount C114 is a column for storing the total amount (MB) in which each container 4 in the autoscale group 5 uses the memory 22 of the computer 2.
- the network usage amount C115 is a column for storing the total amount (Mbps) in which each container 4 in the autoscale group 5 communicates using the communication network CN1 (or another communication network not shown).
- the IO usage amount C116 is a column for storing the number of times of input information and output information (IOPS) of each container 4 in the autoscale group 5.
- the average operation information table T12 will be described with reference to FIG.
- the average operation information table T12 is a table that manages the average of the operation information of each container 4 in the autoscale group 5.
- a record is created for each measurement time and for each autoscale group.
- the average operation information table T12 manages, for example, time C121, autoscale group ID C122, CPU usage C123, memory usage C124, network usage C125, and IO usage C126 in association with each other.
- Time C121 is a column for storing measurement date and time of operation information (CPU usage, memory usage, network usage, IO usage).
- the auto scale group ID C122 is a column for storing identification information for specifying the auto scale group 5 to be measured.
- the CPU usage amount C123 is a column for storing an average value (GHz) in which each container 4 in the autoscale group 5 uses the CPU 21 of the computer 2.
- the memory usage C124 is a column for storing an average value (MB) in which each container 4 in the autoscale group 5 uses the memory 22 of the computer 2.
- the network usage amount C125 is a column for storing an average amount (Mbps) in which each container 4 in the autoscale group 5 communicates using the communication network CN1 (or another communication network not shown).
- the IO usage amount C126 is a column for storing the average number (IOPS) of input information and output information of each container 4 in the autoscale group 5.
- the total amount baseline table T13 will be described with reference to FIG.
- the total amount baseline table T13 is a table for managing the total amount baseline generated based on the total amount operation information.
- the total amount baseline table T13 manages, for example, the weekly cycle C131, autoscale group ID C132, CPU usage C133, memory usage C134, network usage C135, and IO usage C136 in association with each other.
- a record is created for each cycle and for each autoscale group.
- the weekly cycle C131 is a column for holding the weekly cycle of the baseline.
- a total amount baseline is created every Monday and for each autoscale group.
- Auto scale group ID C132 is a column for storing identification information for identifying the auto scale group 5 that is the subject of the baseline.
- the CPU usage amount C133 is a column for storing a baseline (GHz) of the total amount that each container 4 in the autoscale group 5 uses the CPU 21 of the computer 2.
- the memory usage amount C134 is a column for storing the baseline (MB) of the total amount that each container 4 in the autoscale group 5 uses the memory 22 of the computer 2.
- the network usage amount C135 is a column for storing the baseline (Mbps) of the total amount that each container 4 in the autoscale group 5 communicates using the communication network CN1 (or another communication network not shown).
- the IO usage amount C136 is a column for storing a baseline (IOPS) of the number of times of input information and output information of each container 4 in the autoscale group 5.
- the average baseline table T14 will be described with reference to FIG.
- the average baseline table T14 is a table that manages an average baseline generated based on the average of the operation information.
- a record is created for each cycle and for each autoscale group.
- the average baseline table T14 manages, for example, a weekly cycle C141, an autoscale group ID C142, a CPU usage C143, a memory usage C144, a network usage C145, and an IO usage C146 in association with each other.
- the weekly cycle C141 is a column that holds the weekly cycle of the average baseline.
- the autoscale group ID C142 is a column for storing identification information for specifying the autoscale group 5 that is the subject of the baseline.
- the CPU usage amount C143 is a column for storing an average baseline (GHz) in which each container 4 in the autoscale group 5 uses the CPU 21 of the computer 2.
- the memory usage C144 is a column for storing an average baseline (MB) in which each container 4 in the autoscale group 5 uses the memory 22 of the computer 2.
- the network usage C145 is a column for storing an average baseline (Mbps) in which each container 4 in the autoscale group 5 communicates using the communication network CN1 (or another communication network not shown).
- the IO usage amount C146 is a column for storing an average baseline (IOPS) of input information and output information of each container 4 in the autoscale group 5.
- FIG. 14 is a flowchart showing the process of the operation information acquisition program P10.
- the operation information acquisition program P10 acquires the operation information of the container 4 from the computer 2 periodically such as at a fixed time every week.
- the operation subject is described as the operation information acquisition program P10, the operation information acquisition unit P10 or the management server 1 may be described as the operation subject instead.
- the operation information acquisition program P10 acquires information of the autoscale group table T30 from the replication control device 3 (S100).
- the operation information acquisition program P10 confirms whether there is a container from which operation information is not acquired among the containers 4 described in the autoscale group table T30 (S101).
- the operation information acquisition program P10 acquires the operation information of the container 4 from the computer 2 and stores it in the container operation information table T10 (S102). Return to step S100.
- the operation information acquisition program P10 When the operation information acquisition program P10 acquires the operation information from all the containers 4 (S101: NO), the operation information acquisition program P10 confirms whether there is an autoscale group 5 that has not performed the predetermined statistical processing (S103).
- the predetermined statistical process is, for example, a process for calculating the total amount of each piece of operation information and a process for calculating an average of each piece of operation information.
- the operation information acquisition program P10 calculates the sum of the operation information of each container 4 included in the unprocessed autoscale group 5, and the total amount operation information table Save to T11 (S104). Furthermore, the operation information acquisition program P10 calculates the average of the operation information of each container 4 included in the unprocessed autoscale group 5, and stores it in the average operation information table T12 (S105). Thereafter, the operation information acquisition program P10 returns to Step S103.
- FIG. 15 is a flowchart showing the processing of the baseline generation program P11.
- the baseline generation program P11 periodically generates a total amount baseline and an average baseline for each autoscale group.
- the main subject of the operation is described as the baseline generation program P11, but instead, the base line generation unit P11 or the management server 1 may be described as the main subject of operation.
- the baseline generation program P11 acquires information of the autoscale group table T30 from the replication control device 3 (S110). The baseline generation program P11 checks whether there is an autoscale group 5 that has not updated the baseline among the autoscale groups 5 (S111).
- the baseline generation program P11 When there is an autoscale group 5 in which the baseline has not been updated (S111: YES), the baseline generation program P11 generates a total amount baseline using the operation information recorded in the total amount operation information table T11. Save to the line table T13 (S112).
- the baseline generation program P11 generates an average baseline using the operation information in the average operation information table T12, stores the average baseline in the average baseline table T14 (S113), and returns to step S111.
- the baseline generation program P11 ends this processing.
- FIG. 16 is a flowchart showing the processing of the performance deterioration sign detection program P12.
- the performance deterioration sign detection program P12 checks whether a sign of performance deterioration (performance failure) has occurred.
- the main subject of the operation will be described as the performance deterioration sign detection program P12, but instead, the performance deterioration sign detection unit P12 or the management server 1 may be described as the operation subject.
- the performance deterioration sign detection program P12 may be referred to as a sign detection program P12.
- the performance deterioration sign detection program P12 acquires information of the autoscale group table T30 from the replication control device 3 (S120). The sign detection program P12 checks whether or not there is an autoscale group 5 that has not determined a sign of performance deterioration among the autoscale groups 5 (S121).
- the sign detection program P12 compares the total amount baseline held in the total amount baseline table T13 with the total amount operation information held in the total amount operation information table T11. (S122).
- the total amount operation information may be abbreviated as “DT” and the median of the total amount baseline may be abbreviated as “BLT”.
- the sign detection program P12 confirms whether the value of the total amount operation information of the auto scale group 5 is within the range of the total amount baseline (S123). As shown in FIG. 12, the total amount baseline has a width of ⁇ 3 ⁇ with respect to its median value, for example. A value obtained by subtracting 3 ⁇ from the median is the lower limit, and a value obtained by adding 3 ⁇ to the median is the upper limit.
- the sign detection program P12 returns to step S121 when the value of the total amount operation information is within the range of the total amount baseline (S123: YES). When the value of the total amount operation information does not fall within the range of the total amount baseline (S123: NO), the sign detection program P12 issues a total amount baseline violation alert indicating that a sign of performance deterioration has been detected (S124). ), The process returns to step S121.
- the sign detection program P12 monitors whether or not the value of the total amount operating information is outside the range of the total amount baseline (S123), and the value of the total amount operating information is outside the range of the total amount baseline. In this case, an alert is output (S124).
- the sign detection program P12 compares the average baseline held in the average baseline table T14 with the operation information held in the container operation information table T10 (S126). .
- the average operation information may be abbreviated as “DA” and the average baseline may be abbreviated as “BLA”.
- the sign detection program P12 confirms whether the value of the operation information of the container 4 is within the range of the average baseline (S127). As shown in FIG. 13, the average baseline has a width of ⁇ 3 ⁇ with respect to its median, for example. A value obtained by subtracting 3 ⁇ from the median is the lower limit, and a value obtained by adding 3 ⁇ to the median is the upper limit.
- the sign detection program P12 returns to step S125 when the value of the operation information is within the average baseline range (S127: YES). When the value of the operation information does not fall within the range of the average baseline (S127: NO), the sign detection program P12 issues an average baseline violation alert indicating that a sign of performance deterioration has been detected (S128). Return to step S125.
- the sign detection program P12 monitors whether or not the value of the operation information is outside the range of the average baseline (S127), and when the value of the operation information is outside the range of the average baseline An alert is output (S128).
- FIG. 17 is a flowchart showing the processing of the countermeasure program P13.
- the countermeasure program P13 receives an alert issued by the performance deterioration sign detection program P12
- the countermeasure program P13 implements a countermeasure that matches the alert.
- the subject of the operation will be described as the handling program P13, but instead, the handling unit P13 or the management server 1 may be described as the subject of the operation.
- the handling program P13 receives the alert issued by the performance deterioration sign detection program P12 (S130).
- an alert for total amount violation also referred to as total amount alert
- an alert for average baseline violation also referred to as average alert
- AA alert for average baseline violation
- the coping program P13 determines whether the type of the received alert is both a total baseline violation alert and an average baseline violation alert (S131). When the coping program P13 receives both the alert for the total amount baseline violation and the alert for the average baseline violation at the same time (S131: YES), the coping program P13 implements predetermined measures to deal with each alert.
- the handling program P13 issues a scale-out instruction to the replication control device 3 in order to respond to the alert for violation of the total amount baseline (S132).
- the replication control device 3 executes scale-out for the autoscale group 5 for which the alert for violation of the total amount baseline has been issued, a container 4 is newly added to the autoscale group 5, so that the autoscale group Processing capacity is improved.
- the countermeasure program P13 instructs the computer 2 provided with the container 4 to which the alert is issued to recreate the container 4 in order to deal with the alert of the average baseline violation (S133).
- the handling program P13 causes the computer 2 to newly generate a container 4 with the same argument (same image 40) as that of the container 4 that issued the alert. Then, the handling program P13 discards the container 4 that has caused the alert.
- the coping program P13 checks whether the alert for the total amount baseline violation and the alert for the average baseline violation are not received at the same time (S131: NO), whether the alert for the total amount baseline violation is received in step S130. (S134).
- the handling program P13 instructs the replication control device 3 to execute scale-out (S135) when the alert received in step S130 is an alert for violation of the total amount baseline (S134: YES).
- the handling program P13 checks whether the alert is an average baseline violation alert (S136).
- the coping program P13 requests the computer 2 to recreate the container 4 when the alert received in step S130 is an alert for violation of the average baseline (S136: YES). That is, as described in step S133, the handling program P13 instructs the computer 2 to deploy a container with the same argument as the container that caused the average baseline violation alert. Furthermore, the handling program P13 instructs the computer 2 to discard the container that has caused the average baseline violation alert.
- a baseline can be generated even in an information system in an environment where the lifetime of the container 4 (instance) to be monitored is shorter than the baseline generation period, and the baseline is used.
- a sign of performance deterioration can be detected, and a sign of performance deterioration can be dealt with in advance.
- each container 4 belonging to the same autoscale group 5 is replaced with a pseudo same container 4 in creating the baseline. Therefore, it is possible to obtain a baseline for predicting performance degradation. Thereby, since the sign of the performance degradation of the information system can be detected, the reliability is improved.
- each container 4 in the same autoscale group 5 can be regarded as the same container from the viewpoint of creating a baseline.
- a sign of performance degradation can be detected in at least one of or both of the auto scale group unit and the container unit.
- a measure suitable for the sign can be automatically implemented, so that the deterioration of performance can be suppressed in advance and the reliability is improved.
- the replication control device 3 and the management server 1 are configured by separate computers. Instead, a configuration in which the processing of the replication control device and the processing of the management server are executed in the same computer. It is good.
- the container 4 that is a logical existence is the monitoring target, but the monitoring target is not limited to the container 4 and may be a virtual server or a physical server (bare metal).
- the deployment in the physical server is started using an OS image on the image management server using a network boot mechanism such as PXE (Preboot Execution Environment).
- the operation information to be monitored is CPU usage, memory usage, network usage, and IO usage, but the type of operation information is not limited to these, and can be acquired as operation information. Any other type of operation information may be used.
- the second embodiment will be described with reference to FIGS.
- Each of the following embodiments including the present embodiment corresponds to a modification of the first embodiment, and therefore, differences from the first embodiment will be mainly described.
- a group for creating a baseline is managed in consideration of the performance difference between the computers 2 in which the containers 4 are provided.
- FIG. 18 shows a configuration example of the management server 1A of the present embodiment.
- the management server 1A of the present embodiment has substantially the same configuration as the management server 1 described in FIG. 8, but the computer programs P10A, P11A, P12A stored in the storage device 13 are the computer programs P10, Different from P11 and P12. Furthermore, the management server 1A of the present embodiment holds the group generation program P14, the computer table T15, and the grade-specific group table T16 in the storage device 13.
- FIG. 19 shows the configuration of a computer table T15 that manages the grade of each computer 2 in the information system.
- the computer table T15 is configured, for example, by associating a column C151 that stores computer information that uniquely identifies the computer 2 and a column C152 that stores a grade representing the performance of the computer 2.
- a record is created for each computer.
- FIG. 20 shows a configuration of a group table by grade T16 for managing the computers 2 in the same autoscale group 5 by dividing them according to their grades.
- the group by grade is a virtual autoscale group formed by classifying the computers 2 belonging to the same autoscale group 5 by grade.
- the grade-specific group table T16 manages, for example, a group ID C161, an autoscale group ID C162, a container ID C163, computer information C164, and an argument C165 at the time of deployment.
- the group ID C161 is identification information that uniquely identifies a group by grade existing in the autoscale group 5.
- the autoscale group ID C162 is identification information that uniquely identifies the autoscale group 5.
- the container ID C163 is identification information that uniquely identifies the container 4.
- the computer information C164 is information for specifying the computer 2 in which the container 4 is provided.
- the argument C165 at the time of deployment is management information used when the container 4 specified by the container ID C163 is created again. In the grade group table T16, a record is created for each container.
- FIG. 21 is a flowchart showing the processing of the group generation program P14.
- the subject of the operation is described as the group generation program P14, the group generation unit P14 or the management server 1A may be used as the operation subject instead.
- the group generation program P14 acquires information of the autoscale group table T30 from the replication control device 3 (S140). The group generation program P14 checks whether there is an autoscale group 5 that has not generated a grade-specific group among the autoscale groups 5 (S141).
- the group generation program P14 When there is an autoscale group 5 that has not been subjected to group generation processing for each grade (S141: YES), the group generation program P14 includes the containers 4 provided in the computers 2 of different grades in the autoscale group 5. (S142). Specifically, the group generation program P14 uses different grades of computers in the same autoscale group by collating the computer information column C303 of the autoscale group table T30 with the computer information column C151 of the computer table T15. It is determined whether a container exists (S142).
- the group generation program P14 generates a group by grade with a grouping that matches the autoscale group when there is no container 4 using the computer 2 of another grade in the same autoscale group (S142: NO).
- step S144 a grade-specific group is generated formally, but the actual situation is the same as the autoscale group.
- the group generation program P14 returns to step S141, and checks whether there are any autoscale groups 5 that are not subjected to grade-specific group generation processing. When the group generation program P14 performs the group generation processing by grade for all the autoscale groups 5 (S141: NO), the processing ends.
- the containers 4 having the container IDs “Cont001” and “Cont002” have the same autoscale group ID “AS01” and the same grade of the computer 2 “Gold”. Accordingly, the two containers 4 having the container ID “Cont001” [Cont002] both belong to the same grade group “AS01a”.
- the two containers (Cont003 and Cont004) included in the autoscale group “AS02” have different grades of the computer 2 respectively.
- the grade of the computer (C1) provided with one container (Cont003) is “Gold”, but the grade of the computer (C3) provided with the other container (Cont004) is “Silver”.
- the autoscale group “AS02” is virtually divided into groups “AS02a” and “AS02b” according to grades. Baseline generation and predictive signs of performance degradation are performed in units of autoscale groups divided by grade.
- This embodiment which is configured in this way, also has the same function and effect as the first embodiment.
- a group for each computer grade is virtually generated in the same autoscale group, and a baseline or the like is generated for each autoscale group for each grade.
- a total amount baseline and an average baseline can be generated from a group of containers operating on a computer having uniform performance.
- a baseline is generated even in an information system composed of computers with non-uniform performance, and an environment in which the lifetime of the monitored container is shorter than the baseline generation period, A sign of performance deterioration can be detected, and a sign of performance deterioration can be handled in advance.
- a third embodiment will be described with reference to FIG. In the present embodiment, a case where operation information and the like are taken over between sites will be described.
- FIG. 22 is an overall view of a failover system in which a plurality of information systems are connected in a switchable manner.
- the primary site ST1 used during normal operation and the secondary site ST2 used during an abnormality are connected via the inter-site network CN2. Since the configuration in each site is basically the same, description thereof is omitted.
- the secondary site ST2 can also be provided with the same container group as that operated at the primary site ST1 from the normal time (hot standby). Alternatively, the secondary site ST2 can also activate the same container group that was operating at the primary site ST1 when a failure occurs (cold standby).
- the container operation information table T10 and the like are transmitted from the management server 1 of the primary site ST1 to the management server 1 of the secondary site ST2.
- the management server 1 of secondary site ST2 can produce
- the total amount operation information table T11, the average operation information table T12, the total amount baseline table T13, and the average baseline table T14 are also transmitted from the primary site ST1 to the secondary site ST2, in addition to the container operation information table T10, the secondary site ST2
- the processing load on the management server 1 can be reduced.
- This embodiment which is configured in this way, also has the same function and effect as the first embodiment. Furthermore, in this embodiment, by applying to a failover system, it is possible to quickly start monitoring for signs of performance degradation at the time of failover, and reliability is improved.
- the container operation information table T10 of the secondary site ST2 is transferred from the management server 1 of the secondary site ST2 to the management server 1 of the primary site ST1. Etc. can also be transmitted. Thereby, even when switching to the primary site ST1, it is possible to start detecting signs of performance deterioration at an early stage.
- each of the above-described embodiments is a description of the present invention in an easy-to-understand manner, and the present invention does not have to have all the configurations described in the embodiments. At least a part of the configuration described in the embodiment can be changed to another configuration or deleted. Furthermore, a new configuration can be added to the embodiment.
- Some or all of the functions and processes described in the embodiments may be realized as a hardware circuit or as software.
- the computer program and various data may be stored not only in the storage device in the computer but also in a storage device outside the computer.
- 1, 1A Management server (management computer), 2: Computer, 3: Replication control device, 4: Container (virtual operation unit), 5: Autoscale group, 40: Image, P10: Operation information acquisition unit, P11: Base Line generation unit, P12: Performance deterioration sign detection unit, P13: Countermeasure unit
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Computer Hardware Design (AREA)
- Mathematical Physics (AREA)
- Debugging And Monitoring (AREA)
Abstract
Description
Claims (15)
- 一つ以上の計算機と前記計算機に仮想的に設けられる一つ以上の仮想演算部とを含む情報システムの性能劣化の予兆を検知して管理する管理計算機であって、
前記仮想演算部の数を自動的に調整するオートスケールの管理単位であるオートスケールグループに属する全ての仮想演算部から稼働情報を取得する稼働情報取得部と、
前記稼働情報取得部の取得した前記各稼働情報から、オートスケールグループ毎に、性能劣化の予兆を検知するための基準値を生成する基準値生成部と、
前記基準値生成部の生成した前記基準値と前記稼働情報取得部の取得した前記仮想演算部の稼働情報とから、前記各仮想演算部の性能劣化の予兆を検知する検知部と、
を備える管理計算機。 - 前記基準値生成部は、前記オートスケールグループ毎に、オートスケールグループに属する全ての仮想演算部の稼働情報の平均に基づいて、前記基準値としての平均基準値を生成する、
請求項1に記載の管理計算機。 - 前記検知部は、前記オートスケールグループ毎に、オートスケールグループに属する各仮想演算部の稼働情報と前記平均基準値とをそれぞれ比較して、性能劣化の予兆を検知する、
請求項2に記載の管理計算機。 - 予兆の検知された性能劣化へ対処する対処部を備えており、
前記検知部が、前記オートスケールグループ内の全ての仮想演算部のうち稼働情報が前記平均基準値から外れている仮想演算部について性能劣化の予兆を検知したと判定すると、その仮想演算部を再起動する、
請求項3に記載の管理計算機。 - 前記基準値生成部は、前記オートスケールグループ毎に、オートスケールグループに属する全ての仮想演算部の稼働情報の総量に基づいて、前記基準値としての総量基準値を生成する、
請求項4に記載の管理計算機。 - 前記検知部は、前記オートスケールグループ毎に、オートスケールグループに属する全ての仮想演算部の稼働情報の総量と前記総量基準値とを比較して、性能劣化の予兆を検知する、
請求項5に記載の管理計算機。 - 予兆の検知された性能劣化へ対処する対処部を備えており、
前記検知部が、前記稼働情報の総量が前記総量基準値から外れており、性能劣化の予兆を検知した場合に、前記対処部はスケールアウトの実行を指示する、
請求項6に記載の管理計算機。 - 前記基準値生成部は、
前記オートスケールグループ毎に、オートスケールグループに属する全ての仮想演算部の稼働情報の総量に基づいて、前記基準値としての総量基準値を生成するか、
または、前記オートスケールグループ毎に、オートスケールグループに属する全ての仮想演算部の稼働情報の平均に基づいて、前記基準値としての平均基準値を生成し、
前記検知部は、
前記オートスケールグループ毎に、オートスケールグループに属する全ての仮想演算部の稼働情報の総量と前記総量基準値とを比較して、性能劣化の予兆を検知するか、
または、前記オートスケールグループ毎に、オートスケールグループに属する各仮想演算部の稼働情報と前記平均基準値とをそれぞれ比較して、性能劣化の予兆を検知し、
予兆の検知された性能劣化へ対処する対処部を備えており、
前記対処部は、
前記検知部が、前記稼働情報の総量が前記総量基準値から外れており、性能劣化の予兆を検知した場合に、スケールアウトの実行を指示し、
前記検知部が、前記オートスケールグループ内の全ての仮想演算部のうち、稼働情報が前記平均基準値から外れている仮想演算部について性能劣化の予兆を検知したと判定すると、その仮想演算部を再起動する、
請求項1に記載の管理計算機。 - 前記オートスケールグループ内の前記仮想演算部は、同一の起動用管理情報から生成されている、
請求項1~8のいずれか一項に記載の管理計算機。 - 前記基準値生成部は、前記オートスケールグループ内に性能の異なる計算機が含まれている場合は、前記オートスケールグループ内において前記計算機の性能毎のグループについて、性能劣化の予兆を検知するための基準値を生成する、
請求項1~8のいずれか一項に記載の管理計算機。 - フェイルオーバの開始前に、他サイトの管理計算機に向けて、少なくとも前記基準値を送信する、
請求項10に記載の管理計算機。 - 一つ以上の計算機と前記計算機に仮想的に設けられる一つ以上の仮想演算部を含む情報システムの性能劣化の予兆を管理計算機により検知して管理する性能劣化方法であって、
前記管理計算機は、
前記仮想演算部の数を自動的に調整するオートスケールの管理単位であるオートスケールグループに属する全ての仮想演算部から稼働情報を取得するステップと、
前記取得した前記各稼働情報から、オートスケールグループ毎に、性能劣化の予兆を検知するための基準値を生成するステップと、
前記生成した前記基準値と前記取得した前記仮想演算部の稼働情報とから、前記各仮想演算部の性能劣化の予兆を検知するステップと、
を実行する性能劣化方法。 - さらに、予兆の検知された性能劣化へ対処するステップを備える、
請求項12に記載の性能劣化方法。 - 前記基準値を生成するステップは、前記オートスケールグループ毎に、オートスケールグループに属する全ての仮想演算部の稼働情報の総量に基づいて、前記基準値としての総量基準値を生成し、
前記性能劣化の予兆を検知するステップは、前記オートスケールグループ毎に、オートスケールグループに属する全ての仮想演算部の稼働情報の総量と前記総量基準値とを比較して、性能劣化の予兆を検知し、
前記性能劣化へ対処するステップは、前記稼働情報の総量が前記総量基準値から外れており、性能劣化の予兆が検知された場合に、スケールアウトの実行を指示する、
請求項13に記載の性能劣化方法。 - 前記基準値を生成するステップは、前記オートスケールグループ毎に、オートスケールグループに属する全ての仮想演算部の稼働情報の平均に基づいて、前記基準値としての平均基準値を生成し、
前記性能劣化の予兆を検知するステップは、前記オートスケールグループ毎に、オートスケールグループに属する各仮想演算部の稼働情報と前記平均基準値とをそれぞれ比較して、性能劣化の予兆を検知し、
前記性能劣化へ対処するステップは、前記オートスケールグループ内の全ての仮想演算部のうち、稼働情報が前記平均基準値から外れている仮想演算部について性能劣化の予兆が検知されると、その仮想演算部を再起動する、
請求項13に記載の性能劣化方法。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2016/059801 WO2017168484A1 (ja) | 2016-03-28 | 2016-03-28 | 管理計算機および性能劣化予兆検知方法 |
US15/743,516 US20180203784A1 (en) | 2016-03-28 | 2016-03-28 | Management computer and performance degradation sign detection method |
JP2018507814A JP6578055B2 (ja) | 2016-03-28 | 2016-03-28 | 管理計算機および性能劣化予兆検知方法 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2016/059801 WO2017168484A1 (ja) | 2016-03-28 | 2016-03-28 | 管理計算機および性能劣化予兆検知方法 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2017168484A1 true WO2017168484A1 (ja) | 2017-10-05 |
Family
ID=59963587
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2016/059801 WO2017168484A1 (ja) | 2016-03-28 | 2016-03-28 | 管理計算機および性能劣化予兆検知方法 |
Country Status (3)
Country | Link |
---|---|
US (1) | US20180203784A1 (ja) |
JP (1) | JP6578055B2 (ja) |
WO (1) | WO2017168484A1 (ja) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2020135336A (ja) * | 2019-02-19 | 2020-08-31 | 日本電気株式会社 | 監視システム、監視方法および監視プログラム |
JP2021504801A (ja) * | 2017-11-24 | 2021-02-15 | アマゾン テクノロジーズ インコーポレイテッド | プロダクション推論のための自動スケーリングホスト型機械学習モデル |
JP2021051452A (ja) * | 2019-09-24 | 2021-04-01 | 日本電気株式会社 | 監視装置、監視方法、およびプログラム |
WO2023084777A1 (ja) * | 2021-11-15 | 2023-05-19 | 日本電信電話株式会社 | スケーリング管理装置、スケーリング管理方法、および、プログラム |
JP7457435B2 (ja) | 2019-09-09 | 2024-03-28 | インターナショナル・ビジネス・マシーンズ・コーポレーション | 分散システムのデプロイメント |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2011243162A (ja) * | 2010-05-21 | 2011-12-01 | Mitsubishi Electric Corp | 台数制御装置、台数制御方法及び台数制御プログラム |
JP2012208781A (ja) * | 2011-03-30 | 2012-10-25 | Internatl Business Mach Corp <Ibm> | 情報処理システム、情報処理装置、スケーリング方法、プログラムおよび記録媒体 |
JP2014078166A (ja) * | 2012-10-11 | 2014-05-01 | Fujitsu Frontech Ltd | 情報処理装置、ログ出力制御方法、およびログ出力制御プログラム |
JP2014219859A (ja) * | 2013-05-09 | 2014-11-20 | 日本電信電話株式会社 | 分散処理システムおよび分散処理方法 |
JP2014229253A (ja) * | 2013-05-27 | 2014-12-08 | 株式会社エヌ・ティ・ティ・データ | マシン管理システム、管理サーバ、マシン管理方法、及びプログラム |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20120071205A (ko) * | 2010-12-22 | 2012-07-02 | 한국전자통신연구원 | 가상머신 서버와 노드 운영방법 및 그 장치 |
JP6248560B2 (ja) * | 2013-11-13 | 2017-12-20 | 富士通株式会社 | 管理プログラム、管理方法、および管理装置 |
JP6440203B2 (ja) * | 2015-09-02 | 2018-12-19 | Kddi株式会社 | ネットワーク監視システム、ネットワーク監視方法およびプログラム |
US10521315B2 (en) * | 2016-02-23 | 2019-12-31 | Vmware, Inc. | High availability handling network segmentation in a cluster |
-
2016
- 2016-03-28 US US15/743,516 patent/US20180203784A1/en not_active Abandoned
- 2016-03-28 WO PCT/JP2016/059801 patent/WO2017168484A1/ja active Application Filing
- 2016-03-28 JP JP2018507814A patent/JP6578055B2/ja active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2011243162A (ja) * | 2010-05-21 | 2011-12-01 | Mitsubishi Electric Corp | 台数制御装置、台数制御方法及び台数制御プログラム |
JP2012208781A (ja) * | 2011-03-30 | 2012-10-25 | Internatl Business Mach Corp <Ibm> | 情報処理システム、情報処理装置、スケーリング方法、プログラムおよび記録媒体 |
JP2014078166A (ja) * | 2012-10-11 | 2014-05-01 | Fujitsu Frontech Ltd | 情報処理装置、ログ出力制御方法、およびログ出力制御プログラム |
JP2014219859A (ja) * | 2013-05-09 | 2014-11-20 | 日本電信電話株式会社 | 分散処理システムおよび分散処理方法 |
JP2014229253A (ja) * | 2013-05-27 | 2014-12-08 | 株式会社エヌ・ティ・ティ・データ | マシン管理システム、管理サーバ、マシン管理方法、及びプログラム |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2021504801A (ja) * | 2017-11-24 | 2021-02-15 | アマゾン テクノロジーズ インコーポレイテッド | プロダクション推論のための自動スケーリングホスト型機械学習モデル |
US11126927B2 (en) | 2017-11-24 | 2021-09-21 | Amazon Technologies, Inc. | Auto-scaling hosted machine learning models for production inference |
JP7024157B2 (ja) | 2017-11-24 | 2022-02-24 | アマゾン テクノロジーズ インコーポレイテッド | プロダクション推論のための自動スケーリングホスト型機械学習モデル |
JP2020135336A (ja) * | 2019-02-19 | 2020-08-31 | 日本電気株式会社 | 監視システム、監視方法および監視プログラム |
JP7286995B2 (ja) | 2019-02-19 | 2023-06-06 | 日本電気株式会社 | 監視システム、監視方法および監視プログラム |
JP7457435B2 (ja) | 2019-09-09 | 2024-03-28 | インターナショナル・ビジネス・マシーンズ・コーポレーション | 分散システムのデプロイメント |
JP2021051452A (ja) * | 2019-09-24 | 2021-04-01 | 日本電気株式会社 | 監視装置、監視方法、およびプログラム |
JP7331581B2 (ja) | 2019-09-24 | 2023-08-23 | 日本電気株式会社 | 監視装置、監視方法、およびプログラム |
WO2023084777A1 (ja) * | 2021-11-15 | 2023-05-19 | 日本電信電話株式会社 | スケーリング管理装置、スケーリング管理方法、および、プログラム |
Also Published As
Publication number | Publication date |
---|---|
JPWO2017168484A1 (ja) | 2018-07-12 |
JP6578055B2 (ja) | 2019-09-18 |
US20180203784A1 (en) | 2018-07-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6578055B2 (ja) | 管理計算機および性能劣化予兆検知方法 | |
CN109815049B (zh) | 节点宕机恢复方法、装置、电子设备及存储介质 | |
JP5834939B2 (ja) | プログラム、仮想マシン制御方法、情報処理装置および情報処理システム | |
US9652332B2 (en) | Information processing apparatus and virtual machine migration method | |
JP5305040B2 (ja) | サーバ計算機の切替方法、管理計算機及びプログラム | |
US9229840B2 (en) | Managing traces to capture data for memory regions in a memory | |
US11157373B2 (en) | Prioritized transfer of failure event log data | |
JP2017201470A (ja) | 設定支援プログラム、設定支援方法及び設定支援装置 | |
CN114564284B (zh) | 虚拟机的数据备份方法、计算机设备及存储介质 | |
Huang et al. | Metastable failures in the wild | |
CN108292342A (zh) | 向固件中的侵入的通知 | |
JP6124644B2 (ja) | 情報処理装置および情報処理システム | |
CN108964992B (zh) | 一种节点故障检测方法、装置和计算机可读存储介质 | |
TWI652622B (zh) | 電子計算裝置、調整一記憶體回收函式之觸發機制之方法及其電腦程式產品 | |
TWI469573B (zh) | 系統錯誤處理方法與使用其之伺服器系統 | |
CN114327662A (zh) | 操作***的处理方法及装置、存储介质和处理器 | |
JP7275922B2 (ja) | 情報処理装置、異常検出方法及びプログラム | |
CN111090491A (zh) | 虚拟机任务状态的恢复方法、装置及电子设备 | |
CN117971564B (zh) | 数据恢复方法、装置、计算机设备及存储介质 | |
JP5791524B2 (ja) | Os動作装置及びos動作プログラム | |
TWI715005B (zh) | 用於監控基板管理控制器之常駐程序的方法 | |
CN106951306B (zh) | 一种stw检测方法、装置及设备 | |
Ha et al. | {RL-Watchdog}: A Fast and Predictable {SSD} Liveness Watchdog on Storage Systems | |
JP2011141786A (ja) | Cpu監視装置およびcpu監視プログラム | |
CN113608825A (zh) | 虚拟机高可用迁移控制方法、***、终端及存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 15743516 Country of ref document: US |
|
ENP | Entry into the national phase |
Ref document number: 2018507814 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 16896704 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 16896704 Country of ref document: EP Kind code of ref document: A1 |