WO2021184588A1 - 集群优化方法、装置、服务器及介质 - Google Patents

集群优化方法、装置、服务器及介质 Download PDF

Info

Publication number
WO2021184588A1
WO2021184588A1 PCT/CN2020/099299 CN2020099299W WO2021184588A1 WO 2021184588 A1 WO2021184588 A1 WO 2021184588A1 CN 2020099299 W CN2020099299 W CN 2020099299W WO 2021184588 A1 WO2021184588 A1 WO 2021184588A1
Authority
WO
WIPO (PCT)
Prior art keywords
cluster
abnormal
stability
utilization rate
usage rate
Prior art date
Application number
PCT/CN2020/099299
Other languages
English (en)
French (fr)
Inventor
王成成
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021184588A1 publication Critical patent/WO2021184588A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3089Monitoring arrangements determined by the means or processing involved in sensing the monitored data, e.g. interfaces, connectors, sensors, probes, agents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring

Definitions

  • This application relates to the field of cloud computing technology, and in particular to a cluster optimization method, device, server, and medium.
  • the inventor realizes that in the existing technical solutions, it is impossible to deal with the failures of the clusters in the cloud environment accordingly. In addition, when the clusters in different cloud environments have the same failure, it is also impossible to perform unified processing.
  • the first aspect of the present application provides a cluster optimization method, and the cluster optimization method includes:
  • the abnormal cluster is processed according to the abnormal type.
  • a second aspect of the present application provides a server including a processor and a memory, and the processor is configured to execute computer-readable instructions stored in the memory to implement the following steps:
  • the abnormal cluster is processed according to the abnormal type.
  • a third aspect of the present application provides a computer-readable storage medium having at least one computer-readable instruction stored thereon, and the at least one computer-readable instruction is executed by a processor to implement the following steps:
  • the abnormal cluster is processed according to the abnormal type.
  • a fourth aspect of the present application provides a cluster optimization device, and the cluster optimization device includes:
  • the collection unit is used to collect monitoring data of all clusters in at least one cloud environment within a preset time
  • the processing unit is used to normalize the monitoring data of each cluster to obtain at least one indicator item of each cluster;
  • the calculation unit is configured to calculate the stability of each cluster and the utilization rate of each cluster according to the at least one indicator item;
  • the determining unit is used to determine the abnormal cluster and the abnormal type of the abnormal cluster according to the stability of each cluster and the usage rate of each cluster;
  • the processing unit is further configured to process the abnormal cluster according to the abnormal type.
  • the present application can not only deal with the failure of the cluster in the cloud environment, but also can uniformly handle the failure when the clusters in different cloud environments have the same failure.
  • Fig. 1 is a flowchart of a preferred embodiment of a cluster optimization method disclosed in the present application.
  • Fig. 2 is a functional module diagram of a preferred embodiment of a cluster optimization device disclosed in the present application.
  • FIG. 3 is a schematic diagram of the structure of a server in a preferred embodiment of the cluster optimization method according to the present application.
  • FIG. 1 it is a flowchart of a preferred embodiment of the cluster optimization method of the present application. According to different needs, the order of the steps in the flowchart can be changed, and some steps can be omitted.
  • the cluster optimization method is applied to one or more servers.
  • the server is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions. Its hardware includes but is not limited to micro-processing Processor, Application Specific Integrated Circuit (ASIC), Programmable Gate Array (Field-Programmable Gate Array, FPGA), Digital Processor (Digital Signal Processor, DSP), embedded equipment, etc.
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • DSP Digital Processor
  • embedded equipment etc.
  • the server can be any electronic product that can interact with the user's man-machine, for example, a personal computer, a tablet computer, a smart phone, a personal digital assistant (PDA), a game console, an interactive network television (Internet Protocol Television, IPTV), smart wearable devices, etc.
  • a personal computer for example, a personal computer, a tablet computer, a smart phone, a personal digital assistant (PDA), a game console, an interactive network television (Internet Protocol Television, IPTV), smart wearable devices, etc.
  • PDA personal digital assistant
  • IPTV Internet Protocol Television
  • smart wearable devices etc.
  • the server may also include network equipment and/or user equipment.
  • the network device includes, but is not limited to, a single network server, a server group composed of multiple network servers, or a cloud composed of a large number of hosts or network servers based on cloud computing.
  • the network where the server is located includes, but is not limited to, the Internet, a wide area network, a metropolitan area network, a local area network, a virtual private network (Virtual Private Network, VPN), etc.
  • VPN Virtual Private Network
  • S10 Collect monitoring data of all clusters in at least one cloud environment within a preset time.
  • the monitoring data is data extracted from alarm information, and the monitoring data includes, but is not limited to: alarm time, alarm level, number corresponding to the alarm level, and the alarm The risk factor corresponding to the level, the total number of cluster instances, the utilization rate of each instance, the root cause of the alarm, etc.
  • the method before collecting monitoring data of all clusters in at least one cloud environment within a preset time, the method further includes:
  • the server obtains the alarm information of all the clusters within the preset time, performs word segmentation processing on the alarm information, and obtains multiple pieces of first information. Further, the server cleans the first information to obtain multiple pieces of first information.
  • the second information uses the TF-IDF algorithm to calculate the probability of the multiple pieces of second information, and further, the server determines the monitoring data according to the probability of the multiple pieces of second information.
  • the control of the occurrence time of the monitoring data can be realized, which can avoid the interval between the cluster stability or the calculation time of the cluster utilization rate and the occurrence time being too long, thereby avoiding the current cluster The calculation result of stability or current cluster utilization is inaccurate.
  • the server cleans the configuration information in the first information to obtain the plurality of second information.
  • the configuration information includes, but is not limited to: function words, stop words, etc.
  • the server does not need to Calculating the probability of the configuration information in the plurality of first information can shorten the calculation time and enable the monitoring data to be quickly determined.
  • S11 Perform normalization processing on the monitoring data of each cluster to obtain at least one indicator item of each cluster.
  • the at least one indicator item is basic information for calculating the stability or utilization rate of the cluster, and the at least one indicator item includes, but is not limited to: an alarm level, the alarm level The number of corresponding alarms, the risk factor corresponding to the alarm level, the total number of cluster instances, the utilization rate of each instance, etc.
  • the server performs normalization processing on the monitoring data of each cluster to obtain at least one index item of each cluster including:
  • the server uses a de-redundancy algorithm to remove redundant content in the monitoring data to obtain target data. Further, the server uses a shallow semantic analysis method to identify the target data, and identify similar meanings. The result is processed to obtain the at least one index item.
  • the monitoring data can be de-redundant processing to reduce the memory occupied by the server.
  • the similar results after de-redundancy can be processed, so that the monitoring data on each cluster can be consistent.
  • the name is convenient for subsequent unified calculation of the stability and utilization rate of the cluster.
  • S12 Calculate the stability of each cluster and the utilization rate of each cluster according to the at least one indicator item.
  • the calculation of the stability of each cluster and the utilization rate of each cluster by the server according to the at least one indicator item includes:
  • the server calculates the stability of each cluster according to Formula One, and Formula One is:
  • S represents the stability of the cluster
  • a represents the risk factor of level 1 alarms
  • x represents the number of level 1 alarms
  • b represents the risk factor of level 2 alarms
  • y represents the number of level 2 alarms
  • c represents the risk of level 3 alarms.
  • Coefficient z represents the number of level 3 alarms
  • m represents the total number of instances in the cluster
  • the server calculates the utilization rate of each cluster according to Formula 2, and Formula 2 is:
  • U represents the utilization rate of the cluster
  • n i represents the utilization rate of the i-th instance, i ⁇ ⁇ 1, 2, 3,..., m ⁇ (m ⁇ N*).
  • the number of level 1 alarms is 10, the number of level 1 alarms is 0.8, the number of level 2 alarms is 8, the number of level 2 alarms is 0.6, and the number of level 3 alarms is 6.
  • the risk factor of level 3 alarm is 0.4, and the total number of instances is 2, namely instance A and instance B.
  • the utilization rate of instance A is 0.8 and the utilization rate of instance B is 0.6.
  • the server calculates the cluster A’s The stability is 15.2%. According to the calculation of the server, the utilization rate of cluster A is 70%.
  • the stability of each cluster and the utilization rate of each cluster can be obtained, which provides a data basis for the subsequent determination of abnormal clusters.
  • S13 Determine the abnormal cluster and the abnormal type of the abnormal cluster according to the stability of each cluster and the usage rate of each cluster.
  • the above abnormal clusters and the above abnormal types can also be stored in a node of a blockchain.
  • the abnormal cluster refers to a cluster whose stability is less than a first value, and a cluster whose usage rate is less than a second value or a usage rate greater than a third value.
  • the abnormal type is divided according to the stability of the cluster or the usage rate of the cluster, and is specifically divided into a first cluster and a second cluster.
  • the stability of the first cluster is less than the first value
  • the usage rate of the cluster is less than a second value or the usage rate of the second cluster is greater than a third value
  • the value of the second value is smaller than the value of the third value.
  • the server determining the abnormal cluster and the abnormal type of the abnormal cluster according to the stability of each cluster and the usage rate of each cluster includes one or a combination of the following methods:
  • the server obtains the stability of the cluster and the average stability of the remaining clusters except the cluster. Further, the server multiplies the average stability by the first A preset ratio is used to obtain a first value. When the stability of the cluster is less than the first value, the server determines the cluster as the first cluster, and the first cluster belongs to an abnormal cluster of an abnormal stability type.
  • the stability of cluster B is 75%
  • the stability of cluster C is 60%
  • the stability of cluster D is 30%
  • the first preset ratio is 0.8
  • the server obtains the stability of cluster B as 75 %.
  • the average stability of the remaining clusters corresponding to the cluster B is 45%.
  • the server determines the cluster D as the first cluster.
  • the server sequentially extracts any cluster from all the clusters. Further, the server acquires the remaining clusters except for the arbitrary cluster. For the stability of the cluster, the average stability of the remaining clusters is determined according to the stability of the remaining clusters. The number of the average stability is consistent with the number of all clusters.
  • the server obtains the usage rate of the cluster and the average usage rate of the remaining clusters except this cluster. Further, the server multiplies the average usage rate by the second A preset ratio is used to obtain a second value, and the average usage rate is multiplied by a third preset ratio to obtain a third value. When the usage rate of the cluster is less than the second value or greater than the third value, The server determines the cluster as a second cluster, and the second cluster belongs to an abnormal cluster with an abnormal usage rate.
  • the first value is obtained based on the average stability of the remaining clusters multiplied by the first preset ratio, rather than based on all clusters
  • the average stability of is multiplied by the first preset ratio, which can make the determination of the first cluster more accurate.
  • the method for determining the second cluster is similar to the method for determining the first cluster Therefore, the second cluster can also be accurately determined.
  • the cluster usage rate is too high, it will cause cluster blockage, and the cluster usage rate will be too low, which will lead to the waste of cluster instances. Therefore, the cluster with too high usage rate and the cluster with too low usage rate are determined as the second cluster. It is beneficial to the subsequent optimization of the second cluster.
  • S14 Process the abnormal cluster according to the abnormal type.
  • the server processing the abnormal cluster according to the abnormal type includes one or a combination of the following methods:
  • the server extracts the abnormal log in the first cluster, and further, the server obtains a target solution that matches the abnormal log from the configuration solution, and further, the server uses the target solution Process the first cluster.
  • At least one target solution is stored in the configuration solution.
  • extracting, by the server, the abnormal log in the first cluster includes:
  • the server extracts target alarm information of the first cluster from the alarm information. Further, the server extracts target alarm information from the first cluster according to the target alarm information.
  • the abnormal log The abnormal log.
  • the server obtains the alarm information corresponding to the abnormal log, and further, the server uses a symmetric encryption algorithm to encrypt the alarm information , Obtain the first ciphertext, and further, the server sends the first ciphertext to the terminal device of the person in charge.
  • the server determines the amount of change in the number of instances in the second cluster according to the usage rate of the second cluster, and further, the server processes the second cluster according to the amount of change.
  • the server determining the amount of change in the number of instances in the second cluster according to the usage rate of the second cluster includes:
  • the server obtains the first number of instances in the second cluster, and further, the server multiplies the usage rate of the second cluster by the first number and divides it by the average usage rate to obtain For the second number of instances in the second cluster, further, the server subtracts the second number from the first number to obtain the amount of change.
  • the utilization rate of the second cluster is 90%
  • the first number of instances in the second cluster obtained by the server is 2
  • the average utilization rate is 60%.
  • the first number of instances is obtained.
  • the second number of instances in the two clusters is 3, and the second number is subtracted from the first number to obtain the amount of change as 1.
  • the server processing the second cluster according to the change amount includes:
  • the server When it is detected that the usage rate of the second cluster is less than the second value, the server reduces the instances of the second cluster according to the amount of change; or when it is detected that the usage rate of the second cluster is greater than all In the case of the third value, the server adds an instance of the second cluster according to the amount of change.
  • the cluster with stability less than the first value in all clusters is determined as the first cluster, and the usage rate in all clusters is less than the second value and greater than the first value.
  • the three-value cluster is determined as the second cluster, since the monitoring data of all clusters in the at least one cloud environment is processed, the first cluster or the second cluster in all the clusters is determined, When there are a plurality of the first cluster or the second cluster, the server can uniformly handle the failure when the same failure occurs in the cluster.
  • the method further includes:
  • the server tests the abnormal cluster to obtain a test result.
  • the server generates target information according to the test result.
  • the server uses an advanced encryption standard algorithm to encrypt the Target information to obtain the target ciphertext, and further, the server sends the target ciphertext to the terminal device of the designated contact.
  • the target information includes abnormal clusters that failed the test, the root cause of the failure of the test result, and the like.
  • the designated contact person may be the person in charge of cluster optimization, which is not limited in this application.
  • the server tests the abnormal cluster, and the test results obtained include, but are not limited to, one or a combination of the following methods:
  • the server performs a CPU performance test on the abnormal cluster, and obtains the CPU performance test result
  • the server performs a memory performance test on the abnormal cluster, and obtains the memory performance test result
  • the server performs a disk performance test on the abnormal cluster, and obtains the disk performance test result
  • the server performs a function test on the abnormal cluster, and obtains the result of the function test.
  • the server performing a CPU performance test on the abnormal cluster includes:
  • the server obtains the test script file, and runs the CPU performance test tool according to the test script file. Further, the server uses the CPU performance test tool to test the abnormal cluster in an overclocked or fully loaded state to obtain the The CPU performance test result of the abnormal cluster when the CPU is overclocked or fully loaded.
  • the server can also obtain other test results of the abnormal cluster, and process all the test results together to make the test results of the abnormal cluster more accurate.
  • corresponding processing can be performed on the failure of the cluster in the cloud environment, and when the same failure occurs in the cluster in different cloud environments, the failure can also be processed in a unified manner.
  • the cluster optimization device 11 includes an acquisition unit 110, a processing unit 111, a calculation unit 112, a determination unit 113, an acquisition unit 114, a cleaning unit 115, a testing unit 116, a generation unit 117, an encryption unit 118, and a sending unit 119.
  • the module/unit referred to in this application refers to a series of computer program segments that can be executed by the processor 13 and can complete fixed functions, and are stored in the memory 12. In this embodiment, the functions of each module/unit will be described in detail in subsequent embodiments.
  • the collection unit 110 collects monitoring data of all clusters in at least one cloud environment within a preset time.
  • the monitoring data is data extracted from alarm information, and the monitoring data includes, but is not limited to: alarm time, alarm level, number corresponding to the alarm level, and the alarm The risk factor corresponding to the level, the total number of cluster instances, the utilization rate of each instance, the root cause of the alarm, etc.
  • the acquiring unit 114 acquires the alarm information of all clusters within the preset time, and the processing unit 111 The alarm information is processed by word segmentation to obtain multiple pieces of first information. Further, the cleaning unit 115 cleans the first information to obtain multiple pieces of second information, and the calculation unit 112 uses the TF-IDF algorithm to calculate the multiple pieces of second information. Further, the determining unit 113 determines the monitoring data according to the probability of the plurality of second information.
  • the control of the occurrence time of the monitoring data can be realized, which can avoid the interval between the cluster stability or the calculation time of the cluster utilization rate and the occurrence time being too long, thereby avoiding the current cluster The calculation result of stability or current cluster utilization is inaccurate.
  • the cleaning unit 115 cleans the configuration information in the first information to obtain the plurality of second information.
  • the configuration information includes, but is not limited to: function words, stop words, etc.
  • the calculation unit 112 does not need to calculate the probability of the configuration information in the plurality of first information, which can shorten the calculation time and enable the monitoring data to be quickly determined.
  • the processing unit 111 performs normalization processing on the monitoring data of each cluster to obtain at least one index item of each cluster.
  • the at least one indicator item is basic information for calculating the stability or utilization rate of the cluster, and the at least one indicator item includes, but is not limited to: an alarm level, the alarm level The number of corresponding alarms, the risk factor corresponding to the alarm level, the total number of cluster instances, the utilization rate of each instance, etc.
  • the processing unit 111 performs normalization processing on the monitoring data of each cluster to obtain at least one index item for each cluster including:
  • the processing unit 111 uses a de-redundancy algorithm to remove redundant content in the monitoring data to obtain target data. Further, the processing unit 111 uses a shallow semantic analysis method to identify the target data, and will identify The results with similar meanings are processed to obtain the at least one indicator item.
  • the monitoring data can be de-redundantly processed to reduce the memory occupied by the server, and at the same time, similar results after de-redundancy can be processed, so that the monitoring data on each cluster can have a consistent name. It is convenient for subsequent unified calculation of the stability and utilization rate of the cluster.
  • the calculation unit 112 calculates the stability of each cluster and the usage rate of each cluster according to the at least one index item.
  • the calculation unit 112 calculating the stability of each cluster and the utilization rate of each cluster according to the at least one index item includes:
  • the calculation unit 112 calculates the stability of each cluster according to Formula 1, and Formula 1 is:
  • S represents the stability of the cluster
  • a represents the risk factor of level 1 alarms
  • x represents the number of level 1 alarms
  • b represents the risk factor of level 2 alarms
  • y represents the number of level 2 alarms
  • c represents the risk of level 3 alarms.
  • Coefficient z represents the number of level 3 alarms
  • m represents the total number of instances in the cluster
  • the calculation unit 112 calculates the usage rate of each cluster according to Formula 2, and Formula 2 is:
  • U represents the utilization rate of the cluster
  • n i represents the utilization rate of the i-th instance, i ⁇ ⁇ 1, 2, 3,..., m ⁇ (m ⁇ N*).
  • the number of level 1 alarms is 10, the number of level 1 alarms is 0.8, the number of level 2 alarms is 8, the number of level 2 alarms is 0.6, and the number of level 3 alarms is 6.
  • the risk factor of level 3 alarm is 0.4, and the total number of instances is two, namely instance A and instance B.
  • the utilization rate of instance A is 0.8 and the utilization rate of instance B is 0.6.
  • the calculation unit 112 calculates the cluster
  • the stability of A is 15.2%, and the utilization rate of cluster A is obtained by calculation by the server: 70%.
  • the stability of each cluster and the utilization rate of each cluster can be obtained, which provides a data basis for the subsequent determination of abnormal clusters.
  • the determining unit 113 determines the abnormal cluster and the abnormal type of the abnormal cluster according to the stability of each cluster and the usage rate of each cluster.
  • the abnormal cluster and the abnormal type may also be stored in a node of a blockchain.
  • the abnormal cluster refers to a cluster whose stability is less than a first value, and a cluster whose usage rate is less than a second value or a usage rate greater than a third value.
  • the abnormal type is divided according to the stability of the cluster or the usage rate of the cluster, and is specifically divided into a first cluster and a second cluster.
  • the stability of the first cluster is less than the first value
  • the usage rate of the cluster is less than a second value or the usage rate of the second cluster is greater than a third value
  • the value of the second value is smaller than the value of the third value.
  • the determining unit 113 determines the abnormal cluster and the abnormal type of the abnormal cluster according to the stability of each cluster and the usage rate of each cluster, including one or more of the following methods: combination:
  • the determining unit 113 obtains the stability of the cluster and the average stability of the remaining clusters except the cluster. Further, the determining unit 113 determines the average stability of the cluster. Multiply by a first preset ratio to obtain a first value. When the stability of the cluster is less than the first value, the determining unit 113 determines the cluster as the first cluster, and the first cluster is abnormal in stability Type of abnormal cluster.
  • the stability of cluster B is 75%
  • the stability of cluster C is 60%
  • the stability of cluster D is 30%
  • the first preset ratio is 0.8
  • the determining unit 113 obtains the stability of cluster B
  • the average stability of the remaining clusters corresponding to the cluster B is 45%.
  • the determining unit 113 multiplies the average stability by 0.8 to obtain a first value of 36%, then the cluster The stability of B is greater than the first value; the determining unit 113 obtains that the stability of the cluster C is 60%, and the average stability of the remaining clusters corresponding to the cluster C is 52.5%.
  • the determining The unit 113 multiplies the average stability by 0.8 to obtain a first value of 42%, then the stability of the cluster C is greater than the first value; the determining unit 113 obtains that the stability of the cluster D is 30 %. The average stability of the remaining clusters corresponding to the cluster D is 67.5%, and the stability of the cluster D is less than the first value. Therefore, the determining unit 113 determines the cluster D as the first cluster .
  • the determining unit 113 sequentially extracts any cluster from all the clusters. Further, the determining unit 113 acquires any cluster except for the arbitrary According to the stability of the remaining clusters outside the cluster, the determining unit 113 determines the average stability of the remaining clusters according to the stability of the remaining clusters. The number of the average stability is consistent with the number of all clusters.
  • the determining unit 113 obtains the usage rate of the cluster and the average usage rate of the remaining clusters except the cluster. Further, the determining unit 113 calculates the average usage rate Multiply the second preset ratio to obtain the second value, and multiply the average usage rate by the third preset ratio to obtain the third value. When the usage rate of the cluster is less than the second value or greater than the first value In the case of three values, the determining unit 113 determines the cluster as the second cluster, and the second cluster belongs to an abnormal cluster with an abnormal usage rate.
  • the first value is obtained based on the average stability of the remaining clusters multiplied by the first preset ratio, rather than based on all clusters
  • the average stability of is multiplied by the first preset ratio, which can make the determination of the first cluster more accurate.
  • the method for determining the second cluster is similar to the method for determining the first cluster Therefore, the second cluster can also be accurately determined.
  • the cluster usage rate is too high, it will cause cluster blockage, and the cluster usage rate will be too low, which will lead to the waste of cluster instances. Therefore, the cluster with too high usage rate and the cluster with too low usage rate are determined as the second cluster. It is beneficial to the subsequent optimization of the second cluster.
  • the processing unit 111 processes the abnormal cluster according to the abnormal type.
  • the processing unit 111 processing the abnormal cluster according to the abnormal type includes one or a combination of the following methods:
  • the processing unit 111 extracts the abnormal log in the first cluster, and further, the processing unit 111 obtains a target solution that matches the abnormal log from the configuration solution, and further, the processing unit 111 Process the first cluster in the target solution.
  • At least one target solution is stored in the configuration solution.
  • the processing unit 111 extracting the abnormal log in the first cluster includes:
  • the processing unit 111 extracts the target alarm information of the first cluster from the alarm information. Further, the processing unit 111 extracts the abnormality from the log of the first cluster according to the target alarm information. Log.
  • the obtaining unit 114 obtains the alarm information corresponding to the abnormal log, and further, the encryption unit 118 uses a symmetric encryption algorithm to encrypt the Alarm information, the first ciphertext is obtained, and further, the sending unit 119 sends the first ciphertext to the terminal device of the person in charge.
  • the processing unit 111 determines the amount of change in the number of instances in the second cluster according to the usage rate of the second cluster, and further, the processing unit 111 processes the second cluster according to the amount of change .
  • the processing unit 111 determining the amount of change in the number of instances in the second cluster according to the usage rate of the second cluster includes:
  • the processing unit 111 obtains the first number of instances in the second cluster. Further, the processing unit 111 multiplies the usage rate of the second cluster by the first number and divides it by the average usage rate. The second number of instances in the second cluster is obtained. Further, the processing unit 111 subtracts the second number from the first number to obtain the amount of change.
  • the utilization rate of the second cluster is 90%
  • the processing unit 111 obtains that the first number of instances in the second cluster is 2, and the average utilization rate is 60%. After calculation, the total utilization rate is 60%.
  • the second number of instances in the second cluster is three, and the second number is subtracted from the first number to obtain the change amount of one.
  • the processing unit 111 processing the second cluster according to the amount of change includes:
  • the processing unit 111 When it is detected that the usage rate of the second cluster is less than the second value, the processing unit 111 reduces the instances of the second cluster according to the amount of change; or when the usage rate of the second cluster is detected When it is greater than the third value, the processing unit 111 increases the instance of the second cluster according to the amount of change.
  • the cluster with stability less than the first value in all clusters is determined as the first cluster, and the usage rate in all clusters is less than the second value and greater than the first value.
  • the three-value cluster is determined as the second cluster, since the monitoring data of all clusters in the at least one cloud environment is processed, the first cluster or the second cluster in all the clusters is determined, When there are a plurality of the first cluster or the second cluster, the server can uniformly handle the failure when the same failure occurs in the cluster.
  • the testing unit 116 tests the abnormal cluster to obtain a test result, and when the test result is that the test fails, the generating unit 117 According to the test result, the target information is generated.
  • the encryption unit 118 uses the advanced encryption standard algorithm to encrypt the target information to obtain the target ciphertext. Further, the sending unit 119 sends the target ciphertext to the designated contact person. Terminal Equipment.
  • the target information includes abnormal clusters that failed the test, the root cause of the failure of the test result, and the like.
  • the designated contact person may be the person in charge of cluster optimization, which is not limited in this application.
  • test unit 116 tests the abnormal clusters, and obtains test results including, but not limited to, one or a combination of the following methods:
  • the test unit 116 performs a CPU performance test on the abnormal cluster, and obtains the CPU performance test result;
  • the testing unit 116 performs a memory performance test on the abnormal cluster, and obtains the memory performance test result
  • the testing unit 116 performs a disk performance test on the abnormal cluster, and obtains the disk performance test result
  • the test unit 116 performs a function test on the abnormal cluster, and obtains the result of the function test.
  • the testing unit 116 performing a CPU performance test on the abnormal cluster includes:
  • the test unit 116 obtains a test script file, and runs a CPU performance test tool according to the test script file. Further, the test unit 116 uses the CPU performance test tool to test the abnormal cluster in an overclocked or fully loaded state To obtain the CPU performance test result of the abnormal cluster when the CPU is overclocked or fully loaded.
  • test unit 116 can also obtain other test results of the abnormal cluster, and process all the test results together to make the test results of the abnormal cluster more accurate.
  • corresponding processing can be performed on a failure of a cluster in a cloud environment, and when the same failure occurs in a cluster in a different cloud environment, the failure can also be uniformly processed.
  • FIG. 3 it is a schematic diagram of the structure of a server in a preferred embodiment of the cluster optimization method of the present application.
  • the server 1 includes, but is not limited to, a memory 12, a processor 13, and a computer program stored in the memory 12 and running on the processor 13, such as a cluster Optimize the program.
  • the schematic diagram is only an example of the server 1 and does not constitute a limitation on the server 1. It may include more or fewer components than those shown in the figure, or a combination of certain components, or different components,
  • the server 1 may also include input and output devices, network access devices, buses, and so on.
  • the processor 13 may be a central processing unit (Central Processing Unit, CPU), other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor can be a microprocessor or the processor can also be any conventional processor, etc.
  • the processor 13 is the computing core and control center of the server 1, and is connected to the entire server 1 through various interfaces and lines. Each part, and the operating system that executes the server 1 and various installed applications, program codes, etc.
  • the processor 13 executes the operating system of the server 1 and various installed applications.
  • the processor 13 executes the application program to implement the steps in the foregoing embodiments of the cluster optimization method, for example, the steps shown in FIG. 1.
  • the computer program may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 12 and executed by the processor 13 to complete the present invention.
  • the one or more modules/units may be a series of computer program instruction segments capable of completing specific functions, and the instruction segments are used to describe the execution process of the computer program in the server 1.
  • the computer program can be divided into an acquisition unit 110, a processing unit 111, a calculation unit 112, a determination unit 113, an acquisition unit 114, a cleaning unit 115, a testing unit 116, a generation unit 117, an encryption unit 118, and a sending unit 119.
  • the memory 12 may be used to store the computer program and/or module.
  • the processor 13 runs or executes the computer program and/or module stored in the memory 12 and calls data stored in the memory 12,
  • the various functions of the server 1 are realized.
  • the memory 12 may mainly include a storage program area and a storage data area.
  • the storage program area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.), etc.; the storage data area may Stores data, etc. created according to the use of the server.
  • the memory 12 may include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a Secure Digital (SD) card, a flash memory card (Flash Card), At least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device.
  • non-volatile memory such as a hard disk, a memory, a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a Secure Digital (SD) card, a flash memory card (Flash Card), At least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device.
  • the storage 12 may be an external storage and/or an internal storage of the server 1. Further, the memory 12 may be a memory in a physical form, such as a memory stick, a TF card (Trans-flash Card), and so on.
  • TF card Trans-flash Card
  • the integrated modules/units of the server 1 When the integrated modules/units of the server 1 are implemented in the form of software functional units and sold or used as independent products, they may be stored in a computer-readable storage medium, which may be non-volatile.
  • the storage medium can also be a volatile storage medium.
  • this application implements all or part of the processes in the above-mentioned embodiments and methods, and can also be completed by instructing relevant hardware through a computer program.
  • the computer program can be stored in a computer-readable storage medium. When the program is executed by the processor, it can implement the steps of the foregoing method embodiments.
  • the computer program includes computer program code
  • the computer program code may be in the form of source code, object code, executable file, or some intermediate forms.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) .
  • the memory 12 in the server 1 stores multiple instructions to implement a cluster optimization method, and the processor 13 can execute the multiple instructions so as to achieve: collect at least one cloud environment within a preset time The monitoring data of all clusters in each cluster; the monitoring data of each cluster is normalized to obtain at least one indicator item of each cluster; according to the at least one indicator item, the stability of each cluster and each The utilization rate of the cluster; according to the stability of each cluster and the utilization rate of each cluster, determine the abnormal cluster and the abnormal type of the abnormal cluster; and process the abnormal cluster according to the abnormal type.
  • the blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional modules in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional modules.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Computer Hardware Design (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

本申请涉及云计算技术,提供一种集群优化方法。该方法能够采集预设时间内至少一个云环境中所有集群的监控数据,对每个集群的监控数据进行归一化处理,得到每个集群的至少一种指标项,根据所述至少一种指标项,计算每个集群的稳定性及每个集群的使用率,根据每个集群的稳定性及每个集群的使用率,确定异常集群以及所述异常集群的异常类型,根据所述异常类型处理所述异常集群。本申请还提供一种集群优化方法、装置、服务器及介质。本申请通过集群计算,不仅能够针对云环境下集群出现的故障进行相应的处理,而且当不同云环境下的集群出现相同故障时,还能够对该故障进行统一处理。此外,异常集群及对应的异常类型可存储于区块链中。

Description

集群优化方法、装置、服务器及介质
本申请要求于2020年03月18日提交中国专利局,申请号为202010192804.1,发明名称为“集群优化方法、装置、服务器及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及云计算技术领域,尤其涉及一种集群优化方法、装置、服务器及介质。
背景技术
目前,随着云计算的快速发展,云计算的应用领域也随之拓宽,进而使应用的类型也增多,根据需要,不同应用将被部署在不同云环境的集群上。
发明人意识到,在现有的技术方案中,无法针对云环境下集群出现的故障进行相应的处理,此外,当不同云环境下的集群出现相同故障时,也无法进行统一处理。
发明内容
鉴于以上内容,有必要提供一种集群优化方法、装置、服务器及介质,不仅能够针对云环境下集群出现的故障进行相应的处理,而且当不同云环境下的集群出现相同故障时,还能够对该故障进行统一处理。
本申请的第一方面提供一种集群优化方法,所述集群优化方法包括:
采集预设时间内至少一个云环境中所有集群的监控数据;
对每个集群的监控数据进行归一化处理,得到每个集群的至少一种指标项;
根据所述至少一种指标项,计算每个集群的稳定性及每个集群的使用率;
根据每个集群的稳定性及每个集群的使用率,确定异常集群以及所述异常集群的异常类型;
根据所述异常类型处理所述异常集群。
本申请的第二方面提供一种服务器,所述服务器包括处理器和存储器,所述处理器用于执行所述存储器中存储的计算机可读指令以实现以下步骤:
采集预设时间内至少一个云环境中所有集群的监控数据;
对每个集群的监控数据进行归一化处理,得到每个集群的至少一种指标项;
根据所述至少一种指标项,计算每个集群的稳定性及每个集群的使用率;
根据每个集群的稳定性及每个集群的使用率,确定异常集群以及所述异常集群的异常类型;
根据所述异常类型处理所述异常集群。
本申请的第三方面提供一种计算机可读存储介质,所述计算机可读存储介质上存储有至少一个计算机可读指令,所述至少一个计算机可读指令被处理器执行以实现以下步骤:
采集预设时间内至少一个云环境中所有集群的监控数据;
对每个集群的监控数据进行归一化处理,得到每个集群的至少一种指标项;
根据所述至少一种指标项,计算每个集群的稳定性及每个集群的使用率;
根据每个集群的稳定性及每个集群的使用率,确定异常集群以及所述异常集群的异常类 型;
根据所述异常类型处理所述异常集群。
本申请的第四方面提供一种集群优化装置,所述集群优化装置包括:
采集单元,用于采集预设时间内至少一个云环境中所有集群的监控数据;
处理单元,用于对每个集群的监控数据进行归一化处理,得到每个集群的至少一种指标项;
计算单元,用于根据所述至少一种指标项,计算每个集群的稳定性及每个集群的使用率;
确定单元,用于根据每个集群的稳定性及每个集群的使用率,确定异常集群以及所述异常集群的异常类型;
所述处理单元,还用于根据所述异常类型处理所述异常集群。
由以上技术方案可知,本申请不仅能够针对云环境下集群出现的故障进行相应的处理,而且当不同云环境下的集群出现相同故障时,还能够对该故障进行统一处理。
附图说明
图1是本申请公开的一种集群优化方法的较佳实施例的流程图。
图2是本申请公开的一种集群优化装置的较佳实施例的功能模块图。
图3是本申请实现集群优化方法的较佳实施例的服务器的结构示意图。
具体实施方式
为了使本申请的目的、技术方案和优点更加清楚,下面结合附图和具体实施例对本申请进行详细描述。
如图1所示,是本申请集群优化方法的较佳实施例的流程图。根据不同的需求,该流程图中步骤的顺序可以改变,某些步骤可以省略。
所述集群优化方法应用于一个或者多个服务器中,所述服务器是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。
所述服务器可以是任何一种可与用户进行人机交互的电子产品,例如,个人计算机、平板电脑、智能手机、个人数字助理(Personal Digital Assistant,PDA)、游戏机、交互式网络电视(Internet Protocol Television,IPTV)、智能式穿戴式设备等。
所述服务器还可以包括网络设备和/或用户设备。其中,所述网络设备包括,但不限于单个网络服务器、多个网络服务器组成的服务器组或基于云计算(Cloud Computing)的由大量主机或网络服务器构成的云。
所述服务器所处的网络包括但不限于互联网、广域网、城域网、局域网、虚拟专用网络(Virtual Private Network,VPN)等。
S10,采集预设时间内至少一个云环境中所有集群的监控数据。
在本申请的至少一个实施例中,所述监控数据是从告警信息中提取的数据,所述监控数据包括,但不限于:告警时间、告警等级、所述告警等级对应的数量、所述告警等级对应的危险系数、集群实例的总数量、每个实例的利用率、告警根因等。
在本申请的至少一个实施例中,在采集预设时间内至少一个云环境中所有集群的监控数据之前,所述方法还包括:
所述服务器获取所述预设时间内所述所有集群的告警信息,对所述告警信息进行分词处理,得到多个第一信息,进一步地,所述服务器清洗所述第一信息,得到多个第二信息,采用TF-IDF算法计算所述多个第二信息的概率,更进一步地,所述服务器根据所述多个第二 信息的概率确定所述监控数据。
通过获取所述预设时间内的告警信息,实现对所述监控数据的发生时间的控制,能够避免集群稳定性或者集群使用率的计算时间与所述发生时间的间隔过长,进而避免当前集群稳定性或者当前集群使用率的计算结果不准确。
具体地,所述服务器清洗所述第一信息中的配置信息,得到所述多个第二信息。
其中,所述配置信息包括,但不限于:虚词、停用词等。
通过对所述多个第一信息的清洗,能够避免因所述配置信息的存在而影响所述多个第二信息的概率,使所述监控数据能够准确地被确定,此外,所述服务器无需对所述多个第一信息中配置信息的概率进行计算,能够缩短计算时间,使所述监控数据能够快速地被确定。
S11,对每个集群的监控数据进行归一化处理,得到每个集群的至少一种指标项。
在本申请的至少一个实施例中,所述至少一种指标项为计算集群的稳定性或者使用率的基础信息,所述至少一种指标项包括,但不限于:告警等级、所述告警等级对应的告警数量、所述告警等级对应的危险系数、集群实例的总数量、每个实例的利用率等。
在本申请的至少一个实施例中,所述服务器对每个集群的监控数据进行归一化处理,得到每个集群的至少一种指标项包括:
所述服务器采用去冗余算法去除所述监控数据中的冗余内容,得到目标数据,进一步地,所述服务器采用浅层式语义分析方法识别所述目标数据,并将识别出的含义相似的结果进行处理,得到所述至少一种指标项。
通过上述实施方式,能够对所述监控数据进行去冗余处理,减少所述服务器的占用内存,同时,对去冗余后的相似结果进行处理,能够使每个集群上的监控数据具有一致的名称,便于后续统一计算集群的稳定性及使用率。
S12,根据所述至少一种指标项,计算每个集群的稳定性及每个集群的使用率。
在本申请的至少一个实施例中,所述服务器根据所述至少一种指标项,计算每个集群的稳定性及每个集群的使用率包括:
所述服务器根据公式一计算每个集群的稳定性,所述公式一为:
Figure PCTCN2020099299-appb-000001
其中,S表示集群的稳定性,a表示1级告警的危险系数,x表示1级告警的数量,b表示2级告警的危险系数,y表示2级告警的数量,c表示3级告警的危险系数,z表示3级告警的数量,m表示集群中实例的总数量;
所述服务器根据公式二计算每个集群的使用率,所述公式二为:
Figure PCTCN2020099299-appb-000002
其中,U表示集群的使用率,n i表示第i个实例的利用率,i∈{1,2,3,…,m}(m∈N*)。
例如:集群A中1级告警的数量为10条,1级告警的危险系数为0.8,2级告警的数量为8条,2级告警的危险系数为0.6,3级告警的数量为6条,3级告警的危险系数为0.4,实例的总数量为2个,分别为实例甲及实例乙,实例甲的利用率为0.8及实例乙的利用率为0.6,所述服务器计算得出集群A的稳定性为:15.2%,经所述服务器计算,得到集群A的使用率为:70%。
通过上述实施方式,能够得到每个集群的稳定性及每个集群的使用率,为后续确定异常集群提供了数据基础。
S13,根据每个集群的稳定性及每个集群的使用率,确定异常集群以及所述异常集群的异常类型。
需要强调的是,为进一步保证上述异常集群及上述异常类型的私密和安全性,上述异常 集群及上述异常类型还可以存储于一区块链的节点中。
在本申请的至少一个实施例中,所述异常集群是指稳定性小于第一数值的集群,及使用率小于第二数值或者使用率大于第三数值的集群。
进一步地,所述异常类型是根据集群的稳定性或者集群的使用率进行划分的,具体划分为第一集群及第二集群,所述第一集群的稳定性小于第一数值,所述第二集群的使用率小于第二数值或者所述第二集群的使用率大于第三数值,所述第二数值的取值小于所述第三数值的取值。
在本申请的至少一个实施例中,所述服务器根据每个集群的稳定性及每个集群的使用率,确定异常集群以及所述异常集群的异常类型包括以下一种或者多种方式的组合:
(1)对于每个集群中的任意集群,所述服务器获取该集群的稳定性及除该集群以外的其余集群的平均稳定性,进一步地,所述服务器将所述平均稳定性乘以第一预设比例,得到第一数值,当该集群的稳定性小于所述第一数值时,所述服务器将该集群确定为第一集群,所述第一集群属于稳定性异常类型的异常集群。
例如:集群B的稳定性为75%、集群C的稳定性为60%、集群D的稳定性为30%,第一预设比例为0.8,所述服务器获取所述集群B的稳定性为75%、所述集群B对应的其余集群的平均稳定性为45%,将所述平均稳定性乘以0.8,得到第一数值为36%,则所述集群B的稳定性大于所述第一数值;所述服务器获取所述集群C的稳定性为60%、所述集群C对应的其余集群的平均稳定性为52.5%,将所述平均稳定性乘以0.8,得到第一数值为42%,则所述集群C的稳定性大于所述第一数值;所述服务器获取所述集群D的稳定性为30%、所述集群D对应的其余集群的平均稳定性为67.5%,则所述集群D的稳定性小于所述第一数值,因此,所述服务器将所述集群D确定为第一集群。
具体地,在获取每个集群的稳定性及其余集群的平均稳定性之前,所述服务器依次从所述所有集群中提取任一集群,进一步地,所述服务器获取除所述任意集群外的其余集群的稳定性,根据所述其余集群的稳定性,确定所述其余集群的平均稳定性。所述平均稳定性的个数与所有集群的个数一致。
(2)对于每个集群中的任意集群,所述服务器获取该集群的使用率及除该集群以外的其余集群的平均使用率,进一步地,所述服务器将所述平均使用率乘以第二预设比例,得到第二数值,以及将所述平均使用率乘以第三预设比例,得到第三数值,当该集群的使用率小于所述第二数值或者大于所述第三数值时,所述服务器将该集群确定为第二集群,所述第二集群属于使用率异常类型的异常集群。
通过将每个集群的稳定性与所述第一数值进行比较,其中,所述第一数值是根据其余集群的平均稳定性乘以所述第一预设比例得来的,而不是根据所有集群的平均稳定性乘以所述第一预设比例得来的,能够使所述第一集群的确定更为精确,此外,所述第二集群的确定方法与所述第一集群的确定方法类似,因此也能准确确定所述第二集群。
由于集群的使用率过高会导致集群阻塞,集群的使用率过低会导致集群实例的浪费,因此,将使用率过高的集群及使用率过低的集群确定为所述第二集群,有利于后续对所述第二集群的优化。
S14,根据所述异常类型处理所述异常集群。
在本申请的至少一个实施例中,所述服务器根据所述异常类型处理所述异常集群包括以下一种或者多种方式的组合:
(1)所述服务器提取所述第一集群中的异常日志,进一步地,所述服务器从配置方案中获取与所述异常日志匹配的目标方案,更进一步地,所述服务器以所述目标方案处理所述第一集群。
其中,所述配置方案中存储着至少一个目标方案。
具体地,所述服务器提取所述第一集群中的异常日志包括:
所述服务器从所述告警信息中提取所述第一集群的目标告警信息,进一步地,所述服务器根据所述目标告警信息,更进一步地,所述服务器从所述第一集群的日志中提取所述异常日志。
当在所述配置方案中未获取到与所述异常日志匹配的目标方案时,所述服务器获取与所述异常日志对应的告警信息,进一步地,所述服务器采用对称加密算法加密所述告警信息,得到第一密文,更进一步地,所述服务器将所述第一密文发送至负责人的终端设备。
(2)所述服务器根据所述第二集群的使用率,确定所述第二集群中实例数的变化量,进一步地,所述服务器根据所述变化量处理所述第二集群。
具体地,所述服务器根据所述第二集群的使用率,确定所述第二集群中实例数的变化量包括:
所述服务器获取所述第二集群中实例的第一数量,进一步地,所述服务器将所述第二集群的使用率乘以所述第一数量后,除以所述平均使用率,得到所述第二集群中实例的第二数量,更进一步地,所述服务器将所述第二数量与所述第一数量进行相减运算,得到所述变化量。
例如:所述第二集群的使用率为90%,所述服务器获取到所述第二集群中实例的第一数量为2个,所述平均使用率为60%,经计算,得到所述第二集群中实例的第二数量为3个,将所述第二数量与所述第一数量进行相减运算,得到所述变化量为1个。
具体地,所述服务器根据所述变化量处理所述第二集群包括:
当检测到所述第二集群的使用率小于所述第二数值时,所述服务器根据所述变化量减少所述第二集群的实例;或者当检测到所述第二集群的使用率大于所述第三数值时,所述服务器根据所述变化量增加所述第二集群的实例。
通过上述实施方式,能够解决集群阻塞或者集群使用率低的问题。
在本申请的至少一个实施例中,将所有集群中稳定性小于所述第一数值的集群确定为所述第一集群,及将所有集群中使用率小于所述第二数值、大于所述第三数值的集群确定为所述第二集群后,由于对所述至少一个云环境下所有集群的监控数据进行处理,进而确定所述所有集群中的所述第一集群或者所述第二集群,当所述第一集群或者所述第二集群有多个时,所述服务器能够在集群出现相同故障时,对该故障进行统一处理。
在本申请的至少一个实施例中,在根据所述异常类型处理所述异常集群之后,所述方法还包括:
所述服务器测试所述异常集群,得到测试结果,当所述测试结果为测试不通过时,所述服务器根据所述测试结果生成目标信息,进一步地,所述服务器采用高级加密标准算法加密所述目标信息,得到目标密文,更进一步地,所述服务器将所述目标密文发送至指定联系人的终端设备。
其中,所述目标信息包括测试不通过的异常集群、测试结果不通过的根因等。
所述指定联系人可以是集群优化的负责人,本申请不作限制。
通过上述实施方式,不仅能够避免所述告警信息随意被篡改,提高所述告警信息的安全性,还能够在所述异常集群未通过测试时,及时通知所述指定联系人。
具体地,所述服务器测试所述异常集群,得到测试结果包括,但不限于以下一种或者多种方式的组合:
(1)所述服务器对所述异常集群进行CPU性能测试,得到所述CPU性能测试结果;
(2)所述服务器对所述异常集群进行内存性能测试,得到所述内存性能测试结果;
(3)所述服务器对所述异常集群进行磁盘性能测试,得到所述磁盘性能测试结果;
(4)所述服务器对所述异常集群进行功能测试,得到所功能测试结果。
具体地,所述服务器对所述异常集群进行CPU性能测试包括:
所述服务器获取测试脚本文件,根据所述测试脚本文件运行CPU性能测试工具,进一步地,所述服务器采用所述CPU性能测试工具对处于超频或者满载状态的所述异常集群进行测试,得到所述异常集群在CPU处于超频或者满载时的CPU性能测试结果。
在其他实施例中,所述服务器还能得到所述异常集群的其他测试结果,综合所有测试结果进行处理,使所述异常集群的测试结果更加精确。
在图1所描述的方法流程中,能够针对云环境下集群出现的故障进行相应的处理,而且当不同云环境下的集群出现相同故障时,还能够对该故障进行统一处理。
以上所述,仅是本申请的具体实施方式,但本申请的保护范围并不局限于此,对于本领域的普通技术人员来说,在不脱离本申请创造构思的前提下,还可以做出改进,但这些均属于本申请的保护范围。
如图2所示,是本申请集群优化装置的较佳实施例的功能模块图。所述集群优化装置11包括采集单元110、处理单元111、计算单元112、确定单元113、获取单元114、清洗单元115、测试单元116、生成单元117、加密单元118及发送单元119。本申请所称的模块/单元是指一种能够被处理器13所执行,并且能够完成固定功能的一系列计算机程序段,其存储在存储器12中。在本实施例中,关于各模块/单元的功能将在后续的实施例中详述。
采集单元110采集预设时间内至少一个云环境中所有集群的监控数据。
在本申请的至少一个实施例中,所述监控数据是从告警信息中提取的数据,所述监控数据包括,但不限于:告警时间、告警等级、所述告警等级对应的数量、所述告警等级对应的危险系数、集群实例的总数量、每个实例的利用率、告警根因等。
在本申请的至少一个实施例中,在采集预设时间内至少一个云环境中所有集群的监控数据之前,获取单元114获取所述预设时间内所述所有集群的告警信息,处理单元111对所述告警信息进行分词处理,得到多个第一信息,进一步地,清洗单元115清洗所述第一信息,得到多个第二信息,计算单元112采用TF-IDF算法计算所述多个第二信息的概率,更进一步地,确定单元113根据所述多个第二信息的概率确定所述监控数据。
通过获取所述预设时间内的告警信息,实现对所述监控数据的发生时间的控制,能够避免集群稳定性或者集群使用率的计算时间与所述发生时间的间隔过长,进而避免当前集群稳定性或者当前集群使用率的计算结果不准确。
具体地,所述清洗单元115清洗所述第一信息中的配置信息,得到所述多个第二信息。
其中,所述配置信息包括,但不限于:虚词、停用词等。
通过对所述多个第一信息的清洗,能够避免因所述配置信息的存在而影响所述多个第二信息的概率,使所述监控数据能够准确地被确定,此外,所述计算单元112无需对所述多个第一信息中配置信息的概率进行计算,能够缩短计算时间,使所述监控数据能够快速地被确定。
所述处理单元111对每个集群的监控数据进行归一化处理,得到每个集群的至少一种指标项。
在本申请的至少一个实施例中,所述至少一种指标项为计算集群的稳定性或者使用率的基础信息,所述至少一种指标项包括,但不限于:告警等级、所述告警等级对应的告警数量、所述告警等级对应的危险系数、集群实例的总数量、每个实例的利用率等。
在本申请的至少一个实施例中,所述处理单元111对每个集群的监控数据进行归一化处理,得到每个集群的至少一种指标项包括:
所述处理单元111采用去冗余算法去除所述监控数据中的冗余内容,得到目标数据,进一步地,所述处理单元111采用浅层式语义分析方法识别所述目标数据,并将识别出的含义相似的结果进行处理,得到所述至少一种指标项。
通过上述实施方式,能够对所述监控数据进行去冗余处理,减少服务器的占用内存,同时,对去冗余后的相似结果进行处理,能够使每个集群上的监控数据具有一致的名称,便于后续统一计算集群的稳定性及使用率。
所述计算单元112根据所述至少一种指标项,计算每个集群的稳定性及每个集群的使用率。
在本申请的至少一个实施例中,所述计算单元112根据所述至少一种指标项,计算每个集群的稳定性及每个集群的使用率包括:
所述计算单元112根据公式一计算每个集群的稳定性,所述公式一为:
Figure PCTCN2020099299-appb-000003
其中,S表示集群的稳定性,a表示1级告警的危险系数,x表示1级告警的数量,b表示2级告警的危险系数,y表示2级告警的数量,c表示3级告警的危险系数,z表示3级告警的数量,m表示集群中实例的总数量;
所述计算单元112根据公式二计算每个集群的使用率,所述公式二为:
Figure PCTCN2020099299-appb-000004
其中,U表示集群的使用率,n i表示第i个实例的利用率,i∈{1,2,3,…,m}(m∈N*)。
例如:集群A中1级告警的数量为10条,1级告警的危险系数为0.8,2级告警的数量为8条,2级告警的危险系数为0.6,3级告警的数量为6条,3级告警的危险系数为0.4,实例的总数量为2个,分别为实例甲及实例乙,实例甲的利用率为0.8及实例乙的利用率为0.6,所述计算单元112计算得出集群A的稳定性为:15.2%,经所述服务器计算,得到集群A的使用率为:70%。
通过上述实施方式,能够得到每个集群的稳定性及每个集群的使用率,为后续确定异常集群提供了数据基础。
所述确定单元113根据每个集群的稳定性及每个集群的使用率,确定异常集群以及所述异常集群的异常类型。
需要强调的是,为进一步保证上述异常集群及上述异常类型的私密和安全性,上述异常集群及上述异常类型还可以存储于一区块链的节点中。
在本申请的至少一个实施例中,所述异常集群是指稳定性小于第一数值的集群,及使用率小于第二数值或者使用率大于第三数值的集群。
进一步地,所述异常类型是根据集群的稳定性或者集群的使用率进行划分的,具体划分为第一集群及第二集群,所述第一集群的稳定性小于第一数值,所述第二集群的使用率小于第二数值或者所述第二集群的使用率大于第三数值,所述第二数值的取值小于所述第三数值的取值。
在本申请的至少一个实施例中,所述确定单元113根据每个集群的稳定性及每个集群的使用率,确定异常集群以及所述异常集群的异常类型包括以下一种或者多种方式的组合:
(1)对于每个集群中的任意集群,所述确定单元113获取该集群的稳定性及除该集群以外的其余集群的平均稳定性,进一步地,所述确定单元113将所述平均稳定性乘以第一预设比例,得到第一数值,当该集群的稳定性小于所述第一数值时,所述确定单元113将该集群确定为第一集群,所述第一集群属于稳定性异常类型的异常集群。
例如:集群B的稳定性为75%、集群C的稳定性为60%、集群D的稳定性为30%,第一预设比例为0.8,所述确定单元113获取所述集群B的稳定性为75%、所述集群B对应的其余集群的平均稳定性为45%,进一步地,所述确定单元113将所述平均稳定性乘以0.8,得到第一数值为36%,则所述集群B的稳定性大于所述第一数值;所述确定单元113获取所述集群C的稳定性为60%、所述集群C对应的其余集群的平均稳定性为52.5%,进 一步地,所述确定单元113将所述平均稳定性乘以0.8,得到第一数值为42%,则所述集群C的稳定性大于所述第一数值;所述确定单元113获取所述集群D的稳定性为30%、所述集群D对应的其余集群的平均稳定性为67.5%,则所述集群D的稳定性小于所述第一数值,因此,所述确定单元113将所述集群D确定为第一集群。
具体地,在获取每个集群的稳定性及其余集群的平均稳定性之前,所述确定单元113依次从所述所有集群中提取任一集群,进一步地,所述确定单元113获取除所述任意集群外的其余集群的稳定性,根据所述其余集群的稳定性,所述确定单元113确定所述其余集群的平均稳定性。所述平均稳定性的个数与所有集群的个数一致。
(2)对于每个集群中的任意集群,所述确定单元113获取该集群的使用率及除该集群以外的其余集群的平均使用率,进一步地,所述确定单元113将所述平均使用率乘以第二预设比例,得到第二数值,以及将所述平均使用率乘以第三预设比例,得到第三数值,当该集群的使用率小于所述第二数值或者大于所述第三数值时,所述确定单元113将该集群确定为第二集群,所述第二集群属于使用率异常类型的异常集群。
通过将每个集群的稳定性与所述第一数值进行比较,其中,所述第一数值是根据其余集群的平均稳定性乘以所述第一预设比例得来的,而不是根据所有集群的平均稳定性乘以所述第一预设比例得来的,能够使所述第一集群的确定更为精确,此外,所述第二集群的确定方法与所述第一集群的确定方法类似,因此也能准确确定所述第二集群。
由于集群的使用率过高会导致集群阻塞,集群的使用率过低会导致集群实例的浪费,因此,将使用率过高的集群及使用率过低的集群确定为所述第二集群,有利于后续对所述第二集群的优化。
所述处理单元111根据所述异常类型处理所述异常集群。
在本申请的至少一个实施例中,所述处理单元111根据所述异常类型处理所述异常集群包括以下一种或者多种方式的组合:
(1)所述处理单元111提取所述第一集群中的异常日志,进一步地,所述处理单元111从配置方案中获取与所述异常日志匹配的目标方案,更进一步地,所述处理单元111以所述目标方案处理所述第一集群。
其中,所述配置方案中存储着至少一个目标方案。
具体地,所述处理单元111提取所述第一集群中的异常日志包括:
所述处理单元111从所述告警信息中提取所述第一集群的目标告警信息,进一步地,所述处理单元111根据所述目标告警信息,从所述第一集群的日志中提取所述异常日志。
当在所述配置方案中未获取到与所述异常日志匹配的目标方案时,所述获取单元114获取与所述异常日志对应的告警信息,进一步地,加密单元118采用对称加密算法加密所述告警信息,得到第一密文,更进一步地,发送单元119将所述第一密文发送至负责人的终端设备。
(2)所述处理单元111根据所述第二集群的使用率,确定所述第二集群中实例数的变化量,进一步地,所述处理单元111根据所述变化量处理所述第二集群。
具体地,所述处理单元111根据所述第二集群的使用率,确定所述第二集群中实例数的变化量包括:
所述处理单元111获取所述第二集群中实例的第一数量,进一步地,所述处理单元111将所述第二集群的使用率乘以所述第一数量后,除以所述平均使用率,得到所述第二集群中实例的第二数量,进一步地,所述处理单元111将所述第二数量与所述第一数量进行相减运算,得到所述变化量。
例如:所述第二集群的使用率为90%,所述处理单元111获取到所述第二集群中实例的第一数量为2个,所述平均使用率为60%,经计算,得到所述第二集群中实例的第二数量为 3个,将所述第二数量与所述第一数量进行相减运算,得到所述变化量为1个。
具体地,所述处理单元111根据所述变化量处理所述第二集群包括:
当检测到所述第二集群的使用率小于所述第二数值时,所述处理单元111根据所述变化量减少所述第二集群的实例;或者当检测到所述第二集群的使用率大于所述第三数值时,所述处理单元111根据所述变化量增加所述第二集群的实例。
通过上述实施方式,能够解决集群阻塞或者集群使用率低的问题。
在本申请的至少一个实施例中,将所有集群中稳定性小于所述第一数值的集群确定为所述第一集群,及将所有集群中使用率小于所述第二数值、大于所述第三数值的集群确定为所述第二集群后,由于对所述至少一个云环境下所有集群的监控数据进行处理,进而确定所述所有集群中的所述第一集群或者所述第二集群,当所述第一集群或者所述第二集群有多个时,所述服务器能够在集群出现相同故障时,对该故障进行统一处理。
在本申请的至少一个实施例中,在根据所述异常类型处理所述异常集群之后,测试单元116测试所述异常集群,得到测试结果,当所述测试结果为测试不通过时,生成单元117根据所述测试结果生成目标信息,所述加密单元118采用高级加密标准算法加密所述目标信息,得到目标密文,进一步地,所述发送单元119将所述目标密文发送至指定联系人的终端设备。
其中,所述目标信息包括测试不通过的异常集群、测试结果不通过的根因等。
所述指定联系人可以是集群优化的负责人,本申请不作限制。
通过上述实施方式,不仅能够避免所述告警信息随意被篡改,提高所述告警信息的安全性,还能够在所述异常集群未通过测试时,及时通知所述指定联系人。
具体地,所述测试单元116测试所述异常集群,得到测试结果包括,但不限于以下一种或者多种方式的组合:
(1)所述测试单元116对所述异常集群进行CPU性能测试,得到所述CPU性能测试结果;
(2)所述测试单元116对所述异常集群进行内存性能测试,得到所述内存性能测试结果;
(3)所述测试单元116对所述异常集群进行磁盘性能测试,得到所述磁盘性能测试结果;
(4)所述测试单元116对所述异常集群进行功能测试,得到所功能测试结果。
具体地,所述测试单元116对所述异常集群进行CPU性能测试包括:
所述测试单元116获取测试脚本文件,根据所述测试脚本文件运行CPU性能测试工具,进一步地,所述测试单元116采用所述CPU性能测试工具对处于超频或者满载状态的所述异常集群进行测试,得到所述异常集群在CPU处于超频或者满载时的CPU性能测试结果。
在其他实施例中,所述测试单元116还能得到所述异常集群的其他测试结果,综合所有测试结果进行处理,使所述异常集群的测试结果更加精确。
在图2所描述的集群优化装置中,能够针对云环境下集群出现的故障进行相应的处理,而且当不同云环境下的集群出现相同故障时,还能够对该故障进行统一处理。
如图3所示,是本申请实现集群优化方法的较佳实施例的服务器的结构示意图。
在本申请的一个实施例中,所述服务器1包括,但不限于,存储器12、处理器13,以及存储在所述存储器12中并可在所述处理器13上运行的计算机程序,例如集群优化程序。
本领域技术人员可以理解,所述示意图仅仅是服务器1的示例,并不构成对服务器1的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述服务器1还可以包括输入输出设备、网络接入设备、总线等。
所述处理器13可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微 处理器或者该处理器也可以是任何常规的处理器等,所述处理器13是所述服务器1的运算核心和控制中心,利用各种接口和线路连接整个服务器1的各个部分,及执行所述服务器1的操作***以及安装的各类应用程序、程序代码等。
所述处理器13执行所述服务器1的操作***以及安装的各类应用程序。所述处理器13执行所述应用程序以实现上述各个集群优化方法实施例中的步骤,例如图1所示的步骤。
示例性的,所述计算机程序可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器12中,并由所述处理器13执行,以完成本申请。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机程序指令段,该指令段用于描述所述计算机程序在所述服务器1中的执行过程。例如,所述计算机程序可以被分割成采集单元110、处理单元111、计算单元112、确定单元113、获取单元114、清洗单元115、测试单元116、生成单元117、加密单元118及发送单元119。
所述存储器12可用于存储所述计算机程序和/或模块,所述处理器13通过运行或执行存储在所述存储器12内的计算机程序和/或模块,以及调用存储在存储器12内的数据,实现所述服务器1的各种功能。所述存储器12可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作***、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据服务器的使用所创建的数据等。此外,存储器12可以包括非易失性存储器,例如硬盘、内存、插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)、至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。
所述存储器12可以是服务器1的外部存储器和/或内部存储器。进一步地,所述存储器12可以是具有实物形式的存储器,如内存条、TF卡(Trans-flash Card)等等。
所述服务器1集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中,所述计算机可读存储介质可以是非易失性的存储介质,也可以是易失性的存储介质。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,也可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。
其中,所述计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)。
结合图1,所述服务器1中的所述存储器12存储多个指令以实现一种集群优化方法,所述处理器13可执行所述多个指令从而实现:采集预设时间内至少一个云环境中所有集群的监控数据;对每个集群的监控数据进行归一化处理,得到每个集群的至少一种指标项;根据所述至少一种指标项,计算每个集群的稳定性及每个集群的使用率;根据每个集群的稳定性及每个集群的使用率,确定异常集群以及所述异常集群的异常类型;根据所述异常类型处理所述异常集群。
具体地,所述处理器13对上述指令的具体实现方法可参考图1对应实施例中相关步骤的描述,在此不赘述。
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
在本申请所提供的几个实施例中,应该理解到,所揭露的***,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能模块的形式实现。
因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本申请的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本申请内。不应将权利要求中的任何附关联图标记视为限制所涉及的权利要求。
此外,显然“包括”一词不排除其他单元或步骤,单数不排除复数。***权利要求中陈述的多个单元或装置也可以由一个单元或装置通过软件或者硬件来实现。第二等词语用来表示名称,而并不表示任何特定的顺序。
最后应说明的是,以上实施例仅用以说明本申请的技术方案而非限制,尽管参照较佳实施例对本申请进行了详细说明,本领域的普通技术人员应当理解,可以对本申请的技术方案进行修改或等同替换,而不脱离本申请技术方案的精神和范围。

Claims (20)

  1. 一种集群优化方法,其中,所述集群优化方法包括:
    采集预设时间内至少一个云环境中所有集群的监控数据;
    对每个集群的监控数据进行归一化处理,得到每个集群的至少一种指标项;
    根据所述至少一种指标项,计算每个集群的稳定性及每个集群的使用率;
    根据每个集群的稳定性及每个集群的使用率,确定异常集群以及所述异常集群的异常类型;
    根据所述异常类型处理所述异常集群。
  2. 根据权利要求1所述的集群优化方法,其中,在所述采集预设时间内至少一个云环境中所有集群的监控数据之前,所述集群优化方法还包括:
    获取所述预设时间内所述所有集群的告警信息;
    对所述告警信息进行分词处理,得到多个第一信息;
    清洗所述第一信息,得到多个第二信息;
    采用TF-IDF算法计算所述多个第二信息的概率;
    根据所述多个第二信息的概率确定所述监控数据。
  3. 根据权利要求1所述的集群优化方法,其中,所述根据所述至少一种指标项,计算每个集群的稳定性及每个集群的使用率包括:
    根据公式一计算每个集群的稳定性,所述公式一为:
    Figure PCTCN2020099299-appb-100001
    其中,S表示集群的稳定性,a表示1级告警的危险系数,x表示1级告警的数量,b表示2级告警的危险系数,y表示2级告警的数量,c表示3级告警的危险系数,z表示3级告警的数量,m表示集群中实例的总数量;
    根据公式二计算每个集群的使用率,所述公式二为:
    Figure PCTCN2020099299-appb-100002
    其中,U表示集群的使用率,n i表示第i个实例的利用率,i∈{1,2,3,…,m}(m∈N*)。
  4. 根据权利要求1所述的集群优化方法,其中,所述根据每个集群的稳定性及每个集群的使用率,确定异常集群以及所述异常集群的异常类型包括以下一种或者多种方式的组合:
    对于每个集群中的任意集群,获取该集群的稳定性及除该集群以外的其余集群的平均稳定性,将所述平均稳定性乘以第一预设比例,得到第一数值,当该集群的稳定性小于所述第一数值时,将该集群确定为第一集群,所述第一集群属于稳定性异常类型的异常集群;及/或对于每个集群中的任意集群,获取该集群的使用率及除该集群以外的其余集群的平均使用率,将所述平均使用率乘以第二预设比例,得到第二数值,以及将所述平均使用率乘以第三预设比例,得到第三数值,当该集群的使用率小于所述第二数值或者大于所述第三数值时,将该集群确定为第二集群,所述第二集群属于使用率异常类型的异常集群。
  5. 根据权利要求4所述的集群优化方法,其中,所述根据所述异常类型处理所述异常集群包括以下一种或者多种方式的组合:
    提取所述第一集群中的异常日志,从配置方案中获取与所述异常日志匹配的目标方案,以所述目标方案处理所述第一集群;及/或
    根据所述第二集群的使用率,确定所述第二集群中实例数的变化量,根据所述变化量处理所述第二集群。
  6. 根据权利要求5所述的集群优化方法,其中,所述根据所述第二集群的使用率,确定所述第二集群中实例数的变化量包括:
    获取所述第二集群中实例的第一数量;
    将所述第二集群的使用率乘以所述第一数量后,除以所述平均使用率,得到所述第二集群中实例的第二数量;
    将所述第二数量与所述第一数量进行相减运算,得到所述变化量。
  7. 根据权利要求1所述的集群优化方法,其中,在根据所述异常类型处理所述异常集群之后,所述集群优化方法还包括:
    测试所述异常集群,得到测试结果;
    当所述测试结果为测试不通过时,根据所述测试结果生成目标信息;
    采用高级加密标准算法加密所述目标信息,得到目标密文;
    将所述目标密文发送至指定联系人的终端设备。
  8. 一种服务器,其中,所述服务器包括处理器和存储器,所述处理器用于执行存储器中存储的至少一个计算机可读指令以实现以下步骤:
    采集预设时间内至少一个云环境中所有集群的监控数据;
    对每个集群的监控数据进行归一化处理,得到每个集群的至少一种指标项;
    根据所述至少一种指标项,计算每个集群的稳定性及每个集群的使用率;
    根据每个集群的稳定性及每个集群的使用率,确定异常集群以及所述异常集群的异常类型;
    根据所述异常类型处理所述异常集群。
  9. 根据权利要求8所述的服务器,其中,在所述采集预设时间内至少一个云环境中所有集群的监控数据之前,所述处理器执行所述至少一个计算机可读指令以实现以下步骤:
    获取所述预设时间内所述所有集群的告警信息;
    对所述告警信息进行分词处理,得到多个第一信息;
    清洗所述第一信息,得到多个第二信息;
    采用TF-IDF算法计算所述多个第二信息的概率;
    根据所述多个第二信息的概率确定所述监控数据。
  10. 根据权利要求8所述的服务器,其中,在所述根据所述至少一种指标项,计算每个集群的稳定性及每个集群的使用率时,所述处理器执行所述至少一个计算机可读指令以实现以下步骤:
    根据公式一计算每个集群的稳定性,所述公式一为:
    Figure PCTCN2020099299-appb-100003
    其中,S表示集群的稳定性,a表示1级告警的危险系数,x表示1级告警的数量,b表示2级告警的危险系数,y表示2级告警的数量,c表示3级告警的危险系数,z表示3级告警的数量,m表示集群中实例的总数量;
    根据公式二计算每个集群的使用率,所述公式二为:
    Figure PCTCN2020099299-appb-100004
    其中,U表示集群的使用率,n i表示第i个实例的利用率,i∈{1,2,3,…,m}(m∈N*)。
  11. 根据权利要求8所述的服务器,其中,在所述根据每个集群的稳定性及每个集群的使用率,确定异常集群以及所述异常集群的异常类型时,所述处理器执行所述至少一个计算机可读指令还用以实现以下步骤:
    对于每个集群中的任意集群,获取该集群的稳定性及除该集群以外的其余集群的平均稳定性,将所述平均稳定性乘以第一预设比例,得到第一数值,当该集群的稳定性小于所述第一数值时,将该集群确定为第一集群,所述第一集群属于稳定性异常类型的异常集群;及/或 对于每个集群中的任意集群,获取该集群的使用率及除该集群以外的其余集群的平均使用率,将所述平均使用率乘以第二预设比例,得到第二数值,以及将所述平均使用率乘以第三预设比例,得到第三数值,当该集群的使用率小于所述第二数值或者大于所述第三数值时,将该集群确定为第二集群,所述第二集群属于使用率异常类型的异常集群。
  12. 根据权利要求11所述的服务器,其中,在所述根据所述异常类型处理所述异常集群时,所述处理器执行所述至少一个计算机可读指令以实现以下步骤:
    提取所述第一集群中的异常日志,从配置方案中获取与所述异常日志匹配的目标方案,以所述目标方案处理所述第一集群;及/或
    根据所述第二集群的使用率,确定所述第二集群中实例数的变化量,根据所述变化量处理所述第二集群。
  13. 根据权利要求12所述的服务器,其中,在所述根据所述第二集群的使用率,确定所述第二集群中实例数的变化量时,所述处理器执行所述至少一个计算机可读指令以实现以下步骤:
    获取所述第二集群中实例的第一数量;
    将所述第二集群的使用率乘以所述第一数量后,除以所述平均使用率,得到所述第二集群中实例的第二数量;
    将所述第二数量与所述第一数量进行相减运算,得到所述变化量。
  14. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有至少一个计算机可读指令,所述至少一个计算机可读指令被处理器执行时实现以下步骤:
    采集预设时间内至少一个云环境中所有集群的监控数据;
    对每个集群的监控数据进行归一化处理,得到每个集群的至少一种指标项;
    根据所述至少一种指标项,计算每个集群的稳定性及每个集群的使用率;
    根据每个集群的稳定性及每个集群的使用率,确定异常集群以及所述异常集群的异常类型;
    根据所述异常类型处理所述异常集群。
  15. 根据权利要求14所述的存储介质,其中,在所述采集预设时间内至少一个云环境中所有集群的监控数据之前,所述至少一个计算机可读指令被处理器执行以实现以下步骤:
    获取所述预设时间内所述所有集群的告警信息;
    对所述告警信息进行分词处理,得到多个第一信息;
    清洗所述第一信息,得到多个第二信息;
    采用TF-IDF算法计算所述多个第二信息的概率;
    根据所述多个第二信息的概率确定所述监控数据。
  16. 根据权利要求14所述的存储介质,其中,在所述根据所述至少一种指标项,计算每个集群的稳定性及每个集群的使用率时,所述至少一个计算机可读指令被处理器执行以实现以下步骤:
    根据公式一计算每个集群的稳定性,所述公式一为:
    Figure PCTCN2020099299-appb-100005
    其中,S表示集群的稳定性,a表示1级告警的危险系数,x表示1级告警的数量,b表示2级告警的危险系数,y表示2级告警的数量,c表示3级告警的危险系数,z表示3级告警的数量,m表示集群中实例的总数量;
    根据公式二计算每个集群的使用率,所述公式二为:
    Figure PCTCN2020099299-appb-100006
    其中,U表示集群的使用率,n i表示第i个实例的利用率,i∈{1,2,3,…,m}(m∈ N*)。
  17. 根据权利要求14所述的存储介质,其中,在所述根据每个集群的稳定性及每个集群的使用率,确定异常集群以及所述异常集群的异常类型时,所述至少一个计算机可读指令被处理器执行还用以实现以下步骤:
    对于每个集群中的任意集群,获取该集群的稳定性及除该集群以外的其余集群的平均稳定性,将所述平均稳定性乘以第一预设比例,得到第一数值,当该集群的稳定性小于所述第一数值时,将该集群确定为第一集群,所述第一集群属于稳定性异常类型的异常集群;及/或
    对于每个集群中的任意集群,获取该集群的使用率及除该集群以外的其余集群的平均使用率,将所述平均使用率乘以第二预设比例,得到第二数值,以及将所述平均使用率乘以第三预设比例,得到第三数值,当该集群的使用率小于所述第二数值或者大于所述第三数值时,将该集群确定为第二集群,所述第二集群属于使用率异常类型的异常集群。
  18. 根据权利要求17所述的存储介质,其中,在所述根据所述异常类型处理所述异常集群时,所述至少一个计算机可读指令被处理器执行时还用以实现以下步骤:
    提取所述第一集群中的异常日志,从配置方案中获取与所述异常日志匹配的目标方案,以所述目标方案处理所述第一集群;及/或
    根据所述第二集群的使用率,确定所述第二集群中实例数的变化量,根据所述变化量处理所述第二集群。
  19. 根据权利要求18所述的存储介质,其中,在所述根据所述第二集群的使用率,确定所述第二集群中实例数的变化量时,所述至少一个计算机可读指令被处理器执行以实现以下步骤:
    获取所述第二集群中实例的第一数量;
    将所述第二集群的使用率乘以所述第一数量后,除以所述平均使用率,得到所述第二集群中实例的第二数量;
    将所述第二数量与所述第一数量进行相减运算,得到所述变化量。
  20. 一种集群优化装置,其中,所述集群优化装置包括:
    采集单元,用于采集预设时间内至少一个云环境中所有集群的监控数据;
    处理单元,用于对每个集群的监控数据进行归一化处理,得到每个集群的至少一种指标项;
    计算单元,用于根据所述至少一种指标项,计算每个集群的稳定性及每个集群的使用率;
    确定单元,用于根据每个集群的稳定性及每个集群的使用率,确定异常集群以及所述异常集群的异常类型;
    所述处理单元,还用于根据所述异常类型处理所述异常集群。
PCT/CN2020/099299 2020-03-18 2020-06-30 集群优化方法、装置、服务器及介质 WO2021184588A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010192804.1 2020-03-18
CN202010192804.1A CN111581044A (zh) 2020-03-18 2020-03-18 集群优化方法、装置、服务器及介质

Publications (1)

Publication Number Publication Date
WO2021184588A1 true WO2021184588A1 (zh) 2021-09-23

Family

ID=72126062

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/099299 WO2021184588A1 (zh) 2020-03-18 2020-06-30 集群优化方法、装置、服务器及介质

Country Status (2)

Country Link
CN (1) CN111581044A (zh)
WO (1) WO2021184588A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115499299A (zh) * 2022-09-13 2022-12-20 航天信息股份有限公司 一种集群设备监控方法及装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103763143A (zh) * 2014-01-23 2014-04-30 北京华胜天成科技股份有限公司 基于存储服务器的设备异常报警的方法及***
US20170046191A1 (en) * 2014-06-30 2017-02-16 Bmc Software, Inc. Capacity risk management for virtual machines
CN106789257A (zh) * 2016-12-23 2017-05-31 航天星图科技(北京)有限公司 一种云***服务器状态可视化管理方法
CN107391633A (zh) * 2017-06-30 2017-11-24 北京奇虎科技有限公司 数据库集群自动优化处理方法、装置及服务器

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103763143A (zh) * 2014-01-23 2014-04-30 北京华胜天成科技股份有限公司 基于存储服务器的设备异常报警的方法及***
US20170046191A1 (en) * 2014-06-30 2017-02-16 Bmc Software, Inc. Capacity risk management for virtual machines
CN106789257A (zh) * 2016-12-23 2017-05-31 航天星图科技(北京)有限公司 一种云***服务器状态可视化管理方法
CN107391633A (zh) * 2017-06-30 2017-11-24 北京奇虎科技有限公司 数据库集群自动优化处理方法、装置及服务器

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115499299A (zh) * 2022-09-13 2022-12-20 航天信息股份有限公司 一种集群设备监控方法及装置

Also Published As

Publication number Publication date
CN111581044A (zh) 2020-08-25

Similar Documents

Publication Publication Date Title
US8850263B1 (en) Streaming and sampling in real-time log analysis
CN111538642B (zh) 一种异常行为的检测方法、装置、电子设备及存储介质
US10263833B2 (en) Root cause investigation of site speed performance anomalies
CN110083475B (zh) 一种异常数据的检测方法及装置
GB2478066A (en) Identifying errors in a computer system using the relationships between the sources of log messages
CN110647447B (zh) 用于分布式***的异常实例检测方法、装置、设备和介质
Di et al. Exploring properties and correlations of fatal events in a large-scale hpc system
US20150113337A1 (en) Failure symptom report device and method for detecting failure symptom
WO2022142013A1 (zh) 基于人工智能的ab测试方法、装置、计算机设备及介质
US9116804B2 (en) Transient detection for predictive health management of data processing systems
WO2021184588A1 (zh) 集群优化方法、装置、服务器及介质
CN114595765A (zh) 数据处理方法、装置、电子设备及存储介质
CN108595685B (zh) 一种数据处理方法及装置
CN114297037A (zh) 一种告警聚类方法及装置
US11394629B1 (en) Generating recommendations for network incident resolution
CN112819305A (zh) 业务指标分析方法、装置、设备及存储介质
CN114360732B (zh) 医疗数据分析方法、装置、电子设备及存储介质
CN115562934A (zh) 基于人工智能的业务流量切换方法及相关设备
JP2017211806A (ja) 通信の監視方法、セキュリティ管理システム及びプログラム
CN115237721A (zh) 一种基于窗口频繁序列预测故障方法、装置及存储介质
CN115509853A (zh) 一种集群数据异常检测方法及电子设备
CN112306831B (zh) 计算集群错误预测方法及相关设备
WO2018122889A1 (ja) 異常検出方法、システムおよびプログラム
JPWO2013114911A1 (ja) リスク評価システム、リスク評価方法、及びプログラム
US11138512B2 (en) Management of building energy systems through quantification of reliability

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20925125

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20925125

Country of ref document: EP

Kind code of ref document: A1