WO2022016808A1

WO2022016808A1 - Kubernetes cluster resource dynamic adjustment method and electronic device

Info

Publication number: WO2022016808A1
Application number: PCT/CN2020/140019
Authority: WO
Inventors: 杨磊; 王洋; 须成忠
Original assignee: 中国科学院深圳先进技术研究院
Priority date: 2020-07-22
Filing date: 2020-12-28
Publication date: 2022-01-27
Also published as: CN113971066A

Abstract

The present application relates to a Kubernetes cluster resource dynamic adjustment method and an electronic device. The method comprises: collecting historical monitoring index data in an application and, on the basis of the monitoring index data, obtaining monitoring resource consumption timing data; constructing an LSTM-Kalman filter, and inputting the resource consumption timing data into the LSTM-Kalman for pre-training to obtain a prediction model; collecting new resource consumption timing data, inputting the new resource consumption timing data into the prediction model to predict a resource consumption peak value in a future preset time period, and using the resource consumption peak value as a resource limit value for resource updating. The present application ensures the precision of the data and effectively increases the rate of utilisation of system resources, solving the problem of resource fragmentation, and also has a certain capacity for self-adaption.

Description

一种Kubernetes集群资源动态调整方法及电子设备A Kubernetes cluster resource dynamic adjustment method and electronic device

技术领域technical field

本申请属于计算机领域，特别涉及一种Kubernetes集群资源动态调整方法及电子设备。The present application belongs to the field of computers, and in particular relates to a method for dynamically adjusting Kubernetes cluster resources and an electronic device.

背景技术Background technique

随着以Docker为支撑的容器即服务技术兴起，Kubernetes(一种容器集群管理技术，可以实现容器的编排管理)因其强大的容器编排技术受到业界追捧，已然成为容器集群管理的“事实标准”，更是被誉为“云时代的Linux”。为避免应用过多消耗资源，从而影响其他应用及节点性能，需要通过Kubernetes的资源限制机制以及Docker的Cgroup技术实现应用资源控制隔离。通常在部署应用的时候，只能通过配置文件进行静态部署，需要人为的估算Cpu、Memory等资源需求量。显然当预留资源不足会导致容器发生Cputhrottle或者Out of memory kill，影响服务质量，需要新建Pod(是Kubernetes最基本操作单元，包含一个或多个密切相关的业务逻辑容器，是集群调度的最小单位)再次部署。当出现冗余配置时，会使得集群资源利用率低下，影响节点可分配资源，导致其他Pod无法被调度到该节点上。With the rise of container-as-a-service technology supported by Docker, Kubernetes (a container cluster management technology that can realize container orchestration and management) is sought after by the industry due to its powerful container orchestration technology, and has become the "de facto standard" for container cluster management. , is also known as "Linux in the cloud era". In order to prevent the application from consuming too many resources and thus affecting the performance of other applications and nodes, it is necessary to implement the application resource control and isolation through the resource limitation mechanism of Kubernetes and the Cgroup technology of Docker. Usually, when deploying an application, it can only be statically deployed through the configuration file, and it is necessary to manually estimate the resource requirements such as Cpu and Memory. Obviously, when the reserved resources are insufficient, Cputhrottle or Out of memory kill will occur in the container, which will affect the quality of service. It is necessary to create a new Pod (the most basic operation unit of Kubernetes, which contains one or more closely related business logic containers and is the smallest unit of cluster scheduling). ) to deploy again. When redundant configuration occurs, the utilization of cluster resources will be low, which will affect the resources that can be allocated by the node, and prevent other Pods from being scheduled to the node.

通常，如果不对Pod进行资源限制，Pod将会无限制的使用完节点上的资源，出现资源竞争，对服务的稳定性产生巨大影响。在实际生产环境中，流量会发生规律性变化，因此资源使用率随之发生变化，静态部署的方式会带来资源不足或冗余。为了在资源不断波动情况下，能够动态调整资源，开源社区提供HPA(Pod横向自动扩容，https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/)、VPA(垂直自动扩容，https://github.com/kubernetes/community/blob/master/contributors/design-proposals/autoscaling/vertical-pod-autoscaler.md)。HPA的原理是通过Kubernetes的Controller Manager组件运行一个Controller，周期性监听Pod资源使用情况，当高于设定的阈值时，会自动增加Pod的数量，当低于某个阈值时，会自动减少Pod数量，通过调整集群中Pod的数量来动态改变资源。VPA与HPA相似，但是调整的是单个Pod的资源请求值，通过监控组件的数据，计算Pod的资源推荐值，允许在节点上进行适当调度，为每个Pod提供适当资源，当被VPA接管的Pod资源实际消耗值与推荐值相差较大时，会对Pod进行驱逐并重新建立Pod再次部署到集群中，从而达到动态资源改变的目的。Usually, if the Pod is not limited in resources, the Pod will use up the resources on the node without limit, and there will be resource competition, which will have a huge impact on the stability of the service. In the actual production environment, the traffic will change regularly, so the resource usage will change accordingly, and the static deployment method will lead to insufficient or redundant resources. In order to dynamically adjust resources when resources are constantly fluctuating, the open source community provides HPA (horizontal automatic expansion of pods, https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/), VPA (vertical automatic expansion of pods) Automatic scaling, https://github.com/kubernetes/community/blob/master/contributors/design-proposals/autoscaling/vertical-pod-autoscaler.md). The principle of HPA is to run a Controller through the Controller Manager component of Kubernetes to periodically monitor the usage of Pod resources. When it is higher than the set threshold, the number of Pods will be automatically increased, and when it is lower than a certain threshold, the number of Pods will be automatically reduced Quantity, dynamically changing resources by adjusting the number of Pods in the cluster. VPA is similar to HPA, but it adjusts the resource request value of a single Pod. By monitoring the data of components, calculating the resource recommendation value of Pod, allowing appropriate scheduling on the node, providing appropriate resources for each Pod, when the VPA takes over When the actual consumption value of Pod resources differs greatly from the recommended value, the Pod will be expelled and re-established and deployed to the cluster again, so as to achieve the purpose of dynamic resource change.

为了服务的稳定性，工业界也会进行全链路的压力测试，利用集成在Kubernetes中的cAdvisor监控***，提前预判Pod需要的副本数和资源量，在无法准确预算资源时，采用冗余配置的方式。除此之外，学术界一般通过实验得到性能随资源变化的曲线，确定一个合适的资源预留值，提前预留好相应的资源。如文献[Xu,G.,Xu,C.-Z MEER:Online Estimation of Optimal Reservations for Long Lived Containers in In-Memory Cluster Computing.ICDCS 2019]提出一种面向长生命周期的容器敏感内存资源预留机制，允许容器可以在比理想内存少很多的情况下运行，只需付出一定的性能损失代价即可。文献采用离线训练，得到模型后，在线预测获得Pod资源限额。For service stability, the industry will also conduct stress tests on the entire link, using the cAdvisor monitoring system integrated in Kubernetes to predict the number of replicas and resources required by the Pod in advance, and use redundancy when the resources cannot be accurately budgeted. way of configuration. In addition, the academic community generally obtains the curve of performance changing with resources through experiments, determines an appropriate resource reservation value, and reserves the corresponding resources in advance. For example, the literature [Xu, G., Xu, C.-Z MEER: Online Estimation of Optimal Reservations for Long Lived Containers in In-Memory Cluster Computing. ICDCS 2019] proposes a long-life cycle-oriented container-sensitive memory resource reservation mechanism , allowing containers to run with much less memory than ideal, at the cost of a performance penalty. The literature adopts offline training. After obtaining the model, online prediction is used to obtain the Pod resource limit.

如上所述，若采用开源社区的HPA或VPA解决方案，对时间,Cpu,Memory,网络资源等敏感的在线或实时计算应用来说，实时性不够。通常部署的应用都是重量级Java框架或者相关技术栈的Web应用，利用HPA进行水平扩容来调整Pod数量，可以在较短时间能增加Pod数量，但是从触发扩容到经过Kube-Scheduler(Kubernetes的集群调度器，负责将Pod调度到合适的节点上)调度到合适的节点再通过Docker client拉取镜像，开启容器开启服务需要较长时间。至于VPA提供的垂直扩容，需要从节点驱逐旧Pod，重新建立满足资源需求的Pod，再通过API Server(Kubernetes资源对象的唯一操作入口，其他组件都必须通过它提供的API来操作资源数据)，Kube-Scheduler重新部署，无法做到原地升级，显然并不满足实时性需求。并且VPN还处于alpha版，并没有合并到官方的Kubernetes release中，而且HPA和VPA目前不兼容，只能选择一个使用，否则两者之间会产生干扰。As mentioned above, if the HPA or VPA solution of the open source community is adopted, the real-time performance is not enough for time-sensitive online or real-time computing applications such as Cpu, Memory, and network resources. Usually deployed applications are heavyweight Java frameworks or web applications of related technology stacks. Using HPA to perform horizontal expansion to adjust the number of Pods can increase the number of Pods in a short period of time. The cluster scheduler is responsible for scheduling the Pod to the appropriate node) to the appropriate node and then pulling the image through the Docker client. It takes a long time to open the container and start the service. As for the vertical expansion provided by VPA, it is necessary to evict old Pods from nodes, re-establish Pods that meet resource requirements, and then pass API Server (the only operation entry for Kubernetes resource objects, other components must operate resource data through the API provided by it), Kube-Scheduler is redeployed and cannot be upgraded in place, which obviously does not meet the real-time requirements. And the VPN is still in the alpha version and has not been merged into the official Kubernetes release, and HPA and VPA are currently incompatible, so you can only choose one to use, otherwise there will be interference between the two.

对于全链路的离线、在线压测提前预判的方式，或者例如文献[谢文舟,孙艳霞.基于Kubernetes负载特征的资源预测模型研究[J].网络安全技术与应用,2018(04):27-28.]通过在某几种应用类型上大量实验，不断改变资源预留值，测量应用的性能指标，从而建立资源和性能变化曲线，最后折中得到一个资源预留值。然而在生产环境的异构集群中存在成千上万的应用，各种应用版本发布更新频繁，应用链路复杂，各节点物理配置不尽相同。穷举所有应用进行压测找到最优或次优的资源调整方案，开销过大，耗时耗力，并不现实。即便可以，往往旧版本的测试数据可能并不适合新版本，或者相同的数据在不同的机器上，由于配置不同，表现出的性能往往也存在差异性，并且没有做到动态调整的效果。For the offline and online pressure testing of the entire link, the method of pre-judging in advance, or for example literature [Xie Wenzhou, Sun Yanxia. Research on resource prediction model based on Kubernetes load characteristics [J]. Network Security Technology and Application, 2018(04):27- 28.] Through a large number of experiments on certain application types, constantly changing the resource reservation value, measuring the performance index of the application, thereby establishing a resource and performance change curve, and finally obtaining a resource reservation value by compromise. However, there are thousands of applications in heterogeneous clusters in the production environment. Various application versions are frequently released and updated, application links are complex, and the physical configuration of each node is different. It is unrealistic to exhaust all applications and perform stress testing to find the optimal or sub-optimal resource adjustment solution, which is too expensive, time-consuming and labor-intensive. Even if it is possible, the test data of the old version may not be suitable for the new version, or the same data on different machines, due to different configurations, often show differences in performance, and the effect of dynamic adjustment is not achieved.

发明内容SUMMARY OF THE INVENTION

本申请提供了一种Kubernetes集群资源动态调整方法及电子设备，旨在至少在一定程度上解决现有技术中的上述技术问题之一。The present application provides a dynamic adjustment method and electronic device for Kubernetes cluster resources, aiming to solve one of the above technical problems in the prior art at least to a certain extent.

为了解决上述问题，本申请提供了如下技术方案：In order to solve the above problems, the application provides the following technical solutions:

一种Kubernetes集群资源动态调整方法，包括以下步骤：A method for dynamic adjustment of Kubernetes cluster resources, comprising the following steps:

步骤a：收集应用中的历史监控指标数据，根据所述监控指标数据得到资源消耗时序数据；Step a: collect historical monitoring indicator data in the application, and obtain resource consumption time series data according to the monitoring indicator data;

步骤b：构建LSTM-Kalman滤波器，将所述资源消耗时序数据输入所述LSTM-Kalman进行预训练，得到预测模型；其中，所述LSTM-Kalman滤波器为利用LSTM改造后的Kalman滤波器；Step b: constructing an LSTM-Kalman filter, and inputting the resource consumption time series data into the LSTM-Kalman for pre-training to obtain a prediction model; wherein, the LSTM-Kalman filter is a Kalman filter transformed by LSTM;

步骤c：收集新的资源消耗时序数据，将所述新的资源消耗时序数据输入所述预测模型中预测未来预设时间段内的资源消耗峰值，将所述资源消耗峰值作为资源限制值进行资源更新。Step c: Collect new resource consumption time series data, input the new resource consumption time series data into the prediction model to predict the resource consumption peak value within a preset time period in the future, and use the resource consumption peak value as a resource limit value for resource consumption. renew.

本申请实施例采取的技术方案还包括：所述步骤a中，所述收集应用中的历史监控指标数据包括：The technical solution adopted in the embodiment of the present application further includes: in the step a, the collection of historical monitoring indicator data in the application includes:

对运行在Pod中的容器进行监控并收集各种监控指标数据；收集的监控指标数据包括cpu利用率及限额、文件***读/写利用率及限额、网络报文发送/接收/丢弃率数据中的至少一种或一种以上的组合；Monitor the containers running in the Pod and collect various monitoring indicator data; the collected monitoring indicator data includes CPU utilization and quota, file system read/write utilization and quota, and network packet sending/receiving/discarding rate data. at least one or a combination of more than one;

将收集到的监控指标数据存入时序数据库。Store the collected monitoring indicator data in the time series database.

本申请实施例采取的技术方案还包括：在所述步骤a中，所述根据所述监控指标数据得到资源消耗时序数据包括：The technical solution adopted in the embodiment of the present application further includes: in the step a, the obtaining the resource consumption time series data according to the monitoring indicator data includes:

定时查询所述时序数据库得到感兴趣的指标数据，并对所述感兴趣的指标数据进行缓存及整理，形成可以直接用于预测的资源消耗时序数据。The time series database is regularly queried to obtain the index data of interest, and the interested index data is cached and sorted to form time series data of resource consumption that can be directly used for prediction.

本申请实施例采取的技术方案还包括：所述对所述感兴趣的指标数据进行缓存及整理包括：The technical solutions adopted in the embodiments of the present application further include: the caching and sorting of the indicator data of interest includes:

使用滑动窗口依次滑动每一条指标数据，并选择窗口中的最大值作为新的数据集。Use the sliding window to slide each indicator data in turn, and select the maximum value in the window as the new data set.

本申请实施例采取的技术方案还包括：在所述步骤b中，所述构建LSTM-Kalman滤波器包括：The technical solution adopted in the embodiment of the present application further includes: in the step b, the constructing the LSTM-Kalman filter includes:

假设观测值是***状态值的有噪估计，令观测矩阵H＝I：Assuming that the observations are noisy estimates of the state values of the system, let the observation matrix H = I:

X(k)＝f(X(k-1))+W(k)，W～N(0,Q)X(k)=f(X(k-1))+W(k), W～N(0,Q)

Z(k)＝X(k)+V(k)，V～N(0,R)Z(k)=X(k)+V(k), V～N(0,R)

上式中，X(k)为k时刻的***状态，Z(k)为k时刻的观测值，W(k)和V(k)分别表示过程和测量噪声，它们的协方差分别为Q，R；f为由LSTM _f模型生成的函数模型； In the above formula, X(k) is the system state at time k, Z(k) is the observed value at time k, W(k) and V(k) represent the process and measurement noise, respectively, and their covariances are Q, respectively, R; f is the function model generated _{by the LSTM f model;}

预测步骤为：The prediction steps are:

X(k|k-1)＝f(X(k-1|k-1))X(k|k-1)=f(X(k-1|k-1))

P(k|k-1)＝F P(k-1|k-1)FT+Q(k)P(k|k-1)=F P(k-1|k-1)FT+Q(k)

上式中，F代表f关于X(k-1|k-1)的雅可比矩阵，Q(k)由LSTM _Q给出； In the above formula, F represents the Jacobian matrix of f with respect to X(k-1|k-1), and Q(k) is given _{by LSTM Q;}

更新步骤为：The update steps are:

Kg(k)＝P(k|k-1)((P(k|k-1)+R(k))-1Kg(k)=P(k|k-1)((P(k|k-1)+R(k))-1

X(k|k)＝X(k|k-1)+Kg(k)(Z(k)–X(k|k-1)X(k|k)=X(k|k-1)+Kg(k)(Z(k)–X(k|k-1)

P(k|k)＝(I-Kg(k))P(k|k-1)P(k|k)=(I-Kg(k))P(k|k-1)

上式中，R(k)由LSTM _r得到，Z(k)代表k时刻的观测值； In the above formula, R(k) is _{obtained from LSTM r} , and Z(k) represents the observation value at time k;

将配置文件中的request值作为X(k-1|k-1)，当前时刻k采集到的指标数据作为观测值Z(k|k)，分别经过所述LSTM _f和LSTM _r网络得到k时刻的预测值X(k|k-1)和观测噪声R(k)；X(k|k-1)经过LSTM _Q得到k时刻的状态噪声Q(k)；利用R(k)、Q(k)、X(k-1|k-1)对所述Kalman滤波器进行递推更新。 The request value in the configuration file is taken as X(k-1|k-1), the index data collected at the current moment k is taken as the observation value Z(k|k), and the time k is obtained _{through the LSTM f} _{and LSTM r networks respectively} The predicted value X(k|k-1) and observation noise R(k) of ; X(k|k-1) obtains the state noise Q(k) at time k _{through LSTM Q; using R(k), Q(k} ), X(k-1|k-1) to recursively update the Kalman filter.

本申请实施例采取的技术方案还包括：在所述步骤c中，所述将所述资源消耗时序数据输入所述LSTM-Kalman进行预训练，得到预测模型还包括：The technical solution adopted in the embodiment of the present application further includes: in the step c, the inputting the resource consumption time series data into the LSTM-Kalman for pre-training, and obtaining a prediction model further includes:

模型评估：利用新的监控指标数据对所述预测模型进行迭代训练，直到所述预测模型损失收敛或只在预设范围内震荡时停止迭代训练。Model evaluation: Iterative training is performed on the prediction model using new monitoring index data, and the iterative training is stopped until the loss of the prediction model converges or only oscillates within a preset range.

本申请实施例采取的技术方案还包括：所述模型评估还包括：The technical solutions adopted in the embodiments of the present application further include: the model evaluation further includes:

获取预设时间段内的新的监控指标数据，将所述预测模型的预测结果与所述新的监控指标数据进行比较，判断是否需要再次进行模型训练：如果大于设定阈值的时间消耗值在所述预测模型的预测结果以下则不再更新模型，反之则使用所述新的监控指标数据更新所述预测模型的参数，对所述预测模型进行修正。Acquire new monitoring indicator data within a preset time period, compare the prediction result of the prediction model with the new monitoring indicator data, and determine whether model training needs to be performed again: if the time consumption value greater than the set threshold is in If the prediction result of the prediction model is below, the model will not be updated. Otherwise, the parameters of the prediction model will be updated by using the new monitoring index data to correct the prediction model.

本申请实施例采取的技术方案还包括：在所述步骤c中，所述预测未来预设时间段内的资源消耗峰值还包括：The technical solution adopted in the embodiment of the present application further includes: in the step c, the predicting the peak resource consumption within a preset time period in the future further includes:

待所述LSTM _f、LSTM _r、LSTM _Q三个子循环网络损失收敛或者降低到预设阈值，开始执行递推估计，通过度量收集器上报数据给预测模型，所述预测模型对未来预设时间段内的资源消耗峰值进行最优估计，使用最优估计值对Pod中容器资源进行更新操作。 After the _{three sub-loop network losses of the LSTM f} , LSTM _r and LSTM _Q converge or decrease to a preset threshold, start to perform recursive estimation, and report data to the prediction model through the metric collector, and the prediction model will be used for the future preset time period. The resource consumption peak in the Pod is optimally estimated, and the optimal estimated value is used to update the container resources in the Pod.

本申请实施例采取的又一技术方案为：一种电子设备，包括：Another technical solution adopted in the embodiment of the present application is: an electronic device, comprising:

至少一个处理器；以及at least one processor; and

与所述至少一个处理器通信连接的存储器；其中，a memory communicatively coupled to the at least one processor; wherein,

所述存储器存储有可被所述一个处理器执行的指令，所述指令被所述至少一个处理器执行，以使所述至少一个处理器能够执行上述的Kubernetes集群资源动态调整方法的以下操作：The memory stores instructions executable by the one processor, and the instructions are executed by the at least one processor, so that the at least one processor can perform the following operations of the above-mentioned method for dynamically adjusting Kubernetes cluster resources:

相对于现有技术，本申请实施例产生的有益效果在于：本申请实施例的Kubernetes集群资源动态调整方法及电子设备首先对应用进行静态冗余配置部署，收集真实负载环境下应用的历史监控指标数据，在线进行预训练得到较为可靠的预测模型；然后对预测模型进行迭代训练，直到模型损失收敛或只在预设范围内震荡后，不断采集新的资源消耗时序数据输入预测模型中预测未来一小段时间内资源消耗的峰值，将该值作为资源限制值进行更新，实现动态的Pod资源限制调整。相对于现有技术，本申请至少具有以下有益效果：Compared with the prior art, the beneficial effects of the embodiments of the present application are as follows: the method for dynamic adjustment of Kubernetes cluster resources and the electronic device of the embodiments of the present application first perform static redundant configuration deployment on the application, and collect historical monitoring indicators of the application under a real load environment. Data, pre-train online to obtain a more reliable prediction model; then iteratively train the prediction model until the loss of the model converges or only oscillates within a preset range, and continuously collect new resource consumption time series data and input it into the prediction model to predict the future. The peak value of resource consumption in a short period of time is updated as the resource limit value to realize dynamic Pod resource limit adjustment. Compared with the prior art, the present application at least has the following beneficial effects:

1、面对流量“洪峰”短时间内可以起到良好的缓冲作用，保证服务的实时性，稳定性；面对流量“低谷”，又可将空闲的资源归还内核，有效的提高了***资源利用率，解决了资源碎片问题，同时具备一定的自适应的能力。1. In the face of traffic "peaks", it can play a good buffering role in a short period of time to ensure the real-time and stability of services; in the face of traffic "troughs", idle resources can be returned to the kernel, which effectively improves system resources. Utilization rate, solves the problem of resource fragmentation, and has a certain self-adaptive ability.

2、在预测的同时进行滤噪处理，保证数据的精确性。2. Perform noise filtering processing at the same time of prediction to ensure the accuracy of the data.

3、规避了应用版本差异、集群异构性导致模型不适用的问题，无需进行大量的压测试验，建立性能和资源关系曲线。3. It avoids the problem of inapplicability of models caused by differences in application versions and cluster heterogeneity. There is no need to conduct a large number of stress tests to establish a performance and resource relationship curve.

4、通过预测未来一段时间资源消耗峰值作为资源限制值，对资源进行动态限制，采用局部最大值代替全局最大值的配置思想，提高了资源利用率，起到了自适应的效果。4. By predicting the peak value of resource consumption in the future as the resource limit value, the resource is dynamically limited, and the configuration idea of using the local maximum value instead of the global maximum value improves the resource utilization rate and has an adaptive effect.

附图说明Description of drawings

图1是本申请实施例的Kubernetes集群资源动态调整方法的流程图；1 is a flowchart of a method for dynamically adjusting Kubernetes cluster resources according to an embodiment of the present application;

图2为本申请实施例的Kubernetes集群资源动态调整***结构示意图；2 is a schematic structural diagram of a Kubernetes cluster resource dynamic adjustment system according to an embodiment of the application;

图3为经LSTM改造后的Kalman滤波器示意图；Figure 3 is a schematic diagram of the Kalman filter transformed by LSTM;

图4是本申请实施例提供的Kubernetes集群资源动态调整方法的硬件设备结构示意图。FIG. 4 is a schematic diagram of a hardware device structure of a method for dynamically adjusting Kubernetes cluster resources provided by an embodiment of the present application.

具体实施方式detailed description

为了使本申请的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本申请进行进一步详细说明。应当理解，此处所描述的具体实施例仅用以解释本申请，并不用于限定本申请。In order to make the purpose, technical solutions and advantages of the present application more clearly understood, the present application will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application.

针对现有技术的不足，本申请实施例的Kubernetes集群资源动态调整方法提出一种Kubernetes Pod资源在线预测，动态调整解决方案。整个方案包括数据收集整理、在线同步训练、在线预测执行三个阶段。首先对应用进行静态冗余配置部署，收集真实负载环境下应用的Cpu、Memory等历史数据，在线进行预训练得到较为可靠的预测模型；并对预测模型进行迭代训练，直到模型损失收敛或只在预设范围内震荡，则停止训练，并进入在线预测执行阶段，不断采集新的资源消耗时序数据，并输入预测模型中预测未来一小段时间内资源消耗的峰值，将该值作为资源限制值进行更新，实现动态的Pod资源限制调整。In view of the deficiencies of the prior art, the method for dynamic adjustment of Kubernetes cluster resources in the embodiment of the present application proposes a solution for online prediction and dynamic adjustment of Kubernetes Pod resources. The whole scheme includes three stages: data collection and arrangement, online synchronous training, and online prediction execution. First, deploy the application with static redundant configuration, collect historical data such as Cpu and Memory of the application under the real load environment, and perform online pre-training to obtain a more reliable prediction model; and iteratively train the prediction model until the model loss converges or only in the If it oscillates within the preset range, the training will be stopped, and the online prediction execution stage will be entered, and new resource consumption time series data will be continuously collected and input into the prediction model to predict the peak value of resource consumption in a short period of time in the future, and this value will be used as the resource limit value. Updated to implement dynamic Pod resource limit adjustment.

具体的，请参阅图1，是本申请实施例的Kubernetes集群资源动态调整方法的流程图。本申请实施例的Kubernetes集群资源动态调整方法包括以下步骤：Specifically, please refer to FIG. 1 , which is a flowchart of a method for dynamically adjusting Kubernetes cluster resources according to an embodiment of the present application. The method for dynamically adjusting Kubernetes cluster resources in this embodiment of the present application includes the following steps:

步骤100：数据收集整理：首先对应用进行静态冗余配置部署，收集真实负载环境下应用的Cpu、Memory等历史监控指标数据；Step 100: Data collection and arrangement: firstly, static redundant configuration deployment is performed on the application, and historical monitoring indicator data such as Cpu and Memory of the application under the real load environment are collected;

步骤100中，数据收集整理具体包括：In step 100, the data collection and arrangement specifically includes:

步骤101：通过集成在Kubelet中的cAdvisor对运行在Pod中的容器进行监控并收集各种监控指标数据；Step 101: Monitor the container running in the Pod through the cAdvisor integrated in the Kubelet and collect various monitoring indicator data;

本步骤中，Kubelet为运行在集群每个节点上的进程，用于处理Master节点下发到本节点的任务并负责管理Pod及Pod中的容器。cAdvisor集成于Kubelet组件内，Kubelet从cAdvisor获取单独的容器使用统计信息，然后通过REST API暴露这些聚合后的Pod资源使用的统计信息。cAdvisor是一个开源的分析容器资源使用率以及性能特性的代理工具，它会自动查找所在节点上的容器，收集每个容器的相关指标数据。本申请实施例中，cAdvisor收集的指标数据包括cpu利用率及限额、文件***读/写利用率及限额、网络报文发送/接收/丢弃率等数据中的至少一种或一种以上的组合。具体请一并参阅图2，是本申请实施例的Kubernetes集群资源动态调整***结构图。In this step, Kubelet is a process running on each node of the cluster, which is used to process the tasks sent by the Master node to this node and is responsible for managing the Pod and the containers in the Pod. cAdvisor is integrated into the Kubelet component. Kubelet obtains individual container usage statistics from cAdvisor, and then exposes these aggregated Pod resource usage statistics through the REST API. cAdvisor is an open source agent tool that analyzes container resource usage and performance characteristics. It automatically finds containers on the node where it is located and collects relevant indicator data for each container. In this embodiment of the present application, the indicator data collected by cAdvisor includes at least one or a combination of at least one or more of data such as CPU utilization and quota, file system read/write utilization and quota, and network packet sending/receiving/discarding rates. . For details, please refer to FIG. 2 , which is a structural diagram of a system for dynamic adjustment of Kubernetes cluster resources according to an embodiment of the present application.

步骤102：通过Prometheus将收集到的各种监控指标数据存入TSDB(时序数据库)；Step 102: Store the collected data of various monitoring indicators into TSDB (Time Series Database) through Prometheus;

本步骤中，Prometheus是一个开源的服务监控***和时序数据库，提供了通用的数据模型和便捷的数据采集、存储和查询接口。Prometheus的核心组件Prometheus Server(普罗米修斯服务器)定期从静态配置的监控目标或基于服务发现机制自动配置的目标中拉取监控指标数据。在本申请实施例中，Prometheus Server通过cAdvisor提供的metrics接口周期性的进行数据抓取操作，然后将抓取的数据以时间序列的方式存储到服务端内存缓存区或持久化到存储设备中。In this step, Prometheus is an open source service monitoring system and time series database, which provides a common data model and convenient data collection, storage and query interfaces. Prometheus Server, the core component of Prometheus, regularly pulls monitoring indicator data from statically configured monitoring targets or targets automatically configured based on the service discovery mechanism. In the embodiment of the present application, the Prometheus Server periodically performs data capture operations through the metrics interface provided by cAdvisor, and then stores the captured data in a server-side memory cache or persists to a storage device in a time-series manner.

步骤103：通过自定义组件Metrics Collector(度量收集器)定时从Prometheus中收集感兴趣的监控指标数据做二级缓存，并对监控指标数据进行整理，形成可以直接用于预测的资源消耗时序数据；Step 103: Collect the monitoring indicator data of interest from Prometheus periodically through the custom component Metrics Collector for secondary cache, and organize the monitoring indicator data to form resource consumption time series data that can be directly used for prediction;

本步骤中，自定义组件Metrics Collector通过查询语言PromQL定时查询Prometheus Server的时序数据库得到感兴趣的指标数据，并对指标进行缓存及整理，形成可以直接用于预测的资源消耗时序数据，减轻TSDB的压力，同时也保证时序数据的实时性，便于Kalman滤波器进行模型训练和估计。In this step, the custom component Metrics Collector regularly queries the time series database of Prometheus Server through the query language PromQL to obtain the indicator data of interest, and caches and organizes the indicators to form resource consumption time series data that can be directly used for prediction, reducing the burden of TSDB. It also ensures the real-time performance of time series data, which is convenient for Kalman filter to perform model training and estimation.

本申请实施例中，在对时序数据进行处理时，使用滑动窗口依次滑动每一条数据，并选择窗口中的最大值作为新的数据集，使得滤波器预测出的值在未来一段时间内，资源消耗都不会超过预测值。In the embodiment of the present application, when processing time series data, a sliding window is used to slide each piece of data in turn, and the maximum value in the window is selected as a new data set, so that the value predicted by the filter will be more efficient in a future period of time. consumption will not exceed the forecast value.

步骤200：在线同步训练：构建LSTM-Kalman滤波器，将Metrics Collector收集到的资源消耗时序数据输入LSTM-Kalman滤波器中进行预训练，得到预测模型LSTM-KF；Step 200: Online synchronous training: construct an LSTM-Kalman filter, input the resource consumption time series data collected by the Metrics Collector into the LSTM-Kalman filter for pre-training, and obtain a prediction model LSTM-KF;

步骤200中，Kalman滤波器是一个最优化自回归数据处理算法，Kalman利用目标的动态信息，设法去掉噪声的影响，得到一个关于目标最优估计。考虑到***的振荡，应用间的相互干扰等因素造成采集的数据不准，因此利用Kalman滤波器规避掉这些不稳定因素，得到较好的估计值。同时模型仅仅通过5步递推就可以实现对新值的最优估计，但是现有的Kalman滤波器在实际应用中需要依赖动力学模型。In step 200, the Kalman filter is an optimal autoregressive data processing algorithm. Kalman uses the dynamic information of the target to try to remove the influence of noise, and obtain an optimal estimate about the target. Considering the system oscillation, mutual interference between applications and other factors, the collected data is inaccurate. Therefore, the Kalman filter is used to avoid these unstable factors and obtain a better estimated value. At the same time, the model can achieve the optimal estimation of the new value through only 5 steps of recursion, but the existing Kalman filter needs to rely on the dynamic model in practical application.

Kalman的空间状态方程可以使用一个线性随机差分方程描述：Kalman's spatial state equation can be described using a linear stochastic difference equation:

X(k)＝AX(k-1)+W(k)，W～N(0,Q)(1)X(k)=AX(k-1)+W(k), W～N(0,Q)(1)

Z(k)＝HX(k)+V(k)，V～N(0,R)(2)Z(k)=HX(k)+V(k), V～N(0,R)(2)

式(1)代表状态方程，式(2)代表测量方程。其中X(k)为k时刻的***状态，Z(k)为k时刻的观测值，A为状态转移矩阵，H为观测矩阵，W(k)和V(k)分别表示过程和测量噪声，假设服从高斯白噪声，它们的协方差分别为Q，R。Equation (1) represents the state equation, and equation (2) represents the measurement equation. where X(k) is the system state at time k, Z(k) is the observed value at time k, A is the state transition matrix, H is the observation matrix, W(k) and V(k) represent the process and measurement noise, respectively, Assuming that they obey white Gaussian noise, their covariances are Q and R, respectively.

现有的Kalman滤波算法如下：The existing Kalman filtering algorithm is as follows:

利用***的当前状态模型预测***的下一状态；假设***的当前状态k，基于***的上一状态预测出当前状态：Use the current state model of the system to predict the next state of the system; assuming the current state k of the system, predict the current state based on the previous state of the system:

X(k|k-1)＝A X(k-1|k-1)(3)X(k|k-1)=A X(k-1|k-1)(3)

P(k|k-1)＝A P(k-1|k-1)AT+Q(4)P(k|k-1)=A P(k-1|k-1)AT+Q(4)

式(3)、(4)对应了***预测过程，得到k状态下的预测值，然后结合测量值Z(t)，得到k状态下的最优估计值X(k|k)：Equations (3) and (4) correspond to the system prediction process, and the predicted value under k state is obtained, and then combined with the measured value Z(t), the optimal estimated value X(k|k) under k state is obtained:

X(k|k)＝X(k|k-1)+Kg(k)(Z(k)–HX(k|k-1)(5)X(k|k)=X(k|k-1)+Kg(k)(Z(k)–HX(k|k-1)(5)

式(5)中，Kg为Kalman Gain(增益)：In formula (5), Kg is Kalman Gain (gain):

Kg(k)＝P(k|k-1)HT/((H*P(k|k-1)H’+R)(6)Kg(k)=P(k|k-1)HT/((H*P(k|k-1)H’+R)(6)

为了使得滤波器不断迭代递推下去，还需要更新X(k|k)状态下的协方差：In order to make the filter iterative and recursive, it is also necessary to update the covariance in the state of X(k|k):

P(k|k)＝(I-Kg(k)H)P(k|k-1)(7)P(k|k)=(I-Kg(k)H)P(k|k-1)(7)

式(7)中，I是全1的矩阵。当进入k+1状态时，P(k|k)即为式(4)中的P(k|k-1)。In Equation (7), I is a matrix of all 1s. When entering the k+1 state, P(k|k) is P(k|k-1) in formula (4).

上述算法非常依赖动力学模型，然而对于实际问题，动力学特性并不知道，无法先验的给出状态转移矩阵A和观测矩阵H，以及噪声分也是常常假设服从高斯分布。这些参数的假设存在很多人为主观因素，使得模型和实际数据不能很好的匹配，导致较大的预测误差。The above algorithm is very dependent on the dynamic model, but for practical problems, the dynamic characteristics are not known, the state transition matrix A and the observation matrix H cannot be given a priori, and the noise score is often assumed to obey the Gaussian distribution. There are many human subjective factors in the assumptions of these parameters, which make the model and actual data not well matched, resulting in large prediction errors.

因此，本申请通过使用LSTM对Kalman滤波器进行改造，使其摆脱对动力学模型的过分依赖，使得模型参数不需要先验的给出。具体改造过程如下：Therefore, this application transforms the Kalman filter by using LSTM to get rid of the excessive dependence on the dynamic model, so that the model parameters do not need to be given a priori. The specific transformation process is as follows:

给出新的状态空间方程：Give the new state space equation:

X(K)＝f(X(k-1))+W(k)，W～N(0,Q)(8)X(K)=f(X(k-1))+W(k), W～N(0,Q)(8)

Z(k)＝X(k)+V(k)，V～N(0,R)(9)Z(k)=X(k)+V(k), V～N(0,R)(9)

式(9)中，f为由LSTM _f模型生成的函数模型。 In formula (9), f is the function model generated _{by the LSTM f model.}

新的预测步骤为：The new prediction steps are:

X(k|k-1)＝f(X(k-1|k-1))(10)X(k|k-1)=f(X(k-1|k-1))(10)

P(k|k-1)＝F P(k-1|k-1)FT+Q(k)(11)P(k|k-1)=F P(k-1|k-1)FT+Q(k)(11)

式(11)中，F代表f关于X(k-1|k-1)的雅可比矩阵，Q(k)由LSTM _Q给出。 In Equation (11), F represents the Jacobian matrix of f with respect to X(k-1|k-1), and Q(k) is given _{by LSTM Q.}

新的更新步骤为：The new update steps are:

Kg(k)＝P(k|k-1)((P(k|k-1)+R(k))-1(12)Kg(k)=P(k|k-1)((P(k|k-1)+R(k))-1(12)

X(k|k)＝X(k|k-1)+Kg(k)(Z(k)–X(k|k-1)(13)X(k|k)=X(k|k-1)+Kg(k)(Z(k)–X(k|k-1)(13)

P(k|k)＝(I-Kg(k))P(k|k-1)(14)P(k|k)=(I-Kg(k))P(k|k-1)(14)

式(12)、(13)中，R(k)由LSTM _r得到，Z(k)代表k时刻的观测值。 In equations (12) and (13), R(k) is _{obtained from LSTM r} , and Z(k) represents the observation value at time k.

如图3所示，为经LSTM改造后的Kalman滤波器示意图。初始化时，可以将配置文件中的request值作为X(k-1|k-1)，当前时刻k采集到的指标数据作为观测值Z(k|k)，分别经过LSTM _f和LSTM _r网络得到k时刻的预测值X(k|k-1)和观测噪声R(k)。X(k|k-1)经过LSTM _Q得到k时刻的状态噪声Q(k)。利用R(k)、Q(k)、X(k-1|k-1)，通过式(12)、(13)、(14)对Kalman滤波器进行递推更新。 As shown in Figure 3, it is a schematic diagram of the Kalman filter transformed by LSTM. During initialization, the request value in the configuration file can be taken as X(k-1|k-1), and the indicator data collected at the current moment k can be taken as the observation value Z(k|k), which are obtained through the LSTM _f and LSTM _r networks respectively. The predicted value X(k|k-1) and the observation noise R(k) at time k. X(k|k-1) obtains the state noise Q(k) at time k _{through LSTM Q.} Using R(k), Q(k), and X(k-1|k-1), the Kalman filter is recursively updated through equations (12), (13), and (14).

基于上述，经过LSTM改造后的Kalman滤波器兼顾预测和滤噪的同时，摆脱了对动力学模型的过分依赖，模型参数不用先验的给出，可以从数据中学习出来，充分发挥时间序列前后相关性；并且增强了长程预测能力，而不仅仅依赖于上一个时刻值，预测精度可以显著提高。Based on the above, the Kalman filter transformed by LSTM takes into account both prediction and noise filtering, and at the same time gets rid of the excessive dependence on the dynamic model. The model parameters do not need to be given a priori, but can be learned from the data, giving full play to the time series before and after Correlation; and enhanced long-range prediction ability, instead of relying only on the last moment value, the prediction accuracy can be significantly improved.

步骤300：模型评估：利用新的监控指标数据对预测模型进行迭代训练，以对预测模型进行更新修正，直到模型损失收敛或只在预设范围内震荡，得到训练好的预测模型LSTM-KF；Step 300: Model evaluation: use the new monitoring index data to iteratively train the prediction model to update and correct the prediction model, until the model loss converges or only oscillates within a preset range, and the trained prediction model LSTM-KF is obtained;

步骤300中，模型评估过程具体为：通过Controller拉取Metrics Collector设定时间段内的监控指标数据，并对预测模型的预测结果和从Metrics Collector上报的监控指标数据进行比较，判断是否需要再次进行模型训练：如果设定阈值的时间消耗值在预测模型的最优估计值以下则不再更新模型，反之则再次使用新的监控指标数据更新预测模型参数，对模型进行修正。Controller中还备份有一个历史数据峰值，以备滤波器再次训练时，临时作为容器资源限制使用。In step 300, the model evaluation process is specifically as follows: pulling the monitoring indicator data within the time period set by the Metrics Collector through the Controller, and comparing the prediction result of the prediction model with the monitoring indicator data reported from the Metrics Collector, and judging whether it needs to be performed again. Model training: If the time consumption value of the set threshold is below the optimal estimated value of the prediction model, the model will not be updated. Otherwise, the new monitoring index data will be used to update the parameters of the prediction model to correct the model. A historical data peak is also backed up in the Controller, which is temporarily used as a container resource limit when the filter is retrained.

步骤400：进入在线预测执行阶段：不断收集新的资源消耗时序数据，并输入训练好的预测模型中预测未来预设时间段内的资源消耗峰值，并将该值作为资源限制值进行更新；Step 400: Enter the online prediction execution stage: continuously collect new resource consumption time series data, and input the trained prediction model to predict the resource consumption peak value within a preset time period in the future, and update the value as a resource limit value;

步骤400中，待LSTM _f、LSTM _r、LSTM _Q三个子循环网络损失收敛或者降低到预设阈值，开始执行递推估计，Metrics Collector上报数据给预测模型LSTM-KF，预测模型LSTM-KF对接下来一小段时间的资源消耗峰值进行最优估计，Executor直接使用最优估计值对Pod中容器所在的Container Cgroup的限制值进行更新操作。本申请采用最优估计值代表未来一小段时间峰值，即使用局部峰值代替全局峰值，即可避免冗余配置，又可以保证服务的稳定性，提高了资源利用率，起到了自适应的效果。 In step 400, after the _{three sub-loop network losses of LSTM f} , LSTM _r and LSTM _Q converge or decrease to a preset threshold, the recursive estimation is started, and the Metrics Collector reports the data to the prediction model LSTM-KF, and the prediction model LSTM-KF is responsible for the next step. The peak resource consumption is estimated for a short period of time, and the Executor directly uses the optimal estimated value to update the limit value of the Container Cgroup where the container in the Pod is located. In this application, the optimal estimated value is used to represent the peak value in a short period of time in the future. Even if the local peak value is used instead of the global peak value, redundant configuration can be avoided, the stability of the service can be ensured, the resource utilization rate can be improved, and the effect of self-adaptation can be achieved.

Cgroup是Linux内核中用于实现资源使用和统计的模块，Docker利用Cgroup进行资源隔离及限制，Kubelet作为Kubernetes的节点代理，所有对Cgroup的操作都是由内部的Container Manager模块实现，Container Manager会通过Cgroup对资源使用层层限制。当通过yaml等配置文件对容器资源请求和限制量指定时，Docker会为容器设置运行时的cpu.share、cpu.quota、cpu.period、mem.limit等指标，Executor通过修改这些指标文件进行资源更新，从而实现动态的Pod资源限制调整。Cgroup is a module used to implement resource usage and statistics in the Linux kernel. Docker uses Cgroup to isolate and limit resources. Kubelet acts as a node agent of Kubernetes. All operations on Cgroup are implemented by the internal Container Manager module. Container Manager will pass Cgroups limit the use of resources layer by layer. When the container resource request and limit are specified through configuration files such as yaml, Docker will set the runtime cpu.share, cpu.quota, cpu.period, mem.limit and other indicators for the container, and Executor will modify these indicator files for resource utilization. Update to achieve dynamic Pod resource limit adjustment.

本申请实施例中，由于各组件之间存在频繁的调用关系，为了部署方便，可以借助Sidecar(一种将应用功能从应用本身剥离出来作为单独进程的方式。做到无侵入的向应用添加多种功能)思想将组件容器化，部署在同一个Pod内，便于运维管理。In the embodiment of the present application, due to the frequent invocation relationship between the components, in order to facilitate deployment, the application function can be separated from the application itself as a separate process with the help of Sidecar (a method of stripping the application function from the application itself. It is possible to add multiple The idea of containing components and deploying them in the same Pod is convenient for operation and maintenance management.

图4是本申请实施例提供的Kubernetes集群资源动态调整方法的硬件设备结构示意图。如图4所示，该设备包括一个或多个处理器以及存储器。以一个处理器为例，该设备还可以包括：输入***和输出***。FIG. 4 is a schematic diagram of a hardware device structure of a method for dynamically adjusting Kubernetes cluster resources provided by an embodiment of the present application. As shown in Figure 4, the device includes one or more processors and memory. Taking a processor as an example, the device may further include: an input system and an output system.

处理器、存储器、输入***和输出***可以通过总线或者其他方式连接，图4中以通过总线连接为例。The processor, the memory, the input system, and the output system may be connected by a bus or in other ways, and the connection by a bus is taken as an example in FIG. 4 .

存储器作为一种非暂态计算机可读存储介质，可用于存储非暂态软件程序、非暂态计算机可执行程序以及模块。处理器通过运行存储在存储器中的非暂态软件程序、指令以及模块，从而执行电子设备的各种功能应用以及数据处理，即实现上述方法实施例的处理方法。As a non-transitory computer-readable storage medium, the memory can be used to store non-transitory software programs, non-transitory computer-executable programs, and modules. The processor executes various functional applications and data processing of the electronic device by running the non-transitory software programs, instructions and modules stored in the memory, that is, the processing method of the above method embodiment is implemented.

存储器可以包括存储程序区和存储数据区，其中，存储程序区可存储操作***、至少一个功能所需要的应用程序；存储数据区可存储数据等。此外，存储器可以包括高速随机存取存储器，还可以包括非暂态存储器，例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施例中，存储器可选包括相对于处理器远程设置的存储器，这些远程存储器可以通过网络连接至处理***。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory may include a stored program area and a stored data area, wherein the stored program area can store an operating system and an application program required by at least one function; the stored data area can store data and the like. Additionally, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory may optionally include memory located remotely from the processor, which may be connected to the processing system via a network. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.

输入***可接收输入的数字或字符信息，以及产生信号输入。输出***可包括显示屏等显示设备。The input system can receive input numerical or character information and generate signal input. The output system may include a display device such as a display screen.

所述一个或者多个模块存储在所述存储器中，当被所述一个或者多个处理器执行时，执行上述任一方法实施例的以下操作：The one or more modules are stored in the memory, and when executed by the one or more processors, perform the following operations of any of the foregoing method embodiments:

上述产品可执行本申请实施例所提供的方法，具备执行方法相应的功能模块和有益效果。未在本实施例中详尽描述的技术细节，可参见本申请实施例提供的方法。The above product can execute the method provided by the embodiments of the present application, and has functional modules and beneficial effects corresponding to the execution method. For technical details not described in detail in this embodiment, reference may be made to the method provided in this embodiment of the present application.

本申请实施例提供了一种非暂态(非易失性)计算机存储介质，所述计算机存储介质存储有计算机可执行指令，该计算机可执行指令可执行以下操作：An embodiment of the present application provides a non-transitory (non-volatile) computer storage medium, where the computer storage medium stores computer-executable instructions, and the computer-executable instructions can perform the following operations:

本申请实施例提供了一种计算机程序产品，所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算机程序，所述计算机程序包括程序指令，当所述程序指令被计算机执行时，使所述计算机执行以下操作：An embodiment of the present application provides a computer program product, the computer program product includes a computer program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, and when the program instructions are executed by a computer , which causes the computer to do the following:

本申请实施例的Kubernetes集群资源动态调整方法及电子设备首先对应用进行静态冗余配置部署，收集真实负载环境下应用的历史监控指标数据，在线进行预训练得到较为可靠的预测模型；然后对预测模型进行迭代训练，直到模型损失收敛或只在预设范围内震荡后，不断采集新的资源消耗时序数据输入预测模型中预测未来一小段时间内资源消耗的峰值，将该值作为资源限制值进行更新，实现动态的Pod资源限制调整。相对于现有技术，本申请至少具有以下有益效果：The method for dynamic adjustment of Kubernetes cluster resources and the electronic device according to the embodiments of the present application first perform static redundant configuration deployment on the application, collect historical monitoring indicator data of the application under a real load environment, and perform online pre-training to obtain a more reliable prediction model; The model is iteratively trained until the loss of the model converges or only oscillates within a preset range, then continuously collect new resource consumption time series data and input it into the prediction model to predict the peak resource consumption in a short period of time in the future, and use this value as the resource limit value. Updated to implement dynamic Pod resource limit adjustment. Compared with the prior art, the present application at least has the following beneficial effects:

4、通过预测未来一段时间资源消耗峰值作为资源限制值，对资源进行动态限制，采用局部最大值代替全局最大值的配置思想，提高了资源利用率，起到了自适应的效果。4. By predicting the peak value of resource consumption in the future as the resource limit value, the resource is dynamically limited, and the local maximum value is used instead of the global maximum value, which improves the resource utilization rate and has an adaptive effect.

对所公开的实施例的上述说明，使本领域专业技术人员能够实现或使用本申请。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的，本申请中所定义的一般原理可以在不脱离本申请的精神或范围的情况下，在其它实施例中实现。因此，本申请将不会被限制于本申请所示的这些实施例，而是要符合与本申请所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined in this application may be implemented in other embodiments without departing from the spirit or scope of this application. Therefore, this application is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

一种Kubernetes集群资源动态调整方法，其特征在于，包括以下步骤：A method for dynamically adjusting Kubernetes cluster resources, comprising the following steps:

步骤a：收集应用中的历史监控指标数据，根据所述监控指标数据得到资源消耗时序数据；Step a: collect historical monitoring indicator data in the application, and obtain resource consumption time series data according to the monitoring indicator data;

步骤b：构建LSTM-Kalman滤波器，将所述资源消耗时序数据输入所述LSTM-Kalman进行预训练，得到预测模型；其中，所述LSTM-Kalman滤波器为利用LSTM改造后的Kalman滤波器；Step b: constructing an LSTM-Kalman filter, and inputting the resource consumption time series data into the LSTM-Kalman for pre-training to obtain a prediction model; wherein, the LSTM-Kalman filter is a Kalman filter transformed by LSTM;

步骤c：收集新的资源消耗时序数据，将所述新的资源消耗时序数据输入所述预测模型中预测未来预设时间段内的资源消耗峰值，将所述资源消耗峰值作为资源限制值进行资源更新。Step c: Collect new resource consumption time series data, input the new resource consumption time series data into the prediction model to predict the resource consumption peak value within a preset time period in the future, and use the resource consumption peak value as a resource limit value for resource consumption. renew.
根据权利要求1所述的Kubernetes集群资源动态调整方法，其特征在于，所述步骤a中，所述收集应用中的历史监控指标数据包括：The method for dynamically adjusting Kubernetes cluster resources according to claim 1, wherein in the step a, the collection of historical monitoring indicator data in the application comprises:

对运行在Pod中的容器进行监控并收集各种监控指标数据；收集的监控指标数据包括cpu利用率及限额、文件***读/写利用率及限额、网络报文发送/接收/丢弃率数据中的至少一种或一种以上的组合；Monitor the containers running in the Pod and collect various monitoring indicator data; the collected monitoring indicator data includes cpu utilization and quota, file system read/write utilization and quota, network packet sending/receiving/discarding rate data at least one or a combination of more than one;

将收集到的监控指标数据存入时序数据库。Store the collected monitoring indicator data in the time series database.
根据权利要求2所述的Kubernetes集群资源动态调整方法，其特征在于，在所述步骤a中，所述根据所述监控指标数据得到资源消耗时序数据包括：The method for dynamically adjusting Kubernetes cluster resources according to claim 2, wherein, in the step a, the obtaining the resource consumption time series data according to the monitoring indicator data comprises:

定时查询所述时序数据库得到感兴趣的指标数据，并对所述感兴趣的指标数据进行缓存及整理，形成可以直接用于预测的资源消耗时序数据。The time series database is regularly queried to obtain interesting indicator data, and the interesting indicator data is cached and sorted to form resource consumption time series data that can be directly used for prediction.
根据权利要求3所述的Kubernetes集群资源动态调整方法，其特征在于，所述对所述感兴趣的指标数据进行缓存及整理包括：The method for dynamically adjusting Kubernetes cluster resources according to claim 3, wherein the caching and sorting of the interested indicator data comprises:

使用滑动窗口依次滑动每一条指标数据，并选择窗口中的最大值作为新的数据集。Use the sliding window to slide each indicator data in turn, and select the maximum value in the window as the new data set.
根据权利要求1所述的Kubernetes集群资源动态调整方法，其特征在于，在所述步骤b中，所述构建LSTM-Kalman滤波器包括：The method for dynamically adjusting Kubernetes cluster resources according to claim 1, wherein in the step b, the building an LSTM-Kalman filter comprises:

假设观测值是***状态值的有噪估计，令观测矩阵H＝I：Assuming that the observations are noisy estimates of the state values of the system, let the observation matrix H = I:

X(k)＝f(X(k-1))+W(k)，W～N(0,Q)X(k)=f(X(k-1))+W(k), W～N(0,Q)

Z(k)＝X(k)+V(k)，V～N(0,R)Z(k)=X(k)+V(k), V～N(0,R)

上式中，X(k)为k时刻的***状态，Z(k)为k时刻的观测值，W(k)和V(k)分别表示过程和测量噪声，它们的协方差分别为Q，R；f为由LSTM _f模型生成的函数模型； In the above formula, X(k) is the system state at time k, Z(k) is the observed value at time k, W(k) and V(k) represent the process and measurement noise, respectively, and their covariances are Q, respectively, R; f is the function model generated _{by the LSTM f model;}

预测步骤为：The prediction steps are:

X(k|k-1)＝f(X(k-1|k-1))X(k|k-1)=f(X(k-1|k-1))

P(k|k-1)＝F P(k-1|k-1)FT+Q(k)P(k|k-1)=F P(k-1|k-1)FT+Q(k)

上式中，F代表f关于X(k-1|k-1)的雅可比矩阵，Q(k)由LSTM _Q给出； In the above formula, F represents the Jacobian matrix of f with respect to X(k-1|k-1), and Q(k) is given _{by LSTM Q;}

更新步骤为：The update steps are:

Kg(k)＝P(k|k-1)((P(k|k-1)+R(k))-1Kg(k)=P(k|k-1)((P(k|k-1)+R(k))-1

X(k|k)＝X(k|k-1)+Kg(k)(Z(k)–X(k|k-1)X(k|k)=X(k|k-1)+Kg(k)(Z(k)–X(k|k-1)

P(k|k)＝(I-Kg(k))P(k|k-1)P(k|k)=(I-Kg(k))P(k|k-1)

上式中，R(k)由LSTM _r得到，Z(k)代表k时刻的观测值； In the above formula, R(k) is _{obtained from LSTM r} , and Z(k) represents the observation value at time k;

将配置文件中的request值作为X(k-1|k-1)，当前时刻k采集到的指标数据作为观测值Z(k|k)，分别经过所述LSTM _f和LSTM _r网络得到k时刻的预测值X(k|k-1)和观测噪声R(k)；X(k|k-1)经过LSTM _Q得到k时刻的状态噪声Q(k)；利用R(k)、Q(k)、X(k-1|k-1)对所述Kalman滤波器进行递推更新。 The request value in the configuration file is taken as X(k-1|k-1), the index data collected at the current moment k is taken as the observation value Z(k|k), and the time k is obtained _{through the LSTM f} _{and LSTM r networks respectively} The predicted value X(k|k-1) and observation noise R(k) of ; X(k|k-1) obtains the state noise Q(k) at time k _{through LSTM Q; using R(k), Q(k} ), X(k-1|k-1) to recursively update the Kalman filter.
根据权利要求1至5任一项所述的Kubernetes集群资源动态调整方法，其特征在于，在所述步骤c中，所述预将所述资源消耗时序数据输入所述LSTM-Kalman进行预训练，得到预测模型还包括：The method for dynamically adjusting Kubernetes cluster resources according to any one of claims 1 to 5, wherein in the step c, the pre-inputting the resource consumption time series data into the LSTM-Kalman for pre-training, The resulting predictive model also includes:

模型评估：利用新的监控指标数据对所述预测模型进行迭代训练，直到所述预测模型损失收敛或只在预设范围内震荡时停止迭代训练。Model evaluation: Iterative training is performed on the prediction model using new monitoring index data, and the iterative training is stopped until the loss of the prediction model converges or only oscillates within a preset range.
根据权利要求6所述的Kubernetes集群资源动态调整方法，其特征在于，所述模型评估还包括：The method for dynamically adjusting Kubernetes cluster resources according to claim 6, wherein the model evaluation further comprises:

获取设定时间段内的新的监控指标数据，将所述预测模型的预测结果与所述新的监控指标数据进行比较，判断是否需要再次进行模型训练：如果大于设定阈值的时间消耗值在所述预测模型的预测结果以下则不再更新模型，反之则使用所述新的的监控指标数据更新所述预测模型的参数，对所述预测模型进行修正。Obtain the new monitoring indicator data within the set time period, compare the prediction result of the prediction model with the new monitoring indicator data, and judge whether it is necessary to perform model training again: if the time consumption value greater than the set threshold is in If the prediction result of the prediction model is below, the model will not be updated, otherwise, the parameters of the prediction model will be updated by using the new monitoring index data, and the prediction model will be revised.
根据权利要求5所述的Kubernetes集群资源动态调整方法，其特征在于，在所述步骤c中，所述预测未来预设时间段内的资源消耗峰值还包括：The method for dynamically adjusting Kubernetes cluster resources according to claim 5, wherein in the step c, the predicting the peak resource consumption within a preset time period in the future further comprises:

待所述LSTM _f、LSTM _r、LSTM _Q三个子循环网络损失收敛或者降低到预设阈值，开始执行递推估计，通过度量收集器上报数据给预测模型，所述预测模型对未来预设时间段内的资源消耗峰值进行最优估计，使用最优估计值对Pod中容器资源进行更新操作。 After the _{three sub-loop network losses of the LSTM f} , LSTM _r and LSTM _Q converge or decrease to a preset threshold, start to perform recursive estimation, and report data to the prediction model through the metric collector, and the prediction model will be used for the future preset time period. The resource consumption peak in the Pod is optimally estimated, and the optimal estimated value is used to update the container resources in the Pod.
一种电子设备，包括：An electronic device comprising:

至少一个处理器；以及at least one processor; and

与所述至少一个处理器通信连接的存储器；其中，a memory communicatively coupled to the at least one processor; wherein,

所述存储器存储有可被所述一个处理器执行的指令，所述指令被所述至少一个处理器执行，以使所述至少一个处理器能够执行上述1至8任一项所述的Kubernetes集群资源动态调整方法的以下操作：The memory stores instructions executable by the one processor, the instructions are executed by the at least one processor, so that the at least one processor can execute the Kubernetes cluster described in any one of the above 1 to 8 The following operations of the resource dynamic adjustment method:

步骤a：收集应用中的历史监控指标数据，根据所述监控指标数据得到资源消耗时序数据；Step a: collect historical monitoring indicator data in the application, and obtain resource consumption time series data according to the monitoring indicator data;

步骤b：构建LSTM-Kalman滤波器，将所述资源消耗时序数据输入所述LSTM-Kalman进行预训练，得到预测模型；其中，所述LSTM-Kalman滤波器为利用LSTM改造后的Kalman滤波器；Step b: constructing an LSTM-Kalman filter, and inputting the resource consumption time series data into the LSTM-Kalman for pre-training to obtain a prediction model; wherein, the LSTM-Kalman filter is a Kalman filter transformed by LSTM;

步骤c：收集新的资源消耗时序数据，将所述新的资源消耗时序数据输入所述预测模型中预测未来预设时间段内的资源消耗峰值，将所述资源消耗峰值作为资源限制值进行资源更新。Step c: Collect new resource consumption time series data, input the new resource consumption time series data into the prediction model to predict the resource consumption peak value within a preset time period in the future, and use the resource consumption peak value as a resource limit value for resource consumption. renew.