CN117873999A

CN117873999A - Adaptive database optimization method based on deep reinforcement learning

Info

Publication number: CN117873999A
Application number: CN202311714897.XA
Authority: CN
Inventors: 李玉祥; 周莉芸
Original assignee: Tianyi Cloud Technology Co Ltd
Current assignee: Tianyi Cloud Technology Co Ltd
Priority date: 2023-12-13
Filing date: 2023-12-13
Publication date: 2024-04-12

Abstract

The invention relates to a self-adaptive database optimization method based on deep reinforcement learning, which belongs to the technical field of database management and comprises the following steps: data collection and pretreatment are carried out; performing state representation and action space definition; constructing a reinforcement learning model; designing a reward function; performing reinforcement learning training and strategy optimization; making real-time decisions and optimizing; performance evaluation and optimization were performed. The invention designs a self-adaptive database tuning method by utilizing a deep reinforcement learning algorithm and combining the characteristics of database tuning problems, wherein the deep reinforcement learning performs tuning according to real-time feedback through interactive learning of an intelligent agent and an environment, can adapt to continuously changing database environments, improves the response speed and the resource utilization efficiency of the database, and can dynamically adjust the parameters and configuration of the database according to the real-time state and the performance index of the database so as to improve the performance and the query efficiency of the database.

Description

Adaptive database optimization method based on deep reinforcement learning

Technical Field

The invention belongs to the technical field of database management, and particularly relates to a self-adaptive database optimization method based on deep reinforcement learning.

Background

In the current information age, database systems are widely applied to various application fields, such as enterprise management, electronic commerce, data analysis and the like, the performance and efficiency of the database are critical to ensure smoothness of system operation and improve user experience, and the conventional DBMS database management system generally uses a traditional database optimization method based on rules and experience, relies on manual intervention and expertise of DBA database administrators, and is generally low in efficiency and inadaptability, and cannot be applied to various databases.

With the rise of machine learning and artificial intelligence, database automatic tuning technology has attracted much attention, however, solutions based on traditional machine learning require a high quality training set, which is difficult to achieve in complex database systems, poor in adaptivity, weak in learning ability, and low in performance and efficiency of database systems.

Disclosure of Invention

In view of the shortcomings of the prior art, the invention aims to provide a self-adaptive database optimization method based on deep reinforcement learning, which can automatically adjust parameters of a database according to dynamically changed working loads and system requirements, so that a database system can adapt to different environments and load changes.

The invention provides a self-adaptive database tuning method based on deep reinforcement learning, which comprises the following steps:

s1, collecting data in a database, and preprocessing the collected data;

s2, taking data in a database as a state identifier, and defining an action space of a tuning strategy;

s3, constructing a deep reinforcement learning model through a deep neural network;

s4, designing a reward function, evaluating the effect of each action through the reward function, and updating a deep reinforcement learning model according to feedback of the reward function;

s5, training the deep reinforcement learning model through data in a database, and optimizing a decision strategy of the model through environment interaction;

s6, when the database runs in real time, according to the current working load condition and the system state, decision making and optimization are carried out through a trained reinforcement learning model, and configuration parameters and optimization strategies of the database are automatically adjusted;

s7, evaluating and optimizing the performance of the database regularly, collecting new data and updating the model.

Further, in S1, the data includes historical workload data, system performance indicators, and configuration parameters.

Further, in the step S1, the preprocessing includes data cleaning, feature extraction and normalization processing.

Further, in S2, the defining the action space of the tuning policy includes adjusting parameters of the query optimizer and adjusting a cache size.

Further, the deep neural network includes two policy networks and two value function networks.

Further, the historical workload data includes query requests, data access patterns, number of concurrent users, and database load periods.

Further, the system performance indexes comprise response time, throughput, concurrent connection number, CPU utilization rate, memory utilization rate and disk IO speed.

Further, the configuration parameters include a query optimizer parameter, a cache size, a concurrent connection number limit, a memory allocation policy, and a disk storage parameter.

Further, in S4, a specific formula of the reward function is:

where O represents throughput, D represents delay, O _b 、D _b Respectively representing the baseline performance, and α+β=1, α is the proportionality coefficient of the database throughput to the system performance, and β is the proportionality coefficient of the database delay to the system performance.

Further, the training of the deep reinforcement learning model through the data in the database and the decision strategy of the optimization model through the environment interaction specifically comprise the following steps:

s51: initializing parameters of a main strategy network, a target strategy network, a main value function network and a target value function network, and creating an experience playback buffer zone for storing experience samples;

s52: interacting with the environment, and collecting sample data;

s53: randomly sampling a batch of experience samples from an experience playback buffer zone, and updating a main strategy network, a target strategy network, a main value function network and a target value function network;

s54: updating the main strategy network and the target strategy network, and calculating loss functions of the main strategy network and the target strategy network;

s55: updating the main value function network and the target value function network, and calculating the loss functions of the main strategy network and the target strategy network;

s56: soft updating the target network, and slowly approaching the parameters of the target main strategy network, the target strategy network, the main value function network and the target value function network to the parameters of the main strategy network and the main Critic network;

s57: S52-S56 are repeated until the network model converges.

The invention has the following beneficial effects:

1. the invention designs a self-adaptive database tuning method by utilizing a deep reinforcement learning algorithm and combining the characteristics of database tuning problems, and the deep reinforcement learning performs tuning according to real-time feedback through interactive learning of an intelligent agent and an environment, so that the method can adapt to continuously-changed database environments and improve the response speed and the resource utilization efficiency of the database.

2. According to the invention, the parameters and configuration of the database can be dynamically adjusted according to the real-time state and performance index of the database, so that the performance and query efficiency of the database are improved, and the intelligent agent can learn the optimal parameter combination and strategy through the training of deep reinforcement learning, so that the self-adaptive database tuning is realized, and the user experience and the throughput of the system are improved.

3. The invention adopts the experience playback technology to train the deep reinforcement learning model, can improve the utilization rate of the samples and reduce the correlation between the samples by storing and reusing the previous experience samples, thereby accelerating the training process and improving the convergence of the model and providing a high-quality database tuning solution.

4. According to the invention, by introducing noise strategy optimization, the exploratory property of the tuning algorithm is increased, a better strategy can be found, the sinking into a local optimal solution can be avoided, and the exploratory capacity and performance of the algorithm are improved.

Drawings

The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, like reference numerals being used to refer to like parts throughout the several views. It is apparent that the drawings in the following description are only some of the embodiments described in the embodiments of the present invention, and that other drawings may be obtained from these drawings by those of ordinary skill in the art.

FIG. 1 is a reinforcement learning scene description diagram of an embodiment of the present invention;

FIG. 2 is a flowchart of a reinforcement learning optimization method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an automatic tuning system according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the embodiments of the present invention better understood by those skilled in the art, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. It should be understood that the description is only illustrative and is not intended to limit the scope of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, shall fall within the scope of the invention.

In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the invention. Rather, they are merely examples of methods and systems that are consistent with aspects of the invention as detailed in the accompanying claims.

The invention provides a self-adaptive database optimization method based on deep reinforcement learning, which is used for solving the problems of poor self-adaptability, weak learning capacity and low performance and efficiency of a database system in manual and machine learning methods.

Method embodiment

To assist a person in understanding the present invention, terms presented herein are specifically explained.

Reinforcement learning: reinforcement learning is a machine learning method that aims at allowing agents to learn optimal behavior strategies through interactions with the environment to maximize jackpot. Unlike traditional supervised and unsupervised learning, the reinforcement learning agent does not have labeled input-output pairs or predefined categories, but rather learns by attempting and evaluating different actions.

DBA (Database Administrators, database administrator): is a professional responsible for managing and maintaining the database system. They are responsible for database installation, configuration, backup restoration, performance monitoring, tuning, etc. The database manager needs to have technical knowledge of database management and optimization to ensure the stability, availability and performance of the database system. They are generally responsible for database security, backup restoration policies, user rights management, and other tasks, and work in close concert with developers and system administrators to meet business and performance requirements.

A DBMS (Database Management System ) is a software for managing and organizing databases. It provides the functions of creating, accessing and maintaining databases including defining database structures, data manipulation, querying and retrieving, data integrity and security, data backup and restore, concurrency control and transaction management, and database performance optimization and tuning, among others. By using the DBMS, a user can efficiently manage and operate the database, ensuring the security, consistency and reliability of data.

Database tuning: database tuning refers to the process of improving the performance and efficiency of a database by optimizing aspects of configuration, structure, query, and the like of the database. The goal of database tuning is to enable the database to respond to query requests faster, to utilize resources more efficiently, and to handle larger amounts of data and concurrent accesses, increasing the throughput of the database.

The reinforcement learning algorithm is interacted with the environment through the execution of the agent, the environment feedback is obtained to guide the self-update to finish the maximization rewarding expectation, the self-driving learning capacity of the reinforcement learning algorithm can be utilized to self-adaptively optimize configuration parameters in the database tuning, and the performance and throughput of the database are enhanced.

In addition, in the present invention, as shown in fig. 1 to 3, the following embodiments are provided:

embodiment 1

In this embodiment, an adaptive database tuning method based on deep reinforcement learning is provided, and the method includes the following steps:

s1: data collection and pretreatment are carried out;

s2: performing state representation and action space definition;

s3: constructing a reinforcement learning model;

s4: designing a reward function;

s5: performing reinforcement learning training and strategy optimization;

s6: making real-time decisions and optimizing;

s7: performance evaluation and optimization were performed.

In addition, in the present invention, regarding the above-described method, for further explanation of the present invention, the following embodiments are additionally provided:

embodiment 2

A self-adaptive database optimization method based on deep reinforcement learning comprises the following steps:

s1, collecting data in a database, preprocessing the collected data, wherein the data comprises historical workload data, system performance indexes and configuration parameters, the preprocessing comprises data cleaning, feature extraction and standardization processing, the historical workload data comprises a query request, a data access mode, the number of concurrent users and a database load time period, the system performance indexes comprise response time, throughput, the number of concurrent connections, CPU utilization, memory utilization and disk IO speed, and the configuration parameters comprise a query optimizer parameter, a cache size, a concurrent connection limit, a memory allocation strategy and a disk storage parameter;

s2, taking data in a database as a state identifier, defining an action space of a tuning strategy, wherein the definition of the action space of the tuning strategy comprises the steps of adjusting parameters of a query optimizer and adjusting the size of a cache;

s3, constructing a deep reinforcement learning model through a deep neural network, wherein the deep neural network comprises two strategy networks and two value function networks;

s4, designing a reward function, evaluating the effect of each action through the reward function, and updating a deep reinforcement learning model according to feedback of the reward function, wherein the concrete formula of the reward function is as follows:

where O represents throughput, D represents delay, O _b 、D _b Respectively representing the baseline performance, wherein alpha+beta=1, alpha is the proportionality coefficient of the throughput of the database accounting for the system performance, and beta is the proportionality coefficient of the delay of the database accounting for the system performance;

s5, training the deep reinforcement learning model through data in a database, and optimizing a decision strategy of the model through environment interaction, wherein the method specifically comprises the following steps:

s52: interacting with the environment, and collecting sample data;

s57: repeating S52-S56 until the network model converges;

In the present invention, in order to explain the above in detail, the following embodiments are provided:

embodiment 3

The specific flow of the data collection and pretreatment is as follows:

database-related work data such as workload data is defined and collected: { query request, data access pattern, number of concurrent users, database load period … }, database performance index: { response time, throughput, number of concurrent connections, CPU utilization, memory utilization, disk IO speed }, configuration of related work: { query optimizer parameters, cache size, concurrent connection number limit, memory allocation policy, disk storage parameters }. Meanwhile, the collected parameters are subjected to data cleaning, data deduplication is included, unnecessary interference to decision making is reduced, the integrity of the data is guaranteed through missing value processing, and abnormal value processing is performed by using an outlier detection method.

In the present invention, the specific flow of the state representation and the action space definition is as follows:

the overall load state of the database is defined as S, and its state parameter p= { P is defined for state S ₁ ,p ₂ ,…,p _n Where n is the number of key configuration parameters. Defining the motion space as a= { a ₁ ,a ₂ ,…,a _n And represents the action of parameter adjustment. To reduce the instability and inefficiency of training, some important samples are given greater weight to train in combination with a preferential empirical playback strategy.

In addition, in the present invention, regarding the reinforcement learning model construction described above, the specific flow thereof is as follows:

a reinforcement learning model is built using a deep neural network, which is first used to define two policy networks (actors) and two value function networks (Critic). The Actor network is used for learning deterministic database tuning strategies, the input is a state s, and the output is an action a. The target Actor network is a duplicate version of the main Actor network and synchronizes parameters with the main Actor network within a certain time interval. The Critic network is used for estimating a tuning action value function Q (s, a), inputting a state s and an action a, and outputting a corresponding Q value. The primary network is used to calculate the Q value, the target network is a duplicate version of the primary Critic network, and the parameters are synchronized with the primary value function network within a certain time interval. By fixing the parameters of the target network for a period of time, the target value can be more stable, so that the training stability and the training convergence are improved. Wherein the Actor network uses the accepted environment state as input, uses multiple fully connected layers to extract input, and finally inputs actions. Critic networks use state and action as inputs, also use multiple full connection layers to extract inputs, and finally input the value of the current action.

In addition, in the present invention, regarding the above-mentioned bonus function design, the specific flow is as follows:

for training the agent, a database tuning reward function is defined that has practical significance. The throughput and delay performance of the database are introduced into the reward function, and the specific formula is as follows:

where O represents throughput, D represents delay, O _b 、D _b Respectively representing the baseline performance, and α+β=1, α is the proportionality coefficient of the throughput of the database to the system performance, β is the proportionality coefficient of the delay of the database to the system performance, and forward excitation is generated if the throughput and delay performance of the database are improved. The rewarding function evaluates the effect of each action, evaluates the increase or decrease in database performance through the change in performance index, and updates the reinforcement learning model based on feedback from the rewarding function.

In addition, in the present invention, regarding the reinforcement learning training and policy optimization described above, the specific flow is as follows:

training a deep reinforcement learning model by using historical data, optimizing a decision strategy of the model by interacting with the environment, and gradually improving the performance of a database, wherein the specific flow is as follows:

1) Parameters of an Actor network (a main policy network and a target policy network) and a Critic network (a main value function network and a target value function network) are initialized, and an experience playback buffer is created for storing experience samples.

2) And interacting with the environment, and collecting sample data. For each time step, action a=μ (s|θ) is selected using the main Actor network according to the current state s _Actor ) E and perform the action, where E is a noise term, θ _Actor Representing model parameters of the Actor network. The next state s 'and prize r is then observed and (s, a, r, s') is stored in the experience playback buffer.

3) Experience playback: a batch of experience samples is randomly sampled from the experience playback buffer for updating the Actor and Critic networks. And correcting amplitude through a TD error measurement algorithm, wherein the larger the absolute value of the TD error is, the larger the correction effect of the sample on the network is. For each sample, a target Q value and TD error are calculated. TD error is defined as the following formula:

δ＝r(s _t ,a _t )+λQ′(s _t+1 ,a _t+1 )-Q(s _t ,a _t )，

where Q, Q' are the Q network and the target Q network, respectively, r is the current prize, and γ is the coefficient factor. The probability sampling mechanism can ensure that samples with smaller TD errors can still be sampled, and ensure the diversity of the samples during algorithm training.

4) Updating an Actor network, and calculating a loss function of the Actor network, wherein the formula is as follows:

where N is the number of samples in the batch, s _i Representing the current state, θ _Actor 、θ _Critic Representing model parameters of the Actor network and the Critic network, respectively. Maximizing the Q value corresponding to the action selected by the Actor network, and simultaneously updating network parameters by using a gradient descent method, wherein the formula is as follows:

wherein alpha is _Actor Representing the learning rate of the Actor network,representing the gradient, θ, of the Actor network _0ctor Representing model parameters of the Actor network.

5) Updating the Critic network, and calculating a target Q value according to the following formula:

y＝r+γmin _i＝1,2 Q′(s′,μ′(s′)+∈′)，

where r is the current prize, γ is the discount factor, s ' is the next state, μ ' is the output of the target Actor network, and e ' is a noise term. The loss function of the Critic network is calculated as follows:

where N is the number of samples in the batch. The gradient descent method is used for updating the parameters of the Critic network, and the formula is as follows:

wherein alpha is _Critic Representing the learning rate of the Critic network,representing the gradient of the Critic network, θ _Critic Representing model parameters of the Critic network.

6) And soft updating the target network, and gradually approaching the parameters of the target Actor network and the target Critic network to the parameters of the main strategy network and the main Critic network. The formula is as follows:

θ′＝τθ+(1-τ)θ′，

where θ' is a parameter of the target network, θ is a parameter of the main network, and τ is a super parameter less than 1 for controlling the update rate.

7) Repeating the steps 2-6 until the network model converges.

In addition, in the present invention, regarding the above-mentioned real-time decision and tuning, the specific flow is as follows:

in the tuning process, according to the reinforcement learning decision model, the system can make real-time decisions and adjustments according to the current database state and performance indexes. For example, the system may automatically adjust the cache size of the database, index configuration, query optimization strategy, etc., based on the current load situation and characteristics of the query request. By constantly observing and evaluating the adjusted database performance, the system will update the learning model in time to accommodate the constantly changing database workload and environment.

In addition, in the present invention, regarding the above performance evaluation and optimization, the specific flow thereof is as follows:

by periodically evaluating and optimizing the performance of the database, we can collect new workload data and use this data to update the reinforcement learning model to continuously boost the adaptive tuning capabilities of the database. This process can help us to learn about the current performance situation of the database, but also to find potential performance bottlenecks and optimization opportunities. Through continuous optimization measures, the response speed, throughput and resource utilization rate of the database can be improved, so that the efficiency and performance of the whole system are improved.

Through the detailed steps, the adaptive database optimization method based on the deep reinforcement learning can realize the adaptive optimization according to a database system, and improves the performance and efficiency of the database.

Based on the same inventive concept, another embodiment of the present invention provides an electronic device comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface, the memory complete communication with each other through the communication bus,

a memory for storing a computer program;

and the processor is used for realizing the adaptive database tuning method based on the deep reinforcement learning when executing the program stored in the memory.

The communication bus mentioned by the above terminal may be a peripheral component interconnect standard (Peripheral Component Interconnect, abbreviated as PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated as EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus. The communication interface is used for communication between the terminal and other devices. The memory may include random access memory (Random Access Memory, RAM) or non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory may also be at least one storage system located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but also digital signal processors (Digital Signal Processing, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field-programmable gate arrays (Field-Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

In addition, in order to achieve the above objective, an embodiment of the present invention further provides a computer readable storage medium storing a computer program, where the computer program when executed by a processor implements the adaptive database tuning method based on deep reinforcement learning according to the embodiment of the present invention.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the invention may take the form of a computer program product on one or more computer-usable vehicles having computer-usable program code embodied therein, including but not limited to disk storage, CD-ROM, optical storage, and the like.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create a system for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. "and/or" means either or both of which may be selected. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or terminal device comprising the element.

Finally, it should be noted that the above embodiments are merely for illustrating the technical solution of the embodiments of the present invention, and are not limiting. Although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the embodiments of the invention, and any changes and substitutions that would be apparent to one skilled in the art are intended to be included within the scope of the present invention.

Claims

1. The adaptive database optimization method based on deep reinforcement learning is characterized by comprising the following steps of:

s1, collecting data in a database, and preprocessing the collected data;

2. The adaptive database tuning method based on deep reinforcement learning of claim 1, wherein in S1, the data includes historical workload data, system performance indicators and configuration parameters.

3. The adaptive database tuning method based on deep reinforcement learning according to claim 1, wherein in S1, the preprocessing includes data cleaning, feature extraction and normalization.

4. The adaptive database tuning method based on deep reinforcement learning of claim 1, wherein in S2, the action space defining the tuning strategy includes adjusting parameters of a query optimizer and adjusting a cache size.

5. The adaptive database tuning method based on deep reinforcement learning of claim 4, wherein the deep neural network comprises two strategy networks and two value function networks.

6. The method of adaptive database tuning based on deep reinforcement learning of claim 2, wherein the historical workload data comprises query requests, data access patterns, number of concurrent users, and database load time period.

7. The adaptive database tuning method based on deep reinforcement learning of claim 2, wherein the system performance index includes response time, throughput, concurrent connection number, CPU utilization, memory utilization and disk IO speed.

8. The method for optimizing an adaptive database based on deep reinforcement learning according to claim 2, wherein the configuration parameters include a query optimizer parameter, a cache size, a concurrent connection limit, a memory allocation policy, and a disk storage parameter.

9. The adaptive database tuning method based on deep reinforcement learning according to claim 1, wherein in S4, the specific formula of the reward function is:

10. The adaptive database tuning method based on deep reinforcement learning according to claim 1, wherein the decision strategy for training the deep reinforcement learning model by data in the database and optimizing the model by environmental interaction specifically comprises the steps of:

s52: interacting with the environment, and collecting sample data;

s57: S52-S56 are repeated until the network model converges.