CN115293227A

CN115293227A - Model training method and related equipment

Info

Publication number: CN115293227A
Application number: CN202210705971.0A
Authority: CN
Inventors: 和煦; 李栋
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2022-06-21
Filing date: 2022-06-21
Publication date: 2022-11-04
Also published as: WO2023246819A1

Abstract

A model training method relates to the field of artificial intelligence, and comprises the following steps: processing the first data through a first reinforcement learning model to obtain a first processing result; processing the first data through a first target neural network selected from the plurality of first neural networks to obtain a second processing result; each first neural network is an iteration result obtained in the process of performing iterative training on the first initial neural network, and the first reinforcement learning model is updated according to the first processing result and the second processing result. According to the method and the device, the historical training results of the historical confrontation agents (the confrontation agents obtained in the historical iteration process) are utilized to output the interference aiming at the target task, so that the more effective interference aiming at the target task under different scenes can be obtained, and the training effect and the generalization of the model are improved.

Description

Model training method and related equipment

Technical Field

The application relates to the field of artificial intelligence, in particular to a model training method and related equipment.

Background

Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence, senses the environment, acquires knowledge and uses knowledge to obtain the best results through a digital computer or a machine controlled by a digital computer. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision and reasoning, human-computer interaction, recommendation and search, AI basic theory, and the like.

Reinforcement Learning (RL) is an important machine learning method in the field of artificial intelligence, and has many applications in the fields of automatic driving, intelligent robot control, analysis and prediction, and the like. Specifically, the main problem to be solved by reinforcement learning is how to directly interact with the environment to learn the skills to be used when performing a specific task, so as to maximize the long-term rewards for the specific task. In the application process of the reinforcement learning algorithm, an online environment is often required to interact to obtain data and train. The general practice is to model the real scene of the real world and generate a virtual simulated online environment. In this case, if the training environment and the real environment to be deployed have a slight difference, it is likely to cause the trained algorithm to fail, resulting in an unexpected performance in the real scene.

The above problem can be alleviated by improving the robustness of the reinforcement learning algorithm. One method is to introduce imaginary interference into a virtual environment, train a reinforcement learning algorithm under the condition of interference, improve the interference coping capability of the reinforcement learning algorithm, and enhance the robustness and the generalization of the algorithm, namely, aiming at the reinforcement learning model to be trained, an antagonistic agent can be arranged, the data output by the antagonistic agent and the output data of the reinforcement learning model can jointly execute a task, and the data output by the antagonistic agent can be used as the interference for executing a target task. However, since the difference between the training environment and the deployment environment is unpredictable, in the existing training method, the opposing agent can only output a certain specific disturbance (for example, for robot control, a force in a certain range can be applied to a certain joint as the disturbance), and when the change in the real environment is inconsistent with the hypothetical disturbance (i.e., the disturbance output by the opposing agent), the algorithm effect is reduced, and the robustness is poor.

Disclosure of Invention

The application provides a model training method which can improve the training effect and the generalization of a model.

In a first aspect, the present application provides a model training method, including: processing the first data through a first reinforcement learning model to obtain a first processing result; wherein the first data indicates a state of a target object, and the first processing result is used as control information when a target task is executed on the target object; processing the first data through a first target neural network to obtain a second processing result; the second processing result is used as interference information when the target task is executed, the first target neural network is selected from a plurality of first neural networks, and each first neural network is an iteration result obtained by an iterative training process of a first initial neural network; executing the target task according to the first processing result and the second processing result to obtain a third processing result; and updating the first reinforcement learning model according to the third processing result to obtain an updated first reinforcement learning model.

In one possible implementation, the first reinforcement learning model may be an initialized model or an output of an iteration of a model training process. It should be understood that the reinforcement learning model in the embodiments of the present application includes, but is not limited to, a deep neural network, a bayesian neural network, and the like.

In one possible implementation, the first data may be processed by a first reinforcement learning model during a feed-forward process of model training to obtain a first processing result. The first processing result is used as control information when a target task is executed on the target object, for example, the target task is attitude control of the robot, and the first processing result is attitude control information of the robot; or, the target task is automatic driving of the vehicle, and the first processing result is driving control information of the vehicle.

In the embodiment of the application, on one hand, a plurality of confrontation agents for outputting interference information can be trained, and the interference information output by different confrontation agents can perform different types of interference on a target task, and on the other hand, when the confrontation agents are trained, not only the confrontation agents obtained by the current latest iteration are used for outputting the interference on the target task, but also historical training results (the confrontation agents obtained in the historical iteration process) of the confrontation agents in history can be used for outputting the interference on the target task, so that more effective interference on the target task under different scenes can be obtained, and the training effect and the generalization of the model are improved.

In one possible implementation, the first data is robot-related status information; the target task is attitude control of the robot, and the first processing result is attitude control information of the robot.

In one possible implementation, the robot-related state information may include, but is not limited to, information related to the position, speed, and scene in which the robot is located (e.g., obstacle information), and the position and speed of the robot may include information about the state (position, angle, speed, acceleration, etc.) of each joint.

In one possible implementation, the first reinforcement learning model may obtain posture control information of the robot according to the input data, the posture control information may include control information of each joint of the robot, and a posture control task of the robot may be performed based on the posture control information.

In one possible implementation, the first data is vehicle-related status information; the target task is automatic driving of the vehicle, and the first processing result is driving control information of the vehicle.

In one possible implementation, the vehicle-related state information may include, but is not limited to, position, speed of the vehicle, and scene-related information (e.g., information of a driving surface, obstacle information, pedestrian information, information of surrounding vehicles) in which the vehicle is located.

In one possible implementation, the first reinforcement learning model may obtain driving control information of the vehicle according to the input data, and the driving control information may include information of speed, direction, driving track, and the like of the vehicle.

In one possible implementation, the method further comprises: selecting the first target neural network from the plurality of first neural networks.

In one possible implementation, the first target neural network is selected from a plurality of first neural networks based on a first selection probability corresponding to each of the plurality of first neural networks. That is, each first neural network may be configured with a probability (i.e., the first selection probability described above), and when the first target neural network is selected from the plurality of first neural networks, sampling may be performed based on probability distributions corresponding to the plurality of first neural networks and network selection may be performed based on a result of the sampling.

In a possible implementation, a processing result obtained by processing data by each first neural network is used as interference when the target task is executed, and the first selection probability is positively correlated with the interference degree of the processing result output by the corresponding first neural network on the target task. The first selection probability can be a trainable parameter, and when a reinforcement learning model and a model of the confrontation agent are updated, a reward value can be obtained, the reward value can represent the excellence of data output by the reinforcement learning model when a target task is executed, and can also represent the interference degree of interference information output by the confrontation agent on the target task, and the probability distribution corresponding to the first neural network can be updated based on the reward value, so that the first selection probability is positively correlated with the interference degree of a processing result output by the corresponding first neural network on the target task. Through the mode, on one hand, the corresponding sampled probability of the countermeasure intelligent agent with the larger output interference range is larger, so that the countermeasure intelligent agent is easier to sample, and the interference degree of the reinforcement learning model is improved, on the other hand, the countermeasure intelligent agent with the smaller output interference range is likely to be sampled although the corresponding sampled probability is smaller, so that the abundant degree of interference on the reinforcement learning model can be improved, and the generalization of the network is improved.

In one possible implementation, the probability distribution may be a nash-balanced distribution. The probability distribution may be obtained by nash equilibrium calculation based on an incentive value obtained by performing a target task according to data and disturbance information obtained when feedforward is performed according to a reinforcement learning model, and the probability distribution may be updated in an iterative process.

According to the embodiment of the application, the action space of the countermeasure agent is controlled, and the interference strength of the countermeasure agent is changed, so that the reinforcement learning strategy is robust to strong and weak interference. In addition, by introducing a game theory optimization framework, the diversity of the confrontation agents is increased by using a historical strategy, so that the reinforcement learning strategy is more robust to the interference of different strategies.

In one possible implementation, the updating the first reinforcement learning model according to the third processing result includes:

obtaining an award value corresponding to the target task according to the third processing result;

updating the first reinforcement learning model according to the reward value;

the method further comprises the following steps:

and updating the first selection probability corresponding to the first target neural network according to the reward value.

In one possible implementation, after sampling one confrontational agent for each confrontational task, the reinforcement learning strategy and the strategy after the confrontational agent update can be added into the nash equilibrium matrix, and nash equilibrium is calculated, so as to obtain the nash equilibrium distribution of the reinforcement learning and confrontational agents. Specifically, the updating the first reinforcement learning model according to the first processing result and the second processing result includes: obtaining an award value corresponding to the target task according to the first processing result and the second processing result; updating the first reinforcement learning model according to the reward value; further, the first selection probability corresponding to the first target neural network may be updated according to the reward value.

In one possible implementation, to increase the richness of the interference to the reinforcement learning model, a plurality of antagonistic agents may be trained, and, for each of the plurality of antagonistic agents, the antagonistic agent that exerts the interference on the reinforcement learning model may be selected from a plurality of iteration results at the time of training.

In one possible implementation, the method further comprises:

processing the first data through a second target neural network to obtain a fourth processing result; the fourth processing result is used as interference information when the target task is executed, the second target neural network is selected from a plurality of second neural networks, and each second neural network is an iteration result obtained by an iterative training process of a second initial neural network; the first initial neural network and the second initial neural network are different;

executing the target task according to the first processing result and the second processing result to obtain a third processing result, including:

and executing the target task according to the first processing result, the fourth processing result and the second processing result to obtain a third processing result.

In one possible implementation, the second processing result and the fourth processing result are of different interference types.

For example, the disturbance type may be a category of disturbance applied when performing the target task, such as applying a force, applying a moment, adding an obstacle, changing a road condition, changing weather, and so forth.

In one possible implementation, the interfering objects of the second and fourth processing results are different.

For example, a robot may include multiple joints, and applying force to different joints, or different groups of joints, may be considered to interfere with differences in the object. That is, the second processing result and the fourth processing result are forces applied to different joints, or different joint groups.

In one possible implementation, the first target neural network is configured to determine the second processing result from a first range of values based on the first data, and the second target neural network is configured to determine the fourth processing result from a second range of values based on the first data, the second range of values being different from the first range of values.

For example, the second processing result and the fourth processing result are forces applied to the robot joint, the maximum value of the magnitude of the force determined by the first target neural network is A1, the maximum value of the magnitude of the force determined by the second target neural network is A2, and A1 and A2 are different.

In one possible implementation, during the iterative training of the countermeasure agent, the reinforcement learning model of the current round participating in the training process may also be selected from the historical iteration results of the reinforcement learning model. For example, the method may be based on probability sampling, and the similarity may refer to the process of sampling the confrontational agents in the above embodiment.

In one possible implementation, the second data may be processed by a second reinforcement learning model to obtain a fifth processing result; the second reinforcement learning model is selected from a plurality of reinforcement learning models including the updated first reinforcement learning model, and each reinforcement learning model is an iteration result obtained in the process of performing iterative training on the initial reinforcement learning model; the second data indicates a state of a target object, and the fifth processing result is used as control information when the target task is executed on the target object; processing the second data through a third target neural network to obtain a sixth processing result; the third target neural network belongs to the plurality of first neural networks; the sixth processing result is used as interference information when the target task is executed; executing the target task according to the fifth processing result and the sixth processing result to obtain a seventh processing result; and updating the third target neural network according to the seventh processing result to obtain an updated third target neural network.

In one possible implementation, the second reinforcement learning model may be selected from the plurality of reinforcement learning models.

In one possible implementation, the selecting the second reinforcement learning model from the plurality of reinforcement learning models includes: selecting the second reinforcement learning model from a plurality of reinforcement learning models based on the second selection probability corresponding to each reinforcement learning model in the plurality of reinforcement learning models.

In one possible implementation, the second selection probability is positively correlated with the forward execution effect of the processing result output by the corresponding reinforcement learning model when the target task is executed. When the reinforcement learning model and the countermeasure agent are updated, the reward value can be obtained, the reward value can represent that the data output by the reinforcement learning model is excellent when the target task is executed, and the probability distribution corresponding to the reinforcement learning model can be updated based on the reward value, so that the second selection probability is positively correlated with the forward execution effect of the processing result output by the corresponding reinforcement learning model when the target task is executed.

In one possible implementation, the historical policies of the reinforcement learning agent may be sampled and selected from a set of historical policies of the reinforcement learning agent according to a nash equilibrium distribution for use in countermeasure agent policy updates. In a training environment, deploying the selected reinforcement learning strategy and the current confrontation intelligent agent strategy, and sampling to obtain a required training sample. The resulting training samples used train the confrontational agent strategy.

In a second aspect, the present application provides a model training apparatus, the apparatus comprising:

the data processing module is used for processing the first data through the first reinforcement learning model to obtain a first processing result; wherein the first data indicates a state of a target object, and the first processing result is used as control information when a target task is executed on the target object;

processing the first data through a first target neural network to obtain a second processing result; the second processing result is used as interference information when the target task is executed, the first target neural network is selected from a plurality of first neural networks, and each first neural network is an iteration result obtained by an iterative training process of a first initial neural network;

executing the target task according to the first processing result and the second processing result to obtain a third processing result;

and the model updating module is used for updating the first reinforcement learning model according to the third processing result so as to obtain the updated first reinforcement learning model.

In the embodiment of the present application, on one hand, a plurality of countermeasures agents for outputting interference information can be trained, and interference information output by different countermeasures agents can perform different types of interference for a target task, and on the other hand, when training the countermeasures agents, not only the countermeasures agent obtained by the current latest iteration is used to output the interference for the target task, but also historical training results (the countermeasures agents obtained in the historical iteration process) of the countermeasures agents in history are used to output the interference for the target task, so that more effective interference for the target task under different scenes can be obtained, and the training effect and the generalization of a model are improved.

In one possible implementation of the method of the invention,

the target object is a robot; the target task is attitude control of the robot, and the first processing result is attitude control information of the robot; or,

the target object is a vehicle; the target task is automatic driving of the vehicle, and the first processing result is driving control information of the vehicle.

In one possible implementation, the apparatus further comprises:

a network selection module to select the first target neural network from the plurality of first neural networks.

In one possible implementation, the first target neural network is selected from a plurality of first neural networks based on a first selection probability corresponding to each of the plurality of first neural networks.

In a possible implementation, each first neural network processes a processing result obtained by data to be used as interference when the target task is executed, and the first selection probability is positively correlated with the interference degree of the processing result output by the corresponding first neural network on the target task.

In a possible implementation, the model updating module is specifically configured to:

updating the first reinforcement learning model according to the reward value;

the model update module is further configured to:

In one possible implementation, the data processing module is further configured to:

the data processing module is specifically configured to:

In one of the possible implementations of the method,

the interference types of the second processing result and the fourth processing result are different; or,

the interference objects of the second processing result and the fourth processing result are different; or,

the first target neural network is configured to determine the second processing result from a first range of values based on the first data, and the second target neural network is configured to determine the fourth processing result from a second range of values based on the first data, the second range of values being different from the first range of values.

processing the second data through a second reinforcement learning model to obtain a fifth processing result; the second reinforcement learning model is selected from a plurality of reinforcement learning models including the updated first reinforcement learning model, and each reinforcement learning model is an iteration result obtained in the process of performing iterative training on the initial reinforcement learning model; the second data indicates a state of a target object, and the fifth processing result is used as control information when the target task is executed on the target object;

processing the second data through a third target neural network to obtain a sixth processing result; the third target neural network belongs to the plurality of first neural networks; the sixth processing result is used as interference information when the target task is executed;

executing the target task according to the fifth processing result and the sixth processing result to obtain a seventh processing result;

the model update module is further configured to:

and updating the third target neural network according to the seventh processing result to obtain an updated third target neural network.

In one possible implementation, the network selection module is further configured to:

selecting the second reinforcement learning model from the plurality of reinforcement learning models.

In a possible implementation, the network selection module is specifically configured to:

selecting the second reinforcement learning model from a plurality of reinforcement learning models based on the second selection probability corresponding to each reinforcement learning model in the plurality of reinforcement learning models.

In a third aspect, the present application provides a data processing method, including:

acquiring first data, the first data indicating a state of a target object;

processing the first data through a first reinforcement learning model to obtain a first processing result; the first processing result is used as control information of the target object; wherein,

the first reinforcement learning model is updated through an incentive value in a training one-time iteration process, the incentive value is obtained through interference information applied when the target task is executed according to control information output in a feedforward process of the first reinforcement learning model, the interference information is obtained through a feedforward process of a target neural network, the target neural network is selected from a plurality of neural networks, and each neural network is an iteration result obtained in an iteration training process of an initial neural network;

and executing a target task on the target object according to the first processing result.

In one of the possible implementations of the method,

In one possible implementation, the second neural network is selected from a plurality of second neural networks based on a second selection probability corresponding to each of the plurality of second neural networks.

In a possible implementation, a processing result obtained by processing data by each neural network is used as interference when the target task is executed, and the first selection probability is positively correlated with the interference degree of the processing result output by the corresponding neural network on the target task.

In a fourth aspect, an embodiment of the present application provides a model training apparatus, which may include a memory, a processor, and a bus system, where the memory is used to store a program, and the processor is used to execute the program in the memory to perform the method as described in the first aspect and any optional method thereof.

In a fifth aspect, embodiments of the present application provide a data processing apparatus, which may include a memory, a processor, and a bus system, wherein the memory is used for storing a program, and the processor is used for executing the program in the memory to perform the method according to the third aspect and any optional method thereof.

In a sixth aspect, embodiments of the present application provide a computer-readable storage medium, in which a computer program is stored, and when the computer program runs on a computer, the computer program causes the computer to execute the first aspect and any optional method thereof, or the third aspect and any optional method thereof.

In a seventh aspect, this application embodiment provides a computer program product including instructions, which when run on a computer, cause the computer to perform the first aspect and any optional method thereof, or the third aspect and any optional method thereof.

In an eighth aspect, the present application provides a chip system, which includes a processor, configured to support a model training apparatus to implement part or all of the functions involved in the above aspects, for example, sending or processing data involved in the above methods; or, information. In one possible design, the system-on-chip further includes a memory, which stores program instructions and data necessary for the execution device or the training device. The chip system may be formed by a chip, or may include a chip and other discrete devices.

The embodiment of the application provides a model training method, which comprises the following steps: processing the first data through a first reinforcement learning model to obtain a first processing result; wherein the first data indicates a state of a target object, and the first processing result is used as control information when a target task is executed on the target object; processing the first data through a first target neural network to obtain a second processing result; the second processing result is used as interference information when the target task is executed, the first target neural network is selected from a plurality of first neural networks, and each first neural network is an iteration result obtained by an iterative training process of a first initial neural network; executing the target task according to the first processing result and the second processing result to obtain a third processing result; and updating the first reinforcement learning model according to the third processing result to obtain an updated first reinforcement learning model. Through the mode, when the confrontation intelligent agent is trained, the confrontation intelligent agent obtained by the current latest iteration is not only used for outputting the interference aiming at the target task, but also the historical training result of the confrontation intelligent agent in history (the confrontation intelligent agent obtained in the historical iteration process) can be used for outputting the interference aiming at the target task, so that more effective interference aiming at the target task under different scenes can be obtained, and the training effect and the generalization of the model are improved.

Drawings

FIG. 1 is a schematic diagram of an application architecture;

FIG. 2 is a schematic diagram of an application architecture;

FIG. 3 is a schematic diagram of an application architecture;

FIG. 4 is a schematic diagram of an embodiment of a model training method provided in an embodiment of the present application;

FIG. 5 is a schematic diagram of a software architecture provided in an embodiment of the present application;

FIG. 6 is a schematic diagram of an embodiment of a model training method provided in an embodiment of the present application;

FIG. 7 is a schematic diagram of an embodiment of a model training apparatus provided in the embodiment of the present application;

fig. 8 is a schematic structural diagram of an execution device according to an embodiment of the present application;

FIG. 9 is a schematic structural diagram of a server according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a chip according to an embodiment of the present disclosure.

Detailed Description

The embodiments of the present invention will be described below with reference to the drawings. The terminology used in the description of the embodiments of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

Embodiments of the present application are described below with reference to the accompanying drawings. As can be known to those skilled in the art, with the development of technology and the emergence of new scenarios, the technical solution provided in the embodiments of the present application is also applicable to similar technical problems.

The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely descriptive of the various embodiments of the application and how objects of the same nature can be distinguished. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

The terms "substantially", "about" and the like are used herein as terms of approximation and not as terms of degree, and are intended to take into account the inherent deviations in measured or calculated values that would be known to one of ordinary skill in the art. Furthermore, the use of "may" in describing an embodiment of the invention refers to "one or more embodiments possible". As used herein, the terms "use," "using," and "used" may be considered synonymous with the terms "utilizing," "utilizing," and "utilized," respectively. Additionally, the term "exemplary" is intended to refer to an instance or illustration.

The general workflow of the artificial intelligence system will be described first, please refer to fig. 1, which shows a schematic structural diagram of an artificial intelligence body framework, and the artificial intelligence body framework is explained below from two dimensions of "intelligent information chain" (horizontal axis) and "IT value chain" (vertical axis). Where "intelligent information chain" reflects a list of processes processed from the acquisition of data. For example, the general processes of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision making and intelligent execution and output can be realized. In this process, the data undergoes a "data-information-knowledge-wisdom" process of consolidation. The "IT value chain" reflects the value of artificial intelligence to the information technology industry from the underlying infrastructure of human intelligence, information (provision and processing technology implementation) to the industrial ecological process of the system.

(1) Infrastructure arrangement

The infrastructure provides computing power support for the artificial intelligent system, realizes communication with the outside world, and realizes support through a foundation platform. Communicating with the outside through a sensor; the computing power is provided by intelligent chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA and the like); the basic platform comprises distributed computing framework, network and other related platform guarantees and supports, and can comprise cloud storage and computing, interconnection and intercommunication networks and the like. For example, sensors and external communications acquire data that is provided to smart chips in a distributed computing system provided by the underlying platform for computation.

(2) Data of

Data at a level above the infrastructure is used to represent a source of data for the field of artificial intelligence. The data relates to graphs, images, voice and texts, and also relates to the data of the Internet of things of traditional equipment, including service data of the existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.

(3) Data processing

Data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.

The machine learning and the deep learning can be used for performing symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.

Inference means a process of simulating an intelligent human inference mode in a computer or an intelligent system, using formalized information to think about and solve a problem by a machine according to an inference control strategy, and a typical function is searching and matching.

Decision-making refers to a process of making a decision after reasoning intelligent information, and generally provides functions of classification, sorting, prediction and the like.

(4) General capabilities

After the above-mentioned data processing, further based on the result of the data processing, some general capabilities may be formed, such as algorithms or a general system, e.g. translation, analysis of text, computer vision processing, speech recognition, recognition of images, etc.

(5) Intelligent product and industrial application

The intelligent product and industry application refers to the product and application of an artificial intelligence system in various fields, and is the encapsulation of an artificial intelligence integral solution, the intelligent information decision is commercialized, and the landing application is realized, and the application field mainly comprises: intelligent terminal, intelligent transportation, intelligent medical treatment, autopilot, wisdom city etc..

With the development of artificial intelligence, many tasks needing to be completed by people are gradually replaced by the intelligent terminal, and skills used for completing the tasks and a neural network aiming at the tasks need to be configured on the intelligent terminal, so that the function of completing the specific tasks through the intelligent terminal is realized. Specifically, the method can be applied to a mobile intelligent terminal, for example, in the field of automatic driving, driving operations originally performed by a person can be performed by an intelligent automobile instead, and a large number of driving skills and a neural network for the driving skills need to be configured in the intelligent automobile; as another example, in the field of freight transportation, for example, a transportation operation that is originally performed by a person may be performed instead by a transportation robot, and a large number of transportation skills and a neural network for the transportation skills need to be configured in the transportation robot. For example, on an accessory processing production line, the part grabbing operation originally performed by a person may be performed by an intelligent mechanical arm, and then the intelligent mechanical arm needs to be configured with grabbing skills and a neural network for the grabbing skills, where grabbing angles of different grabbing skills, displacements of the intelligent mechanical arm, and the like may be different; as another example, for example, in the field of automatic cooking, a cooking operation originally performed by a person may be performed by a smart robot arm, and then cooking skills such as a raw material grabbing skill, a stir-frying skill, and a neural network for the cooking skills need to be configured in the smart robot arm, and other application scenarios are not exhaustive here.

In order to better understand the solution of the embodiment of the present application, a brief description is first given below to possible implementation architectures of the embodiment of the present application with reference to fig. 2 and fig. 3.

FIG. 2 is a schematic of a computing system that performs model training in an embodiment of the present application. The computing system includes a terminal device 102 (which may or may not include the terminal device 102, for example) and a server 130 (which may also be referred to as a central node) communicatively coupled via a network. The terminal device 102 may be any type of computing device, such as, for example, a personal computing device (e.g., a laptop or desktop computer), a mobile computing device (e.g., a smartphone or tablet computer), a gaming console or controller, a wearable computing device, an embedded computing device, or any other type of computing device.

Terminal device 102 may include a processor 112 and a memory 114. The processor 112 may be any suitable processing device (e.g., a processor core, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a controller, a microcontroller, etc.). The Memory 114 may include, but is not limited to, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), or a portable Read-Only Memory (CD-ROM). The memory 114 may store data 116 and instructions 118 that are executed by the processor 112 to cause the terminal device 102 to perform operations.

In some implementations, the memory 114 may store one or more models 120. For example, the model 120 may be or may additionally include various machine learning models, such as neural networks (e.g., deep neural networks) or other types of machine learning models, including non-linear models and/or linear models. The neural network may include a feed-forward neural network, a recurrent neural network (e.g., a long-short term memory recurrent neural network), a convolutional neural network, or other form of neural network.

In some implementations, one or more models 120 may be received from server 130 over network 180, stored in memory 114, and then used or otherwise implemented by one or more processors 112.

Terminal device 102 may also include one or more user input components 122 that receive user input. For example, the user input component 122 may be a touch-sensitive component (e.g., a touch-sensitive display screen or a touchpad) that is sensitive to touch by a user input object (e.g., a finger or a stylus). The touch sensitive component may be used to implement a virtual keyboard. Other example user input components include a microphone, a conventional keyboard, or other device by which a user may provide user input.

The terminal device 102 may further include a communication interface 123, the terminal device 102 may be communicatively connected to the server 130 through the communication interface 123, the server 130 may include a communication interface 133, and the terminal device 102 may be communicatively connected to the communication interface 133 of the server 130 through the communication interface 123, so as to implement data interaction between the terminal device 102 and the server 130.

The server 130 may include a processor 132 and a memory 134. The processor 132 may be any suitable processing device (e.g., a processor core, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a controller, a microcontroller, etc.). The Memory 134 may include, but is not limited to, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), or a portable Read-Only Memory (CD-ROM). The memory 134 may store data 136 and instructions 138 that are executed by the processor 132 to cause the server 130 to perform operations.

As described above, the memory 134 may store one or more machine learning models 140. For example, the model 140 may be or may additionally include various machine learning models. Example machine learning models include neural networks or other multi-layered nonlinear models. Example neural networks include feed-forward neural networks, deep neural networks, recurrent neural networks, and convolutional neural networks.

It should be understood that the model training method in the embodiment of the present application relates to AI-related operations, and when performing AI operations, the instruction execution architecture of the terminal device and the server is not limited to the architecture of the processor in combination with the memory shown in fig. 2. The system architecture provided by the embodiment of the present application is described in detail below with reference to fig. 3.

Fig. 3 is a schematic diagram of a system architecture according to an embodiment of the present application. As shown in FIG. 3, the system architecture 500 includes an execution device 510, a training device 520, a database 530, a client device 540, a data storage system 550, and a data collection system 560.

The execution device 510 includes a computation module 511, an I/O interface 512, a pre-processing module 513, and a pre-processing module 514. The target model/rule 501 may be included in the calculation module 511, with the pre-processing module 513 and the pre-processing module 514 being optional.

The data acquisition device 560 is used to acquire training samples. The training samples may be first data, second data, and the like, wherein the first data and the second data may be state information related to a target object (e.g., a robot, a vehicle, and the like), state information related to a vehicle, and the like. After the training samples are collected, the data collection device 560 stores the training samples in the database 530.

The training device 520 may obtain the target model/rule 501 based on the training samples maintained in the database 530 and the neural network to be trained (e.g., the reinforcement learning model in the embodiment of the present application, the target neural network used as the countermeasure agent of the reinforcement learning model, etc.).

It should be noted that, in practical applications, the training samples maintained in the database 530 are not necessarily all collected from the data collection device 560, and may be received from other devices. It should be noted that, the training device 520 does not necessarily perform the training of the target model/rule 501 based on the training samples maintained by the database 530, and may also obtain the training samples from the cloud or other places for performing the model training, and the above description should not be taken as a limitation on the embodiment of the present application.

The target model/rule 501 obtained by training according to the training device 520 may be applied to different systems or devices, for example, the executing device 510 shown in fig. 3, where the executing device 510 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an Augmented Reality (AR)/Virtual Reality (VR) device, a vehicle-mounted terminal, or a server.

The target model/rule 501 may be used, among other things, to implement a target task, such as driving control in autonomous driving, attitude control on a robot, and so forth.

Specifically, the training device 520 may pass the trained model to the execution device 510. The performing device 510 may be the target object described above.

In fig. 3, the execution device 510 configures an input/output (I/O) interface 512 for data interaction with an external device, a user may input data to the I/O interface 512 through a client device 540, or the execution device 510 may automatically collect input data.

The pre-processing module 513 and the pre-processing module 514 are configured to perform pre-processing according to input data received by the I/O interface 512. It should be understood that there may be no pre-processing module 513 and pre-processing module 514 or only one pre-processing module. When the pre-processing module 513 and the pre-processing module 514 are not present, the input data may be processed directly using the calculation module 511.

During the process of preprocessing the input data by the execution device 510 or performing the calculation and other related processes by the calculation module 511 of the execution device 510, the execution device 510 may call the data, the code and the like in the data storage system 550 for corresponding processes, or store the data, the instruction and the like obtained by corresponding processes in the data storage system 550.

Finally, the I/O interface 512 provides the processing result to the client device 540, thereby providing it to the user, or performs a control operation based on the processing result.

In the case shown in fig. 3, the user can manually give input data, and this "manually give input data" can be operated through an interface provided by the I/O interface 512. Alternatively, the client device 540 may automatically send the input data to the I/O interface 512, and if the client device 540 is required to automatically send the input data to obtain authorization from the user, the user may set the corresponding permissions in the client device 540. The user can view the results output by the execution device 510 at the client device 540, and the specific presentation form can be display, sound, action, and the like. The client device 540 may also serve as a data collection terminal, collecting input data of the input I/O interface 512 and output results of the output I/O interface 512 as new sample data, as shown, and storing the new sample data in the database 530. Of course, the input data of the input I/O interface 512 and the output result of the output I/O interface 512 may be directly stored in the database 530 as new sample data by the I/O interface 512 without being collected by the client device 540.

It should be noted that fig. 3 is only a schematic diagram of a system architecture provided in the embodiment of the present application, and the position relationship between the devices, modules, and the like shown in the diagram does not constitute any limitation, for example, in fig. 3, the data storage system 550 is an external memory with respect to the execution device 510, and in other cases, the data storage system 550 may be disposed in the execution device 510. It is understood that the execution device 510 described above may be deployed in the client device 540.

From the training side of the model:

in this embodiment, the training device 520 may obtain a code stored in a memory (not shown in fig. 3, and may be integrated with the training device 520 or separately deployed from the training device 520) to implement steps related to model training in this embodiment.

In this embodiment, the training device 520 may include a hardware circuit (e.g., an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA), a general-purpose processor, a Digital Signal Processor (DSP), a microprocessor or a microcontroller, etc.), or a combination of these hardware circuits, for example, the training device 520 may be a hardware system with an instruction execution function, such as a CPU, a DSP, etc., or a hardware system without an instruction execution function, such as an ASIC, an FPGA, etc., or a combination of the above hardware systems without an instruction execution function and a hardware system with an instruction execution function.

It should be understood that the training device 520 may be a combination of a hardware system without a function of executing instructions and a hardware system with a function of executing instructions, and some steps related to model training provided in the embodiments of the present application may also be implemented by a hardware system without a function of executing instructions in the training device 520, which is not limited herein.

Since the embodiments of the present application relate to the application of a large number of neural networks, for the convenience of understanding, the related terms and related concepts such as neural networks related to the embodiments of the present application will be described below.

(1) Neural network

The neural network may be composed of neural units, and the neural units may refer to operation units with xs (i.e., input data) and an intercept 1 as inputs, and outputs of the operation units may be:

wherein s =1, 2, \8230, n is a natural number greater than 1, ws is the weight of xs, and b is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit into an output signal. The output signal of the activation function may be used as an input for the next convolutional layer, and the activation function may be a sigmoid function. A neural network is a network formed by a plurality of the above-mentioned single neural units being joined together, i.e. the output of one neural unit may be the input of another neural unit. The input of each neural unit can be connected with the local receiving domain of the previous layer to extract the characteristics of the local receiving domain, and the local receiving domain can be a region composed of a plurality of neural units.

(2) Deep neural network

Deep Neural Networks (DNNs), also known as multi-layer Neural networks, can be understood as Neural networks having many hidden layers, where "many" has no particular metric. From the division of DNNs by the location of different layers, neural networks inside DNNs can be divided into three categories: input layer, hidden layer, output layer. Typically, the first layer is the input layer, the last layer is the output layer, and the number of layers in between are all hidden layers. The layers are all connected, that is, any neuron of the ith layer is necessarily connected with any neuron of the (i + 1) th layer. Although DNN appears complex, it is not really complex in terms of the work of each layer, simply the following linear relational expression:

wherein,

is the input vector of the input vector,

is the output vector of the output vector,

is an offset vector, W is a weight matrix (also called coefficient), and α () is an activation function. Each layer is only for the input vector

Obtaining the output vector by such simple operation

Due to the large number of DNN layers, the coefficient W and the offset vector

The number of the same is large. The definition of these parameters in DNN is as follows: taking the coefficient W as an example: suppose that in a three-layer DNN, the linear coefficients of the 4 th neuron of the second layer to the 2 nd neuron of the third layer are defined as

Superscript 3 represents the number of layers in which the coefficient W lies, and the subscripts correspond to the third layer index 2 at the output and the second layer index 4 at the input. The summary is that: the coefficients of the kth neuron of the L-1 th layer to the jth neuron of the L-1 th layer are defined as

Note that the input layer is without the W parameter. In deep neural networks, more hidden layers make the network more able to depict complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the larger the "capacity", which means that it can accomplish more complex learning tasks. The final objective of the process of training the deep neural network, i.e., learning the weight matrix, is to obtain the weight matrix (the weight matrix formed by the vectors W of many layers) of all layers of the deep neural network that is trained.

(3) Reinforcement Learning (RL), also known as refinish learning, evaluative learning or reinforcement learning, is one of the paradigms and methodologies of machine learning, and is used to describe and solve the problem that an agent (agent) can achieve maximum return or achieve a specific goal through learning strategies in the process of interacting with the environment.

A common model for reinforcement learning is the standard Markov Decision Process (MDP). Under given conditions, reinforcement learning can be classified into mode-based reinforcement learning (model-based RL) and modeless reinforcement learning (model-free RL), as well as active reinforcement learning (active RL) and passive reinforcement learning (passive RL). Variations of reinforcement learning include reverse reinforcement learning, hierarchical reinforcement learning, and reinforcement learning of partially observable systems. Algorithms used for solving the reinforcement learning problem can be classified into a strategy search algorithm and a value function (value function) algorithm. The deep learning model can be used in the reinforcement learning to form the deep reinforcement learning.

(4) Loss function

In the process of training the deep neural network, because the output of the deep neural network is expected to be as close to the value really expected to be predicted as possible, the weight vector of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the really expected target value by comparing the predicted value of the current network and the really expected target value (of course, an initialization process is usually carried out before the first update, namely parameters are pre-configured for each layer in the deep neural network), for example, if the predicted value of the network is high, the weight vector is adjusted to be lower in prediction, and the adjustment is carried out continuously until the deep neural network can predict the really expected target value or the value which is very close to the really expected target value. Therefore, it is necessary to define in advance how to compare the difference between the predicted value and the target value, which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the deep neural network becomes the process of reducing the loss as much as possible.

(5) Back propagation algorithm

The convolutional neural network can adopt a Back Propagation (BP) algorithm to correct the size of parameters in the initial super-resolution model in the training process, so that the reconstruction error loss of the super-resolution model is smaller and smaller. Specifically, error loss occurs when an input signal is transmitted in a forward direction until the input signal is output, and parameters in an initial super-resolution model are updated by reversely propagating error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion with error loss as a dominant factor, aiming at obtaining the optimal parameters of the super-resolution model, such as a weight matrix.

(6) Nash equilibrium (nash equibrium)

Also known as non-cooperative game balancing, is an important term for game theory. In a gaming process, a party selects a certain policy, called the dominant policy, regardless of the choice of the other party's policy. If any participant chooses the most optimal policy, as determined by the policies of all other participants, then this combination is defined as nash equilibrium.

One combination of strategies is known as nash equilibrium, when the equilibrium strategy of each player is to maximize the desired profit, all other players follow the same strategy.

(7) Reinforced learning model

Reinforcement Learning (RL), also known as refinish learning, evaluative learning or reinforcement learning, is one of the paradigms and methodologies of machine learning, and is used to describe and solve the problem that an agent (agent) can achieve maximum return or achieve a specific goal through learning strategies in the process of interacting with the environment.

(8) Intelligent agent

An agent is a concept in the field of artificial intelligence, and any entity that can stand alone ideas and interact with the environment can be abstracted as an agent. The basic properties of an agent are: the intelligent agent can react according to the change of environment, then automatic adjustment action and state oneself, and different intelligent agents can also interact with other intelligent agents according to respective intention.

In the application process of the reinforcement learning algorithm, an online environment is often required to interact to obtain data and train. The general practice is to model the real scene of the real world and generate a virtual simulated online environment. In this case, if the training environment and the real environment to be deployed have a slight difference, it is likely to cause the trained algorithm to fail, resulting in an unexpected performance in the real scene.

The above problem can be alleviated by improving the robustness of the reinforcement learning algorithm. One method is to introduce imaginary interference into a virtual environment, train a reinforcement learning algorithm under the condition of interference, improve the interference coping capability of the reinforcement learning algorithm, and enhance the robustness and the generalization of the algorithm, namely, aiming at the reinforcement learning model to be trained, an antagonistic agent can be arranged, the data output by the antagonistic agent and the output data of the reinforcement learning model can jointly execute a task, and the data output by the antagonistic agent can be used as the interference for executing a target task. Because the difference between the training environment and the deployment environment is unpredictable, the existing training method mainly resists certain specific interference, however, when the change in the real environment is inconsistent with the hypothetical interference, the algorithm effect is reduced.

In order to solve the above problem, referring to fig. 4, fig. 4 is a schematic flow chart of a model training method provided in the embodiment of the present application, and as shown in fig. 4, the model training method provided in the embodiment of the present application includes:

401. processing the first data through a first reinforcement learning model to obtain a first processing result; wherein the first data indicates a state of a target object, and the first processing result is used as control information when a target task is executed on the target object.

The main body of the step 401 may be a training device (for example, the training device may be a terminal device or a server), which may specifically refer to the description in the foregoing embodiments and is not described herein again.

In one possible implementation, the training apparatus may acquire an object (first reinforcement learning model) for model training and a training sample (first data).

In one possible implementation, the robot-related state information may include, but is not limited to, information related to a position, a speed, a scene in which the robot is located (e.g., obstacle information), and the position, the speed of the robot may include information about states (position, angle, speed, acceleration, etc.) of respective joints.

In one possible implementation, the first reinforcement learning model may be an initialized model or the output of one iteration of a model training process.

In one possible implementation, the first data may be processed by a first reinforcement learning model during a feed-forward process of model training to obtain a first processing result. The first processing result is used as control information when a target task is executed on the target object, for example, the target task is a posture control of the robot, and the first processing result is posture control information of the robot; or, the target task is automatic driving of the vehicle, and the first processing result is driving control information of the vehicle.

Alternatively, in one possible implementation, the first processing result may be a hard constraint imposed on the target object when performing the target task.

It should be understood that the reinforcement learning model in the embodiments of the present application includes, but is not limited to, a deep neural network, a bayesian neural network, and the like.

402. Processing the first data through a first target neural network to obtain a second processing result; the first processing result is used for executing a target task, the second processing result is used as interference when the target task is executed, the first target neural network is selected from a plurality of first neural networks, and each first neural network is an iteration result obtained by a process of performing iterative training on a first initial neural network.

In one possible implementation, the training device may acquire a countering agent to the reinforcement learning model that may output interference information for the target task.

It should be understood that the first target neural network in the embodiments of the present application includes, but is not limited to, a deep neural network, a bayesian neural network, and the like.

In one possible implementation, in determining a countermeasure agent for outputting disturbance information as a first reinforcement learning model, the first target neural network may be selected from the plurality of first neural networks, where each of the first neural networks is an iterative result of a process of iteratively training a first initial neural network.

For example, the neural network 1, the neural network 2, the neural network 3, the neural network 4, the neural network 5, the neural network 6, the neural network 7, the neural network 8, and the neural network 9 may be obtained in the iterative training of the first initial neural network, and when determining the antagonistic agent for outputting the interference information as the first reinforcement learning model, one neural network may be selected from the set [ the neural network 1, the neural network 2, the neural network 3, the neural network 4, the neural network 5, the neural network 6, the neural network 7, the neural network 8, and the neural network 9 ].

In one possible implementation, the selecting the first target neural network from a plurality of first neural networks includes: selecting the first target neural network from the plurality of first neural networks based on the first selection probability corresponding to each of the plurality of first neural networks. That is, each first neural network may be configured with a probability (i.e., the first selection probability described above), and when the first target neural network is selected from the plurality of first neural networks, sampling may be performed based on probability distributions corresponding to the plurality of first neural networks and network selection may be performed based on a result of the sampling.

The first selection probability is presented next:

in a possible implementation, each first neural network processes a processing result obtained by data to be used as interference when the target task is executed, and the first selection probability is positively correlated with the interference degree of the processing result output by the corresponding first neural network on the target task. When the reinforcement learning model and the model of the confrontation agent are updated, an incentive value can be obtained, the incentive value can represent that data output by the reinforcement learning model is excellent when the target task is executed, the degree of interference information output by the confrontation agent on the target task can be represented, and the probability distribution corresponding to the first neural network can be updated based on the incentive value, so that the first selection probability is positively correlated with the degree of interference of a processing result output by the corresponding first neural network on the target task. Through the mode, on one hand, the corresponding sampled probability of the countermeasure intelligent agent with the larger output interference range is larger, so that the countermeasure intelligent agent is easier to sample, and the interference degree of the reinforcement learning model is improved, on the other hand, the countermeasure intelligent agent with the smaller output interference range is likely to be sampled although the corresponding sampled probability is smaller, so that the abundant degree of interference on the reinforcement learning model can be improved, and the generalization of the network is improved.

In one possible implementation, during the feedforward process of model training, the first data may be processed by a first target neural network to obtain a second processing result, and the second processing result is used as interference information when the target task is executed.

For example, in the case of robot control, the second processing result may be a force or moment applied to at least one joint of the robot, and for example, in the case of automatic driving, the second processing result may be an obstacle applied to a road condition of the vehicle or other obstacle information that can affect a driving strategy.

In one possible implementation, to increase the richness of the interference to the reinforcement learning model, a plurality of antagonistic agents may be trained, and, for each antagonistic body of the plurality of antagonistic agents, the antagonistic agent that interferes with the reinforcement learning model may be selected from a plurality of iteration results at the time of training.

For example, for the first initial neural network, the neural network A1, the neural network A2, the neural network A3, the neural network A4, the neural network A5, the neural network A6, the neural network A7, the neural network A8, and the neural network A9 may be obtained in the iterative training process of the first initial neural network, and when determining the antagonistic agent for outputting the disturbance information as the first reinforcement learning model, one neural network may be selected from the set [ neural network A1, the neural network A2, the neural network A3, the neural network A4, the neural network A5, the neural network A6, the neural network A7, the neural network A8, and the neural network A9], that is the first target neural network in the above embodiment. For a second initial neural network different from the first initial neural network, the neural network B1, the neural network B2, the neural network B3, the neural network B4, the neural network B5, the neural network B6, the neural network B7, the neural network B8, and the neural network B9 may be obtained in an iterative training process of the second initial neural network, and when determining an antagonistic agent for outputting interference information as the first reinforcement learning model, one neural network, that is, a second target neural network, may be selected from the set [ the neural network B1, the neural network B2, the neural network B3, the neural network B4, the neural network B5, the neural network B6, the neural network B7, the neural network B8, and the neural network B9 ]. The data output by the first target neural network and the second target neural network may be used as the disturbance information applied to the first reinforcement learning model.

Specifically, in a possible implementation, in a feed-forward process performed according to the second target neural network, the first data may be processed by the second target neural network to obtain a fourth processing result; wherein the second processing result is used as interference when the target task is executed, the second target neural network is selected from a plurality of second neural networks, and each second neural network is an iteration result obtained by performing an iterative training process on a second initial neural network; the first initial neural network and the second initial neural network are different.

For example, the disturbance type may be a category of disturbance applied when performing the target task, such as applying a force, applying a moment, adding an obstacle, changing a road condition, changing weather, and so on.

In one possible implementation, the interfering objects of the second processing result and the fourth processing result are different.

For example, a robot may include multiple joints, and applying forces to different joints, or different groups of joints, may be considered to interfere with differences in the object. That is, the second processing result and the fourth processing result are forces applied to different joints, or different joint groups.

403. And executing the target task according to the first processing result and the second processing result to obtain a third processing result.

The first processing result may be a hard constraint when the target task is executed, that is, the first processing result may be control information that the target object needs to satisfy when the target task is executed, the second processing result may be interference imposed on the target object when the target task is executed, and the third processing result may be a state of the target object when (or after) the target task is executed, and the third processing result may be used to determine the reward value.

It should be understood that the first processing result and the second processing result may be partial data for determining the third processing result, and when other interference information (for example, the fourth processing result described in the above embodiment) may be included in addition to the second processing result, the target task may be executed based on the first processing result, the second processing result, and the other processing results to obtain the third processing result.

404. And updating the first reinforcement learning model according to the third processing result to obtain an updated first reinforcement learning model.

In a possible implementation, when performing model updating, the first reinforcement learning model may be updated according to the third processing result to obtain an updated first reinforcement learning model.

For example, when updating the first reinforcement learning model, the accumulated rewards obtained may be maximized, and the updating method may adopt a reinforcement learning algorithm of a continuous action space, and optionally, a Trust Region Policy Optimization (TRPO) algorithm.

In one possible implementation, the confrontation tasks required for the training may be selected from a plurality of confrontation tasks (for example, in sequence).

In one possible implementation, the confrontation agent historical strategy can be selected from the confrontation agent historical strategy set in a sampling mode according to the Nash equilibrium distribution and used for the confrontation reinforcement learning strategy. And in a training environment, deploying the selected confrontation intelligent agent strategy and the current reinforcement learning strategy, and sampling to obtain a required training sample. And training the reinforcement learning strategy by using the obtained training samples. That is, the first reinforcement learning model may be updated according to the first processing result, the second processing result, and the fourth processing result to obtain an updated first reinforcement learning model (that is, according to the first processing result, the second processing result, and the fourth processing result, the target task is executed to obtain a third processing result, and the first reinforcement learning model is updated according to the third processing result).

In one possible implementation, after sampling one confrontation agent for each confrontation task, the reinforcement learning strategy and the strategy updated by the confrontation agent may be added to the nash equilibrium matrix, and nash equilibrium is calculated, so as to obtain the nash equilibrium distribution (i.e., the first selection probability and the second selection probability introduced later) of the reinforcement learning and confrontation agent. Specifically, the updating the first reinforcement learning model according to the first processing result and the second processing result includes: obtaining an award value corresponding to the target task according to the first processing result and the second processing result; updating the first reinforcement learning model according to the reward value; further, the first selection probability corresponding to the first target neural network may be updated according to the reward value.

In one possible implementation, the historical policies of the reinforcement learning agent may be sampled and selected from a set of historical policies of the reinforcement learning agent according to a nash equilibrium distribution for use in countermeasure agent policy updates. And in a training environment, deploying the selected reinforcement learning strategy and the current confrontation intelligent agent strategy, and sampling to obtain a required training sample. The resulting training samples used train the confrontational agent strategy.

Next, taking a target object as a robot and a target task as a robot control as an example, a software architecture of the embodiment of the present application is described:

referring to fig. 5, fig. 5 shows a robot control system, and as shown in fig. 5, the robot control system may include: the robot comprises a state sensing and processing module, a robustness decision-making module and a robot control module.

Wherein, with respect to the state awareness and processing module: the module functions to sense information of the robot (e.g., information describing a state of the target object, such as the first data, the second data, etc., introduced in the above-described embodiments). Specifically, the information transmitted by each sensor is integrated, the self state of the robot is judged, the self state of the robot comprises basic information (position, speed) of the robot, the state (position, angle, speed, acceleration) of each joint and the like, and the information is transmitted to the decision module

Regarding the robustness decision module: the function of this module is to output upper layer behavior decisions (e.g. control information in the above described embodiments when performing a target task on the target object) for a period of time in the future, based on the current robot state and the task being performed. Specifically, the module may output a behavior decision for a period of time in the future by using the method corresponding to fig. 4 according to the current state of the robot output by the state sensing and processing module, and transmit the behavior decision to the robot control module.

Regarding the robot control module: the module executes the behavior output by the robustness decision module by controlling the joints of the robot, and controls the robot to move.

Specifically, referring to fig. 6, fig. 6 is a flow chart illustrating that the model training method in the embodiment of the present application is applied to a robot control simulation scenario. The robot adopts the model training method in the embodiment of the application, and finally outputs the behavior decision capable of maximizing the advancing speed of the robot through a multi-task framework and a Bo theory optimization theory, so that more rewards are obtained. The method of implementation is described in detail below.

S1, inputting a parameter phi = [ phi ] of multi-task learning ₁ ，φ ₂ ，...]Initializing a reinforcement learning strategy pi, and initializing a countermeasure agent strategy mu for each task i _i Selecting the ith parameter in phi as the countermeasure agent strategy mu _i The action space parameters of the simulation robot construct a plurality of tasks, and each task can apply a disturbance force to the body of the simulation robot by the confrontation intelligent agent. Initializing the nash equilibrium distribution may be a uniform distribution.

And S2, sequentially selecting corresponding confrontation intelligent agents as current confrontation tasks according to phi.

S3, selecting a historical strategy of the confrontation intelligent bodies in a sampling mode according to the distribution of the confrontation intelligent bodies in Nash equilibrium

As a countermeasure for the current countermeasure task, and deploy the countermeasure to the training environment.

S4, according to the reinforcement learning strategy pi and the countermeasure intelligent agent strategy mu _i,t Controlling the robot to sample in a training environment to obtain M samples (s state, a) _pro ,a _adv S', r reward), where a _pro ,a _adv Respectively outputting the behaviors of the reinforcement learning strategy and resisting the behaviors of the intelligent agent.

S5, updating the reinforcement learning strategy pi, wherein the updated objective function is as follows:

i.e. to maximize the accumulated rewards earned, the update method may take the form of a reinforcement learning algorithm of the continuous action space, optionally a trust region policy optimization algorithm (TRPO).

S6, sampling and selecting a historical strategy of the reinforcement learning intelligent agent according to the historical distribution of the reinforcement learning strategy in Nash equilibrium

The method is used as a reinforcement learning strategy requiring interference for the current countermeasures to the intelligent agent strategy, and the reinforcement learning strategy is deployed to a training environment.

S7, according to the reinforcement learning strategy pi _i,t And confrontation agent strategy mu _i Controlling the robot to sample in a training environment to obtain M samples (s, a) _pro ,a _adv ,s′,r)。

S8, strategy mu for counter-antibody intelligent agent _i Updating, wherein the updated objective function is as follows:

namely, the accumulated reward obtained by the reinforcement learning strategy is minimized, the reinforcement learning intelligent agent is prevented from achieving the goal, the updating method can adopt a reinforcement learning algorithm of a continuous action space, and optionally, a credible region strategy optimization algorithm TRPO can be adopted.

And S9, adding the reinforcement learning strategy and the strategy updated by the confrontation intelligent body into a Nash equilibrium matrix for each task in each k step, obtaining a Nash equilibrium value matrix of the new adding strategy by traversing and testing the performances of the new adding strategy and the existing historical strategy in a training environment, and calculating Nash equilibrium according to the value matrix to obtain Nash equilibrium distribution of the reinforcement learning and confrontation intelligent body.

And judging whether the current task is finished, if not, executing the step S2, otherwise, executing the step S10.

And S10, deploying the reinforcement learning strategy obtained by training to a test environment different from the training environment, and testing robustness.

The embodiment described above adopts a robust reinforcement learning control framework based on multi-task learning and game theory, and improves the robustness of the reinforcement learning algorithm by changing the action space of the confrontation agent to construct a plurality of confrontation tasks. In addition, an optimization framework based on a game theory is introduced, and the most appropriate countermeasure strategy is selected according to the historical strategy performance in the training process of each task, so that the reinforcement learning strategy is more robust.

It should be understood that the game theory optimization framework in the embodiment of the present application includes, but is not limited to, a policy-space response solver (PSRO), etc.; training for reinforcement learning models includes, but is not limited to, sampling reinforcement learning algorithms such as trusted space policy optimization (TRPO), proximity Policy Optimization (PPO), and the like.

The application provides a model training method, which comprises the following steps: processing the first data through a first reinforcement learning model to obtain a first processing result; wherein the first data indicates a state of a target object, and the first processing result is used as control information when a target task is executed on the target object; processing the first data through a first target neural network to obtain a second processing result; the second processing result is used as interference information when the target task is executed, the first target neural network is selected from a plurality of first neural networks, and each first neural network is an iteration result obtained by an iterative training process of a first initial neural network; executing the target task according to the first processing result and the second processing result to obtain a third processing result; and updating the first reinforcement learning model according to the third processing result to obtain an updated first reinforcement learning model. In the embodiment of the present application, on one hand, a plurality of countermeasures agents for outputting interference information can be trained, and interference information output by different countermeasures agents can perform different types of interference for a target task, and on the other hand, when training the countermeasures agents, not only the countermeasures agent obtained by the current latest iteration is used to output the interference for the target task, but also historical training results (the countermeasures agents obtained in the historical iteration process) of the countermeasures agents in history are used to output the interference for the target task, so that more effective interference for the target task under different scenes can be obtained, and the training effect and the generalization of a model are improved.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a model training apparatus provided in an embodiment of the present application, and as shown in fig. 7, the apparatus 700 includes:

a data processing module 701, configured to process the first data through the first reinforcement learning model to obtain a first processing result; wherein the first data indicates a state of a target object, and the first processing result is used as control information when a target task is executed on the target object;

for a detailed description of the data processing module 701, reference may be made to the descriptions of step 401, step 402, and step 403 in the foregoing embodiments, which are not described herein again.

A model updating module 702, configured to update the first reinforcement learning model according to the third processing result, so as to obtain an updated first reinforcement learning model.

For a detailed description of the model updating module 702, reference may be made to the description of step 404 in the foregoing embodiment, which is not described herein again.

In one possible implementation of the method of the invention,

updating the first reinforcement learning model according to the reward value;

the model update module is further configured to:

the data processing module is specifically configured to:

In one possible implementation, the second processing result and the fourth processing result are of different interference types; or,

the model update module is further configured to:

In one possible implementation, the second reinforcement learning model is selected from a plurality of reinforcement learning models based on a second selection probability corresponding to each of the plurality of reinforcement learning models.

Referring to fig. 8, fig. 8 is a schematic structural diagram of an execution device provided in the embodiment of the present application, and the execution device 800 may be embodied as a mobile phone, a tablet, a notebook computer, an intelligent wearable device, and the like, which is not limited herein. Specifically, the execution apparatus 800 includes: a receiver 801, a transmitter 802, a processor 803, and a memory 804 (wherein the number of processors 803 in the execution device 800 may be one or more, and one processor is taken as an example in fig. 8), wherein the processor 803 may include an application processor 8031 and a communication processor 8032. In some embodiments of the present application, the receiver 801, the transmitter 802, the processor 803, and the memory 804 may be connected by a bus or other means.

The memory 804 may include a read-only memory and a random access memory, and provides instructions and data to the processor 803. A portion of the memory 804 may also include non-volatile random access memory (NVRAM). The memory 804 stores the processor and operating instructions, executable modules or data structures, or a subset or expanded set thereof, wherein the operating instructions may include various operating instructions for performing various operations.

The processor 803 controls the operation of the execution apparatus. In a particular application, the various components of the execution device are coupled together by a bus system that may include a power bus, a control bus, a status signal bus, etc., in addition to a data bus. For clarity of illustration, the various buses are referred to in the figures as a bus system.

The method disclosed in the embodiments of the present application can be applied to the processor 803, or implemented by the processor 803. The processor 803 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 803. The processor 803 may be a general-purpose processor, a Digital Signal Processor (DSP), a microprocessor or a microcontroller, and may further include an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components. The processor 803 may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 804, and the processor 803 reads the information in the memory 804 to complete the steps of the method in combination with the hardware thereof.

Receiver 801 may be used to receive input numeric or character information and generate signal inputs related to performing device related settings and function control. Transmitter 802 may be used to output numeric or character information; the transmitter 802 may also be used to send instructions to the disk groups to modify the data in the disk groups.

In one embodiment of the present application, the processor 803 is configured to execute the steps of the model obtained by the model training method in the embodiment corresponding to fig. 4.

Referring to fig. 9, fig. 9 is a schematic structural diagram of a server provided in the embodiment of the present application, specifically, the server 900 is implemented by one or more servers, and the server 900 may generate a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 99 (e.g., one or more processors) and a memory 932, and one or more storage media 930 (e.g., one or more mass storage devices) storing an application 942 or data 944. Memory 932 and storage media 930 can be, among other things, transient storage or persistent storage. The program stored on the storage medium 930 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, the central processor 99 may be configured to communicate with the storage medium 930 to execute a series of instruction operations in the storage medium 930 on the server 900.

The server 900 may also include one or more power supplies 99, one or more wired or wireless network interfaces 950, one or more input-output interfaces 958; or, one or more operating systems 941, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, etc.

In the embodiment of the present application, the central processing unit 99 is configured to execute the steps of the model training method in the embodiment corresponding to fig. 4.

Also provided in embodiments of the present application is a computer program product comprising computer readable instructions, which when run on a computer, cause the computer to perform the steps as performed by the aforementioned execution apparatus, or cause the computer to perform the steps as performed by the aforementioned training apparatus.

In an embodiment of the present application, a computer-readable storage medium is further provided, where a program for signal processing is stored, and when the program runs on a computer, the program causes the computer to execute the steps performed by the foregoing execution device, or causes the computer to execute the steps performed by the foregoing training device.

The execution device, the training device, or the terminal device provided in the embodiment of the present application may specifically be a chip, where the chip includes: a processing unit, which may be for example a processor, and a communication unit, which may be for example an input/output interface, a pin or a circuit, etc. The processing unit may execute the computer execution instructions stored in the storage unit to enable the chip in the execution device to execute the model training method described in the above embodiment, or to enable the chip in the training device to execute the steps related to the model training in the above embodiment. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, and the like, and the storage unit may also be a storage unit located outside the chip in the radio access device, such as a read-only memory (ROM) or another type of static storage device that may store static information and instructions, a Random Access Memory (RAM), and the like.

Specifically, referring to fig. 10, fig. 10 is a schematic structural diagram of a chip provided in the embodiment of the present application, where the chip may be represented as a neural network processor NPU 1000, and the NPU 1000 is mounted on a main CPU (Host CPU) as a coprocessor, and the Host CPU allocates tasks. The core portion of the NPU is an arithmetic circuit 1003, and the controller 1004 controls the arithmetic circuit 1003 to extract matrix data in the memory and perform multiplication.

In some implementations, the arithmetic circuit 1003 includes a plurality of processing units (PEs) internally. In some implementations, the operational circuit 1003 is a two-dimensional systolic array. The arithmetic circuit 1003 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 1003 is a general-purpose matrix processor.

For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 1002 and buffers it in each PE in the arithmetic circuit. The arithmetic circuit takes the matrix a data from the input memory 1001 and performs matrix operation with the matrix B, and a partial result or a final result of the obtained matrix is stored in an accumulator (accumulator) 1008.

The unified memory 1006 is used for storing input data and output data. The weight data is directly passed through a Memory Access Controller (DMAC) 1005, and the DMAC is transferred to the weight Memory 1002. The input data is also carried into the unified memory 1006 by the DMAC.

The BIU is a Bus Interface Unit 1010 for interaction of the AXI Bus with the DMAC and an Instruction Fetch memory (IFB) 1009.

A Bus Interface Unit 1010 (Bus Interface Unit, BIU for short) is used for the instruction fetch memory 1009 to fetch instructions from the external memory, and is also used for the memory Unit access controller 1005 to fetch the original data of the input matrix a or the weight matrix B from the external memory.

The DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 1006 or to transfer weight data into the weight memory 1002 or to transfer input data into the input memory 1001.

The vector calculation unit 1007 includes a plurality of operation processing units, and further processes the output of the operation circuit such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like, if necessary. The method is mainly used for non-convolution/full-connection layer network calculation in the neural network, such as Batch Normalization, pixel-level summation, up-sampling of a feature plane and the like.

In some implementations, the vector calculation unit 1007 can store the vector of processed outputs to the unified memory 1006. For example, the vector calculation unit 1007 may calculate a linear function; alternatively, a non-linear function is applied to the output of the arithmetic circuit 1003, such as a linear interpolation of the feature planes extracted by the convolution layer, and then such as a vector of accumulated values to generate the activation value. In some implementations, the vector calculation unit 1007 generates normalized values, pixel-level summed values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to the arithmetic circuit 1003, for example, for use in subsequent layers in a neural network.

An instruction fetch buffer 1009 connected to the controller 1004, for storing instructions used by the controller 1004;

the unified memory 1006, the input memory 1001, the weight memory 1002, and the instruction fetch memory 1009 are On-Chip memories. The external memory is private to the NPU hardware architecture.

The processor mentioned in any of the above may be a general purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the above programs.

It should be noted that the above-described embodiments of the apparatus are merely schematic, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiments of the apparatus provided in the present application, the connection relationship between the modules indicates that there is a communication connection therebetween, and may be implemented as one or more communication buses or signal lines.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus necessary general-purpose hardware, and certainly can also be implemented by special-purpose hardware including special-purpose integrated circuits, special-purpose CPUs, special-purpose memories, special-purpose components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions may be various, such as analog circuits, digital circuits, or dedicated circuits. However, for the present application, the implementation of a software program is more preferable. Based on such understanding, the technical solutions of the present application may be substantially embodied in the form of a software product, which is stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, an exercise device, or a network device) to execute the method according to the embodiments of the present application.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, it may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website site, computer, training device, or data center to another website site, computer, training device, or data center via wired (e.g., coaxial cable, fiber optics, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that a computer can store or a data storage device, such as a training device, a data center, etc., that incorporates one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), among others.

Claims

1. A method of model training, the method comprising:

processing the first data through a first reinforcement learning model to obtain a first processing result; wherein the first data indicates a state of a target object, and the first processing result is used as control information when a target task is executed on the target object;

and updating the first reinforcement learning model according to the third processing result to obtain an updated first reinforcement learning model.

2. The method of claim 1,

3. The method according to claim 1 or 2,

the first target neural network is selected from the plurality of first neural networks based on the first selection probability corresponding to each of the plurality of first neural networks.

4. The method of claim 3, wherein each first neural network processes data to obtain a processing result for use as interference in executing the target task, and the first selection probability is positively correlated to the interference degree of the processing result output by the corresponding first neural network with the target task.

5. The method according to claim 3 or 4, wherein the updating the first reinforcement learning model according to the third processing result comprises:

updating the first reinforcement learning model according to the reward value;

the method further comprises the following steps:

6. The method of any of claims 1 to 5, further comprising:

7. The method of claim 6,

the second processing result and the fourth processing result have different interference types; or,

8. The method of any of claims 1 to 7, further comprising:

9. The method of claim 8, wherein the second reinforcement learning model is selected from a plurality of reinforcement learning models based on a second selection probability corresponding to each reinforcement learning model in the plurality of reinforcement learning models.

10. A model training apparatus, the apparatus comprising:

11. The apparatus of claim 10,

12. The apparatus of claim 10 or 11,

13. The apparatus of claim 12, wherein each first neural network processes data to obtain a processing result for use as an interference in performing the target task, and the first selection probability is positively correlated to a degree of interference on the target task from the processing result output by the corresponding first neural network.

14. The apparatus according to claim 12 or 13, wherein the model update module is specifically configured to:

updating the first reinforcement learning model according to the reward value;

the model update module is further configured to:

15. The apparatus according to any one of claims 10 to 14, wherein the data processing module is further configured to:

processing the first data through a second target neural network to obtain a fourth processing result; the fourth processing result is used as interference information when the target task is executed, the second target neural network is selected from a plurality of second neural networks, and each second neural network is an iteration result obtained by performing an iterative training process on a second initial neural network; the first initial neural network and the second initial neural network are different;

the data processing module is specifically configured to:

16. The apparatus of claim 15,

17. The apparatus according to any one of claims 10 to 16, wherein the data processing module is further configured to:

the model update module is further configured to:

18. The apparatus of claim 17, wherein the second reinforcement learning model is selected from a plurality of reinforcement learning models based on a second selection probability corresponding to each reinforcement learning model in the plurality of reinforcement learning models.

19. A model training apparatus, the apparatus comprising a memory and a processor; the memory stores code, and the processor is configured to retrieve the code and perform the method of any of claims 1 to 9.

20. A computer readable storage medium comprising computer readable instructions which, when run on a computer device, cause the computer device to perform the method of any of claims 1 to 9.

21. A computer program product comprising computer readable instructions which, when run on a computer device, cause the computer device to perform the method of any one of claims 1 to 9.