CN110738221B

CN110738221B - Computing system and method

Info

Publication number: CN110738221B
Application number: CN201810789039.4A
Authority: CN
Inventors: 费旭东; 邹斯骋
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2018-07-18
Filing date: 2018-07-18
Publication date: 2024-04-26
Anticipated expiration: 2038-07-18
Also published as: CN110738221A

Abstract

The embodiment of the application relates to a computing system and a computing method, which are characterized in that the computing system comprises two components, wherein one component is a feature extraction unit for recognizing an environment and effectively extracting concept features of the environment, the other component is an action generation unit for interacting with the environment, and the feature extraction unit and the action generation unit inherit and use the capability of the feature extraction unit for acquiring concepts in the computing process, and even the capability of the feature extraction unit for acquiring concepts can be influenced by the action generation unit. With this implementation, the environment is recognized and machine learning is performed in combination with the environment interaction, so that the environment can be better understood and the optimal action selection can be made according to the understanding of the environment.

Description

Computing system and method

Technical Field

The embodiment of the application relates to the field of machine learning, in particular to an operation system and an operation method.

Background

Deep learning operations have been applied very successfully and are in the process of rapid development. The main directions are Back Propagation (BP) operation, unsupervised learning operation, weak supervised learning operation, etc.

BP operation can be generalized to any complex mapping function defined by a set of parameters that a sample represents can be obtained by an automated learning operation, provided that there are enough labeled samples. This operation has now been successful in solving the classical artificial intelligence problem that has been considered very difficult, such as speech recognition, image classification, etc., thus promoting the general enthusiasm for technology, application and investment in recent years.

However, this method requires a large number of manually labeled data samples, which is not only costly, but also limits the adaptability of the obtained model and the ability to solve more complex problems due to the limitations of manual labeling.

For this reason, the industry has shifted the focus to the direction of unsupervised learning operations and weakly supervised learning operations.

One way is to take unlabeled data samples as input and learn by operation the underlying concepts of the unlabeled data samples. Such methods convert the sample vector of the apparent space into the sample vector of the hidden space by a mapping (called encoder). This efficient mapping can transform a complex distribution of the apparent space, e.g. possibly a complex flow shape, into a simple distribution of the hidden space, e.g. a gaussian distribution. For example, self-encoding (autoencoder, AE), the limited boltzmann machine (RESTRICTED BOLTZMANN MACHINE, RBM), the generative antagonism network (GENERATIVE ADVERSARIAL networks, GAN), and the like all belong to such methods.

Accordingly, in implementation, such methods typically require the introduction of another transform (decoder) to convert the vector of hidden space into a vector of apparent space, a process known as the generation process.

If both mappings are good enough, the reconstructed apparent space vector should be identical to the original vector. However, this is clearly desirable because the ability to be obtained by observation depends on the coverage of the training sample, and in the case of insufficient coverage of the sample, the reliability is not high. For example, the type of data generated in an actual application may be beyond the range of types covered by the training samples, i.e., the data generated in the actual application is beyond the ability to view the samples, making the ability unreliable.

Disclosure of Invention

The embodiment of the application provides an operation system and an operation method. To make the machine learning derived capabilities more reliable.

In a first aspect, an arithmetic system is provided. The operation system comprises two components, wherein one component is a feature extraction unit which is used for recognizing the environment and effectively extracting the concept features of the environment, the other component is an action generation unit which is interacted with the environment, and the feature extraction unit and the action generation unit inherit and use the capability of the feature extraction unit to obtain the concept in the operation process, and even the capability of the feature extraction unit to obtain the concept can be influenced by the action generation unit.

According to the embodiment of the invention, the recognition environment and the interaction with the environment can be combined for machine learning, the environment can be better understood, and the optimal action selection can be made according to the understanding of the environment, so that the learning capacity is more reliable.

In an optional implementation, the feature extraction unit is configured to obtain the current data vector based on the environment; extracting a current feature vector according to one or more data vectors, wherein the one or more data vectors comprise the current data vector; and optimizing the feature extraction unit according to the one or more data vectors and the current feature vector;

The action generating unit is used for determining a current action vector according to the one or more feature vectors extracted by the feature extracting unit, wherein the one or more feature vectors comprise the current feature vector; the motion vector is acted on the environment, so that the feature extraction unit obtains the next data vector based on the environment acted by the motion vector; acquiring current reward and punishment feedback based on an environment, wherein the current reward and punishment feedback is generated by one or more action vectors acting on the environment, and the one or more action vectors comprise the current action vector; and optimizing the action generating unit according to the current reward and punishment feedback.

By the embodiment of the invention, the environment is recognized and the machine learning is performed by combining the environment interaction with the environment, so that the environment can be better understood, and the optimal action selection can be made according to the understanding of the environment.

In another optional implementation, the action generating unit is further configured to optimize the feature extracting unit according to the current reward and punishment feedback.

The embodiment of the invention can optimize the knowledge of the environment through the feedback of the environment so as to better understand the environment and better select actions.

In another alternative implementation, the feature extraction unit optimizes the feature extraction unit according to a first probability, and the action generating unit optimizes the feature extraction unit according to a second probability, the sum of the first probability and the second probability being 1.

According to the embodiment of the invention, the optimization of the feature extraction unit and the optimization of the feature extraction unit per se can be orderly carried out according to the appointed probability according to the environment feedback, so that the possibility of conflict between the two optimizations is reduced.

In another alternative implementation, the feature extraction unit is further configured to learn in advance based on one or more training data vectors. The training data vector can be determined according to simulated or simulated environment information, or can be determined according to real environment information acquired in advance. The action generating unit is further used for learning in advance according to one or more training feature vectors predetermined by the feature extracting unit. Wherein, the feature extraction unit and the action generation unit are independently studied in time; or the feature extraction unit and the action generation unit learn simultaneously in time. Wherein, during simultaneous learning, the feature extraction unit optimizes the feature extraction unit according to a first probability, and the action generation unit optimizes the feature extraction unit according to a second probability.

The feature extraction unit is pre-trained (or learned) so that it has a preliminary knowledge of the environment. When the feature extraction unit is trained alone, the environment can be recognized and the environment concept can be mastered from the process of observing the environment by continuously observing the environment, so that the features can be extracted from the environment more effectively when the feature extraction unit is used specifically. When the action generating unit row is trained individually, the action generating unit determines the optimal action strategy based on the grasped concepts. The feature extraction unit and the action generation unit are trained simultaneously, so that the cognition of the environment can be continuously updated and the optimal action strategy can be selected.

In another alternative implementation, based on this, the computing system includes one or more subsystems, each subsystem includes a feature extraction unit and an action generation unit, respectively, the subsystem can be regarded as an individual, and the computing system is a group; each subsystem follows a winner and winner elimination mechanism, for example, the subsystem determines a reward and punishment accumulated value according to the reward and punishment feedback of the environment in the running process, wherein the reward and punishment accumulated value is increased if the current reward and punishment feedback is rewarded, and the reward and punishment accumulated value is reduced if the current reward and punishment feedback is punished; subsystems with accumulated prize and punishment values higher than a first threshold are duplicated; subsystems with prize and punishment accumulated values lower than the second threshold are eliminated.

By the embodiment of the invention, the reliability of the learning ability of the computing system can be further improved through the competitive screening mechanism of the population bodies.

In another optional implementation, the feature extraction unit is specifically configured to extract, according to one or more data vectors, a current feature vector through a first operation; and generating a data vector through a second operation according to the characteristic vector, and optimizing the first operation and the second operation according to errors of the generated data vector and one or more data vectors.

According to the embodiment of the invention, the feature extraction unit can be optimized by reconstructing the data vector and according to the error evaluation method.

In another optional implementation, the action generating unit is specifically configured to obtain the action vector through a third operation mapping according to the one or more feature vectors extracted by the feature extracting unit; acquiring current reward and punishment feedback based on the environment, and acquiring current reward and punishment values according to the current reward and punishment feedback mapping; and optimizing the third operation and the first operation according to the current punishment and punishment value.

According to the embodiment of the invention, the feature extraction unit and the action generation unit can be optimized according to the punishment and punishment values by mapping the punishment and punishment feedback into the punishment and punishment values.

In another alternative implementation, the third operation includes an operation through a neural network for mapping the one or more feature vectors extracted by the feature extraction unit to one or more pending action vectors, and a selection operation for selecting an optimal one of the one or more pending action vectors as the current action vector.

In another alternative implementation, the selection operation further includes a search operation, such as a monte carlo tree search (monte carlo TREE SEARCH, MCTS) operation, where the search operation is specifically configured to select multiple times from one or more pending action vectors, respectively select one action vector and perform a simulation operation, and select an optimal action vector in the simulation operation result as the current action vector.

In this way, the information given by the mobile network can be utilized to the greatest extent. In addition, the best selection path and the hypothesis result can be used as the basis of the optimizing action generating unit.

In another alternative implementation, the first operation, the second operation, or the third operation includes an operation through the recurrent neural network RNN.

In another alternative implementation, the action vector may represent an action affecting the environment, or may represent an action affecting the learning manner itself. For example, modifications to the network structure, branch pruning, and parameter modifications can be represented by action vectors. Based on this, the action generating unit is further configured to adjust the optimization feature extracting unit or the manner of optimizing the action generating unit according to the current action vector.

In another alternative implementation, the aforementioned computing system is applied to a camera (e.g., a digital camera, a cell phone or tablet computer with photographing function, etc.), a robot (e.g., a sweeping robot), or an autopilot tool (e.g., an autopilot car, a drone, etc.).

In a second aspect, an embodiment of the present invention provides an operation method. The method is suitable for an operation system, and comprises the following steps:

acquiring a current data vector based on the environment;

Extracting a current feature vector according to one or more data vectors, wherein the one or more data vectors comprise the current data vector;

optimizing a feature vector extraction mode according to one or more data vectors and the feature vector;

Determining a current action vector according to one or more feature vectors, wherein the one or more feature vectors comprise the current feature vector;

the motion vector is acted on the environment, so that the feature extraction unit obtains the next data vector based on the environment acted by the motion vector;

Acquiring current reward and punishment feedback based on the environment, wherein the current reward and punishment feedback is generated by one or more action vectors acting on the environment, and the one or more action vectors comprise the current action vector;

and optimizing the action vector determining mode according to the feedback of the reward and punishment.

In an alternative implementation, the method further includes:

and optimizing the feature vector extraction mode according to the feedback of the current reward and punishment.

In another alternative implementation, the method for optimizing feature vector extraction according to the current reward and punishment includes: the method for optimizing feature vector extraction according to one or more data vectors and the current feature vector comprises the following steps: optimizing a feature vector extraction mode according to the first probability according to one or more data vectors and the feature vector;

according to the second probability, the feature vector extraction mode is optimized according to the current reward and punishment feedback;

Wherein the sum of the first probability and the second probability is 1.

In another alternative implementation, the method further includes: learning is performed in advance based on one or more training data vectors.

In another alternative implementation, the method further includes: learning is performed in advance based on one or more training feature vectors.

In another alternative implementation, learning is performed in advance according to one or more training data vectors and learning is performed separately in time according to one or more training feature vectors; or learning in advance according to one or more training data vectors and learning in advance according to one or more training feature vectors are performed simultaneously in time.

In another alternative implementation, the method further includes:

determining a reward and punishment accumulated value according to the reward and punishment feedback of the environment in the operation process, wherein the reward and punishment accumulated value is increased if the current reward and punishment feedback is rewarded, and the reward and punishment accumulated value is reduced if the current reward and punishment feedback is punishment;

The operation system with the accumulated value of the prizes and punishments higher than the first threshold value is duplicated;

the operation system with the accumulated value lower than the second threshold value is eliminated.

In another alternative implementation, extracting the present feature vector from the one or more data vectors includes extracting the present feature vector from the one or more data vectors by a first operation;

The method for optimizing feature vector extraction according to one or more data vectors and the current feature vector comprises the following steps: generating a data vector through a second operation according to the characteristic vector, and optimizing the first operation and the second operation according to errors of the generated data vector and one or more data vectors.

In another optional implementation, determining the current action vector according to the one or more feature vectors includes mapping to obtain the current action vector according to the one or more feature vectors through a third operation;

The method for determining the optimal action vector according to the current reward and punishment feedback comprises the steps of obtaining a current reward and punishment value according to the current reward and punishment feedback mapping; and optimizing the third operation and the first operation according to the current punishment and punishment value.

In another alternative implementation, the third operation includes an operation through the first neural network and a selection operation; the obtaining of the current action vector through the third operation mapping according to the one or more feature vectors comprises the following steps:

determining one or more pending action vectors by operation of a first neural network;

and selecting the optimal one from one or more pending action vectors as the current action vector through a selection operation.

In another alternative implementation, the selection operation further includes a search operation.

In another alternative implementation, selecting an optimal one of the one or more pending action vectors as the current action vector by a selection operation includes:

and selecting a plurality of times from one or more undetermined action vectors, respectively selecting one action vector, performing simulation operation, and selecting the optimal action vector in the simulation operation result as the current action vector.

In another alternative implementation, the first operation, the second operation, or the third operation includes an operation through a recurrent neural network (recurrent neural network, RNN).

In another alternative implementation, the method further includes: and adjusting the feature vector extraction optimization mode or the motion vector optimization determination mode according to the motion vector.

In a third aspect, an embodiment of the present invention provides an arithmetic device. The apparatus includes a processor and a memory; the memory is used for storing programs; the processor is configured to execute a memory-stored program to control the computing device to perform the method of the second aspect described above and optionally implementations thereof.

In a fourth aspect, an embodiment of the present invention provides an autopilot tool. The automatic driving tool comprises a propulsion system, a sensor system, a control system and an operation system, wherein the propulsion system is used for providing power for the automatic driving tool, and the operation system is used for controlling the sensor system to acquire a current data vector based on the environment;

the computing system is further used for extracting a current feature vector according to one or more data vectors, wherein the one or more data vectors comprise the current data vector;

The operation system is also used for optimizing the feature vector extraction mode according to one or more data vectors and the feature vector;

the computing system is also used for determining a current action vector according to one or more feature vectors, and the one or more feature vectors comprise the current feature vector;

the computing system is also used for controlling the control system to act the current action vector on the environment so that the feature extraction unit obtains the next data vector based on the environment acted by the current action vector;

The operation system is also used for controlling the sensor system to acquire current punishment feedback based on the environment, wherein the current punishment feedback is generated by one or more action vectors acting on the environment, and the one or more action vectors comprise the current action vector;

The operation system is also used for optimizing the action vector determination mode according to the reward and punishment feedback.

In a fifth aspect, an embodiment of the present invention provides a camera, which is characterized by including a shooting system and an operation system;

the computing system is used for controlling the shooting system to acquire the current data vector based on the environment;

The computing system is also used for controlling the shooting system to act the current action vector on the environment so that the feature extraction unit obtains the next data vector based on the environment acted by the current action vector;

The operation system is also used for obtaining current reward and punishment feedback based on the environment, wherein the current reward and punishment feedback is generated by one or more action vectors acting on the environment, and the one or more action vectors comprise the current action vector;

In a sixth aspect, a computer readable storage medium is provided, on which a computer program is stored which, when executed by a processor, implements the method of the second aspect described above and optionally of the second aspect.

In a seventh aspect, there is provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of the second aspect described above and optionally implementations thereof.

In an eighth aspect, a chip arrangement is provided, the chip arrangement comprising a processor and a memory; the memory is used for storing programs; the processor runs the program to perform the method of the second aspect described above and optionally implementations thereof.

Drawings

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present invention;

FIG. 2 is a schematic diagram of an architecture of an computing system according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of another computing system according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of another computing system according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of another computing system according to an embodiment of the present invention;

FIG. 6 is a schematic view of an autopilot tool according to one embodiment of the present invention;

fig. 7 is a schematic view of a camera structure according to an embodiment of the present invention;

FIG. 8 is a schematic flow chart of an operation method according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of an operation device according to an embodiment of the present invention.

Detailed Description

Embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application.

The inventor of the present application found through analysis that there are mainly two ways in the existing unsupervised learning operation or weak supervised learning operation. One way is to take unlabeled data samples as input and learn by operation the underlying concepts of the unlabeled data samples. This is a learning method based on continuous observation. Such methods convert the sample vector of the apparent space into the sample vector of the hidden space by a mapping (called encoder). Accordingly, a transform (decoder) is also required to convert the vector of the hidden space into a vector of the apparent space, a process called generating. If both mappings are good enough, the reconstructed apparent space vector should be identical to the original vector.

Alternatively, reinforcement learning (reinforcement learning, RL) operations. The method is to solve an optimal action sequence to obtain the learning operation of the possible delayed rewards. Specifically, the behavior is determined from the state, which in turn causes a change in state. During a series of behavior and state changes, the intelligent agent gets rewards or penalties from which operations to optimize model parameters can be derived.

The intelligent agent for reinforcement learning may be very unknown initially, but after continuous learning improvement, the intelligent agent can form own experience knowledge and take the best action strategy. Many of the athletic events that we see, such as basketball, running, can be understood as the result of reinforcement learning patterns. Biological evolution, a history of human development, can also be understood to be a process of reinforcement learning to a large extent.

For such operations, the course of motion and the opportunity of practice are conditions for the implementation of reinforcement learning operations, and a clear reward and punishment mechanism is also a necessary condition for the effective implementation of reinforcement learning operations. In many cases, however, such as face recognition, there is a lack of exploration and practice opportunities for a sufficiently rich mobility, making implementation difficult.

Based on the above problems, the embodiment of the application provides an operation system and an operation method. The system realizes the knowledge of the environment and the interaction with the environment by combining a learning method based on continuous observation and a reinforcement learning method. The two operations can mutually influence and achieve each other in the learning process, so that the environment is familiar, the environment is mastered, the environment understanding capability, the environment adapting capability, the winning capability and the like are obtained.

The following describes embodiments of the present application with reference to the drawings.

Fig. 1 is a schematic diagram of an operation scenario provided in an embodiment of the present invention. The computing device 100 shown in fig. 1 may be a smart machine, for example, a smart robot (e.g., a sweeping robot, etc.), an autopilot tool (e.g., an autopilot car, an unmanned aerial vehicle, etc.), or a smart camera (e.g., a digital camera, a cell phone or tablet computer with photographing function, etc.), or the like. The operation equipment can acquire environment data, determine the action of the operation equipment according to the environment data, the action of the operation equipment can influence the environment, the operation equipment can acquire reward and punishment feedback caused by the action of the operation equipment from the environment, and the operation mode can be optimized according to the reward and punishment feedback operation equipment so as to improve the probability that the reward and punishment feedback obtained by the action of the equipment is rewarded in a desired direction. The computing device mainly comprises the capability of cognizing the environment and the capability of interacting with the environment, and the capability is improved by combining the cognizing environment and the interaction with the environment.

Embodiments of the present application are further described below in conjunction with fig. 2-5. Fig. 2 is a schematic structural diagram of an operation system according to an embodiment of the present application. The computing system may be implemented as the computing device shown in fig. 1, for example, integrated into a system on chip (SoC) of the computing device 100, or as an application-specific integrated circuit (ASIC) of the computing device 100 (for example, as an ASIC of a cloud computing system). As shown in fig. 2, the system may include a feature extraction unit 210 and an action generation unit 220, where both units, as well as other units in the present application, may be implemented based on hardware and software, e.g., based on a CPU and a memory implementation (the functions performed by the units are performed by the CPU by reading corresponding code stored in the memory) or based on a hardware processor (e.g., FPGA, ASIC), i.e., by associated hardware circuitry.

The feature extraction unit 210 mainly realizes knowledge about the environment, and specifically, the feature extraction unit 210 is configured to extract a feature from data including environment information, where the feature can be regarded as knowledge about the environment information. The action generating unit 220 is mainly used for interacting with the environment, specifically, the action generating unit 220 selects an optimal action according to the knowledge of the environment, acts the optimal action on the environment, and optimizes the selection of the optimal action according to reward and punishment feedback.

In a specific embodiment, as shown in fig. 2, the feature extraction unit 210 is mainly used for: acquiring a current data vector based on the environment; extracting a current feature vector according to one or more data vectors, wherein the one or more data vectors comprise the current data vector; and a feature extraction unit 210 optimized according to one or more data vectors and the current feature vector, for example, the feature extraction unit 210 is configured to map the current data vector to the current feature vector, and integrate the mapping process of the current data vector and the current feature vector to optimize the feature extraction unit 210.

The data vector may refer to a vector including environmental information, for example, the data vector may be generated according to an environmental picture, or the data vector may be obtained by summarizing environmental data acquired by a sensor. The environmental data is data for a specific environment, including but not limited to image, sound, air pressure, temperature, etc., acquired by various sensors, and specific environmental data may be determined in conjunction with a specific application scenario, for example, when the environmental data is for an autopilot scenario, the environmental data includes image data acquired by a camera.

The above-summarized process can be understood as a data preprocessing process, that is, preprocessing original environmental data to obtain a data vector which can be finally used in a subsequent step, for example, performing format conversion, converting into a uniform format to facilitate subsequent processing, or removing some redundant information and the like; the data preprocessing method may have different methods for different data, for example, when the preprocessing method is for image data, the preprocessing method includes, but is not limited to, denoising, color conversion, filtering, and the like. The specific implementation manner of the various preprocessing is the prior art, and is not described herein in detail.

The feature vector is a vector extracted from the data vector after being processed by the feature extraction unit 210, and is a vector for reflecting the feature of things, and in the feature vector, one or more pieces of information may be simultaneously included, for example, information about roads, vehicles, pedestrians, and the like may be included for an image obtained during one automatic driving. The format of each data element in the feature vector may be consistent with the format of the data element in the data vector, but the dimensions of the vector may be different.

In addition, knowledge of the environment may be related to the current environment only, not the historical environment, or not only the current environment but also some or all of the historical environment. Based on this, the one or more data vectors may include only the current data vector, or may include all or part of the historical data vector including the current data vector.

In one example, as shown in fig. 3, the feature extraction unit 210 extracts the present feature vector 2112 from the data vector 2111 by an operation 2121; and generating a data vector 2113 by an operation 2122 from the present feature vector 2112; feature extraction unit 210 may optimize operations 2121 and 2122 based on the error between data vector 2113 and data vector 2111. For example, operation 2121 may be implemented by a feature extraction network, operation 2122 may be implemented by a generation network, and the feature extraction network and the generation network may each be implemented by any neural network, for example, by a multi-layer deep neural network (deep neural network, DNN). If knowledge of the environment is relevant not only to the current environment but also to some or all of the historical environments, then operations 2121 and 2122 may be performed by the RNN based on feature extraction, reconstruction, and optimization based on errors after reconstruction of all or some of the data vectors including data vector 2111, before data vector 2111.

For example, referring to fig. 4, taking operation 2121 as a feature extraction network and operation 2122 as a generation network as an example, the feature extraction unit 210 is specifically configured to perform the following steps:

The computing system continuously collects signals from a real-time existing and changing environment through the sensor, and the signals collected at the t time can be summarized into a vector X (t), and the t time can be regarded as the current time;

The computing system extracts a feature vector Z (t) from X (t) by using a feature extraction network, and can be expressed as: wherein/> Is a set of parameters used to define f, i.e., f follows/>Is changed by a change in (a);

The computing system is updated according to X (t) and Z (t) To achieve optimization of the feature extraction network. Updating/>, in an computing system, based on X (t) and Z (t)In one example to achieve optimization of the feature extraction network, the computing system rebuilds X (t) from Z (t) by generating the network, can be expressed as: x' (t) =g (Z (t), θ, where θ is a set of parameters used to define g, i.e., g varies with θ; the computing system updates/>, according to the error E of X' (t) and X (t), by a gradient descent methodΘ to achieve optimization of the feature extraction network and the generation network.

There are two possibilities for the relationship of Z (t) and X (t):

In one example, Z (t) is related to X (t) only at the current instant and not to the past history of X (t), in which case there is only one spatial mapping from X (t) to Z (t).

In another example, Z (t) is related not only to the current state of X (t), but also to some or all of the historical state of X (t). In implementing the mapping process from X (t) to Z (t), all or part of the history memory of X (t) is preserved, for example, the mapping process from X (t) to Z (t) can be implemented by an RNN model.

With continued reference to fig. 2, the action generating unit 220 is mainly configured to: determining a current action vector according to one or more feature vectors extracted by the feature extraction unit 210, wherein the one or more feature vectors comprise the current feature vector; the present action vector is acted on the environment, so that the feature extraction unit 210 obtains the next data vector based on the environment after the present action vector acts on; acquiring current reward and punishment feedback based on the environment, wherein the current reward and punishment feedback is generated by one or more action vectors acting on the environment, and the one or more action vectors comprise the current action vector; and an optimizing action generating unit 220 according to the reward and punishment feedback. For example, the action generating unit 220 is configured to map the current feature vector to the current action vector, and optimize the mapping process of the action generating unit 220 and the mapping process of the feature extracting unit 210 according to the punishment feedback.

Wherein, the action vector can represent the action affecting the environment, for example, the action vector can be composed of an accelerator parameter, a brake parameter, a steering wheel parameter and the like in an automatic driving scene; actions affecting the learning manner itself may also be represented, e.g., modifications to the network structure, branch pruning, parameter modifications may all be represented by action vectors.

In addition, during the interaction with the environment, the strategy of selecting the best action according to the cognition of the environment can be related to the cognition of the current environment only, not the history, or not only the cognition of the current environment, but also part or all of the cognition of the history environment. Based on this, the one or more current action vectors may include only the current action vector, or may include all or part of the historical action vector including the current action vector.

In one example, as shown in fig. 3, the action generating unit 220 is configured to map the feature vector 2112 extracted by the feature extracting unit 210 to obtain the action vector 2211 through operation 2221; acquiring current reward and punishment feedback 2212 based on the environment, and obtaining a reward and punishment value through operation 2222 mapping according to the current reward and punishment feedback 2212; and optimizing operation 2221 based on the reward and punishment value. In one example, operation 2221 may be implemented by an action network, which may be implemented by any neural network, e.g., by a multi-layered DNN, and a selection operation, which is an optimal solution selection operation. The action network is configured to map the feature vector 2211 extracted by the feature extraction unit 210 to a plurality of pending action vectors, and the selection operation is configured to select an optimal one of the plurality of pending action vectors as the current action vector 2212, where selecting the optimal one of the plurality of pending action vectors may include multiple manners, for example, selecting multiple times from the plurality of pending action vectors, selecting one action vector and performing a simulation operation, and selecting the optimal one of the plurality of selected action vectors as the current action vector 2212 from simulation operation results of the plurality of pending action vectors. In another example, the selection operations further include search operations that assume an action option and simulate the execution of the action, whereupon the simulation generates features after the execution of the action on the environment and persists such assumptions and assumption executions under the new features. Since there are branches of multiple hypothetical action options, such a search results in a search tree from which the best path is determined. For example, the search operation may be a Monte Carlo tree search (monte carlo TREE SEARCH, MCTS) operation, which ultimately gives a statistically best choice path. By this way, the information given by the mobile network can be utilized to the maximum. In addition, the best selection path and the hypothesis result thereof can be used as the basis of the optimizing action generating unit 220.

If the strategy for selecting the best action based on the knowledge of the environment is related to the knowledge of the current environment, but also to the knowledge of the history environment, the operation 2221 may be implemented by RNN, for example, by generating an action vector based on all or part of the feature vectors including the feature vector 2112 before the feature vector 2112, and optimizing the action vector based on the reward and punishment feedback of the action vector, in turn, by the operation 2221.

With continued reference to fig. 2, the action generating unit 220 is further configured to feedback the optimization feature extracting unit 210 according to the present reward. To avoid confusion in optimizing the feature extraction unit 210, the feature extraction unit 210 optimizes the feature extraction unit 210 according to the probability P ₁, and the action generating unit 220 optimizes the feature extraction unit 210 according to the probability P ₂, wherein the sum of the probability P ₁ and the probability P ₂ is 1.

For example, referring to fig. 4, taking the example that operation 2221 includes an action network and a selection operation, the operation system may perform the following steps:

The computing system calculates a motion set vector a (t) from Z (t) using the action network, and can be expressed as: a (t) =h (Z (t), β), where β is a set of parameters used to define h, i.e. the specific h changes with the change of β.

The computing system selects an action from the action set vector by a selection operation, which can be expressed as: a (t) =select (a (t)).

Applying the a (t) to the environment, the environment generating an update of the environment based on the action vector a (t): x (t+1) =a (t) [ X (t) ].

The computing system obtains the environment and produces a reward and punishment feedback according to the action a (t), and the reward and punishment is not limited to the result of the current action, but can also be the comprehensive result of all the previous action sequences.

The computing system updates the beta sum according to the reward and punishment feedbackTo optimize the feature extraction unit 210 and the action generation unit 220, the selection operation may be updated according to the reward and punishment.

Wherein, whether the relation between A (t) and X (t) has two possibilities:

in one example, A (t) is related to Z (t) only at the current instant and not to the past history of Z (t), in which case there is only one spatial mapping from Z (t) to A (t).

In another example, A (t) is related not only to the current state of Z (t), but also to some or all of the historical state of Z (t). In implementing the mapping process from Z (t) to A (t), all or part of the history memory of Z (t) is preserved, for example, the mapping process from Z (t) to A (t) can be implemented by an RNN model.

The action may be an action affecting the environment, or an action affecting the learning operation itself. For example, modifications to the network structure, branch pruning, parameter modifications may all be defined as possible actions. Specifically, the action generating unit 220 is further configured to adjust the optimization feature extracting unit 210 and/or the manner of optimizing the action generating unit 220 according to the action vector.

In another embodiment, as shown in fig. 5, the cognitive ability of the feature extraction unit 210 to the environment and the ability of the action generating unit 220 to interact with the environment may be pre-trained.

The pre-training feature extraction unit 210 may be understood as a process of constantly observing the environment, recognizing the environment from the process of observing the environment, and grasping the environment concept (i.e., the feature). Taking the learning driving scene as an example, it can be understood that a person sits in the co-driver seat to learn driving, only observes and does not operate. The learning process is familiar with various conditions (namely environment) in road conditions and driving processes, and the learning process is a process familiar with environment concepts, including roads, lane lines, various vehicles, different speed relationships, pedestrians, various abnormal conditions and the like. Based on this, the feature extraction unit 210 may learn in advance from one or more training data vectors, which may be obtained by simulation or pre-acquisition.

For example, for an autopilot scenario, environmental information such as data simulating a front image sensor, lidar data, images of other locations or distance sensors, etc., may be simulated while the feature extraction unit 210 is pre-trained. All of these simulated environmental information can be summarized as one input vector X (t). The feature extraction unit 210 extracts a feature Z (t) from X (t), where Z (t) is a feature vector, and optimizes the feature extraction unit 210 based on X (t) and Z (t). When training in advance, one or more input vectors X (t) may be determined, and the feature extraction unit 210 may be trained based on the one or more input vectors X (t), respectively.

Training of the action generating unit 220 may be understood as learning an optimal action strategy based on the grasped concepts (features). Still taking learning driving as an example, it can be understood that a learner performs exercises on a simulator, assuming that a preceding vehicle suddenly decelerates, and then the learner exercises an optimal deceleration parking process. Safe driving can be learned through multiple exercises, and a reasonable and comfortable parking process can be learned instead of simple sudden braking.

For example, in the case of an automatic driving scene, the optimal action a (t) is selected based on the feature Z (t) extracted when the feature extraction unit 210 performs the pre-training. Wherein a (t) is an action vector. By performing a (t) to influence the surrounding environment, the reward and punishment feedback generated by the environment according to the action is obtained, and the action generating unit 220 is optimized according to the reward and punishment feedback, and the feature extracting unit 210 can be optimized according to the reward and punishment feedback. When training in advance, the action generating unit 220 may be trained according to one or more input vectors Z (t) determined by the feature extracting unit 210, and according to the one or more input vectors Z (t), respectively.

The training of the feature extraction unit 210 and the training of the action generation unit 220 described above may be performed independently.

Assuming that the training of the action generating unit 220 is in a stopped state, only the training work of the feature extracting unit 210 is performed. The operation system works in the observation state, which is a process of continuously observing the environment, recognizing the environment from the process of observing the environment and grasping the environment concept. Similar to a person sitting in a co-driver seat to learn to drive, only observe and do not operate. This learning process is basic and useful because learners are familiar with road conditions and various conditions during driving, and is basic and important for later driving learning. This is also a process familiar with environmental concepts, including roads, lane lines, various vehicles, different speed relationships, pedestrians, and various anomalies, etc.

Another case is that training of the feature extraction unit 210 is in a stopped state, and only the training of the action generation unit 220 is performed. The intelligent system learns the optimal action strategy according to the mastered concepts. Still taking learning driving as an example, it can be understood that a learner performs an exercise on a simulator, assuming that the preceding vehicle suddenly decelerates, and then the learner closes his eyes (since the concept of this moment has been determined, the closed eyes means that the calculation adopts the concept only and does not consider the original image data), and exercises an optimal deceleration parking process. Safe driving can be learned through multiple exercises, and a reasonable and comfortable parking process can be learned instead of simple sudden braking. Instead of closing the eyes, the learner may rest the screen on the simulator, and the learning process under this condition not only learns a good control action process, but also may better understand the concept represented by the still screen, and the learning process is based on the observed concept only before the correction.

For an actual application scenario, the above two processes can be performed independently in time, or simultaneously in time without interfering with each other. The structural design of the scheme makes this mating process fully feasible.

In the case where two exercises are performed simultaneously or sequentially, if the execution of the exercise by the action generating unit 220 involves updating the feature extracting unit 210, since the exercise by the feature extracting unit 210 is updated, a problem of which update is dominant, and a problem of weight allocation are involved. Both learning mechanisms are naturally helpful in establishing the correct concepts, the training of the feature extraction unit 210 is based on unsupervised data that can be observed in large numbers, and the training of the action generation unit 220 is helpful in better correcting concepts from the results. According to different application scenes, a developer can fixedly distribute the proportion of the two, and can dynamically adjust the proportion. For example, a learner may rely on continuous driving exercises after having established a basic concept by initially having to sit on multiple head-ups in the co-pilot. If an accident continues to occur, such as a problem with reversing, some observational corrective exercise may be required. In practice, two learning operations may be assigned a probability, for example, p ₁ and p ₂, respectively, and p ₁+p₂ =1, and the training of the feature extraction unit 210 is performed according to the probability p1, and the training of the action generation unit 220 is performed according to the probability p 2.

In the embodiment of the present invention, the operations 2211, 2212 and 2221 may be implemented by using neural network operations, and may be optimized by learning operations, where the learning operations may include a back propagation operation or other learning operations, further, the back propagation operation may be optimized according to an error of the mapping result, and other optimization bases may also be used.

Specifically, any function Y (t) may be approximated by a function z (X (t), λ) defined by a neural network operation, where λ is a parameter to be solved, z () is a generic term of a series of functions, and the function Y (t) may be the aforementioned operation 2211, operation 2212, operation 2221, and so on. The approximation process of z (X (t), λ) is:

Y₁(t)＝z₁(X(t),λ₁)，

Y₂(t)＝z₂(Y1(t),λ₂)，

……

y _p(t)＝z_p(Y_p-1(t),λ_p), and finally Y _p (t) is taken as an approximation result of Y (t).

Each of the above functions z () may be shaped as s (xw+b), where XW represents the dot product of the data vector X and the weight vector W, and b is an adjustable offset. Both W and b are part of the parameter λ, which includes λ1, λ2, …, λp. s () is a nonlinear map, for example, the nonlinear map is an activation function (RECTIFIED LINEAR units, RELU) map, the RELU map is defined as output 0 when the input is less than 0, and output equal to the input when the input is equal to or greater than 0.

In this iterative manner, when a parameter set λ is given, a mapping relationship between the input X (t) and the output Y (t) is determined.

For any λ, any mapping does not necessarily map X (t) to the correct Y (t), we assume that Y' (t) is mapped. Thus, the goal of the back-propagation learning operation is to find a λ such that Y' (t) is closest to Y (t).

An evaluation criterion, for example, a mean square error criterion, is established for evaluating the error between Y (t) and Y' (t), expressed as:

E= Σ [ Y' (t) -Y (t /) ]2 (equation one)

The back propagation learning operation is defined as: lambda _n＝λ_n-1 + delta lambda, wherein,Epsilon is a constant and this operation is also called gradient operation. The meaning is that at each step of the learning process, an optimal downslope step (in the direction of the greatest gradient) is pursued, and finally the valley is reached, namely the target point, which is the lambda position corresponding to the minimum value of E.

In an implementation, if the data vector is unchanged (i.e., the environmental data is unchanged), a threshold may be set, and the iterative process is terminated when E is less than the threshold. If the data vector is in the process of continuously changing, the iterative process continues.

In the embodiment of the present invention, the purpose of optimizing the feature extraction unit 210 according to one or more data vectors and the current feature vector is to obtain the best feature extraction unit 210 according to the current data vector, so as to generate the concept (feature) that is most accurate for the environmental awareness. The description will be given taking an example in which the arithmetic unit includes a feature extraction network and a generation network.

In one example, optimization may be achieved by back-propagation operations. Specifically, the feature extraction unit 210 extracts a feature vector Z (t) from X (t) using a feature extraction network, and can be expressed as: wherein/> Is a set of parameters used to define f, i.e., f follows/>Is changed by a change in (a); reconstructing X (t) from Z (t) using a generating network, the process can be expressed as: x' (t) =g (Z (t), θ), where θ is a set of parameters used to define g, i.e., a particular g changes with changes in θ. Updating/>, based on the evaluation criteria E [ X (t), X' (t) ] and on the gradient learning rules derived from the back-propagation operationsAnd θ to achieve optimization of the feature extraction unit 210.

Wherein the feature extraction networkAnd generating the network g (Z (t), θ) can be implemented by a neural network, which is an iterative process of a series of function operations, as follows:

final Z (t) =x _u (t);

Z₁(t)＝g₁(Z(t),θ₁),Z₂(t)＝g₂(Z₁(t),θ₂),…,Z_v(t)＝g_v(Z_v-1(t),θ_v), Final X' (t) =z _v (t).

Each of these functions may be shaped as s (xw+b), where XW represents the dot product of the data vector X and the weight vector W, b is an adjustable offset, and s () is a nonlinear mapping, such as RELU mapping.

In the case of a mapping relationship (i.e. neural network model and model parameters) determination, a corresponding X' (t) can be generated for all data vectors X (t). Ideally, X (t) and X' (t) are identical, which means that the extracted features of the feature extraction network are exact, but in practice there is always a certain deviation (E). The minimum deviation, which can be measured in terms of mean square error, can be targeted to solve for the optimal parameters of the feature extraction network.

Referring to fig. 5, the best feature parameters of the extraction network and the generation network are solved for a given target vectorAnd the process of theta is an optimization process through learning operation. The solution typically involves an iterative process, updating the parameters according to gradients of the objective function relative to the parameters, which can be expressed by the following formula:

Where ε ₁,ε₂ is a step size parameter that can be adjusted.

It should be noted that, by reconstructing X ' (t), updating parameters with the mean square error between X (t) and X ' (t) is only a feature of one type of learning operation, and other operations may be selected to implement, for example, we know that the STDP learning operation actually performed in the human brain does not need to reconstruct X ' (t).

The principle of STDP operation resembles an election mechanism: any person can vote and be voted, a parameter representing the trust degree exists between people, and after enough people select me, the person can vote to support other people. After me is selected, me will increase trust for voting to me before the voting time, the closer to the voting time, the more trust is increased; for voting on my person after the voting time, i will decrease trust, the closer to the voting time, the more trust is decreased. The trust relationship naturally forms different communities in the community, and the whole community has the refining and extracting functions of complex concepts. If the parameter is represented by phi, the learning rule can be expressed as: Wherein the method comprises the steps of Δt=t _in-t_out, ε is a constant, S is a mapping function, t _in represents the trigger time of the input signal, and t _out represents the trigger time of the output signal.

The law of the mapping function S is: when Δt <0 is positive, the larger the absolute value of Δt, the smaller the absolute value of S; when Δt >0 is negative, the larger the absolute value of Δt, the smaller the absolute value of S.

In the embodiment of the present invention, the objective of the reward and punishment feedback optimization action generating unit 220 and the optimization feature extracting unit 210 is mainly to select the optimal action, that is, to make the computing system get a response (i.e., action) closer to the person.

Wherein, a possible action set A (t) is generated through the action network, and the expression is as follows: a (t) =h (Z (t), β), where Z (t) is a feature vector derived from the environmental input X (t) through the feature extraction network. The mobile network may also be implemented with a neural network, for example:

A₁(t)＝h₁(Z(t),β₁),A₂(t)＝h₂(A₁(t),β₂),…,A_s(t)＝h_s(A_s-1(t),β_s), Final a (t) =a _s (t).

Each of these functions is also shaped as s (xw+b), and will not be described in detail here.

In addition, the set of actions output by the action network includes parameters related to the action, such as expected value of the action, probability of suggesting the action to be taken, and so forth.

In one example, the selection operation may be based on these parameters, directly selecting an action that is considered optimal, such as an action that is most valuable to expect, or most probable, or a combination of both. The weight parameters of the action network are modified according to the result of the action, such as the deviation of the obtained true value and the expected value, so that the estimation of the value is more accurate. This process of modifying the weight parameters is the process of optimizing the mobile network.

For optimization of the action network, the parameter beta is updated according to the gradient of the objective function relative to the parameter on the premise of the determination of the objective function, and the parameter phi of the feature extraction network can also be selectively updated.

In another example, the selection operation may also include a search operation. The idea of the search operation is to assume an action option and assume that the action is to be performed, thus generating an environmental state resulting from the action and continuing such assumption and assumption implementation in the new environmental state. Because of the branching of multiple hypothetical action options, such searches result in a search tree, and one operation that is effective in performing such searches is the MCTS operation, which maximizes the utilization of information presented by the action network. The MCTS operation finally gives a statistically optimal selection path, and the best selection path and the hypothesis result can be used as the basis for revising the mobile network.

As shown in fig. 5, the process of solving the action network β according to the given target vector is an optimization process by learning operation. The solution usually involves an iterative process, solving for the optimal parameters according to the target value E corresponding to the hypothesized result (the target value may be a punishment value associated with the result), and solving usually results in an iterative calculation process, updating the weight parameters β and Φ according to the gradient of the target value relative to the weight parameters, which can be expressed by the following formula:

Where ε ₃,ε₄ is a step size parameter that can be adjusted.

In another embodiment, intelligence is not in isolation, and population characteristics are very important mechanisms of intelligent development, such as multicellular nervous system, population of species, and the like. For the above method we can see that as an individual, its level of intelligence is limited if it is not inherited and repeatedly learned and improved. Continuous learning of individuals, including virtual practice processes embodied as thought processes, is the basis for building basic intelligence. On this basis, population learning, including competitive screening mechanisms, is an important mechanism for intelligent progress.

On the other hand, for digital systems, duplication and deletion of information is much easier than evolution of the physical world, and this difference provides the possibility of machine intelligence community development. For machine intelligence, the community need not be physical, but may be multiple logical copies in a digital system.

In an embodiment of the present invention, the computing system 100 includes one or more subsystems, each subsystem including a feature extraction unit and an action generation unit, respectively; and each subsystem performs the winner and the winner according to the prefabrication conditions.

In one example, each subsystem may specifically perform a win or lose according to the prefabrication conditions, including: determining a reward and punishment accumulated value according to the reward and punishment feedback of the environment in the operation process of the subsystem, wherein the reward and punishment accumulated value is increased if the current reward and punishment feedback is rewarded, and the reward and punishment accumulated value is reduced if the current reward and punishment feedback is punishment; subsystems with accumulated prize and punishment values higher than a first threshold are duplicated; subsystems with prize and punishment accumulated values lower than the second threshold are eliminated.

In particular, the pre-formed condition may include a finalization mechanism. When learning of a subsystem is biased for some reason, it is poor in the reward and punishment mechanism of the environment, and if continuous learning actually wastes resources, it may be difficult to twist, and at this time, a reasonable method is to terminate the subsystem. Based on this, in one example, the termination mechanism may be implemented specifically by:

the subsystem is endowed with an energy value during initialization, and the subsequent energy is acquired from the environment;

During operation of the subsystem, energy may be harvested from the environment or returned to the environment according to rewards and punishments.

The operation of the subsystem itself also consumes energy, which is returned to the environment. For example, the optimization process consumes a certain amount of energy, and the propagation of information within the system also consumes a certain amount of energy.

The subsystem terminates if the individual system's energy is exhausted.

A group system, if only the termination mechanism, especially the individual system is in a very young state at the beginning, cannot be activated, and finally the intelligent system is difficult to cultivate and develop as a result of convergence. Based on this, the pre-cast condition may also include a replication mechanism. The balance of replication and termination, among other things, allows developing population intelligence with the opportunity to continue growing. The steps of the replication mechanism may be:

Assume that the environment and computing system have a total energy value (GE):

Randomly according to a certain proportion r, the selected subsystem is split into two or N. The process of splitting also involves some degree of variation. If none of the subsystems in the system, a subsystem is generated. Wherein the replicated energy may come directly from the environment. The success rate of replication is related to the energy level obtained from the environment, the greater the energy, the lower the probability. For example, an exponential relationship between probability and energy may be defined: p=exp (-E/K), K is a constant.

The environmental energy is affected by the environmental energy consumed by the replication minus the energy returned to the environment after termination. The ambient energy is initially GE and the population energy is initially 0. As the energy of a population grows, the limit of growth is GE, so there is a dynamic balance of energy between the environment and the population.

In one example, an autopilot tool 600 implemented based on the computing system 100 is illustrated. The autopilot tool 600 may be implemented on a car or may take the form of a car. However, the example system may also be implemented on or in the form of other vehicles, such as cars, trucks, motorcycles, buses, boats, planes, helicopters, lawnmowers, snow mobiles, recreational vehicles, agricultural devices, construction devices, trams, golf carts, trains, and trams. Furthermore, robotic devices may also be used to perform the methods and systems described herein, such as a sweeping robot, and the like.

At present, an automatic driving system mainly determines proper driving behaviors of a vehicle from a specific geometric physical position relation by identifying specific people, vehicles and lane lines and judging the relative position relation and speed relation of the people, the vehicles and the driven vehicles, wherein the proper driving behaviors comprise control of a steering wheel, acceleration and deceleration control and the like. The disadvantage of such a system is that only specific rules can be met and the robustness of the operation is poor. If an unprecedented, unaccounted for scene is encountered, natural, accurate reactions may not occur.

The embodiment of the invention can directly generate the image according to the image input, just like a person, and is not implemented after a logic analysis process is carried out, so that a driving response which is relatively close to a natural person is generated.

Autopilot tool 600 includes propulsion system 601, sensor system 602, control system 603, and computing system 604. The computing system 604 may include a processor, memory, and the like. The computing system 604 may be a controller or a portion of a controller of the autopilot tool 600. The memory may include instructions executable by the processor. The components of the autopilot tool 600 may be configured to operate in interconnection with each other and/or with other components coupled to the systems.

The autopilot tool 600 may include more, fewer, or different systems, and each system may include more, fewer, or different components. Furthermore, the illustrated systems and components may be combined or divided in any number of ways, for example, the autopilot 600 may also include a power source, a display screen, speakers, and the like.

The propulsion system 601 may be used to provide powered movement of the autopilot tool 600. For example, propulsion system 601 includes an engine/generator, an energy source, a transmission (transmission), wheels/tires, and so forth.

The engine/motors may be or include any combination of internal combustion engines, electric motors, steam engines, stirling engines, and the like. Other engines and engines are also possible. In some examples, propulsion system 601 may include multiple types of engines and/or engines. For example, a hybrid gas-electric car may include a gasoline engine and an electric motor. Other examples are possible. The energy source may be a source of energy that wholly or partially powers the engine/motor. In some examples, the energy source may also provide energy to other systems of the autopilot tool 600. The transmission may be used to transfer mechanical power from the engine/generator to the wheels/tires. To this end, the transmission may include a gearbox, clutch, differential, drive shaft, and/or other elements. In examples where the transmission includes a drive shaft, the drive shaft includes one or more shafts for coupling to the wheels/tires. The wheels/tires of the autopilot 600 may be configured in a variety of forms including a monocycle, bicycle/motorcycle, tricycle or car/truck four-wheel form. Other wheel/tire forms are also possible, such as those comprising six or more wheels. The wheels/tires of the autopilot tool 600 may be configured to rotate differentially relative to other wheels/tires. In some examples, the wheels/tires may include one or more wheels fixedly attached to the transmission and one or more tires coupled to edges of the wheels that contact the driving surface. The wheel/tire may comprise any combination of metal and rubber, or other materials. Propulsion system 601 may additionally or alternatively include components other than those shown.

The sensor system 602 may include several sensors for sensing information about the environment in which the autopilot tool 600 is located. For example, the sensors of the sensor system include a global positioning system (global positioning system, GPS), an inertial measurement unit (inertial measurement unit, IMU), a radar unit, a laser ranging unit, a camera, and actuators for modifying the position and/or orientation of the sensor. The sensor system 602 may also include additional sensors including, for example, sensors that monitor internal systems of the autopilot tool 600 (e.g., O2 monitors, fuel gauges, oil temperature, etc.). The sensor system 602 may also include other sensors. Wherein the GPS module may be any sensor for estimating the geographic location of the autopilot tool 600. To this end, the GPS module may include a transceiver that estimates the position of the autopilot tool 600 relative to the earth based on satellite positioning data. The IMU may be configured to sense changes in the position and orientation of the autopilot tool 600 based on inertial acceleration and any combination thereof. In some examples, the combination of sensors may include, for example, an accelerometer and a gyroscope. Other combinations of sensors are also possible. Radar units may be regarded as a physical detection system for detecting characteristics of an object, such as a distance, height, direction or speed of the object, using radio waves. The radar unit may be configured to transmit radio waves or microwave pulses, which may bounce off any object in the wave's path. The object may return a portion of the energy of the wave to a receiver (e.g., a dish or antenna), which may also be part of the radar unit. The radar unit may also be configured to perform digital signal processing on the received signal (as reflected off of the object) and may be configured to identify the object. Other radar-like systems have been used in other parts of the electromagnetic spectrum. One example is light detection and ranging, which may use visible light from laser light, rather than radio waves.

The camera may be used for any camera (e.g., still camera, video camera, etc.) that acquires images of the environment in which the autopilot tool 600 is located. To this end, the camera may be configured to detect visible light, or may be configured to detect light from other portions of the spectrum (such as infrared or ultraviolet light). Other types of cameras are also possible. The camera may be a two-dimensional detector, or may have a three-dimensional spatial range. In some examples, the camera may be, for example, a distance detector configured to generate a two-dimensional image indicative of a distance from the camera to a number of points in the environment. To this end, the camera may use one or more distance detection techniques. For example, the camera may be configured to use structured light technology, wherein the autopilot tool 600 illuminates objects in the environment with a predetermined light pattern, such as a grid or checkerboard pattern, and detects reflections of the predetermined light pattern from the objects using the camera. Based on the distortion in the reflected light pattern, the autopilot tool 600 may be configured to detect a distance to a point on the object. The predetermined light pattern may comprise infrared light or light of other wavelengths.

The actuator may be configured to modify the position and/or orientation of the sensor. The sensor system 602 may additionally or alternatively include components other than those shown.

The control system 603 may be configured to control the operation of the autopilot tool 600 and its components. To this end, the control system 603 may comprise a steering unit, a throttle or braking unit, etc.

The steering unit may be any combination of mechanisms configured to adjust the direction of advance or direction of the autopilot tool 600.

The throttle may be any combination of mechanisms configured to control the operating speed and acceleration of the engine/engine and, in turn, the speed and acceleration of the autopilot 600.

The braking unit may be any combination of mechanisms configured to slow down the autopilot tool 600. For example, the brake unit may use friction to slow down the wheel/tire. As another example, the brake unit may be configured to regenerate and convert the kinetic energy of the wheel/tire into an electric current. The brake unit may take other forms as well.

The control system 603 may additionally or alternatively include components other than those shown.

The processors included in computing system 604 may include one or more general purpose processors and/or one or more special purpose processors (e.g., an image processor, a digital signal processor, etc.). To the extent that a processor includes more than one processor, such processors can operate individually or in combination. The computing system 604 may implement functionality to control the autopilot tool 600 based on inputs received through a user interface.

The memory, in turn, may include one or more volatile memory components and/or one or more non-volatile memory components, such as optical, magnetic, and/or organic memory devices, and the memory may be fully or partially integrated with the processor. The memory may contain instructions (e.g., program logic) executable by the processor to perform various vehicle functions, including any of the functions or methods described herein.

The components of the autopilot tool 600 may be configured to operate in interconnection with other components internal and/or external to their respective systems. To this end, the components and systems of the autopilot tool 600 may be communicatively linked together via a system bus, network, and/or other connection mechanism.

In the embodiment of the present invention, the autopilot tool 600 includes various modules specifically configured to:

The computing system 604 collects signals from a real-time, changing environment, such as data from a front image sensor, lidar data, and other location images or distance sensor data, via the sensor system 602. The computing system 604 may summarize the data acquired this time by the sensor system 602 into a data vector X (t).

The computing system 604 extracts the feature vector Z (t) from X (t) using the feature extraction network, which can be expressed as: z (t) =f (X (t), phi), wherein the feature extraction network may be an arbitrary neural network with phi as a parameter, such as a multi-layer DNN.

The computing system 604 reconstructs X (t) from Z (t) using the generation network, the reconstructed vector being X' (t). Can be expressed as:

x' (t) =g (Z (t), θ), where θ is a set of parameters used to define g, e.g., the generation network is a neural network with θ as a parameter.

And updating phi and theta according to the evaluation criterion E [ X (t), X' (t) ] and the gradient learning rule. Wherein the correction values of the parameters phi and theta are proportional to the gradient of the evaluation criterion E, which may be a square error criterion, with respect to the parameters.

The computing system 604 computes an optional set of action vectors A (t) from Z (t) using the action network, which can be expressed as: a (t) =h (Z (t), β), and the action network is a neural network with β as a parameter. The action vector may include, among other things, increasing/decreasing the throttle, stepping on the brake, turning left the steering wheel, turning right the steering wheel, etc. Each element in the action vector corresponds to a specific driving action, and each vector corresponds to a specific throttle, brake, steering wheel parameter, and the number of vectors in the set of possible action vectors represents the number of possible actions.

The computing system 604 selects a particular action from the set of possible action vectors, which may be expressed as: a (t) =select (a (t)), the selection process may be implemented by a selection operation, such as a search operation.

The computing system 604 controls the autopilot tool 600 via the control system 603 to perform action a (t). After the autopilot tool 600 performs action a (t), the sensor system 602 generates an update corresponding to the data acquired from the environment: x (t+1) =a (t) [ X (t) ], wherein a (t) is an operator.

The computing system 604 obtains feedback of rewards and punishments that the environment would generate according to action a (t) through the sensor system 602. For example, the reward and punishment feedback may refer to feedback that the automatic driving tool 600 has a crash or violates a traffic rule or whether the vehicle is stably driven, and the reward and punishment value may be determined according to the reward and punishment feedback and the reward and punishment standard. In one example, R may be used to represent a punishment value, and then r= -Cc may be set to represent an accident or loss due to violation of a traffic rule; a reward and punishment criterion may mean that if driving normally there is a sustained reward, for example setting r=cd. In another example, the reward and punishment criteria may also be determined based on whether the vehicle is traveling smoothly or based on more complex driving objectives. Different reward and punishment standards can be set according to the driving style, for example, comfortable driving and sporty driving can correspond to different reward and punishment standards and the like.

The reward and punishment values may also be empirical and the current reward and punishment value may affect the next autopilot tool 600 behavior on a normalized basis. For example, if the reward and punishment obtained is feedback to send a crash, the punishment determined from the crash is not too great, then the autopilot tool 600 may tend to take the crash-generating behavior as an alternative behavior.

The computing system 604 may update the parameters β, Φ based on reward and punishment feedback, action selection process, and gradient learning rules.

In updating the parameters β and Φ of the neural network, the correction value of the parameter is proportional to the gradient of the punishment function L (a (t)) with respect to the parameter. Wherein, the punishment function L is a mapping function, the input of the L is a (t), and the output is a punishment value R. The relationship of L and R may be either instantaneous, depending only on the current R; or accumulation, namely accumulation of all reward and punishment values in the whole process; or may be short-term, i.e., a weighted sum of reward and punishment values within a time window is selected.

In the implementation of embodiments of the present invention, the autopilot tool 600 is required to undergo a training process. All operations can be firstly operated on a simulator, and the simulator not only simulates the automatic driving, but also simulates the road condition environment. After a sufficient number of simulation runs, the autopilot system will have sufficient driving capabilities, including various natural reactions in the event of anomalies.

In another example, a camera 700 implemented based on the computing system 100 is illustrated. The camera 700 may include a capture system 701 and an arithmetic system 702. The computing system 702 may include a processor, memory, and the like. The computing system 702 may be a controller or a portion of a controller of the camera 700. The memory may include instructions executable by the processor. The components of camera 700 may be configured to operate in interconnection with each other and/or with other components coupled to the systems.

The photographing system 701 includes a camera, an image photographing unit, and other sensors, and the like. Wherein a camera mainly realizes a photographing function, the camera having an optical unit into which light from an imaging object (photographed object) is inputted; the image capturing unit is disposed at the rear of the optical axis of the optical unit, and captures an imaging object by means of the optical unit. Other sensors may include motion sensors, such as accelerometers, among others. In one example, the optical unit may further include a zoom lens, a correction lens, a diaphragm diameter) mechanism, a focus lens, and the like. Wherein the zoom lens is movable in the optical axis direction by a zoom motor, and the focus lens is movable in the optical axis direction by a focus motor. The correction lens may also be controlled by a correction lens motor such that the angle of incident light corresponding to the image capturing surface is always substantially constant. The diaphragm of the diaphragm mechanism can also be controlled by a diaphragm (iris) motor. In addition, the computing system 702 may control the various motors described above via an electric drive.

The image photographing unit may include: a charge coupled device (charged coupled device, CCD) image sensor that generates an image signal of a photographic subject from light from the optical unit; a correlated double sampling (correlated double sampling, CDS) circuit implementing correlated double sampling processing that eliminates a noise portion contained in an image signal read by the CCD image sensor; an analog-to-digital (analog to digital converter, a/D) converter that converts an analog signal processed by the CDS circuit into a digital signal; and a timing signal generator (timinggenerator, TG) that generates a timing signal that drives the CCD image sensor; etc.

A display unit may be further included for displaying information input by a user or provided to the user and various menus of the photographing apparatus. The display unit may include a display panel, which may optionally be configured in the form of a Liquid Crystal Display (LCD) CRYSTAL DISPLAY, an organic light-emitting diode (OLED), or the like. Further, the touch panel may overlay the display panel, and upon detection of a touch operation thereon or thereabout, the touch panel is transferred to the processor to determine the type of touch event, and the processor then provides a corresponding visual output on the display panel based on the type of touch event. In some embodiments, the touch panel may be integrated with the display panel to implement input and output functions.

In this case, the combination of parameter adjustment is a combination of parameters including basic elements such as aperture, white balance, sensitivity, and the like, and more complicated elements such as patterning and light.

In the embodiment of the present invention, the camera 700 includes various modules specifically for:

The photographing system 701 collects image signals of an external subject, and the computing system 702 may summarize the data collected by the photographing system 701 into one data vector X (t).

The computing system 702 extracts the feature Z (t) from X (t) using a feature extraction network, expressed as: z (t) =f (X (t), phi), the feature extraction network may be a neural network parameterized by phi.

The computing system 702 reconstructs X (t) from Z (t) using the generation network, which can be expressed as: x' (t) =g (Z (t), θ), where θ is a set of parameters used to define g, i.e., the generation network is a neural network with θ as a parameter.

The computing system 702 updates phi and theta based on the evaluation criteria E [ X (t), X' (t) ] and the gradient learning rules. The correction value of the parameter is proportional to the gradient of the evaluation criterion E with respect to the parameter, which may be a square error criterion.

The computing system 702 computes an optional set of action vectors a (t) from Z (t) using the action network, which can be expressed as: a (t) =h (Z (t), β), and the action network is a neural network with β as a parameter. The action vector is a combination of photographing parameters, such as a weighting parameter corresponding to R, G, B, a suggested photographing angle, a distance, a tilt angle, and the like. Each action vector corresponds to a specific photographing parameter combination, for example, a (parameter 1, parameter 2, parameter 3) = (value 1, probability 1) is one of them, and the number of action vectors in a (t) represents the number of possible parameter combinations.

The computing system 702 selects a particular action vector from the set of optional action vectors, which can be expressed as: a (t) =select (a (t)).

The arithmetic system 702 controls the photographing system 701 to perform photographing according to a (t). The environment acquired by the photographing system 701 at this time is updated as follows: x (t+1) =a (t) [ X (t) ]. The X (t+1) may be a result of photographing performed by the camera 700 according to a (t). In one example, camera 700 may take multiple pictures (e.g., 3 pictures) at a time for selection by a user.

The computing system 702 obtains the reward and punishment feedback generated according to a (t). The reward and punishment feedback may be the action of the user on the photographed picture. Such as delete, save, or share actions. The reward and punishment value corresponding to the reward and punishment feedback can be determined according to a reward and punishment rule, for example, if the user triggers a deletion action and deletes the photo, the penalty is determined (the reward and punishment value is negative) and can be expressed as r= -Ce. If a hold or share action is triggered, then a prize is determined (a punishment value of positive), which may be denoted as r=ck or r=cs. Where Ck, cs may be different prize values, e.g., ck corresponds to a prize for a retained photograph, cs corresponds to a prize for sharing a photograph, and the prize shared may be higher than the prize retained. In addition, the method can also be used for taking a plurality of reward and punishment factors influencing subjective evaluation of the photo according to composition, light, white balance and the like, and can also be used as the basis of reward and punishment rules.

The specific determination of the reward and punishment value may be empirical, and the magnitude of the reward and punishment influences the behavior of the system in the future on a normalized basis, e.g., if the penalty caused by deleting a photo is not too great, the system may tend to hold more photos that are not too good.

The computing system 702 updates the parameters β and φ according to the punishment and punishment conditions and gradient learning rules. The correction value of the parameter is proportional to the gradient of the punishment function L (a (t)) with respect to the parameter.

In addition, in the embodiment of the invention, for photographing, the resulting rewards and punishments are not necessarily limited to specific users, and can be guided by judgment of experienced photographers based on comprehensive preference of user groups or a large number of users. Finally, the settings of the camera 700 may balance the best photographing experience with the personal preferences of a particular user.

By the embodiment of the invention, the environment is recognized and the machine learning is performed by combining the environment interaction with the environment, so that the environment can be better understood, and the optimal action selection can be made according to the understanding of the environment. In addition, the feature extraction and action selection can be based on memory space, and more accords with the basic rule of intelligent system behavior. By adjusting the action network with an environmental reward and punishment mechanism, the parameters of the feature extraction network can also be updated, which is advantageous for improving the understanding of the concept from the outcome of the action, so-called "practice awareness".

Fig. 8 is a flowchart of an operation method according to an embodiment of the present invention. The method is applicable to an arithmetic system, which may be any one of the systems shown in fig. 1-7, an autopilot tool shown in fig. 6, or a camera shown in fig. 7, as will be understood with reference to each other. As shown in fig. 8, the method specifically includes the following steps:

And S810, acquiring the current data vector based on the environment.

Wherein environmental data may be acquired by a sensor (e.g., the sensor system of fig. 6 or the camera system of fig. 7), and a data vector may be obtained by preprocessing based on the environmental data. For example, preprocessing herein includes, but is not limited to, data cleansing, data integration, data transformation, data reduction, and the like.

The present environmental data may be data in a specified time window, or may be data at a moment, where the length of the specified time window may be determined according to actual needs, for example, the time window may include all historic times.

In addition, the data vector may be acquired in real-time or periodically.

S820, extracting the current feature vector according to one or more data vectors, wherein the one or more data vectors comprise the current data vector.

The extraction method of the feature vector may include various methods, in one example, the feature vector is related to the current data vector only, and is irrelevant to the historical data vector, and at this time, the current data vector may be directly mapped to the current feature vector.

In another example, one case is that the feature vector is related not only to the current data vector, but also to the historical data vector. For example, the present feature vector may be extracted by a first operation based on one or more data vectors. The first operation may be implemented by a neural network, e.g., RNN.

The extraction of the feature vector in S820 may be described with reference to fig. 2-5 (e.g., operation 2121 or the feature extraction network described above).

And S830, optimizing the feature vector extraction mode according to one or more data vectors and the feature vector.

Optimization of the feature vector extraction approach may include a variety of approaches. For example, optimization can be performed by reconstruction and gradient learning rules, or can be performed by biological processes (SPIKE TIMING DEPENDENT PLASTICITY, STDP) that adjust the strength of connections between neurons according to the order in which the neurons learn.

In one example, the data vector may be reconstructed from the current feature vector. And optimizing a feature extraction mode according to errors and gradient descent operation between the reconstructed data vector and the current data vector and an evaluation criterion. For example, a data vector is generated by a second operation based on the present feature vector, and the first operation and the second operation are optimized based on an error of the generated data vector and one or more data vectors. Wherein the second operation may be implemented by a neural network, e.g. by RNN.

The process of optimizing the feature extraction mode according to the error between the reconstructed data vector and the current data vector, gradient descent operation and evaluation criteria is an iterative process, a threshold can be set, and when the error is smaller than the threshold, the iteration is terminated.

Wherein S810-S830 may be implemented by the feature extraction unit 210 in connection with the embodiments shown in the foregoing fig. 2-7, which may be understood with reference to each other.

S840, determining the action vector according to one or more feature vectors, wherein the one or more feature vectors comprise the action vector.

The manner in which the action is generated may include a variety of ways. In one example, the current action vector is related to the current feature vector only, and is not related to the historical feature vector, and at this time, the current feature vector may be directly mapped to the current action vector.

In another example, one case is that the action vector is related not only to the current feature vector, but also to the historical feature vector. For example, the current action vector may be obtained through a third operation mapping according to one or more feature vectors. Wherein the fail third operation may be implemented by a neural network, for example by RNN.

Wherein the third operation includes an operation through the first neural network and a selection operation; s840 may be specifically implemented by the following steps:

selecting an optimal action vector from one or more pending action vectors as the action vector through selection operation, for example, selecting one action vector from the one or more pending action vectors for multiple times, performing simulation operation, and selecting the optimal action vector in the simulation operation result as the action vector. In addition, the selection operation also includes a search operation.

Wherein, the generation of the action in S840 may be described with reference to fig. 2-7 (e.g., operation 2221 or action network described above).

S850, the current action vector is acted on the environment, so that the feature extraction unit obtains the next data vector based on the environment acted by the current action vector.

In the specific execution process of the computing system, the action is continuously determined according to the environment, the environment is influenced by executing the action, and then the action is determined according to the influenced environment.

The action vector action and the environment can be understood as the process that the operation system executes the action or the operation system controls the intelligent device to execute the action. In S850, the current action vector may be applied to the environment or the description related to the application of the current action vector to the environment or the execution of the current action vector in fig. 2-7.

S860, obtaining current reward and punishment feedback based on the environment, wherein the current reward and punishment feedback is generated by one or more action vectors acting on the environment, and the one or more action vectors comprise the current action vector.

After the action vector is acted on the environment, the environment generates a punishment feedback, and the punishment feedback is not limited to the current action result, and can be the comprehensive result of all or part of the previous action sequences. For example, when an automatically driven automobile faces an obstacle, the automobile avoids the obstacle after selecting a left turn, at this time, the automatically driven automobile is accident after a right turn, the front environment does not contain the obstacle, the situation can be considered that the environment generates a rewarding feedback, and the punishment feedback is generated in the same way. For example, the current reward and punishment value can be obtained according to the current reward and punishment feedback mapping; and optimizing a third operation according to the current reward and punishment value.

Wherein, in S860, the penalty feedback can be referred to as the related description in fig. 2-7.

And S870, optimizing the action vector determination mode according to the current reward and punishment feedback.

The optimization action vector determination means may include various kinds of, for example, optimization action vector determination means based on reward and punishment feedback, gradient descent operation, and evaluation criteria. The action vector determination may also be optimized based on the deviation gradient descent operations of the true and expected values of the results of the action and the evaluation criteria.

Wherein, the optimization action vector determination method in S870 may refer to the optimization action vector determination method related description in fig. 2-7 (e.g., the optimization operation 2221, or the optimization action network, etc.).

In addition, S840-S870 may be implemented by action generating unit 220 in connection with the embodiments illustrated in FIGS. 2-5 described above, as will be understood with reference to each other.

In one embodiment, the method further comprises the following:

In order to avoid confusion caused by optimization, the feature vector extraction mode can be optimized according to the one or more data vectors and the current feature vector according to a first probability; according to the second probability, the feature vector extraction mode is optimized according to the current reward and punishment feedback; wherein the sum of the first probability and the second probability is 1.

In another embodiment, the method further comprises: learning is performed in advance based on one or more training data vectors. And pre-learning based on the one or more training feature vectors. Wherein the pre-learning based on the one or more training data vectors and the pre-learning based on the one or more training feature vectors are performed separately in time; or the pre-learning from one or more training data vectors is performed simultaneously in time with the pre-learning from one or more training feature vectors.

In another embodiment, a plurality of computing systems may be further configured, and the plurality of computing systems operate following the rules of the winner and the worse, and specifically, the method further includes: determining a reward and punishment accumulated value according to the reward and punishment feedback of the environment in the operation process by the operation system, wherein the reward and punishment accumulated value is increased if the current reward and punishment feedback is rewarded, and the reward and punishment accumulated value is reduced if the current reward and punishment feedback is punishment; the operation system with the accumulated value of the prizes and punishments higher than the first threshold value is duplicated; the operation system with the accumulated value lower than the second threshold value is eliminated.

In another embodiment, the action is first an action affecting the environment, or may be an action affecting the learning itself. For example, modifications to the network structure, branch pruning, parameter modifications may all be defined as possible actions. Based on this, the method further comprises: and adjusting the feature vector extraction optimization mode or the motion vector optimization determination mode according to the motion vector.

Fig. 9 is a schematic structural diagram of an arithmetic device according to an embodiment of the present application. As shown in fig. 9, the computing device 900 includes a transceiver 901, a processor 902, and a memory 903, the transceiver 901 being configured to receive data of a data bus; the memory 903 is used for storing programs and data; the processor 902 is configured to execute the program stored in the memory 903 and read the data stored in the memory 903, so as to execute steps S820-S840 and S860-S870 in fig. 8, and control the transceiver 901 to execute steps S810 and S850. Wherein the computing system as described in fig. 2-7 may be implemented by the computing device 900.

The embodiment of the invention provides a chip device, which comprises a processor and a memory; the memory is used for storing programs; the processor runs the program to perform the method and/or steps described above with respect to fig. 8.

In an embodiment of the present invention, the chip device may be a chip running in an operation device, where the chip includes: a processing unit and a communication unit, the processing unit may be, for example, a processor, which may be a processor of the various types described above. The communication unit may be, for example, an input/output interface, pins or circuitry, etc., which comprises a system bus. Optionally, the chip further includes a memory unit, which may be a memory inside the chip, such as a register, a cache, a random access memory (random access memory, RAM), an EEPROM, or a FLASH, etc.; the memory unit may also be a memory located outside the chip, which may be of the various types of memory described above. A processor is coupled to the memory and is operable to execute instructions stored in the memory to cause the chip apparatus to perform the method of fig. 8 described above.

In the foregoing embodiments of the invention, it may be implemented in whole or in part in software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable medium to another computer-readable medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), etc.

The present application is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present application are intended to be included in the scope of the present application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Claims

1. An arithmetic system for use with an autopilot tool, comprising:

The feature extraction unit is used for acquiring a current data vector based on the environment, wherein data in the data vector comprises image data; extracting a current feature vector according to one or more data vectors, wherein the one or more data vectors comprise the current data vector; and optimizing the feature extraction unit according to the one or more data vectors and the current feature vector;

The action generating unit is used for determining a current action vector according to the one or more feature vectors extracted by the feature extracting unit, wherein the one or more feature vectors comprise the current feature vector, and the action vector is an action affecting the environment and comprises an accelerator parameter, a brake parameter and/or a steering wheel parameter; the motion vector is acted on the environment, so that the feature extraction unit obtains the next data vector based on the environment acted by the motion vector; acquiring current reward and punishment feedback based on an environment, wherein the current reward and punishment feedback is a result generated by one or more action vectors acting on the environment, and the one or more action vectors comprise the current action vector; optimizing the action generating unit according to the current reward and punishment feedback;

the feature extraction unit optimizes the feature extraction unit according to a first probability, and the action generation unit optimizes the feature extraction unit according to a second probability, wherein the sum of the first probability and the second probability is 1.

2. The system of claim 1, wherein the action generating unit is further configured to optimize the feature extraction unit based on the current reward feedback.

3. The system according to claim 1 or 2, wherein the feature extraction unit is further configured to learn in advance based on one or more training feature vectors; the training feature vector is a feature vector extracted from an input vector, and the input vector comprises data simulating a front image sensor, laser radar data and images of other parts or data of a distance sensor.

4. A system according to claim 3, wherein the action generating unit is further adapted to learn in advance based on one or more training feature vectors predetermined by the feature extraction unit.

5. The system of claim 4, wherein the feature extraction unit learns separately in time from the action generating unit; or the feature extraction unit learns simultaneously in time with the action generation unit.

6. The system according to claim 1 or 2, characterized in that the system comprises one or more subsystems, each subsystem comprising a feature extraction unit and an action generation unit, respectively; determining a reward and punishment accumulated value according to the reward and punishment feedback of the environment in the operation process of the subsystem, wherein the reward and punishment accumulated value is increased if the current reward and punishment feedback is rewarded, and the reward and punishment accumulated value is reduced if the current reward and punishment feedback is punishment; subsystems with accumulated prize and punishment values higher than a first threshold are duplicated; subsystems with prize and punishment accumulated values lower than the second threshold are eliminated.

7. The system according to claim 1, wherein the feature extraction unit is specifically configured to extract the feature vector through a first operation according to one or more data vectors; and generating a data vector through a second operation according to the characteristic vector, and optimizing the first operation and the second operation according to errors of the generated data vector and the one or more data vectors.

8. The system according to claim 7, wherein the action generating unit is specifically configured to obtain the current action vector according to the one or more feature vectors extracted by the feature extracting unit through a third operation mapping; acquiring current reward and punishment feedback based on the environment, and acquiring current reward and punishment values according to the current reward and punishment feedback mapping; and optimizing the third operation and the first operation according to the current reward and punishment value.

9. The system according to claim 8, wherein the third operation includes an operation through a neural network for mapping the one or more feature vectors extracted by the feature extraction unit to a plurality of pending action vectors, and a selection operation for selecting an optimal one of the plurality of pending action vectors as the current action vector.

10. The system of claim 9, wherein the selection operation further comprises a search operation.

11. The system of claim 10, wherein the search operation is specifically configured to select one action vector from the plurality of pending action vectors, perform a simulation operation, and select an optimal action vector from the simulation operation results as the current action vector.

12. The system of any of claims 8-11, wherein the first operation, the second operation, or the third operation comprises an operation through a recurrent neural network RNN.

13. The system according to claim 1 or 2, wherein the action generating unit is further configured to adjust the manner of optimizing the feature extraction unit or optimizing the action generating unit according to the current action vector.

14. The system of claim 1 or 2, further applied to a camera or robot, the action vector being an action affecting the learning manner itself, the action comprising modification of the network structure, branch pruning, parameter modification.

15. A method of operation for an operation system of an autopilot tool, comprising:

Acquiring a current data vector based on the environment, wherein data in the data vector comprises image data;

optimizing a feature vector extraction mode according to the one or more data vectors and the current feature vector;

Determining a current action vector according to one or more feature vectors, wherein the one or more feature vectors comprise the current feature vector; the action vector is an action affecting the environment and comprises an accelerator parameter, a brake parameter and/or a steering wheel parameter;

the action vector is acted on the environment, so that the next data vector is acquired based on the environment acted by the action vector;

Acquiring current reward and punishment feedback based on an environment, wherein the current reward and punishment feedback is a result generated by one or more action vectors acting on the environment, and the one or more action vectors comprise the current action vector;

Optimizing a motion vector determining mode according to the reward and punishment feedback;

The optimizing the feature vector extraction method according to the one or more data vectors and the current feature vector includes: optimizing a feature vector extraction mode according to the one or more data vectors and the current feature vector according to a first probability; the method for extracting the optimal feature vector according to the current reward and punishment feedback comprises the following steps: according to the second probability, the feature vector extraction mode is optimized according to the current reward and punishment feedback;

Wherein the sum of the first probability and the second probability is 1.

16. The method of claim 15, wherein optimizing the manner of feature vector extraction based on the one or more data vectors and the current feature vector further comprises: learning in advance according to one or more training feature vectors; the training feature vector is a feature vector extracted from an input vector, and the input vector comprises data simulating a front image sensor, laser radar data and images of other parts or data of a distance sensor.

17. The method of claim 16, wherein optimizing the feature vector extraction based on the current reward and punishment feedback further comprises: learning is performed in advance based on one or more training feature vectors.

18. The method of claim 17, wherein the means for optimizing feature vector extraction based on the one or more data vectors and the current feature vector is learned separately in time from the means for optimizing feature vector extraction based on the current reward and punishment feedback; or the mode of optimizing the feature vector extraction according to the one or more data vectors and the current feature vector and the mode of optimizing the feature vector extraction according to the current reward and punishment feedback are studied simultaneously in time.

19. The method as recited in claim 15, further comprising:

20. The method of claim 15, wherein extracting the current feature vector from the one or more data vectors comprises extracting the current feature vector from the one or more data vectors via a first operation;

The method for optimizing feature vector extraction according to the one or more data vectors and the current feature vector comprises the following steps: generating a data vector through a second operation according to the characteristic vector, and optimizing the first operation and the second operation according to errors of the generated data vector and the one or more data vectors.

21. The method of claim 20, wherein determining the current action vector based on the one or more feature vectors comprises mapping the current action vector based on the one or more feature vectors via a third operation;

The method for determining the action vector according to the current reward and punishment feedback optimization comprises the steps of obtaining a current reward and punishment value according to the current reward and punishment feedback mapping; and optimizing the third operation and the first operation according to the current reward and punishment value.

22. The method of claim 21, wherein the third operation comprises an operation through a first neural network and a selection operation; the mapping according to the one or more feature vectors through the third operation to obtain the current action vector comprises the following steps:

Determining one or more pending action vectors by operation of the first neural network;

and selecting the optimal one from the one or more pending action vectors as the current action vector through the selection operation.

23. The method of claim 22, wherein the selection operation further comprises a search operation.

24. The method of claim 23, wherein selecting an optimal one of the one or more pending action vectors as the current action vector by the selecting operation comprises:

And selecting a plurality of times from the one or more pending action vectors, respectively selecting one action vector, performing simulation operation, and selecting the optimal action vector in the simulation operation result as the current action vector.

25. The method of any of claims 21-24, wherein the first operation, the second operation, or the third operation comprises an operation through a recurrent neural network RNN.

26. The method as recited in claim 15, further comprising: and adjusting the feature vector extraction optimization mode or the motion vector optimization determination mode according to the motion vector.

27. The method of claim 15, wherein the computing system is further applied to a camera or a robot, and the action vector is an action affecting a learning mode, including modification of a network structure, branch pruning, and parameter modification.

28. An arithmetic device comprising a processor and a memory; the memory is used for storing programs; the processor is configured to execute the program stored in the memory to control the computing device to perform the method of any one of claims 15-27.

29. The automatic driving tool is characterized by comprising a propulsion system, a sensor system, a control system and an operation system, wherein the propulsion system is used for providing power for the automatic driving tool, the operation system is used for controlling the sensor system to acquire a current data vector based on the environment, and data in the data vector comprises image data;

The computing system is further configured to optimize a feature vector extraction method according to the one or more data vectors and the current feature vector, and includes: optimizing a feature vector extraction mode according to the one or more data vectors and the current feature vector according to a first probability;

the computing system is further used for determining a current action vector according to one or more feature vectors, the one or more feature vectors comprise the current feature vector, and the action vector is an action affecting the environment and comprises an accelerator parameter, a brake parameter and/or a steering wheel parameter;

the operation system is also used for controlling the control system to act the current action vector on the environment so as to acquire the next data vector based on the environment acted by the current action vector;

The operation system is further used for controlling the sensor system to acquire current reward and punishment feedback based on the environment, wherein the current reward and punishment feedback is a result generated by one or more action vectors acting on the environment, and the one or more action vectors comprise the current action vector;

The operation system is also used for optimizing a motion vector determining mode according to the current reward and punishment feedback, and the method comprises the step of extracting a feature vector according to the current reward and punishment feedback optimization according to a second probability;

Wherein the sum of the first probability and the second probability is 1.

30. A camera, which is characterized by comprising a shooting system and an operation system;

The computing system is used for controlling the shooting system to acquire a current data vector based on the environment, and data in the data vector comprises image data;

the computing system is further used for determining a current action vector according to one or more feature vectors, the one or more feature vectors comprise the current feature vector, and actions in the action vector are actions affecting the environment and comprise accelerator parameters, brake parameters and/or steering wheel parameters;

the operation system is also used for controlling the shooting system to act the current action vector on the environment so as to acquire the next data vector based on the environment acted by the current action vector;

The operation system is further used for obtaining current reward and punishment feedback based on the environment, wherein the current reward and punishment feedback is a result generated by one or more action vectors acting on the environment, and the one or more action vectors comprise the current action vector;

The operation system is also used for optimizing the action vector determination mode according to the current reward and punishment feedback, and comprises the following steps: according to the second probability, the feature vector extraction mode is optimized according to the current reward and punishment feedback;

Wherein the sum of the first probability and the second probability is 1.

31. A computer readable storage medium comprising computer readable instructions which, when read and executed by a computer, cause the computer to perform the method of any of claims 15-27.