CN110738221A

CN110738221A - operation system and method

Info

Publication number: CN110738221A
Application number: CN201810789039.4A
Authority: CN
Inventors: 费旭东; 邹斯骋
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2018-07-18
Filing date: 2018-07-18
Publication date: 2020-01-31
Anticipated expiration: 2038-07-18
Also published as: CN110738221B

Abstract

The embodiment of the application relates to and provides computing systems and methods, which are characterized in that the computing system comprises two components, wherein components are feature extraction units for recognizing environments and effectively extracting concept features of the environments, and components are action generation units interacting with the environments, and the action generation units inherit and use the capability of the feature extraction units to obtain concepts and even can influence the capability of the feature extraction units to obtain the concepts in the computing process by the feature extraction units and the action generation units.

Description

operation system and method

Technical Field

The embodiment of the application relates to the field of machine learning, in particular to operation systems and methods.

Background

Deep learning algorithms have been used with great success and are in the process of rapid development. The main directions at present are Back Propagation (BP) operation, unsupervised learning operation, weakly supervised learning operation, and the like.

The BP operation can be generalized to any complex mapping function represented by the sample and defined by sets of parameters through automatic learning operation as long as there are enough marked samples.

However, this method requires a large number of manual labeling of data samples, which is not only costly, but also limits the adaptability of the obtained model and the ability to solve more complex problems due to the limitations of manual labeling.

For this reason, the industry has shifted the center of gravity to the direction of unsupervised learning operations and weakly supervised learning operations.

, this method converts the sample vector of the explicit space into the sample vector of the implicit space by mappings (called coders) to convert the complex distribution of the explicit space, such as the distribution of complex manifold shapes, into the simple distribution of the implicit space, such as the Gaussian distribution.

Accordingly, in implementation, this kind of method usually needs to introduce another transforms (decoders) to convert the vector of the hidden space into the vector of the apparent space, which is called the generation process.

If the two mappings are good enough, the reconstructed apparent space vector should match the original vector , but this is clearly the ideal effect, the ability to be obtained by observation depends on the coverage of the training samples, and in the case of insufficient sample coverage, the reliability is not high.

Disclosure of Invention

The embodiment of the application provides calculation systems and methods so that the machine learning capability is more reliable.

, it provides operation system, which includes two parts, wherein parts are feature extraction units for recognizing environment and extracting concept feature of environment effectively, and parts are action generation units interacting with environment, the feature extraction units and action generation units inherit and use the ability of feature extraction unit to obtain concept during operation, even the action generation units can influence the ability of feature extraction unit to obtain concept.

According to the embodiment of the invention, the recognition environment and the interaction with the environment can be combined to carry out machine learning, so that the environment can be better understood, and the optimal action selection can be made according to the understanding of the environment, thereby ensuring that the learning capability is more reliable.

In optional implementations, a feature extraction unit to obtain a present data vector based on an environment, extract a present feature vector based on or more data vectors, wherein the or more data vectors include the present data vector, and optimize the feature extraction unit based on the or more data vectors and the present feature vector;

the device comprises a feature extraction unit, an action generation unit and an optimization unit, wherein the action generation unit is used for determining a current action vector according to or more feature vectors extracted by the feature extraction unit, the or more feature vectors comprise the current feature vector, the current action vector acts on the environment so that the feature extraction unit obtains times of data vectors based on the environment after the current action vector acts, current reward and punishment feedback is obtained based on the environment, the current reward and punishment feedback is or more action vectors act on environment generation, the or more action vectors comprise the current action vector, and the action generation unit is optimized according to the current reward and punishment feedback.

By the embodiment of the invention, the understanding of the environment and the interaction with the environment are combined to carry out machine learning, so that the environment can be better understood and the optimal action selection can be made according to the understanding of the environment.

In another optional implementations, the action generating unit is further configured to optimize the feature extraction unit according to the reward punishment feedback of this time.

The embodiment of the invention can optimize the knowledge of the environment through the feedback of the environment so as to better understand the environment and further better select the action.

In another alternative implementations, the feature extraction unit is optimized according to a probability, the action generation unit is optimized according to a second probability, and the sum of the probability and the second probability is 1.

According to the embodiment of the invention, the optimization of the feature extraction unit and the optimization of the feature extraction unit can be performed in order according to the specified probability according to the environment feedback, so that the possibility of conflict generated by the two optimizations is reduced.

In another alternative implementations, the feature extraction unit is further configured to learn in advance based on or more training data vectors, which may be determined based on simulated or simulated environmental information, or based on pre-acquired real environmental information, the action generation unit is further configured to learn in advance based on or more training feature vectors, which are predetermined by the feature extraction unit, wherein the feature extraction unit and the action generation unit learn separately in time, or the feature extraction unit and the action generation unit learn simultaneously in time, wherein, in the case of simultaneous learning, the feature extraction unit optimizes the feature extraction unit according to probability, and the action generation unit optimizes the feature extraction unit according to a second probability.

The characteristic extraction unit is trained (or learned) in advance so that the characteristic extraction unit has initial recognitions about the environment, when the characteristic extraction unit is trained alone, the characteristic extraction unit is in an observation state, and by continuously observing the environment, the environment can be recognized in the process of observing the environment, the concept of the environment can be mastered, so that the characteristic extraction unit can effectively extract the characteristic from the environment when in specific use.

In another optional implementations, based on this, the operation system includes or multiple subsystems, each subsystem includes a feature extraction unit and an action generation unit, the subsystems can be regarded as individuals, the operation system is populations, each subsystem follows a victory-discriminant mechanism, for example, a reward-discriminant value is determined according to reward-discriminant feedback of the environment in the operation process of the subsystem, wherein the reward-discriminant value is increased if the reward-discriminant feedback is reward, the reward-discriminant value is decreased if the reward-discriminant feedback is penalty, subsystems with the reward-discriminant value higher than a th threshold value are copied, and subsystems with the reward-discriminant value lower than a second threshold value are deselected.

By the embodiment of the invention, the reliability of the learning ability of the computing system can be further improved through a competitive screening mechanism of the population.

In another optional implementations, the feature extraction unit is specifically configured to extract a present feature vector from or more data vectors via a th operation, and generate a data vector from the present feature vector via a second operation, and optimize the th operation and the second operation based on an error between the generated data vector and or more data vectors.

According to the embodiment of the invention, the optimization of the feature extraction unit can be realized by reconstructing the data vector and according to the error evaluation method.

In another optional implementations, the action generating unit is specifically configured to obtain the current action vector through third operation mapping according to or more feature vectors extracted by the feature extracting unit, obtain the current reward and punishment feedback based on the environment, obtain the current reward and punishment value according to the current reward and punishment feedback mapping, and optimize the third operation and the operation according to the current reward and punishment value.

According to the embodiment of the invention, the optimization of the feature extraction unit and the action generation unit can be realized according to the reward and punishment value by mapping the reward and punishment feedback to the reward and punishment value.

In another optional implementations, the third operation includes an operation through a neural network for mapping the extracted or more feature vectors to or more pending action vectors, and a selection operation for selecting optimal ones from the or more pending action vectors as the current action vector.

In another optional implementations, the selection operation further includes a search operation, such as a montecarlo tree search (MCTS) operation, where the search operation is specifically configured to select multiple times from or more pending action vectors, select action vectors respectively and perform the simulation operation, and select optimal action vectors in the simulation operation results as the current action vector.

Thus, the information provided by the mobile network can be utilized to the maximum extent. In addition, the optimal selection path and the hypothesis result thereof can be used as the basis for optimizing the action generating unit.

In another alternative implementations, the th operation, the second operation, or the third operation includes an operation by the recurrent neural network RNN.

In another alternative implementations, the action vector can represent actions that affect the environment and can also represent actions that affect the learning mode itself.

In another alternative implementations, the aforementioned computing system is applied to a camera (e.g., a digital camera, a cell phone or tablet with a camera function, etc.), a robot (e.g., a sweeping robot), or an autonomous driving tool (e.g., an autonomous driving vehicle, a drone, etc.).

In a second aspect, an embodiment of the present invention provides methods of operation, where the method is applied to an operation system, and the method includes:

obtaining the data vector of the time based on the environment;

extracting the current feature vector according to or more data vectors, wherein or more data vectors comprise the current data vector;

optimizing the extraction mode of the feature vector according to or more data vectors and the feature vector;

determining the current action vector according to or more eigenvectors, wherein or more eigenvectors comprise the current eigenvector;

acting the action vector on the environment so that the feature extraction unit obtains times of data vectors based on the environment after the action of the action vector;

obtaining current reward and punishment feedback based on the environment, wherein the current reward and punishment feedback is that or more action vectors act on the environment to generate, and or more action vectors comprise the current action vector;

and optimizing the action vector determination mode according to the reward and punishment feedback of the time.

In optional implementations, the method further comprises:

and optimizing a characteristic vector extraction mode according to the reward and punishment feedback of this time.

In another optional implementations, the method for optimizing feature vector extraction according to the reward punishment feedback of this time includes that the method for optimizing feature vector extraction according to or more data vectors and this time feature vector includes a method for optimizing feature vector extraction according to or more data vectors and this time feature vector according to probability;

optimizing a characteristic vector extraction mode according to the reward and punishment feedback of the time according to a second probability;

wherein the sum of the th probability and the second probability is 1.

In another optional implementations, pre-learning from or more training data vectors is further included.

In another optional implementations, pre-learning from or more training feature vectors is further included.

In another alternative implementations, learning from or more training data vectors is performed separately in time from learning from or more training feature vectors, or learning from or more training data vectors is performed simultaneously in time with learning from or more training feature vectors.

In another optional implementations, the method further includes:

determining a reward and punishment accumulated value according to reward and punishment feedback of the environment in the operation process, wherein the reward and punishment accumulated value is increased if the reward and punishment feedback is reward, and the reward and punishment accumulated value is reduced if the reward and punishment feedback is punishment;

the operation system with the reward and punishment accumulated value higher than the th threshold value is copied;

the operation system with the accumulated reward and punishment value lower than the second threshold value is eliminated.

In another alternative implementations, extracting the present eigenvector from the or more data vectors includes extracting the present eigenvector from the or more data vectors by a operation;

the method for optimizing feature vector extraction based on data vectors and the present feature vector includes generating data vectors by a second operation based on the present feature vector, and optimizing the th operation and the second operation based on the error between the generated data vectors and data vectors.

In another optional implementations, determining the current action vector based on or more eigenvectors includes mapping the current action vector based on or more eigenvectors through a third operation;

the determination mode of the reward and punishment feedback optimization action vector comprises the steps of obtaining the reward and punishment value according to the reward and punishment feedback mapping of the time, and optimizing a third operation and an th operation according to the reward and punishment value of the time.

In another optional implementations, the third operation includes an operation through a neural network and a selection operation, and the obtaining of the current action vector according to or more feature vectors and through a third operation mapping includes:

determining one or more pending action vectors by computing one or more feature vectors through neural network;

and selecting the optimal vectors from or a plurality of pending action vectors as the current action vector through a selection operation.

In another alternative implementations, the selection operation further includes a search operation.

In another optional implementations, selecting the optimal of or more pending action vectors as the current action vector by the selection operation includes:

selecting from or multiple undetermined action vectors for multiple times, selecting action vectors respectively, performing simulation operation, and selecting the optimal action vectors in the simulation operation results as the action vector of the time.

In another alternative implementations, the operation, the second operation, or the third operation includes an operation over a Recurrent Neural Network (RNN).

In another optional implementations, the method further includes adjusting the manner of extracting the optimized feature vector or adjusting the manner of determining the optimized action vector according to the current action vector.

In a third aspect, embodiments of the present invention provide computing devices, the devices including a processor and a memory, the memory being configured to store programs, the processor being configured to execute the programs stored in the memory to control the computing devices to perform the methods described in the second aspect and its optional implementation.

In a fourth aspect, embodiments of the present invention provide autonomous driving tools, where the autonomous driving tools include a propulsion system, a sensor system, a control system, and a computing system, where the propulsion system is configured to provide power for the autonomous driving tools, and the computing system is configured to control the sensor system to obtain the current data vector based on an environment;

the computing system is further configured to extract the current feature vector according to or more data vectors, where or more data vectors include the current data vector;

the operation system is also used for optimizing the characteristic vector extraction mode according to data vectors or a plurality of data vectors and the current characteristic vector;

the computing system is further used for determining the action vector according to feature vectors, wherein feature vectors comprise the action vector;

the operation system is also used for controlling the control system to act the action vector on the environment, so that the feature extraction unit obtains times of data vectors based on the environment after the action of the action vector;

the operation system is further used for controlling the sensor system to acquire the reward and punishment feedback of the time based on the environment, wherein the reward and punishment feedback of the time is generated by or more action vectors acting on the environment, and or more action vectors comprise the action vector of the time;

the operation system is also used for optimizing the action vector determination mode according to the reward and punishment feedback of the time.

In a fifth aspect, an embodiment of the present invention provides kinds of cameras, including a shooting system and an arithmetic system;

the operation system is used for controlling the shooting system to obtain the data vector based on the environment;

the operation system is also used for controlling the shooting system to act on the current action vector to facilitate the feature extraction unit to obtain times of data vectors based on the environment acted by the current action vector;

the operation system is further used for obtaining the reward and punishment feedback of this time based on the environment, wherein the reward and punishment feedback of this time is that or more action vectors act on the environment for generation, and or more action vectors comprise the action vector of this time;

In a sixth aspect, computer-readable storage media are provided, having stored thereon a computer program which, when executed by a processor, implements the method described in the second aspect above and optionally in its implementation.

In a seventh aspect, there is provided computer program product comprising instructions which, when run on a computer, cause the computer to perform the method described in the second aspect above and optionally in its implementation.

In an eighth aspect, chip devices are provided, the chip devices including a processor and a memory, the memory storing a program, the processor executing the program to perform the method of the second aspect and its optional implementation.

Drawings

Fig. 1 is a schematic diagram of application scenarios provided in an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of computing systems according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of another computing systems according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of another computing systems according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of another computing systems according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of kinds of automatic steering tools provided by the embodiment of the invention;

fig. 7 is a schematic structural diagram of cameras according to an embodiment of the present invention;

FIG. 8 is a flow chart illustrating calculation methods according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of types of computing devices according to an embodiment of the present invention.

Detailed Description

The embodiments of the present application will be described below with reference to the drawings.

, the method converts the sample vector of the explicit space into the sample vector of the implicit space through mappings (called coder), correspondingly, transformations (decoder) are needed to convert the vector of the implicit space into the vector of the explicit space, and the process is called generation process.

In the other , Reinforcement Learning (RL) operations are performed by solving optimal action sequences to obtain learning operations that may delay the reward, specifically, actions are determined based on the state, which in turn result in a change of state, in the series of actions and state changes, the intelligent agent receives a reward or penalty, from which operations to optimize the model parameters can be derived.

The reinforcement learning agent may be very unknown at first, but through continuous learning improvement, the agent will develop his own experience and take the best action strategy.

For such operations, the process and practice opportunity of the motion is the condition for implementing the reinforcement learning operation, and the definite reward and punishment mechanism is also the necessary condition for effectively implementing the reinforcement learning operation. However, in many cases, such as face recognition, there are insufficient exploration and practical opportunities for motility, which makes implementation difficult.

Based on the above problems, the present application proposes calculation systems and methods, which combine a learning method based on continuous observation and a reinforcement learning method to realize environment recognition and interaction with the environment.

The following describes aspects of embodiments of the present application with reference to the drawings.

Fig. 1 is a schematic view of operation scenes provided by an embodiment of the present invention, where the operation device 100 shown in fig. 1 may be an intelligent machine, such as an intelligent robot (e.g., a sweeping robot), an automatic driving tool (e.g., an automatic driving automobile, an unmanned aerial vehicle, etc.), or an intelligent camera (e.g., a digital camera, a mobile phone with a shooting function, a tablet computer, etc.), and the like, the operation device may obtain environment data, and determine an action of the operation device according to the environment data, where the action of the operation device may affect an environment, the operation device may obtain reward and punishment feedback caused by the action of the operation device from the environment, and optimize an operation manner according to the reward and punishment feedback so as to improve a probability that reward and punishment feedback obtained by the action of the device is a reward and a desired direction.

In the following, reference is made to steps in conjunction with fig. 2-5, where fig. 2 is a schematic diagram of a computing system structure provided by the embodiment of the present invention, and the computing system may be implemented as the computing device shown in fig. 1, for example, integrated into a system on chip (SoC) of the computing device 100, or as an Application Specific Integrated Circuit (ASIC) of the computing device 100 (for example, as an ASIC of a cloud computing system), as shown in fig. 2, the system may include a feature extraction unit 210 and an action generation unit 220, where both units and other units in the present application may be implemented on the basis of software and hardware, for example, on the basis of a CPU and a memory (for example, a CPU reads corresponding codes stored in the memory to execute functions that the units can perform), or on the basis of a hardware processor (for example, an FPGA or an ASIC), that is implemented by related hardware circuits to perform corresponding functions of the units.

The feature extraction unit 210 mainly implements the awareness of the environment, and specifically, the feature extraction unit 210 is configured to extract a feature from data including environment information, where the feature is considered to be the awareness of the environment information. The action generating unit 220 is mainly used for interacting with the environment, and specifically, the action generating unit 220 selects an optimal action according to the cognition of the environment, acts the optimal action on the environment, and optimizes the selection of the optimal action according to reward and punishment feedback.

In a specific embodiment, as shown in fig. 2, the feature extraction unit 210 is mainly configured to obtain a current data vector based on an environment, extract a current feature vector according to or more data vectors, where the or more data vectors include the current data vector, and optimize the feature extraction unit 210 according to or more data vectors and the current feature vector, for example, the feature extraction unit 210 is configured to map the current data vector to the current feature vector and integrate the mapping processes of the current data vector and the current feature vector optimization feature extraction unit 210.

The data vector may refer to a vector including environment information, and for example, the data vector may be generated according to an environment picture, or may be summarized according to environment data acquired by a sensor. The environment data is data obtained by various sensors for a specific environment, including but not limited to image, sound, air pressure, temperature, and the like, and the specific environment data may be determined in accordance with a specific application scenario, for example, for an automatic driving scenario, image data obtained by a camera is included.

The above-summarized process may be understood as a process of data preprocessing, that is, preprocessing original environment data to obtain a data vector that can be finally used in subsequent steps, for example, performing format conversion, and converting the data vector into a format of a system to facilitate subsequent processing, or removing redundant information, and the like, where the data preprocessing manner may have different methods for different data, for example, for image data, including but not limited to preprocessing methods such as denoising, color transformation, filtering, and the like.

The feature vector is extracted from the data vector after being processed by the feature extraction unit 210, the feature vector is a vector for reflecting the feature of the object, or more pieces of information can be simultaneously contained in the feature vector, for example, information of roads, vehicles, pedestrians and the like can be included for images obtained in automatic driving processes.

Based on this, the or more data vectors may include only the current data vector or all or part of the historical data vectors including the current data vector.

In examples, as shown in fig. 3, the feature extraction unit 210 extracts a present feature vector 2112 by an operation 2121 according to a data vector 2111, and generates a data vector 2113 by an operation 2122 according to the present feature vector 2112, the feature extraction unit 210 may optimize

operations

2121 and 2122 according to an error between the data vector 2113 and the data vector 2111, for example, the operation 2121 may be implemented by a feature extraction network, and the operation 2122 may be implemented by a generation network, which may be implemented by any neural network, for example, multi-layer Deep Neural Networks (DNNs), respectively.

For example, referring to fig. 4, taking an operation 2121 as a feature extraction network and an operation 2122 as a generation network as an example, the feature extraction unit 210 is specifically configured to perform the following steps:

the computing system continuously collects signals from a real-time existing and changing environment through a sensor, the signals collected at the t time can be summarized into vectors X (t), and the t time can be considered as the current time;

the computing system extracts the feature vector z (t) from x (t) using a feature extraction network, which can be expressed as:

wherein the content of the first and second substances,

is sets of parameters defining f, i.e. f is dependent on

Is changed;

the computing system updates based on X (t) and Z (t)

To achieve optimization of the feature extraction network. Updating in the computing system based on X (t) and Z (t)

In examples to achieve optimization of the feature extraction network, the computing system reconstructs X (t) from z (t) by generating a network, which may be expressed as X '(t) g (z (t), θ), where θ is the set of parameters that define g, i.e., g changes as θ changes, and updates by gradient descent method according to the error E of X' (t) and X (t)

Theta to enable optimization of the feature extraction network and the generation network.

There are two possibilities for the relationship of Z (t) and X (t):

in examples, Z (t) is only relevant to X (t) at the current instant, and not to X (t) past history, in which case there are only spatial mappings from X (t) to Z (t).

In another examples, Z (t) is not only related to the current state of X (t), but also related to part or all of the historical state of X (t). in implementing the mapping process from X (t) to Z (t), all or part of the historical memory of X (t) is retained, for example, the mapping process from X (t) to Z (t) can be implemented by RNN model.

With reference to fig. 2, the action generating unit 220 is mainly configured to determine a current action vector according to or more eigenvectors extracted by the feature extracting unit 210, where the or more eigenvectors include the current eigenvector, act the current action vector on the environment, so that the feature extracting unit 210 obtains times of data vectors based on the environment after the current action vector is acted, obtain a current reward and punishment feedback based on the environment, where the current reward and punishment feedback is or more action vectors acting on the environment generation, and the or more action vectors include the current action vector, and optimize the action generating unit 220 according to the current reward and punishment feedback, for example, the action generating unit 220 is configured to map to the current action vector according to the current eigenvector, and optimize a mapping process of the action generating unit 220 and a mapping process of the feature extracting unit 210 according to the reward and punishment feedback.

The action vector may represent actions affecting the environment, for example, the action vector may be composed of oil parameters, brake parameters, and steering wheel parameters in an automatic driving scenario, or may represent actions affecting the learning method itself, for example, modification of the network structure, branch pruning, and parameter modification may be represented by the action vector.

Based on the or more current action vectors, the current action vector or the historical action vector can be included in whole or in part, wherein the current action vector or the historical action vector can be included.

In examples, as shown in fig. 3, the action generating unit 220 is configured to obtain a current action vector 2211 through an operation 2221 according to a feature vector 2112 extracted by the feature extracting unit 210, obtain a current reward and punishment feedback 2212 based on an environment, obtain a reward and punishment value through an operation 2222 according to the current reward and punishment feedback 2212, and optimize the operation 2221 according to the reward and punishment value, in examples, the operation 2221 may be implemented through an action network and a selection operation, the action network may be implemented through 0 multi-layer DNNs, the selection operation is an optimal solution selection operation, the action network is configured to map the feature vector 2211 extracted by the feature extracting unit 210 to a plurality of pending action vectors, the selection operation is configured to select optimal from the pending action vectors to the current action vector 2212, wherein the selection of optimal from the pending action vectors may include a plurality of ways, for example, a selection from the pending action vectors, a plurality of the optimal action vectors may be selected from the plurality of pending action vectors, and the optimal action vectors may be selected as a result of a simulated search result of a plurality of simulated actions, and the simulated search results may be obtained by a plurality of the assumed search algorithm , and the search results may be performed by a plurality of the search algorithm .

In addition, if the strategy for selecting the best action according to the cognition on the environment is not only related to the cognition on the current environment but also related to part or all of the cognition on the historical environment, action vectors are sequentially generated according to all or part of the feature vectors including the feature vector 2112 before the feature vector 2112 through operation 2221, and optimization is performed according to reward and punishment feedback of the action vectors, for example, operation 2221 can be implemented through RNN.

With continued reference to fig. 2, the action generating unit 220 is further configured to optimize the feature extracting unit 210 according to the reward and punishment feedback of this time. To avoid pairing featuresThe extraction unit 210 optimizes the chaos generation, and the feature extraction unit 210 generates the chaos according to the probability P₁The optimized feature extraction unit 210, the action generation unit 220 according to the probability P₂An optimized feature extraction unit 210, wherein the probability P₁And probability P₂The sum of (1).

For example, referring to fig. 4, taking the operation 2221 including the mobile network and the selection operation as an example, the computing system may perform the following steps:

the computing system uses the mobile network to compute a mobile set vector a (t) from z (t), which may be expressed as a (t) h (z (t), β), where β is used to define the set of parameters for h, i.e., the specific h varies with β.

The operation system selects actions from the action set vector by selection operation, and can be expressed as a (t) Select (a (t)).

Applying a (t) to the environment, the environment generates updates of the environment based on the action vector a (t), where X (t +1) is a (t) X (t).

The arithmetic system fetch environment generates reward and punishment feedbacks according to action a (t), and reward and punishment is not limited to the result of the current action, but can also be the integrated result of all previous action sequences.

The operation system updates β according to the reward and punishment feedback

To optimize the feature extraction unit 210 and the action generation unit 220, and of course, the selection operation may be updated according to the reward and punishment.

Wherein, there are two possibilities whether A (t) is related to X (t):

in examples, A (t) is only relevant to Z (t) at the current instant, and not to the past history of Z (t), in which case there are only spatial mappings from Z (t) to A (t).

In another examples, A (t) is not only related to the current state of Z (t), but also related to some or all of the historical states of Z (t). in implementing the mapping process from Z (t) to A (t), all or some of the historical memory of Z (t) is retained, e.g., the mapping process from Z (t) to A (t) can be implemented by RNN model.

The action may be an action that affects the environment first, or an action that affects the learning calculation itself. For example, modifications to the network structure, branch pruning, parameter modifications may be defined as possible actions. Specifically, the action generating unit 220 is further configured to adjust the manner of optimizing the feature extracting unit 210 and/or optimizing the action generating unit 220 according to the current action vector.

In another embodiments, as shown in FIG. 5, the ability of the feature extraction unit 210 to recognize the environment and the ability of the action generation unit 220 to interact with the environment may be pre-trained.

The pre-training feature extraction unit 210 can be understood as continuous observation environments, and a process of recognizing the environment and mastering an environment concept (the environment concept is a feature) from the process of observing the environment, taking a learning driving scene as an example, individual sitting at a copilot to learn driving, only observing and not operating, and a process of learning about environment concepts including roads, lane lines, various vehicles, different vehicle speed relations, pedestrians, various abnormal conditions and the like when a learner is familiar with road conditions and driving.

For example, for an automatic driving scenario, when the feature extraction unit 210 is pre-trained, the simulated environment information, such as data simulating an image sensor in front of a vehicle, lidar data, and data of an image or a distance sensor of other parts, etc., all of the simulated environment information may be summarized as input vectors x (t), the feature extraction unit 210 extracts features z (t) from x (t), where z (t) is feature vectors, optimizes the feature extraction unit 210 according to x (t) and z (t), when pre-training, or more input vectors x (t) may be determined, and the feature extraction unit 210 may be trained according to or more input vectors x (t), respectively.

The training of the action generating unit 220 may be understood as learning an optimal action strategy based on the learned concepts (features). Still taking the example of learning to drive, it can be understood that the learner exercises on the simulator, assuming that the preceding vehicle suddenly decelerates, and then the learner exercises the optimal deceleration stop process. Through a plurality of exercises, safe driving can be learned, and a sufficient reasonable and comfortable parking process can also be learned instead of simple emergency braking.

For example, for an automatic driving scenario, an optimal action a (t) is selected according to a feature z (t) extracted when the feature extraction unit 210 performs pre-training, where a (t) is action vectors, a reward and punishment feedback generated according to an action by the environment is obtained by executing a (t), the action generation unit 220 is optimized according to the reward and punishment feedback, and the feature extraction unit 210 is further optimized according to the reward and punishment feedback.

The above-described training of the feature extraction unit 210 and the training of the action generation unit 220 may be independently operated.

The operation system is operated in the observation state, which is continuous observation environments, and the process of recognizing the environment and grasping the concept of the environment from the observation environment, similar to individual learning driving in the passenger seat, only observation and no operation, because the learner is familiar with the road condition and various conditions during driving, which is essential and important for learning driving later, and process of familiar environment concept, which includes the road, the lane line, various vehicles, different vehicle speed relations, pedestrians and various abnormal situations, etc.

The intelligent system learns the optimal deceleration stop process by practicing on the simulator assuming that the learner suddenly decelerates and then closes his eyes (because the concept at this moment is determined, the closing means that the operation only adopts the concept without considering the original image data), and learns a safe driving process and a stop process which is reasonably comfortable enough instead of a simple sudden braking after a plurality of exercises.

For practical application scenarios, the above two processes can be performed independently in time, or simultaneously in time without mutual interference.

In the case where two trainings are performed simultaneously or sequentially, if the performance of the training of the action generation unit 220 involves an update to the feature extraction unit 210, since the training of the feature extraction unit 210 also updates, the problem of which update predominates, and the problem of weight assignment, both learning mechanisms contribute to establishing correct concepts in nature, the training of the feature extraction unit 210 is based on unsupervised data that can be observed in large numbers, while the training of the action generation unit 220 helps to better correct concepts from the results₁And p₂And p is₁+p₂The training of the feature extraction unit 210 performs the update of itself according to the probability p1, and the training of the action generation unit 220 performs the update of the feature extraction unit 210 according to the probability p 2.

In this embodiment of the present invention, the operations 2211, 2212 and 2221 may be implemented by neural network operations, and may be optimized by learning operations, where the learning operations may include back propagation operations or other learning operations, and further , the back propagation operations may be optimized according to an error of the mapping result, or may use other optimization bases.

Specifically, any function y (t) may be approximated by a function z (x (t), λ) defined by a neural network operation, where λ is the parameter to be solved, z () is a general term for series functions, and y (t) may be the aforementioned operations 2211, 2212, 2221, and so on.

Y₁(t)＝z₁(X(t)，λ₁)，

Y₂(t)＝z₂(Y1(t)，λ₂)，

……

Y_p(t)＝z_p(Y_p-1(t)，λ_p) Finally by Y_p(t) as a result of the approximation of Y (t).

Each of the functions z () described above may be shaped as s (XW + b), where XW represents a dot product of a data vector X and a weight vector W, and b is adjustable offsets W and b are both parts of a parameter λ, λ comprising λ 1, λ 2, …, and λ p.s () is non-linear mappings, e.g., an activated function (RELU) mapping defined by an output of 0 when the input is less than 0 and an output equal to the input when the input is greater than or equal to 0.

In this iterative manner, when a parameter set λ is given, a mapping between the input x (t) and the output y (t) is determined.

For any λ, arbitrary mapping does not certainly map x (t) to the correct Y (t), and we assume that Y' (t) is mapped.

Establishing an evaluation criterion for evaluating the error between Y (t) and Y' (t), for example, a mean square error criterion, expressed as:

e ═ Σ [ Y' (t) -Y (t /) ]2 (formula )

The back propagation learning operation is defined as: lambda [ alpha ]_n＝λ_n-1+ Δ λ, wherein,

the meaning of the algorithm is that optimal downhill steps (in the direction of maximum gradient) are pursued every steps in the learning process, and finally the bottom of the valley, i.e. the target point, is reached, which is the lambda position corresponding to the minimum value of E.

In particular implementations, if the data vector is invariant (i.e., the environmental data is invariant), limits may be set, and the iterative process terminates when E is less than the limit.

In the embodiment of the present invention, the purpose of optimizing the feature extraction unit 210 according to data vectors and the current feature vector is to obtain the best feature extraction unit 210 according to the current data vector to generate the most accurate concept (feature) for environmental awareness.

In particular, the feature extraction unit 210 extracts the feature vector z (t) from x (t) using a feature extraction network, which may be expressed as:

wherein the content of the first and second substances,

is sets of parameters defining f, i.e. f is dependent onAnd reconstructing X (t) from z (t) using the generation network, which process may be expressed as X' (t) g (z (t), theta), where theta is the set of parameters used to define g, i.e. the specific g is a function of thetaAnd changes. According to an evaluation criterion E [ X (t), X' (t)]And gradient learning rules derived from back-propagation operations, updatingAnd θ to achieve optimization of the feature extraction unit 210.

Wherein the feature extraction network

And generating the network g (z (t)), θ, can be realized by a neural network, and the iterative process of the neural network which is series of function operations can be as follows:

final Z (t) ═ X_u(t)；

Z₁(t)＝g₁(Z(t)，θ₁)，Z₂(t)＝g₂(Z₁(t)，θ₂)，…，Z_v(t)＝g_v(Z_v-1(t)，θ_v) Finally, X' (t) ═ Z_v(t)。

Each of the functions may be shaped as s (XW + b), where XW represents a dot product of the data vector X and the weight vector W, b is adjustable offsets, and s () is non-linear maps, such as the RELU map.

In the case of the mapping relationship (i.e., neural network model and model parameters) determination, for all data vectors X (t), corresponding X '(t) can be generated, where ideally X (t) and X' (t) are samples, indicating that the feature extracted by the feature extraction network is accurate, but in practice there is always a deviation (E) of , where the deviation can be measured as the mean square error, which is the target of solving the best parameters of the feature extraction network.

As shown in FIG. 5, solving the optimal feature parameters for extracting and generating networks according to given target vectors

And over of thetaThe solution typically involves iterative processes, updating parameters based on the gradient of the objective function relative to the parameters, which can be expressed by the following equation:

wherein epsilon₁，ε₂Is a step size parameter that can be adjusted.

It should be noted that, the characteristic of only types of learning operation is to reconstruct X ' (t), and the updating of the parameters by the mean square error between X (t) and X ' (t), and other operations can be selected to implement, for example, we know that the STDP learning operation actually performed in the human brain does not need to reconstruct X ' (t).

The principle of STDP operation is similar to election mechanisms, that is, any person can vote and be voted, parameters for representing trust degree are arranged among persons, and after enough persons vote me, i can go to vote to support other persons, after i are chosen, i can add trust to the person who voted before the voting time, the more the trust is added, for the person who voted to me after the voting time, i can reduce the trust, the more the trust is reduced, the more the trust is added, the more the amount of trust relationship naturally forms different communities in the community, and the whole community has the functions of abstracting and abstracting complex concepts, if the parameters are represented by phi, the learning rule can be represented as:

wherein

Δt＝t_in-t_outε is constants, S is mapping functions, t_inRepresenting the time of triggering, t, of the input signal_outRepresenting the output signalA trigger time.

The law of the mapping function S is: positive when Δ t <0, the larger the absolute value of Δ t, the smaller the absolute value of S; when Δ t >0 is negative, the larger the absolute value of Δ t, the smaller the absolute value of S.

In the embodiment of the present invention, the purpose of the optimization action generation unit 220 and the optimization feature extraction unit 210 according to reward and punishment feedback is mainly to select an optimal action, that is, to make the operation system get a reaction (i.e., action) closer to human.

The possible action set a (t) is generated by the action network, and is represented by a (t) h (z (t), β), where z (t) is a feature vector obtained from the environment input x (t) by the feature extraction network.

A₁(t)＝h₁(Z(t)，β₁)，A₂(t)＝h₂(A₁(t)，β₂)，…，A_s(t)＝h_s(A_s-1(t)，β_s) Finally A (t) ═ A_s(t)。

Each of the functions is also shaped as s (XW + b), and will not be described herein.

In addition, the action set output by the action network includes action-related parameters, such as the expected value of the action, the probability of suggesting that the action be taken, and the like.

In examples, the selection operation can be based on these parameters to directly select the action that is considered to be the best, such as the action with the highest expected value, the highest probability, or a combination of both.

For the optimization of the mobile network, the parameters β are updated according to the gradient of the objective function relative to the parameters under the premise of determining the objective function, and the parameters φ of the feature extraction network can also be selectively updated.

In another examples, the selection operations may also include search operations, the idea of which is to assume action options and assume that the action is to be performed, thus creating an environmental state resulting from the action, and continue such assumptions and assumed performance in the new environmental state.

Referring to fig. 5, the process of solving the action network β for a given target vector is an optimization process through learning operations, the solution generally involves iterative processes, the best parameter is solved according to the target value E corresponding to the hypothesis result (the target value may be a reward and punishment value associated with the result), the solution generally results in iterative calculation processes, and the weight parameters β and φ are updated according to the gradient of the target value relative to the weight parameter, which can be expressed by the following formulas:

wherein epsilon₃，ε₄Is a step size parameter that can be adjusted.

In another embodiments, intelligence is not isolated, and population characteristics are the most important mechanisms for intelligence development, such as the multicellular nervous system, population of species, etc. for the above method we can see that, as individuals, their level of intelligence is limited if they are not inherited to be repeatedly learned and improved.

for example, for digital systems, copying and deleting of information is much easier than for the evolution of the physical world, and this difference provides the possibility for the development of groups of machine intelligence for machine intelligence is not necessarily physical, but may be multiple logical copies in a digital system.

In the embodiment of the invention, the computing system 100 comprises or more subsystems, each subsystem comprises a feature extraction unit and an action generation unit respectively, and each subsystem performs the advantages and disadvantages according to the prefabricated conditions.

In examples, performing the advantage-disadvantage elimination on each subsystem according to the prefabrication condition specifically includes that the subsystem determines a reward-penalty integrated value according to reward-penalty feedback of the environment in the operation process, wherein the reward-penalty integrated value is increased if the reward-penalty feedback is reward, the reward-penalty integrated value is reduced if the reward-penalty feedback is penalty, the subsystem with the reward-penalty integrated value higher than a threshold value is copied, and the subsystem with the reward-penalty integrated value lower than a second threshold value is eliminated.

When the learning of the subsystems deviates for some reason, the learning is poor in the environment reward and punishment mechanism, if the continuous learning actually wastes resources, the torsion is difficult, and a reasonable method is to terminate the subsystems at this time.

The subsystem is given energy values at initialization, and subsequent energy is obtained from the environment;

during the operation of the subsystem, energy can be obtained from the environment according to reward and punishment, or the energy can be returned to the environment.

For example, the optimization process requires specified energy to be consumed and specified energy to be consumed for information to propagate within the system.

If the individual system is exhausted, the subsystem terminates.

The pre-defined conditions may also include, based on the population systems, the system is unable to be activated if only the termination mechanism, especially , is initiated when the individual system is in a very young state, and the final convergence results in the intelligence system being difficult to develop and develop.

Assume that there are total energy values (GE) for the environment and the computing system:

randomly, according to a ratio r of , the selected subsystem splits into two, or splits into n, a process that also involves a certain degree of variation of . if none of subsystems in the system, subsystems are generated.

Environmental energy is initially GE, population energy is initially 0, the limit of growth is GE as population energy grows step by step, so there is a dynamic balance of energy between the environment and the population.

However, example systems may also be implemented on or in the form of other vehicles, such as cars, trucks, motorcycles, buses, boats, airplanes, helicopters, lawn mowers, snow shovels, recreational vehicles, amusement park vehicles, farming equipment, construction equipment, trams, golf carts, trains, and trams.

At present, an automatic driving system mainly determines proper vehicle driving behaviors from a specific geometric physical position relationship by identifying specific people, vehicles and lane lines and judging the relative position relationship and speed relationship of the people, the vehicles and the driven vehicles, wherein the proper vehicle driving behaviors comprise control over a steering wheel, acceleration and deceleration control and the like. The disadvantage of this kind of system is that it can only correspond to specific rules, and the robustness of the operation is relatively poor. If an unprecedented scene is encountered, which is not designed and considered, a natural and accurate reaction may not be generated.

By the embodiment of the invention, the driving response can be generated directly according to the image input, like , and is not implemented after logic analysis processes are carried out, so that driving responses which are closer to natural people are generated.

The autonomous tool 600 includes a propulsion system 601, a sensor system 602, a control system 603, a computing system 604, wherein the computing system 604 may include a processor and a memory, etc. the computing system 604 may be a controller of the autonomous tool 600 or a portion of the controller.

The autopilot vehicle 600 may include more, fewer, or different systems, and each system may include more, fewer, or different components. In addition, the illustrated systems and components may be combined or divided in any number of ways, for example, the autopilot device 600 may also include a power source, a display screen, and speakers, among other things.

The propulsion system 601 may be used to power movement of the autonomous tool 600. For example, the propulsion system 601 includes an engine, a power source, a transmission, wheels/tires, and the like.

In other examples , the propulsion system 601 may include multiple types of engines and/or engines.A hybrid air vehicle may include a gasoline engine and an electric motor.A power source may be a source of energy that powers the engine/engine in whole or in part. , in some examples, the power source may also provide power for other systems of the autopilot vehicle 600. a transmission may be used to transmit mechanical power from the engine/engine to the wheels/tires.to this end, the transmission may include a gearbox, a clutch, a differential, a drive shaft, and/or other elements.in examples where the transmission includes a drive shaft, the drive shaft includes or more shafts for coupling to the wheels/tires.A wheel/tire of the autopilot vehicle 600 may be configured in various forms including single wheel, bicycle/motorcycle, or four wheel, or truck four wheel forms.other wheel/tire forms are also possible such as those including six or more wheels/tires.A vehicle may include a plurality of wheels/tires coupled to a rubber wheel assembly .

The sensor system 602 may also include additional sensors, including, for example, sensors that monitor the internal systems of the autonomous tool 600 (e.g., an O2 monitor, a fuel gauge, oil temperature, etc.) the sensor system 602 may also include other sensors, wherein the GPS module may be any sensor for estimating the geographic location of the autonomous tool 600. to this end, the GPS module may include a transceiver, estimates the location of the autonomous tool 600 relative to the earth based on the positioning data, the IMU may be a portion for sensing the location and orientation changes of the autonomous tool 600 based on the acceleration and any combination thereof.in examples, the combination of sensors may include, for example, an accelerometer and any combination thereof, and may be configured to detect the location and orientation changes of an object using a radar signal, such as a radar, or other radar-based on the location of the radar-based on the radar system 600-based on the radar-based on the satellite-based positioning data, the satellite-based on the radar-based location-based on the radar-based combination-based on the radar-based.

The camera may be used with any camera (e.g., still camera, video camera, etc.) that acquires an image of the environment in which the autonomous tool 600 is located, to this end, the camera may be configured to detect visible light, or may be configured to detect light from other portions of the spectrum (such as infrared light or ultraviolet light).

The actuator may be configured to modify the position and/or orientation of the sensor. The sensor system 602 may additionally or alternatively include components other than those shown.

The control system 603 may be configured to control the operation of the autopilot vehicle 600 and its components to this end, the control system 603 may include a steering unit, oil , or a brake unit, among others.

The steering unit may be any combination of mechanisms configured to adjust the heading or direction of the autonomous vehicle 600.

Oil may be any combination of mechanisms configured to control the operating speed and acceleration of the engine and, in turn, the speed and acceleration of the autopilot vehicle 600.

For example, the brake unit may use friction to slow the wheels/tires as another example, the brake unit may be configured to regenerate and convert kinetic energy of the wheels/tires into electrical current.

The control system 603 may additionally or alternatively include components other than those shown.

The processors included in the computing system 604 may include or more general purpose processors and/or or more special purpose processors (e.g., image processors, digital signal processors, etc.). to the extent that the processors include more than processors, such processors may work alone or in combination.

The memory, in turn, may include or more volatile storage components and/or or more non-volatile storage components, such as optical, magnetic, and/or organic storage devices, and may be integrated in whole or in part with the processor.

To this end, the components and systems of the autopilot tool 600 may be communicatively linked via a system bus, network, and/or other connection mechanism.

In an embodiment of the present invention, the autopilot tool 600 includes modules specifically configured to:

the computing system 604 collects signals from the real-time, changing environment via the sensor system 602, such as image sensor data from an image sensor in front of the vehicle, lidar data, and other images of other locations or range sensor data. The computing system 604 may summarize the data acquired this time by the sensor system 602 into a data vector x (t).

The computing system 604 extracts the feature vector z (t), which may be expressed as z (t) ═ f (x (t), phi, from x (t) using a feature extraction network, which may be arbitrary neural networks with phi as a parameter, such as multi-tiered DNNs.

The computing system 604 reconstructs X (t) from z (t) using the generating network, the reconstructed vector being X' (t). Can be expressed as:

x' (t) ═ g (z (t), θ), where θ is the set of parameters used to define g, for example, the resulting network is neural networks with θ as a parameter.

And updating phi and theta according to the evaluation criterion E [ X (t), X' (t) ] and the gradient learning rule. Wherein the correction values for the parameters phi and theta are proportional to the gradient of the evaluation criterion E, which may be a square error criterion, with respect to the parameters.

The computing system 604 calculates an optional set of motion vectors a (t) from z (t) using a motion network, which may be expressed as a (t) h (z (t), β), where the motion network is a neural network with β as parameters, where the motion vectors may include add/subtract oil , step on, turn left, turn right, etc. each element in the motion vector corresponds to specific driving actions, and each vector corresponds to specific oil , brake, steering wheel parameters, and the number of vectors in the set of possible motion vectors represents the number of possible actions.

The computing system 604 selects specific actions from the set of possible action vectors, which may be expressed as a (t) Select (a (t)), and the selection process may be implemented by a selection operation, such as a search operation.

After the autopilot 600 performs the action a (t), X (t +1) is a (t) X (t), which is an operator, the computing system 604 controls the autopilot 600 to perform the action a (t), updates are generated according to the data acquired by the sensor system 602 from the environment.

The arithmetic system 604 obtains the reward and punishment feedback that the environment will generate according to the action a (t) through the sensor system 602, for example, the reward and punishment feedback may refer to that the automatic driving tool 600 has a collision accident or violates a traffic rule or whether the vehicle runs smoothly, and the reward and punishment feedback and the reward and punishment standard may be used to determine the reward and punishment value in examples, R may be used to represent the reward and punishment value, and R — Cc may be set to represent a loss caused by the accident or the violation of the traffic rule, and the reward and punishment standard may refer to that if the vehicle runs normally, a persistent reward and punishment may be given, for example, R — Cd. may be set in another examples, and the reward and punishment standard may be determined according to whether the vehicle runs smoothly or not, or a reward and punishment standard may be set according to the driving style, for example, comfortable driving and sporty driving may correspond to different reward and punishment standards, and the like.

The reward and penalty value may also be empirical and may affect the behavior of autopilot 600 times on a normalized basis, for example, if the reward and penalty feedback obtained is a delivery crash, and the penalty determined from the crash is not too large, autopilot 600 may tend to consider the behavior that produces a crash as selectable behaviors.

The computing system 604 may update the parameters β, phi according to the reward and punishment feedback, the action selection process, and the gradient learning rule.

It should be noted that, when updating the parameters β and Φ of the neural network, the correction value of the parameter is proportional to the gradient of the reward-penalty function L (a (t)) with respect to the parameter, where the reward-penalty function L is mapping functions, the input of L is a (t), and the output is a relationship between reward-penalty values r.l and R, which may be instantaneous and only depend on the current R, or cumulative, i.e., the accumulation of all reward-penalty values during the whole process, or short-time, i.e., the weighted sum of reward-penalty values within time windows is selected.

In the specific implementation process of the embodiment of the invention, the automatic driving tool 600 needs to go through training processes, all the operations can firstly work on simulators, the simulators not only simulate automatic driving per se, but also simulate road conditions, and after a sufficient time of simulation operation, the automatic driving system has a good enough driving capability, including various natural reactions under abnormal conditions.

In another examples, a camera 700 implemented based on the computing system 100 is illustrated, the camera 700 may include a capture system 701 and a computing system 702, wherein the computing system 702 may include a processor and memory, etc. the computing system 702 may be a controller or portion of the controller of the camera 700.

The photographing system 701 includes a camera having an optical unit into which light from an imaging object (subject to be photographed) is input, an image photographing unit which is disposed at the rear of an optical axis of the optical unit and photographs the imaging object by means of the optical unit, and the like, wherein the other sensor may include a motion sensor such as an accelerometer, in examples, the optical unit may further include a zoom lens, a correction lens, a diaphragm mechanism, a focus lens, and the like, wherein the zoom lens may be moved in an optical axis direction by a zoom motor, and the focus lens may be moved in an optical axis direction by a focus motor, the correction lens may be further controlled by a correction lens motor so that an angle of incident light corresponding to an image photographing surface is always substantially constant, the diaphragm mechanism may be further controlled by an aperture (english: iris) motor, and the aforementioned various motors may be controlled by an electric driver by an additional arithmetic system 702.

The image capturing unit may include: a Charge Coupled Device (CCD) image sensor that generates an image signal of a subject from light from the optical unit; a Correlated Double Sampling (CDS) circuit that performs correlated double sampling processing that eliminates a noise portion contained in an image signal read by the CCD image sensor; an analog-to-digital converter (a/D) converter that converts an analog signal processed by the CDS circuit into a digital signal; and a Timing Generator (TG) that generates a timing signal that drives the CCD image sensor; and so on.

The display unit may include a display panel, which may optionally be configured in the form of a Liquid Crystal Display (LCD), organic light-emitting diode (OLED), or the like, the touch panel may cover the display panel, and when a touch operation is detected on or near the touch panel, the touch panel may be transmitted to the processor to determine the type of touch event, and the processor may then provide a corresponding visual output on the display panel according to the type of touch event.

Among them, it is intended to take good photographs, involving a combination of parameter adjustments, including basic elements such as aperture, white balance, and sensitivity, and more complicated elements such as composition and light.

In the embodiment of the present invention, each module included in the camera 700 is specifically configured to:

the capture system 701 collects image signals of an external scene and the computing system 702 may summarize the data collected by the capture system 701 into data vectors x (t).

The computing system 702 extracts features z (t), expressed as z (t) ═ f (x (t), phi), from x (t) using a feature extraction network, which may be neural networks with phi as a parameter.

The computing system 702 reconstructs X (t) from z (t) using the generated network, which may be expressed as X' (t) g (z (t), θ), where θ is the sets of parameters used to define g, i.e., the generated network is neural networks with θ as a parameter.

The computing system 702 updates φ and θ according to the evaluation criteria E [ X (t), X' (t) ] and the gradient learning rules. The correction value of the parameter is proportional to the gradient of the evaluation criterion E, which may be a square error criterion, with respect to the parameter.

The computing system 702 calculates an optional set of motion vectors a (t) from z (t) using a motion network, which may be expressed as a (t) h (z (t), β), where the motion network is a neural network with β as parameters, the motion vectors are combinations of photographing parameters, such as weighting parameters corresponding to R, G, B, suggested photographing angles, distances, inclinations, and the like, each motion vector corresponds to a specific combination of photographing parameters, such as a (parameter 1, parameter 2, parameter 3) (value 1, probability 1) is combinations thereof, and the number of motion vectors in a (t) represents the number of possible combinations of parameters.

The computing system 702 selects specific action vectors from the optional action vector set, which may be expressed as a (t) Select (a (t)).

The computing system 702 controls the shooting system 701 to shoot according to a (t), at this time, the environment acquired by the shooting system 701 is updated to X (t +1) ═ a (t) [ X (t) ], the X (t +1) can be the shooting result obtained by the camera 700 according to a (t), and in examples, the camera 700 can shoot a plurality of pictures (for example, 3 pictures) for the user to select in times.

Computing system 702 obtains the reward and punishment feedback generated according to a (t). The reward and punishment feedback can be a user action with respect to the captured photograph. Such as delete, save, or share actions. The reward and punishment value corresponding to the reward and punishment feedback can be determined according to the reward and punishment rule, for example, if the user triggers a deletion action and deletes the photo, the reward and punishment value is determined as a penalty (the reward and punishment value is negative), and can be represented as R ═ Ce. If a reservation or sharing action is triggered, it is determined as a reward (positive reward penalty), which may be expressed as R ═ Ck or R ═ Cs. Where Ck, Cs may be different prize values, e.g., Ck corresponds to the prize for retaining the photo, Cs corresponds to the prize for sharing the photo, and the prize for sharing may be higher than the prize for retaining. In addition, various reward and punishment factors influencing the subjective evaluation of the photo according to the picture composition, the light consumption, the white balance and the like can also be used as the basis of reward and punishment rules.

The specific determination of the reward and punishment value may be empirical, and the size of the reward and punishment affects the behavior of the system in the future on the basis of , for example, if the penalty caused by deleting a photo is not too large, the system tends to retain more photos which are not too good.

The arithmetic system 702 updates parameters β and phi according to the reward and penalty condition and the gradient learning rule, wherein the correction value of the parameter is proportional to the gradient of reward and penalty function L (a (t)) relative to the parameter.

In addition, in the embodiment of the present invention, the result reward and punishment for taking a picture is not limited to a specific user, but can be based on the general preference of a user group or a large number of users, and can be guided by the judgment of an experienced photographer.

By the embodiment of the invention, the understanding of the environment and the interaction with the environment are combined to carry out machine learning, so that the environment can be better understood and the optimal action selection can be made according to the understanding of the environment. In addition, the feature extraction and action selection can be based on memory space, and the basic rules of intelligent system behaviors are better met. The parameters of the feature extraction network can be updated while the action network is adjusted by utilizing an environment reward and punishment mechanism, so that the concept understanding, namely the so-called 'true-to-practice' can be improved from the action result.

Fig. 8 is a schematic flow chart of calculation methods according to an embodiment of the present invention, where the method is applied to a calculation system, the calculation system may be any system shown in fig. 1-7, or may be an automatic driving tool shown in fig. 6 or a camera shown in fig. 7, and may be understood by referring to each other, as shown in fig. 8, the method may specifically include the following steps:

and S810, acquiring the data vector of the time based on the environment.

In which environmental data may be acquired by a sensor (e.g., the sensor system in fig. 6 or the photographing system in fig. 7), and a data vector is obtained by preprocessing the environmental data. For example, pre-processing herein includes, but is not limited to, data cleansing, data integration, data transformation, and data reduction, among others.

The current environment data may be pieces of data in a specified time window, or pieces of data at time, and the length of the specified time window may be determined according to actual needs, for example, the time window may include all historical times.

In addition, the data vectors may be acquired in real time or periodically.

And S820, extracting the current feature vector according to data vectors, wherein the data vectors comprise the current data vector.

The extraction method of the feature vector may include multiple methods, and in examples, the feature vector is only related to the current data vector, but not related to the historical data vector, and in this case, the current data vector may be directly mapped to the current feature vector.

In another examples, is the case where the feature vector is not only related to the current data vector, but also to historical data vectors.for example, this feature vector may be extracted from or more data vectors by a operation this operation may be implemented by a neural network, e.g., an RNN.

For the extraction of the feature vector in S820, reference may be made to the relevant description in fig. 2-5 (for example, the foregoing operation 2121 or feature extraction network).

And S830, optimizing the characteristic vector extraction mode according to or a plurality of data vectors and the current characteristic vector.

The optimization of the feature vector extraction method may include various methods. For example, the optimization may be performed by reconstruction and gradient learning rules, or may be performed by a biological process (STDP) that adjusts the strength of the connection between neurons according to the order of neuron learning.

In examples, a data vector can be reconstructed according to the current feature vector, a feature extraction mode is optimized according to errors between the reconstructed data vector and the current data vector, gradient descent operation and evaluation criteria, for example, a data vector is generated according to the current feature vector through a second operation, and a operation and the second operation are optimized according to the generated data vector and the errors of or more data vectors.

The process of optimizing the feature extraction mode according to the error between the reconstructed data vector and the current data vector, the gradient descent operation and the evaluation criterion is iterative processes, limits can be set, and the iteration is terminated when the error is smaller than limits.

Wherein, S810-S830 can be implemented by the feature extraction unit 210 in combination with the embodiments shown in fig. 2-7, which can be understood by referring to each other.

And S840, determining the current action vector according to or more eigenvectors, wherein or more eigenvectors comprise the current eigenvector.

In examples, the present action vector is only related to the current feature vector and not to the historical feature vector, and in this case, the present feature vector can be directly mapped to the present action vector.

In another examples, cases are that the action vector not only has a relation with the current feature vector, but also has a relation with the historical feature vector.

The third operation comprises operation and selection operation through the th neural network, and the S840 can be realized by the following steps:

the optimal action vectors are selected from or a plurality of pending action vectors as the current action vector through selection operation, for example, action vectors are selected from or a plurality of pending action vectors for a plurality of times, respectively, and simulation operation is carried out, and the optimal action vectors in simulation operation results are selected as the current action vector.

For action generation in S840, see the relevant descriptions in fig. 2-7 (e.g., the aforementioned operation 2221 or action network).

And S850, acting the current action vector on the environment, so that the characteristic extraction unit acquires times of data vectors based on the environment after the action of the current action vector.

In the specific execution process of the computing system, the action can be continuously determined according to the environment, the environment is influenced by the execution of the action, and then the action is determined according to the influenced environment.

The action vector and the environment of the current action can be understood as the process that the operation system executes the current action or the operation system controls the intelligent equipment to execute the current action. In S850, for applying the current action vector to the environment, reference may be made to the description in fig. 2 to fig. 7 about applying the current action vector to the environment or executing the current action vector.

S860, obtaining a current reward and punishment feedback based on the environment, where the current reward and punishment feedback is generated by or more action vectors acting on the environment, and the or more action vectors include the current action vector.

For example, when an automobile driven automatically faces an obstacle, the obstacle is avoided after a left turn is selected, at the moment, the automobile is driven automatically after the right turn for an accident, the front environment does not contain the obstacle, the situation can be considered that reward feedbacks are generated by the environment, similarly, penalty feedbacks and the like can be generated.

For obtaining the reward punishment feedback of this time in S860, reference may be made to relevant descriptions in fig. 2 to fig. 7.

And S870, determining a mode according to the reward and punishment feedback optimization action vector.

The optimization action vector determination method may include various methods, for example, optimization action vector determination method according to reward and punishment feedback, gradient descent operation and evaluation criterion. And optimizing an action vector determination mode according to the deviation gradient descending operation of the actual value and the expected value of the action result and the evaluation criterion.

For the determination manner of the optimized action vector in S870, see the description related to the determination manner of the optimized action vector in fig. 2-7 (e.g., optimization operation 2221, or optimized action network, etc.).

In addition, S840-S870 may be implemented by action generating unit 220 in conjunction with the embodiments shown in FIGS. 2-5, as described above, as will be understood by reference to each other.

In embodiments, the method further comprises:

In order to avoid confusion caused by optimization, the feature vector extraction mode can be optimized according to the or more data vectors and the current feature vector according to a th probability, the feature vector extraction mode can be optimized according to the current reward and punishment feedback according to a second probability, and the sum of the th probability and the second probability is 1.

In still further embodiments, the method further comprises learning from or more training data vectors, learning from or more training feature vectors, wherein the learning from or more training data vectors is separate in time from the learning from or more training feature vectors, or learning from or more training data vectors is simultaneous in time with the learning from or more training feature vectors.

In another embodiments, a plurality of operation systems can be further provided, and the operation systems operate according to a rule of goodness and poor rejection, specifically, the method further includes that the operation systems determine a reward and punishment accumulated value according to reward and punishment feedback of an environment in an operation process, wherein the reward and punishment accumulated value is increased if the reward and punishment feedback is reward, the reward and punishment accumulated value is decreased if the reward and punishment feedback is penalty, the operation system with the reward and punishment accumulated value higher than a threshold value is copied, and the operation system with the reward and punishment accumulated value lower than a second threshold value is eliminated.

In another embodiments, the action is first an action that affects the environment, and can also be an action that affects the learning itself.

Fig. 9 is a schematic structural diagram of computing devices according to an embodiment of the present disclosure, as shown in fig. 9, the computing device 900 includes a transceiver 901, a processor 902, and a memory 903, where the transceiver 901 is configured to receive data of a data bus, the memory 903 is configured to store programs and data, the processor 902 is configured to execute the programs stored in the memory 903 and read the data stored in the memory 903 to execute the foregoing steps S820-S840 and S860-S870 in fig. 8, and control the transceiver 901 to execute steps S810 and S850, where the computing system as described in fig. 2 to fig. 7 may be implemented by the computing device 900.

The embodiment of the invention provides chip devices, each chip device comprises a processor and a memory, each memory is used for storing a program, and each processor runs the program to execute the method and/or the steps in the above-mentioned fig. 8.

In the embodiment of the present invention, the chip device may be a chip operating in an arithmetic device, where the chip includes: a processing unit, which may be, for example, a processor, which may be of the various types described hereinbefore, and a communication unit. The communication unit, which may be, for example, an input/output interface, a pin or a circuit, etc., includes a system bus. Optionally, the chip further includes a storage unit, which may be a memory inside the chip, such as a register, a cache, a Random Access Memory (RAM), an EEPROM, or a FLASH; the memory unit may also be a memory located outside the chip, which may be of the various types described hereinbefore. The processor is coupled to the memory and is operable to execute the instructions stored in the memory to cause the chip arrangement to perform the method of fig. 8 as described above.

When the computer program instructions are loaded and executed on a computer, the processes or functions described in accordance with embodiments of the present invention are all or partially generated, for example, from a computer readable storage medium stored in the computer readable storage medium or transmitted from computer readable media to computer readable media, for example, from websites, computers, servers or data centers via wired (e.g., coaxial cable, optical fiber, Digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.) sites, to websites, computers, or data centers via wired (e.g., DVD, Digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.) or solid state storage media, such as a semiconductor readable storage medium, a solid state storage medium, a magnetic tape, a magnetic storage medium.

The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

An operation system of , comprising:

the system comprises a feature extraction unit, a feature extraction unit and a feature extraction unit, wherein the feature extraction unit is used for acquiring current data vectors based on environment, extracting the current feature vectors according to or more data vectors, wherein the or more data vectors comprise the current data vectors, and optimizing the feature extraction unit according to the or more data vectors and the current feature vectors;

the device comprises a feature extraction unit, an action generation unit and an optimization unit, wherein the action generation unit is used for determining a current action vector according to or more feature vectors extracted by the feature extraction unit, the or more feature vectors comprise the current feature vector, the current action vector acts on the environment so that the feature extraction unit obtains times of data vectors based on the environment after the current action vector acts, current reward and punishment feedback is obtained based on the environment, the current reward and punishment feedback is or more action vectors act on environment generation, the or more action vectors comprise the current action vector, and the action generation unit is optimized according to the current reward and punishment feedback.
2. The system of claim 1, wherein the action generating unit is further configured to optimize the feature extracting unit according to the reward and punishment feedback of the current time.
3. The system of claim 2, wherein the feature extraction unit optimizes the feature extraction unit according to a th probability, the action generation unit optimizes the feature extraction unit according to a second probability, and the sum of the th probability and the second probability is 1.
4. The system according to any of the preceding claims 1-3 and , wherein the feature extraction unit is further configured to perform learning in advance based on or more training data vectors.
5. The system of claim 4, wherein the action generating unit is further configured to learn in advance according to or more training feature vectors predetermined by the feature extracting unit.
6. The system according to claim 5, wherein the feature extraction unit and the action generation unit learn separately in time; alternatively, the feature extraction unit and the action generation unit learn at the same time in time.
7. The system of any of claims 1-6, wherein the system includes or more subsystems, each subsystem includes a feature extraction unit and an action generation unit, and the subsystems determine a reward-penalty integrated value according to reward-penalty feedback of the environment during operation, wherein the reward-penalty integrated value is increased if the reward-penalty feedback is reward, and is decreased if the reward-penalty feedback is penalty, and subsystems with reward-penalty integrated values higher than a threshold are copied, and subsystems with reward-penalty integrated values lower than a second threshold are eliminated.
8. The system of claim 1, wherein the feature extraction unit is specifically configured to extract a current feature vector from data vectors by a operation, and generate a data vector from the current feature vector by a second operation, and optimize the operation and the second operation according to an error between the generated data vector and the data vectors.
9. The system of claim 8, wherein the action generating unit is specifically configured to obtain the current action vector through a third operation mapping according to or more feature vectors extracted by the feature extracting unit, obtain the current reward and punishment feedback based on an environment, obtain the current reward and punishment value according to the current reward and punishment feedback mapping, and optimize the third operation and the operation according to the current reward and punishment value.
10. The system of claim 9, wherein the third operation comprises an operation of a neural network for mapping the or more feature vectors extracted by the feature extraction unit to a plurality of pending action vectors and a selection operation for selecting an optimal of the plurality of pending action vectors as the current action vector.
11. The system of claim 10, wherein the selection operation further comprises a search operation.
12. The system of claim 11, wherein the search operation is specifically configured to select action vectors for simulation operation, and select optimal action vectors in simulation operation results as the current action vector, when the selection is performed multiple times from the multiple pending action vectors.
13. The system of any of claims 9 to 12 and , wherein the operation, the second operation, or the third operation comprises an operation through a Recurrent Neural Network (RNN).
14. The system according to any of the claims 1-13 and , wherein the action generating unit is further configured to adjust a way of optimizing the feature extraction unit or optimizing the action generating unit according to the present action vector.
15. The system of any of claims 1-14 to applied to a camera, a robot, or an autonomous vehicle.
16, A method of operation, the method being adapted for use in an operating system, comprising:

obtaining the data vector of the time based on the environment;

extracting the current feature vector according to data vectors, wherein the data vectors comprise the current data vector;

optimizing the characteristic vector extraction mode according to the or more data vectors and the current characteristic vector;

determining a current action vector according to one or more eigenvectors, wherein the one or more eigenvectors comprise the current eigenvector;

acting the action vector on the environment so that the feature extraction unit obtains times of data vectors based on the environment after the action of the action vector;

obtaining a reward and punishment feedback of this time based on an environment, wherein the reward and punishment feedback of this time is generated by or more action vectors acting on the environment, and the or more action vectors include the action vector of this time;

and determining a mode according to the reward and punishment feedback optimization action vector.
17. The method of claim 16, further comprising:

and optimizing a characteristic vector extraction mode according to the reward and punishment feedback of this time.
18. The method of claim 17,

the extraction mode according to the data vectors and the current feature vector optimization feature vector comprises a mode of extracting the data vectors and the current feature vector optimization feature vector according to a probability, and the extraction mode according to the current reward punishment feedback optimization feature vector comprises a mode of extracting the current reward punishment feedback optimization feature vector according to a second probability;

wherein the sum of the th probability and the second probability is 1.
19. The method of any of claims 16-18 to , further comprising learning from or more training data vectors in advance.
20. The method of claim 19, further comprising learning in advance based on or more training feature vectors.
21. The method of claim 20, wherein the learning from or more training data vectors is performed separately in time from the learning from or more training feature vectors, or wherein the learning from or more training data vectors is performed simultaneously in time with the learning from or more training feature vectors.
22. The method of any of claims 16-21 to , further comprising:

determining a reward and punishment accumulated value according to reward and punishment feedback of the environment in the operation process, wherein the reward and punishment accumulated value is increased if the reward and punishment feedback is reward, and the reward and punishment accumulated value is reduced if the reward and punishment feedback is punishment;

the operation system with the reward and punishment accumulated value higher than the th threshold value is copied;

the operation system with the accumulated reward and punishment value lower than the second threshold value is eliminated.
23. The method of claim 16, wherein the extracting the present eigenvector from the one or more data vectors comprises extracting the present eigenvector from the one or more data vectors by a operation;

the method for optimizing feature vector extraction according to data vectors and the current feature vector comprises generating data vectors according to the current feature vector through a second operation, and optimizing the th operation and the second operation according to errors between the generated data vectors and the data vectors.
24. The method according to claim 23, wherein the determining the current action vector based on or more eigenvectors comprises mapping the current action vector based on or more eigenvectors by a third operation;

the determination mode according to the reward and punishment feedback optimization action vector comprises the steps of obtaining a reward and punishment value according to reward and punishment feedback mapping of the time, and optimizing the third operation and the th operation according to the reward and punishment value of the time.
25. The method of claim 24, wherein the third operation comprises an operation of neural network and a selection operation, and wherein the mapping the current action vector according to or more eigenvectors through the third operation comprises:

determining one or more pending action vectors from one or more feature vectors operated by the neural network;

and selecting the optimal vectors from the or the plurality of pending action vectors as the current action vector through the selection operation.
26. The method of claim 25, wherein the selection operation further comprises a search operation.
27. The method of claim 26, wherein said selecting, from said or more pending action vectors, an optimal of said current action vector by said selection operation comprises:

and selecting multiple times from the or the multiple undetermined action vectors, selecting action vectors respectively, performing simulation operation, and selecting the optimal action vectors in the simulation operation result as the action vector at this time.
28. The method of any of claims 24-27 and , wherein the operation, the second operation, or the third operation comprises an operation through a Recurrent Neural Network (RNN).
29. The method of any one of claims 16 to 28, , further comprising adjusting the manner of extraction of the optimized feature vector or adjusting the manner of determination of the optimized action vector based on the current action vector.
30. The method of any of claims 16-29 to , wherein the computing system is applied to a camera, a robot, or an autonomous vehicle.
31, arithmetic device comprising a processor and a memory, wherein the memory is used for storing programs, and the processor is used for executing the programs stored in the memory to control the arithmetic device to execute the method of any of claims 16-30 and .
32, kinds of automatic driving tools, which is characterized by comprising a propulsion system, a sensor system, a control system and an operation system, wherein the propulsion system is used for providing power for the automatic driving tool, and the operation system is used for controlling the sensor system to acquire the current data vector based on the environment;

the computing system is further configured to extract a current feature vector according to one or more data vectors, where the one or more data vectors include the current data vector;

the operation system is also used for optimizing a characteristic vector extraction mode according to the data vectors and the current characteristic vector;

the computing system is further configured to determine a current action vector according to or more eigenvectors, where the or more eigenvectors include the current eigenvector;

the operation system is also used for controlling the control system to act the action vector on the environment, so that the feature extraction unit obtains times of data vectors based on the environment acted by the action vector;

the arithmetic system is further configured to control the sensor system to obtain a current reward and punishment feedback based on an environment, where the current reward and punishment feedback is generated by or more action vectors acting on the environment, and the or more action vectors include the current action vector;

and the operation system is also used for determining a mode according to the reward and punishment feedback optimization action vector.
33, kinds of cameras, characterized by comprising a shooting system and an arithmetic system;

the operation system is used for controlling the shooting system to obtain the data vector based on the environment;

the computing system is further configured to extract a current feature vector according to one or more data vectors, where the one or more data vectors include the current data vector;

the operation system is also used for optimizing a characteristic vector extraction mode according to the data vectors and the current characteristic vector;

the computing system is further configured to determine a current action vector according to or more eigenvectors, where the or more eigenvectors include the current eigenvector;

the operation system is also used for controlling the shooting system to act the action vector on the environment, so that the feature extraction unit obtains times of data vectors based on the environment acted by the action vector;

the operation system is further configured to obtain a current reward and punishment feedback based on an environment, where the current reward and punishment feedback is generated by or more action vectors acting on the environment, and the or more action vectors include the current action vector;

and the operation system is also used for determining a mode according to the reward and punishment feedback optimization action vector.
34, computer readable storage media comprising computer readable instructions which, when read and executed by a computer, cause the computer to perform the method of any of claims 16-30 to .
A computer program product of , comprising computer readable instructions which, when read and executed by a computer, cause the computer to perform the method of any of claims 16-30 and .