RU2523218C1

RU2523218C1 - Modified intelligent controller with adaptive critic

Info

Publication number: RU2523218C1
Application number: RU2013108118/08A
Authority: RU
Inventors: Владимир Игнатьевич Ключко; Евгений Александрович Шумков; Роман Оганесович Карнизьян
Priority date: 2013-02-22
Filing date: 2013-02-22
Publication date: 2014-07-20

Abstract

FIELD: information technology.

SUBSTANCE: system comprises interconnected control object, critic unit, decision neural network, action unit, reinforcement calculation unit, time difference calculation unit, action selection unit, action screening unit, critic training unit and action entry unit.

EFFECT: improved adaptive properties of the control system based on an intelligent controller and faster operation thereof.

1 dwg

Description

Изобретение относится к интеллектуальным контроллерам, использующим принцип обучения с подкреплением, нечеткую логику, и может использоваться для управления сложными системами в недетерминированной среде.The invention relates to intelligent controllers using the principle of reinforcement learning, fuzzy logic, and can be used to control complex systems in a non-deterministic environment.

Известен интеллектуальный контроллер на основе сетей адаптивных критиков, например патент США: МПК G06F 15/18 5448681. Данное устройство состоит из объекта управления, сети критика и управляющей нейронной сети. Выход объекта управления связан с первым входом сети критики и первым входом управляющей нейронной сети, выход управляющей нейронной сети связан с входом объекта управления и вторым входом сети критики, выход сети критики связан со вторым входом управляющей нейронной сети.Known intelligent controller based on adaptive criticism networks, for example, US patent: IPC G06F 15/18 5448681. This device consists of a control object, a critic network and a control neural network. The output of the control object is connected to the first input of the criticism network and the first input of the control neural network, the output of the control neural network is connected to the input of the control object and the second input of the criticism network, the output of the criticism network is connected to the second input of the control neural network.

Принцип работы устройства по патенту МПК G06F 15/18 5448681 следующий: объект управления выдает сигнал о своем состоянии, сеть критики вычисляет подкрепление для текущей временной итерации и состояния объекта, управляющая нейронная сеть вычисляет управляющее воздействие с учетом подкрепления.The principle of operation of the device according to the IPC patent G06F 15/18 5448681 is as follows: the control object gives a signal about its state, the critique network calculates the reinforcement for the current time iteration and the state of the object, the control neural network calculates the control action taking into account the reinforcement.

Общий недостаток устройств на основе сетей адаптивных критиков состоит в том, что этот алгоритм не является обобщенным и достаточным для построения универсальной адаптивной системы управления объектом, действующим в недетерминированной среде, чтобы убедиться в этом, достаточно взглянуть на количество методов реализации и структур (DHP, HDP, ВАС, GDHP, FACL, GIFACL и другие). К недостаткам можно отнести, так как система должна управляться в режиме реального времени, большее количество вычислений. Также к недостаткам можно отнести то, что система управления, построенная на базе адаптивного критика, не может радикально менять свое поведение и вырабатывать новые реакции при абсолютно новых, неизвестных данных о состоянии окружающей среды и объекта управления (D. Prokhorov, D. Wanch. Adaptive critic designs. IEEE transactions on Neural Networks, September 1997, pp.997-1007).A common drawback of devices based on networks of adaptive critics is that this algorithm is not generalized and sufficient for constructing a universal adaptive control system for an object operating in a non-deterministic environment, to verify this, just look at the number of implementation methods and structures (DHP, HDP , YOU, GDHP, FACL, GIFACL and others). The disadvantages include, since the system must be controlled in real time, a large number of calculations. The disadvantages include the fact that the control system, built on the basis of adaptive criticism, cannot radically change its behavior and generate new reactions with completely new, unknown data on the state of the environment and the control object (D. Prokhorov, D. Wanch. Adaptive critic designs. IEEE transactions on Neural Networks, September 1997, pp. 997-1007).

Наиболее близким техническим решением является патент РФ МПК G06F 15/00 №2450336 «Модифицированный интеллектуальный контроллер с адаптивным критиком». Контроллер по данному патенту состоит из нескольких структурных компонент: объекта управления, блока действий, решающей нейронной сети, блока критика, блока расчета временной разности, блока расчета подкрепления, блока выбора действия. При этом объект управления связан по выходам состояния и действия с блоком действий, блоком прогнозирования параметра, блоком расчета временной разности и блоком расчета подкрепления. Выход блока действий связан со вторым входом блока критика, выход блока прогнозирования параметра связан с первым входом блока критика. Выход блока критика связан с входом блока выбора действия и входом блока расчета временной разности. Блок расчета временной разности соединен с входами и выходом критика. Блок расчета подкрепления соединен с блоком расчета временной разности. Выходы блока выбора действия соединены с входами блока действий и объектом управления.The closest technical solution is RF patent IPC G06F 15/00 No. 2450336 "Modified intelligent controller with adaptive criticism." The controller according to this patent consists of several structural components: a control object, an action block, a decisive neural network, a critic block, a time difference calculation block, a reinforcement calculation block, an action selection block. In this case, the control object is connected via the status and action outputs to the action block, parameter prediction block, time difference calculation block and reinforcement calculation block. The output of the action block is connected with the second input of the critic block, the output of the parameter prediction block is connected with the first input of the critic block. The output of the critic block is connected with the input of the action selection block and the input of the time difference calculation block. The time difference calculator is connected to the critic's inputs and outputs. The reinforcement calculation unit is connected to the time difference calculation unit. The outputs of the action selection block are connected to the inputs of the action block and the control object.

Принцип работы устройства по патенту РФ МПК G06F 15/00 №2450336 следующий: объект управления выдает сигналы состояния и действия, по которым блок действий выбирает возможные действия в данной ситуации и подает их на блок критика параллельно с прогнозным значением рабочего параметра, который рассчитывает блок прогнозирования параметра. Критик, получая данные, последовательно оценивает последствия возможных действий и выдает их на блок выбора действия, который с помощью «жадного» - правила выбирает действие и подает его на исполнение в объект управления. Параллельно этому процессу, блок расчета подкрепления рассчитывает полученное подкрепление и подает его на блок расчета временной разности, который рассчитывает ошибку временной разности и если ошибка существенная, то блок расчета временной разности останавливает работу критика и переобучает его на новых данных.The principle of operation of the device according to the patent of the Russian Federation IPC G06F 15/00 No. 2450336 is as follows: the control object gives state signals and actions according to which the action block selects possible actions in this situation and sends them to the critic block in parallel with the predicted value of the working parameter, which the forecasting block calculates parameter. The critic, receiving the data, sequentially evaluates the consequences of possible actions and gives them to the action selection block, which, using the “greedy” rule, selects the action and submits it for execution to the control object. In parallel with this process, the reinforcement calculation unit calculates the received reinforcement and feeds it to the time difference calculation unit, which calculates the time difference error, and if the error is significant, the time difference calculation unit stops the critic and retrains him on new data.

Недостатками данного контроллера являются: недостаточные адаптационные свойства, сложность реализации блока критика и его обучения, ограниченные возможности работы блока выбора действий.The disadvantages of this controller are: insufficient adaptive properties, the complexity of the implementation of the critic block and its training, the limited capabilities of the action block.

Задача - разработка модифицированного интеллектуального контроллера с адаптивным критиком.The task is to develop a modified intelligent controller with an adaptive critic.

Техническим результатом предлагаемого устройства является повышение адаптационных свойств системы управления на базе интеллектуального контроллера, повышение его скоростных характеристик и упрощение конечной реализации для разработчика.The technical result of the proposed device is to increase the adaptive properties of the control system based on an intelligent controller, increase its speed characteristics and simplify the final implementation for the developer.

Технический результат достигается тем, что в модифицированном интеллектуальном контроллере с адаптивным критиком, содержащем объект управления, блок критика, решающую нейронную сеть, блок действий, блок расчета подкрепления, блок расчета временной разности, блок выбора действия, первый и второй выходы объекта управления связаны с первым и вторым входами решающей нейронной сети, первым и вторым входами блока расчета временной разности, первым и вторым входами блока расчета подкрепления, выход решающей нейронной сети соединен с первым входом блока критика, выход блока критика связан с входом блока выбора действия, выход блока расчета подкрепления связан с третьим входом блока расчета временной разности, выход блока выбора действия соединен с входом объекта управления, введены блок отбора действий, блок обучения критика и блок занесения действий, при этом первый и второй выходы объекта управления также соединены с первым и вторым входами блока отбора действий, первый выход блока отбора действий соединен с первым входом блока действий, второй выход блока отбора действий соединен со вторым входом блока критика, третий выход блока отбора действий соединен с третьим входом решающей нейронной сети, четвертый выход блока отбора действий соединен со вторым входом блока выбора действия, выход блока действий соединен с третьим входом блока отбора действий, выход блока расчета подкрепления соединен также со вторым входом блока занесения действий, выход блока расчета временной разности соединен с первым входом блока обучения критика, первый и второй выходы блока обучения критика соединены соответственно с первым и вторым входами блока критика, третий выход блока обучения критика соединен с четвертым входом блока расчета временной разности, выход блока критика также соединен со вторым входом блока обучения критика, выход блока выбора действия также соединен с первым входом блока занесения действий, выход блока занесения действий соединен со вторым входом блока действий.The technical result is achieved by the fact that in a modified intelligent controller with an adaptive critic containing a control object, a critic block, a decisive neural network, an action block, a reinforcement calculation block, a time difference calculation block, an action selection block, the first and second outputs of the control object are connected to the first and the second inputs of the decision neural network, the first and second inputs of the time difference calculation unit, the first and second inputs of the reinforcement calculation unit, the output of the decision neural network is connected to the first the critic block unit’s output, the critic’s block output is connected to the input of the action selection block, the output of the reinforcement calculation block is connected to the third input of the time difference calculation block, the output of the action selection block is connected to the input of the control object, the action selection block, the critic’s training block, and the action recording block are entered, the first and second outputs of the control object are also connected to the first and second inputs of the action block, the first output of the action block is connected to the first input of the action block, the second output of the action block connected to the second input of the critic unit, the third output of the action selection unit is connected to the third input of the decisive neural network, the fourth output of the action selection unit is connected to the second input of the action selection unit, the output of the action unit is connected to the third input of the action selection unit, the output of the reinforcement calculation unit is also connected with the second input of the action input unit, the output of the time difference calculation unit is connected to the first input of the critic’s training unit, the first and second outputs of the critic’s training unit are connected to the first m and the second inputs of the critic’s block, the third output of the critic’s training block is connected to the fourth input of the time difference calculation block, the critic’s output is also connected to the second input of the critic’s training block, the output of the action selection block is also connected to the first input of the action recording block, the output of the action recording block connected to the second input of the action block.

Задача повышения адаптационных свойств достигается за счет выделения процесса обучения нейронной сети блока критика в отдельной компоненте - блоке обучения критика. Другим важным моментом является то, что работа с блоком действий строится по новому принципу с использованием двух новых блоков - блока отбора действий и блоком занесения действий. Скоростные характеристики работы системы повышаются за счет блока отбора действий, который отсекает возможные действия не подходящие по минимальному заданному подкреплению. Упрощение реализации для разработчика заключается в декомпозиции блока действий на несколько блоков, а также в выделении процесса обучения нейронной сети блока критика в отдельном блоке.The task of improving adaptive properties is achieved by highlighting the process of training the neural network of the critic block in a separate component - the critic training block. Another important point is that work with an action block is built on a new principle using two new blocks - an action selection block and an action entry block. The speed characteristics of the system are enhanced by the action selection block, which cuts off possible actions that are not suitable for the minimum specified reinforcement. Simplification of the implementation for the developer consists in decomposing the action block into several blocks, as well as in highlighting the process of training the neural network of the critic block in a separate block.

Таким образом, совокупность существенных признаков, изложенных в формуле изобретения, позволяет достигнуть желаемого результата.Thus, the set of essential features set forth in the claims, allows to achieve the desired result.

На фиг.1 изображена схема модифицированного интеллектуального контроллера с адаптивным критиком.Figure 1 shows a diagram of a modified intelligent controller with adaptive criticism.

Система состоит из нескольких структурных компонент: блока расчета подкрепления 1, блока расчета временной разности 2, блока обучения критика 3, блока критика 4, решающей нейронной сети 5, блока выбора действия 6, блока отбора действий 7, блока действий 8, блока занесения действий 9, объекта управления 10.The system consists of several structural components: a unit for calculating reinforcements 1, a unit for calculating a time difference 2, a unit for training critic 3, a unit for critic 4, a decisive neural network 5, an unit for selecting an action 6, a block for selecting actions 7, an action block 8, an entry block 9 Management Object 10.

Также в системе присутствуют следующие связи - от объекта управления идет сигнал состояния объекта 11, который соединен с входом блока отбора действий 11.1, решающей нейронной сети 11.2, блоком расчета подкрепления 11.3 и боком расчета временной разности 11.4. Также от объекта управления идет сигнал состояния внешней среды 12, который соединен с входами блока отбора действий 12.1, решающей нейронной сети 12.2, блоком расчета подкрепления 12.3 и блоком расчета временной разности 12.4. Выход блока действий соединен с блоком отбора действий по сигналу 13. От блока отбора действий идут сигналы на блок действий 14, решающую нейронную сеть 15, блока критика 16 и блок выбора действия 17. От решающей нейронной сети идет сигнал 18 на вход блока критика. Выходы блока обучения критика связаны с входами блока критика по сигналам 19 и 20, а выход блока критика 21 соединен с блоком обучения критика 21.1 и блоком выбора действия 21.2. От блока расчета временной разности идет сигнал 22 на блок обучения критика, также есть обратная связь от блока обучения критика к блоку расчета временной разности 23. Выход 24 блока расчета подкрепления соединен с входами: блока расчета временной разности 24.1 и блоком занесения действий 24.2. Выход 25 блока выбора действия соединен с входами объекта управления 25.1 и блока занесения действий 25.2. От блока занесения действий идет сигнал 26 на блок действий.Also in the system there are the following connections - from the control object there is a status signal of the object 11, which is connected to the input of the action selection block 11.1, the decisive neural network 11.2, the reinforcement calculation block 11.3, and the side difference calculation 11.4. Also from the control object is a signal of the state of the external environment 12, which is connected to the inputs of the action selection block 12.1, the decisive neural network 12.2, the reinforcement calculation unit 12.3, and the time difference calculation unit 12.4. The output of the action block is connected to the block of selection of actions by signal 13. From the block of selection of actions, signals go to block of actions 14, decisive neural network 15, critic block 16 and block for selecting action 17. From decisive neural network, signal 18 goes to the input of critic block. The outputs of the critic’s training block are connected to the inputs of the critic’s block by signals 19 and 20, and the output of the critic’s block 21 is connected to the critic’s training block 21.1 and the action selection block 21.2. A signal 22 is sent from the time difference calculation block to the critic training block, there is also feedback from the critic training block to the time difference calculation block 23. The output 24 of the reinforcement calculation block is connected to the inputs: the time difference calculation block 24.1 and the action recording block 24.2. The output 25 of the action selection block is connected to the inputs of the control object 25.1 and the recording block actions 25.2. A signal 26 is sent from the entry block to the action block.

Блок расчета подкрепления 1 предназначен для расчета подкрепления r(t). Формула расчета подкрепления задается разработчиком.The reinforcement calculation block 1 is intended to calculate the reinforcement r (t). The reinforcement calculation formula is set by the developer.

Блок расчета временной разности 2 предназначен для расчета временной разности по формулеThe unit for calculating the time difference 2 is intended for calculating the time difference using the formula

δ(t)=r(t)+γ·V(t)-V(t-1),δ (t) = r (t) + γV (t) -V (t-1),

где γ∈(0;1] - коэффициент забывания.where γ∈ (0; 1] is the forgetting coefficient.

Блок обучения критика 3 предназначен для обучения/переобучения нейронной сети блока критика.The critic training block 3 is intended for training / retraining of the neural network of the critic block.

Блок критика 4 предназначен для расчета прогнозного значения качества ситуации V(t) последующей при выборе определенного действия. Для расчета качества ситуации используется послойно - полносвязная нейронная сеть прямого распространения сигнала (многослойный персептрон).Critic block 4 is designed to calculate the predicted value of the quality of the situation V (t) subsequent when choosing a specific action. To calculate the quality of the situation, a layer - fully connected neural network of direct signal propagation (multilayer perceptron) is used.

Решающая нейронная сеть 5 предназначена для прогнозирования следующего значения рабочего параметра системы (или нескольких параметров). Под рабочим параметром понимается тот параметр системы, оценивая который, система может определить, как она работает, либо это параметр, который служит ориентиром для работы системы (рабочих параметров может быть несколько).The decisive neural network 5 is designed to predict the next value of the operating parameter of the system (or several parameters). Under a working parameter is meant that parameter of the system, evaluating which, the system can determine how it works, or it is a parameter that serves as a guide for the system (there may be several working parameters).

Блок выбора действия 6 предназначен для выбора конкретного действия из всех возможных в данной ситуации. При выборе используется так называемое «ε - жадное правило» (Sutton R., Barto A. Reinforcement Learning: An Introducion. - Cambridge: MIT Press, 1998), которое можно записать как: «с вероятностью (1-е) выбирается то действие, которому соответствует максимальное значение качества ситуации

, при этом 0<ε<<1».The action selection block 6 is designed to select a specific action from all possible in a given situation. When choosing, the so-called “ε - greedy rule” is used (Sutton R., Barto A. Reinforcement Learning: An Introducion. - Cambridge: MIT Press, 1998), which can be written as: “with probability (1st), that action is selected which corresponds to the maximum value of the quality of the situation

while 0 <ε << 1 ".

Блок отбора действий 7 предназначен для отбора всех возможных действий в данной ситуации с учетом минимального накопленного подкрепления для возможного действия.The block of selection of actions 7 is designed to select all possible actions in this situation, taking into account the minimum accumulated reinforcement for a possible action.

Блок действий 8 предназначен для хранения таблицы возможных действий во всех возможных ситуациях и накопленного подкрепления при совершении определенного действия в определенной ситуации.Action block 8 is intended to store a table of possible actions in all possible situations and accumulated reinforcement when a certain action is performed in a certain situation.

Блок занесения действий 9 предназначен для внесения корректировок в блок действия 8. Данный блок обновляет значение накопленного подкрепления в ячейке выбранного действия на предыдущей итерации после отработки действия объектом управления.The block of entering actions 9 is intended for making corrections to the block of action 8. This block updates the value of the accumulated reinforcement in the cell of the selected action at the previous iteration after the action is completed by the control object.

Заявленное устройство работает следующим образом. Объект управления 10 выполняет действие и на выходах выдает сигналы состояния объекта 11 и внешней среды 12, которые поступают в блок отбора действий 7 по связям 11.1 и 12.1, по 11.2 и 12.2 в решающую нейронную сеть 5, по 11.3 и 12.3 в блок расчета подкрепления 1 и по 11.4 и 12.4 в блок расчета временной разности 2. При поступлении новых данных о состоянии объекта и внешней среды, блок отбора действий 7 запрашивает по связи 14 у блока действий 8 о возможных действиях в данной ситуации. При этом разработчиком при старте системы задается то минимальное значение совокупного полученного подкрепления, начиная с которого отбираются возможные действия в данной ситуации. Также задается количество итераций управления, при которых накапливается история полученного подкрепления в различных ситуациях при различных выполненных действиях (т.н. «этап исследования окружающей среды»). Получив обратно возможные действия в данной ситуации по связи 13, блок отбора действий 7 синхронно с блоком прогнозирования параметра 5 начинает подавать возможные действия на блок критика 4 по сигналу 16. Синхронизация решающей нейронной сети 5 и блока отбора действий 7 происходит по сигналу 15.The claimed device operates as follows. The control object 10 performs the action and outputs the status signals of the object 11 and the external environment 12, which are sent to the action selection block 7 via connections 11.1 and 12.1, 11.2 and 12.2 to the decisive neural network 5, 11.3 and 12.3 to the reinforcement calculation block 1 and 11.4 and 12.4 to the time difference calculation unit 2. Upon receipt of new data on the state of the object and the external environment, the action selection unit 7 requests via communication 14 from action unit 8 about possible actions in this situation. In this case, the developer at the start of the system sets the minimum value of the total reinforcement received, starting from which the possible actions in this situation are selected. The number of control iterations is also set, during which the history of reinforcements obtained is accumulated in various situations with various actions performed (the so-called “environmental research stage”). Having received back the possible actions in this situation via communication 13, the action selection block 7 synchronously with the parameter 5 prediction block starts to send possible actions to the critic 4 block by signal 16. The decisive neural network 5 and the action selection block 7 are synchronized by signal 15.

Решающая нейронная сеть 5, получив текущие значения состояния объекта управления 11.2 и внешней среды 12.2, вычисляет прогнозное значение рабочего параметра на следующую временную итерацию и подает вычисленное значение 18 на блок критика 4 совместно с сигналом 16 от блока отбора действий 7.The decisive neural network 5, having received the current values of the state of the control object 11.2 and the external environment 12.2, calculates the predicted value of the operating parameter for the next time iteration and supplies the calculated value 18 to the critic block 4 together with the signal 16 from the action selection block 7.

Блок критика 4, последовательно получая пары значений {возможное_действие; прогноз_рабочего_параметра}, прогнозирует возможное будущее подкрепление 21 при выполнении данного возможного действия.Block critic 4, sequentially receiving pairs of values {possible_action; work_prediction}, predicts a possible future reinforcement 21 when performing this possible action.

Блок расчета подкрепления 1, получая значения текущего состояния внешней среды 12.3 и объекта управления 11.3, вычисляет по заданной формуле значение полученного подкрепления за последнюю отработанную итерацию управления. Полученное значение 24.1 рассчитанного подкрепления подается в блок расчета временной разности 2, который рассчитывает значение временной разности и формирует наборы {входы; выходы} для обучения нейронной сети блока критика 4. В случае, если ошибка временной разности велика, то есть выше заданного разработчиком предела, то блок расчета временной 2 запускает процесс переобучения нейронной сети блока критика 4, посылая сигнал 22 на блок обучения критика 3. Блок обучения критика 3 получает сигнал о старте переобучения нейронной сети блока критика 4 и начинает процесс переобучения нейросети блока критика. При этом активируются связи на входы 19 и 20 нейронной сети блока критика 4, тем самым блокируются поступления данных от блока отбора действий 7 и блока прогнозирования параметра 5. Блок обучения критика 3, пользуясь алгоритмом обратного распространения ошибки (см. подробнее Rumelhart D.E., Hinton G.E., Williams R.J., "Learning representations by back-propagating errors," Nature, vol.323, pp.533-536, 1986) обучает нейронную сеть блока критика 4. При этом примеры выбираются случайным образом, запрос на новый пример идет по связи 23, соответственно возврат номера примера для обучения идет по связи 22. Входы выбранного примера подаются на входы 19 и 20 нейронной сети блока критика 4, затем снимается значение выхода 21.1 нейронной сети блока критика 4 и сравнивается с выходом обучающего примера. Если ошибка выше заданной разработчиком, то формируется следующий обучающий пример и т.д.The unit for calculating the reinforcement 1, receiving the values of the current state of the external environment 12.3 and the control object 11.3, calculates, according to the given formula, the value of the received reinforcement for the last worked iteration of the control. The obtained value 24.1 of the calculated reinforcement is supplied to the time difference calculation block 2, which calculates the time difference value and forms sets {inputs; outputs} for training the neural network of critic block 4. If the error of the time difference is large, that is, higher than the limit set by the developer, then the temporal 2 calculation block starts the process of retraining the neural network of critic 4, sending signal 22 to the critic 3 learning block the critic’s training 3 receives a signal about the start of retraining the neural network of the critic’s block 4 and begins the process of retraining the neural network of the critic’s block. In this case, the connections to inputs 19 and 20 of the neural network of critic block 4 are activated, thereby blocking the receipt of data from the block of selection of actions 7 and the forecasting block of parameter 5. The block of training critic 3, using the error back propagation algorithm (see more details Rumelhart DE, Hinton GE , Williams RJ, "Learning representations by back-propagating errors," Nature, vol. 323, pp. 533-536, 1986) teaches the neural network of critic 4. Block. In this case, the examples are randomly selected, the request for a new example is made through communication 23 , respectively, the return of the example number for training is through communication 22. Inputs The selected examples are supplied to inputs 19 and 20 of the neural network unit 4 critic then removed output value of the neural network block 21.1 criticism 4 and compared with the output of a training example. If the error is higher than the one set by the developer, then the following training example is formed, etc.

После того как нейронная сеть блока критика 4 обучилась, входы 19 и 20 на блок критика 4 деактивируются и на него начинают поступать сигналы 16 и 18 от блока отбора действий 7 и решающей нейронной сети 5 и блок критика 4 начинает работать в обычном режиме, то есть рассчитывает возможное подкрепление для каждого возможного действия. Рассчитанное подкрепление для каждого возможного действия поступает на блок выбора действия 6, который, пользуясь т.н. «жадным правилом», выбирает действие и подает его на объект управления 10 по сигналу 25.1 и блок занесения действий 9 по сигналу 25.2. Блок занесения действий 9 обновляет по сигналу 26 значение определенной ячейки в блоке действий 8. То есть, если на (i-1) итерации в ситуации S_i-1 управления было выбрано К-е действие, оно было отработано объектом управления и было получено подкрепление r_i-1 (возможно отрицательное), которое и заносится в ячейку К-го действия в S_i-1 ситуации. Таким образом, блок занесения действий 9 задерживает на такт выбранное действие 25.2 и обновляет совокупное накопленное подкрепление данного действия в соответствующей ситуации после того, как был получен результат работы в виде подкрепления.After the neural network of the critic 4 block has been trained, the inputs 19 and 20 to the critic 4 block are deactivated and signals 16 and 18 from the action block 7 and the decisive neural network 5 begin to arrive on it and the critic 4 block starts to work normally, i.e. calculates possible reinforcements for each possible action. The calculated reinforcement for each possible action enters the action selection block 6, which, using the so-called "Greedy rule", selects the action and feeds it to the control object 10 at the signal 25.1 and block recording actions 9 at the signal 25.2. The action entry block 9 updates, according to signal 26, the value of a specific cell in action block 8. That is, if the Kth action was selected at the (i-1) iteration in the control situation S _i-1 , it was worked out by the control object and reinforcement was received r _i-1 (possibly negative), which is entered in the cell of the K-th action in S _i-1 situation. Thus, the block of entering actions 9 delays the selected action 25.2 for the beat and updates the cumulative accumulated reinforcement of this action in the corresponding situation after the result of work in the form of reinforcement was obtained.

Принцип работы блока действий 8 следующий. Блок действий 8 представляет собой четверки значений, которые можно реализовать в виде матрицы. Четверка значений следующая {Состояние внешней среды, Состояние объекта управления. Накопленный коэффициент эффективности, Возможное действие}. Соответственно блок отбора действий 7 выбирает те действия, которые проходят по критерию минимально заданного накопленного подкрепления, а блок занесения действий 9 обновляет значение накопленного коэффициента эффективности для выбранного действия в определенном состоянии внешней среды и объекта управления.The principle of operation of action block 8 is as follows. The action block 8 is a quadruple of values that can be implemented in the form of a matrix. The four values are as follows {State of the environment, State of the control object. Accumulated efficiency coefficient, Possible action}. Accordingly, the action selection block 7 selects those actions that pass by the criterion of the minimum specified accumulated reinforcement, and the action recording block 9 updates the value of the accumulated efficiency coefficient for the selected action in a certain state of the environment and the control object.

Claims

Модифицированный интеллектуальный контроллер с адаптивным критиком, содержащий объект управления, блок критика, решающую нейронную сеть, блок действий, блок расчета подкрепления, блок расчета временной разности, блок выбора действия, первый и второй выходы объекта управления связаны с первым и вторым входами решающей нейронной сети, первым и вторым входами блока расчета временной разности, первым и вторым входами блока расчета подкрепления, выход решающей нейронной сети соединен с первым входом блока критика, выход блока критика связан с входом блока выбора действия, выход блока расчета подкрепления связан с третьим входом блока расчета временной разности, выход блока выбора действия соединен с входом объекта управления, отличающийся тем, что в него введены блок отбора действий, блок обучения критика и блок занесения действий, при этом первый и второй выходы объекта управления также соединены с первым и вторым входами блока отбора действий, первый выход блока отбора действий соединен с первым входом блока действий, второй выход блока отбора действий соединен со вторым входом блока критика, третий выход блока отбора действий соединен с третьим входом решающей нейронной сети, четвертый выход блока отбора действий соединен со вторым входом блока выбора действия, выход блока действий соединен с третьим входом блока отбора действий, выход блока расчета подкрепления соединен также со вторым входом блока занесения действий, выход блока расчета временной разности соединен с первым входом блока обучения критика, первый и второй выходы блока обучения критика соединены соответственно с первым и вторым входами блока критика, третий выход блока обучения критика соединен с четвертым входом блока расчета временной разности, выход блока критика также соединен со вторым входом блока обучения критика, выход блока выбора действия также соединен с первым входом блока занесения действий, выход блока занесения действий соединен со вторым входом блока действий. A modified intelligent controller with an adaptive critic, containing a control object, a critic block, a decisive neural network, an action block, a reinforcement calculation block, a time difference calculation block, an action selection block, the first and second outputs of the control object are connected to the first and second inputs of the critical neural network, the first and second inputs of the time difference calculation unit, the first and second inputs of the reinforcement calculation unit, the output of the decisive neural network is connected to the first input of the critic unit, the output of the critic unit is connected with the input of the action selection block, the output of the reinforcement calculation block is connected to the third input of the time difference calculation block, the output of the action selection block is connected to the input of the control object, characterized in that an action selection block, a critic training block, and an action recording block are introduced into it, while the first and second outputs of the control object are also connected to the first and second inputs of the action selection block, the first output of the action selection block is connected to the first input of the action block, the second output of the action selection block is connected to the second the input of the critic unit, the third output of the action selection unit is connected to the third input of the decisive neural network, the fourth output of the action selection unit is connected to the second input of the action selection unit, the output of the action unit is connected to the third input of the action selection unit, the output of the reinforcement calculation unit is also connected to the second input block recording actions, the output of the unit for calculating the time difference is connected to the first input of the critic training block, the first and second outputs of the critic training block are connected to the first and second inputs, respectively critic block, the third output of the critic training block is connected to the fourth input of the time difference calculation block, the critic block output is also connected to the second input of the critic training block, the output of the action selection block is also connected to the first input of the action recording block, the output of the action recording block is connected to the second input action block.