KR20220102395A

KR20220102395A - System and Method for Improving of Advanced Deep Reinforcement Learning Based Traffic in Non signalalized Intersections for the Multiple Self driving Vehicles

Info

Publication number: KR20220102395A
Application number: KR1020210004701A
Authority: KR
Inventors: 배상훈
Original assignee: 부경대학교 산학협력단; 에스에이엠(주)
Priority date: 2021-01-13
Filing date: 2021-01-13
Publication date: 2022-07-20
Also published as: KR102461831B1

Abstract

The present invention relates to a device and a method for reinforcement learning-based traffic improvement at a non-traffic signal intersection for group driving of an autonomous vehicle, which can improve traffic at a non-traffic signal intersection and secure safety through autonomous vehicle group-driving learning in a mixed traffic situation in which a group of autonomous vehicles and human driver-based vehicles are mixed. The device comprises: a simulation-of-urban-mobility (SUMO) simulation execution unit which builds a simulation environment by utilizing SUMO and transfers data obtained from a speed, a position, and a sensor of an autonomous vehicle to a FLOW application unit; the FLOW application unit which builds a simulation environment in a reinforcement learning platform FLOW environment that can be linked with SUMO, derives driving behaviors without application of reinforcement learning, controls vehicles, updates a simulation state, and transfers state and reward information to a reinforcement learning library environment building unit; and the reinforcement learning library environment building unit which optimizes traffic control with multi-agent deep reinforcement learning by using the reinforcement learning platform FLOW that can be linked with SUMO. The SUMO simulation execution unit presents results for AV penetration rates from 1-100 % in a unit of 10 %, in a situation where a group of vehicles approach a non-traffic signal intersection and drive straight along four different directions, and at the non-traffic signal intersection, lane changes and turning left with respect to all vehicles are disregarded.

Description

자율주행 차량 군집 운행을 위한 비신호 교차로에서의 강화학습기반 통행 개선을 위한 장치 및 방법{System and Method for Improving of Advanced Deep Reinforcement Learning Based Traffic in Non signalalized Intersections for the Multiple Self driving Vehicles}BACKGROUND ART System and Method for Improving of Advanced Deep Reinforcement Learning Based Traffic in Non signalalized Intersections for the Multiple Self driving Vehicles

본 발명은 다수의 자율주행 차량 운행 제어에 관한 것으로, 구체적으로 군집된 자율주행차량과 인간운전자 차량이 혼재되어 있는 혼합 교통류 상황에서 자율주행차량 군집주행 학습으로 비신호 교차로 통행을 개선하고 안전성을 확보할 수 있도록 한 자율주행 차량 군집 운행을 위한 비신호 교차로에서의 강화학습기반 통행 개선을 위한 장치 및 방법에 관한 것이다.The present invention relates to the operation control of a plurality of autonomous vehicles, and specifically, in a mixed traffic flow situation in which clustered autonomous vehicles and human driver vehicles are mixed, autonomous vehicle platooning learning improves non-signal intersection traffic and secures safety It relates to an apparatus and method for reinforcement learning-based traffic improvement at non-signal intersections for autonomous vehicle cluster operation.

자율 주행 차량(Autonomous Vehicle)은 카메라 또는 전방물체 감지센서를 이용하여 차선을 인식하고 자동 조향을 행하는 기술이 탑재된 차량이다. 자율 주행 차량은 카메라의 이미지 프로세싱 또는 전방물체 감지센싱을 기반으로 차선 폭, 차선상의 차량의 횡방향 위치, 양측 차선까지의 거리 및 차선의 형태, 도로의 곡률 반경이 측정되며, 이와 같이 얻어진 차량의 위치와 도로의 정보를 사용하여 차량의 주행 궤적을 추정하고, 추정된 주행 궤적을 따라 차선을 변경한다.An autonomous vehicle is a vehicle equipped with a technology that recognizes a lane and performs automatic steering using a camera or a front object detection sensor. For autonomous vehicles, the lane width, the lateral position of the vehicle on the lane, the distance to both lanes and the shape of the lane, and the radius of curvature of the road are measured based on the camera's image processing or sensing of the front object. The vehicle's driving trajectory is estimated using the location and road information, and the lane is changed according to the estimated driving trajectory.

자율 주행 차량(Autonomous Vehicle)은 차량 전방에 장착된 카메라 또는 전방물체 감지센서에서 검출되는 선행차량의 위치 및 거리를 통하여 차량의 쓰로틀밸브, 브레이크 및 변속기를 자동 제어하여 적절한 가감속을 수행함으로써, 선행차량과 적정거리를 유지하도록 할 수도 있다.An autonomous vehicle automatically controls the vehicle's throttle valve, brake, and transmission through the location and distance of the preceding vehicle detected by a camera mounted on the front of the vehicle or a front object detection sensor to perform appropriate acceleration/deceleration. You can also make sure you keep an appropriate distance from the vehicle.

그러나 이와 같은 자율 주행 차량(Autonomous Vehicle)이 교차로를 통과하는 경우에는 신호등의 교통신호에 따라 정차 후 출발시 선행 차량의 움직임을 감지한 다음 출발하므로 차량들 간의 출발이 지체되어 교차로에서 정체가 발생될 수 있다.However, when such an autonomous vehicle passes through an intersection, it detects the movement of the preceding vehicle and then departs after stopping according to the traffic signal of a traffic light. can

특히, 자율주행 차량과 같이 센서로부터 입력되는 정보를 이용하여 주행 환경을 파악하는 경우 비신호 교차로에서의 주행은 일반적인 도로에서의 주행보다 훨씬 어려운 과제가 된다.In particular, when the driving environment is grasped using information input from a sensor, such as an autonomous vehicle, driving at a non-signaled intersection becomes a much more difficult task than driving on a general road.

한편, 무선 통신 기술의 발전으로 인하여 IoT 관련 연구가 활발히 진행되고 있으며, 그와 같이 주목 받고 있는 것이 IoV(Internet of Vehicles)이다. 차량 사이의 통신을 위해 각 차량이 노드 역할을 수행하는 무선 네트워크인 Vehicular Ad-hoc Network (VANET)은 Mobile Ad-hoc Network (MANET)의 한 형태이다.On the other hand, IoT-related research is being actively conducted due to the development of wireless communication technology, and the Internet of Vehicles (IoV) is attracting attention as such. Vehicular Ad-hoc Network (VANET), a wireless network in which each vehicle acts as a node for communication between vehicles, is a form of Mobile Ad-hoc Network (MANET).

Simulation of Urban MObility(SUMO)는 도로 상에서의 교통 네트워크를 시뮬레이션 할 수 있도록 디자인되어 있는 오픈 소스이다.Simulation of Urban Mobility (SUMO) is an open source designed to simulate traffic networks on roads.

SUMO를 이용하여 도로 위에서 차량 간의 움직임을 파악함으로써 교통의 흐름을 예측할 수 있다.Using SUMO, traffic flow can be predicted by understanding the movement between vehicles on the road.

이와 같은 기술들을 통하여 자율주행 차량이 주행 환경을 파악하여 비신호 교차로에서의 효율적인 주행을 위한 연구들이 이루어지고 있으나, 혼합 교통류 상황(자율주행차량과 인간운전자의 혼재)에서 자율주행차량 군집주행에 따른 비신호 교차로 통행에서는 아직도 해결하여야 하는 과제가 많다.Although studies are being conducted for efficient driving at non-signal intersections by understanding the driving environment of autonomous vehicles through these technologies, There are still many challenges to be solved at non-signal intersections.

따라서, 자율주행차량 군집주행에 따른 비신호 교차로 통행 개선 및 안전성 확보를 위한 새로운 기술의 개발이 요구되고 있다.Therefore, there is a demand for the development of new technologies for improving the traffic at non-signal intersections and securing safety according to the platooning of autonomous vehicles.

대한민국 공개특허 제10-2020-0071406호Republic of Korea Patent Publication No. 10-2020-0071406 대한민국 공개특허 제10-2020-0058613호Republic of Korea Patent Publication No. 10-2020-0058613 대한민국 공개특허 제10-2018-0065196호Republic of Korea Patent Publication No. 10-2018-0065196

본 발명은 종래 기술의 자율주행 차량 운행 제어 기술의 문제점을 해결하기 위한 것으로, 군집된 자율주행차량과 인간운전자 차량이 혼재되어 있는 혼합 교통류 상황에서 자율주행차량 군집주행 학습으로 비신호 교차로 통행을 개선하고 안전성을 확보할 수 있도록 한 자율주행 차량 군집 운행을 위한 비신호 교차로에서의 강화학습기반 통행 개선을 위한 장치 및 방법을 제공하는데 그 목적이 있다.The present invention is to solve the problems of the autonomous vehicle operation control technology of the prior art, and improves the non-signaled intersection through self-driving vehicle platooning learning in a mixed traffic flow situation in which clustered autonomous vehicles and human driver vehicles are mixed. The purpose of the present invention is to provide an apparatus and method for improving traffic based on reinforcement learning at non-signal intersections for group operation of autonomous driving vehicles to ensure safety and safety.

본 발명은 실제 상황과 같이 자율주행차량이 관찰할 수 있는 범위 내의 정보를 통하여 학습하는 방법으로 강화학습과 마르코프 의사결정 모델 사용(Partial Observability MDP, POMDP)을 적용하여 행동에 대한 강화학습의 보상을 최대화할 수 있도록 한 자율주행 차량 군집 운행을 위한 비신호 교차로에서의 강화학습기반 통행 개선을 위한 장치 및 방법을 제공하는데 그 목적이 있다.The present invention is a method of learning through information within the range that an autonomous vehicle can observe as in a real situation, and rewards reinforcement learning for behavior by applying reinforcement learning and using a Markov decision-making model (Partial Observability MDP, POMDP). An object of the present invention is to provide an apparatus and method for improving traffic based on reinforcement learning at non-signal intersections for maximizing autonomous vehicle swarm operation.

본 발명은 군집된 자율주행차량과 인간운전자 차량이 혼재되어 있는 혼합 교통류 상황에서 자율주행차량 운행 행태를 학습하는 방법으로 인공신경망에 학습을 최적화하기 위한 알고리즘인 PPO 적용으로 통행 제어를 최적화할 수 있도록 한 자율주행 차량 군집 운행을 위한 비신호 교차로에서의 강화학습기반 통행 개선을 위한 장치 및 방법을 제공하는데 그 목적이 있다.The present invention is a method of learning the driving behavior of an autonomous driving vehicle in a mixed traffic flow situation in which clustered autonomous vehicles and human driver vehicles are mixed. An object of the present invention is to provide an apparatus and method for improving traffic based on reinforcement learning at a non-signaling intersection for the operation of a group of autonomous vehicles.

본 발명은 SUMO(Simulation of Urban MObility)를 활용하여 실험환경을 구축하고 ACC(Adaptive Cruise Control) 시스템으로 인간운전자 정의를 하여, SUMO와 연동할 수 있는 강화학습 플랫폼 FLOW를 활용하여 멀티 에이전트 심층강화학습(Multi agent Deep Reinforcement Learning)으로 통행 제어를 최적화할 수 있도록 한 자율주행 차량 군집 운행을 위한 비신호 교차로에서의 강화학습기반 통행 개선을 위한 장치 및 방법을 제공하는데 그 목적이 있다.The present invention builds an experimental environment using SUMO (Simulation of Urban Mobility), defines a human driver with an ACC (Adaptive Cruise Control) system, and utilizes multi-agent deep reinforcement learning using FLOW, a reinforcement learning platform that can be linked with SUMO. (Multi agent Deep Reinforcement Learning) The purpose of this is to provide a device and method for improving traffic based on reinforcement learning at non-signaled intersections for autonomous vehicle swarm operation that enables traffic control to be optimized.

본 발명은 강화학습 파라미터 조정 및 자율주행차량 점유율별 운행 최적화 및 검증으로 비신호 교차로에서 완전 인간운전자환경에 비해 완전 자율주행차량 환경에서 평균 통행 속도를 향상시킬 수 있도록 한 자율주행 차량 군집 운행을 위한 비신호 교차로에서의 강화학습기반 통행 개선을 위한 장치 및 방법을 제공하는데 그 목적이 있다.The present invention is for the operation of a group of autonomous vehicles that can improve the average travel speed in a fully autonomous vehicle environment compared to a fully human driver environment at non-signal intersections by adjusting reinforcement learning parameters and optimizing and verifying operation by autonomous vehicle share. An object of the present invention is to provide an apparatus and method for improving traffic based on reinforcement learning at non-signaling intersections.

본 발명은 부분관찰 마르코프 의사결정과정(POMDP)에 따라 시뮬레이션 환경 내의 자율주행차량의 행태를 결정하며 평균속도를 보상으로 학습하고, 멀티 에이전트 심층강화학습을 하기 위해 PPO(Proximal Policy Optimization) 알고리즘을 적용하여 행동 결정을 최적화할 수 있도록 한 자율주행 차량 군집 운행을 위한 비신호 교차로에서의 강화학습기반 통행 개선을 위한 장치 및 방법을 제공하는데 그 목적이 있다.The present invention determines the behavior of an autonomous vehicle in a simulation environment according to the partial observation Markov decision-making process (POMDP), learns the average speed as a reward, and applies a PPO (Proximal Policy Optimization) algorithm to perform multi-agent deep reinforcement learning. The purpose of this study is to provide a device and method for improving traffic based on reinforcement learning at non-signaled intersections for autonomous vehicle swarm operation that can optimize behavioral decisions.

본 발명은 시뮬레이션 환경에서 실제 자율주행 환경을 모사하기 위해 학습과 행동 결정의 근거를 시뮬레이션의 모든 환경이 아닌 자율주행차량 센서를 통하여 얻어진 데이터(부분만 관찰)를 기반으로 하여 행동을 결정하고 행동에 대해 강화학습의 보상을 최대화할 수 있도록 한 자율주행 차량 군집 운행을 위한 비신호 교차로에서의 강화학습기반 통행 개선을 위한 장치 및 방법을 제공하는데 그 목적이 있다.In order to simulate the real autonomous driving environment in the simulation environment, the present invention determines the behavior based on the data (partial observation) obtained through the autonomous driving vehicle sensor, not the entire environment of the simulation, as the basis for learning and behavior decision. The purpose of this study is to provide an apparatus and method for improving the reinforcement learning-based traffic at non-signaled intersections for autonomous vehicle swarm operation so that the reward of reinforcement learning can be maximized.

본 발명의 다른 목적들은 이상에서 언급한 목적으로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.Other objects of the present invention are not limited to the above-mentioned objects, and other objects not mentioned will be clearly understood by those skilled in the art from the following description.

상기와 같은 목적을 달성하기 위한 본 발명에 따른 자율주행 차량 군집 운행을 위한 비신호 교차로에서의 강화학습기반 통행 개선을 위한 장치는 SUMO(Simulation of Urban MObility)를 활용하여 시뮬레이션 환경을 구축하고, 자율 주행 차량의 속도,위치,센서에서 얻어진 데이터를 FLOW 적용부로 전달하는 SUMO 시뮬레이션 실행부;SUMO와 연동할 수 있는 강화학습 플랫폼 FLOW 환경에서 시뮬레이션 환경을 구축하고, 강화학습을 적용하지 않은 운전 행태 도출, 차량 제어 및 시뮬레이션 상태 업데이트를 하여 상태 및 보상 정보를 강화학습 라이브러리 환경 구축부로 전달하는 FLOW 적용부;SUMO와 연동할 수 있는 강화학습 플랫폼 FLOW를 활용하여 멀티 에이전트 심층강화학습(Multi agent Deep Reinforcement Learning)으로 통행 제어를 최적화하는 강화학습 라이브러리 환경 구축부;를 포함하는 것을 특징으로 한다.In order to achieve the above object, the apparatus for improving traffic based on reinforcement learning at non-signal intersections for autonomous vehicle cluster operation according to the present invention for achieving the above object is to build a simulation environment using SUMO (Simulation of Urban Mobility), and SUMO simulation execution unit that transmits the data obtained from the speed, position, and sensor of the driving vehicle to the flow application unit; a reinforcement learning platform that can be linked with SUMO builds a simulation environment in the FLOW environment and derives driving behavior without reinforcement learning; FLOW application unit that updates vehicle control and simulation status and delivers status and reward information to reinforcement learning library environment building unit; Multi agent deep reinforcement learning using FLOW, a reinforcement learning platform that can be linked with SUMO and a reinforcement learning library environment construction unit for optimizing passage control.

다른 목적을 달성하기 위한 본 발명에 따른 자율주행 차량 군집 운행을 위한 비신호 교차로에서의 강화학습기반 통행 개선을 위한 방법은 SUMO(Simulation of Urban MObility)를 활용하여 시뮬레이션 환경을 구축하고, 자율 주행 차량의 속도,위치,센서에서 얻어진 데이터를 FLOW 적용부로 전달하는 SUMO 시뮬레이션 실행 단계;SUMO와 연동할 수 있는 강화학습 플랫폼 FLOW 환경에서 시뮬레이션 환경을 구축하고, 강화학습을 적용하지 않은 운전 행태 도출, 차량 제어 및 시뮬레이션 상태 업데이트를 하여 상태 및 보상 정보를 강화학습 라이브러리 환경 구축부로 전달하는 FLOW 적용 단계;SUMO와 연동할 수 있는 강화학습 플랫폼 FLOW를 활용하여 멀티 에이전트 심층강화학습(Multi agent Deep Reinforcement Learning)으로 통행 제어를 최적화하는 강화학습 라이브러리 환경 구축 단계;를 포함하는 것을 특징으로 한다.The method for improvement of traffic based on reinforcement learning at non-signal intersections for group operation of autonomous vehicles according to the present invention for achieving another object is to build a simulation environment using SUMO (Simulation of Urban Mobility), and autonomous vehicle SUMO simulation execution step in which the data obtained from the speed, position, and sensor of and FLOW application step to update the simulation state and deliver the state and reward information to the reinforcement learning library environment building unit; use FLOW, a reinforcement learning platform that can be linked with SUMO, to pass through Multi-agent Deep Reinforcement Learning It characterized in that it comprises; the step of constructing a reinforcement learning library environment for optimizing the control.

이상에서 설명한 바와 같은 본 발명에 따른 자율주행 차량 군집 운행을 위한 비신호 교차로에서의 강화학습기반 통행 개선을 위한 장치 및 방법은 다음과 같은 효과가 있다.As described above, the apparatus and method for improving traffic based on reinforcement learning at non-signal intersections for group operation of autonomous vehicles according to the present invention have the following effects.

첫째, 군집된 자율주행차량과 인간운전자 차량이 혼재되어 있는 혼합 교통류 상황에서 자율주행차량 군집주행 학습으로 비신호 교차로 통행을 개선하고 안전성을 확보할 수 있도록 한다.First, in a mixed traffic flow situation in which clustered autonomous vehicles and human driver vehicles are mixed, self-driving vehicle platooning learning improves non-signal crossing traffic and ensures safety.

둘째, 실제 상황과 같이 자율주행차량이 관찰할 수 있는 범위 내의 정보를 통하여 학습하는 방법으로 강화학습과 마르코프 의사결정 모델 사용(Partial Observability MDP, POMDP)을 적용하여 행동에 대한 강화학습의 보상을 최대화할 수 있도록 한다.Second, as in a real situation, it is a method of learning through information within the range that an autonomous vehicle can observe, and by applying reinforcement learning and using a Markov decision-making model (Partial Observability MDP, POMDP), the reward of reinforcement learning for behavior is maximized. make it possible

셋째, 군집된 자율주행차량과 인간운전자 차량이 혼재되어 있는 혼합 교통류 상황에서 자율주행차량 운행 행태를 학습하는 방법으로 인공신경망에 학습을 최적화하기 위한 알고리즘인 PPO 적용으로 통행 제어를 최적화할 수 있다.Third, it is possible to optimize traffic control by applying PPO, an algorithm for optimizing learning in artificial neural networks, as a method of learning the driving behavior of autonomous vehicles in a mixed traffic flow situation where clustered autonomous vehicles and human driver vehicles are mixed.

넷째, SUMO(Simulation of Urban MObility)를 활용하여 실험환경을 구축하고 ACC(Adaptive Cruise Control) 시스템으로 인간운전자 정의를 하여, SUMO와 연동할 수 있는 강화학습 플랫폼 FLOW를 활용하여 멀티 에이전트 심층강화학습(Multi agent Deep Reinforcement Learning)으로 통행 제어를 최적화할 수 있도록 한다.Fourth, build an experimental environment using SUMO (Simulation of Urban Mobility), define a human driver with the ACC (Adaptive Cruise Control) system, and utilize multi-agent deep reinforcement learning (FLOW), a reinforcement learning platform that can be linked with SUMO. Multi agent Deep Reinforcement Learning) to optimize traffic control.

다섯째, 강화학습 파라미터 조정 및 자율주행차량 점유율별 운행 최적화 및 검증으로 비신호 교차로에서 완전 인간운전자환경에 비해 완전 자율주행차량 환경에서 평균 통행 속도를 향상시킬 수 있도록 한다.Fifth, it is possible to improve the average travel speed in a fully autonomous vehicle environment compared to a fully human driver environment at non-signal intersections by adjusting reinforcement learning parameters and optimizing and verifying operation by autonomous vehicle share.

여섯째, 부분관찰 마르코프 의사결정과정(POMDP)에 따라 시뮬레이션 환경 내의 자율주행차량의 행태를 결정하며 평균속도를 보상으로 학습하고, 멀티 에이전트 심층강화학습을 하기 위해 PPO(Proximal Policy Optimization) 알고리즘을 적용하여 행동 결정을 최적화할 수 있도록 한다.Sixth, according to the partial observation Markov decision-making process (POMDP), the behavior of the autonomous vehicle in the simulation environment is determined, the average speed is learned as a reward, and the PPO (Proximal Policy Optimization) algorithm is applied to perform multi-agent deep reinforcement learning. to optimize action decisions.

일곱째, 시뮬레이션 환경에서 실제 자율주행 환경을 모사하기 위해 학습과 행동 결정의 근거를 시뮬레이션의 모든 환경이 아닌 자율주행차량 센서를 통하여 얻어진 데이터(부분만 관찰)를 기반으로 하여 행동을 결정하고 행동에 대해 강화학습의 보상을 최대화할 수 있도록 한다.Seventh, in order to simulate the real autonomous driving environment in the simulation environment, the basis for learning and behavior decision is determined based on the data (partial observation) obtained through the autonomous vehicle sensor rather than the entire environment of the simulation, and the behavior is determined. Maximize the rewards of reinforcement learning.

도 1은 본 발명에 따른 비신호 교차로에서의 심층 강화 학습 아키텍처를 나타낸 구성도
도 2는 적응형 KL 페널티 알고리즘을 사용한 PPO 알고리즘
도 3은 본 발명에 따른 자율주행 차량 군집 운행을 위한 비신호 교차로에서의 강화학습기반 통행 개선을 위한 장치 구성도
도 4는 본 발명에 따른 자율주행 차량 군집 운행을 위한 비신호 교차로에서의 강화학습기반 통행 개선을 위한 방법을 나타낸 동작 흐름도
도 5는 비신호 교차로에서의 일반적인 SUMO 시뮬레이터 구성도
도 6은 일반적인 관측 영역(Observation Space)의 일 예를 나타낸 구성도
도 7은 비신호 교차로에서의 선도 자율 주행 차량 실험 특성을 나타낸 구성도
도 8은 비신호화된 교차로에서의 실험 비교 구성도
도 9는 AV 점유율을 기반으로 한 200회 이상의 평균 보상 곡선 그래프
도 10은 비신호화된 교차로에서 AV 점유율을 통한 시공간 역학 특성 그래프
도 11은 SUMO 시뮬레이션 환경에서 평균속도, 평균 지체시간, 평균연료 소모량, 평균 배기가스 값 도출 특성 그래프1 is a block diagram showing a deep reinforcement learning architecture at a non-signal intersection according to the present invention
2 is a PPO algorithm using an adaptive KL penalty algorithm.
3 is a configuration diagram of a device for improving traffic based on reinforcement learning at a non-signal intersection for group operation of autonomous driving vehicles according to the present invention
4 is an operation flowchart showing a method for improving traffic based on reinforcement learning at a non-signal intersection for group operation of autonomous vehicles according to the present invention;
5 is a configuration diagram of a general SUMO simulator at a non-signal intersection.
6 is a configuration diagram showing an example of a general observation space (Observation Space);
7 is a configuration diagram showing experimental characteristics of a leading autonomous driving vehicle at a non-signal intersection.
8 is an experimental comparison diagram at a non-signaled intersection.
9 is a graph of the average compensation curve over 200 times based on AV occupancy.
10 is a graph of spatiotemporal dynamics through AV occupancy at non-signaled intersections.
11 is a graph showing the average speed, average delay time, average fuel consumption, and average exhaust gas value derivation characteristic graph in a SUMO simulation environment.

이하, 본 발명에 따른 자율주행 차량 군집 운행을 위한 비신호 교차로에서의 강화학습기반 통행 개선을 위한 장치 및 방법의 바람직한 실시 예에 관하여 상세히 설명하면 다음과 같다.Hereinafter, a preferred embodiment of an apparatus and method for improving traffic based on reinforcement learning at a non-signal intersection for group operation of autonomous vehicles according to the present invention will be described in detail as follows.

본 발명에 따른 자율주행 차량 군집 운행을 위한 비신호 교차로에서의 강화학습기반 통행 개선을 위한 장치 및 방법의 특징 및 이점들은 이하에서의 각 실시 예에 대한 상세한 설명을 통해 명백해질 것이다.Features and advantages of the apparatus and method for improving traffic based on reinforcement learning at non-signal intersections for group operation of autonomous vehicles according to the present invention will become apparent through detailed description of each embodiment below.

도 1은 본 발명에 따른 비신호 교차로에서의 심층 강화 학습 아키텍처를 나타낸 구성도이고, 도 2는 적응형 KL 페널티 알고리즘을 사용한 PPO 알고리즘이다.1 is a block diagram showing a deep reinforcement learning architecture at a non-signal intersection according to the present invention, and FIG. 2 is a PPO algorithm using an adaptive KL penalty algorithm.

본 발명에 따른 자율주행 차량 군집 운행을 위한 비신호 교차로에서의 강화학습기반 통행 개선을 위한 장치 및 방법은 군집된 자율주행차량과 인간운전자 차량이 혼재되어 있는 혼합 교통류 상황에서 자율주행차량 군집주행 학습으로 비신호 교차로 통행을 개선하고 안전성을 확보할 수 있도록 한 것이다.The apparatus and method for improving traffic based on reinforcement learning at non-signal intersections for autonomous vehicle swarm operation according to the present invention is autonomous vehicle platooning learning in a mixed traffic flow situation in which clustered autonomous vehicles and human driver vehicles are mixed. This is to improve traffic at non-signaled intersections and ensure safety.

이를 위하여, 본 발명은 실제 상황과 같이 자율주행차량이 관찰할 수 있는 범위 내의 정보를 통하여 학습하는 방법으로 강화학습과 마르코프 의사결정 모델 사용(Partial Observability MDP, POMDP)을 적용하여 행동에 대한 강화학습의 보상을 최대화할 수 있도록 하는 구성을 포함할 수 있다.To this end, the present invention provides reinforcement learning for behavior by applying reinforcement learning and using Markov decision-making models (Partial Observability MDP, POMDP) as a method of learning through information within the range that an autonomous vehicle can observe, such as in a real situation. It may include a configuration to maximize the compensation of.

본 발명은 군집된 자율주행차량과 인간운전자 차량이 혼재되어 있는 혼합 교통류 상황에서 자율주행차량 운행 행태를 학습하는 방법으로 인공신경망에 학습을 최적화하기 위한 알고리즘인 PPO 적용으로 통행 제어를 최적화할 수 있도록 하는 구성을 포함할 수 있다.The present invention is a method of learning the driving behavior of an autonomous driving vehicle in a mixed traffic flow situation in which clustered autonomous vehicles and human driver vehicles are mixed. It may include a configuration that

본 발명은 SUMO(Simulation of Urban MObility)를 활용하여 실험환경을 구축하고 ACC(Adaptive Cruise Control) 시스템으로 인간운전자 정의를 하여, SUMO와 연동할 수 있는 강화학습 플랫폼 FLOW를 활용하여 멀티 에이전트 심층강화학습(Multi agent Deep Reinforcement Learning)으로 통행 제어를 최적화할 수 있도록 하는 구성을 포함할 수 있다.The present invention builds an experimental environment using SUMO (Simulation of Urban Mobility), defines a human driver with an ACC (Adaptive Cruise Control) system, and utilizes multi-agent deep reinforcement learning using FLOW, a reinforcement learning platform that can be linked with SUMO. (Multi agent Deep Reinforcement Learning) may include a configuration to optimize traffic control.

본 발명은 강화학습 파라미터 조정 및 자율주행차량 점유율별 운행 최적화 및 검증 구성, 부분관찰 마르코프 의사결정과정(POMDP)에 따라 시뮬레이션 환경 내의 자율주행차량의 행태를 결정하며 평균속도를 보상으로 학습하는 구성, 멀티 에이전트 심층강화학습을 하기 위해 PPO(Proximal Policy Optimization) 알고리즘을 적용하여 행동 결정을 최적화하는 구성을 포함할 수 있다.The present invention is a configuration that determines the behavior of an autonomous driving vehicle in a simulation environment according to the reinforcement learning parameter adjustment, operation optimization and verification configuration by autonomous driving vehicle share, and partial observation Markov decision-making process (POMDP), and learns the average speed as a reward, In order to do multi-agent deep reinforcement learning, it may include a configuration for optimizing behavioral decisions by applying a PPO (Proximal Policy Optimization) algorithm.

강화 학습(RL)은 기계 학습의 하위 영역이며 에이전트가 환경과 상호 작용하고 누적 보상을 극대화하는 조치를 학습하는 것이다.Reinforcement learning (RL) is a subdomain of machine learning, in which agents learn actions to interact with their environment and maximize their cumulative reward.

RL 알고리즘의 전형적인 형태는 마르코프 결정 과정(MDP)으로, 전체 관측치 집합이 주어진 적절한 동작을 결정하는 데 사용되는 강력한 프레임워크이다.A typical form of the RL algorithm is the Markov Decision Process (MDP), a powerful framework used to determine the appropriate behavior given the entire set of observations.

MDP는 튜플(S, A, P, R,

,

, T)이며, 여기서 S와 A는 각각 참가자의 상태와 행동이다.

는 전이 확률을 정의하며,

은 선택된 작용에 따라 보상을 정의하며,

은 초기 상태 분포를 정의하며,

는 0에서 1까지의 할인 계수(discount factor)를 정의하며, T는 시간 범위를 나타낸다. MDP is a tuple (S, A, P, R,

,

, T), where S and A are the participant's state and behavior, respectively.

defines the transition probability,

defines the reward according to the selected action,

defines the initial state distribution,

defines a discount factor from 0 to 1, and T represents the time range.

그러나 자동화 차량은 부정확성, 의도 및 센서 노이즈를 포함하는 불확실한 환경에서 기동한다. 이 문제를 해결하기 위해 관측치 집합을 정의하는 O와 관측 함수인 Z라는 두 가지 요소를 더 사용하는 부분 관측 가능한 MDP(POMDP)가 제안되었다. However, automated vehicles maneuver in uncertain environments that include inaccuracies, intent, and sensor noise. To solve this problem, partially observable MDP (POMDP) has been proposed, which uses two more elements: O, which defines a set of observations, and Z, which is an observation function.

RL의 객관적 학습 에이전트는 정책

를 최적화하여 몇 가지 타임 스텝에 걸쳐 예상 누적 할인 보상을 극대화한다.RL's objective learning agent is a policy

to maximize the expected cumulative discount reward over several time steps.

심층 신경 네트워크(DNN)는 여러 개의 숨겨진 표현 계층으로 인해 형상 추출을 자동으로 수행할 수 있는 기능을 가지고 있다. 연속 제어기의 경우, 인공 신경 네트워크(ANN)는 복잡한 기능을 나타내기 위해 여러 개의 숨겨진 레이어를 사용하는 일반적으로 사용되는 방법이다. Deep neural networks (DNNs) have the ability to automatically perform shape extraction due to multiple hidden representation layers. For continuous controllers, artificial neural networks (ANNs) are a commonly used method that uses multiple hidden layers to represent complex functions.

이 작업에서는 MLP를 적용하여 입력 세트(상태 및 관찰)에서 출력 세트(정책)를 생성한다. 또한, DNN의 성능을 향상시키기 위해 경사 하강 최적화 방법에 기초한 PPO를 적용한다. In this task, MLP is applied to generate an output set (policy) from a set of inputs (states and observations). In addition, we apply PPO based on gradient descent optimization method to improve the performance of DNN.

MLP와 RL을 융합하는 제안된 심층 RL 프레임워크는 비신호화된 교차점에서 AV의 효과를 고려하도록 설계되었다. The proposed deep RL framework, which fuses MLP and RL, is designed to consider the effects of AV at non-signaled junctions.

첫째, SUMO 시뮬레이터는 하나의 시뮬레이션 단계를 실행한다. First, the SUMO simulator runs one simulation step.

둘째, Flow 프레임워크는 SUMO 시뮬레이터의 상태에 대한 정보를 RL 라이브러리에 보낸다. 그런 다음, RL 라이브러리(RLlib)는 MLP를 통해 SUMO 시뮬레이터의 상태에 따라 적절한 조치를 계산한다. MLP 정책은 트래픽 데이터를 기반으로 RL 알고리즘에 대한 누적 보상을 최대화하기 위해 적용된다.Second, the Flow framework sends information about the state of the SUMO simulator to the RL library. Then, the RL library (RLlib) calculates appropriate actions according to the state of the SUMO simulator through MLP. The MLP policy is applied to maximize the cumulative compensation for the RL algorithm based on the traffic data.

마지막으로 시뮬레이션은 RL 프로세스를 재설정하고 반복한다. Finally, the simulation resets and repeats the RL process.

도 1은 비신호화 교차로에서 심층 강화 학습 아키텍처를 나타낸 것이다.1 shows a deep reinforcement learning architecture at a non-signaling intersection.

중요한 것은, '정책'은 환경에서의 인식과 행동 사이의 의사소통의 청사진을 가리킨다. 즉, 정책은 트래픽 시뮬레이션의 컨트롤러와 유사하다.Importantly, 'policy' refers to the blueprint for communication between perceptions and actions in the environment. In other words, the policy is analogous to the controller of the traffic simulation.

이 작업에서 컨트롤러는 여러 개의 숨겨진 계층이 있는 MLP 정책이다.In this task, the controller is an MLP policy with several hidden layers.

컨트롤러의 매개변수는 MLP 정책을 사용하여 반복적으로 업데이트되어 SUMO 시뮬레이터에서 샘플링된 트래픽 데이터를 기반으로 누적 보상을 최대화한다. The parameters of the controller are iteratively updated using MLP policies to maximize cumulative compensation based on traffic data sampled from the SUMO simulator.

에이전트의 주요 목표는 다음과 같이 확률적 정책을 최적화하는 방법을 학습하는 것이다.The main goal of the agent is to learn how to optimize a probabilistic policy as follows:

여기서,

는 할인 계수(

)와 보상(

)에 의해 계산되는 예상 누적 할인 보상이다.here,

is the discount coefficient (

) and reward (

) is the expected cumulative discount reward calculated by

종방향 역학 모델(Longitudinal Dynamic Models)을 설명하면 다음과 같다.Longitudinal Dynamic Models are described as follows.

기본적인 차량 역학은 차량 자체와 전방 차량의 관찰에 기초하여 수동 작동 차량의 세로 방향 역학을 설명하는 차량 추종 모델에 의해 정의될 수 있다. Basic vehicle dynamics can be defined by a vehicle-following model that describes the longitudinal dynamics of a manually operated vehicle based on observations of the vehicle itself and the vehicle in front.

표준 차량 추종 모델은 다음과 같다.Standard vehicle-following models are as follows:

여기서,

는 차량 i의 가속도이고,

는 비선형 함수이며,

,

및

는 각각 차량 i의 속도, 상대 속도 및 방향이다.here,

is the acceleration of vehicle i,

is a nonlinear function,

,

and

are the speed, relative speed and direction of vehicle i, respectively.

본 발명에서는 운전자 행동을 묘사할 수 있는 능력으로 인해 인간 구동 차량의 세로 방향 제어를 위해 ACC 시스템의 일종인 IDM을 적용한다. In the present invention, IDM, a type of ACC system, is applied for longitudinal control of a human-driven vehicle due to its ability to depict driver behavior.

IDM은 일반적으로 사용되는 자동차 추종 모델이다. IDM is a commonly used car-following model.

IDM의 가속도 명령에서 비신호화된 교차로 환경에서의 차량 속도와 선도 차량의 식별(ID) 및 선도 차량의 진행(headway of the leading vehicle)은 "get" 방법으로 얻을 수 있도록 설정할 수 있다. In the acceleration command of the IDM, the vehicle speed in the non-signaled intersection environment, the identification (ID) of the leading vehicle, and the headway of the leading vehicle may be set to be obtained by a “get” method.

차량의 가속도는 다음과 같이 계산한다.The vehicle's acceleration is calculated as follows.

여기서,

은 차량의 가속이고,

는 원하는 속도이며,

는 가속도지수,

는 차량의 앞길(앞차와의 거리)이며,

는 원하는 방향을 나타내며, 다음과 같이 표현된다.here,

is the acceleration of the vehicle,

is the desired speed,

is the acceleration index,

is the front road of the vehicle (distance from the vehicle in front),

represents the desired direction, and is expressed as follows.

여기서,

는 최소 갭을, T는 a시간 갭을,

는 선두 차량과 비교한 속도 차이(현재 속도 - 선두 속도), a는 가속 구간, b는 편안한 감속을 나타낸다.here,

is the minimum gap, T is the time gap,

is the speed difference compared with the leading vehicle (current speed - leading speed), a is the acceleration section, and b is the comfortable deceleration.

도시 교통에 대한 IDM 컨트롤러의 대표적인 매개변수는 표 1에서와 같다.Representative parameters of the IDM controller for urban traffic are shown in Table 1.

정책 최적화(Policy Optimization)를 설명하면 다음과 같다.Policy Optimization will be described as follows.

정책 경사 방법(Policy gradient methods)은 동작 값이나 상태 값 함수가 아닌 경사 강하 알고리즘을 사용하여 매개 변수화된 정책 함수의 추정기를 계산하려고 한다. Policy gradient methods attempt to compute an estimator of a parameterized policy function using a gradient descent algorithm rather than an action-value or state-value function.

따라서 비선형 근사 및 부분 관측으로 인해 추정 함수에 발생하는 수렴 문제를 피한다. Thus, it avoids the convergence problems that arise in the estimation function due to nonlinear approximation and partial observation.

본 발명은 비신호화된 교차로의 시뮬레이션에서 제어 정책을 직접 최적화하기 위해 MLP 정책을 적용한다. 정책 행동(

)의 확률에 대한 기대치와 시간 스텝 t

에서의 어드밴티지 함수(advantage function)의 추정치에 기초하는 정책 경사법은 다음과 같이 표현된다.The present invention applies the MLP policy to directly optimize the control policy in the simulation of the non-signaled intersection. policy action (

) and the time step t

A policy gradient method based on an estimate of the advantage function in

여기서,

는 유한한 표본 배치에 대한 기대 연산자이며,

는 확률적 정책을 나타내며,

는 디스카운트된 보상 합계와 기준 추정치로 정의되며,

와

는 시간 스텝 t의 행동과 상태를 각각 나타낸다.here,

is the expected operator for a finite sample batch,

represents a probabilistic policy,

is defined as the sum of the discounted rewards and the baseline estimate,

Wow

denotes the behavior and state of time step t, respectively.

슐만(Schulman) 등에 의해 제안된 PPO는 RLlib 라이브러리에서 제공하는 간단한 TRPO이다.The PPO proposed by Schulman et al. is a simple TRPO provided by the RLlib library.

즉, PPO의 목표는 TRPO와 동일하며, TRPO는 신뢰 지역 제약 조건을 사용하여 새 정책이 이전 정책에서 너무 멀리 있지 않도록 정책을 업데이트하도록 강제한다.In other words, the goal of the PPO is the same as the TRPO, and the TRPO uses a trusted region constraint to force the policy to be updated so that the new policy is not too far from the old one.

PPO에는 적응형 쿨백-라이블러(adaptive Kullback-Leibler;KL) 페널티와 클리핑 목표(clipped objective)의 두 가지 유형이 있다. There are two types of PPOs: adaptive Kullback-Leibler (KL) penalties and clipped objectives.

PPO는 대리 손실 함수를 채택하여 정책 업데이트를 생성한다. 이 프로세스는 훈련 과정 중 성능 저하를 방지한다.The PPO adopts a surrogate loss function to generate policy updates. This process prevents performance degradation during the training process.

대리 객체(

)는 다음과 같이 설명된다.surrogate object (

) is described as

여기서,

는 업데이트 전 정책 매개 변수,

는 업데이트 후 정책 매개 변수,

는 확률비를 나타낸다.here,

is the pre-update policy parameter,

is the policy parameter after update,

represents the probability ratio.

연속 행동의 경우 PPO의 정책 출력은 각 행동에 대한 가우스 분포의 매개 변수이다.For continuous actions, the policy output of the PPO is the parameter of the Gaussian distribution for each action.

그런 다음 정책은 이러한 분포를 기반으로 연속 출력을 생성한다.The policy then generates continuous outputs based on these distributions.

본 발명에서 적응형 KL 패널티를 가진 PPO는 다음과 같이 미니 배치(minibatch) 확률적 경사 하강(SGD)을 사용하여 KL 페널티 목표를 최적화하는 데 사용된다.In the present invention, the PPO with adaptive KL penalty is used to optimize the KL penalty target using minibatch stochastic gradient descent (SGD) as follows.

여기서,

는 매 정책 업데이트 후 업데이트되는 가중 조절 계수(weight control coefficient)이다.here,

is a weight control coefficient updated after every policy update.

현재 KL 차이가 목표 KL 편차보다 클 경우 증가되고, 현재 KL 발산이 목표 KL 발산보다 작으면 감소한다.It increases when the current KL difference is greater than the target KL deviation, and decreases when the current KL divergence is less than the target KL divergence.

PPO 알고리즘에서는 먼저 현재 정책이 환경과 상호 작용하여 에피소드 시퀀스를 생성한다. 다음으로, 어드밴티지 함수(advantage function)는 상태 값에 대한 기준 추정치를 사용하여 추정된다.In the PPO algorithm, the current policy first interacts with the environment to create a sequence of episodes. Next, an advantage function is estimated using the reference estimate for the state value.

마지막으로, 모든 경험을 수집하고 정책 네트워크를 통해 경사 하강 알고리즘을 실행한다. 적응형 KL 페널티 알고리즘의 전체 PPO는 도 2의 알고리즘 1의 유사 코드로 표시된다.Finally, we collect all the experiences and run the gradient descent algorithm through the policy network. The overall PPO of the adaptive KL penalty algorithm is represented by the pseudo code of Algorithm 1 in FIG.

도 3은 본 발명에 따른 자율주행 차량 군집 운행을 위한 비신호 교차로에서의 강화학습기반 통행 개선을 위한 장치 구성도이다.3 is a block diagram of a device for improving traffic based on reinforcement learning at a non-signal intersection for group operation of autonomous vehicles according to the present invention.

본 발명에 따른 자율주행 차량 군집 운행을 위한 비신호 교차로에서의 강화학습기반 통행 개선을 위한 장치는 SUMO(Simulation of Urban MObility)를 활용하여 시뮬레이션 환경을 구축하고, 자율 주행 차량의 속도,위치,센서에서 얻어진 데이터를 FLOW 적용부(200)로 전달하는 SUMO 시뮬레이션 실행부(100)와, SUMO와 연동할 수 있는 강화학습 플랫폼 FLOW 환경에서 시뮬레이션 환경을 구축하고, 강화학습을 적용하지 않은 운전 행태 도출, 차량 제어 및 시뮬레이션 상태 업데이트를 하여 상태 및 보상 정보를 강화학습 라이브러리 환경 구축부(300)로 전달하는 FLOW 적용부(200)와, SUMO와 연동할 수 있는 강화학습 플랫폼 FLOW를 활용하여 멀티 에이전트 심층강화학습(Multi agent Deep Reinforcement Learning)으로 통행 제어를 최적화하는 강화학습 라이브러리 환경 구축부(300)를 포함한다.The device for improving traffic based on reinforcement learning at non-signal intersections for group operation of autonomous vehicles according to the present invention builds a simulation environment using SUMO (Simulation of Urban Mobility), and the speed, position, and sensor of the autonomous vehicle The SUMO simulation execution unit 100 that delivers the data obtained from the FLOW to the FLOW application unit 200, and the reinforcement learning platform FLOW environment that can be linked with SUMO builds a simulation environment, and derives driving behavior that does not apply reinforcement learning; Multi-agent deep reinforcement using the FLOW application unit 200 that updates the vehicle control and simulation status and delivers status and reward information to the reinforcement learning library environment construction unit 300, and the reinforcement learning platform FLOW that can be linked with SUMO and a reinforcement learning library environment construction unit 300 that optimizes traffic control by learning (Multi agent Deep Reinforcement Learning).

여기서, SUMO 시뮬레이션 실행부(100)는 SUMO(Simulation of Urban MObility)를 활용하여 시뮬레이션 환경을 구축하여 SUMO 시뮬레이션을 수행하는 SUMO 시뮬레이션부(10)와, 배기가스, 속도 및 위치값 파일을 생성하여 자율 주행 차량의 속도,위치,센서에서 얻어진 데이터를 FLOW 적용부(200)로 전달하는 결과 파일생성부(11)를 포함한다.Here, the SUMO simulation execution unit 100 builds a simulation environment using SUMO (Simulation of Urban Mobility) and performs the SUMO simulation, and the SUMO simulation unit 10 generates exhaust gas, velocity, and position value files and autonomously It includes a result file generating unit 11 that transmits the data obtained from the speed, position, and sensor of the driving vehicle to the FLOW application unit 200 .

그리고 FLOW 적용부(200)는 인간운전자 정의, 심층 강화학습 입력값 설정 및 차량의 속도, 가속도, 출발점 등 시뮬레이션 환경 설정을 하는 시뮬레이션 초기화부(20)와, FLOW 환경 구축을 하여 상태(state)를 강화학습 라이브러리로 전달하는 FLOW 환경 구축부(21)와, 강화학습을 적용하지 않은 운전 행태 도출을 하는 운행행태 도출부(22)와, 차량 제어를 하고 제어 정보를 SUMO 시뮬레이션 실행부(100)로 전달하는 차량 제어 모듈(23)과, SUMO 시뮬레이션 실행부(100)로부터 시뮬레이션 상태를 받아 시뮬레이션 상태 업데이트를 하여 상태 및 보상 정보를 강화학습 라이브러리 환경 구축부(300)로 전달하는 업데이트부(24)를 포함한다.And the FLOW application unit 200 is a simulation initialization unit 20 that sets the simulation environment, such as human driver definition, deep reinforcement learning input value setting and vehicle speed, acceleration, starting point, etc. The flow environment construction unit 21 that delivers to the reinforcement learning library, the driving behavior derivation unit 22 that derives driving behavior without reinforcement learning applied, and the vehicle control and control information are transferred to the SUMO simulation execution unit 100 The vehicle control module 23 that delivers, and the update unit 24 that receives the simulation state from the SUMO simulation execution unit 100, updates the simulation state, and transmits the state and reward information to the reinforcement learning library environment construction unit 300. include

그리고 강화학습 라이브러리 환경 구축부(300)는 FLOW 적용부(200)로부터 상태(state)를 전달받는 강화학습 라이브러리(31)와, 학습할 데이터를 샘플링하는 데이터 샘플링부(32)와, 운전 행태(정책) 훈련을 하는 정책 훈련부(33)와, 훈련 결과를 평가하고 학습된 행태(주행방법)를 FLOW 적용부(200)로 전달하는 훈련 결과 평가부(34)와, 자율주행차량이 관찰할 수 있는 범위 내의 정보를 통하여 학습하는 방법으로 강화학습과 마르코프 의사결정 모델 사용(Partial Observability MDP, POMDP)을 적용하여 행동에 대한 강화학습의 보상을 최대화할 수 있도록 하는 정책 최적화부(35)와, FLOW 적용부(200)로부터 자율 주행 차량의 상태를 받아 정책 업데이트 및 저장을 하는 정책 업데이트 저장부(36)와, 업데이트된 정책이 학습 루프 조건을 만족하는지 판단하는 학습 루프조건 판단부(37)를 포함한다.And the reinforcement learning library environment construction unit 300 includes a reinforcement learning library 31 receiving a state from the FLOW application unit 200, a data sampling unit 32 sampling data to be learned, and a driving behavior ( Policy) The policy training unit 33 for training, the training result evaluation unit 34 that evaluates the training results and delivers the learned behavior (driving method) to the FLOW application unit 200, and the autonomous vehicle can observe A policy optimization unit 35 that maximizes the reward of reinforcement learning for behavior by applying reinforcement learning and the use of a Markov decision-making model (Partial Observability MDP, POMDP) as a method of learning through information within a given range; and FLOW It includes a policy update storage unit 36 that receives the state of the autonomous vehicle from the application unit 200 and updates and stores the policy, and a learning loop condition determination unit 37 that determines whether the updated policy satisfies the learning loop condition. do.

도 4는 본 발명에 따른 자율주행 차량 군집 운행을 위한 비신호 교차로에서의 강화학습기반 통행 개선을 위한 방법을 나타낸 동작 흐름도이다.4 is an operation flowchart illustrating a method for improving traffic based on reinforcement learning at a non-signal intersection for group operation of autonomous vehicles according to the present invention.

본 발명에 따른 자율주행 차량 군집 운행을 위한 비신호 교차로에서의 강화학습기반 통행 개선을 위한 방법은 SUMO(Simulation of Urban MObility)을 활용하여 시뮬레이션 환경을 구축하고, 자율 주행 차량의 속도,위치,센서에서 얻어진 데이터를 FLOW 적용부(200)로 전달하는 SUMO 시뮬레이션 실행 단계와, SUMO와 연동할 수 있는 강화학습 플랫폼 FLOW 환경에서 시뮬레이션 환경을 구축하고, 강화학습을 적용하지 않은 운전 행태 도출, 차량 제어 및 시뮬레이션 상태 업데이트를 하여 상태 및 보상 정보를 강화학습 라이브러리 환경 구축부(300)로 전달하는 FLOW 적용 단계와, SUMO와 연동할 수 있는 강화학습 플랫폼 FLOW를 활용하여 멀티 에이전트 심층강화학습(Multi agent Deep Reinforcement Learning)으로 통행 제어를 최적화하는 강화학습 라이브러리 환경 구축 단계를 포함한다.The method for improving traffic based on reinforcement learning at non-signal intersections for group operation of autonomous vehicles according to the present invention builds a simulation environment using SUMO (Simulation of Urban Mobility), and the speed, position, and sensor of the autonomous vehicle The SUMO simulation execution step of delivering the data obtained from the FLOW to the FLOW application unit 200, the reinforcement learning platform that can be linked with SUMO, builds a simulation environment in the FLOW environment, derives driving behavior without reinforcement learning, vehicle control and Multi-agent deep reinforcement learning (Multi-agent Deep Reinforcement) using the FLOW application step that updates the simulation state and delivers the state and reward information to the reinforcement learning library environment construction unit 300, and the reinforcement learning platform FLOW that can be linked with SUMO Learning), including the step of building a reinforcement learning library environment that optimizes traffic control.

여기서, SUMO 시뮬레이션 실행 단계는 SUMO(Simulation of Urban MObility)를 활용하여 시뮬레이션 환경을 구축하여 SUMO 시뮬레이션을 수행하는 SUMO 시뮬레이션 단계(S409)와, 배기가스, 속도 및 위치값 파일을 생성하여 자율 주행 차량의 속도,위치,센서에서 얻어진 데이터를 FLOW 적용부(200)로 전달하는 결과 파일생성 단계(S410)를 포함한다.Here, the SUMO simulation execution step includes the SUMO simulation step (S409) in which the SUMO simulation is performed by building a simulation environment using SUMO (Simulation of Urban Mobility), and the exhaust gas, speed, and position value files are generated to control the autonomous driving vehicle. It includes a result file creation step (S410) of transferring the data obtained from the speed, position, and sensor to the FLOW application unit 200 .

그리고 FLOW 적용 단계는 인간운전자 정의, 심층 강화학습 입력값 설정 및 차량의 속도, 가속도, 출발점 등 시뮬레이션 환경 설정을 하는 시뮬레이션 초기화 단계(S401)와, FLOW 환경 구축을 하여 상태(state)를 강화학습 라이브러리로 전달하는 FLOW 환경 구축 단계(S402)와, 강화학습을 적용하지 않은 운전 행태 도출을 하는 운행행태 도출 단계(S403)와, 차량 제어를 하고 제어 정보를 SUMO 시뮬레이션 실행부(100)로 전달하는 차량 제어 단계(S408)와, SUMO 시뮬레이션 실행부(100)로부터 시뮬레이션 상태를 받아 시뮬레이션 상태 업데이트를 하여 상태 및 보상 정보를 강화학습 라이브러리 환경 구축부(300)로 전달하는 업데이트 단계(S411)를 포함한다.And the FLOW application step includes a simulation initialization step (S401) that sets the simulation environment such as human driver definition, deep reinforcement learning input value setting and vehicle speed, acceleration, and starting point, and a FLOW environment construction to set the state to the reinforcement learning library. A vehicle that controls the vehicle and transmits the control information to the SUMO simulation execution unit 100 with a flow environment construction step (S402) that delivers to It includes a control step (S408) and an update step (S411) of receiving the simulation state from the SUMO simulation execution unit 100, updating the simulation state, and transferring the state and reward information to the reinforcement learning library environment construction unit 300.

그리고 강화학습 라이브러리 환경 구축 단계는 강화학습 라이브러리(31)가 FLOW 적용부(200)로부터 상태(state)를 전달받는 단계(S404)와, 학습할 데이터를 샘플링하는 데이터 샘플링 단계(S405)와, 운전 행태(정책) 훈련을 하는 정책 훈련 단계(S406)와, 훈련 결과를 평가하고 학습된 행태(주행방법)를 FLOW 적용부(200)로 전달하는 훈련 결과 평가 단계(S407)와, 자율주행차량이 관찰할 수 있는 범위 내의 정보를 통하여 학습하는 방법으로 강화학습과 마르코프 의사결정 모델 사용(Partial Observability MDP, POMDP)을 사용하여 행동에 대한 강화학습의 보상을 최대화할 수 있도록 하는 정책 최적화 단계(S412)와, FLOW 적용부(200)로부터 자율 주행 차량의 상태를 받아 정책 업데이트 및 저장을 하는 정책 업데이트 저장 단계(S413)와, 업데이트된 정책이 학습 루프 조건을 만족하는지 판단하는 학습 루프조건 판단 단계(S414)를 포함한다.And the step of constructing the reinforcement learning library environment is a step (S404) of the reinforcement learning library 31 receiving a state from the FLOW application unit 200, a data sampling step of sampling the data to be learned (S405), and driving A policy training step (S406) of behavior (policy) training, a training result evaluation step (S407) of evaluating the training result and delivering the learned behavior (driving method) to the FLOW application unit 200, and the autonomous vehicle Policy optimization step to maximize the reward of reinforcement learning for behavior using reinforcement learning and Markov decision-making model (Partial Observability MDP, POMDP) as a method of learning through information within the observable range (S412) And, a policy update storage step (S413) of receiving the state of the autonomous vehicle from the FLOW application unit 200 and updating and storing the policy, and a learning loop condition determination step (S414) of determining whether the updated policy satisfies the learning loop condition ) is included.

도 5는 비신호 교차로에서의 일반적인 SUMO 시뮬레이터 구성도이다.5 is a schematic diagram of a general SUMO simulator at a non-signal intersection.

독일 항공우주센터의 교통 시스템 연구소가 개발한 SUMO는 오픈소스 마이크로스코픽 교통 시뮬레이터이다. SUMO는 신호등, 차량, 보행자 및 대중 교통과 함께 도시 규모의 교통 네트워크를 시뮬레이션할 수 있다. 또한 TraCI는 SUMO 시뮬레이터에 심층 RL을 적용하기 위해 SUMO를 Python에 연결할 수 있도록 한다. Developed by the Transport Systems Research Center at the German Aerospace Center, SUMO is an open-source microscopic transport simulator. SUMO can simulate city-scale transport networks with traffic lights, vehicles, pedestrians and public transport. TraCI also makes it possible to connect SUMO to Python to apply deep RL to the SUMO simulator.

비신호화된 교차로에서 일반적인 SUMO 시뮬레이터는 도 5에서와 같다.A general SUMO simulator at a non-signaled intersection is shown in FIG. 5 .

UC Berkeley에서 개발한 Flow는 심층 RL 알고리즘과 맞춤형 도로 네트워크 간의 인터페이스를 제공한다. 또한 Flow는 훈련 정책을 분석하고 검증할 수 있다.Flow, developed at UC Berkeley, provides an interface between deep RL algorithms and custom road networks. Flow can also analyze and validate training policies.

Flow의 장점은 심층 RL을 통해 자율 주행 차량의 제어기를 개선하기 위해 다양한 도로망을 쉽게 구현할 수 있는 능력을 포함한다. Flow에서 사용자 지정 환경은 다양한 시나리오에 대한 초기화된 시뮬레이션, 관찰 공간, 상태 공간, 작업 공간, 제어기 및 보상 기능을 포함한 주요 부분 집합 클래스를 생성하는 데 사용될 수 있다.Flow's advantages include the ability to easily implement various road networks to improve the control of autonomous vehicles with deep RL. In Flow, custom environments can be used to create key subset classes, including initialized simulations, observation spaces, state spaces, workspaces, controllers, and compensation functions for various scenarios.

초기화된 시뮬레이션은 시작 에피소드에 대한 시뮬레이션 환경의 초기 설정을 나타낸다. The initialized simulation represents the initial setup of the simulation environment for the starting episode.

본 발명에서는 IDM 규칙과 심층 RL 프레임워크의 매개 변수뿐만 아니라 위치, 속도, 가속, 출발점, 궤적 및 차량 수를 설정한다.In the present invention, position, velocity, acceleration, starting point, trajectory and number of vehicles are set as well as parameters of IDM rules and deep RL framework.

특히, 모든 차량의 궤적은 특정 노드(네트워크의 포인트 위치), 특정 에지(노드를 함께 연결) 및 특정 경로(에지 차량이 통과하는 시퀀스)를 포함하여 SUMO 시뮬레이터에 의해 초기 시뮬레이션 프로세스에서 설정된다. In particular, the trajectory of every vehicle is established in the initial simulation process by the SUMO simulator, including specific nodes (point locations in the network), specific edges (connecting nodes together) and specific paths (sequences through which edge vehicles pass).

다음으로, 인간 운전 차량의 가속은 SUMO 시뮬레이터에 의해 제어되고 AV의 가속은 Rllib 라이브러리에 의해 제어된다.Next, the acceleration of the human-driven vehicle is controlled by the SUMO simulator and the acceleration of the AV is controlled by the Rllib library.

도 6은 일반적인 관측 영역(Observation Space)의 일 예를 나타낸 구성도이다.6 is a configuration diagram illustrating an example of a general observation space.

관측 공간은 AV 속도(자기 차량 속도), AV 위치(자기 차량 위치) 및 해당 선행 및 AV의 속도 및 범퍼 투 범퍼 헤드웨이와 같은 관측 가능한 형상의 수와 유형을 나타낸다. The observation space represents the number and type of observable features such as AV velocity (own vehicle velocity), AV position (own vehicle position) and its preceding and AV velocity and bumper-to-bumper headway.

관찰 가능한 출력이 상태 공간으로 공급되어 적절한 정책을 예측한다.An observable output is fed into the state space to predict the appropriate policy.

그리고 상태 공간(state space)은 AV의 위치 및 속도뿐만 아니라 이전 및 이후의 AV를 포함하여 관찰 공간을 기반으로 하는 자율 에이전트 및 주변 차량의 벡터를 나타낸다.And the state space represents the vector of autonomous agents and surrounding vehicles based on the observation space, including the before and after AV as well as the position and velocity of the AV.

환경 내의 기능은 get_state 방법을 사용하여 추출되어 정책에 공급된다.Functions within the environment are extracted using the get_state method and fed into the policy.

첫째, 비신호화된 교차로에서 모든 차량의 ID를 얻는다. 그런 다음 모든 차량의 위치와 속도를 파악하여 상태 공간을 생성한다.First, we get the IDs of all vehicles at the unsigned intersection. It then determines the positions and velocities of all vehicles to create a state space.

중요한 것은 현재 위치가 미리 지정된 시작 지점을 기반으로 한다는 것이다.The important thing is that your current location is based on a predefined starting point.

상태 공간은 다음과 같이 정의된다.The state space is defined as

여기서, S는 특정 차량의 상태이고, x0은 AV의 해당 좌표이고,

,

및

는 각각 AV, 이전 AV 및 다음 AV의 해당 속도이고, where S is the state of a specific vehicle, x0 is the corresponding coordinate of AV,

,

and

are the corresponding speeds of AV, previous AV, and next AV, respectively,

와

는 각각 이전 AV와 다음 AV의 범퍼-대-범퍼 헤드웨이이다.

Wow

are the bumper-to-bumper headways of the previous AV and the next AV, respectively.

행동 공간(Action Space)은 OpenAI gym에서 제공하는 트래픽 환경에서 자율 에이전트의 행동을 나타낸다.The action space represents the behavior of the autonomous agent in the traffic environment provided by the OpenAI gym.

자동화 차량의 표준 행동은 가속이고, 행동 공간에서 행동의 범위는 최대 감속부터 최대 가속까지이다. The standard behavior of an automated vehicle is acceleration, and the range of behavior in the action space is from maximum deceleration to maximum acceleration.

apply_RL_ 행동 함수는 SUMO 시뮬레이터에서 특정 명령을 실제 행동으로 변환하기 위해 적용된다. The apply_RL_ action function is applied to convert a specific command into an actual action in the SUMO simulator.

첫째, 비신호화된 교차로에서 모든 AV를 식별한다. 그런 다음 행동 명령은 기본 환경 방법을 사용하여 가속으로 변환된다.First, identify all AVs at unsigned intersections. The action command is then converted to acceleration using the default environment method.

컨트롤러는 사람이 운전하는 차량과 AV를 포함하여 행위자들의 행동을 통제한다. 공유 제어를 사용하여 단일 컨트롤러를 여러 행위자에 적용할 수 있다. 본 발명에서는 인간이 운전하는 차량은 플로우 프레임워크에 의해 제어되고, 자동화 차량은 RLlib 라이브러리에 의해 제어된다.Controllers control the behavior of actors, including human-driven vehicles and AVs. Shared controls allow a single controller to apply to multiple actors. In the present invention, a human-driven vehicle is controlled by the flow framework, and an automated vehicle is controlled by the RLlib library.

보상 함수(Reward Function)를 설명하면 다음과 같다.The reward function will be described as follows.

트래픽 정체를 줄이기 위해서는 지연 시간, 대기열 길이를 줄임으로써 네트워크의 평균 속도를 최적화해야 한다. 따라서, 평균 속도는 현실에서 심층 RL 정책을 훈련하는 유망한 측정 기준이 된다.To reduce traffic congestion, the average speed of the network should be optimized by reducing latency and queue length. Therefore, average speed becomes a promising metric for training deep RL policies in reality.

보상 함수는 자율 에이전트가 정책을 최적화하는 방법을 정의한다.The reward function defines how the autonomous agent optimizes the policy.

본 발명에서 RL 에이전트의 목표는 비신호화된 교차로에서 차량 간 충돌을 억제하는 동시에 높은 평균 속도를 얻는다. The goal of the RL agent in the present invention is to achieve high average speed while suppressing vehicle-to-vehicle collisions at unsigned intersections.

본 발명에서, L2 규범은 목표 속도(비신호 교차로에서 모든 차량의 원하는 속도)에 기초하여 비신호 교차로에서 주어진 차량 속도에 주어진 양의 거리를 추정하는 데 사용된다. In the present invention, the L2 norm is used to estimate a distance of a given amount for a given vehicle speed at a non-signaling intersection based on the target speed (the desired speed of all vehicles at the non-signaling intersection).

특히, 비신호화된 교차로에서 모든 차량의 현재 속도를 구한 다음 평균 속도를 보상으로 돌려주는 Get-speed 방법을 적용한다. In particular, the Get-speed method is applied, which obtains the current speed of all vehicles at a non-signaled intersection and returns the average speed as a compensation.

보상 함수는 수학식 11에서와 같이 표현된다.The compensation function is expressed as in Equation 11.

여기서,

는 임의의 원하는 속도를 나타내고

는 비신호화된 교차로에서 모든 차량의 속도를 나타낸다.here,

represents any desired speed

represents the speed of all vehicles at unsigned intersections.

본 발명에 따른 자율주행 차량 군집 운행을 위한 비신호 교차로에서의 강화학습기반 통행 개선을 위한 장치 및 방법을 이용한 시뮬레이션 환경 설정 및 결과를 설명하면 다음과 같다.A simulation environment setting and results using the apparatus and method for improving traffic based on reinforcement learning at non-signal intersections for group operation of autonomous vehicles according to the present invention will be described as follows.

도 7은 비신호 교차로에서의 선도 자율 주행 차량 실험 특성을 나타낸 구성도이고, 도 8은 비신호화된 교차로에서의 실험 비교 구성도이다.7 is a block diagram showing the experimental characteristics of a leading autonomous driving vehicle at a non-signaled intersection, and FIG. 8 is a schematic diagram showing an experimental comparison at a non-signaled intersection.

표 2는 비신호 교차로 시뮬레이션 설정값의 일 예를 나타낸 것이다.Table 2 shows an example of non-signal intersection simulation settings.

시뮬레이션 시나리오는 다음과 같다.The simulation scenario is as follows.

본 발명에서 비신호 교차로를 횡단한 차량은 SUMO 시뮬레이터가 제공하는 선로설비 규칙(right-of-way rule)을 따랐다. 선로설비 규칙의 목적은 교통 규칙을 시행하고 교통 충돌을 방지하는 것이다.In the present invention, the vehicle crossing the non-signaled intersection followed the right-of-way rule provided by the SUMO simulator. The purpose of the railroad rules is to enforce traffic rules and prevent traffic collisions.

또한 모든 차량의 위치를 관찰하고 POMDP에서 MDP으로 환경을 전환했다. 중요한 것은, 자율 에이전트는 RLlib 라이브러리를 사용하여 롤아웃에 대한 특정 보상을 최적화하는 방법을 학습한다. 시뮬레이션은 RL 에이전트를 사용하여 인간 운전 주행과 혼합 자율 주행에서 전체 주행 흐름을 나타낸다. It also observed the positions of all vehicles and switched the environment from POMDP to MDP. Importantly, autonomous agents learn how to optimize specific rewards for rollouts using the RLlib library. Simulations use RL agents to represent the overall driving flow in human-driven driving and mixed autonomous driving.

RL 에이전트는 업데이트된 상태를 수신하고 0.1초의 시간 단계에서 새 상태를 가져오고, 인간 운전 차량의 경우 가속 동작은 IDM 모델에 의해 제어된다. 또한, 연속 라우팅은 네트워크 내에서 차량을 유지하기 위해 적용된다.The RL agent receives the updated state and fetches the new state in a time step of 0.1 seconds, and in the case of a human driven vehicle, the acceleration behavior is controlled by the IDM model. In addition, continuous routing is applied to keep vehicles within the network.

0.1초의 시간 스텝, 3.2m의 차선 폭, 각 방향으로 2차선, 420m의 차선 길이, 최대 가속도 3m/s², 최소 가속도 -3m/s², 최대 속도 12m/s, 600의 시야, 훈련 과정에 대한 200회의 반복으로 시뮬레이션 실험을 수행했다. 0.1 second time step, 3.2 m lane width, 2 lanes in each direction, 420 m lane length, maximum acceleration 3 m/s ² , minimum acceleration -3 m/s ² , maximum speed 12 m/s, field of view 600, training process A simulation experiment was performed with 200 iterations.

각 방향으로 시간당 1000대의 차량이 유입되도록 설정하고, 비신호 교차로의 범위는 200m에서 220m 사이였다.1000 vehicles per hour in each direction were set, and the range of the non-signaled intersection was between 200m and 220m.

현장에서 다양한 시나리오를 시뮬레이션해야 하는데, 본 발명에서는 비신호화된 교차로에서 선도적인 자율 주행 차량의 효과로 초점을 제한했다. Various scenarios need to be simulated in the field, and our focus is limited to the effect of leading autonomous vehicles at unsigned intersections.

군집 차량은 비신호 교차로에 접근하여 네 가지 다른 방향을 따라 직진 주행한다. 또한 1% ~ 100%의 AV 보급률에 대한 결과를 10% 단위로 제시하고, 비신호 교차로에서 모든 차량에 대해 차선 변경과 좌회전을 무시한다.A platoon vehicle approaches a non-signal intersection and drives straight in four different directions. In addition, the results for AV penetration rates of 1% to 100% are presented in units of 10%, and lane changes and left turns are ignored for all vehicles at non-signaled intersections.

도 7의 (a)는 10% ~ 90% 범위의 자율 주행(AV) 점유율을 가진 혼합 교통 상황에서의 비신호화된 교차로에서 선도 자율 주행 차량 실험 환경이고, (b)는 100% 자율 주행(AV) 점유율의 실험 환경이다.Fig. 7 (a) is an experimental environment for a leading autonomous driving vehicle at a non-signaled intersection in a mixed traffic situation with autonomous driving (AV) occupancy in the range of 10% to 90%, and (b) is 100% autonomous driving (AV) ) is the experimental environment of the share.

선도적인 자율 주행 차량 실험의 우수성을 입증하기 위해 선도적인 자율 주행 차량 실험을 선도적인 인간 주도 차량 실험과 모든 인간 주도 차량 실험을 포함한 다른 실험과 비교했다. 도 8은 비신호화된 교차로에서 실험의 비교를 보여준다.To demonstrate the superiority of leading autonomous vehicle trials, a leading autonomous vehicle trial was compared with other trials, including the leading human-led vehicle trial and all human-led vehicle trials. 8 shows a comparison of experiments at unsigned intersections.

도 9는 AV 점유율을 기반으로 한 200회 이상의 평균 보상 곡선 그래프이다.9 is a graph of an average compensation curve of 200 or more times based on AV occupancy.

훈련 정책의 성능(Training Policy's Performance)은 다음과 같다.Training Policy's Performance is as follows.

AV 점유율을 통한 RL 훈련 성과는 학습 성과를 평가하기 위해 사용되었다. 도 9는 AV 점유율을 기반으로 한 200회 이상의 평균 보상 곡선을 나타낸 것이다.RL training outcomes through AV occupancy were used to evaluate learning outcomes. 9 shows an average compensation curve of 200 or more times based on AV occupancy.

모든 상황에서 곡선이 평평해졌다는 것은 교육 정책이 거의 융합되었음을 나타낸다. 또한, 비신호 교차로의 AV 점유율이 50% AV 점유율을 제외하고 증가함에 따라 평균 보상이 증가했다. 완전 자율 주행은 다른 AV 점유율을 능가했으며, 가장 높은 평균 보상과 상당한 곡선 평탄화를 초래했다. 특히, 전체 자율 주행은 10% AV 점유율에 비해 6.8배 향상되었다. The flattening of the curve in all circumstances indicates that education policy is almost converged. Also, average compensation increased as AV occupancy of non-signaling intersections increased except for 50% AV occupancy. Fully autonomous driving outperformed other AV shares, resulting in the highest average compensation and significant curve flattening. In particular, the overall autonomous driving improved 6.8 times compared to the 10% AV share.

따라서 전체 자율 주행은 모든 상황에서 다른 AV 점유율을 능가했고, 비신호화된 교차로에서 선도적인 자율 주행 차량 실험의 효과는 AV 점유율이 증가함에 따라 더욱 분명해졌다.Thus, overall autonomous driving outperformed other AV occupancy in all situations, and the effect of leading autonomous vehicle trials at unsigned intersections became more evident as AV occupancy increased.

도 10은 비신호화된 교차로에서 AV 점유율을 통한 시공간 역학 특성 그래프이다.10 is a graph of spatiotemporal dynamics through AV occupancy at a non-signaled intersection.

선도적 자율 주행 차량이 부드러운 주행 속도에 미치는 영향은 다음과 같다.The impact of leading autonomous vehicles on smooth driving speeds is as follows:

도 10에서 점(point)은 속도에 따라 색상으로 구분되고, 맨 위에 가까운 점은 원활한 교통을 나타낸다. 이와는 대조적으로, 바닥에 가까운 지점은 혼잡한 교통량을 나타낸다. In FIG. 10, a point is color-coded according to speed, and a point close to the top indicates smooth traffic. In contrast, points close to the floor represent congested traffic.

낮은 AV 점유율의 경우, 사람이 운전하는 차량 거동의 정지 및 이동 파동으로 인해 교란이 발생하여 비신호화된 교차로 영역(200m에서 220m 범위)의 속도가 감소했다. 도 10에서와 같이, AV 점유율이 낮은 비신호화된 교차로에서 거의 모든 지점이 바닥에 근접해 있다. In the case of low AV occupancy, disturbances caused by static and moving waves of human-driven vehicle motion resulted in reduced speed in unsigned intersection areas (range 200m to 220m). As shown in Fig. 10, almost all points are close to the floor at the non-signaled intersection with low AV occupancy.

이는 인간이 운전하는 차량이 비신호화된 교차로 구역에 동시에 접근하고 선로설비 규칙에 따라 속도를 늦추기 때문이다. 높은 AV 점유율에서 포인트는 상단에 가깝고, AV는 더 짧은 시간 내에 느려지며, 따라서 비신호화된 교차로에서 정지 및 이동 파동이 점점 더 적어진다. This is because human-driven vehicles approach unsigned intersection areas at the same time and slow down according to the railroad rules. At high AV occupancy, the point is closer to the top, and the AV slows down in a shorter time, so there are fewer and fewer stationary and moving waves at unsigned intersections.

전체 자율 주행은 모든 AV 점유율 중 가장 높은 부드러운 주행 속도를 달성했다. 따라서, 교통 체증이 부분적으로 해소되었고, AV 점유율이 증가함에 따라 교통 흐름이 원활해졌다.Total autonomous driving achieved the highest smooth driving speed among all AV shares. Therefore, the traffic jam was partially resolved, and the traffic flow was smoothed as the AV occupancy increased.

도 11은 SUMO 시뮬레이션 환경에서 평균속도, 평균 지체시간, 평균연료 소모량, 평균 배기가스 값 도출 특성 그래프이다.11 is a graph showing characteristics of deriving average speed, average delay time, average fuel consumption, and average exhaust gas value in a SUMO simulation environment.

도 11은 평균 속도, 지연 시간, 연료 소비량 및 AV 점유율에 따른 배출량 측면에서 MOE 평가를 나타낸 것으로, MOE 평가 결과는 AV 점유율이 증가함에 따라 시뮬레이션이 더욱 효과적이었음을 나타낸다.11 shows the MOE evaluation in terms of emissions according to the average speed, delay time, fuel consumption, and AV occupancy, and the MOE evaluation results indicate that the simulation was more effective as the AV occupancy increased.

이동성과 관련하여, 평균 속도는 AV 점유율이 증가함에 따라 점차적으로 증가하였고 지연 시간은 점차 감소하였다.Regarding mobility, the average speed gradually increased as AV occupancy increased, and the latency gradually decreased.

도11의 (a)(b)에서와 같이, 완전 자율 주행은 10% AV 점유율에 비해 평균 속도가 1.19배, 지연 시간은 1.76배 향상되었다. 에너지 효율, 연료 소비 및 배출량은 AV 보급률이 증가함에 따라 약간 감소했다.As shown in (a) (b) of FIG. 11 , the average speed and delay time were improved by 1.19 times and 1.76 times in the fully autonomous driving compared to the 10% AV occupancy. Energy efficiency, fuel consumption and emissions decreased slightly as AV penetration increased.

도 11의 (c)(d)에서와 같이, 완전 자율 주행은 10% AV 점유율에 비해 연료 소비량이 1.05배, 배기 가스 배출량이 1.22배 향상되었다. As shown in (c)(d) of FIG. 11 , the fully autonomous driving improved fuel consumption by 1.05 times and exhaust gas emission by 1.22 times compared to the 10% AV occupancy.

따라서, 선도적인 자율 주행 차량은 AV 점유율이 증가할 때 이동성과 에너지 효율 측면에서 더 효과적인 것을 확인할 수 있다.Therefore, it can be confirmed that leading autonomous vehicles are more effective in terms of mobility and energy efficiency when AV occupancy increases.

이상에서 설명한 본 발명에 따른 자율주행 차량 군집 운행을 위한 비신호 교차로에서의 강화학습기반 통행 개선을 위한 장치 및 방법은 군집된 자율주행차량과 인간운전자 차량이 혼재되어 있는 혼합 교통류 상황에서 자율주행차량 군집주행 학습으로 비신호 교차로 통행을 개선하고 안전성을 확보할 수 있도록 한 것이다.The apparatus and method for improving traffic based on reinforcement learning at non-signal intersections for group operation of autonomous vehicles according to the present invention described above is an autonomous vehicle in a mixed traffic flow situation in which clustered autonomous vehicles and human driver vehicles are mixed. It is designed to improve traffic at non-signal intersections and ensure safety through platoon driving learning.

본 발명은 실제 상황과 같이 자율주행차량이 관찰할 수 있는 범위 내의 정보를 통하여 학습하는 방법으로 강화학습과 마르코프 의사결정 모델 사용(Partial Observability MDP, POMDP)을 적용하여 행동에 대한 강화학습의 보상을 최대화할 수 있도록 한 것이다.The present invention is a method of learning through information within the range that an autonomous vehicle can observe, such as in a real situation, by applying reinforcement learning and using a Markov decision-making model (Partial Observability MDP, POMDP) to reward reinforcement learning for behavior. so that it can be maximized.

이상에서의 설명에서와 같이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 본 발명이 구현되어 있음을 이해할 수 있을 것이다.It will be understood that the present invention is implemented in a modified form without departing from the essential characteristics of the present invention as described above.

그러므로 명시된 실시 예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 하고, 본 발명의 범위는 전술한 설명이 아니라 특허청구 범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.Therefore, the specified embodiments are to be considered in an illustrative rather than a restrictive view, the scope of the present invention is indicated in the claims rather than the foregoing description, and all differences within the scope equivalent thereto are included in the present invention. will have to be interpreted.

100. SUMO 시뮬레이션 실행부
200. FLOW 적용부
300. 강화학습 라이브러리 환경 구축부100. SUMO Simulation Execution Unit
200. FLOW application part
300. Reinforcement Learning Library Environment Construction Department

Claims

SUMO(Simulation of Urban MObility)를 활용하여 시뮬레이션 환경을 구축하고, 자율 주행 차량의 속도,위치,센서에서 얻어진 데이터를 FLOW 적용부로 전달하는 SUMO 시뮬레이션 실행부;
SUMO와 연동할 수 있는 강화학습 플랫폼 FLOW 환경에서 시뮬레이션 환경을 구축하고, 강화학습을 적용하지 않은 운전 행태 도출, 차량 제어 및 시뮬레이션 상태 업데이트를 하여 상태 및 보상 정보를 강화학습 라이브러리 환경 구축부로 전달하는 FLOW 적용부;
SUMO와 연동할 수 있는 강화학습 플랫폼 FLOW를 활용하여 멀티 에이전트 심층강화학습(Multi agent Deep Reinforcement Learning)으로 통행 제어를 최적화하는 강화학습 라이브러리 환경 구축부;를 포함하는 것을 특징으로 하는 자율주행 차량 군집 운행을 위한 비신호 교차로에서의 강화학습기반 통행 개선을 위한 장치.a SUMO simulation execution unit that builds a simulation environment using SUMO (Simulation of Urban Mobility) and delivers the data obtained from the speed, position, and sensor of the autonomous vehicle to the FLOW application unit;
FLOW that builds a simulation environment in the FLOW environment, a reinforcement learning platform that can be linked with SUMO, derives driving behavior without reinforcement learning, and updates vehicle control and simulation status to deliver status and reward information to the reinforcement learning library environment building unit application part;
A reinforcement learning library environment construction unit that optimizes traffic control with multi-agent deep reinforcement learning using FLOW, a reinforcement learning platform that can be linked with SUMO; A device for improvement of traffic based on reinforcement learning at non-signaled intersections.

제 1 항에 있어서, SUMO 시뮬레이션 실행부는,
SUMO(Simulation of Urban MObility)를 활용하여 시뮬레이션 환경을 구축하여 SUMO 시뮬레이션을 수행하는 SUMO 시뮬레이션부와,
배기가스, 속도 및 위치값 파일을 생성하여 자율 주행 차량의 속도,위치,센서에서 얻어진 데이터를 FLOW 적용부로 전달하는 결과 파일생성부를 포함하는 것을 특징으로 하는 자율주행 차량 군집 운행을 위한 비신호 교차로에서의 강화학습기반 통행 개선을 위한 장치.The method of claim 1, wherein the SUMO simulation execution unit,
A SUMO simulation unit that builds a simulation environment using SUMO (Simulation of Urban Mobility) and performs SUMO simulation;
At a non-signal intersection for group operation of autonomous vehicles, characterized in that it includes a result file generator that generates exhaust gas, speed and location value files and transmits the data obtained from the speed, location, and sensors of the autonomous vehicle to the FLOW application unit A device for improving traffic based on reinforcement learning.

제 1 항에 있어서, SUMO 시뮬레이션 실행부는 군집 차량이 비신호 교차로에 접근하여 네 가지 다른 방향을 따라 직진 주행하는 상황에서,
1% ~ 100%의 AV 보급률에 대한 결과를 10% 단위로 제시하고, 비신호 교차로에서 모든 차량에 대해 차선 변경과 좌회전을 무시하는 것을 특징으로 하는 자율주행 차량 군집 운행을 위한 비신호 교차로에서의 강화학습기반 통행 개선을 위한 장치.The method according to claim 1, wherein the SUMO simulation execution unit is configured to:
Presenting the results for AV penetration rate of 1% to 100% in units of 10%, and ignoring lane changes and left turns for all vehicles at non-signaling intersections. A device for improving traffic based on reinforcement learning.

제 1 항에 있어서, FLOW 적용부는,
인간운전자 정의, 심층 강화학습 입력값 설정 및 차량의 속도, 가속도, 출발점을 포함하는 시뮬레이션 환경 설정을 하는 시뮬레이션 초기화부와,
FLOW 환경 구축을 하여 상태(state)를 강화학습 라이브러리로 전달하는 FLOW 환경 구축부와,
강화학습을 적용하지 않은 운전 행태 도출을 하는 운행행태 도출부와,
차량 제어를 하고 제어 정보를 SUMO 시뮬레이션 실행부로 전달하는 차량 제어 모듈과,
SUMO 시뮬레이션 실행부로부터 시뮬레이션 상태를 받아 시뮬레이션 상태 업데이트를 하여 상태 및 보상 정보를 강화학습 라이브러리 환경 구축부로 전달하는 업데이트부를 포함하는 것을 특징으로 하는 자율주행 차량 군집 운행을 위한 비신호 교차로에서의 강화학습기반 통행 개선을 위한 장치.According to claim 1, FLOW application unit,
A simulation initialization unit for defining a human driver, setting input values for deep reinforcement learning, and setting a simulation environment including vehicle speed, acceleration, and starting point;
A FLOW environment construction unit that builds a FLOW environment and delivers the state to the reinforcement learning library,
A driving behavior derivation unit that derives driving behavior without applying reinforcement learning;
A vehicle control module that controls the vehicle and transmits the control information to the SUMO simulation execution unit;
Reinforcement learning base at non-signal intersection for autonomous driving vehicle cluster operation, characterized in that it includes an update unit that receives the simulation state from the SUMO simulation execution unit, updates the simulation state, and transmits the state and reward information to the reinforcement learning library environment construction unit Device for improving traffic.

제 1 항에 있어서, 강화학습 라이브러리 환경 구축부는,
FLOW 적용부로부터 상태(state)를 전달받는 강화학습 라이브러리와,
학습할 데이터를 샘플링하는 데이터 샘플링부와,
운전 행태(정책) 훈련을 하는 정책 훈련부와,
훈련 결과를 평가하고 학습된 행태를 FLOW 적용부로 전달하는 훈련 결과 평가부와,
자율주행차량이 관찰할 수 있는 범위 내의 정보를 통하여 학습하는 방법으로 강화학습과 마르코프 의사결정 모델 사용(Partial Observability MDP, POMDP)을 적용하여 행동에 대한 강화학습의 보상을 최대화할 수 있도록 하는 정책 최적화부를 포함하는 것을 특징으로 하는 자율주행 차량 군집 운행을 위한 비신호 교차로에서의 강화학습기반 통행 개선을 위한 장치.The method of claim 1, wherein the reinforcement learning library environment construction unit,
A reinforcement learning library that receives a state from the flow application unit,
a data sampling unit for sampling the data to be learned;
a policy training department that conducts driving behavior (policy) training;
A training result evaluation unit that evaluates the training result and delivers the learned behavior to the FLOW application unit;
Policy optimization to maximize the reward of reinforcement learning for behavior by applying reinforcement learning and the use of Markov decision-making models (Partial Observability MDP, POMDP) as a method of learning through information within the range that autonomous vehicles can observe Apparatus for reinforcement learning-based traffic improvement at non-signal intersections for autonomous vehicle swarm operation, characterized in that it comprises a part.

제 5 항에 있어서, 강화학습 라이브러리 환경 구축부는,
FLOW 적용부로부터 자율 주행 차량의 상태를 받아 정책 업데이트 및 저장을 하는 정책 업데이트 저장부와,
업데이트된 정책이 학습 루프 조건을 만족하는지 판단하는 학습 루프조건 판단부를 더 포함하는 것을 특징으로 하는 자율주행 차량 군집 운행을 위한 비신호 교차로에서의 강화학습기반 통행 개선을 위한 장치.The method of claim 5, wherein the reinforcement learning library environment construction unit,
A policy update storage unit that receives the status of the autonomous vehicle from the FLOW application unit and updates and stores the policy;
An apparatus for improving traffic based on reinforcement learning at non-signal intersections for autonomous vehicle swarm operation, characterized in that it further comprises a learning loop condition determining unit that determines whether the updated policy satisfies the learning loop condition.

제 5 항에 있어서, 강화학습 라이브러리 환경 구축부는,
정책 최적화(Policy Optimization)를 위하여 동작 값이나 상태 값 함수가 아닌 경사 강하 알고리즘을 사용하여 매개 변수화된 정책 함수의 추정기를 계산하는 정책 경사 방법(Policy gradient methods)을 적용하여,
비선형 근사 및 부분 관측으로 인해 추정 함수에 발생하는 수렴 문제를 피하도록 하는 것을 특징으로 하는 자율주행 차량 군집 운행을 위한 비신호 교차로에서의 강화학습기반 통행 개선을 위한 장치.The method of claim 5, wherein the reinforcement learning library environment construction unit,
By applying the policy gradient methods to compute the estimator of the parameterized policy function using a gradient descent algorithm rather than an action value or state value function for policy optimization,
A device for reinforcement learning-based traffic improvement at non-signal intersections for autonomous vehicle swarm operation, characterized in that it avoids convergence problems that occur in the estimation function due to nonlinear approximation and partial observation.

제 7 항에 있어서, 강화학습 라이브러리 환경 구축부는,
비신호화된 교차로의 시뮬레이션에서 제어 정책을 직접 최적화하기 위해 MLP(multilayer perceptron)정책을 적용하고,
정책 행동(

)의 확률에 대한 기대치와 시간 스텝 t

에서의 어드밴티지 함수(advantage function)의 추정치에 기초하는 정책 경사법은,

으로 정의하고,
여기서,

는 유한한 표본 배치에 대한 기대 연산자,

는 확률적 정책,

는 디스카운트된 보상 합계와 기준 추정치로 정의되며,

와

는 시간 스텝 t의 행동과 상태인 것을 특징으로 하는 자율주행 차량 군집 운행을 위한 비신호 교차로에서의 강화학습기반 통행 개선을 위한 장치.The method of claim 7, wherein the reinforcement learning library environment construction unit,
Apply the multilayer perceptron (MLP) policy to directly optimize the control policy in the simulation of non-signaled intersections,
policy action (

) and the time step t

A policy gradient method based on an estimate of the advantage function in

defined as,
here,

is the expected operator for a finite sample batch,

is a probabilistic policy,

is defined as the sum of the discounted rewards and the baseline estimate,

Wow

A device for reinforcement learning-based travel improvement at non-signaled intersections for autonomous vehicle swarm operation, characterized in that is the behavior and state of time step t.

제 5 항에 있어서, 강화학습 라이브러리 환경 구축부는,
훈련 과정 중 성능 저하를 방지하기 위하여 대리 손실 함수를 채택하여 정책 업데이트를 생성하는 PPO(Proximal policy optimization)를 적용하고,
대리 객체(

)는

으로 정의되고,

는 업데이트 전 정책 매개 변수,

는 업데이트 후 정책 매개 변수,

는 확률비인 것을 특징으로 하는 자율주행 차량 군집 운행을 위한 비신호 교차로에서의 강화학습기반 통행 개선을 위한 장치.The method of claim 5, wherein the reinforcement learning library environment construction unit,
Apply PPO (Proximal policy optimization) to generate policy updates by adopting a surrogate loss function to prevent performance degradation during the training process,
surrogate object (

)Is

is defined as

is the pre-update policy parameter,

is the policy parameter after update,

A device for reinforcement learning-based traffic improvement at non-signal intersections for group operation of autonomous vehicles, characterized in that is a probability ratio.

제 9 항에 있어서, 연속 행동의 경우 PPO의 정책 출력은 각 행동에 대한 가우스 분포의 매개 변수이고,
적응형 KL 패널티를 가진 PPO는 미니 배치(minibatch) 확률적 경사 하강(SGD)을 사용하여 KL 페널티 목표를 최적화하는 데 사용되고,

,

여기서,

는 매 정책 업데이트 후 업데이트되는 가중 조절 계수(weight control coefficient)인 것을 특징으로 하는 자율주행 차량 군집 운행을 위한 비신호 교차로에서의 강화학습기반 통행 개선을 위한 장치.10. The method of claim 9, wherein in the case of continuous actions, the policy output of the PPO is a parameter of the Gaussian distribution for each action,
PPO with adaptive KL penalty is used to optimize the KL penalty target using minibatch stochastic gradient descent (SGD),

,

here,

is a weight control coefficient updated after every policy update. An apparatus for reinforcement learning-based traffic improvement at non-signal intersections for group operation of autonomous vehicles.

제 10 항에 있어서, 현재 KL 차이가 목표 KL 편차보다 클 경우 증가되고, 현재 KL 발산이 목표 KL 발산보다 작으면 감소되고,
PPO 알고리즘에서 먼저 현재 정책이 환경과 상호 작용하여 에피소드 시퀀스를 생성하고, 어드밴티지 함수(advantage function)는 상태 값에 대한 기준 추정치를 사용하여 추정되어 모든 경험을 수집하고 정책 네트워크를 통해 경사 하강 알고리즘을 실행하는 것을 특징으로 하는 자율주행 차량 군집 운행을 위한 비신호 교차로에서의 강화학습기반 통행 개선을 위한 장치.11. The method of claim 10, wherein it is increased when the current KL difference is greater than the target KL deviation, and decreased when the current KL divergence is less than the target KL divergence;
In the PPO algorithm, first the current policy interacts with the environment to generate the sequence of episodes, and the advantage function is estimated using a baseline estimate of the state value to collect all experiences and run the gradient descent algorithm through the policy network. A device for improving traffic based on reinforcement learning at non-signal intersections for group operation of autonomous vehicles, characterized in that

SUMO(Simulation of Urban MObility)를 활용하여 시뮬레이션 환경을 구축하고, 자율 주행 차량의 속도,위치,센서에서 얻어진 데이터를 FLOW 적용부로 전달하는 SUMO 시뮬레이션 실행 단계;
SUMO와 연동할 수 있는 강화학습 플랫폼 FLOW 환경에서 시뮬레이션 환경을 구축하고, 강화학습을 적용하지 않은 운전 행태 도출, 차량 제어 및 시뮬레이션 상태 업데이트를 하여 상태 및 보상 정보를 강화학습 라이브러리 환경 구축부로 전달하는 FLOW 적용 단계;
SUMO와 연동할 수 있는 강화학습 플랫폼 FLOW를 활용하여 멀티 에이전트 심층강화학습(Multi agent Deep Reinforcement Learning)으로 통행 제어를 최적화하는 강화학습 라이브러리 환경 구축 단계;를 포함하는 것을 특징으로 하는 자율주행 차량 군집 운행을 위한 비신호 교차로에서의 강화학습기반 통행 개선을 위한 방법.SUMO simulation execution step of constructing a simulation environment using SUMO (Simulation of Urban Mobility) and delivering data obtained from the speed, position, and sensor of the autonomous vehicle to the FLOW application unit;
FLOW that builds a simulation environment in the FLOW environment, a reinforcement learning platform that can be linked with SUMO, derives driving behavior without reinforcement learning, and updates vehicle control and simulation status to deliver status and reward information to the reinforcement learning library environment building unit application step;
Building a reinforcement learning library environment that optimizes traffic control with multi-agent deep reinforcement learning using the reinforcement learning platform FLOW that can be linked with SUMO A method for improving traffic based on reinforcement learning at non-signaling intersections for

제 12 항에 있어서, SUMO 시뮬레이션 실행 단계는,
SUMO(Simulation of Urban MObility)를 활용하여 시뮬레이션 환경을 구축하여 SUMO 시뮬레이션을 수행하는 SUMO 시뮬레이션 단계와,
배기가스, 속도 및 위치값 파일을 생성하여 자율 주행 차량의 속도,위치,센서에서 얻어진 데이터를 FLOW 적용부로 전달하는 결과 파일생성 단계를 포함하는 것을 특징으로 하는 자율주행 차량 군집 운행을 위한 비신호 교차로에서의 강화학습기반 통행 개선을 위한 방법.The method of claim 12, wherein the SUMO simulation execution step comprises:
The SUMO simulation step of constructing a simulation environment using SUMO (Simulation of Urban Mobility) and performing SUMO simulation;
Non-signal intersection for group operation of autonomous vehicles, characterized in that it includes a result file generation step of generating exhaust gas, speed, and position value files and transmitting the data obtained from the speed, position, and sensor of the autonomous vehicle to the FLOW application unit A method for reinforcement learning-based transit improvement in

제 12 항에 있어서, FLOW 적용 단계는,
인간운전자 정의, 심층 강화학습 입력값 설정 및 차량의 속도, 가속도, 출발점 등 시뮬레이션 환경 설정을 하는 시뮬레이션 초기화 단계와,
FLOW 환경 구축을 하여 상태(state)를 강화학습 라이브러리로 전달하는 FLOW 환경 구축 단계와,
강화학습을 적용하지 않은 운전 행태 도출을 하는 운행행태 도출 단계와,
차량 제어를 하고 제어 정보를 SUMO 시뮬레이션 실행부로 전달하는 차량 제어 단계와,
SUMO 시뮬레이션 실행부로부터 시뮬레이션 상태를 받아 시뮬레이션 상태 업데이트를 하여 상태 및 보상 정보를 강화학습 라이브러리 환경 구축부로 전달하는 업데이트 단계를 포함하는 것을 특징으로 하는 자율주행 차량 군집 운행을 위한 비신호 교차로에서의 강화학습기반 통행 개선을 위한 방법.The method of claim 12, wherein the applying FLOW comprises:
A simulation initialization step to define a human driver, set input values for deep reinforcement learning, and set the simulation environment such as vehicle speed, acceleration, and starting point;
A FLOW environment construction step of constructing a FLOW environment and delivering the state to the reinforcement learning library,
A driving behavior derivation step of deriving driving behavior without applying reinforcement learning;
A vehicle control step of controlling the vehicle and transmitting the control information to the SUMO simulation execution unit;
Reinforcement learning at a non-signal intersection for swarm operation of autonomous vehicles, characterized in that it includes an update step of receiving the simulation state from the SUMO simulation execution unit, updating the simulation state, and transmitting the state and reward information to the reinforcement learning library environment construction unit Methods for improving infrastructure traffic.

제 12 항에 있어서, 강화학습 라이브러리 환경 구축 단계는,
강화학습 라이브러리가 FLOW 적용부로부터 상태(state)를 전달받는 단계와,
학습할 데이터를 샘플링하는 데이터 샘플링 단계와,
운전 행태(정책) 훈련을 하는 정책 훈련 단계와,
훈련 결과를 평가하고 학습된 행태를 FLOW 적용부로 전달하는 훈련 결과 평가 단계와,
자율주행차량이 관찰할 수 있는 범위 내의 정보를 통하여 학습하는 방법으로 강화학습과 마르코프 의사결정 모델 사용(Partial Observability MDP, POMDP)을 적용하여 행동에 대한 강화학습의 보상을 최대화할 수 있도록 하는 정책 최적화 단계와,
FLOW 적용부로부터 자율 주행 차량의 상태를 받아 정책 업데이트 및 저장을 하는 정책 업데이트 저장 단계와,
업데이트된 정책이 학습 루프 조건을 만족하는지 판단하는 학습 루프조건 판단 단계를 포함하는 것을 특징으로 하는 자율주행 차량 군집 운행을 위한 비신호 교차로에서의 강화학습기반 통행 개선을 위한 방법.
13. The method of claim 12, wherein the step of constructing a reinforcement learning library environment comprises:
A step in which the reinforcement learning library receives a state from the FLOW application unit,
A data sampling step of sampling the data to be learned;
a policy training stage of driving behavior (policy) training;
A training result evaluation step of evaluating the training result and delivering the learned behavior to the FLOW application unit;
Policy optimization to maximize the reward of reinforcement learning for behavior by applying reinforcement learning and the use of Markov decision-making models (Partial Observability MDP, POMDP) as a method of learning through information within the range that autonomous vehicles can observe step and
A policy update storage step of receiving the status of the autonomous vehicle from the FLOW application unit and updating and storing the policy;
A method for improving traffic based on reinforcement learning at non-signal intersections for autonomous vehicle swarm operation, comprising the step of determining whether the updated policy satisfies the learning loop condition.