KR102380573B1

KR102380573B1 - Object detection apparatus and method to minimize end-to-end delay and advanced driver assistance device using the same

Info

Publication number: KR102380573B1
Application number: KR1020200148121A
Authority: KR
Inventors: 김종찬; 장원석
Original assignee: 국민대학교산학협력단
Priority date: 2020-11-06
Filing date: 2020-11-06
Publication date: 2022-03-31
Also published as: WO2022097829A1

Abstract

The present invention relates to a device and a method for detecting an object to minimize an end-to-end delay, and an advanced driver assistance device using the same. The device for detecting an object comprises: a first thread processing unit taking a camera image during every first time section of each cycle; a second thread processing unit inferring the taken camera image during every second time section directly adjacent to the first time section of each cycle; and a third thread processing unit detecting whether the inferred camera image exists during every first time section of each cycle and displaying the inferred camera image when the inferred camera image exists.

Description

종단간 지연을 최소화하기 위한 객체검출 장치 및 방법, 그리고 이를 이용한 첨단 운전자 보조장치{OBJECT DETECTION APPARATUS AND METHOD TO MINIMIZE END-TO-END DELAY AND ADVANCED DRIVER ASSISTANCE DEVICE USING THE SAME}OBJECT DETECTION APPARATUS AND METHOD TO MINIMIZE END-TO-END DELAY AND ADVANCED DRIVER ASSISTANCE DEVICE USING THE SAME

본 발명은 실시간 객체검출 기술에 관한 것으로, 보다 상세하게는 물체가 나타난 후 객체 검출기를 거쳐서 해당 물체를 인지할 때까지의 시간을 줄일 수 있는 종단간 지연을 최소화하기 위한 객체검출 장치 및 방법, 그리고 이를 이용한 첨단 운전자 보조장치에 관한 것이다.The present invention relates to a real-time object detection technology, and more particularly, an object detection apparatus and method for minimizing an end-to-end delay that can reduce the time from the appearance of an object to recognizing the object through an object detector, and It relates to a state-of-the-art driver assistance system using this.

안전한 자율 주행 실현을 위해 차량의 카메라 기반 객체검출 시스템은 위험한 도로 상의 장애물을 최대한 빨리 감지하여 충돌 위험을 줄여야 한다. 따라서, 객체의 출현과 감지 사이의 지연을 철저히 분석하고 최적화할 필요가 있다.To realize safe autonomous driving, the vehicle's camera-based object detection system must detect dangerous obstacles on the road as quickly as possible to reduce the risk of collision. Therefore, there is a need to thoroughly analyze and optimize the delay between the appearance and detection of an object.

가장 최근의 객체 탐지기는 심층신경망(DNN)을 기반으로 하기 때문에 많은 연구가 경량 신경망 또는 하드웨어 가속 기술을 개발하여 DNN 추론 지연을 줄이려는 시도에 집중되어 있다.Since the most recent object detectors are based on deep neural networks (DNNs), many studies have focused on developing lightweight neural networks or hardware-accelerated techniques to reduce DNN inference delays.

그러나, 추론 지연뿐만 아니라 이미지 크기 조정이나 결과 표시와 같은 기타 사전 및 사후 처리 지연을 포함하는 실시간 객체 검출 시스템의 종단 간 지연 분석 및 최적화에 대해서는 관심도가 상대적으로 낮은 실정이다.However, there is relatively little interest in end-to-end delay analysis and optimization of real-time object detection systems, including inference delay as well as other pre- and post-processing delays such as image resizing or result display.

한국공개특허 제10-2009-0062881호 (2009.06.17)Korean Patent Publication No. 10-2009-0062881 (June 17, 2009)

본 발명의 일 실시예는 물체가 나타난 후 객체 검출기를 거쳐서 해당 물체를 인지할 때까지의 시간을 줄일 수 있는 종단간 지연을 최소화하기 위한 객체검출 장치 및 방법, 그리고 이를 이용한 첨단 운전자 보조장치를 제공하고자 한다.An embodiment of the present invention provides an object detection apparatus and method for minimizing the end-to-end delay that can reduce the time from the appearance of an object to recognizing the object through an object detector, and a state-of-the-art driver assistance device using the same want to

본 발명의 일 실시예는 시스템 내부 구조 분석을 통해 총 3단계를 거쳐 종단간 지연 시간을 최적화 할 수 있는 종단간 지연을 최소화하기 위한 객체검출 장치 및 방법, 그리고 이를 이용한 첨단 운전자 보조장치를 제공하고자 한다.An embodiment of the present invention is to provide an object detection apparatus and method for minimizing the end-to-end delay that can optimize the end-to-end delay time through a total of three steps through analysis of the system internal structure, and an advanced driver assistance device using the same do.

실시예들 중에서, 종단간 지연을 최소화하기 위한 객체검출 장치는 매 주기의 제1 시간 구간 동안마다 카메라 영상을 가져오는 제1 쓰레드 처리부; 상기 매 주기의 상기 제1 시간 구간에 바로 인접한 제2 시간 구간 동안마다 상기 가져온 카메라 영상을 추론하는 제2 쓰레드 처리부; 및 상기 매 주기의 제1 시간 구간 동안마다 상기 추론된 카메라 영상이 존재하는지 여부를 검출하고, 그렇다면 상기 추론된 카메라 영상을 디스플레이하는 제3 쓰레드 처리부를 포함한다.In embodiments, an apparatus for detecting an object for minimizing an end-to-end delay may include: a first thread processing unit that fetches a camera image during a first time period of each cycle; a second thread processing unit for inferring the fetched camera image for every second time section immediately adjacent to the first time section of each cycle; and a third thread processing unit configured to detect whether the inferred camera image exists during the first time period of each period, and if so, to display the inferred camera image.

상기 제1 쓰레드 처리부는 상기 카메라 영상을 저장하는 별도의 메모리 버퍼 없이 카메라로부터 상기 카메라 영상을 상기 제1 시간 구간에 직접적으로 수신하는 온디맨드 캡처를 수행할 수 있다.The first thread processing unit may perform on-demand capture by directly receiving the camera image from the camera in the first time period without a separate memory buffer for storing the camera image.

상기 제1 쓰레드 처리부는 CPU (Central Processing Unit) 기반으로 수행되고 상기 제1 시간 구간에 직접적으로 상기 카메라에 접근하여 상기 카메라 영상을 수신할 수 있다.The first thread processing unit may be performed based on a central processing unit (CPU) and may receive the camera image by directly accessing the camera during the first time period.

상기 제1 쓰레드 처리부는 상기 카메라의 정상 동작에 따라 결정되는 상기 매 주기의 시작 시점 이후부터 상기 카메라 영상을 수신하여 쓰레드 간의 경쟁 프리 영상 버퍼에 저장할 수 있다.The first thread processing unit may receive the camera image from a start point of each cycle determined according to the normal operation of the camera and store the received camera image in a contention-free image buffer between threads.

상기 제2 쓰레드 처리부는 상기 제1 시간 구간이 끝나자 마자 상기 제2 시간 구간의 시점에 상기 가져온 카메라 영상에 관한 영상 처리를 수행할 수 있다.The second thread processing unit may perform image processing on the fetched camera image at a point in time of the second time period as soon as the first time period ends.

상기 제2 쓰레드 처리부는 GPU (Graphic Processing Unit) 기반으로 수행되고 상기 제2 시간 구간에 상기 제1 쓰레드 처리부와 쓰레드 간의 경쟁 프리 영상 버퍼를 경쟁하지 않고 상기 가져온 카메라 영상에 관한 객체 인식을 수행할 수 있다.The second thread processing unit is performed based on a GPU (Graphic Processing Unit), and in the second time period, object recognition on the fetched camera image can be performed without competing for a contention-free image buffer between the first thread processing unit and a thread. there is.

상기 제2 쓰레드 처리부는 신경망 기반의 스테이지 검출기를 통해 상기 매 주기의 다음 제1 시간 구간의 시점에 해당하는 데드라인에 맞춰 상기 객체 인식을 완료할 수 있다.The second thread processing unit may complete the object recognition according to a deadline corresponding to a time point of the first time section next to each cycle through a neural network-based stage detector.

상기 제3 쓰레드 처리부는 상기 카메라 영상을 가져오는 시간 구간과 동기에 맞춰 상기 매 주기의 제1 시간 구간 동안마다 상기 추론된 카메라 영상을 디스플레이할 수 있다.The third thread processing unit may display the inferred camera image during the first time period of each period in synchronization with a time period for fetching the camera image.

상기 제3 쓰레드 처리부는 CPU (Central Processing Unit) 기반으로 수행되고 상기 추론된 카메라 영상에 있는 객체를 오버레이 되게 하이라이팅 할 수 있다.The third thread processing unit is performed based on a CPU (Central Processing Unit) and may highlight an object in the inferred camera image to be overlaid.

상기 제3 쓰레드 처리부는 상기 디스플레이 이전에 상기 객체에 관한 메타데이터를 기초로 상기 객체를 분석하여 상기 객체의 종류에 따라 차별되게 하이라이팅 할 수 있다.The third thread processing unit may analyze the object based on metadata about the object before the display, and highlight differently according to the type of the object.

상기 장치는 상기 매 주기의 동기를 맞춰서 상기 매 주기에서의 상기 제1 및 제2 시간 구간들 각각의 시점을 제공하는 제어부를 더 포함할 수 있다.The apparatus may further include a control unit that synchronizes each period and provides a time point of each of the first and second time intervals in each period.

실시예들 중에서, 첨단 운전자 보조장치는 매 주기의 제1 시간 구간 동안마다 카메라 영상을 가져오는 제1 쓰레드 처리부; 상기 매 주기의 상기 제1 시간 구간에 바로 인접한 제2 시간 구간 동안마다 상기 가져온 카메라 영상을 추론하는 제2 쓰레드 처리부; 및 상기 매 주기의 제1 시간 구간 동안마다 상기 추론된 카메라 영상이 존재하는지 여부를 검출하고, 그렇다면 상기 추론된 카메라 영상을 디스플레이하는 제3 쓰레드 처리부; 및 상기 추론된 카메라 영상을 기초로 주행중 차량의 긴급성을 결정하여 상기 차량을 제어하는 제어부를 포함한다.In embodiments, the advanced driver assistance device may include: a first thread processing unit that fetches a camera image during a first time period of each cycle; a second thread processing unit for inferring the fetched camera image for every second time section immediately adjacent to the first time section of each cycle; and a third thread processing unit configured to detect whether the inferred camera image exists during the first time period of each period, and if so, to display the inferred camera image; and a controller for controlling the vehicle by determining the urgency of the vehicle while driving based on the inferred camera image.

상기 제어부는 상기 매 주기에서의 상기 제1 및 제2 시간 구간들 각각의 시점에 오류가 발생하는 경우에는 상기 가져온 카메라 영상의 추론을 패스하여 상기 매 주기의 동기를 재설정하여 상기 제1 시간 구간의 시점을 리셋할 수 있다.When an error occurs at each time point of the first and second time sections in each cycle, the control unit passes the inference of the imported camera image to reset the synchronization of each cycle to determine the first time section of the first time section You can reset the time point.

개시된 기술은 다음의 효과를 가질 수 있다. 다만, 특정 실시예가 다음의 효과를 전부 포함하여야 한다거나 다음의 효과만을 포함하여야 한다는 의미는 아니므로, 개시된 기술의 권리범위는 이에 의하여 제한되는 것으로 이해되어서는 아니 될 것이다.The disclosed technology may have the following effects. However, this does not mean that a specific embodiment should include all of the following effects or only the following effects, so the scope of the disclosed technology should not be understood as being limited thereby.

본 발명의 일 실시예에 따른 종단간 지연을 최소화하기 위한 객체검출 장치 및 방법, 그리고 이를 이용한 첨단 운전자 보조장치는 물체가 나타난 후 객체 검출기를 거쳐서 해당 물체를 인지할 때까지의 시간을 줄일 수 있다.The object detection apparatus and method for minimizing the end-to-end delay according to an embodiment of the present invention, and a state-of-the-art driver assistance device using the same can reduce the time from the appearance of an object to recognizing the object through the object detector. .

본 발명의 일 실시예에 따른 종단간 지연을 최소화하기 위한 객체검출 장치 및 방법, 그리고 이를 이용한 첨단 운전자 보조장치는 시스템 내부 구조 분석을 통해 총 3단계를 거쳐 종단간 지연 시간을 최적화 할 수 있다.The object detection apparatus and method for minimizing the end-to-end delay according to an embodiment of the present invention, and the advanced driver assistance device using the same can optimize the end-to-end delay time through a total of three steps through the analysis of the internal structure of the system.

도 1은 객체검출 시스템의 하드웨어-소프트웨어 아키텍처를 설명하는 도면이다.
도 2는 객체검출 시스템의 전체 플로우와 관련 지연들을 설명하는 도면이다.
도 3은 USB 대역폭 예약을 설명하는 도면이다.
도 4는 전송 지연을 설명하는 도면이다.
도 5는 다중 스레드 파이프라인 아키텍처를 설명하는 도면이다.
도 6 및 7은 본 발명에 따른 종단간 지연 최적화 방법을 설명하는 도면이다.
도 8은 공유 메모리 대역폭 경합 유무에 따른 실행시간을 비교 설명하는 도면이다.1 is a diagram illustrating a hardware-software architecture of an object detection system.
2 is a diagram illustrating the overall flow of an object detection system and associated delays.
3 is a diagram for explaining USB bandwidth reservation.
4 is a diagram for explaining transmission delay.
5 is a diagram illustrating a multi-threaded pipeline architecture.
6 and 7 are diagrams for explaining an end-to-end delay optimization method according to the present invention.
8 is a diagram for explaining and comparing execution times according to the presence or absence of contention for shared memory bandwidth.

본 발명에 관한 설명은 구조적 내지 기능적 설명을 위한 실시예에 불과하므로, 본 발명의 권리범위는 본문에 설명된 실시예에 의하여 제한되는 것으로 해석되어서는 아니 된다. 즉, 실시예는 다양한 변경이 가능하고 여러 가지 형태를 가질 수 있으므로 본 발명의 권리범위는 기술적 사상을 실현할 수 있는 균등물들을 포함하는 것으로 이해되어야 한다. 또한, 본 발명에서 제시된 목적 또는 효과는 특정 실시예가 이를 전부 포함하여야 한다거나 그러한 효과만을 포함하여야 한다는 의미는 아니므로, 본 발명의 권리범위는 이에 의하여 제한되는 것으로 이해되어서는 아니 될 것이다.Since the description of the present invention is merely an embodiment for structural or functional description, the scope of the present invention should not be construed as being limited by the embodiment described in the text. That is, since the embodiment may have various changes and may have various forms, it should be understood that the scope of the present invention includes equivalents capable of realizing the technical idea. In addition, since the object or effect presented in the present invention does not mean that a specific embodiment should include all of them or only such effects, it should not be understood that the scope of the present invention is limited thereby.

한편, 본 출원에서 서술되는 용어의 의미는 다음과 같이 이해되어야 할 것이다.On the other hand, the meaning of the terms described in the present application should be understood as follows.

"제1", "제2" 등의 용어는 하나의 구성요소를 다른 구성요소로부터 구별하기 위한 것으로, 이들 용어들에 의해 권리범위가 한정되어서는 아니 된다. 예를 들어, 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다.Terms such as “first” and “second” are for distinguishing one component from another, and the scope of rights should not be limited by these terms. For example, a first component may be termed a second component, and similarly, a second component may also be termed a first component.

어떤 구성요소가 다른 구성요소에 "연결되어"있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결될 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어"있다고 언급된 때에는 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. 한편, 구성요소들 간의 관계를 설명하는 다른 표현들, 즉 "~사이에"와 "바로 ~사이에" 또는 "~에 이웃하는"과 "~에 직접 이웃하는" 등도 마찬가지로 해석되어야 한다.When a component is referred to as being “connected to” another component, it may be directly connected to the other component, but it should be understood that other components may exist in between. On the other hand, when it is mentioned that a certain element is "directly connected" to another element, it should be understood that the other element does not exist in the middle. Meanwhile, other expressions describing the relationship between elements, that is, "between" and "between" or "neighboring to" and "directly adjacent to", etc., should be interpreted similarly.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한 복수의 표현을 포함하는 것으로 이해되어야 하고, "포함하다"또는 "가지다" 등의 용어는 실시된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이며, 하나 또는 그 이상의 다른 특징이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The singular expression is to be understood as including the plural expression unless the context clearly dictates otherwise, and terms such as "comprises" or "have" refer to the embodied feature, number, step, action, component, part or these It is intended to indicate that a combination exists, and it should be understood that it does not preclude the possibility of the existence or addition of one or more other features or numbers, steps, operations, components, parts, or combinations thereof.

각 단계들에 있어 식별부호(예를 들어, a, b, c 등)는 설명의 편의를 위하여 사용되는 것으로 식별부호는 각 단계들의 순서를 설명하는 것이 아니며, 각 단계들은 문맥상 명백하게 특정 순서를 기재하지 않는 이상 명기된 순서와 다르게 일어날 수 있다. 즉, 각 단계들은 명기된 순서와 동일하게 일어날 수도 있고 실질적으로 동시에 수행될 수도 있으며 반대의 순서대로 수행될 수도 있다.In each step, identification numbers (eg, a, b, c, etc.) are used for convenience of description, and identification numbers do not describe the order of each step, and each step clearly indicates a specific order in context. Unless otherwise specified, it may occur in a different order from the specified order. That is, each step may occur in the same order as specified, may be performed substantially simultaneously, or may be performed in the reverse order.

본 발명은 컴퓨터가 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현될 수 있고, 컴퓨터가 읽을 수 있는 기록 매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록 장치를 포함한다. 컴퓨터가 읽을 수 있는 기록 매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광 데이터 저장 장치 등이 있다. 또한, 컴퓨터가 읽을 수 있는 기록 매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산 방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다.The present invention can be embodied as computer-readable codes on a computer-readable recording medium, and the computer-readable recording medium includes all types of recording devices in which data readable by a computer system is stored. . Examples of the computer-readable recording medium include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like. In addition, the computer-readable recording medium is distributed in a computer system connected to a network, so that the computer-readable code can be stored and executed in a distributed manner.

여기서 사용되는 모든 용어들은 다르게 정의되지 않는 한, 본 발명이 속하는 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한 이상적이거나 과도하게 형식적인 의미를 지니는 것으로 해석될 수 없다.All terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the present invention belongs, unless otherwise defined. Terms defined in the dictionary should be interpreted as being consistent with the meaning of the context of the related art, and cannot be interpreted as having an ideal or excessively formal meaning unless explicitly defined in the present application.

도 1은 객체검출 시스템의 하드웨어-소프트웨어 아키텍처를 설명하는 도면이다.1 is a diagram illustrating a hardware-software architecture of an object detection system.

도 1을 참조하면, 객체 검출 시스템(100)은 카메라 서브시스템(110) 및 검출기 서브시스템(130)을 포함하는 하드웨어-소프트웨어 아키텍처로 구현될 수 있다.Referring to FIG. 1 , the object detection system 100 may be implemented as a hardware-software architecture including a camera subsystem 110 and a detector subsystem 130 .

카메라 서브시스템(110)은 USB (Universal Serial Bus) 커넥터를 사용하여 검출기 서브시스템(130)에 직접 연결된 단일 카메라를 포함할 수 있다. 검출기 서브시스템(130)은 CPU-GPU 이기종 컴퓨팅 플랫폼으로 구현될 수 있다.Camera subsystem 110 may include a single camera coupled directly to detector subsystem 130 using a Universal Serial Bus (USB) connector. The detector subsystem 130 may be implemented as a CPU-GPU heterogeneous computing platform.

이때, Linux 운영 체제는 각각 CPU 스레드 및 GPU 커널을 관리하기 위한 Pthread 라이브러리 및 CUDA 프레임워크를 기반으로 다크넷(Darknet) 신경망 프레임워크를 실행하는데 사용될 수 있다.In this case, the Linux operating system may be used to execute a darknet neural network framework based on a Pthread library and CUDA framework for managing CPU threads and GPU kernels, respectively.

다크넷 프레임워크는 사전 훈련된 YOLO v2, v3 및 v4 DNN을 사용할 수 있다. 또한, 두개의 서브시스템들 간의 인터페이스로 V4L2(Video For Linux Two) 기반 USB 카메라 드라이버가 사용될 수 있다.Darknet frameworks can use pre-trained YOLO v2, v3 and v4 DNNs. Also, a V4L2 (Video For Linux Two)-based USB camera driver can be used as an interface between the two subsystems.

검출기 서브시스템(130)은 객체 검출 결과를 시각화하기 위해 OpenCV 라이브러리를 사용하여 감지된 객체 주위에 경계 상자(bounding box)를 표시gkf 수 있다.The detector subsystem 130 may display a bounding box around the detected object using the OpenCV library to visualize the object detection result.

도 2는 객체검출 시스템의 전체 플로우와 관련 지연들을 설명하는 도면이다.2 is a diagram illustrating the overall flow of an object detection system and associated delays.

도 2를 참조하면, 객체검출 시스템(100)은 연관된 지연들과 함께 카메라 서브시스템(110), 검출기 서브 시스템(130) 및 이미지들을 버퍼링(buffering)하기 위한 V4L2 카메라 드라이버 대기열(이하, 대기열)을 포함할 수 있으며, 각 구성들 사이에서 데이터의 종단 간 흐름을 확인할 수 있다.Referring to Figure 2, the object detection system 100 creates a camera subsystem 110, a detector subsystem 130, and a V4L2 camera driver queue (hereafter queue) for buffering images with associated delays. can be included, and the end-to-end flow of data between each configuration can be checked.

객체검출 시스템(100)은 크기 Q의 단일 서버 대기열로 모델링 할 수 있다. 이때, 대기열(Queue)에서 이미지 도착 간격(Arrival interval)의 분포 A와 객체 검출 서비스 간격(Service interval)의 분포 S가 활용될 수 있다. 여기에서, 서비스 간격은 대기열에서 이미지를 검색(retrieve)하기 위한 두개의 연속된 요청들 사이의 시간(즉, 객체 검출주기의 시작)으로 정의될 수 있다.The object detection system 100 can be modeled as a single server queue of size Q. In this case, the distribution A of the image arrival interval and the distribution S of the object detection service interval in the queue may be utilized. Here, the service interval may be defined as the time between two consecutive requests to retrieve an image from the queue (ie, the start of an object detection cycle).

객체검출 시스템(100)은 (i)캡처(Capture), (ii)전송(Transfer), (iii) 대기열(Queue), (iv)가져오기(Fetch), (v)추론(Inference) 및 (vi)디스플레이(Display)의 여섯 가지 구성요소에 따라 지연이 발생할 수 있다. 즉, 해당 구성요소들은 객체검출 시스템(110)의 종단 간 지연을 결정할 수 있다. 또한 지연 구성요소들은 독립적으로 정의될 수 있으나, 일부는 상호 연관되어 종단 간 지연에 영향을 줄 수 있다.The object detection system 100 includes (i) Capture, (ii) Transfer, (iii) Queue, (iv) Fetch, (v) Inference, and (vi) ) Delay may occur depending on the six components of the display. That is, the corresponding components may determine the end-to-end delay of the object detection system 110 . Also, delay components can be defined independently, but some can be correlated to affect end-to-end delay.

- 카메라 서브시스템(Camera Subsystem) 분석- Camera Subsystem Analysis

이미지는 F의 프레임 속도(frame rate)로 카메라에 의해 캡처될 수 있고, 캡처들 사이의 카메라의 주기 시간(cycle time) C는 1/F에 해당할 수 있다. 따라서, 새로 등장한 객체는 다음 캡처 순간까지 최대 C만큼 대기할 수 있다. 그러나, 객체는 검출기 서브시스템(130)의 서비스 간격과 연관되기 때문에 캡처되기 전에 더 긴 지연, 즉 d_capt로 표시되는 캡처 지연이 발생할 수 있다.The image may be captured by the camera at a frame rate of F, and the cycle time C of the camera between captures may correspond to 1/F. Thus, a newly emerged object can wait up to C until the next capture moment. However, since the object is associated with the service interval of the detector subsystem 130 , a longer delay may occur before being captured, i.e., a capture delay denoted by d _capt .

한편, 이미지 프레임에서 데이터의 양은 해상도(X×Y)와 픽셀 포맷에 의해 결정될 수 있다. YUYV 및 RGBA와 같은 비압축(non-compressing) 픽셀 포맷은 고정된 픽셀 당 비트 수(bpp, bits-per-pixel) P를 가질 수 있다. 반면, H264, MJPEG와 같은 압축 픽셀 포맷은 이미지 콘텐츠에 따라 프레임 크기가 달라질 수 있다. 여기에서는 비압축 픽셀 포맷을 가정하여 설명한다. 즉, 고정된 bpp를 가정하면 I로 표시되는 이미지 프레임의 크기는 I=(X·Y·P)/8 바이트(bytes)로 계산될 수 있다.Meanwhile, the amount of data in the image frame may be determined by the resolution (X×Y) and the pixel format. Non-compressing pixel formats such as YUYV and RGBA may have a fixed bits-per-pixel (bpp) P. On the other hand, compressed pixel formats such as H264 and MJPEG may have different frame sizes depending on image content. Herein, the description will be made assuming an uncompressed pixel format. That is, assuming a fixed bpp, the size of the image frame represented by I may be calculated as I=(X·Y·P)/8 bytes.

캡처된 각 이미지는 대역폭 예약(bandwidth reservation)에 따라 알려진 지연 시간을 제공하는 등시성 전송 모드(isochronous transfer mode)에서 USB 연결을 통해 검출기 서브 시스템(130)으로 전송될 수 있다. 한편, 카메라 드라이버는 사용 가능한 대역폭을 결정하기 위해 카메라와 상호 작용할 수 있다. 이때, 대역폭은 U=125μs 길이의 각 USB 마이크로프레임(microframe) 동안 사용 가능한 바이트 수 B로 표현될 수 있다. USB 드라이버는 버퍼를 사용하여 M 마이크로프레임들을 하나의 USB 요청 블록(URB, USB Request Block)으로 그룹화 할 수 있다(즉, 단위 처리). 도 3은 P=16 인 YUYV 픽셀 포맷을 사용하여 실험 플랫폼에서 측정된 대역폭 예약의 예를 나타낼 수 있다.Each captured image may be transferred to the detector subsystem 130 via a USB connection in an isochronous transfer mode that provides a known delay time according to a bandwidth reservation. On the other hand, the camera driver can interact with the camera to determine the available bandwidth. In this case, the bandwidth may be expressed as the number of bytes B usable during each USB microframe with a length of U=125 μs. The USB driver can use a buffer to group M microframes into one USB Request Block (URB) (ie, unit processing). 3 may represent an example of bandwidth reservation measured in an experimental platform using the YUYV pixel format with P=16.

또한, 이미지 캡처와 대기열 진입 사이의 시간에 해당하는 전송 지연 d_tran을 고려하면, 주어진 I, B 및 U로부터 필요한 마이크로프레임 수를 계산하여 다음의 수학식 1로 표현할 수 있습니다.In addition, considering the transmission delay d _tran corresponding to the time between image capture and queue entry, the required number of microframes can be calculated from the given I, B, and U and expressed as Equation 1 below.

[수학식 1][Equation 1]

여기에서, j는 (i)캡처 후 다음 마이크로프레임의 시작을 대기(최악의 경우: U), 및 (ii)현재 URB가 그룹화 되기를 대기(최악의 경우: (M-1)·U)에 의해 발생되는 지터에 해당한다. 특히, 후자를 URB 버퍼링 지터(URB buffering jitter)로 표현한다. 결과적으로, j는 0 ≤ j < M·U 범위의 값을 가질 수 있. 상수(constant) 2는 프로토콜 오버헤드(overhead)를 포함하는 추가 마이크로프레임을 나타낸다.Here, j is (i) waiting for the start of the next microframe after capture (worst case: U), and (ii) waiting for the current URB to be grouped (worst case: (M-1) U) by It corresponds to the generated jitter. In particular, the latter is expressed as URB buffering jitter. Consequently, j can have values in the range 0 ≤ j < M U. A constant (constant) 2 indicates an additional microframe including protocol overhead (overhead).

도 4는 해당 지터를 고려한 전송 지연을 나타낼 수 있다. 카메라 드라이버는 해당 간격에서 이미지 도착들을 체크하기 때문에 전송 지연은 M 마이크로프레임들에 대한 URB 버퍼링 기간의 배수로 결정될 수 있다. M·U = 32·125μs = 4ms는 도 3의 모든 행에 해당될 수 있지만, 카메라마다 달라질 수 있다.4 may represent a transmission delay in consideration of the corresponding jitter. Since the camera driver checks for image arrivals at that interval, the transmission delay can be determined as a multiple of the URB buffering period for M microframes. M·U = 32·125 μs = 4 ms may correspond to all rows in FIG. 3 , but may be different for each camera.

따라서, A는 (보통) 두 값 a^min 및 a^max를 갖는 이산 확률 분포가 될 수 있고, 이미지 도착 간격 a가 이러한 값들 중 하나와 같을 확률은 각각 다음의 수학식 2와 같이 정의될 수 있다.Therefore, A can be a discrete probability distribution having (usually) two values a ^min and a ^max , and the probability that the image arrival interval a is equal to one of these values can be defined as the following Equation 2, respectively.

[수학식 2][Equation 2]

p₁ = Pr{a=a^min}, p₂ = Pr{a=a^max}p ₁ = Pr{a=a ^min }, p ₂ = Pr{a=a ^max }

여기에서, a^min=

이고, a^max =

에 해당한다. 즉, 각각 C의 바닥(floor)과 천장(ceiling)에 가장 가까운 M·U의 배수에 해당할 수 있다. 또한, a(E(a))의 기대값은 C로 근사될 수 있다. 즉, a^minp₁ + a^maxp₂

C에 해당할 수 있다. p₁ + p₂ = 1이므로 확률 분포는 주어진 C, M 및 U로부터 도출될 수 있다.where a ^min =

, and a ^max =

corresponds to That is, it may correspond to a multiple of M·U closest to the floor and ceiling of C, respectively. Also, the expected value of a(E(a)) can be approximated to C. That is, a ^min p ₁ + a ^max p ₂

It may correspond to C. Since p ₁ + p ₂ = 1, a probability distribution can be derived from given C, M and U.

이미지 프레임이 대기열의 헤드(head)에 도착하면 대기열에 의해 수용(accept)되거나 또는 드랍(drop)될 수 있다. 만약 이미지 도착이 검색보다 빠를 경우 일부의 드랍은 불가피할 수 있다. 실제로 대기 결정(queuing decision)은 이미지가 완전히 도착하기 전에 이루어질 수 있다. 이미지 프레임을 포함하는 첫 번째 URB 버퍼링 기간(buffering period)이 종료될 때 해당 결정이 이루어질 수 있으며, 결정이 이루어진 시점은 도 4와 같이 대기 결정 지점(queuing decision point)으로 표현될 수 있다. 실제 이미지 캡처와 대기 결정 사이에 임의의 시간 간격이 존재하며, 최대 URB 버퍼링 기간 (M·U)에 해당할 수 있다. 또한, 대기 결정 지터(queuing decision jitter)로 표현될 수 있다.When an image frame arrives at the head of the queue, it can be accepted or dropped by the queue. If image arrival is faster than retrieval, some drops may be unavoidable. In practice, a queuing decision can be made before the image has fully arrived. A corresponding determination may be made when the first URB buffering period including the image frame ends, and the time at which the determination is made may be expressed as a queuing decision point as shown in FIG. 4 . An arbitrary time interval exists between the actual image capture and the wait decision, which may correspond to the maximum URB buffering period (M·U). Also, it may be expressed as queuing decision jitter.

객체의 물리적인 출현(physical appearance)과 성공적인 캡처(successful capture) 사이의 경과된 시간으로 정의되는 객체의 캡처 지연 d_capt를 고려하면, 성공적인 캡처는 캡처 전송 후 해당 객체가 대기열에 존재함을 의미할 수 있다. 실패된 캡처는 대기열에 객체와 연관된 프레임을 수용할 충분한 공간이 없어 대기열에 의해 드랍되는 경우에 해당할 수 있다. 따라서, d_capt는 검출기 서브 시스템(130)의 서비스 간격 분포 S와 연관될 수 있다. 각 객체 검출주기(detection cycle)가 시작될 때 대기열에서 이미지 프레임이 검색(retrieve)되고, 새 이미지 프레임을 위한 공간이 생성될 수 있다. 대기 결정 지점은 프레임을 대기열에 추가하는데 중요한 역할을 담당할 수 있다. 해당 시점 이후에 대기 결정 지점이 발생해야 해당 프레임이 대기열에 의해 수용(accept)될 수 있다.Taking into account the capture delay d _capt of an object, which is defined as the elapsed time between the physical appearance of the object and the successful capture, a successful capture would mean that the object was queued after the capture was sent. can A failed capture may be when the queue does not have enough space to accommodate the frame associated with the object and is dropped by the queue. Thus, d _capt may be associated with the service interval distribution S of the detector subsystem 130 . At the beginning of each object detection cycle, image frames are retrieved from the queue and space can be created for new image frames. The queuing decision point can play an important role in enqueuing frames. The frame can be accepted by the queue only when a waiting decision point occurs after that point.

최상의 경우는 캡처 순간 직전에(즉, 새 캡처 주기가 시작되기 전) 객체가 나타날 때 d_capt = 0이고 대기열에 객체와 연관된 이미지 프레임을 수용할 수 있는 충분한 공간이 있는 경우에 해당할 수 있다. 이와 반대로, 최악의 경우는 캡처 순간 직후에 객체가 나타나고 대기열이 방금 가득 차서 방금 캡처된 이미지와 이후에 연속적으로 캡처된 모든 이미지가 대기열이 다시 이용 가능할 때까지 대기열에 의해 드랍(drop)되는 경우에 해당할 수 있다. 최악의 경우 d_capt는 s^max로 표시되는 가장 긴 객체 검출 서비스 간격 내에서 연속적으로 드랍되는 최대 이미지 프레임 수를 사용하여 표현될 수 있다. 최대 대기 결정 지터(maximum queuing decision jitter) (M·U)의 영향을 고려하면 s^max 내의 대기 결정 지점들의 개수를 계산하여 해당 수를 예측할 수 있다. 대기 결정 지점의 개수에 의해 결정되는 시간 간격은 최대 M·U까지 동일한 개수만큼의 캡처 시점들에 의해 결정되는 간격보다 짧을 수 있다. 따라서, 최대 대기 결정 지터를 설명하기 위해 s^max는 d_capt를 얻기 위해 C로 나누어질 때 M·U와 합산될 수 있고, 이는 다음의 수학식 3과 같이 d_capt 범위를 결정할 수 있다.The best case might be when the object appears just before the moment of capture (i.e. before a new capture cycle begins), d _capt = 0, and there is enough space in the queue to accommodate the image frame associated with the object. Conversely, the worst case is when an object appears immediately after the moment of capture and the queue is just full and the image just captured and all images captured subsequently are dropped by the queue until the queue becomes available again. may be applicable. In the worst case, d _capt can be expressed using the maximum number of continuously dropped image frames within the longest object detection service interval denoted by s ^max . Considering the influence of the maximum queuing decision jitter (M·U), the number of queuing decision points in s ^max can be calculated and the corresponding number can be predicted. The time interval determined by the number of waiting determination points may be shorter than the interval determined by the same number of capture points up to M·U. Therefore, in order to account for the maximum atmospheric determination jitter, s ^max can be summed with M·U when divided by C to obtain d _capt , which can determine the range of d _capt as in Equation 3 below.

[수학식 3][Equation 3]

수학식 1 및 3을 합산함으로써, 다음의 수학식 4와 같이 각각

와

로 표현되는 최소 및 최대 카메라 서브시스템 지연(camera subsystem delay)을 획득할 수 있다.By summing

Equations

1 and 3, respectively, as in Equation 4 below

Wow

It is possible to obtain the minimum and maximum camera subsystem delay represented by .

[수학식 4][Equation 4]

- 검출기 서브시스템(Detector Subsystem) 분석- Detector Subsystem Analysis

검출기 서브시스템(130)은 (i)가져오기(fetch), (ii)추론(inference) 및 (iii)디스플레이(display)의 세 가지 모듈로 구성될 수 있다. 검출기 서브시스템(130)의 서비스 간격 분포 S는 Nvidia Jetson AGX Xavier에서 3개의 모듈들을 분석하고 프로파일링(profiling) 한 후에 식별될 수 있다. 또한, 다른 백그라운드 워크로드(background workload)를 가정하지 않는다.The detector subsystem 130 may consist of three modules: (i) fetch, (ii) inference, and (iii) display. The service interval distribution S of the detector subsystem 130 can be identified after analyzing and profiling the three modules in the Nvidia Jetson AGX Xavier. Also, it does not assume other background workloads.

(i)가져오기(Fetch): 가져오기 모듈은 대기열에서 이미지 프레임을 검색(retrieve)할 수 있다. 그러나, V4L2는 버퍼에 대한 포인터만 드라이버와 응용 프로그램 간에 교환되는 스트리밍 I/O 방식을 지원하기 때문에 데이터 전송을 위한 실제 메모리 복사는 수행되지 않을 수 있다. DNN 입력 레이어의 해상도와 일치하도록 입력 이미지의 크기를 조정하는데 대부분의 실행 시간이 소요될 수 있다. 따라서, 가져오기 모듈의 실행시간 e_fetch는 다음의 수학식 5와 같이 표현될 수 있다.(i) Fetch: The fetch module can retrieve image frames from the queue. However, since V4L2 supports the streaming I/O method in which only the pointer to the buffer is exchanged between the driver and the application program, the actual memory copy for data transfer may not be performed. Resizing the input image to match the resolution of the DNN input layer can take most of the execution time. Therefore, the runtime e _fetch of the fetch module can be expressed as in Equation 5 below.

[수학식 5][Equation 5]

e_fetch ~ D_fetch(r_in, r_out)e _fetch ~ D _fetch (r _in , r _out )

여기에서, D_fetch(·)는 크기 조정을 위한 입력 해상도(r_in) 및 출력 해상도(r_out)의 함수인 확률 분포에 해당할 수 있다. d_fetch로 표현되는 가져오기 모듈의 지연은 실행(release)(가져오기 스레드의 시작)과 완료(complete)(이미지 크기 조정의 종료) 사이에 경과된 시간으로 정의되며 다음의 수학식 6과 같이 표현될 수 있다.Here, D _fetch (·) may correspond to a probability distribution that is a function of an input resolution (r _in ) and an output resolution (r _out ) for size adjustment. The delay of the import module expressed by d _fetch is defined as the elapsed time between release (start of the import thread) and complete (end of image resizing), and is expressed as Equation 6 below. can be

[수학식 6][Equation 6]

d_fetch = e_fetch + b_fetch d _fetch = e _fetch + b _fetch

여기에서, b_fetch는 차단(blocking) 시간 인자(factor)에 해당한다. 다른 백그라운드 워크로드가 가정되지 않기 때문에 d_fetch에서 선점(preemption) 지연이 적용되지 않을 수 있다. b_fetch와 관련하여, 대기열에서 이미지가 검색될 때 가져오기 모듈에서 차단 시스템 호출(blocking system call)이 발생할 수 있다. 보다 구체적으로, 가져오기 모듈은 대기열이 비어 있는 동안 이미지를 가져오려고 하면 새 이미지가 도착할 때까지 차단될 수 있다.Here, b _fetch corresponds to a blocking time factor. Since no other background workload is assumed, the preemption delay may not be applied in d _fetch . Regarding b _fetch , a blocking system call may occur in the fetch module when an image is retrieved from the queue. More specifically, the fetch module may block until a new image arrives if it tries to fetch an image while the queue is empty.

(ii)추론(Inference): 추론 모듈은 CPU 및 GPU 부분(part)으로 분해될 수 있는, 신경망에서의 순방향 전파(forwar propagation)를 수행할 수 있다. CPU 부분에서는 CUDA 함수 호출을 통해 비동기 GPU 연산을 간단히 호출할 수 있으며, 여기에는 CPU와 GPU 메모리 간의 데이터 복사와 신경망용 GPU 커널 함수들이 포함될 수 있다. 이후 GPU 연산들은 CUDA 런타임에 의해 유지되는 FIFO(선입선출) 대기열인 CUDA 스트림에 삽입되어 대기될 수 있다. CPU 스레드는 CUDA 동기화 함수를 명시적으로 호출에 의해 비동기 GPU 연산들이 완료될 때까지 차단될 수 있다. 또한, 마지막 GPU 연산이 검출 결과를 CPU 메모리에 복사할 수 있다. 예를 들어, 단일 순방향 전파(single forward propagation)를 실행하기 위해서, YOLOv3는 4개의 복사 연산들과 322개의 GPU 커널 연산들(반복을 포함)을 요청할 수 있다. 추론 모듈의 실행 시간은 신경망 r_nn에 대한 입력의 해상도에 따라 달라질 수 있다. 따라서, 추론 모듈의 CPU 실행 시간 e_infer^CPU과 GPU 실행 시간 e_infer^GPU은 랜덤 변수에 의해 다음의 수학식 7과 같이 표현될 수 있다.(ii) Inference: The inference module can perform forward propagation in neural networks, which can be decomposed into CPU and GPU parts. In the CPU part, asynchronous GPU operations can be simply called through CUDA function calls, which can include data copy between CPU and GPU memory and GPU kernel functions for neural networks. Afterwards, GPU operations can be queued by being inserted into a CUDA stream, which is a FIFO (first-in-first-out) queue maintained by the CUDA runtime. A CPU thread can block until asynchronous GPU operations are complete by explicitly calling the CUDA synchronization function. Also, the last GPU operation may copy the detection result to CPU memory. For example, to perform single forward propagation, YOLOv3 may request 4 copy operations and 322 GPU kernel operations (including iterations). The execution time of the inference module may depend on the resolution of the input to the neural network r _nn . Accordingly, the CPU execution time e_infer^CPU and the GPU execution time e_infer^GPU of the inference module may be expressed as in Equation 7 below by using a random variable.

[수학식 7][Equation 7]

여기에서,

및

는 CPU 및 GPU 실행 시간들 각각에 대한 확률 분포에 해당한다. CPU 및 GPU 실행 시간들은 각각 r_nn에 따라 달라질 수 있다. 표기의 단순화를 위해 추론 모듈의 (통합된) 실행 시간 e_infer는 e_infer =

+

로 정의된다.From here,

and

corresponds to the probability distribution for each of the CPU and GPU execution times. CPU and GPU execution times may vary depending on r _nn , respectively. For simplicity of notation, the (integrated) execution time of the inference module e _infer is e _infer =

+

is defined as

추가적인 차단 인자가 없기 때문에 추론 모듈의 지연 d_infer는 다음의 수학식 8과 같이 표현될 수 있다.Since there is no additional blocking factor, the delay d _infer of the inference module can be expressed as Equation 8 below.

[수학식 8][Equation 8]

d_infer = e_infer ~ D_infer(r_nn)d _infer = e _infer ~ D _infer (r _nn )

여기에서, D_infer(·)는 다음의 수학식 9와 같이 D_infer^CPU 및 D_infer^GPU의 합성곱으로 예측될 수 있습니다.Here, D _infer (·) can be predicted as the convolution of D_infer^CPU and D_infer^GPU as shown in Equation 9 below.

[수학식 9][Equation 9]

(iii)디스플레이(Display): 디스플레이 모듈은 검출된 각 객체 주변에 경계 상자가 있는 이미지를 표시하여 검출 결과를 시각화 할 수 있다. 이를 위해 디스플레이 모듈은 다음 단계를 수행할 수 있다. 첫째, 다양한 신뢰 수준을 갖는 원시(raw) 검출 결과가 주어진 신뢰 임계값(confidence threshold)으로 필터링될 수 있다. 다음으로, 관련 없는 중복(irrelevant duplicate)은 비최대값 억제 알고리즘(non-maximal suppression algorithm)에 의해 결합(join)(또는 제거)될 수 있다. 또한, 디스플레이 모듈은 검출된 각 객체 주위에 사각형을 표시할 수 있다. 따라서, e_disp로 표시되는 디스플레이 단계의 실행 시간은 검출된 객체들의 수 n_obj에 따라 달라질 수 있으며, 다음의 수학식 10과 같이 표현될 수 있다.(iii) Display: The display module can visualize the detection result by displaying an image with a bounding box around each detected object. To this end, the display module may perform the following steps. First, raw detection results having various confidence levels may be filtered with a given confidence threshold. Next, irrelevant duplicates may be joined (or removed) by a non-maximal suppression algorithm. Also, the display module may display a rectangle around each detected object. Accordingly, the execution time of the display step represented by e _disp may vary depending on the number of detected objects n _obj , and may be expressed as in Equation 10 below.

[수학식 10][Equation 10]

e_disp ~ D_disp(n_obj)e _disp ~ D _disp (n _obj )

여기에서, 원시 검출의 개수와 검출된 객체들의 최종 개수는 매우 높은 양의 상관관계를 나타낼 수 있다. 따라서, 확률 분포를 정의할 때 두 개의 변수를 사용하는 대신 검출된 객체들의 최종 개수만이 사용될 수 있다. 또한, 화면에 이미지를 렌더링하는 OpenCV 함수를 호출하여 발생하는 차단 인자 b_disp가 존재할 수 있다. 따라서, 디스플레이 모듈의 지연 d_disp은 다음의 수학식 11과 같이 표현될 수 있다.Here, the number of raw detections and the final number of detected objects may exhibit a very high positive correlation. Therefore, instead of using two variables when defining the probability distribution, only the final number of detected objects can be used. Also, there may be a blocking factor b _disp that is generated by calling an OpenCV function that renders an image on the screen. Accordingly, the delay d _disp of the display module may be expressed as in Equation 11 below.

[수학식 11][Equation 11]

d_disp = e_disp + b_disp d _disp = e _disp + b _disp

여기에서, 최악의 경우에 대한 분석을 위해, 검출 가능한 최대 객체 수 n_obj = 30으로 가정할 수 있다.Here, for the analysis of the worst case, it may be assumed that the maximum number of detectable objects n _obj = 30.

- 다중 스레드 파이프 라인 아키텍처(Multithreaded pipeline architecture)- Multithreaded pipeline architecture

Darknet에서 위의 세 모듈은 포크-조인(fork-join) 모델을 사용하여 멀티 코어 프로세서를 활용하기 위해 멀티 스레드 파이프 라인 아키텍처로 구성될 수 있다. 즉, (i)가져오기 스레드, (ii)추론 스레드 및 (iii)디스플레이 스레드와 같이 각 모듈에 하나씩 세 개의 병렬 스레드가 존재할 수 있다. 객체 검출주기가 시작될 때 세 개의 스레드가 분기된 다음 루프의 끝에서 결합되어 다음주기가 시작될 수 있다. 도 5는 멀티 스레드 파이프 라인 아키텍처의 타임 라인을 나타낸다. 일반적으로 i 번째 객체 검출주기에 대해, (i-1) 번째 이미지 프레임을 신경망에 의해 처리하는 동안 i 번째 이미지 프레임을 가져오고 (i-2) 번째 이미지가 화면에 표시될 수 있다. 도 5에서 추론이 가장 길기 때문에 가져오기 및 디스플레이 스레드는 각 객체 검출주기가 끝날 때 추론 스레드가 완료될 때까지 대기할 수 있다. 따라서, 일반적으로 s로 표시되는 객체 검출 서비스 간격은 다음의 수학식 12와 같이 표현될 수 있다.In Darknet, the above three modules can be configured in a multi-threaded pipeline architecture to utilize multi-core processors using a fork-join model. That is, there can be three parallel threads, one for each module: (i) a fetch thread, (ii) an inference thread, and (iii) a display thread. When the object detection cycle starts, the three threads branch and then join at the end of the loop so that the next cycle can start. 5 shows a timeline of a multi-threaded pipeline architecture. In general, for the i-th object detection cycle, while the (i-1)-th image frame is processed by the neural network, the i-th image frame is obtained and the (i-2)-th image can be displayed on the screen. Since the speculation is longest in FIG. 5 , the fetch and display threads may wait until the speculation thread completes at the end of each object detection cycle. Accordingly, the object detection service interval, generally expressed as s, can be expressed as in Equation 12 below.

[수학식 12][Equation 12]

s = max({d_fetch, d_infer, d_disp})s = max({d _fetch , d _infer , d _disp })

이때, 상황에 따라 객체 검출주기에 해당할 수 있다. s와 관련하여 d_infer는 d_fetch 및 d_disp보다 상당히 길 수 있다. 따라서, 서비스 간격 분포 S를 다음의 수학식 13과 같이 가정할 수 있다.In this case, it may correspond to an object detection period depending on the situation. With respect to s, d _infer can be significantly longer than d _fetch and d _disp . Accordingly, the service interval distribution S can be assumed as in Equation 13 below.

[수학식 13] [Equation 13]

S = D_infer(·)S = D _infer (·)

그러나, 일반적으로 수학식 12에 의한 검출기 서브시스템(130)의 서비스 간격 분포 S는 3개의 랜덤 변수 {d_fetch, d_infer, d_disp} 값의 가능한 조합에 의해 결정될 수 있다. 즉 해당 분포는 가능한 모든 조합 각각에 대해 이 세 값 중 최대 값 집합으로 정의될 수 있다. d_fetch = A, d_infer = B, d_disp = C라고 가정하면 해당 확률은 세 가지 확률의 곱에 해당할 수 있다. 즉, Pr{d_fetch = A} × Pr{d_infer = B} × Pr{d_disp = C}에 해당할 수 있다. 최종 S는 주어진 서비스 간격에 대한 모든 확률을 합산하여 산출될 수 있다. S를 기반으로 d_detector로 표시되는 검출기 서브시스템(130)의 지연은 다음의 수학식 14와 같이 표현될 수 있다.However, in general, the service interval distribution S of the detector subsystem 130 according to Equation 12 may be determined by a possible combination of values of three random variables {d _fetch , d _infer , d _disp }. That is, the distribution can be defined as the set of the maximum of these three values for each of all possible combinations. Assuming that d _fetch = A, d _infer = B, and d _disp = C, the corresponding probability can correspond to the product of three probabilities. That is, it may correspond to Pr{d _fetch = A} × Pr{d _infer = B} × Pr{d _disp = C}. The final S can be calculated by summing all probabilities for a given service interval. The delay of the detector subsystem 130 expressed as d _detector based on S can be expressed as in Equation 14 below.

[수학식 14][Equation 14]

d_detector = 2s + d_disp d _detector = 2s + d _disp

도 5와 같이 두 개의 객체 검출주기와 디스플레이 스레드의 지연 시간에 해당할 수 있다. 그런 다음 최소 및 최대 검출기 서브시스템 지연을 다음의 수학식 15와 같이 산출할 수 있다.As shown in FIG. 5 , it may correspond to two object detection cycles and a delay time of the display thread. Then, the minimum and maximum detector subsystem delays can be calculated as in Equation 15 below.

[수학식 15][Equation 15]

- 대기열 지연(Queue Delay)- Queue Delay

이미지 도착 간격 분포 A와 객체 검출 서비스 간격 분포 S를 기반으로 각 이미지 프레임이 대기열 내에서 소비하는 시간으로 정의되는 대기열 지연 d_queue를 분석할 수 있다. Darknet은 기본 대기열 크기 Q = 4로 대기열 관리를 위해 OpenCV VideoCapture 라이브러리를 사용할 수 있다. 가득찬 대기열은 추가 수신된 이미지 프레임을 수용할 수 없다. d_queue 분석을 위해 다음 세 가지 경우를 고려할 필요가 있다.Based on the image arrival interval distribution A and the object detection service interval distribution S, we can analyze the queue delay d _queue , which is defined as the time each image frame spends in the queue. Darknet can use OpenCV VideoCapture library for queue management with default queue size Q = 4. A full queue cannot accommodate additional received image frames. The following three cases need to be considered for d _queue analysis.

(i)사례 1: min(A) > max(S). 이 경우, 객체 검출 서비스 간격은 항상 이미지 도착 간격보다 짧을 수 있다(빠름). 그런 다음 가져오기 스레드의 관점에서 이미지 프레임을 검색하려고 할 때마다 대기열이 비어 있으므로 이미지 프레임이 도착할 때까지 가져오기 스레드가 차단될 수 있다. 따라서, 다음의 수학식 16과 같이 표현될 수 있다.(i) Case 1: min(A) > max(S). In this case, the object detection service interval may always be shorter (faster) than the image arrival interval. Then, from the perspective of the fetching thread, whenever it tries to retrieve an image frame, the queue is empty, so the fetching thread may block until the image frame arrives. Therefore, it can be expressed as Equation 16 below.

[수학식 16] [Equation 16]

d_queue = 0d _queue = 0

이 경우, b_fetch는 결국 d_fetch = b_fetch + e_fetch = a까지 연장될 수 있다. 여기에서, a는 이미지 도착 간격에 해당한다. 결과적으로 객체 검출주기 시간(object detection cycle time)은 d_infer 및 d_disp에 관계없이 이미지 도착 간격에 의해 조절될 수 있다.In this case, b _fetch can eventually be extended until d _fetch = b _fetch + e _fetch = a. Here, a corresponds to the image arrival interval. Consequently, the object detection cycle time can be adjusted by the image arrival interval regardless of d _infer and d _disp .

(ii)사례 2: max(A) < min(S). 이 경우, 이미지 도착 간격은 항상 객체 검출 서비스 간격보다 짧아서(빠름) 가져오기 스레드가 이미지 프레임을 검색하려고 할 때마다 대기열이 가득찰 수 있다. 따라서, 항상 b_fetch = 0에 해당할 수 있다. 대기열 지연의 경우 이미지가 대기열 끝에 들어간 후 대기열에서 제외될 때까지 Q 객체 검출주기(즉, 가져오기)가 필요할 수 있다. 따라서, 최악의 경우 대략 Q·s^max가 필요할 수 있다. 정확히 말하면 최대 대기열 지연은 다음의 수학식 17과 같이 산출될 수 있다.(ii) Case 2: max(A) < min(S). In this case, the image arrival interval is always shorter (faster) than the object detection service interval, so each time the fetching thread tries to retrieve an image frame, the queue can fill up. Therefore, it can always correspond to b _fetch = 0. In the case of queuing delay, Q object detection cycles (i.e. fetches) may be required after an image enters the end of the queue until it is dequeued. Thus, in the worst case, approximately Q·s ^max may be needed. More precisely, the maximum queue delay can be calculated as in Equation 17 below.

[수학식 17][Equation 17]

차감된 부분은 객체 검출주기의 시작(즉, 새로 캡처한 이미지를 수용할 공간 확보)과 이미지 프레임의 실제 도착 사이에 경과하는 최소 시간이 적용될 수 있다. 최대 대기 결정 지터(M·U)는 이미지 프레임의 양이 캡처 후, 심지어 객체 검출주기가 시작되기 전이라도 전송될 수 있기 때문에 고려될 필요가 있다. 최상의 경우

의 값은 다음의 수학식 18과 같이 산출될 수 있다.The subtracted portion may be the minimum time elapsed between the start of the object detection cycle (ie, freeing up space to accommodate the newly captured image) and the actual arrival of the image frame. The maximum atmospheric decision jitter (M·U) needs to be taken into account because the amount of image frames can be transmitted after capture, even before the object detection cycle begins. best case

The value of can be calculated as in Equation 18 below.

[수학식 18][Equation 18]

이 경우는 객체가 캡처 순간을 지나서 즉시 나타나고 다음 캡처 순간에 성공적으로 촬영되었을 때 발생할 수 있다. 그런 다음 최대 C의 지연이 대기열 지연 대신 캡처 지연에서 고려될 수 있다. 이 경우, 최소 대기열 지연(minimum queue delay)을 계산하기 위해 대기 결정 지터(queuing decision jitter)가 0이라고 가정할 수 있다.This case can occur when an object appears immediately past the moment of capture and has been successfully captured at the moment of capture. Then a delay of up to C can be taken into account in the capture delay instead of the queue delay. In this case, it can be assumed that the queuing decision jitter is zero to calculate the minimum queue delay.

(iii)사례 3: 다른 모든 조건. 이 경우, 대기열 상태는 비결정적(non-deterministic)일 수 있다. 장기 평균(long-term average)은 두 분포의 평균이 동일한 경우에만 추정될 수 있다. 또한, 대기열 길이가 [0, Q] 이내로 제한되기 때문에 특정 순간의 대기열 상태를 예측할 수 없다. 따라서, 대기열 지연은 최상의 경우 사례 1에서와 같이

= 0이거나 최악의 경우 사례 2에서와 같이 수학식 17 만큼 증가할 수 있다.(iii) Case 3: All other conditions. In this case, the queue state may be non-deterministic. A long-term average can only be estimated if the means of the two distributions are equal. Also, since the queue length is limited within [0, Q], the queue state at a specific moment cannot be predicted. Therefore, the queue delay is the best case as in case 1

= 0 or, in the worst case, may increase by Equation 17 as in Case 2.

이하 객체 검출 시스템(130)에서 종단 간 지연을 최소화하기 위한 접근 방식을 설명한다. 보다 구체적으로, (i)온디맨드 캡처(On-demand Capture), (ii)제로 슬랙 파이프 라인(Zero-slack Pipeline) 및 (iii)무경합 파이프 라인(contention-free Pipeline)의 세 가지 기술에 대해 설명한다.Hereinafter, an approach for minimizing the end-to-end delay in the object detection system 130 will be described. More specifically, about three technologies: (i) On-demand Capture, (ii) Zero-slack Pipeline, and (iii) Contention-free Pipeline. Explain.

도 6 및 7은 본 발명에 따른 종단간 지연 최적화 방법을 설명하는 도면이고, 도 8은 공유 메모리 대역폭 경합 유무에 따른 실행시간을 비교 설명하는 도면이다.6 and 7 are diagrams for explaining an end-to-end delay optimization method according to the present invention, and FIG. 8 is a diagram for comparing execution times according to the presence or absence of contention for shared memory bandwidth.

도 6 및 7은 이러한 기술이 원래 (바닐라) 아키텍처에서 최종 아키텍처로의 종단 간 지연을 점진적으로 줄이는 방법을 나타낼 수 있다. 최적화 방법의 효과는 세 스레드의 상대적 실행 시간에 따라 달라질 수 있다. 설명의 편의를 위해 그림 (a)에 나와있는 것처럼 d_infer가 d_fetch 및 d_disp보다 훨씬 길다고 가정한다. 이 경우, 객체 검출주기 시간 s는 d_infer가 지배하며, 이는 임베디드 시스템에서 가장 일반적인 경우일 수 있다.6 and 7 may show how these techniques progressively reduce the end-to-end delay from the original (vanilla) architecture to the final architecture. The effectiveness of the optimization method may depend on the relative execution time of the three threads. For convenience of explanation, it is assumed that d _infer is much longer than d _fetch and d _disp as shown in Fig. (a). In this case, the object detection cycle time s is dominated by d _infer , which may be the most common case in embedded systems.

A.온디맨드 캡처(On-demand Capture)A. On-demand Capture

수학식 17 및 18에서, 대기열 크기 Q를 줄임으로써 대기열 지연을 줄일 수 있다(특히 사례 2의 경우와 부분적으로 사례 3의 경우). 사례 1의 경우에도 Q를 줄이면 대기열이 없는 것처럼 이미 작동하므로 부정적인 부작용이 없다. 즉, 대기열을 완전히 제거하여 필요할 때만 이미지 프레임을 수신할 수 있다. 이를 온디맨드 캡처 방법이라 한다.In Equations 17 and 18, it is possible to reduce the queue delay by reducing the queue size Q (especially for case 2 and partially case 3). Even in case 1, reducing Q already works as if there were no queues, so there are no negative side-effects. This means that you can completely remove the queue to receive image frames only when needed. This is called the on-demand capture method.

초기화 단계에서 고정된 대기열 크기 Q가 ioctl 명령 (REQBUFS)으로 드라이버에서 요청된다. 그런 다음 Q 공백 버퍼들, 즉 이미지 홀더는 ioctl 명령 (QBUF)을 사용하여 대기열에 넣을 수 있다. 그 후 가져오기 루프가 시작될 수 있다. 처음에는 선택 시스템 호출을 수행하여 사용 가능한 이미지 프레임의 존재를 확인한다. 선택 시스템 호출에서 돌아오면 ioctl 명령 (DQBUF)을 사용하여 이미지를 포함하는 버퍼를 대기열에서 빼고 ioctl 명령 (QBUF)을 사용하여 빈 버퍼를 다시 대기열에 넣을 수 있다. 즉시 다음 루프를 시작한다.In the initialization phase, a fixed queue size Q is requested from the driver with the ioctl command (REQBUFS). The Q empty buffers, i.e. image holders, can then be queued using the ioctl command (QBUF). After that, the import loop can start. It first checks for the existence of an available image frame by making a selection system call. Upon returning from the select system call, you can use the ioctl command (DQBUF) to dequeue buffers containing images and use the ioctl command (QBUF) to requeue empty buffers. Immediately start the next loop.

그림 (a)는 대기열이 있는 바닐라 아키텍처를 보여주는 반면, 그림 (b)는 온디맨드 캡처 아키텍처를 나타낸다. 두 가지를 비교해 보면 온디맨드 캡처 아키텍처에 더 이상 대기열 지연이 없음을 알 수 있다. 다만, 선택 시스템 호출을 할 때 블로킹 시간 b_fetch가 발생할 수 있다.Figure (a) shows a vanilla architecture with queues, while figure (b) shows an on-demand capture architecture. Comparing the two shows that the capture on demand architecture no longer has queuing delays. However, blocking time b _fetch may occur when making a selection system call.

보다 구체적으로 최소 및 최대 차단 시간은 다음과 같이 결정될 수 있다.More specifically, the minimum and maximum blocking times may be determined as follows.

[수학식 19][Equation 19]

수학식 17 및 18에서 빼는 부분과 동일하며, 이는 각각 객체 검출주기의 시작부터 나중에 사용될 새 이미지 프레임이 차단되지 않더라도 대기열에 도착할 때까지의 대기 시간에 해당할 수 있다. 반대로 온 디맨드 캡처 아키텍처에서는 대기 시간과 똑같은 시간 동안 가져오기 스레드가 차단되고 차단 해제되면 도착 이미지 프레임이 대기열 지연없이 가져오기 스레드로 직접 공급될 수 있다.It is the same as the part subtracted from Equations 17 and 18, which may correspond to the waiting time from the start of the object detection cycle to arrival at the queue even if a new image frame to be used later is not blocked, respectively. Conversely, in an on-demand capture architecture, if the fetch thread is blocked and unblocked for the same amount of time as the latency, the arriving image frames can be fed directly to the fetch thread without queuing delays.

그림 (b)에서 위에서 설명한 온디맨드 캡처 방법에 의해 대기열 지연이 완전히 제거될 수 있다. 즉, In Figure (b), the queuing delay can be completely eliminated by the on-demand capture method described above. in other words,

[수학식 20][Equation 20]

d_queue = 0d _queue = 0

부작용으로, 그림에서 d_camera와 d_fetch는 b_fetch에 의해 다소 겹칩니다. 왜냐하면 가져오기 스레드가 처음에 b_fetch에 대한 이미지 도착을 기다리기 위해 차단되기 때문이다. 따라서, 검출기 서브시스템 지연을 계산할 때 이미 d_camera에서 계산된 중첩된 시간 (b_fetch)을 수학식 15의 원래 지연 방정식에서 빼야한다.As a side effect, in the figure d _camera and d _fetch are somewhat overlapped by b _fetch . This is because the fetch thread is initially blocked waiting for the image to arrive for b _fetch . Therefore, when calculating the detector subsystem delay, the overlapped time (b _fetch ) already calculated in d _camera must be subtracted from the original delay equation in Equation 15.

[수학식 21][Equation 21]

B. 제로 슬랙 파이프 라인(Zero-slack Pipeline)B. Zero-slack Pipeline

그림 (b)에서 세 파이프 라인 단계 (예: d_fetch, d_infer 및 d_disp)의 길이가 서로 다르기 때문에 가져오기 끝과 객체 검출주기 끝 사이에 상당한 유휴 시간 차이를 관찰할 수 있다. 이러한 유휴 시간을 제거할 수 있다면 종단 간 지연을 더욱 줄일 수 있다. 이를 위해 다음과 같이 가져오기 스레드의 여유 및 오프셋 개념을 설명한다.In Figure (b), a significant difference in idle time can be observed between the end of the fetch and the end of the object detection cycle due to the different lengths of the three pipeline stages (eg d _fetch , d _infer and d _disp ). If this idle time can be eliminated, the end-to-end latency can be further reduced. To this end, the concept of margin and offset of the import thread is explained as follows.

-가져오기 스레드 slack(δfetch): 가져오기 스레드의 끝과 객체 검출주기의 끝 사이의 시간 간격-fetch thread slack(δfetch): the time interval between the end of the fetch thread and the end of the object detection cycle

-가져오기 스레드 offset(θfetch): 가져오기 스레드가 실행될 때마다 인위적으로 일정하게 지연될 수 있다.-Fetch thread offset(θfetch): Each time the fetch thread is executed, it can be artificially and uniformly delayed.

오프셋 가져오기를 고려하여 d_fetch는 다음과 같이 증가할 수 있다.Taking the offset fetch into account, d _fetch can be increased as follows.

[수학식 22][Equation 22]

d_fetch = θ_fetch + e_fetch + b_fetch d _fetch = θ _fetch + e _fetch + b _fetch

그런 다음 slack은 다음과 같이 계산할 수 있다.Then slack can be computed as

[수학식 23][Equation 23]

δ_fetch = s - d_fetch δ _fetch = s - d _fetch

여기에서, s는 객체 검출주기 시간을 나타낸다.Here, s represents the object detection cycle time.

따라서, 그림 (c)와 같이 δ_fetch가 0에 도달할 때까지(zero-slack) θ_fetch를 늘릴 수 있다. 그러면 결국 유휴 시간 간격이 제거된다. 그런 다음 검출기 서브시스템 지연은 수학식 21로부터 다음과 같이 더 줄일 수 있습니다.Therefore, as shown in Figure (c), θ _fetch can be increased until δ _fetch reaches 0 (zero-slack). This will eventually eliminate the idle time interval. Then the detector subsystem delay can be further reduced from Equation (21) as

[수학식 24][Equation 24]

그러나 너무 많은 θ_fetch를 적용하면 객체 검출주기 자체가 늘어날 수 있다. 따라서 s를 증가시키지 않으면서 가능한 종단 간 지연을 최소화하는 최적의 θ_fetch를 찾을 필요가 있다. 이를 위해 θ_fetch는 보수적으로 다음과 같이 결정될 수 있다.However, if too many θ _fetches are applied, the object detection cycle itself may increase. Therefore, it is necessary to find the optimal θ _fetch that minimizes possible end-to-end delay without increasing s. To this end, θ _fetch can be conservatively determined as follows.

[수학식 25][Equation 25]

증가된 d_fetch는 최소 객체 검출주기 시간 s^min(즉, max(d_fetch)

s_min)에 거의 영향을 주지 않는다. 보수적인 결정으로 인해 모든 유휴 시간 차이를 완전히 제거할 수는 없다. 그러나, 적절한 θ_fetch를 사용하는 제로 슬랙 파이프 라인은 종단 간 지연을 어느 정도 줄여줄 수 있다.The increased d _fetch is the minimum object detection cycle time s ^min (i.e. max(d _fetch )

s _min ) has little effect. Conservative decisions cannot completely eliminate all idle time differences. However, a zero-slack pipeline with proper θ _fetch can reduce the end-to-end delay to some extent.

C. 무경합 파이프 라인(Contention-free Pipeline)C. Contention-free Pipeline

공유 메모리 멀티 코어 프로세서에서 여러 스레드가 실행될 때 동시 스레드 간에 상당한 메모리 대역폭 경합이 발생할 수 있다. 더욱이 CPU 스레드가 통합 GPU에서 GPU 커널로 실행될 때 그 효과는 더욱 악화될 수 있다. 메모리 경합 효과를 정량화하기 위해 도 8은 경합 유무에 따른 실행 시간 분포를 보여준다. 경합이 있는 실행 시간은 바닐라 아키텍처에서 측정되는 반면 경합이 없는 실행 시간은 객체 감지 단계를 순차적으로 실행하는 동안 측정될 수 있다. 특히 도 8의 그림 (b)는 추론 스레드에서 GPU 실행시 실행 시간(약 28ms)이 크게 증가한 것을 보여준다. 대조적으로, 도 8의 그림 (a)와 그림 (c)는 상대적으로 작은 효과를 보여준다. 그림에서 추론 스레드가 메모리 대역폭 경합 문제를 가장 많이 겪는다는 것이 분명하다. 따라서, 경합을 최소화하기 위해 추론 스레드를 격리하는 반면, 가져오기 및 디스플레이 스레드는 원래 아키텍처에서와 같이 동시에 실행될 수 있다. 그림 (d)는 새로운 무경함 파이프 라인 아키텍처를 보여준다.When multiple threads run on shared memory multi-core processors, significant memory bandwidth contention can occur between concurrent threads. Moreover, the effect can be even worse when CPU threads run from the integrated GPU to the GPU kernel. In order to quantify the effect of memory contention, FIG. 8 shows the distribution of execution time according to the presence or absence of contention. Contention-free execution time is measured in vanilla architectures, whereas contention-free execution time can be measured during sequential execution of the object detection phase. In particular, the figure (b) of FIG. 8 shows that the execution time (about 28 ms) is greatly increased when the GPU is executed in the inference thread. In contrast, Fig. 8 (a) and Fig. (c) show a relatively small effect. It is clear from the figure that the inference thread suffers the most from memory bandwidth contention issues. Thus, while isolating the inference threads to minimize contention, the fetch and display threads can run concurrently as in the original architecture. Figure (d) shows the new zero-fault pipeline architecture.

그림 (d)에서 객체 검출주기 시간은 다음과 같이 정의되는

로 변경될 수 있다.In Fig. (d), the object detection cycle time is defined as

can be changed to

[수학식 26][Equation 26]

그러나, d_infer도 경합없는 실행에 의해 감소되기 때문에 s에서

로의 증가는 미미할 수 있다. 그런 다음 검출기 서브시스템(130) 지연이 최종적으로 다음과 같이 감소할 수 있다.However, since d _infer is also reduced by contention-free execution, in s

The increase in furnace may be negligible. Then the detector subsystem 130 delay can finally be reduced as follows.

[수학식 27][Equation 27]

본 발명에 따른 객체검출 시스템(100)의 지연 요소는 Camera subsystem, Queue, Detector subsystem으로 구성된다. Camera subsystem에 의한 지연은 물체가 나타나고 물체가 찍힌 이미지 프레임이 Queue에 도착할 때까지의 시간을 의미하고, Detector subsystem에 의한 지연은 Queue에서 가져온 이미지를 Darknet YOLO를 거쳐 이미지 상의 물체를 인지할 때까지의 시간을 의미한다. 마지막으로 Queue에 의한 지연은 Camera subsystem에서 공급받은 이미지가 Detector subsystem에서 소비할 때까지 Queue에 머무는 시간을 의미한다.The delay element of the object detection system 100 according to the present invention is composed of a camera subsystem, a queue, and a detector subsystem. The delay by the camera subsystem means the time until the object appears and the image frame with the object arrives in the queue, and the delay by the detector subsystem is the time until the image fetched from the queue is passed through Darknet YOLO and the object on the image is recognized. means time. Finally, the delay by the queue means the time the image supplied from the camera subsystem stays in the queue until consumed by the detector subsystem.

시스템 내부 구조를 분석한 결과, 불필요한 Queue에 의한 지연과 Darknet YOLO 내부 stage 간의 실행시간 불균형에 의한 지연, 하드웨어 자원 경합에 의한 지연이 있음을 확인했고 총 3단계 (I.Ondemand capture, II.Zero-slack pipeline 구조, III.Contention-free pipeline 구조)를 거쳐 종단간 지연 시간을 최적화 할 수 있다. As a result of analyzing the internal structure of the system, it was confirmed that there are delays due to unnecessary queues, delays due to imbalance in execution time between Darknet YOLO internal stages, and delays due to hardware resource contention. It is possible to optimize the end-to-end delay time through the slack pipeline structure, III. Contention-free pipeline structure).

첫 번째로, Darknet YOLO에서 이미지가 필요할 때 카메라 드라이버에 이미지를 요청하여 받아오는 방식으로 수정하여 불필요한 queue에 의한 지연을 제거할 수 있다. 두 번째로, 실행시간이 짧은 stage를 늦게 실행시켜서 실행시간 불균형으로 인해 발생하는 idle 시간을 제거할 수 있다. 마지막으로, Darknet YOLO 내부 stage 중 GPU를 사용하는 stage와 CPU를 사용하는 stage가 병렬로 실행될 경우 Memory bandwidth 간섭으로 인해 GPU를 사용하는 stage의 실행시간에 악영향을 미치게 되는데, 이와 같은 간섭에 의한 지연을 제거하기 위해 CPU를 사용하는 stage는 병렬로 실행할 수 있다. 즉, GPU를 사용하는 stage는 보호하여 혼자 실행하도록 한다. First, when an image is needed in Darknet YOLO, it can be modified by requesting and receiving the image from the camera driver, thereby eliminating unnecessary delay caused by the queue. Second, the idle time caused by the imbalance in the execution time can be eliminated by executing the stage with a short execution time later. Lastly, if the stage using GPU and the stage using CPU among Darknet YOLO internal stages are executed in parallel, memory bandwidth interference will adversely affect the execution time of the stage using GPU. Stages that use the CPU to remove can be run in parallel. In other words, the stage using the GPU is protected and executed alone.

한편, 본 발명에 따른 각 최적화 방법은 Jetson AGX Xavier 플랫폼에서 Darknet YOLO (v1, v2, v3)를 사용하여 검증했고, v3 기준 평균 종단간 지연 시간을 Original 대비 76% 감소시킴을 확인하였다. 추가적으로 본 발명에 딸느 최적화 방법은 Neural network 구조를 변경하지 않고 시스템 구조만 변경했기 때문에 객체인지 정확도에는 영향을 미치지 않을 수 있다.On the other hand, each optimization method according to the present invention was verified using Darknet YOLO (v1, v2, v3) on the Jetson AGX Xavier platform, and it was confirmed that the average end-to-end delay time based on v3 was reduced by 76% compared to the Original. Additionally, the optimization method according to the present invention may not affect the object recognition accuracy because only the system structure is changed without changing the neural network structure.

상기에서는 본 발명의 바람직한 실시예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.Although the above has been described with reference to preferred embodiments of the present invention, those skilled in the art can variously modify and change the present invention within the scope without departing from the spirit and scope of the present invention as set forth in the claims below. You will understand that it can be done.

100: 객체검출 시스템
110: 카메라 서브시스템
130: 검출기 서브시스템100: object detection system
110: camera subsystem
130: detector subsystem

Claims

매 주기의 제1 시간 구간 동안마다 카메라 영상을 가져오는 제1 쓰레드 처리부;
상기 매 주기의 상기 제1 시간 구간에 바로 인접한 제2 시간 구간 동안마다 상기 가져온 카메라 영상을 추론하는 제2 쓰레드 처리부; 및
상기 매 주기의 제1 시간 구간 동안마다 상기 추론된 카메라 영상이 존재하는지 여부를 검출하고, 그렇다면 상기 추론된 카메라 영상을 디스플레이하는 제3 쓰레드 처리부를 포함하고,
상기 제1 쓰레드 처리부는 상기 카메라 영상을 저장하는 별도의 메모리 버퍼 없이 카메라로부터 상기 카메라 영상을 상기 제1 시간 구간에 직접적으로 수신하는 온디맨드 캡처를 수행하는 것을 특징으로 하는 종단간 지연을 최소화하기 위한 객체검출 장치.
a first thread processing unit that fetches a camera image during a first time period of each cycle;
a second thread processing unit for inferring the fetched camera image for every second time section immediately adjacent to the first time section of each cycle; and
a third thread processing unit that detects whether the inferred camera image exists during the first time period of each period, and if so, displays the inferred camera image;
The first thread processing unit to minimize the end-to-end delay, characterized in that the on-demand capture to receive the camera image directly from the camera in the first time period without a separate memory buffer for storing the camera image object detection device.

삭제delete

제1항에 있어서, 상기 제1 쓰레드 처리부는
CPU (Central Processing Unit) 기반으로 수행되고 상기 제1 시간 구간에 직접적으로 상기 카메라에 접근하여 상기 카메라 영상을 수신하는 것을 특징으로 하는 종단간 지연을 최소화하기 위한 객체검출 장치.
The method of claim 1, wherein the first thread processing unit
An object detection apparatus for minimizing end-to-end delay, which is performed based on a CPU (Central Processing Unit) and directly accesses the camera during the first time period to receive the camera image.

제3항에 있어서, 상기 제1 쓰레드 처리부는
상기 카메라의 정상 동작에 따라 결정되는 상기 매 주기의 시작 시점 이후부터 상기 카메라 영상을 수신하여 쓰레드 간의 경쟁 프리 영상 버퍼에 저장하는 것을 특징으로 하는 종단간 지연을 최소화하기 위한 객체검출 장치.
The method of claim 3, wherein the first thread processing unit
The object detection apparatus for minimizing end-to-end delay, characterized in that the camera image is received from the start time of each cycle determined according to the normal operation of the camera and stored in a contention-free image buffer between threads.

제1항에 있어서, 상기 제2 쓰레드 처리부는
상기 제1 시간 구간이 끝나자 마자 상기 제2 시간 구간의 시점에 상기 가져온 카메라 영상에 관한 영상 처리를 수행하는 것을 특징으로 하는 종단간 지연을 최소화하기 위한 객체검출 장치.
The method of claim 1, wherein the second thread processing unit
The object detection apparatus for minimizing the end-to-end delay, characterized in that as soon as the first time period ends, the image processing on the imported camera image is performed at the time point of the second time period.

제4항에 있어서, 상기 제2 쓰레드 처리부는
GPU (Graphic Processing Unit) 기반으로 수행되고 상기 제2 시간 구간에 상기 제1 쓰레드 처리부와 쓰레드 간의 경쟁 프리 영상 버퍼를 경쟁하지 않고 상기 가져온 카메라 영상에 관한 객체 인식을 수행하는 것을 특징으로 하는 종단간 지연을 최소화하기 위한 객체검출 장치.
5. The method of claim 4, wherein the second thread processing unit
End-to-end delay, which is performed based on GPU (Graphic Processing Unit) and performs object recognition on the fetched camera image without competing with the contention-free image buffer between the first thread processing unit and the thread in the second time period object detection device to minimize

제6항에 있어서, 상기 제2 쓰레드 처리부는
신경망 기반의 스테이지 검출기를 통해 상기 매 주기의 다음 제1 시간 구간의 시점에 해당하는 데드라인에 맞춰 상기 객체 인식을 완료하는 것을 특징으로 하는 종단간 지연을 최소화하기 위한 객체검출 장치.
The method of claim 6, wherein the second thread processing unit
The object detection apparatus for minimizing end-to-end delay, characterized in that through a neural network-based stage detector, the object recognition is completed according to a deadline corresponding to the time point of the first time section next to each cycle.

제1항에 있어서, 상기 제3 쓰레드 처리부는
상기 카메라 영상을 가져오는 시간 구간과 동기에 맞춰 상기 매 주기의 제1 시간 구간 동안마다 상기 추론된 카메라 영상을 디스플레이하는 것을 특징으로 하는 종단간 지연을 최소화하기 위한 객체검출 장치.
The method of claim 1 , wherein the third thread processing unit
The object detection apparatus for minimizing end-to-end delay, characterized in that the inferred camera image is displayed during the first time period of each period in synchronization with a time period for obtaining the camera image.

제8항에 있어서, 상기 제3 쓰레드 처리부는
CPU (Central Processing Unit) 기반으로 수행되고 상기 추론된 카메라 영상에 있는 객체를 오버레이 되게 하이라이팅 하는 것을 특징으로 하는 종단간 지연을 최소화하기 위한 객체검출 장치.
The method of claim 8, wherein the third thread processing unit
An object detection apparatus for minimizing end-to-end delay, characterized in that it is performed based on a CPU (Central Processing Unit) and highlights an object in the inferred camera image to be overlaid.

제9항에 있어서, 상기 제3 쓰레드 처리부는
상기 디스플레이 이전에 상기 객체에 관한 메타데이터를 기초로 상기 객체를 분석하여 상기 객체의 종류에 따라 차별되게 하이라이팅 하는 것을 특징으로 하는 종단간 지연을 최소화하기 위한 객체검출 장치.
10. The method of claim 9, wherein the third thread processing unit
The object detection apparatus for minimizing the end-to-end delay, characterized in that before the display, the object is analyzed based on the metadata about the object and highlighted according to the type of the object.

제1항에 있어서,
상기 매 주기의 동기를 맞춰서 상기 매 주기에서의 상기 제1 및 제2 시간 구간들 각각의 시점을 제공하는 제어부를 더 포함하는 것을 특징으로 하는 종단간 지연을 최소화하기 위한 객체검출 장치.
According to claim 1,
The apparatus for minimizing the end-to-end delay according to claim 1, further comprising: a control unit that synchronizes each period and provides a time point of each of the first and second time sections in each period.

매 주기의 제1 시간 구간 동안마다 카메라 영상을 가져오는 단계;
상기 매 주기의 상기 제1 시간 구간에 바로 인접한 제2 시간 구간 동안마다 상기 가져온 카메라 영상을 추론하는 단계; 및
상기 매 주기의 제1 시간 구간 동안마다 상기 추론된 카메라 영상이 존재하는지 여부를 검출하고, 그렇다면 상기 추론된 카메라 영상을 디스플레이하는 단계를 포함하고,
상기 카메라 영상을 가져오는 단계는 상기 카메라 영상을 저장하는 별도의 메모리 버퍼 없이 카메라로부터 상기 카메라 영상을 상기 제1 시간 구간에 직접적으로 수신하는 온디맨드 캡처를 수행하는 단계를 포함하는 것을 특징으로 하는 종단간 지연을 최소화하기 위한 객체검출 방법.
retrieving a camera image during a first time period of each cycle;
inferring the imported camera image for every second time section immediately adjacent to the first time section of each period; and
detecting whether the inferred camera image exists every first time interval of the period, and if so, displaying the inferred camera image;
The step of retrieving the camera image includes performing on-demand capture of directly receiving the camera image from the camera in the first time period without a separate memory buffer for storing the camera image. An object detection method to minimize inter-delay.

매 주기의 제1 시간 구간 동안마다 카메라 영상을 가져오는 제1 쓰레드 처리부;
상기 매 주기의 상기 제1 시간 구간에 바로 인접한 제2 시간 구간 동안마다 상기 가져온 카메라 영상을 추론하는 제2 쓰레드 처리부; 및
상기 매 주기의 제1 시간 구간 동안마다 상기 추론된 카메라 영상이 존재하는지 여부를 검출하고, 그렇다면 상기 추론된 카메라 영상을 디스플레이하는 제3 쓰레드 처리부; 및
상기 추론된 카메라 영상을 기초로 주행중 차량의 긴급성을 결정하여 상기 차량을 제어하는 제어부를 포함하고,
상기 제1 쓰레드 처리부는 상기 카메라 영상을 저장하는 별도의 메모리 버퍼 없이 카메라로부터 상기 카메라 영상을 상기 제1 시간 구간에 직접적으로 수신하는 온디맨드 캡처를 수행하는 것을 특징으로 하는 첨단 운전자 보조장치.
a first thread processing unit that fetches a camera image during a first time period of each cycle;
a second thread processing unit for inferring the fetched camera image for every second time section immediately adjacent to the first time section of each cycle; and
a third thread processing unit that detects whether the inferred camera image exists during the first time period of each period, and if so, displays the inferred camera image; and
A control unit for controlling the vehicle by determining the urgency of the vehicle while driving based on the inferred camera image,
and the first thread processing unit performs on-demand capture by directly receiving the camera image from the camera in the first time period without a separate memory buffer for storing the camera image.

제13항에 있어서, 상기 제어부는
상기 매 주기에서의 상기 제1 및 제2 시간 구간들 각각의 시점에 오류가 발생하는 경우에는 상기 가져온 카메라 영상의 추론을 패스하여 상기 매 주기의 동기를 재설정하여 상기 제1 시간 구간의 시점을 리셋하는 것을 특징으로 하는 첨단 운전자 보조장치.
14. The method of claim 13, wherein the control unit
If an error occurs at each time point of the first and second time sections in each cycle, the inference of the imported camera image is passed to reset the synchronization of each cycle to reset the time point of the first time section Advanced driver assistance systems, characterized in that