KR102182660B1

KR102182660B1 - System and method for detecting violence using spatiotemporal features and computer readable recording media storing program for executing method thereof

Info

Publication number: KR102182660B1
Application number: KR1020190155473A
Authority: KR
Inventors: 백성욱; 노승민; 이미영; 유 민 울라 파튜; 아민 울라; 울 하크 이자즈
Original assignee: 세종대학교산학협력단
Priority date: 2019-11-28
Filing date: 2019-11-28
Publication date: 2020-11-25

Abstract

The present invention relates to a violence detection system using a spatiotemporal feature, to a method thereof, and to a computer-readable recording medium in which a program for performing the method is stored. Specifically, the violence detection system using a spatiotemporal feature detects a person by applying a CNN model in order to reduce unnecessary frames from an acquired image, applies 3D-CNN from a frame sequence, which is a sequence of frames detected by the person to predict and determine violent activities in the image, and provides alerts to relevant security departments when predicting and determining a violent activity.

Description

시공간 특징을 이용한 폭력 감지 시스템, 방법 및 상기 방법을 수행하도록 하는 프로그램이 저장된 컴퓨터 판독 가능한 기록매체{System and method for detecting violence using spatiotemporal features and computer readable recording media storing program for executing method thereof}A system and method for detecting violence using spatiotemporal features and computer readable recording media storing program for executing method thereof

본 발명은 시공간 특징을 이용한 폭력 감지 시스템, 방법 및 상기 방법을 수행하도록 하는 프로그램이 저장된 컴퓨터 판독 가능한 기록매체에 관한 것으로, 더욱 상세하게는 획득되는 영상으로부터 불필요한 프레임을 줄이기 위해 CNN 모델을 적용하여 사람을 검출하고, 사람이 검출된 연속된 프레임열인 프레임 시퀀스로부터 3D-CNN을 적용하여 시공간 특징을 추출하여 영상 내의 폭력적인 활동을 예측 및 판단하고, 폭력적인 활동의 예측 및 판단 시 관련 보안 부서로 경고를 제공하는 시공간 특징을 이용한 폭력 감지 시스템, 방법 및 상기 방법을 수행하도록 하는 프로그램이 저장된 컴퓨터 판독 가능한 기록매체에 관한 것이다.The present invention relates to a violence detection system and method using a space-time feature, and a computer-readable recording medium in which a program for performing the method is stored. More particularly, a CNN model is applied to reduce unnecessary frames from an acquired image. Is detected, and 3D-CNN is applied from the frame sequence, which is a sequence of frames detected by humans, to extract spatio-temporal features to predict and determine violent activity in the video, and to the relevant security department when predicting and determining violent activity. It relates to a violence detection system and method using a space-time feature that provides a warning, and a computer-readable recording medium storing a program for performing the method.

일반적으로 폭력이란 신체적인 손상을 가져오고, 정신적 및 심리적인 압박을 가하는 물리적인 강제력을 말한다. 폭력은 법에서 정하고 있는 바와 같이 다른 사람에게 상해를 입히거나, 협박하거나 하는 등의 행위, 사람을 감금하는 행위, 주거에 침입하는 행위, 기물을 파손하는 행위 등과 같이 다양한 형태로 발생한다. In general, violence is a physical coercion that causes physical damage and exerts mental and psychological pressure. Violence occurs in various forms, such as injuring or intimidating another person, imprisoning a person, breaking into a residence, and destroying property as prescribed by law.

사회가 집단이기주의화, 개인주의화, 핵가족화되고 고용불안 등의 심리적 불안감 및 학업 중압감이 증가함에 따라 폭력도 증가하고 있으며 이러한 폭력은 학교, 거리, 공원, 의료센터 등과 같은 다양한 장소에서 빈번하게 발생하고 있다.Violence is also increasing as society becomes group egoism, individualism, nuclear family, and psychological anxiety such as employment insecurity and academic pressure increase, and such violence occurs frequently in various places such as schools, streets, parks, medical centers, etc. have.

이러한 폭력을 방지하고 폭력 발생 시의 후 조치를 위해 학교, 유치원, 거리, 공원, 의료센터, 보호시설 등의 감시영역에는 수많은 폐쇄회로텔레비전(Closed Circuit Television: CCTV)이 설치되고 있다.In order to prevent such violence and to take follow-up measures in the event of violence, numerous closed circuit televisions (CCTVs) are installed in surveillance areas such as schools, kindergartens, streets, parks, medical centers, and shelters.

다수의 CCTV를 통해 촬영된 영상들은 CCTV를 설치한 주최 측의 서버에 저장되고, 주최 측의 관리자에 의해 실시간 모니터링되고 있으며, 경찰 등의 유관기관의 증거자료서 제공되고 있다.Images captured through multiple CCTVs are stored on the server of the organizer who installed the CCTV, are monitored in real time by the organizer's manager, and are provided with evidence from related organizations such as the police.

그러나 관리자가 CCTV를 통해 촬영된 다수의 영상을 실시간 모니터링한다 할지라도, 관리자가 다수의 영상을 한꺼번에 보아야 하므로 정밀하게 모니터링하기 힘들며, 24시간 지속해서 모니터링하는 것은 매우 어려운 문제점이 있었다.However, even if the administrator monitors a large number of images captured through CCTV in real time, it is difficult for the administrator to view a large number of images at once, and it is difficult to monitor continuously for 24 hours.

또한 관리자가 영상을 정밀하게 모니터링하여 폭력에 대응하는 영상을 확인했을지라도, 이 영상이 실제 폭력인지를 판단하는 데 어려움이 있으며, 폭력을 판단하여 조치하는 데 있어 많은 절차를 걸쳐야 하므로 많은 시간이 소요되는 문제점이 있었다.In addition, even if the manager monitors the video precisely and checks the video that responds to the violence, it is difficult to determine whether this video is actually violence, and it takes a lot of time to take a lot of steps to determine and take action. There was a problem.

이러한 문제점을 해결하기 위해 획득된 영상으로부터 폭력을 감지하는 Datta,et.al, Ngynen.et.al, Mahadevan.et.al, Hassner, Huang,et.al, Guo.et.al 등의 수작업 기능 기반 접근 방식 및 Chen, Lloyed.et.al, Fu.et.al, Sudhakaran.et.al 등의 딥러닝 기반 접근법들이 개발되었다.To solve this problem, it is based on manual functions such as Datta, et.al, Ngynen.et.al, Mahadevan.et.al, Hassner, Huang, et.al, and Guo.et.al, which detect violence from acquired images. Approaches and deep learning-based approaches such as Chen, Lloyed.et.al, Fu.et.al, and Sudhakaran.et.al have been developed.

위에서 언급한 방식들은 카메라 보기, 복잡한 군중 패턴 및 강도 변화를 포함하여 폭력 감지의 많은 문제를 해결하려고 시도했으나, 폭력탐지를 위해 인체에 관점, 중요한 상호 교합 및 규모로 인해 발생하는 변화가 발생했을 때 추출을 통해 차별적이고 효과적인 특징을 포착하지 못하는 문제점이 있었다.The above-mentioned approaches have attempted to solve many problems of violence detection, including camera viewing, complex crowd patterns, and intensity variations, but when a change occurs due to the perspective, significant mutual occlusion and scale of the human body for detection of violence. There was a problem in not capturing distinctive and effective features through extraction.

특히, 종래의 방식들은 저수준의 기능을 사용하므로 복잡한 패턴을 인식하기에는 비효율적이었으며, 실시간 감시에서는 구현하기 어려운 문제점이 있었다.In particular, conventional methods are inefficient to recognize complex patterns because they use low-level functions, and there is a problem that is difficult to implement in real-time monitoring.

또한, 종래 방식들은 많은 수의 중요하지 않은 프레임을 처리하는 문제로 인해 더 많은 메모리를 차지하고 시간이 오래 걸리는 문제점이 있었다.In addition, conventional methods occupy more memory and take a long time due to a problem of processing a large number of non-critical frames.

또한, 종래 방식들은 폭력탐지 벤치마크 데이터 세트의 데이터 부족과 낮은 정확도로 인해 효과적인 패턴을 학습할 수 없는 문제점이 있었다.In addition, conventional methods have a problem in that effective patterns cannot be learned due to lack of data and low accuracy in the violence detection benchmark data set.

대한민국 등록특허공보 제10-1541272호(2015.08.03. 공고)Republic of Korea Patent Publication No. 10-1541272 (2015.08.03. Announcement)

따라서 본 발명의 목적은 획득되는 영상으로부터 불필요한 프레임을 줄이기 위해 CNN 모델을 적용하여 사람을 검출하고, 사람이 검출된 연속된 프레임열인 프레임 시퀀스로부터 3D-CNN을 적용하여 시공간 특징을 추출하여 영상 내의 폭력적인 활동을 예측 및 판단하고, 폭력적인 활동의 예측 및 판단 시 관련 보안 부서로 경고를 제공하는 시공간 특징을 이용한 폭력 감지 시스템, 방법 및 상기 방법을 수행하도록 하는 프로그램이 저장된 컴퓨터 판독 가능한 기록매체를 제공함에 있다.Accordingly, an object of the present invention is to detect a person by applying a CNN model to reduce unnecessary frames from an acquired image, and to extract spatiotemporal features by applying a 3D-CNN from a frame sequence, which is a sequence of frames in which the person is detected. A computer-readable recording medium storing a violence detection system and method using a space-time feature that predicts and judges violent activity and provides a warning to the relevant security department when predicting and determining violent activity, and a program that performs the method. In the offering.

상기와 같은 목적을 달성하기 위한 본 발명에 따른 시공간 특징을 이용한 폭력 감지 시스템은: 감시영역을 촬영한 영상을 획득하여 출력하는 영상획득부; 상기 영상획득부로부터 출력되는 영상으로부터 사람을 검출하고, 사람이 검출된 프레임을 캡처하여 출력하는 사람 검출부; 상기 사람 검출부로부터 출력되는 프레임들을 일정 수의 프레임으로 그룹화한 프레임 시퀀스를 생성하여 출력하는 프레임 시퀀스 생성부; 상기 프레임 시퀀스 생성부로부터 출력되는 프레임 시퀀스로부터 시공간특성을 검출하고, 그에 따른 시공간특성정보를 분류하여 상기 프레임 시퀀스에 폭력이 존재하는지의 폭력 존재 여부 정보를 출력하는 시공간특성 추출부; 및 상기 폭력 존재 여부 정보에 의해 상기 프레임 시퀀스가 폭력이 존재하는 프레임 시퀀스인지 폭력이 없는 프레임 시퀀스인지를 판단하는 폭력 감지 판단부를 포함하는 것을 특징으로 한다.A violence detection system using a space-time feature according to the present invention for achieving the above object comprises: an image acquisition unit for obtaining and outputting an image photographed of a surveillance area; A person detection unit that detects a person from the image output from the image acquisition unit and captures and outputs a frame in which the person is detected; A frame sequence generator for generating and outputting a frame sequence in which the frames output from the person detection unit are grouped into a predetermined number of frames; A spatiotemporal feature extractor configured to detect a spatiotemporal feature from a frame sequence output from the frame sequence generator, classify spatiotemporal feature information according to the spatiotemporal feature information, and output information on whether violence exists in the frame sequence; And a violence detection determination unit determining whether the frame sequence is a frame sequence with violence or a frame sequence without violence based on the information on whether violence exists.

상기 영상획득부는, 상기 감시영역을 감시하는 폐쇄회로텔레비전(CCTV)을 통해 촬영된 영상을 획득하는 것을 특징으로 한다.The image acquisition unit is characterized in that it acquires an image photographed through a closed circuit television (CCTV) monitoring the surveillance area.

상기 영상획득부는, 상기 감시영역을 촬영한 임의의 모바일 단말기 및 상기 감시영역을 촬영한 영상에 대한 임의의 서비스를 제공하는 서비스 어플리케이션 서버 중 어느 하나로부터 상기 영상을 획득하는 것을 특징으로 한다.The image acquisition unit is characterized in that the image acquisition unit acquires the image from any one of a mobile terminal capturing the surveillance area and a service application server providing an arbitrary service for the image capturing the surveillance area.

상기 사람 검출부는, 획득되는 영상의 프레임의 깊이 및 점별 컨벌루션이 개별적으로 계산되는 경우 28개의 레이어를 포함하고, 마지막 완전 연결 레이어(Fully Connected Layer: FC Layer)를 제외한 레이어 뒤에 비선형 배치 표준 및 ReLU가 적용된 모바일 넷(Mobilenet)??SSD CNN 모델을 적용하여 각 프레임별로 사람을 검출하여 출력하는 것을 특징으로 한다.When the depth of the frame of the acquired image and the point-by-point convolution are individually calculated, the human detector includes 28 layers, and a non-linear arrangement standard and ReLU are provided after the layer excluding the last fully connected layer (FC Layer). It is characterized in that a person is detected and output for each frame by applying the applied Mobilenet??SSD CNN model.

상기 프레임 시퀀스 생성부는, 사람이 검출된 16개의 프레임을 포함하는 프레임 시퀀스를 생성하여 출력하는 것을 특징으로 한다.The frame sequence generator may generate and output a frame sequence including 16 frames detected by a person.

상기 시공간특성 추출부는, 상기 프레임 시퀀스 생성부로부터 입력되는 프레임 시퀀스의 사람이 검출된 프레임들 간 공간 상관 정보 및 시간 상관 정보를 포함하는 시공간특성 정보를 생성하고, 생성된 시공간특성정보에 근거하여 프레임 시퀀스의 영상에 폭력이 존재하는지 존재하지 않는지를 분류하여 출력하는 것을 특징으로 한다.The spatiotemporal characteristic extracting unit generates spatiotemporal characteristic information including spatial correlation information and temporal correlation information between frames detected by a person in a frame sequence input from the frame sequence generator, and a frame based on the generated spatiotemporal characteristic information It is characterized by classifying and outputting whether violence exists in the image of the sequence.

상기 시공간특성 추출부는, 8개의 회선(Conv), 5개의 풀링(Pooling) 및 폭력이 존재하는 영상인지 폭력이 없는 영상인지를 분류하는 소프트맥스 출력 레이어가 있는 2개의 완전 연결 레이어(FC Layer)를 포함하는 것을 특징으로 한다.The spatiotemporal feature extraction unit includes eight convolutions (Conv), five pooling (pooling), and two fully connected layers (FC Layers) having a softmax output layer that classifies whether an image with or without violence. It characterized in that it includes.

상기 영상획득부는, 상기 감시영역에 대한 위치정보를 더 획득하여 출력하고, 상기 폭력 감지 판단부를 통해 입력된 상기 프레임 시퀀스가 폭력이 있는 프레임인 것으로 판단되면, 상기 프레임 시퀀스의 영상이 획득된 위치정보를 포함하는 폭력 발생 통지 정보를 유관기관으로 통지하는 경보부를 더 포함하는 것을 특징으로 한다.The image acquisition unit further acquires and outputs the location information for the surveillance area, and when it is determined that the frame sequence input through the violence detection determination unit is a frame with violence, the location information of the frame sequence obtained It characterized in that it further comprises an alarm unit for notifying the violence occurrence notification information including a related organization.

상기 위치정보는, 상기 감시영역을 촬영하는 폐쇄회로텔레비전이 설치된 위치정보를 포함하는 폐쇄회로텔레비전 관리서버로부터 획득되는 것을 특징으로 한다.The location information may be obtained from a closed circuit television management server including location information on which a closed circuit television for photographing the surveillance area is installed.

상기와 같은 목적을 달성하기 위한 본 발명에 따른 시공간 특징을 이용한 폭력 감지 방법은: 영상획득부가 감시영역을 촬영한 영상을 획득하여 출력하는 영상획득 과정; 사람 검출부가 상기 영상획득부로부터 출력되는 영상으로부터 사람을 검출하고, 사람이 검출된 프레임을 캡처하여 출력하는 사람 검출 프레임 추출 과정; 프레임 시퀀스 생성부가 상기 사람 검출부로부터 출력되는 프레임을 일정 수의 프레임으로 그룹화한 프레임 시퀀스를 생성하여 출력하는 프레임 시퀀스 생성과정; 시공간특성 추출부가 상기 프레임 시퀀스 생성부로부터 출력되는 프레임 시퀀스로부터 시공간특성을 검출하고, 그에 따른 시공간특성정보를 분류하여 상기 프레임 시퀀스에 폭력이 존재하는지의 폭력 존재 여부 정보를 출력하는 시공간특성 추출 과정; 및 폭력 감지 판단부가 상기 폭력 존재 여부 정보에 의해 상기 프레임 시퀀스가 폭력이 존재하는 프레임 시퀀스인지 폭력이 없는 프레임 시퀀스인지를 판단하는 폭력 감지 판단부를 포함하는 것을 특징으로 한다.In order to achieve the above object, a method for detecting violence using spatiotemporal features according to the present invention includes: an image acquisition process in which an image acquisition unit acquires and outputs an image captured by a surveillance area; A process of extracting a person detection frame in which a person detection unit detects a person from an image output from the image acquisition unit, and captures and outputs the detected frame; A frame sequence generation process in which a frame sequence generation unit generates and outputs a frame sequence in which the frames output from the person detection unit are grouped into a predetermined number of frames; A spatio-temporal feature extraction process in which a spatiotemporal feature extractor detects a spatiotemporal feature from a frame sequence output from the frame sequence generator, classifies the spatiotemporal feature information accordingly, and outputs information on whether violence is present in the frame sequence; And a violence detection determination unit determining whether the frame sequence is a frame sequence with violence or a frame sequence without violence based on the information on whether violence exists.

상기 시공간특성 추출 과정은, 상기 프레임 시퀀스 생성부로부터 입력되는 프레임 시퀀스의 사람이 검출된 프레임들 간 공간 상관 정보를 획득하는 공간 상관 정보 획득 단계; 상기 프레임 시퀀스의 프레임들 간 시간 상관 정보를 획득하는 시간 상관 정보 획득 단계; 상기 공간 상관 정보 및 시간 상관 정보를 포함하는 시공간특성 정보를 생성하는 시공간특성 정보 생성 단계; 및 생성된 상기 시공간특성정보에 근거하여 프레임 시퀀스의 영상에 폭력이 존재하는지 존재하지 않는지를 분류하여 출력하는 분류 단계를 포함하는 것을 특징으로 한다.The spatio-temporal feature extraction process may include: obtaining spatial correlation information between frames detected by a person in a frame sequence input from the frame sequence generator; Obtaining temporal correlation information between frames of the frame sequence; A spatiotemporal characteristic information generating step of generating spatiotemporal characteristic information including the spatial correlation information and temporal correlation information; And a classification step of classifying and outputting whether violence exists in the image of the frame sequence based on the generated spatiotemporal characteristic information.

상기 영상획득 과정은, 상기 감시영역에 대한 영상을 획득하는 영상 획득 단계; 및 상기 감시영역의 위치정보를 획득하는 위치정보 획득 단계를 포함하되, 경보부가 상기 폭력 감지 판단부를 통해 입력된 상기 프레임 시퀀스가 폭력이 있는 프레임인 것으로 판단되면, 상기 영상획득 과정의 위치정보 획득 단계에서 획득된 프레임 시퀀스의 영상이 획득된 위치정보를 포함하는 폭력 발생 통지 정보를 유관기관으로 통지하는 경보 과정을 더 포함하는 것을 특징으로 한다. The image acquisition process may include: an image acquisition step of obtaining an image for the surveillance area; And a location information acquisition step of acquiring location information of the surveillance area, wherein when the alarm unit determines that the frame sequence input through the violence detection determination unit is a frame with violence, the location information acquisition step of the image acquisition process It characterized in that it further comprises a warning process of notifying the violence occurrence notification information including the location information obtained by the image of the frame sequence obtained in the related organization.

상기와 같은 목적을 달성하기 위한 본 발명에 따른 시공간 특징을 이용한 폭력 감지 방법을 수행하도록 하는 프로그램이 저장된 컴퓨터 판독 가능한 기록매체는: 영상획득부가 감시영역을 촬영한 영상을 획득하여 출력하는 영상획득 과정; 사람 검출부가 상기 영상획득부로부터 출력되는 영상으로부터 사람을 검출하고, 사람이 검출된 프레임을 캡처하여 출력하는 사람 검출 프레임 추출 과정; 프레임 시퀀스 생성부가 상기 사람 검출부로부터 출력되는 프레임을 일정 수의 프레임으로 그룹화한 프레임 시퀀스를 생성하여 출력하는 프레임 시퀀스 생성과정; 시공간특성 추출부가 상기 프레임 시퀀스 생성부로부터 출력되는 프레임 시퀀스로부터 시공간특성을 검출하고, 그에 따른 시공간특성정보를 분류하여 상기 프레임 시퀀스에 폭력이 존재하는지의 폭력 존재 여부 정보를 출력하는 시공간특성 추출 과정; 및 폭력 감지 판단부가 상기 폭력 존재 여부 정보에 의해 상기 프레임 시퀀스가 폭력이 존재하는 프레임 시퀀스인지 폭력이 없는 프레임 시퀀스인지를 판단하는 폭력 감지 판단부를 포함하는 것을 특징으로 하는 시공간 특징을 이용한 폭력 감지 방법을 기록하는 것을 특징으로 한다.A computer-readable recording medium in which a program for performing a violence detection method using a space-time feature according to the present invention to achieve the above object is: An image acquisition process in which an image acquisition unit acquires and outputs an image captured by a surveillance area. ; A process of extracting a person detection frame in which a person detection unit detects a person from an image output from the image acquisition unit, and captures and outputs the detected frame; A frame sequence generation process in which a frame sequence generation unit generates and outputs a frame sequence in which the frames output from the person detection unit are grouped into a predetermined number of frames; A spatio-temporal feature extraction process in which a spatiotemporal feature extractor detects a spatiotemporal feature from a frame sequence output from the frame sequence generator, classifies the spatiotemporal feature information accordingly, and outputs information on whether violence is present in the frame sequence; And a violence detection determination unit configured to determine whether the frame sequence is a frame sequence with violence or a frame sequence without violence based on the violence presence information. It is characterized by recording.

본 발명은 첫 번째 단계에서 효율적인 CNN 모델을 사용하여 원하지 않는 프레임을 제거하여 사람을 감지하여 전체 처리 시간을 줄일 수 있는 효과가 있다.The present invention has the effect of reducing the total processing time by detecting a person by removing unwanted frames using an efficient CNN model in the first step.

또한, 본 발명은 사람이 있는 프레임 시퀀스가 3개의 벤치마크 데이터 세트에 대해 훈련된 3D CNN 모델에 적용되어 시공간적 특징을 추출하므로 비교적 적은 데이터 세트에도 불구하고 효과적으로 패턴을 학습할 수 있으며, 정확도를 향상시킬 수 있는 효과가 있다.In addition, the present invention is applied to a 3D CNN model trained on three benchmark data sets to extract spatiotemporal features, so that a pattern can be effectively learned and accuracy is improved despite a relatively small number of data sets. There is an effect that can be made.

또한, 본 발명은 최적화 모듈을 적용하여 모델을 최적화하므로 최종 플랫폼에서 속도를 높이고 성능을 향상시킬 수 있는 효과가 있다.In addition, since the present invention optimizes the model by applying the optimization module, there is an effect of increasing the speed and improving the performance in the final platform.

도 1은 본 발명에 따른 시공간 특징을 이용한 폭력 감지 시스템을 포함하는 통신시스템의 구성을 나타낸 도면이다.
도 2는 본 발명에 따른 시공간 특징을 이용한 폭력 감지 시스템의 구성을 나타낸 도면이다.
도 3은 본 발명에 따른 시공간 특징을 이용한 폭력 감지 과정을 나타낸 도면이다.
도 4는 본 발명의 일실시예에 따른 시공간 특징을 이용한 폭력 감지 시스템에서 사람이 검출된 영상 프레임을 나타낸 도면이다.
도 5는 본 발명에 따른 시공간 특징을 이용한 폭력 감지 시스템의 시공간 특징 추출을 위한 3D CNN의 구조를 나타낸 도면이다.
도 6은 본 발명에 따른 시공간 특징을 이용한 폭력 감지 시스템의 컨볼루션 깊이에 따른 영상 프레임을 나타낸 도면이다.
도 7은 본 발명에 따른 시공간 특징을 이용한 폭력 감지 방법을 나타낸 흐름도이다.1 is a diagram showing the configuration of a communication system including a violence detection system using space-time features according to the present invention.
2 is a diagram showing the configuration of a violence detection system using space-time features according to the present invention.
3 is a diagram showing a violence detection process using space-time features according to the present invention.
4 is a diagram illustrating an image frame in which a person is detected in a violence detection system using space-time features according to an embodiment of the present invention.
5 is a diagram showing the structure of a 3D CNN for extracting spatiotemporal features of a violence detection system using spatiotemporal features according to the present invention.
6 is a diagram showing an image frame according to a convolution depth of a violence detection system using a space-time feature according to the present invention.
7 is a flowchart illustrating a method of detecting violence using space-time features according to the present invention.

이하 첨부된 도면을 참조하여 본 발명에 따른 시공간 특징을 이용한 폭력 감지 시스템의 구성 및 동작을 설명하고, 시스템에서의 폭력 감지 방법을 상세히 설명한다.Hereinafter, a configuration and operation of a violence detection system using space-time features according to the present invention will be described with reference to the accompanying drawings, and a violence detection method in the system will be described in detail.

도 1은 본 발명에 따른 시공간 특징을 이용한 폭력 감지 시스템을 포함하는 통신시스템의 구성을 나타낸 도면이다.1 is a diagram showing the configuration of a communication system including a violence detection system using space-time features according to the present invention.

도 1을 참조하면, 본 발명의 시공간 특징을 이용한 폭력 감지 시스템(300)은 유무선데이터통신망(10)에 연결되어 유무선데이터통신망(10)에 연결된 영상제공장치(100), 영상제공부(200) 등과 데이터 통신을 수행하여 영상을 획득할 수 있을 것이다.Referring to FIG. 1, the violence detection system 300 using the space-time feature of the present invention is connected to the wired/wireless data communication network 10 and connected to the wired/wireless data communication network 10, an image providing device 100, an image providing unit 200, etc. An image may be acquired by performing data communication.

영상제공장치(100)는 특정 위치의 감시영역을 촬영하는 다수의 폐쇄회로텔레비전(Closed Circuit Television: CCTV)(110) 및 감시영역을 촬영하는 스마트폰, 스마트패드 등과 같은 모바일 단말기(120)일 수 있을 것이다. 상기 감시영역은 CCTV(110) 및 모바일 단말기(120)가 촬영하는 영역을 의미한다.The image providing device 100 may be a plurality of closed circuit televisions (CCTVs) 110 for photographing a surveillance area at a specific location, and a mobile terminal 120 such as a smartphone or a smart pad for recording the surveillance area. There will be. The surveillance area refers to an area captured by the CCTV 110 and the mobile terminal 120.

영상제공장치(100)는 감시영역을 촬영한 영상을 유무선데이터통신망(10)을 통해 본 발명의 폭력 감지 시스템(300)으로 전송할 수도 있고, 영상제공부(200)로 전송할 수도 있을 것이다.The image providing apparatus 100 may transmit an image of the surveillance area to the violence detection system 300 of the present invention through the wired/wireless data communication network 10 or to the image providing unit 200.

영상제공부(200)는 CCTV(110)들에 대한 CCTV 식별정보 및 상기 CCTV(110)가 설치된 위치정보를 저장하여 관리하고, 상기 CCTV(110)들로부터 수신되는 영상을 CCTV 식별정보 및 위치정보에 맵핑하여 저장하고, 실시간으로 본 발명의 폭력 감지 시스템(300)으로 제공하는 CCTV 관리 서버(210-1)일 수도 있고, 모바일 단말기(120)들에 단말기 식별정보를 저장하여 관리하고, 설치된 영상 촬영 어플리케이션을 통해 영상을 수신받아 영상을 단말기 식별정보에 맵핑하여 저장하며, 상기 영상을 본 발명의 폭력 감지 시스템(300)으로 전송하는 적어도 하나 이상의 어플리케이션 서버(220)를 포함할 수 있을 것이다.The image providing unit 200 stores and manages the CCTV identification information for the CCTV 110 and the location information on which the CCTV 110 is installed, and stores images received from the CCTV 110 in the CCTV identification information and the location information. It may be a CCTV management server 210-1 that maps and stores, and provides to the violence detection system 300 of the present invention in real time, or stores and manages terminal identification information in mobile terminals 120, and photographs installed images It may include at least one application server 220 that receives an image through an application, maps the image to terminal identification information, and stores the image, and transmits the image to the violence detection system 300 of the present invention.

상기 모바일 단말기(120)는 촬영된 영상 및 상기 감시영역의 위치정보(모바일 단말기의 위치정보)를 폭력 감지 시스템(300) 및 해당 어플리케이션 서버(220) 중 어느 하나 이상으로 전송할 수도 있을 것이다.The mobile terminal 120 may transmit the captured image and location information of the surveillance area (location information of the mobile terminal) to one or more of the violence detection system 300 and the corresponding application server 220.

유관기관 서버(400)는 경찰서, 병원 등과 같은 유관기관에 설치되는 서버로 폭력 감지 시스템(300)으로부터 폭력 발생 통지 정보의 수신 시 별도의 경보기(미도시)를 통해 경보를 발생하고 관리자 단말기(미도시)에 폭력 발생 통지 정보를 표시할 수 있을 것이다. 상기 폭력 발생 통지 정보에는 폭력이 발생한 감지영역의 위치정보가 포함되는 것이 바람직할 것이다.The related institution server 400 is a server installed in a related institution such as a police station, a hospital, etc., and generates an alarm through a separate alarm (not shown) when the violence detection system 300 receives information about the occurrence of violence. City) will be able to display information about the occurrence of violence. It is preferable that the violence occurrence notification information includes location information of a detection area where violence has occurred.

폭력 감지 시스템(300)은 상술한 영상제공장치(100) 및 영상제공부(200) 중 어느 하나로부터 영상 또는 위치정보를 포함하는 영상을 획득하고, 획득된 영상을 분석하여 폭력이 발생하는지를 모니터링하고, 폭력 발생을 감지 시 상기 유관기관 서버(400)로 폭력 발생 통지 정보를 전송하여 상기 발생된 폭력에 빠르게 대처할 수 있도록 한다.The violence detection system 300 acquires an image or an image including location information from any one of the image providing apparatus 100 and the image providing unit 200 described above, and analyzes the acquired image to monitor whether violence occurs, When the occurrence of violence is detected, violence occurrence notification information is transmitted to the related institution server 400 so that the generated violence can be quickly responded.

상기 폭력 감지 시스템(300)의 상세 구성 및 동작은 다음의 도 2를 참조하여 상세히 설명한다.The detailed configuration and operation of the violence detection system 300 will be described in detail with reference to FIG. 2 below.

도 2는 본 발명에 따른 시공간 특징을 이용한 폭력 감지 시스템의 구성을 나타낸 도면이고, 도 3은 본 발명에 따른 시공간 특징을 이용한 폭력 감지 과정을 나타낸 도면이며, 도 4는 본 발명의 일실시예에 따른 시공간 특징을 이용한 폭력 감지 시스템에서 사람이 검출된 영상 프레임을 나타낸 도면이고, 도 5는 본 발명에 따른 시공간 특징을 이용한 폭력 감지 시스템의 시공간 특징 추출을 위한 3D CNN의 구조를 나타낸 도면이며, 도 6은 본 발명에 따른 시공간 특징을 이용한 폭력 감지 시스템의 컨볼루션 깊이에 따른 영상 프레임을 나타낸 도면이다. 이하 도 2 내지 도 6을 참조하여 설명한다.2 is a diagram showing the configuration of a violence detection system using spatiotemporal features according to the present invention, FIG. 3 is a diagram showing a violence detection process using spatiotemporal features according to the present invention, and FIG. 4 is an embodiment of the present invention. A diagram showing an image frame in which a person is detected in a violence detection system using spatiotemporal features according to the present invention, and FIG. 5 is a diagram showing the structure of a 3D CNN for extracting spatiotemporal features of a violence detection system using spatiotemporal features according to the present invention. 6 is a diagram showing an image frame according to the convolution depth of the violence detection system using the space-time feature according to the present invention. Hereinafter, it will be described with reference to FIGS. 2 to 6.

본 발명에 따른 폭력 감지 시스템(300)은 영상획득부(310), 사람 검출부(320), 프레임 시퀀스 생성부(330), 시공간특성 추출부(340) 및 폭력 감지 판단부(350)를 포함하고, 실시예에 따라 경보부(360) 및 학습부(380)를 더 포함할 수 있을 것이다.The violence detection system 300 according to the present invention includes an image acquisition unit 310, a person detection unit 320, a frame sequence generation unit 330, a spatiotemporal characteristic extraction unit 340, and a violence detection determination unit 350, and , Depending on the embodiment, it may further include an alarm unit 360 and a learning unit 380.

상기 영상획득부(310)는 도 3의 510의 511에서 보이는 바와 같이 상기 유무선데이터통신망(10)을 통해 영상제공장치(100) 및 영상제공부(200) 중 어느 하나 이상으로부터 영상 또는, 위치정보 및 영상을 획득하여 출력한다. 상기 획득되는 영상은 스트림 형태로 획득되어 출력될 수도 있고, 영상 데이터 형태로 획득되어 출력될 수 있으나, 스트림 형태로 획득되어 실시간 감시될 수 있도록 하는 것이 바람직할 것이다.As shown in 511 of 510 of FIG. 3, the image acquisition unit 310 includes an image or location information from one or more of the image providing device 100 and the image providing unit 200 through the wired/wireless data communication network 10. Acquire an image and output it. The acquired image may be acquired in the form of a stream and may be output, or may be acquired in the form of image data and output, but it would be desirable to be acquired in a stream form and monitored in real time.

사람 검출부(320)는 상기 영상획득부(310)로부터 입력되는 영상에서 프레임 단위로 사람이 존재하는지를 모니터링하고, 사람이 감지되면 해당 프레임을 출력한다.The person detection unit 320 monitors whether a person exists in a frame unit in the image input from the image acquisition unit 310, and outputs a corresponding frame when a person is detected.

상기 사람 검출부(320)는 도 3의 512에서 보이는 바와 같이 미리 훈련된 모바일넷(MobileNet)-SSD(single Shot MultiBox Detector) 컨볼루션 신경망(Convolution Neural Network: CNN) 모델을 적용하여 획득된 영상의 프레임 내 물체인 사람을 검출한다.The person detection unit 320 is a frame of an image obtained by applying a pretrained MobileNet-SSD (single shot multibox detector) convolution neural network (CNN) model as shown in 512 of FIG. Detect the person who is my object.

상기 MobileNet은 규칙적인 컨볼루션 대신 객체를 감지하기 위해 깊이 분리 가능한 컨볼루션을 가지고 있다. The MobileNet has a depth-separable convolution to detect an object instead of a regular convolution.

본 발명에 따른 MobileNet-SSD CNN 모델은 획득되는 영상의 프레임의 깊이 및 점별 컨볼루션이 개별적으로 계산되는 경우 28개의 레이어를 포함하고, 마지막 완전 연결 레이어(Fully Connected Layer: FC Layer)를 제외한 레이어 뒤에 비선형 배치 표준 및 ReLU가 적용된다. 상기 ReLU는 기계 학습에서 많이 사용되는 액티베이션 함수 중 하나이다.The MobileNet-SSD CNN model according to the present invention includes 28 layers when the frame depth and point-by-point convolution of the acquired image are individually calculated, and after the layer excluding the last fully connected layer (FC Layer). Nonlinear placement standards and ReLU apply. The ReLU is one of the activation functions widely used in machine learning.

본 발명에 따른 상기 MobileNet-SSD CNN 모델의 첫 번째 컨볼루션 층은 3*3*3*32의 필터 형상을 갖는 2개의 스트라이드를 포함하고, 입력 크기는 224*224*3이며, 그다음 깊이 방향 컨볼루션에는 하나의 스트라이드가 있고, 필터 모양은 3*3*32이며, 입력 크기는 112*112*32이다.The first convolution layer of the MobileNet-SSD CNN model according to the present invention includes two strides having a filter shape of 3*3*3*32, the input size is 224*224*3, and the next depth direction convolution There is one stride in the lusion, the filter shape is 3*3*32, and the input size is 112*112*32.

상기 MobileNet은 주로 분류에 적용되는 반면, SSD는 멀티 박스 검출기의 정확한 위치를 찾는 데 사용되고, 그 조합은 물체 감지를 수행한다. 이를 위해 SSD는 네트워크 끝에 추가되어 피드 포워드 컨볼루션을 수행하고 고정 크기 그룹의 경계 상자 그룹을 생성하여 기능 맵을 추출하고 적용하여 해당 상자의 객체 인스턴스 존재 및 감지를 보장한다. 컨볼루션 필터 및 경계 상자는 각 클래스에 대한 확률을 가진 예측 클래스로 구성되고, 확률이 가장 높은 클래스는 객체를 나타낸다.The MobileNet is mainly applied to classification, while the SSD is used to find the exact location of the multi-box detector, and the combination performs object detection. To this end, SSDs are added to the end of the network to perform feedforward convolution and create a bounding box group of a fixed size group to extract and apply a functional map to ensure the existence and detection of object instances in the box. The convolution filter and bounding box are composed of prediction classes having a probability for each class, and the class with the highest probability represents an object.

본 발명에 따른 상기 MobileNet-SSD CNN 모델, 즉 사람 검출부(320)는 도 4와 같이 아이스하키 영상에서 사람들을 검출하고 사람이 검출된 프레임을 캡처하여 출력할 것이다. 이처럼 영상 내에서 불필요한 프레임을 제거하여 사람만 존재하는 프레임만을 출력하므로 효율적으로 메모리를 사용할 수 있으며, 효율적인 분석을 수행할 수 있다.The MobileNet-SSD CNN model according to the present invention, that is, the person detection unit 320, detects people in the ice hockey image and captures and outputs the detected frames as shown in FIG. 4. In this way, since unnecessary frames are removed from the image and only frames in which only people exist, memory can be efficiently used and efficient analysis can be performed.

프레임 시퀀스 생성부(330)는 상기 사람 검출부(320)로부터 출력되는 연속되는 프레임을 일정 개수 이상 수집 후 상기 수집된 개수의 프레임 시퀀스를 생성하여 출력한다. 상기 프레임 시퀀스에는 16개의 프레임으로 구성되는 것이 바람직하나, 이에 한정되는 것은 아니다.The frame sequence generation unit 330 collects a predetermined number or more of consecutive frames output from the person detection unit 320, and then generates and outputs the collected number of frame sequences. The frame sequence is preferably composed of 16 frames, but is not limited thereto.

시공간특성 추출부(340)는 상기 프레임 시퀀스 생성부(330)로부터 출력되는 프레임 시퀀스를 입력받고, 상기 프레임 시퀀스로부터 시공간특성을 검출하고, 그에 따른 시공간특성정보를 분류하여 상기 프레임 시퀀스에 폭력이 존재하는지의 폭력 존재 여부 정보를 출력한다.The spatiotemporal characteristic extraction unit 340 receives a frame sequence output from the frame sequence generator 330, detects the spatiotemporal characteristic from the frame sequence, and classifies the spatiotemporal characteristic information accordingly, so that violence exists in the frame sequence. Prints information on whether there is violence or not.

상기 시공간특성 추출부(340)는 3D CNN 모델을 적용하여 시공간특성을 추출하고, 추출된 시공간특성에 대한 시공간특성정보를 출력한다.The spatiotemporal feature extraction unit 340 extracts spatiotemporal features by applying a 3D CNN model, and outputs spatiotemporal feature information for the extracted spatiotemporal features.

본 발명의 시공간특성 추출부(340)는 도 3의 520 및 도 5에서 보이는 바와 같이 2D CNN을 통해서 공간정보를 추출하고, 3D 컨볼루션(Convolution: Conv) 및 풀링(Pooling) 작업으로 인해 시간정보를 더 잘 추출할 수 있다. 본 발명의 3D 컨볼루션은 프레임 시퀀스의 프레임 조립을 통해 설계된 큐브에서 3D 마스크를 회전시켜 작도하며, 컨볼루션 계층으로부터 획득된 특징 맵은 모션정보를 캡처하여 이전 계층의 다수의 부착된 프레임에 링크된다. 따라서 바이어스 t_pq를 갖는 p번 째층의 q 번째 특징 맵에서의 위치, x, y, z의 값은 수학식 1에 의해 나타낼 수 있을 것이다.The spatiotemporal feature extraction unit 340 of the present invention extracts spatial information through 2D CNN as shown in 520 of FIG. 3 and FIG. 5, and time information due to 3D convolution (Conv) and pooling (Pooling) operations. Can be extracted better. The 3D convolution of the present invention is constructed by rotating a 3D mask in a cube designed through frame assembly of a frame sequence, and a feature map obtained from the convolution layer captures motion information and is linked to a plurality of attached frames of the previous layer. . Therefore, the position, x, y, and z values in the q-th feature map of the p-th layer having the bias t _pq can be expressed by Equation 1.

여기서, Cp는 시간 차원이 있는 3D 마스크 크기이고,

는 이전 레이어에서 k 번째 특징 맵에 연결된 3D 마스크의 (a, b, c) 값이다. 커널의 가중치가 전체 큐브에 복제되므로 프레임 큐브에서 3D 컨볼루션 마스크로 한 가지 유형의 특징만 추출된다.Where Cp is the 3D mask size with time dimension,

Is the (a, b, c) value of the 3D mask connected to the k-th feature map in the previous layer. Since the weights of the kernel are replicated to the entire cube, only one type of feature is extracted from the frame cube with a 3D convolution mask.

상기 시공간특성 추출부(340)는 도 6에서와 같이 입력 프레임 시퀀스를 입력받고 컨볼루션 프로세스가 3번째 및 5번째 컨볼루션 계층으로 진행할수록 더 깊은 특징을 획득할 수 있을 것이다.The spatiotemporal feature extracting unit 340 receives an input frame sequence as shown in FIG. 6 and may acquire deeper features as the convolution process proceeds to the third and fifth convolution layers.

이러한 시공간특성 추출부(340)는 8개의 회선(컨볼루션 레이어: Convolution Layer=Conv), 5개의 풀링(Pooling) 및 소프트맥스(Softmax) 출력 레이어가 있는 2개의 완전 연결된 레이어(FC Layer=fc)를 포함한다.The spatiotemporal feature extraction unit 340 includes two fully connected layers (FC Layer = fc) with eight lines (convolution layer: Convolution Layer = Conv), five pooling and Softmax output layers. Includes.

각 컨볼루션 레이어는 하나의 스트라이드를 갖는 3*3*3 커널이 있으며, 모든 풀링 레이어는 커널 크기가 1*2*2인 두 번째 스트라이드가 있는 첫 번째 풀링 레이어를 제외하고 2*2*2 커널 크기를 갖는 맥스 풀링(Max Pooling)이다. 상기 맥스 풀링은 시간 기반 정보를 보존한다. 각 컨볼루션의 필터 수는 각각 첫 번째, 두 번째 및 세 번째 계층에 대해 64, 128, 256이다. 각 컨볼루션 레이어의 커널은 크기가 D로 정의된 시간 깊이를 갖는다. 컨볼루션을 적용하는 데 사용되는 커널 크기와 패딩은 각각 3과 1로 유지된다. 두 개의 완전 연결 레어어(fc6 및 fc7)는 4096개의 뉴런이 포함되어 있으며, N개의 출력을 포함하는 소프트맥스레이어는 데이터 세트에 따라 N값이 다르다. 본 발명에서는 두 가지 클래스, 즉 폭력적 장면과 비폭력적 장면만 있기 때문에 상기 소프트맥스 레이어의 N은 2일 것이다. Each convolutional layer has a 3*3*3 kernel with one stride, and all pooling layers are 2*2*2 kernels except the first one with a second stride with a kernel size of 1*2*2. It is Max Pooling with a size. The max pooling preserves time-based information. The number of filters for each convolution is 64, 128 and 256 for the first, second and third layers, respectively. The kernel of each convolutional layer has a temporal depth whose size is defined as D. The kernel size and padding used to apply the convolution are kept at 3 and 1 respectively. The two fully connected layers (fc6 and fc7) contain 4096 neurons, and the Softmax layer, which contains N outputs, has a different N value depending on the data set. In the present invention, since there are only two classes, violent scenes and non-violent scenes, N of the softmax layer will be 2.

본 발명의 시공간특성 추출부(340)는 16개의 프레임의 짧은 프레임 시퀀스를 획득하지만, 훈련 시 원래 입력 시퀀스에서 3*16*112*112 크기의 임의 자르기를 사용하고, 효과적인 학습을 위해 그 후, 도 3 및 도 5에서와 같이 프레임 시퀀스 다음에 3D 컨볼루션 및 풀링 작업을 수행한다.The spatiotemporal feature extraction unit 340 of the present invention acquires a short frame sequence of 16 frames, but uses a random cropping size of 3*16*112*112 from the original input sequence during training, and then for effective learning, 3D convolution and pooling are performed after the frame sequence as shown in FIGS. 3 and 5.

폭력 감지 판단부(350)는 시공간특성 추출부(340)로부터 프레임 시퀀스에 대한 소프트맥스의 결과값(0 or1)이 입력되면 상기 결과값이 폭력(예: 0 or 1) 또는 비폭력(예: 1 or 0) 중 폭력인지를 판단하고, 폭력인 것으로 판단되면 폭력 발생 정보를 경보부(360)로 출력한다.When the result value (0 or 1) of the softmax for the frame sequence is input from the spatio-temporal characteristic extraction unit 340, the violence detection determination unit 350 indicates that the result value is violence (eg, 0 or 1) or non-violence (eg, 1). or 0), it is determined whether it is violence, and when it is determined that it is violence, the information about the occurrence of violence is output to the alarm unit 360.

경보부(360)는 폭력 감지 판단부(350)로부터 폭력 발생 정보가 입력되면 영상 획득부(310)를 통해 획득된 상기 프레임 시퀀스의 감시영역의 위치정보를 포함하는 폭력 발생 통지 정보를 생성하여 미리 설정된 유관기관 서버(400)로 전송하여 폭력이 발생되었음을 통지한다.When the violence occurrence information is input from the violence detection determination unit 350, the warning unit 360 generates violence occurrence notification information including the location information of the surveillance area of the frame sequence acquired through the image acquisition unit 310, and is preset. It is transmitted to the related institution server 400 to notify that violence has occurred.

학습부(380)는 다수의 비디오 데이터 세트로부터 폭력 활동 패턴을 학습한다. 상기 비디오 데이터 세트는 폭력 클래스와 비폭력 클래스의 두 가지 범주로 구분된다. 즉, 학습부(380)는 폭력 및 비폭력 둘 모두에 대해 학습을 수행한다. 상기 데이터 세트는 두 개의 연속적인 클립 사이에 8 프레임 오버레이로 16프레임의 프레임 시퀀스로 분할될 수 있을 것이다.The learning unit 380 learns a pattern of violent activity from a plurality of video data sets. The video data set is divided into two categories: violent class and non-violent class. That is, the learning unit 380 performs learning on both violence and non-violence. The data set could be divided into a sequence of 16 frames with an 8 frame overlay between two consecutive clips.

또한, 상기 학습부(380)는 영상획득부(310)를 통해 획득되어 시공간특성 추출부(340)를 통해 출력되는 프레임, 프레임 시퀀스 및 분류된 결과값을 실시간 학습하고, 학습된 학습데이터를 최적화하여 시공간특성 추출부(340)의 3D CNN 모델을 업데이트하도록 구성될 수도 있을 것이다.In addition, the learning unit 380 learns frames, frame sequences, and classified result values obtained through the image acquisition unit 310 and output through the spatiotemporal feature extraction unit 340 in real time, and optimizes the learned learning data. Thus, it may be configured to update the 3D CNN model of the spatiotemporal feature extraction unit 340.

도 7은 본 발명에 따른 시공간 특징을 이용한 폭력 감지 방법을 나타낸 흐름도이다.7 is a flowchart illustrating a method of detecting violence using space-time features according to the present invention.

도 7을 참조하면, 우선 폭력 감지 시스템(300)의 학습부(380)는 본 발명에 따라 사전이 미리 준비된 폭행이 존재하는 다수의 영상 및 폭행이 존재하지 않는 다수의 영상을 데이터 세트로 하여 본 발명에 따른 3D CNN 모델에 의한 학습, 즉 훈련을 수행하며, 그 훈련에 따른 학습(훈련)데이터를 등록하고 있어야 할 것이다.Referring to FIG. 7, first, the learning unit 380 of the violence detection system 300 uses as a data set a plurality of images with assault and a plurality of images prepared in advance according to the present invention. Learning by the 3D CNN model according to the invention, that is, training is performed, and learning (training) data according to the training should be registered.

영상 획득부(310)는 영상획득 이벤트가 발생되는지를 모니터링한다(S111). 상기 영상획득 이벤트는 일정 시간 주기로 발생될 수도 있고, 영상제공장치(100) 및 영상제공부(200) 중 어느 하나로부터 영상이 수신되는 경우 발생될 수도 있으며, 관리자에 의한 요청에 따라 발생될 수도 있을 것이다.The image acquisition unit 310 monitors whether an image acquisition event occurs (S111). The image acquisition event may occur at a certain period of time, may occur when an image is received from any one of the image providing device 100 and the image providing unit 200, or may be generated according to a request by the administrator. .

영상 획득부(310)는 영상획득 이벤트가 발생되면 영상제공장치(100) 및 영상제공부(200)로부터 수신되는 영상을 획득하거나, 영상을 요청하고 이에 따른 영상을 수신받아 획득하고, 획득되는 영상을 사람 검출부(320)로 출력한다(S113).When an image acquisition event occurs, the image acquisition unit 310 acquires an image received from the image providing apparatus 100 and the image providing unit 200, or requests an image and receives and acquires the image accordingly, and obtains the acquired image. Output to the person detection unit 320 (S113).

사람 검출부(320)는 영상획득부(310)로부터 영상이 입력되면 본 발명에 따른 MobileNet-SSD CNN 모델을 통해 영상 내 사람이 존재하는 모니터링하기 시작하여(S115), 사람이 검출되는지를 검사한다(S117).When an image is input from the image acquisition unit 310, the person detection unit 320 starts monitoring the presence of a person in the image through the MobileNet-SSD CNN model according to the present invention (S115), and checks whether a person is detected ( S117).

사람이 검출되면 사람 검출부(117)는 사람이 검출된 프레임을 캡처하여 프레임 시퀀스 생성부(330)로 출력하기 시작한다(S119).When a person is detected, the person detection unit 117 starts to capture the frame in which the person is detected and output it to the frame sequence generation unit 330 (S119).

프레임 시퀀스 생성부(330)는 사람 검출부(320)로부터 프레임이 입력되기 시작하면 미리 설정된 개수의 연속된 프레임이 입력되는지를 판단하고(S121), 미리 설정된 개수의 프레임이 연속해서 입력되면 프레임 시퀀스를 생성한 후 시공간특성 추출부(340)로 출력한다(S123).The frame sequence generation unit 330 determines whether a preset number of consecutive frames is input when a frame is input from the person detection unit 320 (S121), and when a preset number of frames is continuously input, the frame sequence is generated. After generation, it is output to the spatiotemporal feature extraction unit 340 (S123).

시공간특성 추출부(340)는 입력되는 프레임 시퀀스에 대한 시공간특성을 추출하고, 추출된 시공간특성에 따라 상기 프레임 시퀀스의 영상이 폭력이 존재하는영상인지 폭력이 존재하지 않는 영상인지를 분류하고, 그 분류 결과를 폭력 감지 판단부(350)로 출력한다(S125).The spatiotemporal characteristic extracting unit 340 extracts the spatiotemporal characteristics of the input frame sequence, classifies whether the image of the frame sequence is an image with violence or an image without violence according to the extracted spatiotemporal characteristics, and The classification result is output to the violence detection determination unit 350 (S125).

폭력 감지 판단부(350)는 상기 시공간특성 추출부(340)로부터 분류 결과값이 입력되면 상기 결과값이 상기 프레임 시퀀스의 영상이 폭력이 존재하는 영상인 결과값인 경우 폭력 발생 정보를 경보부(360)로 출력한다(S127).When the classification result value is input from the spatio-temporal characteristic extraction unit 340, the violence detection determination unit 350 alerts the violence occurrence information when the result value is a result value that the image of the frame sequence is an image in which violence exists. ) Is output (S127).

경보부(360)는 상기 폭력 감지 판단부(350)로부터 폭력 발생 정보가 입력되면 상기 프레임 시퀀스의 영상이 획득된 감시영역의 위치정보를 포함하는 폭력 발생 통지 정보를 유관기관 서버(400)로 전송한다(S129). 즉 경보부(360)는 경찰서, 병원 등의 유관기관으로 특정 위치(지역)에서 폭력이 발생했음을 통지한다.When violence occurrence information is input from the violence detection determining unit 350, the warning unit 360 transmits violence occurrence notification information including the location information of the surveillance area in which the image of the frame sequence is obtained to the related institution server 400. (S129). That is, the warning unit 360 notifies that violence has occurred in a specific location (region) to a related institution such as a police station or hospital.

한편, 본 발명은 전술한 전형적인 바람직한 실시예에만 한정되는 것이 아니라 본 발명의 요지를 벗어나지 않는 범위 내에서 여러 가지로 개량, 변경, 대체 또는 부가하여 실시할 수 있는 것임은 당해 기술분야에서 통상의 지식을 가진 자라면 용이하게 이해할 수 있을 것이다. 이러한 개량, 변경, 대체 또는 부가에 의한 실시가 이하의 첨부된 특허청구범위의 범주에 속하는 것이라면 그 기술사상 역시 본 발명에 속하는 것으로 보아야 한다.Meanwhile, it is common knowledge in the art that the present invention is not limited to the above-described typical preferred embodiment, but can be implemented in various ways without departing from the gist of the present invention. Anyone who has a will be able to understand it easily. If implementation by such improvement, change, substitution or addition falls within the scope of the following appended claims, the technical idea should also be considered to belong to the present invention.

10: 유무선데이터통신망 100: 영상제공장치
200: 영상제공부 300: 폭력 감지 시스템
310: 영상획득부 320: 사람 검출부
330: 프레임 시퀀스 생성부 340: 시공간특징 추출부
350: 폭력 감지 판단부 360: 경보부
380: 학습부 400: 유관기관 서버10: wired/wireless data communication network 100: image providing device
200: image providing unit 300: violence detection system
310: image acquisition unit 320: human detection unit
330: frame sequence generator 340: spatiotemporal feature extraction unit
350: violence detection determination unit 360: alarm unit
380: learning department 400: related institution server

Claims

감시영역을 촬영한 영상을 획득하여 출력하는 영상획득부;
상기 영상획득부로부터 출력되는 영상으로부터 사람을 검출하고, 사람이 검출된 프레임을 캡처하여 출력하는 사람 검출부;
상기 사람 검출부로부터 출력되는 프레임들을 일정 수의 프레임으로 그룹화한 프레임 시퀀스를 생성하여 출력하는 프레임 시퀀스 생성부;
상기 프레임 시퀀스 생성부로부터 출력되는 프레임 시퀀스로부터 시공간특성을 검출하고, 그에 따른 시공간특성정보를 분류하여 상기 프레임 시퀀스에 폭력이 존재하는지의 폭력 존재 여부 정보를 출력하는 시공간특성 추출부; 및
상기 폭력 존재 여부 정보에 의해 상기 프레임 시퀀스가 폭력이 존재하는 프레임 시퀀스인지 폭력이 없는 프레임 시퀀스인지를 판단하는 폭력 감지 판단부를 포함하는 것을 특징으로 하는 시공간 특징을 이용한 폭력 감지 시스템.
An image acquisition unit for obtaining and outputting an image photographed of the surveillance area;
A person detection unit that detects a person from the image output from the image acquisition unit and captures and outputs a frame in which the person is detected;
A frame sequence generator for generating and outputting a frame sequence in which the frames output from the person detection unit are grouped into a predetermined number of frames;
A spatiotemporal feature extractor configured to detect a spatiotemporal feature from a frame sequence output from the frame sequence generator, classify spatiotemporal feature information according to the spatiotemporal feature information, and output information on whether violence exists in the frame sequence; And
And a violence detection determination unit that determines whether the frame sequence is a frame sequence with violence or a frame sequence without violence based on the information on whether violence exists.

제1항에 있어서,
상기 영상획득부는,
상기 감시영역을 감시하는 폐쇄회로텔레비전(CCTV)을 통해 촬영된 영상을 획득하는 것을 특징으로 하는 시공간 특징을 이용한 폭력 감지 시스템.
The method of claim 1,
The image acquisition unit,
A violence detection system using a space-time feature, characterized in that acquiring an image captured through a closed circuit television (CCTV) monitoring the surveillance area.

제1항에 있어서,
상기 영상획득부는,
상기 감시영역을 촬영한 임의의 모바일 단말기 및 상기 감시영역을 촬영한 영상에 대한 임의의 서비스를 제공하는 서비스 어플리케이션 서버 중 어느 하나로부터 상기 영상을 획득하는 것을 특징으로 하는 시공간 특징을 이용한 폭력 감지 시스템.
The method of claim 1,
The image acquisition unit,
A violence detection system using a space-time feature, characterized in that the image is obtained from any one of a mobile terminal that captures the surveillance area and a service application server that provides a service application server that provides an arbitrary service for the image that the surveillance area is captured.

제1항에 있어서,
상기 사람 검출부는,
획득되는 영상의 프레임의 깊이 및 점별 컨벌루션이 개별적으로 계산되는 경우 28개의 레이어를 포함하고, 마지막 완전 연결 레이어(Fully Connected Layer: FC Layer)를 제외한 레이어 뒤에 비선형 배치 표준 및 ReLU가 적용된 모바일 넷(Mobilenet)-SSD CNN 모델을 적용하여 각 프레임별로 사람을 검출하여 출력하는 것을 특징으로 하는 시공간 특징을 이용한 폭력 감지 시스템.
The method of claim 1,
The person detection unit,
When the frame depth and point-by-point convolution of the acquired image are individually calculated, 28 layers are included, and a nonlinear placement standard and ReLU are applied behind the layer excluding the last fully connected layer (FC Layer). )-Violence detection system using spatiotemporal features, characterized in that a person is detected and output for each frame by applying an SSD CNN model.

제4항에 있어서,
상기 프레임 시퀀스 생성부는,
사람이 검출된 16개의 프레임을 포함하는 프레임 시퀀스를 생성하여 출력하는 것을 특징으로 하는 시공간 특징을 이용한 폭력 감지 시스템.
The method of claim 4,
The frame sequence generator,
A violence detection system using spatiotemporal features, characterized in that a frame sequence including 16 frames detected by a person is generated and output.

제1항 또는 제5항에 있어서,
상기 시공간특성 추출부는,
상기 프레임 시퀀스 생성부로부터 입력되는 프레임 시퀀스의 사람이 검출된 프레임들 간 공간 상관 정보 및 시간 상관 정보를 포함하는 시공간특성 정보를 생성하고, 생성된 시공간특성정보에 근거하여 프레임 시퀀스의 영상에 폭력이 존재하는지 존재하지 않는지를 분류하여 출력하는 것을 특징으로 하는 시공간 특징을 이용한 폭력 감지 시스템.
The method of claim 1 or 5,
The spatiotemporal feature extraction unit,
A person in the frame sequence input from the frame sequence generator generates spatiotemporal characteristic information including spatial correlation information and temporal correlation information between the detected frames, and violence is applied to the image of the frame sequence based on the generated spatiotemporal characteristic information. A violence detection system using spatiotemporal features, characterized in that classification and outputting whether it exists or does not exist.

제6항에 있어서,
상기 시공간특성 추출부는,
8개의 회선(Conv), 5개의 풀링(Pooling) 및 폭력이 존재하는 영상인지 폭력이 없는 영상인지를 분류하는 소프트맥스 출력 레이어가 있는 2개의 완전 연결 레이어(FC Layer)를 포함하는 것을 특징으로 하는 시공간 특징을 이용한 폭력 감지 시스템.
The method of claim 6,
The spatiotemporal feature extraction unit,
Characterized in that it includes 8 conv, 5 pooling, and 2 fully connected layers (FC layers) with a Softmax output layer that classifies whether a video with or without violence. Violence detection system using space-time features.

제1항에 있어서,
상기 영상획득부는,
상기 감시영역에 대한 위치정보를 더 획득하여 출력하고,
상기 폭력 감지 판단부를 통해 입력된 상기 프레임 시퀀스가 폭력이 있는 프레임인 것으로 판단되면, 상기 프레임 시퀀스의 영상이 획득된 위치정보를 포함하는 폭력 발생 통지 정보를 유관기관으로 통지하는 경보부를 더 포함하는 것을 특징으로 하는 시공간 특징을 이용한 폭력 감지 시스템.
The method of claim 1,
The image acquisition unit,
Further obtains and outputs location information for the surveillance area,
When it is determined that the frame sequence input through the violence detection determination unit is a frame with violence, further comprising an alarm unit for notifying a related agency of violence occurrence notification information including location information from which the image of the frame sequence is obtained. Violence detection system using space-time features characterized.

제8항에 있어서,
상기 위치정보는,
상기 감시영역을 촬영하는 폐쇄회로텔레비전이 설치된 위치정보를 포함하는 폐쇄회로텔레비전 관리서버로부터 획득되는 것을 특징으로 하는 시공간 특징을 이용한 폭력 감지 시스템.
The method of claim 8,
The location information,
Violence detection system using a space-time feature, characterized in that obtained from a closed-circuit television management server including location information on which a closed-circuit television for photographing the surveillance area is installed.

영상획득부가 감시영역을 촬영한 영상을 획득하여 출력하는 영상획득 과정;
사람 검출부가 상기 영상획득부로부터 출력되는 영상으로부터 사람을 검출하고, 사람이 검출된 프레임을 캡처하여 출력하는 사람 검출 프레임 추출 과정;
프레임 시퀀스 생성부가 상기 사람 검출부로부터 출력되는 프레임을 일정 수의 프레임으로 그룹화한 프레임 시퀀스를 생성하여 출력하는 프레임 시퀀스 생성과정;
시공간특성 추출부가 상기 프레임 시퀀스 생성부로부터 출력되는 프레임 시퀀스로부터 시공간특성을 검출하고, 그에 따른 시공간특성정보를 분류하여 상기 프레임 시퀀스에 폭력이 존재하는지의 폭력 존재 여부 정보를 출력하는 시공간특성 추출 과정; 및
폭력 감지 판단부가 상기 폭력 존재 여부 정보에 의해 상기 프레임 시퀀스가 폭력이 존재하는 프레임 시퀀스인지 폭력이 없는 프레임 시퀀스인지를 판단하는 폭력 감지 판단부를 포함하는 것을 특징으로 하는 시공간 특징을 이용한 폭력 감지 방법.
An image acquisition process in which the image acquisition unit acquires and outputs an image captured of the surveillance area;
A process of extracting a person detection frame in which a person detection unit detects a person from an image output from the image acquisition unit, and captures and outputs the detected frame;
A frame sequence generation process in which a frame sequence generation unit generates and outputs a frame sequence in which the frames output from the person detection unit are grouped into a predetermined number of frames;
A spatio-temporal feature extraction process in which a spatiotemporal feature extractor detects a spatiotemporal feature from a frame sequence output from the frame sequence generator, classifies the spatiotemporal feature information accordingly, and outputs information on whether violence is present in the frame sequence; And
And a violence detection determining unit determining whether the frame sequence is a frame sequence with violence or a frame sequence without violence based on the information on whether violence is present or not.

제10항에 있어서,
상기 시공간특성 추출 과정은,
상기 프레임 시퀀스 생성부로부터 입력되는 프레임 시퀀스의 사람이 검출된 프레임들 간 공간 상관 정보를 획득하는 공간 상관 정보 획득 단계;
상기 프레임 시퀀스의 프레임들 간 시간 상관 정보를 획득하는 시간 상관 정보 획득 단계;
상기 공간 상관 정보 및 시간 상관 정보를 포함하는 시공간특성 정보를 생성하는 시공간특성 정보 생성 단계; 및
생성된 상기 시공간특성정보에 근거하여 프레임 시퀀스의 영상에 폭력이 존재하는지 존재하지 않는지를 분류하여 출력하는 분류 단계를 포함하는 것을 특징으로 하는 시공간 특징을 이용한 폭력 감지 방법.
The method of claim 10,
The spatiotemporal feature extraction process,
A spatial correlation information acquisition step of obtaining spatial correlation information between frames detected by a person in a frame sequence input from the frame sequence generator;
Obtaining temporal correlation information between frames of the frame sequence;
A spatiotemporal characteristic information generating step of generating spatiotemporal characteristic information including the spatial correlation information and temporal correlation information; And
And a classification step of classifying and outputting whether violence exists in an image of a frame sequence based on the generated spatiotemporal characteristic information.

제10항에 있어서,
상기 영상획득 과정은,
상기 감시영역에 대한 영상을 획득하는 영상 획득 단계; 및
상기 감시영역의 위치정보를 획득하는 위치정보 획득 단계를 포함하되,
경보부가 상기 폭력 감지 판단부를 통해 입력된 상기 프레임 시퀀스가 폭력이 있는 프레임인 것으로 판단되면, 상기 영상획득 과정의 위치정보 획득 단계에서 획득된 프레임 시퀀스의 영상이 획득된 위치정보를 포함하는 폭력 발생 통지 정보를 유관기관으로 통지하는 경보 과정을 더 포함하는 것을 특징으로 하는 시공간 특징을 이용한 폭력 감지 방법
The method of claim 10,
The image acquisition process,
An image acquisition step of obtaining an image for the surveillance area; And
Including a location information acquisition step of obtaining location information of the surveillance area,
When the warning unit determines that the frame sequence input through the violence detection determination unit is a frame with violence, a violence occurrence notification including the location information of the frame sequence obtained in the location information acquisition step of the image acquisition process Violence detection method using spatiotemporal features, further comprising an alert process of notifying information to related organizations

영상획득부가 감시영역을 촬영한 영상을 획득하여 출력하는 영상획득 과정;
사람 검출부가 상기 영상획득부로부터 출력되는 영상으로부터 사람을 검출하고, 사람이 검출된 프레임을 캡처하여 출력하는 사람 검출 프레임 추출 과정;
프레임 시퀀스 생성부가 상기 사람 검출부로부터 출력되는 프레임을 일정 수의 프레임으로 그룹화한 프레임 시퀀스를 생성하여 출력하는 프레임 시퀀스 생성과정;
시공간특성 추출부가 상기 프레임 시퀀스 생성부로부터 출력되는 프레임 시퀀스로부터 시공간특성을 검출하고, 그에 따른 시공간특성정보를 분류하여 상기 프레임 시퀀스에 폭력이 존재하는지의 폭력 존재 여부 정보를 출력하는 시공간특성 추출 과정; 및
폭력 감지 판단부가 상기 폭력 존재 여부 정보에 의해 상기 프레임 시퀀스가 폭력이 존재하는 프레임 시퀀스인지 폭력이 없는 프레임 시퀀스인지를 판단하는 폭력 감지 판단부를 포함하는 것을 특징으로 하는 시공간 특징을 이용한 폭력 감지 방법을 수행하도록 하는 프로그램이 저장된 컴퓨터 판독 가능한 기록 매체.
An image acquisition process in which the image acquisition unit acquires and outputs an image captured of the surveillance area;
A process of extracting a person detection frame in which a person detection unit detects a person from an image output from the image acquisition unit, and captures and outputs the detected frame;
A frame sequence generation process in which a frame sequence generation unit generates and outputs a frame sequence in which the frames output from the person detection unit are grouped into a predetermined number of frames;
A spatio-temporal feature extraction process in which a spatiotemporal feature extractor detects a spatiotemporal feature from a frame sequence output from the frame sequence generator, classifies the spatiotemporal feature information accordingly, and outputs information on whether violence is present in the frame sequence; And
A violence detection method using a spatiotemporal feature, characterized in that the violence detection determination unit includes a violence detection determination unit that determines whether the frame sequence is a frame sequence with violence or a frame sequence without violence based on the information on whether violence exists. A computer-readable recording medium in which a program to be executed is stored.

제13항에 있어서,
상기 시공간특성 추출 과정은,
상기 프레임 시퀀스 생성부로부터 입력되는 프레임 시퀀스의 사람이 검출된 프레임들 간 공간 상관 정보를 획득하는 공간 상관 정보 획득 단계;
상기 프레임 시퀀스의 프레임들 간 시간 상관 정보를 획득하는 시간 상관 정보 획득 단계;
상기 공간 상관 정보 및 시간 상관 정보를 포함하는 시공간특성 정보를 생성하는 시공간특성 정보 생성 단계; 및
생성된 상기 시공간특성정보에 근거하여 프레임 시퀀스의 영상에 폭력이 존재하는지 존재하지 않는지를 분류하여 출력하는 분류 단계를 포함하는 것을 특징으로 하는 시공간 특징을 이용한 폭력 감지 방법을 수행하도록 하는 프로그램이 저장된 컴퓨터 판독 가능한 기록 매체.
The method of claim 13,
The spatiotemporal feature extraction process,
A spatial correlation information acquisition step of obtaining spatial correlation information between frames detected by a person in a frame sequence input from the frame sequence generator;
Obtaining temporal correlation information between frames of the frame sequence;
A spatiotemporal characteristic information generating step of generating spatiotemporal characteristic information including the spatial correlation information and temporal correlation information; And
A computer storing a program for performing a violence detection method using spatiotemporal features, comprising a classification step of classifying and outputting whether violence exists in an image of a frame sequence based on the generated spatiotemporal characteristic information Readable recording medium.

제13항에 있어서,
상기 영상획득 과정은,
상기 감시영역에 대한 영상을 획득하는 영상 획득 단계; 및
상기 감시영역의 위치정보를 획득하는 위치정보 획득 단계를 포함하되,
경보부가 상기 폭력 감지 판단부를 통해 입력된 상기 프레임 시퀀스가 폭력이 있는 프레임인 것으로 판단되면, 상기 영상획득 과정의 위치정보 획득 단계에서 획득된 프레임 시퀀스의 영상이 획득된 위치정보를 포함하는 폭력 발생 통지 정보를 유관기관으로 통지하는 경보 과정을 더 포함하는 것을 특징으로 하는 시공간 특징을 이용한 폭력 감지 방법을 수행하도록 하는 프로그램이 저장된 컴퓨터 판독 가능한 기록 매체.The method of claim 13,
The image acquisition process,
An image acquisition step of obtaining an image for the surveillance area; And
Including a location information acquisition step of obtaining location information of the surveillance area,
When the warning unit determines that the frame sequence input through the violence detection determination unit is a frame with violence, a violence occurrence notification including the location information of the frame sequence obtained in the location information acquisition step of the image acquisition process A computer-readable recording medium storing a program for performing a violence detection method using a space-time feature, further comprising an alarm process of notifying information to a related organization.