KR102675619B1

KR102675619B1 - Deep learning-based posture estimation system and metaverse system for real-time animation

Info

Publication number: KR102675619B1
Application number: KR1020220032959A
Authority: KR
Inventors: 이준; 김다은
Original assignee: 호서대학교 산학협력단
Priority date: 2022-03-16
Filing date: 2022-03-16
Publication date: 2024-06-19
Also published as: KR20230135449A

Abstract

본 발명은 딥러닝 기반 자세 추정 시스템에 있어서, 사용자의 동작이 촬영된 영상 파일을 입력받는 입력 모듈; 가상의 캐릭터를 구현하는 3D 엔진 모듈; 상기 영상 파일에서 사용자의 조인트 정보를 라벨링하여 딥러닝 학습으로 사용자의 자세를 추정하는 학습 모듈; 및 상기 학습 모듈을 통해 추정된 결과를 이용하여 스켈레톤 모델을 생성하고, 상기 스켈레톤 모델의 모션을 상기 3D 엔진 모듈을 통해 구성한 가상의 캐릭터에 적용시키는 캐릭터 모듈;을 포함하고, 상기 학습 모듈은, 입력값(x)이 다층의 가중치(w) 레이어를 거쳐서 출력된 F(x)에 입력값(x)이 출력으로 바로 연결된 잔여 블록(Residual Block)으로 심층 신경망이 구성되어, [F(x)+x]의 출력값이 최소화되도록 학습하는 것을 일 특징으로 한다.The present invention provides a deep learning-based posture estimation system, comprising: an input module that receives an image file in which a user's movements are captured; 3D engine module that implements virtual characters; A learning module that labels the user's joint information in the video file and estimates the user's posture using deep learning; And a character module that generates a skeleton model using the results estimated through the learning module and applies the motion of the skeleton model to the virtual character constructed through the 3D engine module. The learning module includes an input A deep neural network is composed of a residual block where the input value (x) is directly connected to the output of F(x), where the value (x) passes through multiple weight (w) layers, and [F(x)+ One feature is learning so that the output value of [x] is minimized.

Description

딥러닝 기반 자세 추정 시스템 및 실시간 애니메이션을 위한 딥러닝 기반의 메타버스 시스템{DEEP LEARNING-BASED POSTURE ESTIMATION SYSTEM AND METAVERSE SYSTEM FOR REAL-TIME ANIMATION}Deep learning-based posture estimation system and deep learning-based metaverse system for real-time animation {DEEP LEARNING-BASED POSTURE ESTIMATION SYSTEM AND METAVERSE SYSTEM FOR REAL-TIME ANIMATION}

본 발명은 메타버스 환경에서 아바타의 실시간 구현을 위한 딥러닝 기반 자세 추정 시스템과 딥러닝 기반 자세 추정 시스템이 적용된 메타버스 시스템에 관한 것이다.The present invention relates to a deep learning-based posture estimation system for real-time implementation of an avatar in a metaverse environment and a metaverse system to which the deep learning-based posture estimation system is applied.

정보기술과 디바이스 기술의 발전으로 인해 관련 산업의 규모 확장으로 우리 일상 속 가상세계의 영향력은 점차 높아지는 추세이다. 특히 2020년을 COVID-19 팬더믹을 계기로 언택트 생활이 익숙해졌으며, 비대면 사회로의 변화가 급격하게 이루어졌다. 언택트 사회로의 전환과 관련 기술의 발달로 인해 메타버스에 대한 주목이 급격하게 증가하였다.Due to the advancement of information technology and device technology, the scale of related industries is expanding, and the influence of the virtual world in our daily lives is gradually increasing. In particular, in 2020, with the COVID-19 pandemic, we became accustomed to an untact life, and the change to a non-face-to-face society occurred rapidly. Due to the transition to an untact society and the development of related technologies, attention to the metaverse has increased rapidly.

메타버스(Metaverse)는 시, 공간의 물리적 제약 없이 의사소통, 경제활동 등 다양한 콘텐츠를 3D 가상환경에 구현하는 실감 미디어 환경이다(Bridges 등, 2021). 메타버스 활동의 주체가 되는 ‘아바타’는 사용자가 생성한 메타버스 속 또 다른 나로서, 사용자는 아바타로 메타버스를 탐색하고 의사 표현과 타인과 교류를 통해 관계를 맺는다. 메타버스라는 가상공간 속에서 사용자의 경험과 상호작용을 활발히 하기 위해서 아바타와 사용자 간의 몰입감이 전제되어야 한다. Metaverse is a realistic media environment that implements various contents, such as communication and economic activities, in a 3D virtual environment without physical constraints of time and space (Bridges et al., 2021). The ‘avatar’, which is the subject of metaverse activities, is another self in the metaverse created by the user. Users use the avatar to explore the metaverse and establish relationships through expressing opinions and interacting with others. In order to actively promote user experience and interaction in the virtual space called Metaverse, a sense of immersion between the avatar and the user must be prerequisite.

종래 가상현실 기기에서 사용자의 동작을 인식할 수 있는 기술들은 특수장비 착용, 컨트롤러 디바이스를 조작, 스마트폰을 화면을 터치하는 방식으로 움직임을 구현해 왔다. 메타버스 시스템에서 동작 인식에 주로 활용된 모션 트랙킹 기술은 주로 영화 촬영에 사용되어 온 동작 인식 기술이다. 적외선으로 반사되는 마커들을 사용자의 관절 부분에 부착한 후, 여러 대의 적외선 카메라 설치된 공간에서 동작을 수행하면 몸에 부착된 마커들을 통해 인식된 정보를 사람의 주요 관절 부위에 매칭한다. 이를 실시간으로 가상의 캐릭터에 동작을 인식시켜주는 기술이 사용자의 동작을 실시간으로 정확하고 빠르게 특정하고 구현하였다.Conventionally, technologies that can recognize user movements in virtual reality devices have implemented movements by wearing special equipment, operating a controller device, or touching the screen of a smartphone. Motion tracking technology, which is mainly used for motion recognition in the metaverse system, is a motion recognition technology that has been mainly used in movie shooting. After attaching infrared-reflecting markers to the user's joints, the user performs movements in a space where several infrared cameras are installed, and the information recognized through the markers attached to the body is matched to the person's major joints. The technology that recognizes the movements of a virtual character in real time accurately and quickly identifies and implements the user's movements in real time.

그러나, 이러한 마킹 방법은 사용자가 고가의 특수 장비인 마커를 항상 몸에 부착하여 인식해야 한다. 이는 영화 산업 등 특수 분야를 제외하고 일반 사용자들까지 대중화되기 어렵고 사용성이 떨어진다는 문제점이 있다. 이 같은 문제점을 해결하기 위해 마커의 부착없이 딥러닝을 활용하여 포즈를 추정하는 Markerless Human Pose Estmation에 관한 연구도 활발히 진행되어 왔다. Markerless Human Pose Estmation의 대표적인 사례인 OpenPose는 다양한 사람의 포즈를 추정하지만 구현 시 속도가 느리다는 한계를 가진다(Cao 등, 2018).However, this marking method requires the user to always attach a marker, an expensive special device, to the body for recognition. This has the problem of being difficult to popularize with general users, excluding special fields such as the film industry, and having poor usability. To solve this problem, research on Markerless Human Pose Estmation, which estimates pose using deep learning without attaching markers, has been actively conducted. OpenPose, a representative example of Markerless Human Pose Estmation, estimates the poses of various people, but has the limitation of being slow when implemented (Cao et al., 2018).

이외의 동작인식으로, 카메라를 이용한 Markerless 동작 인식은 카메라를 통해 신체 움직임을 간접적으로 인식하여 계산하므로 사용자의 자유로운 동작이 가능하다. 카메라를 이용한 동작인식은 Marker 기반 동작 인식에 비해 인식 과정에서 불편함이 적지만 인식 결과 오차율이 클 수 있으며 신체 폐색 같은 상황에서 정확한 인식이 어렵다.In addition to motion recognition, markerless motion recognition using a camera indirectly recognizes and calculates body movements through the camera, allowing the user to move freely. Motion recognition using a camera causes less inconvenience in the recognition process compared to marker-based motion recognition, but the error rate in recognition results can be large, and accurate recognition is difficult in situations such as body occlusion.

이외의 기존의 메타버스 콘텐츠에서의 사용자 동작 인식은 Vive 트래커를 지원한 VR Chat을 제외하고 대부분 키보드나 마우스 등의 디바이스로 캐릭터를 조작하는 기술을 제공한다. 이 경우, 사용자들은 자신의 몸동작을 가상 세계에 효과적으로 반영하지 못해 현실과 가상의 경계가 모호해지는 메타버스 콘텐츠들에 몰입하고 실재감을 느끼며 즐기기 어렵다는 문제들이 있다.Except for VR Chat, which supports Vive trackers, most of the user motion recognition in existing metaverse content provides technology to manipulate characters with devices such as keyboard or mouse. In this case, there are problems in that users cannot effectively reflect their body movements in the virtual world, making it difficult to immerse themselves in metaverse content, feel a sense of presence, and enjoy it, where the boundaries between reality and virtuality become blurred.

현재 가상, 증강현실 환경에서 사용자의 오감을 만족할 수 있는 기술 적용과 가상공간에서 사용자와 아바타 간의 몰입감과 동일감을 부여하기 위한 연구와 개발은 꾸준히 이어지고 있다. Marker를 부착하거나 디바이스를 통한 메타버스 속 아바타의 움직임 구현은 사용자에게 실제감과 몰입감을 주기 어렵다. 따라서, 보다 효과적인 메타버스 콘텐츠 체험을 위해 딥러닝 기반의 동작 인식 시스템 활용으로 아바타를 통해 사용자의 실시간 동작 인식 구현 관련 연구가 요구된다.Currently, research and development are continuing to apply technology that can satisfy users' five senses in virtual and augmented reality environments and to provide a sense of immersion and identity between users and avatars in virtual space. Attaching a marker or implementing the movement of an avatar in the metaverse through a device is difficult to give users a sense of reality and immersion. Therefore, for a more effective metaverse content experience, research on implementing real-time motion recognition of users through avatars using a deep learning-based motion recognition system is required.

한국공개특허 제10-2021-0061211호Korean Patent Publication No. 10-2021-0061211

본 발명은 웹캠과 같은 일반 영상 촬영 장비를 이용하여 마커없이 사용자의 동작을 인식하되, 인식 결과의 정확도와 속도 측면에서 개선된 딥러닝 기반 자세 추정 시스템 및 실시간 애니메이션을 위한 딥러닝 기반의 메타버스 시스템을 제공하고자 한다.The present invention recognizes the user's movements without a marker using general video recording equipment such as a webcam, and provides a deep learning-based posture estimation system with improved accuracy and speed of recognition results and a deep learning-based metaverse system for real-time animation. We would like to provide.

상기 목적을 달성하기 위하여 본 발명은 딥러닝 기반 자세 추정 시스템에 있어서, 사용자의 동작이 촬영된 영상 파일을 입력받는 입력 모듈; 가상의 캐릭터를 구현하는 3D 엔진 모듈; 상기 영상 파일에서 사용자의 조인트 정보를 라벨링하여 딥러닝 학습으로 사용자의 자세를 추정하는 학습 모듈; 및 상기 학습 모듈을 통해 추정된 결과를 이용하여 스켈레톤 모델을 생성하고, 상기 스켈레톤 모델의 모션을 상기 3D 엔진 모듈을 통해 구성한 가상의 캐릭터에 적용시키는 캐릭터 모듈;을 포함하고, 상기 학습 모듈은, 입력값(x)이 다층의 가중치(w) 레이어를 거쳐서 출력된 F(x)에 입력값(x)이 출력으로 바로 연결된 잔여 블록(Residual Block)으로 심층 신경망이 구성되어, [F(x)+x]의 출력값이 최소화되도록 학습하는 것을 일 특징으로 한다.In order to achieve the above object, the present invention provides a deep learning-based posture estimation system, including an input module that receives an image file in which a user's movements are captured; 3D engine module that implements virtual characters; A learning module that labels the user's joint information in the video file and estimates the user's posture using deep learning; And a character module that generates a skeleton model using the results estimated through the learning module and applies the motion of the skeleton model to the virtual character constructed through the 3D engine module. The learning module includes an input A deep neural network is composed of a residual block where the input value (x) is directly connected to the output of F(x), where the value (x) passes through multiple weight (w) layers, and [F(x)+ One feature is learning so that the output value of [x] is minimized.

바람직하게, 상기 입력 모듈은 사용자의 영상을 실시간으로 촬영하는 영상 촬영 기기와 연동되고, 상기 영상 촬영 기기로부터 사용자의 실시간 모션이 입력되면, 상기 학습 모듈로 학습된 사용자의 조인트 위치와 자세 추정으로 상기 캐릭터 모듈이 사용자의 실시간 모션을 트랙킹하여 출력하는 출력 모듈을 더 포함할 수 있다.Preferably, the input module is linked to a video capture device that captures the user's image in real time, and when the user's real-time motion is input from the video capture device, the user's joint position and posture learned by the learning module are estimated. The character module may further include an output module that tracks and outputs the user's real-time motion.

바람직하게, 상기 3D 엔진 모듈은 유니티(Unity-Chan) 3D 엔진일 수 있다.Preferably, the 3D engine module may be a Unity-Chan 3D engine.

바람직하게, 상기 학습 모듈은 상기 영상 파일인 입력 데이터를 머리, 상체 및 하체로 구분된 3개의 채널로 분할하여 학습을 수행하며, 하기의 [표 1]과 같이 24개의 조인트 정보를 라벨링할 수 있다.Preferably, the learning module performs learning by dividing the input data, which is the image file, into three channels divided into head, upper body, and lower body, and can label 24 joint information as shown in [Table 1] below. .

[표 1][Table 1]

바람직하게, 상기 학습 모듈은 상기 영상 파일인 입력 데이터를 심층 신경망에 통과시킨 후 관절에 따라 (x, y) 좌표 별로 2개씩 k개의 관절에 대한 예측 값을 추정하여 총 2k 차원의 벡터로 학습할 수 있다.Preferably, the learning module passes the input data, which is the image file, through a deep neural network, then estimates predicted values for k joints, 2 for each (x, y) coordinate, depending on the joint, and learns them as a vector with a total of 2k dimensions. You can.

바람직하게, 상기 학습 모듈은 정답 클래스인 그라운드 트루스 포즈 벡터(ground truth pose vector)를 사용자의 신체에 대한 바운딩 박스의 위치, 넓이, 높이의 변수로 정규화할 수 있다. Preferably, the learning module can normalize the ground truth pose vector, which is the correct answer class, into variables of the position, area, and height of the bounding box for the user's body.

바람직하게, 상기 학습 모듈은 상기 그라운드 트루스 포즈 벡터(ground truth pose vector)를 하기의 [수학식 1]에 기반하여 정규화할 수 있다.Preferably, the learning module can normalize the ground truth pose vector based on [Equation 1] below.

[수학식 1][Equation 1]

여기서 y는 ground truth pose vector이고, y_i는 i번째 관절의 x 및 y의 좌표를 포함하며, 레이블이 지정된 이미지는 (x, y)로 표현되고, b는 신체에 대한 바운딩 박스이며, b_c는 박스 중앙이고, b_w는 박스 넓이 이며, b_h는 박스 높이의 값을 의미한다.where y is the ground truth pose vector, y _i contains the x and y coordinates of the ith joint, the labeled image is represented by (x, y), b is the bounding box for the body, b _c is the center of the box, b _w is the box width, and b _h is the value of the box height.

바람직하게, 상기 학습 모듈은, GPU와 CPU 모두에 대해 추론을 지원하는 경량형 크로스 플랫폼 뉴럴 넷 추론 라이브러리가 사용될 수 있다.Preferably, the learning module may be a lightweight cross-platform neural net inference library that supports inference for both GPU and CPU.

바람직하게, 상기 학습 모듈은 경량형 크로스 플랫폼 뉴럴 넷 추론 라이브러리로 바라쿠다(Barracuda)를 사용하며, 상기 바라쿠다(Barracuda)와 호환하여 외부 프레임워크에서 신경망 모델을 적용시키기 위해 학습 알고리즘을 ONNX 형식으로 변환할 수 있다.Preferably, the learning module uses Barracuda as a lightweight cross-platform neural net inference library, and is compatible with Barracuda to convert the learning algorithm into ONNX format to apply a neural network model in an external framework. You can.

바람직하게, 상기 캐릭터 모듈은 상기 학습 모듈을 통해 조인트 정보가 인식된 스켈레톤 모델 및 캐릭터의 모션을 부드럽게 처리하는 칼만 필터(Kalman filter)가 설정될 수 있다.Preferably, the character module may be configured with a skeleton model whose joint information is recognized through the learning module and a Kalman filter that smoothly processes the motion of the character.

또한, 본 발명은 실시간 애니메이션을 위한 딥러닝 기반의 메타버스 시스템에 있어서, 사용자의 동작을 실시간으로 촬영하는 영상 촬영 기기; 상기 영상 촬영 기기와 연동되어 사용자의 동작이 촬영된 영상 파일을 입력받는 입력 모듈; 가상의 캐릭터를 구현하는 3D 엔진 모듈; 상기 영상 파일에서 사용자의 조인트 정보를 라벨링하여 딥러닝 학습으로 사용자의 자세를 추정하되, 입력값(x)이 다층의 가중치(w) 레이어를 거쳐서 출력된 F(x)에 입력값(x)이 출력으로 바로 연결된 잔여 블록(Residual Block)으로 심층 신경망이 구성되어, [F(x)+x]의 출력값이 최소화되도록 학습하는 학습 모듈; 및 상기 학습 모듈을 통해 추정된 결과를 이용하여 스켈레톤 모델을 생성하고, 상기 스켈레톤 모델의 모션을 상기 3D 엔진 모듈을 통해 구성한 가상의 캐릭터에 적용시키는 캐릭터 모듈;을 포함하고, 상기 영상 촬영 기기로부터 사용자의 실시간 모션이 입력되면, 상기 학습 모듈로 학습된 사용자의 조인트 위치와 자세 추정으로 상기 캐릭터 모듈이 사용자의 실시간 모션을 트랙킹하여 출력되는 것을 다른 특징으로 한다.In addition, the present invention provides a deep learning-based metaverse system for real-time animation, comprising: a video capture device that captures the user's movements in real time; an input module that receives a video file in which the user's movements are captured in conjunction with the video capture device; 3D engine module that implements virtual characters; The user's joint information is labeled in the video file and the user's posture is estimated using deep learning, but the input value (x) is passed through a multi-layer weight (w) layer and the output F (x) contains the input value (x). A learning module in which a deep neural network is composed of residual blocks directly connected to the output and learns to minimize the output value of [F(x)+x]; And a character module that generates a skeleton model using the results estimated through the learning module and applies the motion of the skeleton model to the virtual character constructed through the 3D engine module. Another feature is that when real-time motion is input, the character module tracks the user's real-time motion and outputs the result by estimating the user's joint position and posture learned by the learning module.

본 발명에 따르면, 별도의 장치 없이 웹캠이나 영상 데이터를 통해 포즈 구현 다양성의 제약에서 벗어나 사용자의 자유로운 동작을 메타버스 콘텐츠 내 아바타로 구현 가능하다는 점에서 접근성과 몰입도를 높일 수 있으며, 다양한 메타버스형 콘텐츠 제작과 활용이 가능한 이점이 있다. According to the present invention, accessibility and immersion can be increased in that the user's free movement can be implemented as an avatar in metaverse content, free from restrictions on pose implementation diversity through a webcam or video data without a separate device, and various metaverses can be created. There is an advantage in being able to create and utilize type content.

또한 본 발명은, 사용자의 동작 인식에 있어서 신체 전신이 포함된 영상 데이터로 구현 결과 24개의 스켈레톤 정보를 평균 95.6%의 인식률로 구현하였으며, 22~23의 fps를 유지하며 원본 영상의 속도와 거의 유사한 속도로 자세 추정이 가능하다.In addition, the present invention was implemented with image data including the entire body in recognizing the user's motion, and as a result, 24 skeleton information was implemented with an average recognition rate of 95.6%, maintaining 22 to 23 fps and almost similar to the speed of the original image. It is possible to estimate posture based on speed.

또한 본 발명은, 기존의 CNN 학습망 구조가 신경망을 깊게 쌓으면 일정 깊이 이상에서 학습의 성능이 저하되는 한계점을 고려하여 학습 모듈이 잔여 블록(Residual Block) 방식으로 제공되며, 레이어 층이 깊어질수록 학습결과가 저하되는 한계를 극복하여 깊은 레이어 층을 사용해도 학습 성능이 우수한 이점이 있다.In addition, the present invention considers the limitation of the existing CNN learning network structure that the learning performance deteriorates beyond a certain depth when the neural network is stacked deeply, so the learning module is provided in a residual block method, and the deeper the layer layer, the deeper the layer becomes. It has the advantage of excellent learning performance even when using deep layers by overcoming the limitation of poor learning results.

도 1은 본 발명의 실시예에 따른 딥러닝 기반 자세 추정 시스템 구성도이다.
도 2는 딥러닝 기반 자세 추정 시스템의 동작을 도식화한 것이다.
도 3은 CNN의 플레인 네트워크(a)와 본 실시예에 따른 학습 모듈의 잔여블록 구조(b)를 나타낸다.
도 4는 딥러닝 기반 자세 추정 시스템의 학습 모듈의 네트워크 구조를 나타낸다.
도 5는 딥러닝 기반 자세 추정 시스템의 자세 추정 결과를 나타낸다.
도 6는 본 발명의 실시예에 따른 딥러닝 기반 자세 추정 시스템과 종래의 자세 추정 시스템인 OpenPose의 비교 평가 실험례를 나타낸다.Figure 1 is a configuration diagram of a deep learning-based posture estimation system according to an embodiment of the present invention.
Figure 2 schematically illustrates the operation of the deep learning-based posture estimation system.
Figure 3 shows the plane network (a) of CNN and the residual block structure (b) of the learning module according to this embodiment.
Figure 4 shows the network structure of the learning module of the deep learning-based posture estimation system.
Figure 5 shows the pose estimation results of the deep learning-based pose estimation system.
Figure 6 shows an example of a comparative evaluation experiment between a deep learning-based posture estimation system according to an embodiment of the present invention and OpenPose, a conventional posture estimation system.

이하, 첨부된 도면들에 기재된 내용들을 참조하여 본 발명을 상세히 설명한다. 다만, 본 발명이 예시적 실시 예들에 의해 제한되거나 한정되는 것은 아니다. 각 도면에 제시된 동일 참조부호는 실질적으로 동일한 기능을 수행하는 부재를 나타낸다.Hereinafter, the present invention will be described in detail with reference to the accompanying drawings. However, the present invention is not limited or limited by the exemplary embodiments. The same reference numerals in each drawing indicate members that perform substantially the same function.

본 발명의 목적 및 효과는 하기의 설명에 의해서 자연스럽게 이해되거나 보다 분명해 질 수 있으며, 하기의 기재만으로 본 발명의 목적 및 효과가 제한되는 것은 아니다. 또한, 본 발명을 설명함에 있어서 본 발명과 관련된 공지 기술에 대한 구체적인 설명이, 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략하기로 한다.The purpose and effect of the present invention can be naturally understood or become clearer through the following description, and the purpose and effect of the present invention are not limited to the following description. Additionally, in describing the present invention, if it is determined that a detailed description of known techniques related to the present invention may unnecessarily obscure the gist of the present invention, the detailed description will be omitted.

도 1은 본 발명의 실시예에 따른 딥러닝 기반 자세 추정 시스템 구성도이다. 도 2는 딥러닝 기반 자세 추정 시스템의 동작을 도식화한 것이다.Figure 1 is a configuration diagram of a deep learning-based posture estimation system according to an embodiment of the present invention. Figure 2 schematically illustrates the operation of the deep learning-based posture estimation system.

도 1 및 도 2를 참조하면, 딥러닝 기반 자세 추정 시스템(1)은 입력 모듈(10), 3D 엔진 모듈(20), 학습 모듈(30), 캐릭터 모듈(50) 및 출력 모듈(70)을 포함할 수 있다.1 and 2, the deep learning-based posture estimation system 1 includes an input module 10, a 3D engine module 20, a learning module 30, a character module 50, and an output module 70. It can be included.

본 실시예에 따른 딥러닝 기반 자세 추정 시스템(1)은 컴퓨터의 자원을 활용하여 외부의 영상 촬영 기기와 연동하여 수행되는 매체에 저장된 프로그램 또는 애플리케이션의 형태로 제공될 수 있다.The deep learning-based posture estimation system 1 according to this embodiment may be provided in the form of a program or application stored in a medium that is executed in conjunction with an external image capture device using computer resources.

입력 모듈(10)은 사용자의 동작이 촬영된 영상 파일을 입력받는다. 입력 모듈(10)은 사용자의 영상을 실시간으로 촬영하는 영상 촬영 기기와 연동될 수 있다. 본 실시예로 입력 모듈(10)은 사용자의 실시간 모습을 촬영할 수 있는 일반 영상 촬영 기기인 웹캠(wepcam)을 포함한 일반 카메라 장비와 연동될 수 있다. The input module 10 receives an image file in which the user's movements are captured. The input module 10 may be linked to an image capture device that captures a user's image in real time. In this embodiment, the input module 10 can be linked with general camera equipment, including a webcam (wepcam), a general video capture device that can capture a user's real-time appearance.

입력 모듈(10)은 영상 촬영 기기와 연동되어 실시간으로 촬영되는 정보를 입력받을 수 있으며, 이미 촬영된 영상 파일 또는 이미지 파일의 형태로 데이터를 입력받을 수 있다. 입력 모듈(10)은 초기 학습을 위해 기 촬영된 영상 파일 또는 이미지 데이터를 입력받아 학습 모듈(30)로 전송한다. 본 실시예로, 입력 모듈(10)은 학습을 위해 Human Pose Estimation 분야 연구에서 활용되는 대표적인 데이터 셋인 MPII Human Pose Dataset로부터 40,000여명의 사람이 포함된 약 25,000개의 이미지와, 16개의 키포인트에 대한 annotation 정보가 포함된 이미지 데이터를 수신하여 하기의 학습 모듈(30)을 통해 자세 추정 학습을 수행하였다.The input module 10 can receive information captured in real time in conjunction with a video capture device, and can receive data in the form of a video file or image file that has already been captured. The input module 10 receives previously captured video files or image data for initial learning and transmits them to the learning module 30. In this embodiment, the input module 10 receives about 25,000 images containing about 40,000 people and annotation information for 16 key points from the MPII Human Pose Dataset, a representative data set used in research in the field of Human Pose Estimation for learning. Image data including was received and posture estimation learning was performed through the learning module 30 below.

3D 엔진 모듈(20)은 가상의 캐릭터를 구현할 수 있다. 3D 엔진 모듈(20)은 유니티(Unity-Chan) 3D 엔진으로 제공될 수 있다. 유니티(Unity-Chan) 3D 엔진은 유니티사가 제공하는 게임 엔진으로, PC 플랫폼 뿐만 아니라 iPhone OS(iOS), Android와 같은 모바일 플랫폼, PS3, Xbox와 같은 콘솔 게임기 등의 다양한 플랫폼으로 적용가능하며, 3D 캐릭터의 디자인, 모델링을 손쉽게 구축할 수 있는 엔진 플랫폼이다. 본 실시예에 따른 3D 엔진 모듈(20)은 가장 보급화된 유니티 엔진이 적용될 수 있으며, 다른 실시예로 기타 언리얼 엔진과 연동될 수 있다.The 3D engine module 20 can implement a virtual character. The 3D engine module 20 may be provided as a Unity-Chan 3D engine. Unity (Unity-Chan) 3D engine is a game engine provided by Unity, and can be applied to various platforms such as PC platforms, mobile platforms such as iPhone OS (iOS) and Android, and console game consoles such as PS3 and Xbox. It is an engine platform that allows you to easily build character design and modeling. The most popular Unity engine may be applied to the 3D engine module 20 according to this embodiment, and in another embodiment, it may be linked with other Unreal Engines.

학습 모듈(30)은 영상 파일에서 사용자의 조인트 정보를 라벨링하여 딥러닝 학습으로 사용자의 자세를 추정할 수 있다. 학습 모듈(30)은 입력값(x)이 다층의 가중치(w) 레이어를 거쳐서 출력된 F(x)에 입력값(x)이 출력으로 바로 연결된 잔여 블록(Residual Block)으로 심층 신경망이 구성되어, [F(x)+x]의 출력값이 최소화되도록 학습할 수 있다.The learning module 30 can label the user's joint information in an image file and estimate the user's posture through deep learning. The learning module 30 is a deep neural network composed of a residual block where the input value (x) is directly connected as an output to F(x), which is output through a multi-layer weight (w) layer. , the output value of [F(x)+x] can be learned to be minimized.

도 3은 CNN의 플레인 네트워크(a)와 본 실시예에 따른 학습 모듈(30)의 잔여블록 구조(b)를 나타낸다. 도 3의 (a)는 기존의 CNN 구조(Plain CNN)를 나타낸다. 기존의 CNN구조는 단순히 신경망을 깊게 쌓으면 일정 깊이 이상에서는 학습이 제대로 이루어지지 않거나 학습 속도가 느려지며 좋지 않은 결과가 나오는 한계점이 발생했다. 도 3의 (b)는 본 실시예에 따른 학습 모듈(30)의 잔여블록(Residual Block) 구조를 나타낸다. 잔여블록(Residual Block) 구조는 출력값에 입력값을 더해주는 shortcut이 추가된 차이점이 있다. 도 3의 (a)에서 기존 CNN 신경망은 입력값 x가 2개의 가중치 레이어(Weight layer)를 거쳐서 출력값으로 H(x)를 얻는다. 도 3의 (b)를 참조하면, 잔여블록은 입력값 x를 출력으로 바로 연결시키는 shortcut connection이 구성되어 F(x)+x가 최소화시키는 결과값이 된다. 여기서, 입력값 x는 변하지 않는 고정값이 된다. 이러한, 잔여블록 구조는 레이어 층이 깊어질수록 학습 결과가 나빠지는 한계를 극복하여 깊은 레이어 층을 사용해도 효율적인 학습이 가능하다.Figure 3 shows the plane network (a) of CNN and the residual block structure (b) of the learning module 30 according to this embodiment. Figure 3(a) shows the existing CNN structure (Plain CNN). The existing CNN structure has a limitation in that if you simply stack neural networks deeply, learning does not occur properly beyond a certain depth, or the learning speed slows down and produces poor results. Figure 3(b) shows the residual block structure of the learning module 30 according to this embodiment. The difference in the residual block structure is that a shortcut is added that adds the input value to the output value. In Figure 3 (a), the existing CNN neural network obtains H(x) as an output value by passing the input value x through two weight layers. Referring to (b) of FIG. 3, the remaining block consists of a shortcut connection that directly connects the input value x to the output, so that F(x)+x becomes the minimized result value. Here, the input value x becomes a fixed value that does not change. This residual block structure overcomes the limitation that learning results worsen as the layer becomes deeper, enabling efficient learning even when using deep layers.

학습 모듈(30)은 영상 파일인 입력 데이터를 머리, 상체 및 하체로 구분된 3개의 채널로 분할하여 학습을 수행하며, 하기의 [표 1]과 같이 24개의 조인트 정보를 라벨링할 수 있다.The learning module 30 performs learning by dividing the input data, which is an image file, into three channels divided into the head, upper body, and lower body, and can label 24 joint information as shown in [Table 1] below.

[표 1][Table 1]

본 실시예로, 학습 모듈(30)은 448x448의 입력 이미지 사이즈의 영상 데이터를 사용한다. 학습 모듈(30)은 입력 데이터를 Input 1, Input 4, Input 7의 3개의 채널로 분할하여 전처리하며, 같은 이미지가 내부적으로 머리, 상체 및 하체 부위로 설정되어 각 부위에 알맞게 학습이 될 수 있도록 구성된다. In this embodiment, the learning module 30 uses image data with an input image size of 448x448. The learning module 30 preprocesses the input data by dividing it into three channels, Input 1, Input 4, and Input 7, and the same image is internally set to the head, upper body, and lower body parts so that learning can be done appropriately for each part. It is composed.

학습 모듈(30)은 사용자의 스켈레톤 데이터를 추출하기 위해 [표 1]과 같이 24개의 조인트 정보를 라벨링 데이터로 정의하고 학습하여 자세 추정 결과를 트랙킹(Tracking)하고, 트레이닝된 정보를 히트맵 형식으로 출력할 수 있다. 히트맵 정보가 히트맵 셀에 스케치하듯 형성되면 후술하게 될 캐릭터 모듈(50)이 그 결과물로 24개의 조인트 정보를 3D 게임 엔진(20)과 연동하여 가상의 캐릭터에 결합해 스켈레톤을 형성한다.In order to extract the user's skeleton data, the learning module 30 defines and learns 24 joint information as labeling data as shown in [Table 1], tracks the posture estimation result, and converts the trained information into a heatmap format. Can be printed. When heat map information is formed as if sketching in a heat map cell, the character module 50, which will be described later, links the resulting 24 joint information with the 3D game engine 20 and combines it with the virtual character to form a skeleton.

도 4는 딥러닝 기반 자세 추정 시스템의 학습 모듈(30)의 네트워크 구조를 나타낸다. 도 4를 참조하면, 각각의 채널 Input 1, Input 4, Input 7이 순차적으로 조합되어 CNN 학습이 수행되며, CNN의 네트워크에 일정 레이어 단위로 shortcut이 형성된 잔여블록 구조를 확인할 수 있다.Figure 4 shows the network structure of the learning module 30 of the deep learning-based posture estimation system. Referring to Figure 4, CNN learning is performed by sequentially combining each channel Input 1, Input 4, and Input 7, and the residual block structure in which shortcuts are formed in units of certain layers in the CNN network can be confirmed.

학습 모듈(30)은 영상 파일인 입력 데이터를 심층 신경망에 통과시킨 후 관절에 따라 (x, y) 좌표 별로 2개씩 k개의 관절에 대한 예측 값을 추정하여 총 2k 차원의 벡터로 학습할 수 있다. 이를 식으로 나타내면 [관계식 1]과 같다.The learning module 30 passes input data, which is an image file, through a deep neural network, then estimates predicted values for k joints, 2 for each (x, y) coordinate, depending on the joint, and can be learned as a vector with a total of 2k dimensions. . This can be expressed as [Relational Equation 1].

[관계식 1][Relational Expression 1]

학습 모듈(30)은 정답 클래스인 그라운드 트루스 포즈 벡터(ground truth pose vector)를 사용자의 신체에 대한 바운딩 박스의 위치, 넓이, 높이의 변수로 정규화할 수 있다. 본 실시예로, 학습 모듈(30)은 추출된 히트맵 정보를 키포인트의 (x, y)값을 구하는 2D 형식 뿐만아니라 (x,y,z) 좌표값까지 구하는 3D 형식으로 확장하여 구현될 수 있도록 3D 히트맵 큐브를 사용할 수 있다. 이를 통해서 24개로 라벨링된 조인트 외에 인식에 필요한 엉덩이, 머리, 목, 척추의 4개 관절 위치는 인체구조 기반 분석을 통해 추출된다. The learning module 30 can normalize the ground truth pose vector, which is the correct answer class, into variables of the position, area, and height of the bounding box for the user's body. In this embodiment, the learning module 30 can be implemented by expanding the extracted heatmap information into not only a 2D format that obtains the (x, y) values of key points, but also a 3D format that obtains (x, y, z) coordinate values. You can use a 3D heatmap cube to do this. Through this, in addition to the 24 labeled joints, the positions of the four joints required for recognition - hip, head, neck, and spine - are extracted through human body structure-based analysis.

학습 모듈(30)은 그라운드 트루스 포즈 벡터(ground truth pose vector)를 하기의 [관계식 2]에 기반하여 정규화할 수 있다.The learning module 30 can normalize the ground truth pose vector based on [Relation 2] below.

[관계식 2][Relational Expression 2]

학습 모듈(30)은 GPU와 CPU 모두에 대해 추론을 지원하는 경량형 크로스 플랫폼 뉴럴 넷 추론 라이브러리가 사용될 수 있다.The learning module 30 may use a lightweight cross-platform neural net inference library that supports inference for both GPU and CPU.

학습 모듈(30)은 경량형 크로스 플랫폼 뉴럴 넷 추론 라이브러리로 바라쿠다(Barracuda)를 사용하며, 상기 바라쿠다(Barracuda)와 호환하여 외부 프레임워크에서 신경망 모델을 적용시키기 위해 학습 알고리즘을 ONNX(Open Neural Network Exchange) 형식으로 변환할 수 있다. ONNX 는 Pytorch 및 Tensorflow 등에서 학습된 딥러닝 모델 정보를 다양한 응용 프로그램에서 실시간으로 호출하여 사용할 수 있도록 제공해주는 오픈 소스 기반 딥러닝 지원 소프트웨어이다.The learning module 30 uses Barracuda as a lightweight, cross-platform neural net inference library, and is compatible with Barracuda to apply a learning algorithm to ONNX (Open Neural Network Exchange) to apply a neural network model in an external framework. ) can be converted to format. ONNX is an open source-based deep learning support software that provides deep learning model information learned in Pytorch and Tensorflow so that it can be called and used in real time in various applications.

학습 모듈(30)은 동작 인식에 기본적으로 Top-down 방식을 사용하며 동작을 인식하는 과정에서 스켈레톤 정보를 파악하고 히트맵 3D 큐브를 통해 실시간으로 3D 정보로 변환한다. 이는 Openpose 등의 Bottom-up 방식에 비하여 속도 측면에서 빠른 구현을 가능하게 하며 실시간으로 사용자의 동작을 아바타를 통해 동일하게 구현할 수 있도록 한다.The learning module 30 basically uses a top-down method for motion recognition, and in the process of motion recognition, it identifies skeleton information and converts it into 3D information in real time through a heat map 3D cube. This enables faster implementation in terms of speed compared to bottom-up methods such as Openpose, and allows the user's movements to be equally implemented through an avatar in real time.

캐릭터 모듈(50)은 학습 모듈(30)을 통해 추정된 결과를 이용하여 스켈레톤 모델을 생성하고, 스켈레톤 모델의 모션을 3D 엔진 모듈(20)을 통해 구성한 가상의 캐릭터에 적용시킬 수 있다. 캐릭터 모듈(50)은 학습 모듈(30)을 통해 조인트 정보가 인식된 스켈레톤 모델 및 캐릭터의 모션을 부드럽게 처리하는 칼만 필터(Kalman filter)가 설정될 수 있다.The character module 50 can generate a skeleton model using the results estimated through the learning module 30, and apply the motion of the skeleton model to the virtual character created through the 3D engine module 20. The character module 50 may be configured with a skeleton model whose joint information is recognized through the learning module 30 and a Kalman filter that smoothly processes the motion of the character.

도 5는 딥러닝 기반 자세 추정 시스템(1)의 자세 추정 결과를 나타낸다.Figure 5 shows the posture estimation results of the deep learning-based posture estimation system (1).

출력 모듈(70)은 영상 촬영 기기로부터 사용자의 실시간 모션이 입력되면, 학습 모듈(30)로 학습된 사용자의 조인트 위치와 자세 추정으로 캐릭터 모듈(50)이 사용자의 실시간 모션을 트랙킹하여 출력할 수 있다.When the user's real-time motion is input from the video recording device, the output module 70 allows the character module 50 to track the user's real-time motion and output it by estimating the user's joint position and posture learned by the learning module 30. there is.

실험례: 성능평가Experimental example: performance evaluation

본 실시예에 따른 딥러닝 기반 자세 추정 시스템(1)을 다음과 같은 환경에서 구현하였다. Window OS 환경에서 Python 3.9.7버전과 딥러닝 라이브러리로 Pytorch 1.10.1버전을 사용하였고 CPU(AMD Ryzen 7 4700U 2.00GHz 인텔 코어i7-11370H 3.1GHz )와 메모리(16GB), NVDIA RTX 3060 GDDR6 6GB 그래픽 카드가 장착된 Window10(x64) 환경의 PC에서 수행하였다. Unity 엔진 버전은 2019.4.21.f1 버전을 사용하였다.The deep learning-based posture estimation system (1) according to this embodiment was implemented in the following environment. In a Windows OS environment, Python version 3.9.7 and Pytorch version 1.10.1 were used as the deep learning library, CPU (AMD Ryzen 7 4700U 2.00GHz Intel Core i7-11370H 3.1GHz), memory (16GB), NVDIA RTX 3060 GDDR6 6GB graphics. It was performed on a PC with a Windows 10 (x64) environment equipped with a card. Unity engine version 2019.4.21.f1 was used.

도 6는 본 발명의 실시예에 따른 딥러닝 기반 자세 추정 시스템과 종래의 자세 추정 시스템인 OpenPose의 비교 평가 실험례를 나타낸다.Figure 6 shows an example of a comparative evaluation experiment between a deep learning-based posture estimation system according to an embodiment of the present invention and OpenPose, a conventional posture estimation system.

본 실시예에 따른 딥러닝 기반 자세 추정 시스템(1)의 성능 평가를 하기 위해서 동일한 환경에서 OpenPose와 같은 데이터셋에 대한 비교 평가를 하였다. 비교 평가는 정확도와 속도에 대해서 평가를 하였으며, 본 실험례에서 적용한 학습 모델(30)에서 인식하는 관절의 개수와 OpenPose에서 인식이 되는 관절의 개수가 상이하기 때문에 인식된 24개의 관절 포인트들에 대한 매칭을 통하여 정확도를 계산하였다. 도 6에 도시된 MPII Human Pose Dataset 중 한 사람이 등장하고 신체 전신이 포함된 데이터를 추출하여 테스트를 진행하였다. 실행 결과 본 실시예에 따른 학습 모델의 평균 인식 정확도는 약 91%, OpenPose의 평균 인식 정확도는 약 72%로 측정되었으며 속도는 본 실시예에 따른 학습 모델이 평균 22fps, OpenPose 평균 11fps로 측정되었다. 이를 통해 단일 사람의 전신 자세 추정 데이터셋으로만 판단할 경우 본 실시예에 따른 학습 모델이 OpenPose 보다 정확도와 속도 측면에서 모두 우위를 보였다.In order to evaluate the performance of the deep learning-based posture estimation system (1) according to this embodiment, a comparative evaluation was performed on datasets such as OpenPose in the same environment. The comparative evaluation evaluated accuracy and speed, and since the number of joints recognized by the learning model (30) applied in this experiment is different from the number of joints recognized by OpenPose, the number of joint points for the 24 recognized joint points was evaluated. Accuracy was calculated through matching. A test was conducted by extracting data containing one person's entire body from the MPII Human Pose Dataset shown in Figure 6. As a result of the execution, the average recognition accuracy of the learning model according to this embodiment was measured at about 91%, the average recognition accuracy of OpenPose was measured at about 72%, and the speed was measured at an average of 22fps for the learning model according to this embodiment and 11fps for OpenPose. Through this, when judging only from a single person's whole body posture estimation dataset, the learning model according to this embodiment showed superiority over OpenPose in both accuracy and speed.

이상에서 대표적인 실시예를 통하여 본 발명을 상세하게 설명하였으나, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 상술한 실시예에 대하여 본 발명의 범주에서 벗어나지 않는 한도 내에서 다양한 변형이 가능함을 이해할 것이다. 그러므로 본 발명의 권리 범위는 설명한 실시예에 국한되어 정해져서는 안 되며, 후술하는 특허청구범위뿐만 아니라 특허청구범위와 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태에 의하여 정해져야 한다. Although the present invention has been described in detail through representative embodiments above, those skilled in the art will understand that various modifications can be made to the above-described embodiments without departing from the scope of the present invention. will be. Therefore, the scope of rights of the present invention should not be limited to the described embodiments, but should be determined not only by the scope of the patent claims described later, but also by all changes or modified forms derived from the scope of the patent claims and the concept of equivalents.

1: 딥러닝 기반 자세 추정 시스템
10: 입력 모듈
20: 3D 엔진 모듈
30: 학습 모듈
50: 캐릭터 모듈
70: 출력 모듈1: Deep learning-based posture estimation system
10: input module
20: 3D engine module
30: Learning module
50: Character module
70: output module

Claims

사용자의 동작이 촬영된 영상 파일을 입력받는 입력 모듈;
가상의 캐릭터를 구현하는 3D 엔진 모듈;
상기 영상 파일에서 사용자의 조인트 정보를 라벨링하여 딥러닝 학습으로 사용자의 자세를 추정하는 학습 모듈; 및
상기 학습 모듈을 통해 추정된 결과를 이용하여 스켈레톤 모델을 생성하고, 상기 스켈레톤 모델의 모션을 상기 3D 엔진 모듈을 통해 구성한 가상의 캐릭터에 적용시키는 캐릭터 모듈;을 포함하고,
상기 학습 모듈은,
입력값(x)이 다층의 가중치(w) 레이어를 거쳐서 출력된 F(x)에 입력값(x)이 출력으로 바로 연결된 잔여 블록(Residual Block)으로 심층 신경망이 구성되어, [F(x)+x]의 출력값이 최소화되도록 학습하도록 구비되고,
상기 영상 파일인 입력 데이터를 머리, 상체 및 하체로 구분된 3개의 채널로 분할하여 학습을 수행하며, 하기의 [표 1]과 같이 24개의 조인트 정보를 라벨링하는 것을 특징으로 하는 딥러닝 기반 자세 추정 시스템.
[표 1]

An input module that receives a video file in which the user's movements are captured;
3D engine module that implements virtual characters;
A learning module that labels the user's joint information in the video file and estimates the user's posture using deep learning; and
A character module that generates a skeleton model using the results estimated through the learning module and applies the motion of the skeleton model to a virtual character constructed through the 3D engine module,
The learning module is,
A deep neural network is composed of a residual block where the input value (x) is directly connected to the output of F(x), where the input value (x) passes through multiple weight (w) layers, and [F(x) +x] is equipped to learn to minimize the output value,
Deep learning-based posture estimation is characterized by dividing the input data, which is the video file, into three channels divided into head, upper body, and lower body, and labeling 24 joint information as shown in [Table 1] below. system.
[Table 1]

제 1 항에 있어서,
상기 입력 모듈은,
사용자의 영상을 실시간으로 촬영하는 영상 촬영 기기와 연동되고,
상기 영상 촬영 기기로부터 사용자의 실시간 모션이 입력되면, 상기 학습 모듈로 학습된 사용자의 조인트 위치와 자세 추정으로 상기 캐릭터 모듈이 사용자의 실시간 모션을 트랙킹하여 출력하는 출력 모듈을 더 포함하는 것을 특징으로 하는 딥러닝 기반 자세 추정 시스템.
According to claim 1,
The input module is,
It is linked to a video recording device that captures the user's video in real time,
When the user's real-time motion is input from the video recording device, the character module further includes an output module that tracks the user's real-time motion and outputs it by estimating the user's joint position and posture learned by the learning module. Deep learning-based pose estimation system.

제 1 항에 있어서,
상기 3D 엔진 모듈은,
유니티(Unity) 3D 엔진인 것을 특징으로 하는 딥러닝 기반 자세 추정 시스템.
According to claim 1,
The 3D engine module is,
Deep learning-based posture estimation system characterized by Unity 3D engine.

삭제delete

제 1 항에 있어서,
상기 학습 모듈은,
상기 영상 파일인 입력 데이터를 심층 신경망에 통과시킨 후 관절에 따라 (x, y) 좌표 별로 2개씩 k개의 관절에 대한 예측 값을 추정하여 총 2k 차원의 벡터로 학습하는 것을 특징으로 하는 딥러닝 기반 자세 추정 시스템.
According to claim 1,
The learning module is,
After passing the input data, which is the video file, through a deep neural network, prediction values for k joints are estimated, 2 for each (x, y) coordinate, and learned as a vector with a total of 2k dimensions. Posture estimation system.

제 1 항에 있어서,
상기 학습 모듈은,
정답 클래스인 그라운드 트루스 포즈 벡터(ground truth pose vector)를 사용자의 신체에 대한 바운딩 박스의 위치, 넓이, 높이의 변수로 정규화하는 것을 특징으로 하는 딥러닝 기반 자세 추정 시스템.
According to claim 1,
The learning module is,
A deep learning-based pose estimation system characterized by normalizing the ground truth pose vector, which is the correct answer class, into the variables of the position, area, and height of the bounding box for the user's body.

제 6 항에 있어서,
상기 학습 모듈은,
상기 그라운드 트루스 포즈 벡터(ground truth pose vector)를 하기의 [수학식 1]에 기반하여 정규화한 것을 특징으로 하는 딥러닝 기반 자세 추정 시스템.
[수학식 1]

여기서 y는 ground truth pose vector이고, y_i는 i번째 관절의 x 및 y의 좌표를 포함하며, 레이블이 지정된 이미지는 (x, y)로 표현되고, b는 신체에 대한 바운딩 박스이며, b_c는 박스 중앙이고, b_w는 박스 넓이 이며, b_h는 박스 높이의 값을 의미한다.
According to claim 6,
The learning module is,
A deep learning-based pose estimation system characterized in that the ground truth pose vector is normalized based on Equation 1 below.
[Equation 1]

where y is the ground truth pose vector, y _i contains the x and y coordinates of the ith joint, the labeled image is represented by (x, y), b is the bounding box for the body, b _c is the center of the box, b _w is the box width, and b _h is the value of the box height.

제 1 항에 있어서,
상기 학습 모듈은,
GPU와 CPU 모두에 대해 추론을 지원하는 경량형 크로스 플랫폼 뉴럴 넷 추론 라이브러리가 사용된 것을 특징으로 하는 딥러닝 기반 자세 추정 시스템.
According to claim 1,
The learning module is,
A deep learning-based pose estimation system characterized by the use of a lightweight, cross-platform neural net inference library that supports inference on both GPU and CPU.

제 8 항에 있어서,
상기 학습 모듈은,
경량형 크로스 플랫폼 뉴럴 넷 추론 라이브러리로 바라쿠다(Barracuda)를 사용하며, 상기 바라쿠다(Barracuda)와 호환하여 외부 프레임워크에서 신경망 모델을 적용시키기 위해 학습 알고리즘을 ONNX(Open Neural Network Exchange) 형식으로 변환하는 것을 특징으로 하는 딥러닝 기반 자세 추정 시스템.
According to claim 8,
The learning module is,
It uses Barracuda as a lightweight cross-platform neural net inference library, and is compatible with Barracuda to convert learning algorithms into ONNX (Open Neural Network Exchange) format to apply neural network models in external frameworks. Features a deep learning-based pose estimation system.

제 1 항에 있어서,
상기 캐릭터 모듈은,
상기 학습 모듈을 통해 조인트 정보가 인식된 스켈레톤 모델 및 캐릭터의 모션을 부드럽게 처리하는 칼만 필터(Kalman filter)가 설정된 것을 특징으로 하는 딥러닝 기반 자세 추정 시스템.
According to claim 1,
The character module is,
A deep learning-based posture estimation system, characterized in that a Kalman filter is set to smoothly process the motion of the skeleton model and character whose joint information is recognized through the learning module.

사용자의 동작을 실시간으로 촬영하는 영상 촬영 기기;
상기 영상 촬영 기기와 연동되어 사용자의 동작이 촬영된 영상 파일을 입력받는 입력 모듈;
가상의 캐릭터를 구현하는 3D 엔진 모듈;
상기 영상 파일에서 사용자의 조인트 정보를 라벨링하여 딥러닝 학습으로 사용자의 자세를 추정하되, 입력값(x)이 다층의 가중치(w) 레이어를 거쳐서 출력된 F(x)에 입력값(x)이 출력으로 바로 연결된 잔여 블록(Residual Block)으로 심층 신경망이 구성되어, [F(x)+x]의 출력값이 최소화되도록 학습하는 학습 모듈; 및
상기 학습 모듈을 통해 추정된 결과를 이용하여 스켈레톤 모델을 생성하고, 상기 스켈레톤 모델의 모션을 상기 3D 엔진 모듈을 통해 구성한 가상의 캐릭터에 적용시키는 캐릭터 모듈;을 포함하고,
상기 영상 촬영 기기로부터 사용자의 실시간 모션이 입력되면, 상기 학습 모듈로 학습된 사용자의 조인트 위치와 자세 추정으로 상기 캐릭터 모듈이 사용자의 실시간 모션을 트랙킹하여 출력하도록 구비되고,
상기 학습 모듈은,
상기 영상 파일인 입력 데이터를 머리, 상체 및 하체로 구분된 3개의 채널로 분할하여 학습을 수행하며, 하기의 [표 1]과 같이 24개의 조인트 정보를 라벨링하는 것을 특징으로 하는 실시간 애니메이션을 위한 딥러닝 기반의 메타버스 시스템.
[표 1]

A video recording device that records the user's movements in real time;
an input module that receives a video file in which the user's movements are captured in conjunction with the video capture device;
3D engine module that implements virtual characters;
The user's joint information is labeled in the video file and the user's posture is estimated using deep learning, but the input value (x) is passed through a multi-layer weight (w) layer and the output F (x) contains the input value (x). A learning module in which a deep neural network is composed of residual blocks directly connected to the output and learns to minimize the output value of [F(x)+x]; and
A character module that generates a skeleton model using the results estimated through the learning module and applies the motion of the skeleton model to a virtual character constructed through the 3D engine module,
When the user's real-time motion is input from the video recording device, the character module is equipped to track and output the user's real-time motion based on the user's joint position and posture estimation learned by the learning module,
The learning module is,
Deep for real-time animation, characterized by dividing the input data, which is the video file, into three channels divided into head, upper body, and lower body, and labeling 24 joint information as shown in [Table 1] below. Learning-based metaverse system.
[Table 1]