KR102649818B1

KR102649818B1 - Apparatus and method for generating 3D lip sync video

Info

Publication number: KR102649818B1
Application number: KR1020220064510A
Authority: KR
Inventors: 채경수; 김두현; 곽희태; 조혜진; 이기혁
Original assignee: 주식회사 딥브레인에이아이
Priority date: 2022-05-26
Filing date: 2022-05-26
Publication date: 2024-03-21
Also published as: KR20230164854A; WO2023229091A1

Abstract

3D 립싱크 비디오 생성 장치 및 방법이 개시된다. 일 실시예에 따른 3D 립싱크 비디오 생성 장치는 입력 텍스트를 기반으로 발화 오디오를 생성하는 음성 변환부; 및 상기 생성된 발화 오디오, 인물의 발화 모습이 촬영된 2D 비디오 및 상기 인물의 발화 모습으로부터 획득된 3D 데이터를 기반으로 상기 인물의 3차원 모델이 발화하는 3D 립싱크 비디오를 생성하는 3D 립싱크 비디오 생성 모델; 을 포함한다.An apparatus and method for generating a 3D lip sync video are disclosed. A 3D lip sync video generating device according to an embodiment includes a voice converter that generates speech audio based on input text; and a 3D lip-sync video generation model that generates a 3D lip-sync video in which the 3D model of the person speaks based on the generated speech audio, 2D video of the person's speech, and 3D data obtained from the person's speech. ; Includes.

Description

3D 립싱크 비디오 생성 장치 및 방법{Apparatus and method for generating 3D lip sync video}Apparatus and method for generating 3D lip sync video}

3D 립싱크 비디오를 생성하는 장치 및 방법과 관련된다.Relates to an apparatus and method for generating 3D lip sync video.

최근, 인공 지능 분야의 기술 발전에 따라 다양한 유형의 콘텐츠가 인공 지능 기술에 기초하여 생성되고 있다. 그 일 예로, 어떤 전달하고자 하는 음성 메시지가 있을 때, 그 음성 메시지를 유명 인물(예를 들어, 대통령 등)이 말하는 것과 같은 멀티미디어를 생성하여 사람들의 주의를 끌고자 하는 경우가 있다. 이는 유명 인물의 영상에서 유명 인물이 특정 메시지를 말하는 것처럼 입 모양 등을 특정 메시지에 맞게 생성하여 구현하게 된다.Recently, with technological advancements in the field of artificial intelligence, various types of content are being created based on artificial intelligence technology. For example, when there is a voice message to be conveyed, there is a case in which it is desired to attract people's attention by creating multimedia such as a famous person (for example, the president, etc.) speaking the voice message. This is implemented by creating mouth shapes to suit a specific message, just as a famous person speaks a specific message in a video of a famous person.

이를 위하여, 종래에는 기존의 발화 영상(발화 비디오)에서 먼저 음성과 관련된 랜드마크 또는 키포인트를 생성하여 이에 대한 학습을 수행한 후, 학습된 모델을 이용하여 입력된 음성에 맞는 영상(비디오)를 합성하는 방식을 사용하였다.To this end, conventionally, landmarks or key points related to speech are first created from existing speech images (speech videos), learning is performed on them, and then images (videos) matching the input speech are synthesized using the learned model. method was used.

그러나, 이러한 종래 기술의 경우 학습을 위하여 키포인트를 추출하고 이를 표준 공간(화면 중심에서 정면을 바라보는 위치)으로 변환 및 역변환하는 과정이 필수적으로 요구되며, 키포인트를 합성하는 단계와 영상을 합성하는 단계가 필요한 바 그 절차가 복잡하다는 문제점이 있다.However, in the case of this prior art, a process of extracting key points for learning and converting and inversely converting them into standard space (the position facing the front from the center of the screen) is essential, and the steps of synthesizing key points and synthesizing images are essential. There is a problem that the procedure is complicated because it is necessary.

한편, 키포인트를 사용하지 않는 방식의 경우, 얼굴 부분만 잘라 크기 및 위치를 정렬한 후 입력된 음성에 맞는 영상을 합성하는 방식을 사용하여, 인물의 자연스러운 움직임을 반영하지 못하기 때문에 결과물이 부자연스럽다는 문제점이 있다.Meanwhile, in the case of a method that does not use key points, only the face part is cut, the size and position are aligned, and the video is synthesized according to the input voice, so the result is unnatural because it does not reflect the natural movement of the person. There is a problem.

한국등록특허공보 제10-1177408호(2012.08.27)Korean Patent Publication No. 10-1177408 (2012.08.27)

3D 립싱크 비디오 생성 장치 및 방법을 제공하는 것을 목적으로 한다.The purpose is to provide a 3D lip sync video generating device and method.

일 양상에 따른 3D 립싱크 비디오 생성 장치는, 입력 텍스트를 기반으로 발화 오디오를 생성하는 음성 변환부; 및 상기 생성된 발화 오디오, 인물의 발화 모습이 촬영된 2D 비디오 및 상기 인물의 발화 모습으로부터 획득된 3D 데이터를 기반으로 상기 인물의 3차원 모델이 발화하는 3D 립싱크 비디오를 생성하는 3D 립싱크 비디오 생성 모델; 을 포함할 수 있다.A 3D lip sync video generating apparatus according to an aspect includes a voice converter that generates speech audio based on input text; and a 3D lip-sync video generation model that generates a 3D lip-sync video in which the 3D model of the person speaks based on the generated speech audio, 2D video of the person's speech, and 3D data obtained from the person's speech. ; may include.

상기 2D 비디오는 발화와 관련된 부분이 마스크로 가려지고, 상기 인물의 상반신을 포함할 수 있다.The 2D video may include the upper body of the person in which the part related to speech is covered by a mask.

상기 3D 립싱크 비디오 생성 모델은 상기 마스크로 가려진 부분을 상기 발화 오디오에 대응하도록 복원하여 상기 3D 립싱크 비디오를 생성할 수 있다.The 3D lip-sync video generation model can generate the 3D lip-sync video by restoring the part hidden by the mask to correspond to the speech audio.

상기 3D 데이터는 상기 인물의 기하학적 구조를 표현하는 제1 3D 데이터와 피부의 수축과 팽창, 목젖의 움직임 및 어깨 근육의 움직임을 표현하는 제2 3D 데이터를 포함할 수 있다.The 3D data may include first 3D data expressing the geometric structure of the person and second 3D data expressing contraction and expansion of the skin, movement of the uvula, and movement of the shoulder muscles.

상기 제1 3D 데이터는 정점들(vertices), 폴리곤들(polygons) 및 메쉬(mesh) 중 적어도 하나에 대한 정보를 포함하고, 상기 제2 3D 데이터는 상기 인물에 표시된 복수의 마커의 위치 및 이동에 관한 정보를 포함할 수 있다.The first 3D data includes information about at least one of vertices, polygons, and mesh, and the second 3D data is related to the position and movement of a plurality of markers displayed on the person. It may contain information about

상기 3D 립싱크 비디오 생성 모델은, 상기 2D 비디오에서 제1 특징 벡터를 추출하는 제1 인코더; 상기 3D 데이터에서 제2 특징 벡터를 추출하는 제2 인코더; 상기 발화 오디오로부터 제3 특징 벡터를 추출하는 제3 인코더; 상기 제1 특징 벡터, 상기 제2 특징 벡터 및 상기 제3 특징 벡터를 조합하여 조합 벡터를 생성하는 조합부; 및 상기 조합 벡터로부터 상기 3D 립싱크 비디오를 생성하는 디코더; 를 포함할 수 있다.The 3D lip sync video generation model includes: a first encoder that extracts a first feature vector from the 2D video; a second encoder extracting a second feature vector from the 3D data; a third encoder extracting a third feature vector from the speech audio; a combination unit that generates a combination vector by combining the first feature vector, the second feature vector, and the third feature vector; and a decoder that generates the 3D lip sync video from the combination vector; may include.

상기 3D 립싱크 비디오 생성 모델은, 상기 2D 비디오와 상기 3D 데이터에서 제1 특징 벡터를 추출하는 제1 인코더; 상기 발화 오디오로부터 제2 특징 벡터를 추출하는 제2 인코더; 상기 제1 특징 벡터 및 상기 제2 특징 벡터를 조합하여 조합 벡터를 생성하는 조합부; 및 상기 조합 벡터로부터 상기 3D 립싱크 비디오를 생성하는 디코더; 를 포함할 수 있다.The 3D lip sync video generation model includes: a first encoder that extracts a first feature vector from the 2D video and the 3D data; a second encoder extracting a second feature vector from the speech audio; a combination unit that generates a combination vector by combining the first feature vector and the second feature vector; and a decoder that generates the 3D lip sync video from the combination vector; may include.

상기 3D 데이터는 상기 인물의 기하학적 구조를 표현하는 제1 3D 데이터와 피부의 수축과 팽창, 목젖의 움직임 및 어깨 근육의 움직임을 표현하는 제2 3D 데이터를 포함하고, 상기 3D 립싱크 비디오 생성 모델은, 상기 2D 비디오에서 제1 특징 벡터를 추출하는 제1 인코더; 상기 제1 3D 데이터에서 제2 특징 벡터를 추출하는 제2 인코더; 상기 제2 3D 데이터에서 제3 특징 벡터를 추출하는 제3 인코더; 상기 발화 오디오로부터 제4 특징 벡터를 추출하는 제4 인코더; 상기 제1 특징 벡터, 상기 제2 특징 벡터, 상기 제3 특징 벡터 및 상기 제4 특징 벡터를 조합하여 조합 벡터를 생성하는 조합부; 및 상기 조합 벡터로부터 상기 3D 립싱크 비디오를 생성하는 디코더; 를 포함할 수 있다.The 3D data includes first 3D data expressing the geometric structure of the person and second 3D data expressing contraction and expansion of the skin, movement of the uvula, and movement of the shoulder muscles, and the 3D lip sync video generation model includes, a first encoder extracting a first feature vector from the 2D video; a second encoder extracting a second feature vector from the first 3D data; a third encoder extracting a third feature vector from the second 3D data; a fourth encoder extracting a fourth feature vector from the speech audio; a combination unit that generates a combination vector by combining the first feature vector, the second feature vector, the third feature vector, and the fourth feature vector; and a decoder that generates the 3D lip sync video from the combination vector; may include.

상기 3D 립싱크 비디오 생성 장치는 상기 발화 오디오와 상기 3D 립싱크 비디오의 싱크를 맞춰 3D 멀티미디어를 생성하는 싱크부; 를 더 포함할 수 있다.The 3D lip sync video generating device includes a sync unit that synchronizes the speech audio and the 3D lip sync video to generate 3D multimedia; It may further include.

다른 양상에 따른 컴퓨팅 장치에 의해 수행되는 3D 립싱크 비디오 생성 방법은, 입력 텍스트를 기반으로 발화 오디오를 생성하는 단계; 및 상기 생성된 발화 오디오, 인물의 발화 모습이 촬영된 2D 비디오 및 상기 인물의 발화 모습으로부터 획득된 3D 데이터를 기반으로 상기 인물의 3차원 모델이 발화하는 3D 립싱크 비디오를 생성하는 단계; 을 포함할 수 있다.A 3D lip sync video generating method performed by a computing device according to another aspect includes generating speech audio based on input text; and generating a 3D lip-sync video in which a 3D model of the person speaks based on the generated speech audio, a 2D video in which the person speaks, and 3D data obtained from the person's speech. may include.

상기 3D 립싱크 비디오를 생성하는 단계는, 상기 마스크로 가려진 부분을 상기 발화 오디오에 대응하도록 복원하여 상기 3D 립싱크 비디오를 생성하는 단계; 를 포함할 수 있다.Generating the 3D lip sync video may include generating the 3D lip sync video by restoring a portion obscured by the mask to correspond to the speech audio; may include.

상기 3D 립싱크 비디오를 생성하는 단계는, 상기 2D 비디오에서 제1 특징 벡터를 추출하는 단계; 상기 3D 데이터에서 제2 특징 벡터를 추출하는 단계; 상기 발화 오디오로부터 제3 특징 벡터를 추출하는 단계; 상기 제1 특징 벡터, 상기 제2 특징 벡터 및 상기 제3 특징 벡터를 조합하여 조합 벡터를 생성하는 단계; 및 상기 조합 벡터로부터 상기 3D 립싱크 비디오를 생성하는 단계; 를 포함할 수 있다.Generating the 3D lip sync video includes extracting a first feature vector from the 2D video; extracting a second feature vector from the 3D data; extracting a third feature vector from the speech audio; generating a combination vector by combining the first feature vector, the second feature vector, and the third feature vector; and generating the 3D lip sync video from the combination vector; may include.

상기 3D 립싱크 비디오를 생성하는 단계는, 상기 2D 비디오와 상기 3D 데이터에서 제1 특징 벡터를 추출하는 단계; 상기 발화 오디오로부터 제2 특징 벡터를 추출하는 단계; 상기 제1 특징 벡터 및 상기 제2 특징 벡터를 조합하여 조합 벡터를 생성하는 단계; 및 상기 조합 벡터로부터 상기 3D 립싱크 비디오를 생성하는 단계; 를 포함할 수 있다.Generating the 3D lip sync video includes extracting a first feature vector from the 2D video and the 3D data; extracting a second feature vector from the speech audio; generating a combination vector by combining the first feature vector and the second feature vector; and generating the 3D lip sync video from the combination vector; may include.

상기 3D 데이터는 상기 인물의 기하학적 구조를 표현하는 제1 3D 데이터와 피부의 수축과 팽창, 목젖의 움직임 및 어깨 근육의 움직임을 표현하는 제2 3D 데이터를 포함하고, 상기 3D 립싱크 비디오를 생성하는 단계는, 상기 2D 비디오에서 제1 특징 벡터를 추출하는 단계; 상기 제1 3D 데이터에서 제2 특징 벡터를 추출하는 단계; 상기 제2 3D 데이터에서 제3 특징 벡터를 추출하는 단계; 상기 발화 오디오로부터 제4 특징 벡터를 추출하는 단계; 상기 제1 특징 벡터, 상기 제2 특징 벡터, 상기 제3 특징 벡터 및 상기 제4 특징 벡터를 조합하여 조합 벡터를 생성하는 단계; 및 상기 조합 벡터로부터 상기 3D 립싱크 비디오를 생성하는 단계; 를 포함할 수 있다.The 3D data includes first 3D data representing the geometric structure of the person and second 3D data representing contraction and expansion of the skin, movement of the uvula, and movement of the shoulder muscles, and generating the 3D lip sync video. extracting a first feature vector from the 2D video; extracting a second feature vector from the first 3D data; extracting a third feature vector from the second 3D data; extracting a fourth feature vector from the speech audio; generating a combination vector by combining the first feature vector, the second feature vector, the third feature vector, and the fourth feature vector; and generating the 3D lip sync video from the combination vector; may include.

상기 3D 립싱크 비디오 생성 방법은 상기 발화 오디오와 상기 3D 립싱크 비디오의 싱크를 맞춰 3D 멀티미디어를 생성하는 단계; 를 더 포함할 수 있다.The 3D lip sync video generating method includes generating 3D multimedia by synchronizing the speech audio and the 3D lip sync video; It may further include.

예시적 실시예에 따른 3D 립싱크 비디오 생성 장치는 2D 비디오와 관련이 없는 발화 오디오를 기반으로 2D 비디오 속 인물의 3D 모델이 발화하는 3D 립싱크 비디오를 생성할 수 있다.The 3D lip-sync video generating apparatus according to an exemplary embodiment may generate a 3D lip-sync video in which a 3D model of a person in a 2D video speaks based on speech audio that is unrelated to the 2D video.

또한, 발화와 관련된 부분이 마스킹 처리된 2D 비디오와 인물의 발화하는 모습으로부터 획득된 3D 데이터를 이용함으로써, 인물의 발화 시 나타나는 얼굴 움직임, 목 움직임, 및 어깨 움직임뿐만 아니라 피부의 수축과 팽창, 목젖의 움직임, 어깨 근육의 움직임 등과 같은 그 인물만의 독특한 제스쳐 또는 특징이 반영된 3D 립싱크 비디오를 생성할 수 있으며, 그로 인하여 보다 자연스러운 3D 립싱크 비디오를 생성할 수 있다.In addition, by using 2D video with the parts related to speech masked and 3D data obtained from the person speaking, not only the facial movements, neck movements, and shoulder movements that appear when the person speaks, but also the contraction and expansion of the skin and the uvula It is possible to create a 3D lip-sync video that reflects the person's unique gestures or characteristics, such as the movement of the body, the movement of the shoulder muscles, etc., thereby creating a more natural 3D lip-sync video.

또한 2D 비디오의 마스킹 처리된 발화와 관련된 부분을 발화 오디오로부터 복원함으로써, 별도의 키포인트 예측 과정 없이 단일 신경망 모델을 통해 3D 립싱크 비디오를 생성할 수 있다.Additionally, by restoring the masked speech-related part of the 2D video from the speech audio, a 3D lip-sync video can be generated through a single neural network model without a separate keypoint prediction process.

도 1은 예시적 실시예에 따른 3D 립싱크 비디오 생성 장치를 도시한 블록도이다.
도 2는 인물의 발화 모습이 촬영된 2D 비디오의 예시도이다.
도 3은 일 실시예에 따른 3D 립싱크 비디오 생성 모델을 도시한 도면이다.
도 4는 다른 실시예에 따른 3D 립싱크 비디오 생성 모델을 도시한 도면이다.
도 5는 또 다른 실시예에 따른 3D 립싱크 비디오 생성 모델을 도시한 도면이다.
도 6은 예시적 실시예에 따른 3D 립싱크 비디오 생성 방법을 도시한 도면이다.
도 7은 일 실시예에 따른 3D 립싱크 비디오 생성 과정(620)을 도시한 도면이다.
도 8은 다른 실시예에 따른 3D 립싱크 비디오 생성 과정(620)을 도시한 도면이다.
도 9는 또 다른 실시예에 따른 3D 립싱크 비디오 생성 과정(620)을 도시한 도면이다.
도 10은 예시적인 실시예들에서 사용되기에 적합한 컴퓨팅 장치를 포함하는 컴퓨팅 환경을 예시하여 설명하기 위한 블록도이다.Fig. 1 is a block diagram showing a 3D lip-sync video generating device according to an exemplary embodiment.
Figure 2 is an example of a 2D video in which a person's speech is filmed.
Figure 3 is a diagram illustrating a 3D lip sync video generation model according to an embodiment.
Figure 4 is a diagram illustrating a 3D lip sync video generation model according to another embodiment.
Figure 5 is a diagram illustrating a 3D lip sync video generation model according to another embodiment.
Figure 6 is a diagram illustrating a method for generating a 3D lip-sync video according to an exemplary embodiment.
FIG. 7 is a diagram illustrating a 3D lip sync video generation process 620 according to an embodiment.
FIG. 8 is a diagram illustrating a 3D lip sync video generation process 620 according to another embodiment.
FIG. 9 is a diagram illustrating a 3D lip sync video generation process 620 according to another embodiment.
10 is a block diagram illustrating and illustrating a computing environment including a computing device suitable for use in example embodiments.

이하, 첨부된 도면을 참조하여 본 발명의 일 실시예를 상세하게 설명한다. 각 도면의 구성요소들에 참조부호를 부가함에 있어서, 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 부호를 가지도록 하고 있음에 유의해야 한다. 또한, 본 발명을 설명함에 있어 관련된 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이다.Hereinafter, an embodiment of the present invention will be described in detail with reference to the attached drawings. When adding reference numerals to components in each drawing, it should be noted that identical components are given the same reference numerals as much as possible even if they are shown in different drawings. Additionally, in describing the present invention, if it is determined that a detailed description of a related known function or configuration may unnecessarily obscure the gist of the present invention, the detailed description will be omitted.

한편, 각 단계들에 있어, 각 단계들은 문맥상 명백하게 특정 순서를 기재하지 않은 이상 명기된 순서와 다르게 일어날 수 있다. 즉, 각 단계들은 명기된 순서와 동일하게 수행될 수 있고 실질적으로 동시에 수행될 수도 있으며 반대의 순서대로 수행될 수도 있다.Meanwhile, in each step, unless a specific order is clearly stated in the context, each step may occur in a different order from the specified order. That is, each step may be performed in the same order as specified, may be performed substantially simultaneously, or may be performed in the opposite order.

후술되는 용어들은 본 발명에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다.The terms described below are terms defined in consideration of functions in the present invention, and may vary depending on the intention or custom of the user or operator. Therefore, the definition should be made based on the contents throughout this specification.

제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 구성요소들은 용어들에 의해 한정되어서는 안 된다. 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한 복수의 표현을 포함하고, '포함하다' 또는 '가지다' 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Terms such as first, second, etc. may be used to describe various components, but the components should not be limited by the terms. Terms are used only to distinguish one component from another. Singular expressions include plural expressions unless the context clearly indicates otherwise, and terms such as 'include' or 'have' refer to the features, numbers, steps, operations, components, parts, or combinations thereof described in the specification. It is intended to specify that something exists, but it should be understood as not precluding the possibility of the existence or addition of one or more other features, numbers, steps, operations, components, parts, or combinations thereof.

또한, 본 명세서에서의 구성부들에 대한 구분은 각 구성부가 담당하는 주 기능별로 구분한 것에 불과하다. 즉, 2개 이상의 구성부가 하나의 구성부로 합쳐지거나 또는 하나의 구성부가 보다 세분화된 기능별로 2개 이상으로 분화되어 구비될 수도 있다. 그리고 구성부 각각은 자신이 담당하는 주기능 이외에도 다른 구성부가 담당하는 기능 중 일부 또는 전부의 기능을 추가적으로 수행할 수도 있으며, 구성부 각각이 담당하는 주기능 중 일부 기능이 다른 구성부에 의해 전담되어 수행될 수도 있다. 각 구성부는 하드웨어 또는 소프트웨어로 구현되거나 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다.In addition, the division of components in this specification is merely a division according to the main function each component is responsible for. That is, two or more components may be combined into one component, or one component may be divided into two or more components for more detailed functions. In addition to the main functions that each component is responsible for, each component may additionally perform some or all of the functions that other components are responsible for, and some of the main functions that each component is responsible for may be performed by other components. It may also be carried out. Each component may be implemented as hardware or software, or as a combination of hardware and software.

도 1은 예시적 실시예에 따른 3D 립싱크 비디오 생성 장치를 도시한 블록도이고, 도 2는 인물의 발화 모습이 촬영된 2D 비디오의 예시도이다.FIG. 1 is a block diagram showing a 3D lip-sync video generating apparatus according to an exemplary embodiment, and FIG. 2 is an example diagram of a 2D video in which a person's speech is captured.

도 1을 참조하면, 예시적 실시예에 따른 3D 립싱크 비디오 생성 장치(100)는 3D 립싱크 비디오 생성 모델(110)을 포함할 수 있다.Referring to FIG. 1 , the 3D lip-sync video generation apparatus 100 according to an example embodiment may include a 3D lip-sync video generation model 110.

3D 립싱크 비디오 생성 모델(110)은 인물의 발화 모습이 촬영된 2D 비디오, 인물의 발화 모습으로부터 획득된 3D 데이터, 및 발화 오디오를 기반으로, 인물의 3차원 모델이 발화하는(말하는) 3D 립싱크 비디오를 생성할 수 있다.The 3D lip sync video generation model 110 is a 3D lip sync video in which a 3D model of a person utters (speaking) based on a 2D video in which a person's speech is filmed, 3D data obtained from the person's speech, and speech audio. can be created.

인물의 발화 모습이 촬영된 2D 비디오에서 발화와 관련된 부분은 마스킹(masking) 처리될 수 있다. 예를 들어, 도 2에 도시된 바와 같이 2D 비디오에서 발화와 관련된 부분, 예컨대, 입 및 입 주위 부분 등은 마스크(210)로 가려질 수 있다.In a 2D video in which a person's speech is filmed, parts related to the speech may be masked. For example, as shown in FIG. 2, parts related to speech in a 2D video, such as the mouth and areas around the mouth, may be covered with the mask 210.

도 2에 도시된 바와 같이 인물에는 복수의 마커(220)가 표시될 수 있다. 예를 들어, 마커(220)는 얼굴, 목 및/또는 어깨 등에 표시될 수 있으며, 후술하는 바와 같이 피부의 수축과 팽창, 목젖의 움직임, 어깨 근육의 움직임 등을 나타내는데 이용될 수 있다.As shown in FIG. 2, a plurality of markers 220 may be displayed on a person. For example, the marker 220 may be displayed on the face, neck, and/or shoulder, and may be used to indicate contraction and expansion of the skin, movement of the uvula, movement of shoulder muscles, etc., as described later.

3D 데이터는 인물의 기하학적 구조를 표현하는 제1 3D 데이터와, 피부의 수축과 팽창, 목젖의 움직임, 어깨 근육의 움직임을 표현하는 제2 3D 데이터를 포함할 수 있다.The 3D data may include first 3D data expressing the geometric structure of the person, and second 3D data expressing contraction and expansion of the skin, movement of the uvula, and movement of the shoulder muscles.

예시적 실시예에 따르면, 제1 3D 데이터는 정점들(vertices), 폴리곤들(polygons) 및/또는 메쉬(mesh)에 대한 정보를 포함할 수 있다. 제1 3D 데이터는 깊이 카메라 및/또는 적외선 카메라를 이용하여 획득될 수 있다. 예를 들어, 제1 3D 데이터는 제1 해상도 이하의 저해상도 깊이 카메라를 이용하여 획득된 인물의 깊이 정보와, 제2 해상도 이상의 고해상도 적외선 카메라를 이용하여 획득된 인물에 대한 적외선 이미지를 기반으로 머신러닝 모델을 이용하여 획득될 수 있다. 그러나 이는 일 실시예에 불과할 뿐 이에 한정되는 것은 아니다. 즉, 공개된 다양한 기법을 이용하여 제1 3D 데이터를 획득할 수 있다.According to an example embodiment, the first 3D data may include information about vertices, polygons, and/or mesh. The first 3D data may be acquired using a depth camera and/or an infrared camera. For example, the first 3D data is machine learning based on the depth information of the person acquired using a low-resolution depth camera with a first resolution or lower and the infrared image of the person acquired using a high-resolution infrared camera with a second resolution or higher. It can be obtained using a model. However, this is only an example and is not limited thereto. That is, the first 3D data can be obtained using various publicly available techniques.

예시적 실시예에 따르면, 제2 3D 데이터는 인물에 표시된 마커(도 2의 220)의 위치 및/또는 이동에 관한 정보를 포함할 수 있다. 얼굴, 목 및/또는 어깨 등에 표시된 마커(220)의 위치 및/또는 이동에 관한 정보는 피부의 수축과 팽창, 목젖의 움직임, 어깨 근육의 움직임 등을 표현할 수 있다.According to an exemplary embodiment, the second 3D data may include information about the location and/or movement of the marker (220 in FIG. 2) displayed on the person. Information about the position and/or movement of the marker 220 displayed on the face, neck, and/or shoulder, etc. may express contraction and expansion of the skin, movement of the uvula, movement of shoulder muscles, etc.

발화 오디오는 인물의 발화 모습이 촬영된 2D 비디오와는 관련 없는 것일 수 있다. 예를 들어, 발화 오디오는 비디오 속 인물과는 다른 인물이 발화한 오디오일 수도 있고, 비디오 속 인물이 비디오 속 배경 또는 상황과 관련이 없는 배경 또는 상황에서 발화한 오디오일 수도 있고, TTS(Text to Speech) 기법을 통해 텍스트로부터 생성된 오디오일 수도 있다.The speech audio may be unrelated to the 2D video in which the person's speech was recorded. For example, speech audio may be audio spoken by a person different from the person in the video, audio may be audio spoken by the person in the video in a background or situation unrelated to the background or situation in the video, or text-to-speech (TTS) audio may be audio uttered by a person different from the person in the video. It may be audio created from text using a speech technique.

3D 립싱크 비디오 생성 모델(110)은 인물의 발화 모습이 촬영된 2D 비디오, 인물의 발화 모습으로부터 획득된 3D 데이터 및 발화 오디오를 기반으로 마스크(도 2의 210)로 가려진 발화 관련 부분을 복원하고 2D 비디오 속 인물의 3D 모델이 발화하는 3D 립싱크 비디오를 생성할 수 있도록, 미리 학습될 수 있다.The 3D lip-sync video generation model 110 restores the speech-related part hidden by the mask (210 in FIG. 2) based on the 2D video in which the person's speech is filmed, 3D data obtained from the person's speech, and speech audio, and creates a 2D A 3D model of a person in a video can be trained in advance to create a 3D lip-sync video of the person speaking.

예를 들어, 인물의 발화 모습이 촬영된 2D 비디오와, 2D 비디오의 촬영과 함께 녹음된 해당 인물의 발화 오디오, 및 해당 인물의 발화 모습으로부터 획득된 3D 데이터를 학습 데이터로 획득하고, 해당 인물의 발화 모습이 촬영된 3D 비디오를 정답(target) 데이터로 획득할 수 있다. 이때 인물에는 복수의 마커가 표시될 수 있으며, 2D 비디오에서 발화와 관련된 부분은 마스킹(masking) 처리될 수 있다.For example, a 2D video in which a person's speech is filmed, the person's speech audio recorded with the shooting of the 2D video, and 3D data obtained from the person's speech are acquired as learning data, and the person's speech is acquired as learning data. A 3D video of the speech can be obtained as target data. At this time, a plurality of markers may be displayed on the person, and parts related to speech in the 2D video may be masked.

3D 립싱크 비디오 생성 모델(110)은 학습 데이터를 입력으로 하여 생성된 3D 립싱크 비디오가 정답 데이터인 3D 비디오에 가까워지도록 즉, 3D 립싱크 비디오와 3D 비디오의 차이가 최소가 되도록 학습 파라미터(예컨대, 각 레이어의 가중치 및 바이어스 등)을 조절하는 방식으로 미리 학습될 수 있다.The 3D lip sync video generation model 110 sets learning parameters (e.g., each layer) so that the 3D lip sync video generated by inputting the learning data is close to the 3D video that is the correct answer data, that is, the difference between the 3D lip sync video and the 3D video is minimized. It can be learned in advance by adjusting the weight and bias, etc.

예시적 실시예에 따르면, 3D 립싱크 비디오 생성 장치(100)는 음성 변환부(120) 및/또는 싱크부(130)를 더 포함할 수 있다.According to an exemplary embodiment, the 3D lip sync video generating apparatus 100 may further include a voice converter 120 and/or a sync unit 130.

음성 변환부(120)는 텍스트를 입력받아 음성, 즉 발화 오디오를 생성할 수 있다. 음성 변환부(120)에서 생성된 발화 오디오는 3D 립싱크 비디오 생성 모델(110)에 입력으로 제공될 수 있다. 음성 변환부(120)는 공개된 다양한 텍스트 음성 변환 기법(Text to Speech, TTS)을 이용할 수 있다.The voice converter 120 can receive text as input and generate voice, that is, speech audio. The speech audio generated by the voice converter 120 may be provided as an input to the 3D lip sync video generation model 110. The voice conversion unit 120 can use various publicly available text-to-speech (TTS) techniques.

싱크부(130)는 발화 오디오와 3D 립싱크 비디오의 싱크를 맞춰 3D 멀티미디어를 생성할 수 있다.The sync unit 130 can generate 3D multimedia by synchronizing speech audio and 3D lip sync video.

예시적 실시예에 따른 3D 립싱크 비디오 생성 장치(100)는 2D 비디오와 관련이 없는 발화 오디오를 기반으로 2D 비디오 속 인물의 3D 모델이 발화하는 3D 립싱크 비디오를 생성할 수 있다. 또한, 발화와 관련된 부분이 마스킹 처리된 2D 비디오와 인물의 발화하는 모습으로부터 획득된 3D 데이터를 이용함으로써, 인물의 발화 시 나타나는 얼굴 움직임, 목 움직임, 및 어깨 움직임뿐만 아니라 피부의 수축과 팽창, 목젖의 움직임, 어깨 근육의 움직임 등과 같은 그 인물만의 독특한 제스쳐 또는 특징이 반영된 3D 립싱크 비디오를 생성할 수 있으며, 그로 인하여 보다 자연스러운 3D 립싱크 비디오를 생성할 수 있다. 또한 2D 비디오의 마스킹 처리된 발화와 관련된 부분을 발화 오디오로부터 복원함으로써, 별도의 키포인트 예측 과정 없이 단일 신경망 모델을 통해 3D 립싱크 비디오를 생성할 수 있다.The 3D lip-sync video generating apparatus 100 according to an exemplary embodiment may generate a 3D lip-sync video in which a 3D model of a person in a 2D video speaks based on speech audio that is unrelated to the 2D video. In addition, by using 2D video with the parts related to speech masked and 3D data obtained from the person speaking, not only the facial movements, neck movements, and shoulder movements that appear when the person speaks, but also the contraction and expansion of the skin and the uvula It is possible to create a 3D lip-sync video that reflects the person's unique gestures or characteristics, such as the movement of the body, the movement of the shoulder muscles, etc., thereby creating a more natural 3D lip-sync video. Additionally, by restoring the masked speech-related part of the 2D video from the speech audio, a 3D lip-sync video can be generated through a single neural network model without a separate keypoint prediction process.

이하 도 3 내지 도 5를 참조하여 3D 립싱크 비디오 생성 모델(110)을 구체적으로 설명한다.Hereinafter, the 3D lip sync video generation model 110 will be described in detail with reference to FIGS. 3 to 5.

도 3은 일 실시예에 따른 3D 립싱크 비디오 생성 모델을 도시한 도면이다. 도 3의 3D 립싱크 비디오 생성 모델(110a)은 도 1의 3D 립싱크 비디오 생성 모델(110)의 일 실시예일 수 있다.Figure 3 is a diagram illustrating a 3D lip sync video generation model according to an embodiment. The 3D lip sync video generation model 110a of FIG. 3 may be an example of the 3D lip sync video generation model 110 of FIG. 1.

도 3을 참조하면, 3D 립싱크 비디오 생성 모델(110a)은 제1 인코더(310), 제2 인코더(320), 제3 인코더(330), 조합부(340) 및 디코더(350)를 포함할 수 있다.Referring to FIG. 3, the 3D lip sync video generation model 110a may include a first encoder 310, a second encoder 320, a third encoder 330, a combination unit 340, and a decoder 350. there is.

제1 인코더(310)는 인물의 발화 모습이 촬영된 2D 비디오에서 제1 특징 벡터를 추출할 수 있다. 이하, "벡터"는 "텐서"를 포함하는 의미로 사용될 수 있다. 2D 비디오는 인물이 발화할 때 나타나는 얼굴, 목, 어깨 등의 움직임을 알 수 있도록 해당 인물의 상반신이 촬영된 2D 비디오일 수 있다.The first encoder 310 may extract a first feature vector from a 2D video in which a person's speech is captured. Hereinafter, “vector” may be used to include “tensor.” The 2D video may be a 2D video in which the upper body of the person in question can be seen so that the movements of the face, neck, shoulders, etc. that appear when the person speaks.

전술한 바와 같이, 인물의 발화 모습이 촬영된 2D 비디오에서 발화와 관련된 부분은 마스킹(masking) 처리될 수 있다. 예를 들어, 도 2에 도시된 바와 같이 2D 비디오에서 발화와 관련된 부분, 예컨대, 입 및 입 주위 부분 등은 마스크(210)로 가려질 수 있다. 이 경우, 제1 인코더(310)는 2D 비디오 중 마스크(210)로 가려지지 않은 부분(예컨대 발화와 관련된 부분(예, 입 및 입 주위 부분 등)을 제외한 부분)을 기반으로 제1 특징 벡터를 추출할 수 있다.As described above, in a 2D video in which a person's speech is captured, parts related to the speech may be masked. For example, as shown in FIG. 2, parts related to speech in a 2D video, such as the mouth and areas around the mouth, may be covered with the mask 210. In this case, the first encoder 310 generates a first feature vector based on the part of the 2D video that is not covered by the mask 210 (e.g., the part excluding the part related to speech (e.g., the mouth and the area around the mouth, etc.)) It can be extracted.

제1 인코더(310)는 인물의 발화 모습이 촬영된 2D 비디오에서 제1 특징 벡터를 추출하도록 학습된 머신러닝 모델일 수 있다. 예를 들어, 제1 인코더(310)는 하나 이상의 합성곱 층(Convolutional Layer) 및 하나 이상의 풀링 층(Pooling Layer)를 포함할 수 있으나 이에 한정되는 것은 아니다. 합성곱 층은 입력되는 2D 비디오에서 기 설정된 크기(예를 들어, 3×3 픽셀 크기)의 필터를 일정 간격으로 이동시키면서 해당 필터에 대응되는 픽셀들의 특징 값을 추출할 수 있다. 풀링 층은 합성곱 층의 출력을 입력으로 받아 다운 샘플링(Down Sampling)을 수행할 수 있다.The first encoder 310 may be a machine learning model learned to extract a first feature vector from a 2D video in which a person's speech is captured. For example, the first encoder 310 may include one or more convolutional layers and one or more pooling layers, but is not limited thereto. The convolution layer can extract feature values of pixels corresponding to the filter by moving a filter of a preset size (for example, 3×3 pixel size) at regular intervals in the input 2D video. The pooling layer can receive the output of the convolution layer as input and perform down sampling.

제2 인코더(320)는 인물의 발화 모습으로부터 획득된 3D 데이터에서 제2 특징 벡터를 추출할 수 있다. 전술한 바와 같이 3D 데이터는 인물의 기하학적 구조를 표현하는 제1 3D 데이터와, 피부의 수축과 팽창, 목젖의 움직임, 어깨 근육의 움직임을 표현하는 제2 3D 데이터를 포함할 수 있다.The second encoder 320 may extract a second feature vector from 3D data obtained from the person's speech appearance. As described above, the 3D data may include first 3D data expressing the geometric structure of the person, and second 3D data expressing contraction and expansion of the skin, movement of the uvula, and movement of the shoulder muscles.

제2 인코더(320)는 인물의 발화 모습으로부터 획득된 3D 데이터에서 제2 특징 벡터를 추출하도록 학습된 머신러닝 모델일 수 있다. 예를 들어, 제2 인코더(320)는 하나 이상의 합성곱 층(Convolutional Layer) 및 하나 이상의 풀링 층(Pooling Layer)를 포함할 수 있으나 이에 한정되는 것은 아니다.The second encoder 320 may be a machine learning model learned to extract a second feature vector from 3D data obtained from the person's speech appearance. For example, the second encoder 320 may include one or more convolutional layers and one or more pooling layers, but is not limited thereto.

제3 인코더(330)는 발화 오디오로부터 제3 특징 벡터를 추출할 수 있다. 이때 발화 오디오는 제1 인코더(310)에 입력되는 2D 비디오와는 관련 없는 것일 수 있다. 예를 들어, 발화 오디오는 비디오 속 인물과는 다른 인물이 발화한 오디오일 수도 있고, 비디오 속 인물이 비디오와 관련 없는 배경 또는 상황에서 발화한 오디오일 수도 있고, TTS(Text to Speech) 기법을 통해 생성된 오디오일 수도 있다.The third encoder 330 may extract a third feature vector from the speech audio. At this time, the spoken audio may be unrelated to the 2D video input to the first encoder 310. For example, speech audio may be audio spoken by a person different from the person in the video, or it may be audio spoken by the person in the video in a background or situation unrelated to the video, or it may be audio spoken by a person in the video in a background or situation unrelated to the video, or it may be audio that is spoken by a person different from the person in the video, or it may be audio that the person in the video speaks in a background or situation unrelated to the video, or it may be audio that is spoken by a person different from the person in the video, or it may be audio that the person in the video speaks in a background or situation unrelated to the video, or it may be audio that is spoken by a person different from the person in the video, or it may be audio that the person in the video speaks in a background or situation unrelated to the video, or it may be audio that is spoken by a person different from the person in the video. It could also be generated audio.

제3 인코더(330)는 발화 오디오로부터 제3 특징 벡터를 추출하도록 학습된 머신러닝 모델일 수 있다. 예를 들어, 제3 인코더(330)는 하나 이상의 합성곱 층(Convolutional Layer) 및 하나 이상의 풀링 층(Pooling Layer)를 포함할 수 있으나 이에 한정되는 것은 아니다.The third encoder 330 may be a machine learning model learned to extract a third feature vector from speech audio. For example, the third encoder 330 may include one or more convolutional layers and one or more pooling layers, but is not limited thereto.

조합부(340)는 제1 특징 벡터, 제2 특징 벡터 및 제3 특징 벡터를 조합하여 조합 벡터를 생성할 수 있다. 예를 들어, 조합부(340)는 제1 특징 벡터, 제2 특징 벡터 및 제3 특징 벡터를 연결(Concatenate)하여 조합 벡터를 생성할 수 있으나 이에 한정되는 것은 아니다.The combination unit 340 may generate a combination vector by combining the first feature vector, the second feature vector, and the third feature vector. For example, the combination unit 340 may generate a combination vector by concatenating the first feature vector, the second feature vector, and the third feature vector, but the method is not limited to this.

디코더(350)는 조합 벡터를 기반으로 2D 비디오 속 인물의 3D 모델이 발화하는 3D 립싱크 비디오를 생성할 수 있다. 구체적으로, 디코더(350)는 2D 비디오로부터 추출된 제1 특징 벡터, 3D 데이터로부터 추출된 제2 특징 벡터 및 발화 오디오로부터 추출된 제3 특징 벡터를 기반으로, 마스킹 처리된 부분(발화와 관련된 부분)을 발화 오디오에 대응하도록 복원하여 인물의 3D 모델이 발화하는 3D 립싱크 비디오를 생성할 수 있다.The decoder 350 can generate a 3D lip sync video in which a 3D model of a person in a 2D video speaks based on the combination vector. Specifically, the decoder 350 is a masked part (part related to the speech) based on the first feature vector extracted from the 2D video, the second feature vector extracted from the 3D data, and the third feature vector extracted from the speech audio. ) can be restored to correspond to speech audio to create a 3D lip-sync video in which a 3D model of a person speaks.

디코더(350)는 조합 벡터를 기반으로 3D 립싱크 비디오를 생성하도록 학습된 머신러닝 모델일 수 있다. 예를 들어, 디코더(350)는 조합 벡터에 역 합성곱(Deconvolution)을 수행한 후 업 샘플링(Up Sampling)을 수행하여 3D 립싱크 비디오를 생성할 수 있다.The decoder 350 may be a machine learning model trained to generate a 3D lip sync video based on the combination vector. For example, the decoder 350 may generate a 3D lip sync video by performing deconvolution on the combination vector and then performing up sampling.

도 4는 다른 실시예에 따른 3D 립싱크 비디오 생성 모델을 도시한 도면이다. 도 4의 3D 립싱크 비디오 생성 모델(110b)은 도 1의 3D 립싱크 비디오 생성 모델(110)의 다른 실시예일 수 있다.Figure 4 is a diagram illustrating a 3D lip sync video generation model according to another embodiment. The 3D lip-sync video generation model 110b of FIG. 4 may be another embodiment of the 3D lip-sync video generation model 110 of FIG. 1.

도 4를 참조하면, 다른 실시예에 따른 3D 립싱크 비디오 생성 모델(110b)은 제1 인코더(410), 제2 인코더(420), 조합부(430) 및 디코더(440)를 포함할 수 있다.Referring to FIG. 4, a 3D lip sync video generation model 110b according to another embodiment may include a first encoder 410, a second encoder 420, a combination unit 430, and a decoder 440.

제1 인코더(410)는 인물의 발화 모습이 촬영된 2D 비디오와 인물의 발화 모습으로부터 획득된 3D 데이터에서 제1 특징 벡터를 추출할 수 있다. 2D 비디오는 해당 인물의 상반신이 촬영된 2D 비디오일 수 있으며, 2D 비디오에서 발화와 관련된 부분은 마스킹 처리될 수 있다. 또한, 3D 데이터는 인물의 기하학적 구조를 표현하는 정점들(vertices), 폴리곤들(polygons) 및/또는 메쉬(mesh)에 대한 정보를 포함하는 제1 3D 데이터와, 피부의 수축과 팽창, 목젖의 움직임, 어깨 근육의 움직임을 표현하는 인물에 표시된 마커의 위치 및/또는 이동에 관한 정보를 포함하는 제2 3D 데이터를 포함할 수 있다.The first encoder 410 may extract a first feature vector from a 2D video in which a person's speech is captured and 3D data obtained from the person's speech. The 2D video may be a 2D video in which the upper body of the person in question is filmed, and parts related to speech in the 2D video may be masked. In addition, the 3D data includes first 3D data containing information about vertices, polygons, and/or mesh representing the geometric structure of the person, contraction and expansion of the skin, and uvula. It may include second 3D data including information about the position and/or movement of a marker displayed on a person expressing movement or shoulder muscle movement.

제2 인코더(420)는 발화 오디오로부터 제2 특징 벡터를 추출할 수 있다. 발화 오디오는 제1 인코더(310)에 입력되는 2D 비디오와는 관련 없는 것일 수 있다.The second encoder 420 may extract a second feature vector from the speech audio. The spoken audio may be unrelated to the 2D video input to the first encoder 310.

조합부(430)는 제1 특징 벡터 및 제2 특징 벡터를 조합하여 조합 벡터를 생성할 수 있으며, 디코더(440)는 조합 벡터를 기반으로 2D 비디오 속 인물의 3차원 모델이 발화하는 3D 립싱크 비디오를 생성할 수 있다.The combination unit 430 may generate a combination vector by combining the first feature vector and the second feature vector, and the decoder 440 may generate a 3D lip sync video in which a 3D model of a person in a 2D video utters based on the combination vector. can be created.

도 5는 또 다른 실시예에 따른 3D 립싱크 비디오 생성 모델을 도시한 도면이다. 도 5의 3D 립싱크 비디오 생성 모델(110c)는 도 1의 3D 립싱크 비디오 생성 모델(110)의 또 다른 실시예일 수 있다.Figure 5 is a diagram illustrating a 3D lip sync video generation model according to another embodiment. The 3D lip sync video generation model 110c of FIG. 5 may be another embodiment of the 3D lip sync video generation model 110 of FIG. 1.

도 5를 참조하면, 또 다른 실시예에 따른 3D 립싱크 비디오 생성 모델(110c)는 제1 인코더(510), 제2 인코더(520), 제3 인코더(530), 제4 인코더(540), 조합부(550) 및 디코더(560)를 포함할 수 있다.Referring to FIG. 5, the 3D lip sync video generation model 110c according to another embodiment includes a first encoder 510, a second encoder 520, a third encoder 530, and a fourth encoder 540. It may include a unit 550 and a decoder 560.

제1 인코더(510)는 인물의 발화 모습이 촬영된 2D 비디오에서 제1 특징 벡터를 추출할 수 있다. 2D 비디오는 해당 인물의 상반신이 촬영된 2D 비디오일 수 있으며, 2D 비디오에서 발화와 관련된 부분은 마스킹 처리될 수 있다.The first encoder 510 may extract a first feature vector from a 2D video in which a person's speech is captured. The 2D video may be a 2D video in which the upper body of the person in question is filmed, and parts related to speech in the 2D video may be masked.

제2 인코더(520)는 제1 3D 데이터에서 제2 특징 벡터를 추출할 수 있다. 제1 3D 데이터는 인물의 기하학적 구조를 표현하는 정점들(vertices), 폴리곤들(polygons) 및/또는 메쉬(mesh)에 대한 정보를 포함할 수 있다.The second encoder 520 may extract a second feature vector from the first 3D data. The first 3D data may include information about vertices, polygons, and/or mesh representing the geometric structure of the person.

제3 인코더(530)는 제2 3D 데이터에서 제3 특징 벡터를 추출할 수 있다. 제2 3D 데이터는 피부의 수축과 팽창, 목젖의 움직임, 어깨 근육의 움직임을 표현하는 인물에 표시된 마커의 위치 및/또는 이동에 관한 정보를 포함할 수 있다.The third encoder 530 may extract a third feature vector from the second 3D data. The second 3D data may include information about the position and/or movement of a marker displayed on the person representing contraction and expansion of the skin, movement of the uvula, and movement of the shoulder muscles.

제4 인코더(540)는 발화 오디오로부터 제4 특징 벡터를 추출할 수 있다. 발화 오디오는 제1 인코더(510)에 입력되는 2D 비디오와는 관련 없는 것일 수 있다.The fourth encoder 540 may extract the fourth feature vector from the speech audio. The speech audio may be unrelated to the 2D video input to the first encoder 510.

조합부(550)는 제1 특징 벡터, 제2 특징 벡터, 제3 특징 벡터 및 제4 특징 벡터를 조합하여 조합 벡터를 생성할 수 있으며, 디코더(560)는 조합 벡터를 기반으로 2D 비디오 속 인물의 3차원 모델이 발화하는 3D 립싱크 비디오를 생성할 수 있다.The combination unit 550 may generate a combination vector by combining the first feature vector, the second feature vector, the third feature vector, and the fourth feature vector, and the decoder 560 may generate a combination vector based on the combination vector. It is possible to create a 3D lip sync video in which a 3D model speaks.

도 6은 예시적 실시예에 따른 3D 립싱크 비디오 생성 방법을 도시한 도면이다. 도 6의 3D 립싱크 비디오 생성 방법은 도 1의 3D 립싱크 비디오 생성 장치(100)에 의해 수행될 수 있다.Figure 6 is a diagram illustrating a method for generating a 3D lip-sync video according to an exemplary embodiment. The 3D lip-sync video generating method of FIG. 6 may be performed by the 3D lip-sync video generating apparatus 100 of FIG. 1.

도 6을 참조하면, 3D 립싱크 비디오 생성 장치는 텍스트를 입력받아 음성, 즉 발화 오디오를 생성할 수 있다(510). 예를 들어, 3D 립싱크 비디오 생성 장치는 공개된 다양한 텍스트 음성 변환 기법(Text to Speech, TTS)을 이용하여 텍스트로부터 발화 오디오를 생성할 수 있다.Referring to FIG. 6, the 3D lip sync video generating device can receive text as input and generate voice, that is, speech audio (510). For example, a 3D lip-sync video generating device can generate speech audio from text using various publicly available text-to-speech (TTS) techniques.

3D 립싱크 비디오 생성 장치는 인물의 발화 모습이 촬영된 2D 비디오, 인물의 발화 모습으로부터 획득된 3D 데이터, 및 발화 오디오를 기반으로, 인물의 3차원 모델이 발화하는(말하는) 3D 립싱크 비디오를 생성할 수 있다(560). 여기서 2D 비디오는 해당 인물의 상반신이 촬영된 2D 비디오일 수 있으며, 2D 비디오에서 발화와 관련된 부분은 마스킹(masking) 처리될 수 있다. 또한 인물에는 피부의 수축과 팽창, 목젖의 움직임, 어깨 근육의 움직임 등을 나타내는데 이용될 수 있는 복수의 마커가 표시될 수 있다.The 3D lip-sync video generating device is capable of generating a 3D lip-sync video in which a 3D model of a person utters (speaking) based on a 2D video of a person speaking, 3D data obtained from the person speaking, and speech audio. Can (560). Here, the 2D video may be a 2D video in which the upper body of the person in question is filmed, and parts related to speech in the 2D video may be masked. Additionally, a plurality of markers that can be used to indicate contraction and expansion of the skin, movement of the uvula, movement of the shoulder muscles, etc. may be displayed on the person.

예를 들어, 제1 3D 데이터는 정점들(vertices), 폴리곤들(polygons) 및/또는 메쉬(mesh)에 대한 정보를 포함할 수 있다. 제1 3D 데이터는 깊이 카메라 및/또는 적외선 카메라를 이용하여 획득될 수 있다. 예를 들어, 제1 3D 데이터는 제1 해상도 이하의 저해상도 깊이 카메라를 이용하여 획득된 인물의 깊이 정보와, 제2 해상도 이상의 고해상도 적외선 카메라를 이용하여 획득된 인물에 대한 적외선 이미지를 기반으로 머신러닝 모델을 이용하여 획득될 수 있다. 그러나 이는 일 실시예에 불과할 뿐 이에 한정되는 것은 아니다. 즉, 공개된 다양한 기법을 이용하여 제1 3D 데이터를 획득할 수 있다.For example, the first 3D data may include information about vertices, polygons, and/or mesh. The first 3D data may be acquired using a depth camera and/or an infrared camera. For example, the first 3D data is machine learning based on the depth information of the person acquired using a low-resolution depth camera with a first resolution or lower and the infrared image of the person acquired using a high-resolution infrared camera with a second resolution or higher. It can be obtained using a model. However, this is only an example and is not limited thereto. That is, the first 3D data can be obtained using various publicly available techniques.

예를 들어, 제2 3D 데이터는 인물에 표시된 마커(도 2의 220)의 위치 및/또는 이동에 관한 정보를 포함할 수 있다. 얼굴, 목 및/또는 어깨 등에 표시된 마커(220)의 위치 및/또는 이동에 관한 정보는 피부의 수축과 팽창, 목젖의 움직임, 어깨 근육의 움직임 등을 표현할 수 있다.For example, the second 3D data may include information about the location and/or movement of the marker (220 in FIG. 2) displayed on the person. Information about the position and/or movement of the marker 220 displayed on the face, neck, and/or shoulder, etc. may express contraction and expansion of the skin, movement of the uvula, movement of shoulder muscles, etc.

3D 립싱크 비디오 생성 장치는 발화 오디오와 3D 립싱크 비디오의 싱크를 맞춰 3D 멀티미디어를 생성할 수 있다(630).The 3D lip sync video generating device can generate 3D multimedia by synchronizing speech audio and 3D lip sync video (630).

이하 도 7 내지 도 9를 참조하여 3D 립싱크 비디오를 생성하는 과정(620)을 구체적으로 설명한다.Hereinafter, the process 620 of generating a 3D lip sync video will be described in detail with reference to FIGS. 7 to 9.

도 7은 일 실시예에 따른 3D 립싱크 비디오 생성 과정(620)을 도시한 도면이다.FIG. 7 is a diagram illustrating a 3D lip sync video generation process 620 according to an embodiment.

도 7을 참조하면, 3D 립싱크 비디오 생성 장치는 인물의 발화 모습이 촬영된 2D 비디오에서 제1 특징 벡터를 추출할 수 있다(710).Referring to FIG. 7, the 3D lip sync video generating device can extract a first feature vector from a 2D video in which a person's speech is captured (710).

3D 립싱크 비디오 생성 장치는 인물의 발화 모습으로부터 획득된 3D 데이터에서 제2 특징 벡터를 추출할 수 있다(720). 여기서 3D 데이터는 인물의 기하학적 구조를 표현하는 제1 3D 데이터와, 피부의 수축과 팽창, 목젖의 움직임, 어깨 근육의 움직임을 표현하는 제2 3D 데이터를 포함할 수 있다.The 3D lip sync video generating device may extract a second feature vector from 3D data obtained from the person's speech (720). Here, the 3D data may include first 3D data expressing the geometric structure of the person, and second 3D data expressing contraction and expansion of the skin, movement of the uvula, and movement of the shoulder muscles.

3D 립싱크 비디오 생성 장치는 발화 오디오로부터 제3 특징 벡터를 추출할 수 있다(730). 이때 발화 오디오는 2D 비디오와는 관련 없는 것으로 단계 610에서 텍스트로부터 생성된 발화 오디오일 수 있다.The 3D lip sync video generating device may extract a third feature vector from the speech audio (730). At this time, the speech audio is not related to the 2D video and may be speech audio generated from text in step 610.

3D 립싱크 비디오 생성 장치는 제1 특징 벡터, 제2 특징 벡터 및 제3 특징 벡터를 조합하여 조합 벡터를 생성할 수 있다(740). 예를 들어, 3D 립싱크 비디오 생성 장치는 제1 특징 벡터, 제2 특징 벡터 및 제3 특징 벡터를 연결(Concatenate)하여 조합 벡터를 생성할 수 있다.The 3D lip sync video generating device may generate a combination vector by combining the first feature vector, the second feature vector, and the third feature vector (740). For example, a 3D lip sync video generating device may generate a combination vector by concatenating a first feature vector, a second feature vector, and a third feature vector.

3D 립싱크 비디오 생성 장치는 조합 벡터를 기반으로 2D 비디오 속 인물의 3D 모델이 발화하는 3D 립싱크 비디오를 생성할 수 있다(750). 구체적으로, 3D 립싱크 비디오 생성 장치는 조합 벡터를 기반으로 마스킹 처리된 부분(발화와 관련된 부분)을 발화 오디오에 대응하도록 복원한 3D 립싱크 비디오를 생성할 수 있다.The 3D lip sync video generating device can generate a 3D lip sync video in which a 3D model of a person in a 2D video speaks based on a combination vector (750). Specifically, the 3D lip sync video generating device can generate a 3D lip sync video in which the masked portion (part related to speech) is restored to correspond to the speech audio based on the combination vector.

도 8은 다른 실시예에 따른 3D 립싱크 비디오 생성 과정(620)을 도시한 도면이다.FIG. 8 is a diagram illustrating a 3D lip sync video generation process 620 according to another embodiment.

도 8을 참조하면, 3D 립싱크 비디오 생성 장치는 인물의 발화 모습이 촬영된 2D 비디오와 인물의 발화 모습으로부터 획득된 3D 데이터에서 제1 특징 벡터를 추출할 수 있다(810). 2D 비디오는 해당 인물의 상반신이 촬영된 2D 비디오일 수 있으며, 2D 비디오에서 발화와 관련된 부분은 마스킹 처리될 수 있다. 또한, 3D 데이터는 인물의 기하학적 구조를 표현하는 정점들(vertices), 폴리곤들(polygons) 및/또는 메쉬(mesh)에 대한 정보를 포함하는 제1 3D 데이터와, 피부의 수축과 팽창, 목젖의 움직임, 어깨 근육의 움직임을 표현하는 인물에 표시된 마커의 위치 및/또는 이동에 관한 정보를 포함하는 제2 3D 데이터를 포함할 수 있다.Referring to FIG. 8, the 3D lip sync video generating apparatus may extract a first feature vector from a 2D video in which a person's speech is captured and 3D data obtained from the person's speech (810). The 2D video may be a 2D video in which the upper body of the person in question is filmed, and parts related to speech in the 2D video may be masked. In addition, the 3D data includes first 3D data containing information about vertices, polygons, and/or mesh representing the geometric structure of the person, contraction and expansion of the skin, and uvula. It may include second 3D data including information about the position and/or movement of a marker displayed on a person expressing movement or shoulder muscle movement.

3D 립싱크 비디오 생성 장치는 발화 오디오로부터 제2 특징 벡터를 추출할 수 있다(820). 이때 발화 오디오는 2D 비디오와는 관련 없는 것으로 단계 610에서 텍스트로부터 생성된 발화 오디오일 수 있다.The 3D lip sync video generating device may extract a second feature vector from the speech audio (820). At this time, the speech audio is not related to the 2D video and may be speech audio generated from text in step 610.

3D 립싱크 비디오 생성 장치는 제1 특징 벡터 및 제2 특징 벡터를 조합하여 조합 벡터를 생성할 수 있다(830). 예를 들어, 3D 립싱크 비디오 생성 장치는 제1 특징 벡터 및 제2 특징 벡터를 연결(Concatenate)하여 조합 벡터를 생성할 수 있다.The 3D lip sync video generating device may generate a combination vector by combining the first feature vector and the second feature vector (830). For example, a 3D lip sync video generating device may generate a combination vector by concatenating a first feature vector and a second feature vector.

3D 립싱크 비디오 생성 장치는 조합 벡터를 기반으로 2D 비디오 속 인물의 3차원 모델이 발화하는 3D 립싱크 비디오를 생성할 수 있다(840). 구체적으로, 3D 립싱크 비디오 생성 장치는 조합 벡터를 기반으로 마스킹 처리된 부분(발화와 관련된 부분)을 발화 오디오에 대응하도록 복원한 3D 립싱크 비디오를 생성할 수 있다. The 3D lip sync video generating device can generate a 3D lip sync video in which a 3D model of a person in a 2D video speaks based on a combination vector (840). Specifically, the 3D lip sync video generating device can generate a 3D lip sync video in which the masked portion (part related to speech) is restored to correspond to the speech audio based on the combination vector.

도 9는 또 다른 실시예에 따른 3D 립싱크 비디오 생성 과정(620)을 도시한 도면이다.FIG. 9 is a diagram illustrating a 3D lip sync video generation process 620 according to another embodiment.

도 9를 참조하면, 3D 립싱크 비디오 생성 장치는 인물의 발화 모습이 촬영된 2D 비디오에서 제1 특징 벡터를 추출할 수 있다(910). 2D 비디오는 해당 인물의 상반신이 촬영된 2D 비디오일 수 있으며, 2D 비디오에서 발화와 관련된 부분은 마스킹 처리될 수 있다.Referring to FIG. 9, the 3D lip sync video generating device can extract a first feature vector from a 2D video in which a person's speech is captured (910). The 2D video may be a 2D video in which the upper body of the person in question is filmed, and parts related to speech in the 2D video may be masked.

3D 립싱크 비디오 생성 장치는 인물의 발화 모습으로부터 획득된 제1 3D 데이터에서 제2 특징 벡터를 추출할 수 있다(920). 여기서 제1 3D 데이터는 인물의 기하학적 구조를 표현하는 정점들(vertices), 폴리곤들(polygons) 및/또는 메쉬(mesh)에 대한 정보를 포함할 수 있다.The 3D lip sync video generating device may extract a second feature vector from the first 3D data obtained from the person's speech (920). Here, the first 3D data may include information about vertices, polygons, and/or mesh representing the geometric structure of the person.

3D 립싱크 비디오 생성 장치는 인물의 발화 모습으로부터 획득된 제2 3D 데이터에서 제3 특징 벡터를 추출할 수 있다(930). 여기서 제2 3D 데이터는 피부의 수축과 팽창, 목젖의 움직임, 어깨 근육의 움직임을 표현하는 인물에 표시된 마커의 위치 및/또는 이동에 관한 정보를 포함할 수 있다.The 3D lip sync video generating device may extract a third feature vector from the second 3D data obtained from the person's speech appearance (930). Here, the second 3D data may include information about the position and/or movement of the marker displayed on the person expressing the contraction and expansion of the skin, the movement of the uvula, and the movement of the shoulder muscles.

3D 립싱크 비디오 생성 장치는 발화 오디오로부터 제4 특징 벡터를 추출할 수 있다(940). 이때 발화 오디오는 2D 비디오와는 관련 없는 것으로 단계 610에서 The 3D lip sync video generating device may extract the fourth feature vector from the speech audio (940). At this time, the speech audio is not related to the 2D video and is performed in step 610.

3D 립싱크 비디오 생성 장치는 제1 특징 벡터, 제2 특징 벡터, 제3 특징 벡터 및 제4 특징 벡터를 조합하여 조합 벡터를 생성할 수 있다(950). 예를 들어, 3D 립싱크 비디오 생성 장치는 제1 특징 벡터, 제2 특징 벡터, 제3 특징 벡터 및 제4 특징 벡터를 연결(Concatenate)하여 조합 벡터를 생성할 수 있다.The 3D lip sync video generating apparatus may generate a combination vector by combining the first feature vector, the second feature vector, the third feature vector, and the fourth feature vector (950). For example, a 3D lip sync video generating device may generate a combination vector by concatenating a first feature vector, a second feature vector, a third feature vector, and a fourth feature vector.

3D 립싱크 비디오 생성 장치는 조합 벡터를 기반으로 2D 비디오 속 인물의 3D 모델이 발화하는 3D 립싱크 비디오를 생성할 수 있다(960). 구체적으로, 3D 립싱크 비디오 생성 장치는 조합 벡터를 기반으로 마스킹 처리된 부분(발화와 관련된 부분)을 발화 오디오에 대응하도록 복원한 3D 립싱크 비디오를 생성할 수 있다.The 3D lip sync video generating device can generate a 3D lip sync video in which a 3D model of a person in a 2D video speaks based on a combination vector (960). Specifically, the 3D lip sync video generating device can generate a 3D lip sync video in which the masked portion (part related to speech) is restored to correspond to the speech audio based on the combination vector.

도 10은 예시적인 실시예들에서 사용되기에 적합한 컴퓨팅 장치를 포함하는 컴퓨팅 환경(10)을 예시하여 설명하기 위한 블록도이다. 도시된 실시예에서, 각 컴포넌트들은 이하에 기술된 것 이외에 상이한 기능 및 능력을 가질 수 있고, 이하에 기술된 것 이외에도 추가적인 컴포넌트를 포함할 수 있다.FIG. 10 is a block diagram illustrating and illustrating a computing environment 10 including computing devices suitable for use in example embodiments. In the illustrated embodiment, each component may have different functions and capabilities in addition to those described below, and may include additional components in addition to those described below.

도시된 컴퓨팅 환경(10)은 컴퓨팅 장치(12)를 포함한다. 일 실시예에서, 컴퓨팅 장치(12)는 3D 립싱크 비디오 생성 장치(100)일 수 있다.The illustrated computing environment 10 includes a computing device 12 . In one embodiment, computing device 12 may be a 3D lip sync video generation device 100.

컴퓨팅 장치(12)는 적어도 하나의 프로세서(14), 컴퓨터 판독 가능 저장 매체(16) 및 통신 버스(18)를 포함할 수 있다. 프로세서(14)는 컴퓨팅 장치(12)로 하여금 앞서 언급된 예시적인 실시예에 따라 동작하도록 할 수 있다. 예컨대, 프로세서(14)는 컴퓨터 판독 가능 저장 매체(16)에 저장된 하나 이상의 프로그램들을 실행할 수 있다. 하나 이상의 프로그램들은 하나 이상의 컴퓨터 실행 가능 명령어를 포함할 수 있으며, 컴퓨터 실행 가능 명령어는 프로세서(14)에 의해 실행되는 경우 컴퓨팅 장치(12)로 하여금 예시적인 실시예에 따른 동작들을 수행하도록 구성될 수 있다.Computing device 12 may include at least one processor 14, a computer-readable storage medium 16, and a communication bus 18. Processor 14 may cause computing device 12 to operate in accordance with the example embodiments noted above. For example, processor 14 may execute one or more programs stored on computer-readable storage medium 16. One or more programs may include one or more computer-executable instructions that, when executed by processor 14, may configure computing device 12 to perform operations according to example embodiments. there is.

컴퓨터 판독 가능 저장 매체(16)는 컴퓨터 실행 가능 명령어 내지 프로그램 코드, 프로그램 데이터 및/또는 다른 적합한 형태의 정보를 저장하도록 구성될 수 있다. 컴퓨터 판독 가능 저장 매체(16)에 저장된 프로그램(20)은 프로세서(14)에 의해 실행 가능한 명령어의 집합을 포함할 수 있다. 일 실시예에서, 컴퓨터 판독 가능 저장 매체(16)는 메모리(랜덤 액세스 메모리와 같은 휘발성 메모리, 비휘발성 메모리, 또는 이들의 적절한 조합), 하나 이상의 자기 디스크 저장 디바이스들, 광학 디스크 저장 디바이스들, 플래시 메모리 디바이스들, 그 밖에 컴퓨팅 장치(12)에 의해 액세스되고 원하는 정보를 저장할 수 있는 다른 형태의 저장 매체, 또는 이들의 적합한 조합일 수 있다.Computer-readable storage medium 16 may be configured to store computer-executable instructions or program code, program data, and/or other suitable forms of information. The program 20 stored in the computer-readable storage medium 16 may include a set of instructions executable by the processor 14. In one embodiment, computer-readable storage medium 16 includes memory (volatile memory, such as random access memory, non-volatile memory, or an appropriate combination thereof), one or more magnetic disk storage devices, optical disk storage devices, flash It may be memory devices, another form of storage medium that can be accessed by computing device 12 and store desired information, or a suitable combination thereof.

통신 버스(18)는 컴퓨팅 장치(12)의 다른 다양한 컴포넌트들을 상호 연결할 수 있다.Communication bus 18 may interconnect various other components of computing device 12.

컴퓨팅 장치(12)는 또한 하나 이상의 입출력 장치(24)를 위한 인터페이스를 제공하는 하나 이상의 입출력 인터페이스(22) 및 하나 이상의 네트워크 통신 인터페이스(26)를 포함할 수 있다. 입출력 인터페이스(22) 및 네트워크 통신 인터페이스(26)는 통신 버스(18)에 연결될 수 있다. 입출력 장치(24)는 입출력 인터페이스(22)를 통해 컴퓨팅 장치(12)의 다른 컴포넌트들에 연결될 수 있다. 예시적인 입출력 장치(24)는 포인팅 장치(마우스 또는 트랙패드 등), 키보드, 터치 입력 장치(터치패드 또는 터치스크린 등), 음성 또는 소리 입력 장치, 다양한 종류의 센서 장치 및/또는 촬영 장치와 같은 입력 장치, 및/또는 디스플레이 장치, 프린터, 스피커 및/또는 네트워크 카드와 같은 출력 장치를 포함할 수 있다. 예시적인 입출력 장치(24)는 컴퓨팅 장치(12)를 구성하는 일 컴포넌트로서 컴퓨팅 장치(12)의 내부에 포함될 수도 있고, 컴퓨팅 장치(12)와는 구별되는 별개의 장치로 컴퓨팅 장치(12)와 연결될 수도 있다.Computing device 12 may also include one or more input/output interfaces 22 and one or more network communication interfaces 26 that provide an interface for one or more input/output devices 24. Input/output interface 22 and network communication interface 26 may be connected to communication bus 18. Input/output device 24 may be coupled to other components of computing device 12 through input/output interface 22. Exemplary input/output devices 24 include, but are not limited to, a pointing device (such as a mouse or trackpad), a keyboard, a touch input device (such as a touchpad or touch screen), a voice or sound input device, various types of sensor devices, and/or imaging devices. It may include input devices and/or output devices such as display devices, printers, speakers, and/or network cards. The exemplary input/output device 24 may be included within the computing device 12 as a component constituting the computing device 12, or may be connected to the computing device 12 as a separate device distinct from the computing device 12. It may be possible.

이제까지 본 발명에 대하여 그 바람직한 실시 예들을 중심으로 살펴보았다. 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 따라서, 본 발명의 범위는 전술한 실시 예에 한정되지 않고 특허 청구범위에 기재된 내용과 동등한 범위 내에 있는 다양한 실시 형태가 포함되도록 해석되어야 할 것이다.So far, the present invention has been examined focusing on its preferred embodiments. A person skilled in the art to which the present invention pertains will understand that the present invention may be implemented in a modified form without departing from the essential characteristics of the present invention. Accordingly, the scope of the present invention is not limited to the above-described embodiments, but should be construed to include various embodiments within the scope equivalent to the content described in the patent claims.

100: 3D 립싱크 비디오 생성 장치
110, 110a, 110b, 110c: 3D 립싱크 비디오 생성 모델
120: 음성 변환부
130: 싱크부
310, 410, 510: 제1 인코더
320, 420, 520: 제2 인코더
330, 530: 제3 인코더
540: 제4 인코더
340, 430, 550: 조합부
350, 440, 560: 디코더100: 3D lip sync video generation device
110, 110a, 110b, 110c: 3D lip sync video generation model
120: Voice conversion unit
130: Sink part
310, 410, 510: first encoder
320, 420, 520: second encoder
330, 530: Third encoder
540: fourth encoder
340, 430, 550: Union Department
350, 440, 560: Decoder

Claims

입력 텍스트를 기반으로 발화 오디오를 생성하는 음성 변환부; 및
상기 생성된 발화 오디오, 인물의 발화 모습이 촬영된 2D 비디오 및 상기 인물의 발화 모습으로부터 획득된 3D 데이터를 기반으로 상기 인물의 3차원 모델이 발화하는 3D 립싱크 비디오를 생성하는 3D 립싱크 비디오 생성 모델; 을 포함하고,
상기 3D 데이터는 상기 인물의 기학학적 구조를 표현하는 정점들(vertices), 폴리곤들(polygons) 및 메쉬(mesh)에 대한 정보를 포함하는 제1 3D 데이터와, 피부의 수축과 팽창, 목젖의 움직임 및 어깨 근육의 움직임을 표현하는 상기 인물에 표시된 복수의 마커의 위치 및 이동에 관한 정보를 포함하는 제2 3D 데이터를 포함하고,
상기 제1 3D 데이터는 제1 해상도 이하의 저해상도 깊이 카메라를 이용하여 획득된 상기 인물의 깊이 정보와, 제2 해상도 이상의 고해상도 적외선 카메라를 이용하여 획득된 상기 인물에 대한 적외선 이미지를 기반으로 머신러닝 모델을 이용하여 획득되며,
상기 복수의 마커는 상기 인물의 얼굴, 목 및 어깨에 표시되고,
상기 3D 립싱크 비디오 생성 모델은,
인물의 발화 모습이 촬영된 2D 비디오, 상기 2D 비디오의 촬영과 함께 녹음된 해당 인물의 발화 오디오, 및 상기 해당 인물의 발화 모습으로부터 획득된 3D 데이터를 학습 데이터로 획득하고, 상기 해당 인물의 발화 모습이 촬영된 3D 비디오를 정답(target) 데이터로 획득하며, 상기 학습 데이터를 입력으로 하여 생성된 3D 립싱크 비디오와 정답 데이터인 상기 3D 비디오의 차이가 최소가 되도록 각 레이어의 가중치 및 바이어스를 조절하는 방식으로 미리 학습한,
3D 립싱크 비디오 생성 장치.A voice conversion unit that generates speech audio based on input text; and
A 3D lip-sync video generation model that generates a 3D lip-sync video in which a 3D model of the person speaks based on the generated speech audio, a 2D video of the person's speech, and 3D data obtained from the person's speech; Including,
The 3D data includes first 3D data containing information about vertices, polygons, and mesh representing the geometric structure of the person, contraction and expansion of the skin, and movement of the uvula. and second 3D data including information about the position and movement of a plurality of markers displayed on the person representing movement of shoulder muscles,
The first 3D data is a machine learning model based on depth information of the person acquired using a low-resolution depth camera with a first resolution or lower and an infrared image of the person acquired using a high-resolution infrared camera with a second resolution or higher. It is obtained using
The plurality of markers are displayed on the face, neck, and shoulders of the person,
The 3D lip sync video generation model is,
A 2D video in which a person's speech is captured, audio of the person's speech recorded along with the shooting of the 2D video, and 3D data obtained from the person's speech are acquired as learning data, and the person's speech is recorded as learning data. This captured 3D video is acquired as target data, and the weight and bias of each layer are adjusted to minimize the difference between the 3D lip sync video generated using the learning data as input and the 3D video as the correct answer data. learned in advance,
3D lip-sync video creation device.

제1항에 있어서,
상기 2D 비디오는 발화와 관련된 부분이 마스크로 가려지고, 상기 인물의 상반신을 포함하는,
3D 립싱크 비디오 생성 장치.According to paragraph 1,
The 2D video includes the upper body of the person in which the part related to speech is covered by a mask,
3D lip-sync video creation device.

제2항에 있어서,
상기 3D 립싱크 비디오 생성 모델은 상기 마스크로 가려진 부분을 상기 발화 오디오에 대응하도록 복원하여 상기 3D 립싱크 비디오를 생성하는,
3D 립싱크 비디오 생성 장치.According to paragraph 2,
The 3D lip sync video generation model generates the 3D lip sync video by restoring the part hidden by the mask to correspond to the speech audio,
3D lip-sync video creation device.

삭제delete

제1항에 있어서,
상기 3D 립싱크 비디오 생성 모델은,
상기 2D 비디오에서 제1 특징 벡터를 추출하는 제1 인코더;
상기 3D 데이터에서 제2 특징 벡터를 추출하는 제2 인코더;
상기 발화 오디오로부터 제3 특징 벡터를 추출하는 제3 인코더;
상기 제1 특징 벡터, 상기 제2 특징 벡터 및 상기 제3 특징 벡터를 조합하여 조합 벡터를 생성하는 조합부; 및
상기 조합 벡터로부터 상기 3D 립싱크 비디오를 생성하는 디코더; 를 포함하는,
3D 립싱크 비디오 생성 장치.According to paragraph 1,
The 3D lip sync video generation model is,
a first encoder extracting a first feature vector from the 2D video;
a second encoder extracting a second feature vector from the 3D data;
a third encoder extracting a third feature vector from the speech audio;
a combination unit that generates a combination vector by combining the first feature vector, the second feature vector, and the third feature vector; and
a decoder that generates the 3D lip sync video from the combination vector; Including,
3D lip-sync video creation device.

제1항에 있어서,
상기 3D 립싱크 비디오 생성 모델은,
상기 2D 비디오와 상기 3D 데이터에서 제1 특징 벡터를 추출하는 제1 인코더;
상기 발화 오디오로부터 제2 특징 벡터를 추출하는 제2 인코더;
상기 제1 특징 벡터 및 상기 제2 특징 벡터를 조합하여 조합 벡터를 생성하는 조합부; 및
상기 조합 벡터로부터 상기 3D 립싱크 비디오를 생성하는 디코더; 를 포함하는,
3D 립싱크 비디오 생성 장치.According to paragraph 1,
The 3D lip sync video generation model is,
a first encoder that extracts a first feature vector from the 2D video and the 3D data;
a second encoder extracting a second feature vector from the speech audio;
a combination unit that generates a combination vector by combining the first feature vector and the second feature vector; and
a decoder that generates the 3D lip sync video from the combination vector; Including,
3D lip-sync video creation device.

제1항에 있어서,
상기 3D 립싱크 비디오 생성 모델은,
상기 2D 비디오에서 제1 특징 벡터를 추출하는 제1 인코더;
상기 제1 3D 데이터에서 제2 특징 벡터를 추출하는 제2 인코더;
상기 제2 3D 데이터에서 제3 특징 벡터를 추출하는 제3 인코더;
상기 발화 오디오로부터 제4 특징 벡터를 추출하는 제4 인코더;
상기 제1 특징 벡터, 상기 제2 특징 벡터, 상기 제3 특징 벡터 및 상기 제4 특징 벡터를 조합하여 조합 벡터를 생성하는 조합부; 및
상기 조합 벡터로부터 상기 3D 립싱크 비디오를 생성하는 디코더; 를 포함하는,
3D 립싱크 비디오 생성 장치.According to paragraph 1,
The 3D lip sync video generation model is,
a first encoder extracting a first feature vector from the 2D video;
a second encoder extracting a second feature vector from the first 3D data;
a third encoder extracting a third feature vector from the second 3D data;
a fourth encoder extracting a fourth feature vector from the speech audio;
a combination unit that generates a combination vector by combining the first feature vector, the second feature vector, the third feature vector, and the fourth feature vector; and
a decoder that generates the 3D lip sync video from the combination vector; Including,
3D lip-sync video creation device.

제1항에 있어서,
상기 발화 오디오와 상기 3D 립싱크 비디오의 싱크를 맞춰 3D 멀티미디어를 생성하는 싱크부; 를 더 포함하는,
3D 립싱크 비디오 생성 장치.According to paragraph 1,
a sync unit that generates 3D multimedia by synchronizing the speech audio and the 3D lip-sync video; Containing more,
3D lip-sync video creation device.

컴퓨팅 장치에 의해 수행되는 3D 립싱크 비디오 생성 방법에 있어서,
인물의 발화 모습이 촬영된 2D 비디오, 상기 2D 비디오의 촬영과 함께 녹음된 해당 인물의 발화 오디오, 및 상기 해당 인물의 발화 모습으로부터 획득된 3D 데이터를 학습 데이터로 획득하고, 상기 해당 인물의 발화 모습이 촬영된 3D 비디오를 정답(target) 데이터로 획득하는 단계;
상기 학습 데이터를 입력으로 하여 생성된 3D 립싱크 비디오와 정답 데이터인 상기 3D 비디오의 차이가 최소가 되도록 각 레이어의 가중치 및 바이어스를 조절하는 방식으로 미리 학습하는 단계;
입력 텍스트를 기반으로 발화 오디오를 생성하는 단계; 및
상기 생성된 발화 오디오, 인물의 발화 모습이 촬영된 2D 비디오 및 상기 인물의 발화 모습으로부터 획득된 3D 데이터를 기반으로 상기 인물의 3차원 모델이 발화하는 3D 립싱크 비디오를 생성하는 단계; 을 포함하고,
상기 3D 데이터는 상기 인물의 기학학적 구조를 표현하는 정점들(vertices), 폴리곤들(polygons) 및 메쉬(mesh)에 대한 정보를 포함하는 제1 3D 데이터와, 피부의 수축과 팽창, 목젖의 움직임 및 어깨 근육의 움직임을 표현하는 상기 인물에 표시된 복수의 마커의 위치 및 이동에 관한 정보를 포함하는 제2 3D 데이터를 포함하고,
상기 제1 3D 데이터는 제1 해상도 이하의 저해상도 깊이 카메라를 이용하여 획득된 상기 인물의 깊이 정보와, 제2 해상도 이상의 고해상도 적외선 카메라를 이용하여 획득된 상기 인물에 대한 적외선 이미지를 기반으로 머신러닝 모델을 이용하여 획득되고,
상기 복수의 마커는 상기 인물의 얼굴, 목 및 어깨에 표시되는,
3D 립싱크 비디오 생성 방법.In a method of generating a 3D lip sync video performed by a computing device,
A 2D video in which a person's speech is captured, audio of the person's speech recorded along with the shooting of the 2D video, and 3D data obtained from the person's speech are acquired as learning data, and the person's speech is recorded as learning data. Obtaining the captured 3D video as target data;
Pre-learning by adjusting the weight and bias of each layer to minimize the difference between the 3D lip sync video generated using the learning data as input and the 3D video that is the correct answer data;
generating speech audio based on input text; and
Generating a 3D lip-sync video in which a 3D model of the person speaks based on the generated speech audio, a 2D video of the person speaking, and 3D data obtained from the person's speech; Including,
The 3D data includes first 3D data containing information about vertices, polygons, and mesh representing the geometric structure of the person, contraction and expansion of the skin, and movement of the uvula. and second 3D data including information about the position and movement of a plurality of markers displayed on the person representing movement of shoulder muscles,
The first 3D data is a machine learning model based on depth information of the person acquired using a low-resolution depth camera with a first resolution or lower and an infrared image of the person acquired using a high-resolution infrared camera with a second resolution or higher. Obtained using,
The plurality of markers are displayed on the face, neck, and shoulders of the person,
How to create a 3D lip sync video.

제10항에 있어서,
상기 2D 비디오는 발화와 관련된 부분이 마스크로 가려지고, 상기 인물의 상반신을 포함하는,
3D 립싱크 비디오 생성 방법.According to clause 10,
The 2D video includes the upper body of the person in which the part related to speech is covered by a mask,
How to create a 3D lip sync video.

제11항에 있어서,
상기 3D 립싱크 비디오를 생성하는 단계는,
상기 마스크로 가려진 부분을 상기 발화 오디오에 대응하도록 복원하여 상기 3D 립싱크 비디오를 생성하는 단계; 를 포함하는,
3D 립싱크 비디오 생성 방법.According to clause 11,
The step of generating the 3D lip sync video is,
generating the 3D lip-sync video by restoring a portion obscured by the mask to correspond to the speech audio; Including,
How to create a 3D lip sync video.

삭제delete

제10항에 있어서,
상기 3D 립싱크 비디오를 생성하는 단계는,
상기 2D 비디오에서 제1 특징 벡터를 추출하는 단계;
상기 3D 데이터에서 제2 특징 벡터를 추출하는 단계;
상기 발화 오디오로부터 제3 특징 벡터를 추출하는 단계;
상기 제1 특징 벡터, 상기 제2 특징 벡터 및 상기 제3 특징 벡터를 조합하여 조합 벡터를 생성하는 단계; 및
상기 조합 벡터로부터 상기 3D 립싱크 비디오를 생성하는 단계; 를 포함하는,
3D 립싱크 비디오 생성 방법.According to clause 10,
The step of generating the 3D lip sync video is,
extracting a first feature vector from the 2D video;
extracting a second feature vector from the 3D data;
extracting a third feature vector from the speech audio;
generating a combination vector by combining the first feature vector, the second feature vector, and the third feature vector; and
generating the 3D lip sync video from the combination vector; Including,
How to create a 3D lip sync video.

제10항에 있어서,
상기 3D 립싱크 비디오를 생성하는 단계는,
상기 2D 비디오와 상기 3D 데이터에서 제1 특징 벡터를 추출하는 단계;
상기 발화 오디오로부터 제2 특징 벡터를 추출하는 단계;
상기 제1 특징 벡터 및 상기 제2 특징 벡터를 조합하여 조합 벡터를 생성하는 단계; 및
상기 조합 벡터로부터 상기 3D 립싱크 비디오를 생성하는 단계; 를 포함하는,
3D 립싱크 비디오 생성 방법.According to clause 10,
The step of generating the 3D lip sync video is,
extracting a first feature vector from the 2D video and the 3D data;
extracting a second feature vector from the speech audio;
generating a combination vector by combining the first feature vector and the second feature vector; and
generating the 3D lip sync video from the combination vector; Including,
How to create a 3D lip sync video.

제10항에 있어서,
상기 3D 립싱크 비디오를 생성하는 단계는,
상기 2D 비디오에서 제1 특징 벡터를 추출하는 단계;
상기 제1 3D 데이터에서 제2 특징 벡터를 추출하는 단계;
상기 제2 3D 데이터에서 제3 특징 벡터를 추출하는 단계;
상기 발화 오디오로부터 제4 특징 벡터를 추출하는 단계;
상기 제1 특징 벡터, 상기 제2 특징 벡터, 상기 제3 특징 벡터 및 상기 제4 특징 벡터를 조합하여 조합 벡터를 생성하는 단계; 및
상기 조합 벡터로부터 상기 3D 립싱크 비디오를 생성하는 단계; 를 포함하는,
3D 립싱크 비디오 생성 방법.According to clause 10,
The step of generating the 3D lip sync video is,
extracting a first feature vector from the 2D video;
extracting a second feature vector from the first 3D data;
extracting a third feature vector from the second 3D data;
extracting a fourth feature vector from the speech audio;
generating a combination vector by combining the first feature vector, the second feature vector, the third feature vector, and the fourth feature vector; and
generating the 3D lip sync video from the combination vector; Including,
How to create a 3D lip sync video.

제10항에 있어서,
상기 발화 오디오와 상기 3D 립싱크 비디오의 싱크를 맞춰 3D 멀티미디어를 생성하는 단계; 를 더 포함하는,
3D 립싱크 비디오 생성 방법.According to clause 10,
Generating 3D multimedia by synchronizing the speech audio and the 3D lip sync video; Containing more,
How to create a 3D lip sync video.