KR20200145315A

KR20200145315A - Method of predicting lip position for synthesizing a person's speech video based on modified cnn

Info

Publication number: KR20200145315A
Application number: KR1020190074167A
Authority: KR
Inventors: 황금별; 채경수; 박성우; 장세영
Original assignee: 주식회사 머니브레인
Priority date: 2019-06-21
Filing date: 2019-06-21
Publication date: 2020-12-30

Abstract

According to the present disclosure, in predicting key points such as lip coordinates for synthesizing speaking images, a modified CNN is applied in place of an existing LSTM RNN model. According to the present disclosure, a CNN used for predicting key points such as lip coordinates for synthesizing speaking images may be in a form of a CNN modified to consider information asymmetrically so that a front part, which is a previous part, can be considered but also a back part after the present can be considered little. According to the present disclosure, a given input can be gradually combined with information as the input goes up a hidden layer, and the CNN can be used for predicting a desired key point, which is a target value, by a causal CNN form for predicting a current target value in consideration of a causal relationship with the previous data.

Description

변형된 CNN 방식에 기초한, 소정 인물의 말하는 영상 합성을 위한 입술 위치 예측 방법{METHOD OF PREDICTING LIP POSITION FOR SYNTHESIZING A PERSON'S SPEECH VIDEO BASED ON MODIFIED CNN}A method of predicting the lip position for synthesizing a speaking image of a certain person based on a modified CNN method {METHOD OF PREDICTING LIP POSITION FOR SYNTHESIZING A PERSON'S SPEECH VIDEO BASED ON MODIFIED CNN}

본 개시는, 특정 인물의 말하는 영상 합성에 관한 것으로, 보다 구체적으로는 해당 인물의 말하는 영상 합성을 위하여 입술 위치를 예측하는 방법 등에 관한 것이다.The present disclosure relates to synthesizing a spoken image of a specific person, and more specifically, to a method of predicting a lip position for synthesizing a speaking image of a corresponding person.

근래, 인공지능 분야 기술 발전에 따라, 다양한 유형의 컨텐츠가 인공지능 기술에 기초하여 생성되어 활용되고 있다. 그러한 흐름의 일환으로서, 최근에는 어떤 전달하고자 하는 메시지가 있을 때, 그러한 메시지를 특정한 인물, 예컨대 어떤 유명 인물이 마치 그러한 메시지를 말하는 것과 같은 영상을 생성하여 제공함으로써 사람들의 주의를 끌고자 하는 경우가 있다. 특정 인물에 관한 영상을 기초로, 그 특정 인물이 마치 어떤 특정한 말을 하는 것처럼 (실제로는 그러한 말을 하지 않았음에도) 입모양 등을 그 특정한 말에 맞게 생성함으로써, 해당 인물이 그 말을 하는 것과 같은 동영상을 합성하는 기술이 관심을 모으고 있다. In recent years, with the development of technology in the field of artificial intelligence, various types of contents are created and utilized based on artificial intelligence technology. As part of such a trend, recently, when there is a message to be conveyed, there are cases in which such a message is intended to attract people's attention by creating and providing a video that looks as if a certain person, for example, a famous person speaks such a message. have. Based on the video about a specific person, by creating a mouth shape or the like according to the specific speech, as if the specific person was saying a specific word (even though he didn’t actually say such a word), Technology that synthesizes the same video is attracting attention.

이와 관련하여, 복수의 말하는 영상 데이터들로부터 각 음성 신호별로 그 영상 상의 얼굴에서 입술 등 발화와 관련된 얼굴 부분의 좌표를 따고, 그 좌표 값 학습 및 예측을 통해, 음성 신호가 주어질 경우 그에 맞는 영상 합성시 해당 음성에 맞게 입모양을 위 좌표 값 예측의 적용에 의해 출력해주는 방식이 있다(예컨대, http://grail.cs.washington.edu/projects/AudioToObama/siggraph17_obama.pdf, https://arxiv.org/pdf/1801.01442.pdf, https://www.youtube.com/watch?v=9Yq67Cj Dqvw, https://ritheshkumar.com/obamanet/ 등을 참조할 수 있음). 이러한 종래의 딥러닝 모델 방식은, 배경이 된 영상에서 입 부분만 까맣게 가린 다음, 음성에 기초한 좌표를 출력하고 그 영상의 가려진 부분에 입술 좌표를 얹고 이를 더해서 영상을 합성하는 모델에 넣어 최종적으로 그 음성에 대한 합성된 비디오 영상이 나오는 구조를 따른다. 즉, 종래에는, 먼저 음성과 관련된 랜드마크 또는 키포인트를 먼저 생성한 후 그 정보를 이용해서 영상을 이미지를 합성하는 두 단계 과정을 거쳐야 했다.In this regard, the coordinates of the face part related to speech, such as lips, are obtained from the face on the image for each voice signal from a plurality of spoken image data, and through learning and prediction of the coordinate value, the appropriate image synthesis when a voice signal is given. There is a method of outputting the shape of the mouth by applying the above coordinate value prediction according to the corresponding voice (eg, http://grail.cs.washington.edu/projects/AudioToObama/siggraph17_obama.pdf, https://arxiv. org/pdf/1801.01442.pdf, https://www.youtube.com/watch?v=9Yq67Cj Dqvw, https://ritheshkumar.com/obamanet/, etc.). In such a conventional deep learning model method, only the mouth part is covered in the background image, then the coordinates based on the voice are output, and the lip coordinates are placed on the hidden part of the image, and then added to the model to synthesize the image. It follows the structure in which the synthesized video image of voice appears. That is, in the prior art, a landmark or key point related to a voice was first created, and then a two-step process was required to synthesize an image by using the information.

통상적으로, 이러한 말하는 영상 합성을 위한 입술 좌표 등 키포인트를 예측하는 경우, 어떤 음성은 그 앞부분이 현재의 발화의 맥락에 영향을 미치기 때문에 앞부분을 고려해서 현재 입모양을 예측해야 한다. 아울러, 실험적으로는 또한 뒤에 오는 부분도 현재의 입모양에 약간의 영향을 줄 수 있으므로, 그 일부 뒷부분의 내용도 현재의 입모양 예측에 고려될 필요가 있다. 이러한 관점에서, 종래의 입술 좌표 등의 예측을 위해서는, LSTM RNN 모델에 기초하여 그러한 입술 좌표 등의 키포인트를 예측하는 모델이 주로 이용되어 왔다. 그런데, 실제 LSTM RNN 모델에 따르는 경우, 입모양의 변화가 느리고 변화가 크지 않아서 생동감이 떨어지는 문제가 있어 왔다. 또한, LSTM RNN 모델에 의할 경우, 입을 완벽하게 다물지 않는 경우가 있어서 그러한 점도 문제로 지적되어 왔다.In general, when predicting key points such as lip coordinates for synthesizing speech images, the shape of the current mouth should be predicted by considering the front part of some voices because the front part of the voice influences the context of the current speech. In addition, experimentally, the following part may have a slight influence on the current mouth shape, so the content of the part of the back part needs to be considered in the current mouth shape prediction. From this point of view, for the conventional prediction of lip coordinates or the like, a model for predicting key points such as lip coordinates based on an LSTM RNN model has been mainly used. However, in the case of the actual LSTM RNN model, there has been a problem that the change in the shape of the mouth is slow and the change is not large, resulting in a loss of vitality. In addition, in the case of the LSTM RNN model, it has been pointed out as a problem because there are cases where the mouth is not completely closed.

본 개시는, 발화와 관련된 이미지 정보, 예컨대 입술 좌표 등의 예측에 있어서, 변화가 좀 더 빠르고 정확한 신경망 모델을 제공하고자 한다. The present disclosure is intended to provide a neural network model that changes faster and more accurately in prediction of image information related to speech, such as lip coordinates.

본 개시에 의하면, 말하는 영상 합성을 위한 입술 좌표 등 키포인트를 예측함에 있어서, 기존의 LSTM RNN 모델을 대신하여 변형된 CNN을 적용한다.According to the present disclosure, in predicting key points such as lip coordinates for synthesizing an image, a modified CNN is applied instead of the existing LSTM RNN model.

본 개시에 의하면, 말하는 영상 합성을 위한 입술 좌표 등 키포인트를 예측을 위하여 이용되는 CNN은, 주로 앞부분, 즉 이전 부분을 고려하되 현재 이후의 뒷부분도 약간은 고려할 수 있게 비대칭으로 정보를 고려하도록 변형된 CNN 형태일 수 있다. 본 개시에 의하면, 주어진 입력은, 은닉층을 올라가면서 정보가 점점 합쳐질 수 있고, 이전 데이터와의 인과 관계를 고려하여 현재의 타겟값을 예측하는 인과적 CNN 형태에 의해 원하는 키포인트, 즉 타겟값의 예측에 이용될 수 있다.According to the present disclosure, the CNN used for predicting key points, such as lip coordinates for synthesizing an image, is modified to consider information asymmetrically so that the front part, that is, the previous part, is mainly considered, but the latter part is also slightly considered. It may be in the form of CNN. According to the present disclosure, a given input can be gradually combined with information as it goes up the hidden layer, and prediction of a desired key point, that is, a target value by a causal CNN form that predicts the current target value by considering a causal relationship with previous data. Can be used for

본 개시에 의하면, 비대칭적으로 구성된 CNN을 통해 입술 위치 등 키포인트를 예측함으로써, 종래 RNN 방식에 의하는 경우에 비해 결과적인 합성 영상에서의 입모양의 움직임이 크고 빠르게 변화할 수 있다.According to the present disclosure, by predicting a key point such as a lip position through an asymmetrically configured CNN, the movement of a mouth shape in a resultant composite image can be large and rapidly changed compared to the case of a conventional RNN method.

도 1은, 종래 기술에 다른 LSTM RNN 방식에 의하여 입술 위치 등을 예측하는 과정을 개략적으로 보여주는 도면이다.
도 2는, 본 개시에 따른, 비대칭적 변형된 CNN 방식에 의하여, 말하는 영상 합성을 위한 인물의 입술 위치 등을 예측하는 과정을 개략적으로 보여주는 도면이다. 1 is a diagram schematically showing a process of predicting a lip position and the like according to an LSTM RNN scheme different from the prior art.
FIG. 2 is a diagram schematically illustrating a process of predicting a position of a person's lips for synthesizing a talking image using an asymmetrically modified CNN method according to the present disclosure.

이하, 첨부 도면을 참조하여 본 개시의 실시예에 관하여 상세히 설명한다. 이하에서는, 본 개시의 요지를 불필요하게 흐릴 우려가 있다고 판단되는 경우, 이미 공지된 기능 및 구성에 관한 구체적인 설명을 생략한다. 또한, 이하에서 설명하는 내용은 어디까지나 본 개시의 일 실시예에 관한 것일 뿐 본 개시가 이로써 제한되는 것은 아님을 알아야 한다.Hereinafter, exemplary embodiments of the present disclosure will be described in detail with reference to the accompanying drawings. Hereinafter, when it is determined that there is a possibility that the subject matter of the present disclosure may be unnecessarily obscured, detailed descriptions of functions and configurations already known are omitted. In addition, it should be understood that the contents described below are only related to an embodiment of the present disclosure, and the present disclosure is not limited thereto.

본 개시에서 사용되는 용어는 단지 특정한 실시예를 설명하기 위해 사용되는 것으로 본 개시를 한정하려는 의도에서 사용된 것이 아니다. 예를 들면, 단수로 표현된 구성요소는 문맥상 명백하게 단수만을 의미하지 않는다면 복수의 구성요소를 포함하는 개념으로 이해되어야 한다. 본 개시에서 사용되는 "및/또는"이라는 용어는, 열거되는 항목들 중 하나 이상의 항목에 의한 임의의 가능한 모든 조합들을 포괄하는 것임이 이해되어야 한다. 본 개시에서 사용되는 '포함하다' 또는 '가지다' 등의 용어는 본 개시 상에 기재된 특징, 숫자, 단계, 동작, 구성 요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것일 뿐이고, 이러한 용어의 사용에 의해 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성 요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 배제하려는 것은 아니다.The terms used in the present disclosure are only used to describe specific embodiments, and are not intended to limit the present disclosure. For example, a component expressed in the singular should be understood as a concept including a plurality of components unless the context clearly means only the singular. It is to be understood that the term "and/or" as used in this disclosure encompasses any and all possible combinations by one or more of the listed items. The terms "comprise" or "have" used in the present disclosure are only intended to designate the existence of features, numbers, steps, actions, components, parts, or a combination thereof described in the present disclosure. It is not intended to exclude the possibility of the presence or addition of one or more other features or numbers, steps, actions, components, parts, or combinations thereof by use.

본 개시의 실시예에 있어서 '모듈' 또는 '부'는 적어도 하나의 기능이나 동작을 수행하는 기능적 부분을 의미하며, 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합으로 구현될 수 있다. 또한, 복수의 '모듈' 또는 '부'는, 특정한 하드웨어로 구현될 필요가 있는 '모듈' 또는 '부'를 제외하고는, 적어도 하나의 소프트웨어 모듈로 일체화되어 적어도 하나의 프로세서에 의해 구현될 수 있다.In an embodiment of the present disclosure, a'module' or'unit' means a functional part that performs at least one function or operation, and may be implemented as hardware or software, or a combination of hardware and software. In addition, a plurality of'modules' or'units' may be integrated into at least one software module and implemented by at least one processor, except for'modules' or'units' that need to be implemented with specific hardware. have.

말하는 영상 합성을 위한 입술 좌표 등 키포인트를 예측하는 경우, 어떤 음성은 그 앞부분이 현재의 발화의 맥락에 영향을 미치기 때문에 앞부분을 고려해서 현재 입모양을 예측해야 한다. 또한 뒤에 오는 부분도 현재의 입모양에 약간의 영향을 줄 수 있으므로, 그 일부 뒷부분의 내용도 현재의 입모양 예측에 고려될 필요가 있다. 이러한 관점에서, 종래의 입술 좌표 등의 예측을 위해서는, LSTM RNN 모델에 기초하여 그러한 입술 좌표 등의 키포인트를 예측하는 모델이 주로 이용되어 왔는데, 실제 이러한 LSTM RNN 모델에 따르는 경우, 입모양의 변화가 느리고 변화가 크지 않아서 생동감이 떨어지는 문제가 있어 왔다. When predicting key points such as lip coordinates for synthesizing speech images, the shape of the current mouth should be predicted by considering the front part because the front part of the voice affects the context of the current speech. In addition, since the following part may have some influence on the current mouth shape, the content of the latter part needs to be considered in the current mouth shape prediction. From this point of view, for the conventional prediction of lip coordinates, etc., a model that predicts key points such as lip coordinates based on the LSTM RNN model has been mainly used.In fact, when the LSTM RNN model is followed, the change in the shape of the mouth There has been a problem that it is slow and the change is not so great that the liveliness decreases.

본 개시는, 종래의 LSTM에 의한 시간-지연 방식의 로스(손실)을 구성하는 방식을 대신하여, CNN 방식으로 발화와 관련된 이미지 정보, 예컨대 입술 좌표 등의 예측에 있어서, 변화가 좀 더 빠르고 정확한 신경망 모델을 제공할 수 있다. 본 개시에 의하면, 말하는 영상 합성을 위한 입술 좌표 등 키포인트를 예측함에 있어서, 기존의 LSTM RNN 모델을 대신하여 변형된 CNN을 적용한다.In the present disclosure, instead of the method of configuring the loss (loss) of the time-delay method according to the conventional LSTM, in the prediction of image information related to speech, such as lip coordinates, in the CNN method, the change is faster and more accurate. Neural network model can be provided. According to the present disclosure, in predicting key points such as lip coordinates for synthesizing an image, a modified CNN is applied instead of the existing LSTM RNN model.

참고로, RNN(Recurrent Neural Network)는 이전 스텝의 출력 값이 다시 입력으로 연결되는 순환 신경망으로서, 시계열 데이터 분석에 주로 사용된다. LSTM(Long Short-Term Memory)는 RNN 계열의 네트워크로서 RNN의 구조에 장/단기 기억을 가능하게 설계한 신경망 구조이다. CNN(Convolution Neural Network)는 여러 계층의 컨볼루셔널 계층을 붙인 네트워크를 말한다.For reference, a recurrent neural network (RNN) is a recurrent neural network in which an output value of a previous step is connected to an input again, and is mainly used for time series data analysis. LSTM (Long Short-Term Memory) is a network of RNN series, and is a neural network structure designed to enable long/short memory in the structure of RNN. CNN (Convolution Neural Network) refers to a network to which multiple layers of convolutional layers are attached.

본 개시에 의하면, 말하는 영상 합성을 위한 입술 좌표 등 키포인트를 예측을 위하여 이용되는 CNN은, 주로 앞부분, 즉 이전 부분을 고려하되 현재 이후의 뒷부분도 약간은 고려할 수 있게 비대칭으로 정보를 고려하도록 변형된 CNN 형태일 수 있다. 따라서, 본 개시의 CNN은 인과적 CNN(Causal CNN)이라고 할 수 있다. 본 개시에 따른 인과적 CNN에 의하면, 주어진 입력은, 은닉층을 올라가면서 정보가 점점 합쳐질 수 있고, 따라서 이전 데이터와의 인과 관계를 고려하여 현재의 타겟값이 예측될 수 있다. 본 개시의 실시예에 의하면, 또한 인과적 CNN은 Auto-Regressive, 즉 과거 예측 값을 기반으로 현재의 타겟값을 예측하는 특징을 가질 수 있다. 본 개시에 의하면, 비대칭적으로 구성된 CNN을 통해 입술 위치 등 키포인트를 예측함으로써, 종래 RNN 방식에 의하는 경우에 비해 결과적인 합성 영상에서의 입모양의 움직임이 크고 빠르게 변화할 수 있다.According to the present disclosure, the CNN used for predicting key points, such as lip coordinates for synthesizing an image, is modified to consider information asymmetrically so that the front part, that is, the previous part, is mainly considered, but the latter part is also slightly considered. It may be in the form of CNN. Accordingly, the CNN of the present disclosure may be referred to as a causal CNN. According to the causal CNN according to the present disclosure, information may be gradually combined as a given input goes up the hidden layer, and thus a current target value may be predicted in consideration of a causal relationship with previous data. According to an embodiment of the present disclosure, the causal CNN may also have a feature of predicting a current target value based on an auto-regressive, that is, a past predicted value. According to the present disclosure, by predicting a key point such as a lip position through an asymmetrically configured CNN, the movement of a mouth shape in a resultant composite image can be large and rapidly changed compared to the case of a conventional RNN method.

당업자라면 알 수 있듯이, 본 개시가 본 명세서에 기술된 예시에 한정되는 것이 아니라 본 개시의 범주를 벗어나지 않는 범위 내에서 다양하게 변형, 재구성 및 대체될 수 있다. 본 명세서에 기술된 다양한 기술들은 하드웨어 또는 소프트웨어, 또는 하드웨어와 소프트웨어의 조합에 의해 구현될 수 있음을 알아야 한다.As will be appreciated by those skilled in the art, the present disclosure is not limited to the examples described herein, but may be variously modified, reconfigured, and substituted without departing from the scope of the present disclosure. It should be understood that the various techniques described herein may be implemented by hardware or software, or a combination of hardware and software.

본 개시의 일 실시예에 따른 컴퓨터 프로그램은, 컴퓨터 프로세서 등에 의해 판독 가능한 저장 매체, 예컨대 EPROM, EEPROM, 플래시 메모리장치와 같은 비휘발성 메모리, 내장형 하드 디스크와 착탈식 디스크 같은 자기 디스크, 광자기 디스크, 및 CDROM 디스크 등을 포함한 다양한 유형의 저장 매체에 저장된 형태로 구현될 수 있다. 또한, 프로그램 코드(들)는 어셈블리어나 기계어로 구현될 수 있다. 본 개시의 진정한 사상 및 범주에 속하는 모든 변형 및 변경을 이하의 특허청구범위에 의해 모두 포괄하고자 한다.A computer program according to an embodiment of the present disclosure includes a storage medium readable by a computer processor, such as EPROM, EEPROM, a nonvolatile memory such as a flash memory device, a magnetic disk such as an internal hard disk and a removable disk, a magneto-optical disk, and It can be implemented in a form stored in various types of storage media, including a CDROM disk. Further, the program code(s) may be implemented in assembly language or machine language. It is intended to cover all modifications and changes belonging to the true spirit and scope of the present disclosure by the following claims.

Claims

신경망 모델에 따라 소정 인물의 말하는 영상을 합성하는 방법으로서,
상기 소정 인물에 관한 배경 영상을 수신하는 단계;
음성 데이터를 수신하는 단계;
상기 음성 데이터로부터 상기 음성 데이터에 대응하는 일련의 핵심 좌표들을 획득하는 단계; 및
상기 획득된 일련의 핵심 좌표들을 이용해서, 상기 배경 영상으로부터 상기 말하는 영상을 합성하는 단계를 포함하고,
상기 일련의 핵심 좌표들을 획득하는 단계는 인과적 CNN 신경망 모델에 기초하여 수행되는, 인물 영상 합성 방법.As a method of synthesizing a talking image of a predetermined person according to a neural network model,
Receiving a background image of the predetermined person;
Receiving voice data;
Acquiring a series of core coordinates corresponding to the voice data from the voice data; And
Using the obtained series of core coordinates, comprising the step of synthesizing the talking image from the background image,
The step of obtaining the set of core coordinates is performed based on a causal CNN neural network model.

제1항에 있어서,
상기 일련의 핵심 좌표들은, 상기 소정 인물에 관한 일련의 입술 좌표들인, 인물 영상 합성 방법. The method of claim 1,
The series of core coordinates are a series of lip coordinates related to the predetermined person.

제1항에 있어서,
상기 인과적 CNN 신경망 모델은, 이전 데이터와의 인과 관계를 고려하여 현재 데이터를 예측하도록 구성된, 인물 영상 합성 방법.The method of claim 1,
The causal CNN neural network model is configured to predict current data in consideration of a causal relationship with previous data.