KR102042168B1

KR102042168B1 - Methods and apparatuses for generating text to video based on time series adversarial neural network

Info

Publication number: KR102042168B1
Application number: KR1020180049255A
Authority: KR
Inventors: 이지형; 세르게이 데니소브; 김누리; 임윤규; 우상명
Original assignee: 성균관대학교산학협력단
Priority date: 2018-04-27
Filing date: 2018-04-27
Publication date: 2019-11-07
Also published as: KR20190125029A

Abstract

본 발명은 시계열 적대적인 신경망 기반의 텍스트-비디오 생성 방법 및 장치에 관한 것이다. 본 발명의 일 실시 예에 따른 시계열 적대적인 신경망 기반의 텍스트-비디오 생성 방법은, 비디오 생성을 위한 텍스트에 대한 캡션 임베딩, 이전 프레임의 비디오 특징 및 노이즈가 연결된 데이터를 제1 시계열 신경망에 입력시켜 순차적인 특징을 추출하는 단계, 상기 추출된 순차적인 특징으로부터 다음 프레임을 생성하여 실제와 같은 비디오를 생성하는 단계, 기저장된 실제 비디오 또는 상기 생성된 비디오 중에서 어느 하나의 비디오를 샘플링하는 단계, 및 제2 시계열 신경망을 이용하여 상기 샘플링된 비디오가 상기 생성된 비디오로부터 샘플링된 비디오인지 또는 상기 실제 비디오로부터 샘플링된 비디오인지를 판별하는 단계를 포함한다.The present invention relates to a method and apparatus for generating time-series hostile neural network-based text-video. According to an embodiment of the present invention, a time-series hostile neural network-based text-video generation method includes inputting caption embedding for text for video generation, video characteristics of a previous frame, and noise-connected data into a first time-series neural network. Extracting a feature, generating a next frame from the extracted sequential feature, generating a realistic video, sampling a pre-stored real video or any one of the generated video, and a second time series Determining whether the sampled video is a sampled video from the generated video or a sampled video from the real video using a neural network.

Description

시계열 적대적인 신경망 기반의 텍스트-비디오 생성 방법 및 장치{METHODS AND APPARATUSES FOR GENERATING TEXT TO VIDEO BASED ON TIME SERIES ADVERSARIAL NEURAL NETWORK}METHODS AND APPARATUSES FOR GENERATING TEXT TO VIDEO BASED ON TIME SERIES ADVERSARIAL NEURAL NETWORK}

본 발명은 시계열 적대적인 신경망 기반의 텍스트-비디오(Text-to-video) 생성 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for time-series hostile neural network-based text-to-video generation.

특정 비디오의 내용을 전달하는 캡션을 만드는 방법이 많이 연구되고 있다. Much research has been done to create captions that convey the content of a particular video.

하지만, 반대로 텍스트로부터 비디오를 생성하는 연구는 거의 없다. 텍스트를 사용하는 비디오 생성 과정은 텍스트의 내용에 따라 순차적으로 비디오의 프레임을 생성해야 한다. 이 과정에서 텍스트의 의미 있는 속성을 선택하는 연구가 필요하고, 이 속성을 기반으로 비디오 프레임의 내용을 생성하는 기법이 매우 복잡하다. 이 과정의 복잡성으로 인해 비디오 생성에 대한 연구가 거의 진행되지 않고 있다. On the contrary, very little research has been made to generate video from text. In the video generation process using text, frames of video must be sequentially generated according to the content of the text. In this process, research on selecting meaningful attributes of text is required, and a technique for generating content of a video frame based on these attributes is very complicated. Due to the complexity of this process, little research has been done on video generation.

한편, 주어진 텍스트 또는 캡션을 시각화하는 것은 최근 컴퓨터 비전 문제 중 하나이다. 이미지 생성 문제는 인상적인 결과들을 보여주는 많은 접근법을 갖는다. 비록 그것이 훨씬 더 복잡한 작업이라도, 비디오 생성 문제는 아직 열려있는 문제로 남아 있다. 일 측면에서는 텍스트 콘텐츠와 별도의 비디오 프레임 사이의 관계가 캡쳐될 수 있지만, 다른 측면에서는 텍스트 콘텐츠가 프레임 시퀀스에 매핑되어야 한다. 여기서, 프레임이 올바른 순서로 구성되어야 하지만, 자주 틀린 순서로 프레임이 구성되고 있다. 이러한 시간적인 순서는 비디오 생성 문제에서 가장 중요한 부분이다.On the other hand, visualizing a given text or caption is one of the recent computer vision problems. The image generation problem has many approaches that show impressive results. Although it is a much more complicated task, the video creation problem remains open. In one aspect the relationship between the text content and a separate video frame can be captured, while in the other aspect the text content should be mapped to a frame sequence. Here, the frames should be organized in the correct order, but the frames are constructed in the wrong order frequently. This temporal order is the most important part of the video generation problem.

종래의 연구에서는 텍스트로부터 비디오를 생성하는 연구는 거의 없어서, 비디오의 순차적인 특징들을 표현하기가 어려운 문제점이 있다.In the conventional research, there is little research on generating video from text, and thus it is difficult to express sequential features of video.

본 발명의 실시 예들은 시계열 적대적인 신경망 기반으로 캡션의 텍스트 특징 및 각 프레임의 이전 프레임을 고려하여 비디오를 생성함으로써, 텍스트와 더욱 맞는 비디오를 제공할 수 있는, 시계열 적대적인 신경망 기반의 텍스트-비디오 생성 방법 및 장치를 제공하고자 한다.Embodiments of the present invention generate a video by considering a text feature of a caption and a previous frame of each frame based on a time-series hostile neural network, thereby providing a video that more closely matches the text. And an apparatus.

본 발명의 일 실시 예에 따르면, 텍스트-비디오(Text-to-video) 생성 장치에 의해 수행되는 시계열 적대적인 신경망 기반의 텍스트-비디오 생성 방법에 있어서, 비디오 생성을 위한 텍스트에 대한 캡션 임베딩, 이전 프레임의 비디오 특징 및 노이즈를 제1 시계열 신경망에 입력시켜 순차적인 특징을 추출하는 단계; 상기 추출된 순차적인 특징으로부터 다음 프레임을 생성하여 실제와 같은 비디오(Real-like video)를 생성하는 단계; 기저장된 실제 비디오 또는 상기 생성된 비디오 중에서 어느 하나의 비디오를 샘플링하는 단계; 및 제2 시계열 신경망을 이용하여 상기 샘플링된 비디오가 상기 생성된 비디오로부터 샘플링된 비디오인지 또는 상기 실제 비디오로부터 샘플링된 비디오인지를 판별하는 단계를 포함하는 시계열 적대적인 신경망 기반의 텍스트-비디오 생성 방법이 제공될 수 있다.According to an embodiment of the present invention, in a time-series hostile neural network-based text-video generation method performed by a text-to-video generation device, caption embedding for text for video generation, a previous frame Extracting sequential features by inputting a video feature and noise into the first time series neural network; Generating a real-like video by generating a next frame from the extracted sequential features; Sampling any one of pre-stored real video or the generated video; And determining whether the sampled video is a sampled video from the generated video or a video sampled from the real video using a second time series neural network. Can be.

상기 방법은, 캡션 처리를 통해 캡션을 원-핫 벡터(one-hot vector)로 변환하고, 상기 변환된 원-핫 벡터와 상기 제1 시계열 신경망을 이용하여 상기 캡션 임베딩을 생성하는 단계를 더 포함할 수 있다.The method further includes converting the caption to a one-hot vector through caption processing and generating the caption embedding using the converted one-hot vector and the first time series neural network. can do.

상기 방법은, 비디오 프레임 처리를 통해 상기 이전 프레임을 제1 컨벌루션 신경망에 입력시켜 상기 이전 프레임의 비디오 특징을 추출하는 단계를 더 포함할 수 있다.The method may further include extracting a video feature of the previous frame by inputting the previous frame into a first convolutional neural network through video frame processing.

상기 방법은, 랜덤 분포(random distribution)로부터 상기 노이즈를 샘플링하는 단계를 더 포함할 수 있다.The method may further comprise sampling the noise from a random distribution.

상기 순차적인 특징을 추출하는 단계는, 상기 캡션 임베딩, 상기 이전 프레임의 비디오 특징 및 상기 노이즈를 제1 LSTM(long-short term memory) 네트워크에 입력시켜 순차적인 특징을 추출할 수 있다.The extracting of the sequential features may include extracting the sequential features by inputting the caption embedding, the video features of the previous frame, and the noise into a first long-short term memory (LSTM) network.

상기 실제와 같은 비디오를 생성하는 단계는, 상기 추출된 순차적인 특징을 디컨벌루션(deconvolutional) 신경망에 입력시켜 상기 비디오의 다음 프레임을 생성할 수 있다.The generating of the real video may include inputting the extracted sequential features to a deconvolutional neural network to generate a next frame of the video.

상기 비디오를 생성하는 단계는, 상기 생성된 다음 프레임으로부터 새롭게 추출된 순차적인 특징을 이용하여, 상기 다음 프레임 이후의 적어도 하나의 프레임들을 순차적으로 생성하여 상기 실제와 같은 비디오를 생성할 수 있다.The generating of the video may sequentially generate at least one frame after the next frame by using the sequential feature newly extracted from the generated next frame to generate the actual video.

상기 판별하는 단계는, 상기 샘플링된 비디오를 제2 컨벌루션 신경망에 입력시켜 상기 샘플링된 비디오 특징을 추출할 수 있다.The determining may include inputting the sampled video to a second convolutional neural network to extract the sampled video feature.

상기 판별하는 단계는, 상기 샘플링된 비디오를 제2 LSTM 네트워크에 입력시켜 순차적인 특징을 추출할 수 있다.The determining may include extracting sequential features by inputting the sampled video to a second LSTM network.

상기 판별하는 단계는, 상기 생성된 비디오 및 상기 실제 비디오로부터 샘플링된 비디오를 비교하여 상기 생성된 비디오가 상기 실제 비디오로부터 샘플링된 비디오와 동일한지를 판단할 수 있다.The determining may include comparing the generated video with the video sampled from the real video to determine whether the generated video is the same as the video sampled from the real video.

한편, 본 발명의 다른 실시 예에 따르면, 비디오 생성을 위한 텍스트에 대한 캡션 임베딩, 이전 프레임의 비디오 특징 및 노이즈를 제1 시계열 신경망에 입력시켜 순차적인 특징을 추출하고, 상기 추출된 순차적인 특징으로부터 다음 프레임을 생성하여 실제와 같은 비디오(Real-like video)를 생성하는 생성기; 및 기저장된 실제 비디오 또는 상기 생성된 비디오 중에서 어느 하나의 비디오를 샘플링하고, 제2 시계열 신경망을 이용하여 상기 샘플링된 비디오가 상기 생성된 비디오로부터 샘플링된 비디오인지 또는 상기 실제 비디오로부터 샘플링된 비디오인지를 판별하는 판별기를 포함하는 시계열 적대적인 신경망 기반의 텍스트-비디오 생성 장치가 제공될 수 있다.Meanwhile, according to another embodiment of the present invention, caption embedding for text for video generation, video characteristics and noise of a previous frame are input to a first time series neural network to extract sequential features, and from the extracted sequential features. A generator for generating a next frame to generate a real-like video; And sampling one of the pre-stored real video or the generated video and using the second time series neural network to determine whether the sampled video is sampled from the generated video or video sampled from the real video. A time-series hostile neural network based text-video generating apparatus including a discriminator for discriminating may be provided.

상기 생성기는, 캡션 처리를 통해 캡션을 원-핫 벡터(one-hot vector)로 변환하고, 상기 변환된 원-핫 벡터와 상기 제1 시계열 신경망을 이용하여 상기 캡션 임베딩을 생성할 수 있다.The generator may convert the caption into a one-hot vector through caption processing, and generate the caption embedding using the converted one-hot vector and the first time series neural network.

상기 생성기는, 비디오 프레임 처리를 통해 상기 이전 프레임을 제1 컨벌루션 신경망에 입력시켜 상기 이전 프레임의 비디오 특징을 추출할 수 있다.The generator may extract the video feature of the previous frame by inputting the previous frame to a first convolutional neural network through video frame processing.

상기 생성기는, 랜덤 분포(random distribution)로부터 상기 노이즈를 샘플링할 수 있다.The generator may sample the noise from a random distribution.

상기 생성기는, 상기 캡션 임베딩, 상기 이전 프레임의 비디오 특징 및 상기 노이즈를 제1 LSTM(long-short term memory) 네트워크에 입력시켜 순차적인 특징을 추출할 수 있다.The generator may extract the sequential feature by inputting the caption embedding, the video feature of the previous frame, and the noise into a first long-short term memory (LSTM) network.

상기 생성기는, 상기 추출된 순차적인 특징을 디컨벌루션(deconvolutional) 신경망에 입력시켜 상기 비디오의 다음 프레임을 생성할 수 있다.The generator may input the extracted sequential features to a deconvolutional neural network to generate a next frame of the video.

상기 생성기는, 상기 생성된 다음 프레임으로부터 새롭게 추출된 순차적인 특징을 이용하여, 상기 다음 프레임 이후의 적어도 하나의 프레임들을 순차적으로 생성하여 상기 실제와 같은 비디오를 생성할 수 있다.The generator may sequentially generate at least one frame after the next frame by using the sequential feature newly extracted from the generated next frame to generate the realistic video.

상기 판별기는, 상기 샘플링된 비디오를 제2 컨벌루션 신경망에 입력시켜 상기 샘플링된 비디오 특징을 추출할 수 있다.The discriminator may extract the sampled video feature by inputting the sampled video to a second convolutional neural network.

상기 판별기는, 상기 샘플링된 비디오를 제2 LSTM 네트워크에 입력시켜 순차적인 특징을 추출할 수 있다.The discriminator may extract the sequential features by inputting the sampled video to a second LSTM network.

상기 판별기는, 상기 생성된 비디오 및 상기 실제 비디오로부터 샘플링된 비디오를 비교하여 상기 생성된 비디오가 상기 실제 비디오로부터 샘플링된 비디오와 동일한지를 판단할 수 있다.The discriminator may compare the generated video with the video sampled from the real video to determine whether the generated video is the same as the video sampled from the real video.

한편, 본 발명의 다른 실시 예에 따르면, 시계열 적대적인 신경망 기반의 텍스트-비디오 생성 방법을 컴퓨터에 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체에 있어서, 캡션 임베딩, 이전 프레임의 비디오 특징 및 노이즈가 연결된 데이터를 제1 시계열 신경망에 입력시켜 순차적인 특징을 추출하는 단계; 상기 추출된 순차적인 특징으로부터 다음 프레임을 생성하여 실제와 같은 비디오를 생성하는 단계; 기저장된 실제 비디오 또는 상기 생성된 비디오 중에서 어느 하나의 비디오를 샘플링하는 단계; 및 제2 시계열 신경망을 이용하여 상기 샘플링된 비디오가 상기 생성된 비디오로부터 샘플링된 비디오인지 또는 상기 실제 비디오로부터 샘플링된 비디오인지를 판별하는 단계를 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체가 제공될 수 있다.Meanwhile, according to another embodiment of the present invention, in a computer-readable recording medium recording a program for executing a time-series hostile neural network-based text-video generation method on a computer, caption embedding, video characteristics of a previous frame, and noise Extracting sequential features by inputting the connected data to the first time series neural network; Generating a realistic video by generating a next frame from the extracted sequential features; Sampling any one of pre-stored real video or the generated video; And a step of determining whether the sampled video is a sampled video from the generated video or a video sampled from the real video using a second time series neural network. Can be provided.

본 발명의 실시 예들은 시계열 적대적인 신경망 기반으로 캡션의 텍스트 특징 및 각 프레임의 이전 프레임을 고려하여 비디오를 생성함으로써, 텍스트와 더욱 맞는 비디오를 제공할 수 있다.Embodiments of the present invention can provide a video that more closely matches the text by generating a video by considering the text feature of the caption and the previous frame of each frame based on a time-series hostile neural network.

본 발명의 실시 예들은 텍스트 내용의 특징을 고려하여 이를 기반으로 비디오를 생성함으로써, 텍스트와 더욱 맞고 또한 더 화질이 좋은 비디오를 생성할 수 있다.Embodiments of the present invention may generate a video based on the text content based on the feature of the text content, thereby creating a video that is more suitable for the text and has a higher quality.

본 발명의 실시 예들은 시계열 신경망과 2차원 컨벌루션 네트워크를 사용함으로써, 종래의 3차원 컨벌루션 네트워크에 의해 고려되지 않은 각 프레임의 이전 프레임의 내용을 고려하여 비디오를 생성할 수 있다.Embodiments of the present invention can generate video by taking into account the contents of previous frames of each frame not considered by the conventional 3D convolutional network by using a time series neural network and a 2D convolutional network.

도 1은 본 발명의 일 실시 예에 적용되는 컨벌루션 신경망을 설명하기 위한 도면이다.
도 2는 본 발명의 일 실시 예에 적용되는 순환 신경망을 설명하기 위한 도면이다.
도 3은 본 발명의 일 실시 예에 적용되는 생성적 적대 네트워크를 설명하기 위한 도면이다.
도 4는 본 발명의 일 실시 예에 따른 시계열 적대적인 신경망 기반의 텍스트-비디오 생성 장치의 구성을 설명하기 위한 구성도이다.
도 5는 본 발명의 일 실시 예에 따른 텍스트-비디오 생성 장치에서의 생성기 구성을 설명하기 위한 도면이다.
도 6은 본 발명의 일 실시 예에 따른 텍스트-비디오 생성 장치에서의 판별기 구성을 설명하기 위한 도면이다.
도 7 및 도 8은 본 발명의 일 실시 예에 따른 텍스트-비디오 생성 장치에 의해 생성된 비디오 프레임을 설명하기 위한 도면이다.
도 9는 본 발명의 일 실시 예에 따른 텍스트-비디오 생성 방법에서 캡션 임베딩 생성 과정을 설명하기 위한 구성도이다.
도 10은 본 발명의 일 실시 예에 따른 텍스트-비디오 생성 방법에서 비디오 특징의 추출 과정을 설명하기 위한 구성도이다.
도 11은 본 발명의 일 실시 예에 따른 텍스트-비디오 생성 방법에서 비디오 생성 과정을 설명하기 위한 구성도이다.
도 12는 본 발명의 일 실시 예에 따른 텍스트-비디오 생성 방법에서 비디오 판별 과정을 설명하기 위한 구성도이다.
도 13은 본 발명의 일 실시 예에 따른 시계열 적대적인 신경망 기반의 텍스트-비디오 생성 방법에 따른 실험 결과를 설명하기 위한 도면이다.1 is a view for explaining a convolutional neural network applied to an embodiment of the present invention.
2 is a view for explaining a circulatory neural network applied to an embodiment of the present invention.
3 is a diagram for describing a generative hostile network applied to an embodiment of the present invention.
4 is a block diagram illustrating a configuration of an apparatus for generating text-video based on a time series hostile neural network according to an embodiment of the present invention.
5 is a diagram illustrating a generator configuration in a text-video generating apparatus according to an embodiment of the present invention.
FIG. 6 is a diagram illustrating an arrangement of a discriminator in a text-video generating apparatus according to an exemplary embodiment.
7 and 8 are diagrams for describing a video frame generated by the text-video generating apparatus according to an embodiment of the present invention.
9 is a block diagram illustrating a caption embedding generation process in a text-video generation method according to an embodiment of the present invention.
10 is a block diagram illustrating a process of extracting video features in a text-video generation method according to an embodiment of the present invention.
11 is a block diagram illustrating a video generation process in a text-video generation method according to an embodiment of the present invention.
12 is a block diagram illustrating a video determination process in a text-video generation method according to an embodiment of the present invention.
FIG. 13 is a diagram for describing an experimental result according to a text-video generation method based on a time-series hostile neural network, according to an exemplary embodiment.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는바, 특정 실시 예들을 도면에 예시하고 상세하게 설명하고자 한다.As the inventive concept allows for various changes and numerous embodiments, particular embodiments will be illustrated in the drawings and described in detail in the written description.

그러나 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.However, this is not intended to limit the present invention to specific embodiments, it should be understood to include all changes, equivalents, and substitutes included in the spirit and scope of the present invention.

제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.Terms such as first and second may be used to describe various components, but the components should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be referred to as the second component, and similarly, the second component may also be referred to as the first component. The term and / or includes a combination of a plurality of related items or any item of a plurality of related items.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. When a component is referred to as being "connected" or "connected" to another component, it may be directly connected to or connected to that other component, but it may be understood that other components may be present in between. Should be. On the other hand, when a component is said to be "directly connected" or "directly connected" to another component, it should be understood that there is no other component in between.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting of the present invention. Singular expressions include plural expressions unless the context clearly indicates otherwise. In this application, the terms "comprise" or "have" are intended to indicate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, and one or more other features. It is to be understood that the present invention does not exclude the possibility of the presence or the addition of numbers, steps, operations, components, components, or a combination thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 가진 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art. Terms such as those defined in the commonly used dictionaries should be construed as having meanings consistent with the meanings in the context of the related art, and shall not be construed in ideal or excessively formal meanings unless expressly defined in this application. Do not.

이하, 첨부한 도면들을 참조하여, 본 발명의 바람직한 실시예를 보다 상세하게 설명하고자 한다. 본 발명을 설명함에 있어 전체적인 이해를 용이하게 하기 위하여 도면상의 동일한 구성요소에 대해서는 동일한 참조부호를 사용하고 동일한 구성요소에 대해서 중복된 설명은 생략한다.Hereinafter, with reference to the accompanying drawings, it will be described in detail a preferred embodiment of the present invention. In the following description of the present invention, the same reference numerals are used for the same elements in the drawings and redundant descriptions of the same elements will be omitted.

도 1은 본 발명의 일 실시 예에 적용되는 컨벌루션 신경망을 설명하기 위한 도면이다.1 is a view for explaining a convolutional neural network applied to an embodiment of the present invention.

컨벌루션 신경망(Convolutional Neural Network, CNN)은 이미지 데이터의 구조적 특성을 고려하기 때문에, 이미지 또는 비디오 처리 분야에서 널리 사용된다. 컨벌루션 신경망(CNN)은 이미지 분류에 탁월한 성능을 보인 후, 다양하게 사용되고 있다. 그 후 CNN은 이미지 처리와 질문 응답 문제(question answering problems)를 포함한, 많은 다른 업무에서 그 신뢰성이 증명되고 있다. 그 이유는 데이터 구조의 공간 특성을 학습할 수 있는, 기회를 제공하는 네트워크 구조이기 때문이다.Convolutional Neural Networks (CNNs) are widely used in the field of image or video processing because they consider the structural characteristics of image data. Convolutional neural networks (CNNs) are used in a variety of ways after their superior performance in image classification. Since then, CNN has proven its reliability in many other tasks, including image processing and question answering problems. The reason is that it is a network structure that provides an opportunity to learn the spatial characteristics of the data structure.

CNN의 입력 레이어(input layer)에서 2차원 데이터 벡터를 입력받는다. CNN의 가중치는 데이터에 나타나는 숨은 특징(latent features)을 학습하는 입력에 대한 필터이다. 특징을 학습하는 레이어는 컨벌루션 레이어(convolutional layers)라고 한다. 각 컨벌루션 레이어는 미리 정의된 크기의 창을 가져 와서, 2차원 데이터의 일부를 포함하는 이미지 위로 슬라이딩하는 몇 가지 필터로 구성된다. 컨벌루션이 처리된 후, 데이터의 크기가 줄어드는 서브 샘플링 단계가 있다. 가장 보편적인 서브 샘플링 기법은 서브 샘플링 레이어를 통과하는 가장 중요한 특징만 있는 최대 풀링(max pooling)이다. 분류 문제가 있는 경우, 일반적으로 컨벌루션 레이어 다음에 주어진 데이터를 분류하는 방법을 학습하는 밀도 레이어(dense layers)가 있다. A 2D data vector is received from an input layer of the CNN. The weight of the CNN is a filter on the input that learns the latent features that appear in the data. Layers for learning features are called convolutional layers. Each convolutional layer consists of several filters that take a window of predefined size and slide over an image containing a portion of the two-dimensional data. After the convolution is processed, there is a subsampling step in which the size of the data is reduced. The most common subsampling technique is the maximum pooling with only the most important features passing through the subsampling layer. If there is a classification problem, there are usually dense layers that learn how to classify the given data after the convolutional layer.

도 1에 도시된 바와 같이, CNN의 컨벌루션 연산의 예가 나타나 있다. 여기서, P_n은 이미지의 각 픽셀이고, 문자 A ~ I는 커널 추정의 결과이다. 또한, h[I,j]는 주어진 커널에 대한 픽셀의 출력이다.As shown in FIG. 1, an example of a convolution operation of a CNN is shown. Where P _n is each pixel of the image, and letters A through I are the result of kernel estimation. H [I, j] is also the output of pixels for a given kernel.

도 2는 본 발명의 일 실시 예에 적용되는 순환 신경망을 설명하기 위한 도면이다.2 is a view for explaining a circulatory neural network applied to an embodiment of the present invention.

순환 신경망(Recurrent Neural Network, RNN)은 순차적 데이터를 습득할 수있는 정규 신경 네트워크 기반의 구조이다. 통상적인 신경망과 순환적인 것의 주요 차이점은 순환적인 연결 또는 노드의 시간 연결이다. 이 연결은 데이터의 시간 의존성을 학습한다. RNN의 각 뉴런은 입력, 출력 및 2개의 순환 연결을 갖는다. 하나는 노드의 이전 상태를 수신하고, 두 번째는 노드의 현재 상태를 다음 시간 단계로 전송하는 것이다. 피드포워드 신경망과 달리, RNN은 내부 메모리를 사용하여 임의의 입력 시퀀스를 처리한다.Recurrent Neural Network (RNN) is a structure based on regular neural networks that can acquire sequential data. The main difference between a conventional neural network and a circular one is a circular connection or a time connection of nodes. This connection learns the time dependence of data. Each neuron in the RNN has an input, an output, and two circular connections. One receives the previous state of the node, and the second sends the current state of the node to the next time step. Unlike feedforward neural networks, RNNs use internal memory to process arbitrary input sequences.

한편, RNN에 충분히 긴 시퀀스가 적용되는 경우, 역 전파 중, 그래디언트는 사라지고 시퀀스의 시작에서 가중치 갱신은 매우 천천히 진행될 수 있다. 따라서, RNN의 문제를 해결하기 위한 장단기 메모리 네트워크 또는 LSTM으로 불리는 신경망이 있다. On the other hand, if a sufficiently long sequence is applied to the RNN, the gradient disappears during backpropagation and the weight update can proceed very slowly at the beginning of the sequence. Thus, there is a neural network called short and long term memory network or LSTM to solve the problem of RNN.

도 2에 도시된 바와 같이, 순환 신경망(200)은 입력 게이트(210), 까먹음 게이트(220), 메모리 셀(230), 출력 게이트(240) 및 업데이트 게이트(250)를 포함할 수 있다.As shown in FIG. 2, the cyclic neural network 200 may include an input gate 210, a forget gate 220, a memory cell 230, an output gate 240, and an update gate 250.

입력 게이트(210)는 데이터의 현재 시간 단계에 대한 입력 벡터를 나타낼 수 있다. 까먹음 게이트(220)는 저장 여부를 결정하는 까먹음(forget) 함수에 대한 메모리 셀의 정보를 나타낼 수 있다. 메모리 셀(230)은 데이터의 가장 대표적인 순차적 특징을 저장하는 메모리 셀을 나타낼 수 있다. 출력 게이트(240)는 출력 벡터를 계산하는 함수를 나타낼 수 있다. 업데이트 게이트(250)는 입력 벡터를 이용하여 셀 메모리를 업데이트하는 함수를 나타낼 수 있다.The input gate 210 can represent an input vector for the current time step of the data. The forget gate 220 may indicate the information of the memory cell for the forget function that determines whether to store. Memory cell 230 may represent a memory cell that stores the most representative sequential features of data. Output gate 240 may represent a function for calculating an output vector. The update gate 250 may represent a function of updating the cell memory using the input vector.

도 2에는 LSTM 구조가 도시되어 있다. LSTM의 한 노드에는 몇 가지 기능과 메모리 셀이 있다. LSTM은 셀 전파(cell propagate) 내의 데이터를 그대로 유지할지 또는 동일한 시간 동안 머물러야 하는지를 제어한다. 또한, LSTM은 정보가 다음 시간 단계로 가져 가야 하는지 또는 단지 잊어 버려야 하는지를 결정할 수 있다.2 shows the LSTM structure. One node of the LSTM has several functions and memory cells. The LSTM controls whether data in the cell propagate should remain intact or stay for the same time. In addition, the LSTM can determine whether the information should be taken to the next time step or just forgotten.

도 3은 본 발명의 일 실시 예에 적용되는 생성적 적대 네트워크를 설명하기 위한 도면이다.3 is a diagram for describing a generative hostile network applied to an embodiment of the present invention.

도 3에 도시된 바와 같이, 생성적 적대 네트워크(Generative Adversarial Network, GAN)(300)는 생성기 네트워크(generator network)(310) 및 판별기 네트워크(discriminator network)(320)를 포함한다. 생성적 적대 네트워크(300)는 2개의 분리된 신경 네트워크인, 생성기 네트워크(310) 및 판별기 네트워크(320)를 포함한다.As shown in FIG. 3, a generative adversarial network (GAN) 300 includes a generator network 310 and a discriminator network 320. The productive antagonist network 300 includes two separate neural networks, a generator network 310 and a discriminator network 320.

생성기 네트워크(310)는 주어진 데이터 분포에서 새로운 데이터를 생성하는, 창조적 모델이다. 생성기 네트워크(310)는 주어진 노이즈 또는 데이터 분포(data distribution)로부터 샘플을 생성하는데 초점을 맞춘다. 이들 샘플은 소스 데이터와 동일한 분포에 속해야 한다.Generator network 310 is a creative model that generates new data in a given data distribution. Generator network 310 focuses on generating samples from a given noise or data distribution. These samples must belong to the same distribution as the source data.

판별기 네트워크(320)는 생성된 데이터를 획득하여 데이터가 실제인지 또는 판별기 네트워크에 의해 생성되었는지를 구별하려고 시도한다. 따라서, 생성기 네트워크(310)의 주요 임무는 판별기 네트워크(320)를 속이기 위한 것이고, 판별기 네트워크(320)는 존재하는 생성된 데이터를 기존 데이터로 정확하게 분류해야 한다.The discriminator network 320 obtains the generated data and attempts to distinguish whether the data is real or generated by the discriminator network. Thus, the main task of the generator network 310 is to deceive the discriminator network 320, and the discriminator network 320 must correctly classify the generated data present as existing data.

도 4는 본 발명의 일 실시 예에 따른 시계열 적대적인 신경망 기반의 텍스트-비디오 생성 장치의 구성을 설명하기 위한 구성도이다.4 is a block diagram illustrating a configuration of an apparatus for generating text-video based on a time series hostile neural network according to an embodiment of the present invention.

도 4에 도시된 바와 같이, 본 발명의 일 실시 예에 따른 시계열 적대적인 신경망 기반의 텍스트-비디오 생성 장치(400)는 생성기(410) 및 판별기(420)를 포함한다. 여기서, 생성기(410)는 제1 컨벌루션 네트워크(411), 제1 LSTM(long-short term memory) 네트워크 및 디컨벌루션 네트워크(413)를 포함할 수 있다. 판별기(420)는 제2 컨벌루션 네트워크(421) 및 제2 LSTM 네트워크(422)를 포함할 수 있다. 그러나 도시된 구성요소 모두가 필수 구성요소인 것은 아니다. 도시된 구성요소보다 많은 구성요소에 의해 시계열 적대적인 신경망 기반의 텍스트-비디오 생성 장치(400)가 구현될 수도 있고, 그보다 적은 구성요소에 의해서도 시계열 적대적인 신경망 기반의 텍스트-비디오 생성 장치(400)가 구현될 수 있다.As shown in FIG. 4, the apparatus 400 for generating time-based hostile neural network text-video according to an embodiment of the present invention includes a generator 410 and a discriminator 420. Here, the generator 410 may include a first convolutional network 411, a first long-short term memory (LSTM) network, and a deconvolutional network 413. The discriminator 420 may include a second convolutional network 421 and a second LSTM network 422. However, not all illustrated components are essential components. The time-series hostile neural network-based text-video generating apparatus 400 may be implemented by more components than the illustrated components, and the time-series hostile neural network-based text-video generating apparatus 400 may be implemented by fewer components. Can be.

이하, 도 4의 시계열 적대적인 신경망 기반의 텍스트-비디오 생성 장치(400)의 각 구성요소들의 구체적인 구성 및 동작을 설명한다.Hereinafter, specific configurations and operations of the components of the time-series hostile neural network-based text-video generating apparatus 400 of FIG. 4 will be described.

생성기(410)는 비디오 생성을 위한 텍스트에 대한 캡션 임베딩, 이전 프레임의 비디오 특징 및 노이즈를 제1 시계열 신경망에 입력시켜 순차적인 특징을 추출하고, 그 추출된 순차적인 특징으로부터 다음 프레임을 생성하여 실제와 같은 비디오(Real-like video)를 생성한다. 여기서, 실제와 같은 비디오는 실제 비디오와는 상이하다. 하지만, 실제 비디오와 가장 유사하도록 생성기(410)에서 생성되는 비디오를 나타낸다. 이와 같이, 텍스트-비디오 생성 장치(400)의 생성기(410)는 반복적인 시계열 적대적인 신경망을 이용하여, 주어진 캡션에서 비디오 프레임을 생성할 수 있다. 이를 위해, 생성기(410)는 주어진 캡션 내의 문장의 속성을 추출하고 그 속성을 이용해서 컨벌루션 신경망을 통해 비디오의 프레임을 하나씩 생성할 수 있다. The generator 410 extracts a sequential feature by inputting caption embedding for text for video generation, video features and noise of a previous frame into a first time series neural network, and generates a next frame from the extracted sequential features. Create a Real-like video. Here, the real video is different from the real video. However, it represents the video generated by the generator 410 to be most similar to the actual video. As such, the generator 410 of the text-video generation device 400 may generate a video frame at a given caption using an iterative time series hostile neural network. To this end, the generator 410 may extract an attribute of a sentence in a given caption and generate a frame of video one by one through the convolutional neural network using the attribute.

여기서, 생성기(410)는, 캡션 처리를 통해 캡션을 원-핫 벡터(one-hot vector)로 변환하고, 그 변환된 원-핫 벡터와 제1 시계열 신경망을 이용하여 캡션 임베딩을 생성할 수 있다. Here, the generator 410 may convert the caption into a one-hot vector through caption processing, and generate caption embedding using the converted one-hot vector and the first time series neural network. .

생성기(410)는, 비디오 프레임 처리를 통해 이전 프레임을 제1 컨벌루션 신경망에 입력시켜 이전 프레임의 비디오 특징을 추출할 수 있다.The generator 410 may extract the video feature of the previous frame by inputting the previous frame to the first convolutional neural network through video frame processing.

생성기(410)는, 랜덤 분포(random distribution)로부터 노이즈를 샘플링할 수 있다.The generator 410 may sample noise from a random distribution.

생성기(410)는, 캡션 임베딩, 이전 프레임의 비디오 특징 및 노이즈를 제1 LSTM(long-short term memory) 네트워크에 입력시켜 순차적인 특징을 추출할 수 있다.The generator 410 may extract the sequential features by inputting the caption embedding, the video features of the previous frame, and the noise into the first long-short term memory (LSTM) network.

생성기(410)는, 추출된 순차적인 특징을 디컨벌루션(deconvolutional) 신경망에 입력시켜 비디오의 다음 프레임을 생성할 수 있다.The generator 410 may input the extracted sequential features into a deconvolutional neural network to generate the next frame of the video.

생성기(410)는, 생성된 다음 프레임으로부터 새롭게 추출된 순차적인 특징을 이용하여, 다음 프레임 이후의 적어도 하나의 프레임들을 순차적으로 생성하여 실제와 같은 비디오를 생성할 수 있다.The generator 410 may sequentially generate at least one frame after the next frame by using the newly extracted sequential feature from the generated next frame to generate a realistic video.

이후, 판별기(420)는 기저장된 실제 비디오 또는 상기 생성된 비디오 중에서 어느 하나의 비디오를 샘플링하고, 제2 시계열 신경망을 이용하여 샘플링된 비디오가 생성기(410)에서 생성된 비디오로부터 샘플링된 비디오인지 또는 실제 비디오로부터 샘플링된 비디오인지를 판별한다.Then, the discriminator 420 samples any one of the pre-stored real video or the generated video, and determines whether the video sampled using the second time series neural network is the video sampled from the video generated by the generator 410. Or whether the video is sampled from the actual video.

여기서, 판별기(420)는, 샘플링된 비디오를 제2 컨벌루션 신경망에 입력시켜 상기 샘플링된 비디오 특징을 추출할 수 있다. Here, the determiner 420 may input the sampled video to the second convolutional neural network to extract the sampled video feature.

판별기(420)는, 샘플링된 비디오를 제2 LSTM 네트워크(422)에 입력시켜 순차적인 특징을 추출할 수 있다.The discriminator 420 may input the sampled video to the second LSTM network 422 to extract sequential features.

판별기(420)는, 생성된 비디오 및 실제 비디오로부터 샘플링된 비디오를 비교하여, 생성된 비디오가 실제 비디오로부터 샘플링된 비디오와 동일한지를 판단할 수 있다.The discriminator 420 may compare the generated video with the video sampled from the real video to determine whether the generated video is the same as the video sampled from the real video.

한편, 본 발명의 일 실시예에서 사용되는 목적 함수(Objective Function)를 설명하기로 한다. 본 발명의 일 실시예에 따른 텍스트-비디오 생성 장치(400)는 GAN(Generative Adversarial Network)을 기반으로 장면 다이내믹스(scene dynamics)를 효과적으로 학습하고, 적절한 순서로 비디오 프레임을 생성할 수 있다. 텍스트-비디오 생성 장치(400)는 주어진 분포 조건에서 데이터의 학습과 생성에 완벽하게 적합한 GAN 아키텍처를 사용하여 최적의 비디오를 생성할 수 있다.Meanwhile, an objective function used in an embodiment of the present invention will be described. The text-video generating apparatus 400 according to an embodiment of the present invention may effectively learn scene dynamics based on a GAN (Generative Adversarial Network) and generate video frames in an appropriate order. The text-video generating apparatus 400 may generate an optimal video using a GAN architecture that is perfectly suited for learning and generating data under a given distribution condition.

이를 구체적으로 설명하기로 한다. 우선, 텍스트 데이터 세트 T가 분포 D_T로부터 샘플링되고, 비디오 데이터 V가 분포 D_V로부터 샘플링된다고 가정하자. 텍스트-비디오 생성 장치(400)는 결합된 손실(combined loss)

을 최소화하는

를 학습한다. 여기서,

은 하기의 [수학식 1]과 같이 나타내진다.This will be described in detail. First, assume that text data set T is sampled from distribution D _T and video data V is sampled from distribution D _V. The text-video generating device 400 is combined loss

To minimize

To learn. here,

Is represented by the following [Equation 1].

여기서, D는 데이터가 실제 비디오인 원본 분포(original distribution)로부터의 비디오 샘플인지 또는 생성기(410)에 의해 생성된 비디오 샘플인지를 결정하는 분류기이다. x는 입력 데이터 샘플이다.Here, D is a classifier that determines whether the data is a video sample from an original distribution that is the actual video or a video sample generated by generator 410. x is the input data sample.

다음으로,

손실은 생성된 비디오 샘플과 원본 분포로부터의 비디오 샘플 간의 거리를 측정하는 거리 손실(distance loss)을 나타낸다. 거리 손실

은 하기의 [수학식 2]와 같이 정의된다.to the next,

The loss represents a distance loss that measures the distance between the generated video sample and the video sample from the original distribution. Distance loss

Is defined as in Equation 2 below.

여기서,

는 평균 제곱 오차의 거리 측정치(Mean Squared Error distance measure)를 나타낸다. 평균 절대 오류(Mean Absolute Error)와 같은 다른 측정법을 사용하여 다른 실험을 시도했지만 성능에 눈에 띄는 차이는 보이지 않았습니다. 따라서 본 발명의 일 실시예에 사용되는 거리 손실

은 특정 측정법으로 한정되지 않는다. 따라서, 총 손실은 2개의 항목으로 구성된다.here,

Denotes a mean squared error distance measure. Other experiments were attempted using other measures, such as Mean Absolute Error, but there was no noticeable difference in performance. Thus the distance loss used in one embodiment of the present invention

Is not limited to a specific measurement method. Thus, the total loss consists of two items.

첫 번째 항목인

은 생성된 비디오 샘플을 원본 데이터 분포에 더 가깝게 만들 수 있다.The first item,

Can make the generated video sample closer to the original data distribution.

반면, 두 번째 항목인 거리 손실

항목은 샘플을 교정하여 실제 이미지처럼 보이도록 만들 수 있다.On the other hand, the second item, distance loss

Items can be calibrated to make the sample look like a real image.

한편, 본 발명의 일 실시 예에 따른 시계열 적대적인 신경망 기반의 텍스트-비디오 생성 장치(400)를 다시 설명하면 다음과 같다.Meanwhile, the text-video generating apparatus 400 based on a time series hostile neural network according to an embodiment of the present invention will be described as follows.

텍스트-비디오 생성 장치(400)는 주어진 랜덤 분포 데이터(randomly distributed data)를 이용하여 생성 모델(generative model)을 학습하여 이미지를 생성할 수 있다. 텍스트-비디오 생성 장치(400)는 이미지뿐만 아니라 비디오를 생성할 수 있다. 텍스트-비디오 생성 장치(400)는 랜덤 분포 데이터와 같이 텍스트의 내재된 특징을 이용하여 시계열 신경망을 학습할 수 있다. The text-video generating apparatus 400 may generate an image by learning a generational model using given randomly distributed data. The text-video generating apparatus 400 may generate a video as well as an image. The text-video generating apparatus 400 may learn a time series neural network using intrinsic features of text, such as random distribution data.

텍스트-비디오 생성 장치(400)의 생성기(410)가 시계열 신경망의 비디오의 프레임마다 특징적인 은닉 벡터(hidden vector)를 생성하고, 2차원 컨벌루션 신경망을 이용하여 은닉 벡터로부터 비디오의 프레임을 생성할 수 있다.The generator 410 of the text-video generating apparatus 400 may generate a characteristic hidden vector for each frame of the video of the time series neural network, and generate a frame of the video from the hidden vector using the two-dimensional convolutional neural network. have.

이때, 판별기(420)는 생성기(410)에서 생성된 비디오 프레임들과 실제 비디오인 원래의 원본 데이터에 있는 비디오 프레임을 비교하여, 생성된 비디오 프레임이 원본 데이터와 같은지 아닌지를 판단할 수 있다. 따라서 생성기(410)는 원본 데이터와 최대한 비슷한 비디오를 생성한다. 이때 판별기(420)는 생성기(410)에서 생성된 비디오가 얼마나 같은지를 구분할 수 있다. 결론적으로 생성기(410)가 원본 데이터의 분포를 학습함으로써, 생성기(410)는 실제와 같은 비디오를 생성할 수 있다. In this case, the determiner 420 may compare the video frames generated by the generator 410 with the video frames in the original original data, which is the actual video, and determine whether the generated video frame is the same as the original data. Thus, the generator 410 generates a video as close as possible to the original data. In this case, the discriminator 420 may distinguish how much the video generated by the generator 410 is the same. In conclusion, as the generator 410 learns the distribution of the original data, the generator 410 may generate a realistic video.

도 5는 본 발명의 일 실시 예에 따른 텍스트-비디오 생성 장치에서의 생성기 구성을 설명하기 위한 도면이다.5 is a diagram illustrating a generator configuration in a text-video generating apparatus according to an embodiment of the present invention.

생성기(410)는 제1 컨벌루션 네트워크(411), 제1 장단기 메모리 네트워크 및 디컨벌루션 네트워크(413)를 포함한다. 생성기(410)는 비디오의 캡션, 비디오의 제1 프레임 및 노이즈를 입력으로서 입력받아 비디오의 다음 프레임인 제2 프레임을 생성한다. 비디오의 제1 프레임만이 생성기(410)에 제공된다. 그 후, 생성기(410)는 생성기 자체에서 생성된 제2 프레임을 다시 입력으로 취한다.The generator 410 includes a first convolutional network 411, a first short and long term memory network, and a deconvolutional network 413. The generator 410 receives a caption of the video, a first frame of the video, and noise as inputs, and generates a second frame that is the next frame of the video. Only the first frame of video is provided to the generator 410. The generator 410 then takes as input again the second frame generated by the generator itself.

생성기(410)는 주어진 프레임의 공간 특징(spatial feature)을 학습하기 위해, LSTM 네트워크(LSTM 계층)가 뒤따르는 CNN을 포함한다. LSTM 계층은 CNN 및 캡션에 의해 학습된 특징을 입력으로 획득하여 데이터의 시계열적 또는 순차적인 특징을 학습할 수 있다.Generator 410 includes a CNN followed by an LSTM network (LSTM layer) to learn the spatial features of a given frame. The LSTM layer can acquire the features learned by the CNN and the caption as inputs to learn the time series or sequential features of the data.

다음 단계에서, 학습된 순차적 특징은 디컨벌루션 네트워크(413)로 전파된다. 디컨벌루션 네트워크(413)는 전파된 순차적 특징으로부터 다음 프레임인 제2 프레임을 생성한다. 또한, 생성기(410)는, 동일한 크기를 갖는 CNN 층에 대한 잔여 연결을 사용한다. 잔여 연결은 구조를 보다 신속하게 역 전파하는데 도움을 주고, 데이터의 의미 있는 특징을 다음 계층으로 직접 공유할 수 있다. 생성기(410)는 랜덤하게 분산된 노이즈를 입력으로 받아 순차적인 프레임 시퀀스를 생성할 수 있다.In the next step, the learned sequential features are propagated to the deconvolution network 413. The deconvolution network 413 generates a second frame, which is the next frame, from the propagated sequential features. Generator 410 also uses the remaining connections for CNN layers having the same size. Residual connections help to propagate the structure back more quickly, and share meaningful features of the data directly to the next layer. The generator 410 may generate a sequential frame sequence by receiving randomly distributed noise as an input.

도 6은 본 발명의 일 실시 예에 따른 텍스트-비디오 생성 장치에서의 판별기 구성을 설명하기 위한 도면이다.FIG. 6 is a diagram illustrating an arrangement of a discriminator in a text-video generating apparatus according to an exemplary embodiment.

판별기(420)는 제2 컨벌루션 네트워크(421)(CNN) 및 제2 장단기 메모리 네트워크를 포함한다. 그러나 판별기(420)는 입력 데이터를 실제 비디오 또는 생성된 비디오 중에서 어떤 분포로부터 가져 왔는지를 결정하는 하나의 출력을 가진다. 판별기(420)는 생성된 비디오 프레임을 가져오고, 그 가져온 비디오 프레임과 실제 데이터 분포로부터의 비디오 샘플을 무작위로 섞다. 그 다음, 판별기(420)는 하나의 비디오 샘플을 선택하고, 그 선택된 비디오 샘플이 어느 데이터로부터의 분포인지를 분류할 수 있다.The discriminator 420 includes a second convolutional network 421 (CNN) and a second short and long term memory network. However, discriminator 420 has one output that determines from which distribution the input data is from the actual video or the generated video. The discriminator 420 takes the generated video frame and randomly mixes the imported video frame with video samples from the actual data distribution. The discriminator 420 may then select one video sample and classify from which data the selected video sample is a distribution.

텍스트-비디오 생성 장치(400)는 시각화 문제의 세 가지 가장 중요한 특성을 고려하여 텍스트-비디오를 생성할 수 있다. 거리 손실을 사용하여 생성기(410)를 최적화함으로써, 텍스트-비디오 생성 장치(400)는 생성된 비디오가 의미 있고 주어진 텍스트와 일치하는 비디오를 생성할 수 있다. 장단기 메모리 네트워크의 구조는 비디오 프레임이 적절한 순서로 생성되도록 보장할 수 있다. 그러므로 생성된 비디오는 실제 장면과 구별될 수 없다. 마지막으로, 텍스트-비디오 생성 장치(400)의 손실 기능은 생성된 비디오 데이터 샘플이 원래 데이터와 동일한 분포로 생성되는 방식으로 텍스트-비디오 생성 장치(400)의 아키텍처가 최적화된다. 그러므로 텍스트-비디오 생성 장치(400)는 생성된 비디오를 실제 비디오의 분포와 완전히 동일하게 만들 수 있다. 여기서, 판별기(420)는 생성기(410)의 가이드 역할을 수행한다. 비디오가 실제 비디오처럼 보이는지 여부를 생성기(410)에 제공할 수 있다.The text-video generating apparatus 400 may generate the text-video in consideration of the three most important characteristics of the visualization problem. By optimizing the generator 410 using distance loss, the text-video generating device 400 can generate a video in which the generated video is meaningful and matches the given text. The structure of the short and long term memory network can ensure that video frames are generated in the proper order. Therefore, the generated video cannot be distinguished from the actual scene. Finally, the loss function of the text-video generating device 400 is optimized in the architecture of the text-video generating device 400 in such a way that the generated video data samples are generated in the same distribution as the original data. Therefore, the text-video generating apparatus 400 may make the generated video exactly the same as the distribution of the actual video. Here, the discriminator 420 serves as a guide of the generator 410. It may be provided to the generator 410 whether the video looks like a real video.

도 7 및 도 8은 본 발명의 일 실시 예에 따른 텍스트-비디오 생성 장치에 의해 생성된 비디오 프레임을 설명하기 위한 도면이다.7 and 8 are diagrams for describing a video frame generated by the text-video generating apparatus according to an embodiment of the present invention.

텍스트-비디오 생성 장치(400)가 "Person is 'handwaving'"이라는 캡션, "Person is 'handwaving'"에 대응되는 비디오의 첫 번째인 제1 프레임 및 노이즈를 입력받고, 'handwaving'이라는 텍스트에 맞게 사람이 손을 흔드는 비디오를 생성하는 예가 도 7에 도시되어 있다.The text-video generating apparatus 400 receives the caption "Person is 'handwaving'", the first frame and noise of the video corresponding to "Person is 'handwaving'", and fits the text "handwaving". An example of generating a video of a human waving is shown in FIG. 7.

한편, 텍스트-비디오 생성 장치(400)가 "Person is 'walking' to the left"라는 캡션, "Person is 'walking' to the left"에 대응되는 비디오의 첫 번째인 제1 프레임 및 노이즈를 입력받고, "'walking' to the left" 라는 텍스트에 맞게 사람이 왼쪽으로 걷는 비디오를 생성하는 예가 도 8에 도시되어 있다.Meanwhile, the text-video generating apparatus 400 receives the first frame and noise, which is the first frame of the video corresponding to the caption "Person is 'walking' to the left" and "Person is 'walking' to the left". An example of creating a video of a person walking left to fit the text "walking 'to the left" is shown in FIG. 8.

도 9는 본 발명의 일 실시 예에 따른 텍스트-비디오 생성 방법에서 캡션 임베딩 생성 과정을 설명하기 위한 구성도이다.9 is a block diagram illustrating a caption embedding generation process in a text-video generation method according to an embodiment of the present invention.

단계 S101에서, 텍스트-비디오 생성 장치(400)는 캡션을 처리한다.In operation S101, the text-video generating apparatus 400 processes a caption.

단계 S102에서, 텍스트-비디오 생성 장치(400)는 캡션을 원-핫 벡터로 변환한다.In operation S102, the text-video generating apparatus 400 converts the caption into a one-hot vector.

단계 S103에서, 텍스트-비디오 생성 장치(400)는 LSTM을 이용하여 캡션 임베딩을 생성한다.In operation S103, the text-video generating apparatus 400 generates caption embedding using the LSTM.

도 10은 본 발명의 일 실시 예에 따른 텍스트-비디오 생성 방법에서 비디오 특징의 추출 과정을 설명하기 위한 구성도이다.10 is a block diagram illustrating a process of extracting video features in a text-video generation method according to an embodiment of the present invention.

단계 S201에서, 텍스트-비디오 생성 장치(400)는 비디오를 처리한다.In step S201, the text-video generating apparatus 400 processes the video.

단계 S202에서, 텍스트-비디오 생성 장치(400)는 CNN에 이전 비디오 프레임을 입력한다.In operation S202, the text-video generating apparatus 400 inputs a previous video frame to the CNN.

단계 S203에서, 텍스트-비디오 생성 장치(400)는 CNN을 이용하여 비디오 특징을 추출한다.In operation S203, the text-video generating apparatus 400 extracts a video feature using the CNN.

도 11은 본 발명의 일 실시 예에 따른 텍스트-비디오 생성 방법에서 비디오 생성 과정을 설명하기 위한 구성도이다.11 is a block diagram illustrating a video generation process in a text-video generation method according to an embodiment of the present invention.

단계 S301에서, 텍스트-비디오 생성 장치(400)는 비디오 특징, 캡션 임베딩 벡터 및 노이즈를 획득한다.In operation S301, the text-video generating apparatus 400 obtains a video feature, a caption embedding vector, and noise.

단계 S302에서, 텍스트-비디오 생성 장치(400)는 비디오 특징, 캡션 임베딩 벡터 및 노이즈를 연결한다.In operation S302, the text-video generating apparatus 400 connects the video feature, the caption embedding vector, and the noise.

단계 S303에서, 텍스트-비디오 생성 장치(400)는 LSTM 네트워크에 벡터를 입력하여 순차적인 특징을 추출한다.In operation S303, the text-video generating apparatus 400 extracts a sequential feature by inputting a vector to the LSTM network.

단계 S304에서, 텍스트-비디오 생성 장치(400)는 DCNN에 순차적인 특징을 입력하고 순차적인 특징으로부터 다음 비디오 프레임을 생성한다.In operation S304, the text-video generating apparatus 400 inputs a sequential feature to the DCNN and generates a next video frame from the sequential feature.

단계 S305에서, 텍스트-비디오 생성 장치(400)는 생성기(410)의 학습 후, 실제 같은 비디오를 생성한다.In operation S305, the text-video generating apparatus 400 generates the actual video after learning the generator 410.

도 12는 본 발명의 일 실시 예에 따른 텍스트-비디오 생성 방법에서 비디오 판별 과정을 설명하기 위한 구성도이다.12 is a block diagram illustrating a video determination process in a text-video generation method according to an embodiment of the present invention.

단계 S401에서, 텍스트-비디오 생성 장치(400)는 생성된 실제 같은 비디오 획득한다.In operation S401, the text-video generating apparatus 400 obtains the generated actual video.

단계 S402에서, 텍스트-비디오 생성 장치(400)는 기저장된 실제 비디오 또는 생성된 비디오 중에서 어느 하나의 비디오를 샘플링한다.In operation S402, the text-video generating apparatus 400 samples one of the pre-stored real video or the generated video.

단계 S403에서, 텍스트-비디오 생성 장치(400)는 샘플링된 비디오를 CNN에 입력한다.In operation S403, the text-video generating apparatus 400 inputs the sampled video to the CNN.

단계 S404에서, 텍스트-비디오 생성 장치(400)는 CNN을 이용하여 비디오 특징을 추출한다.In operation S404, the text-video generating apparatus 400 extracts a video feature by using the CNN.

단계 S405에서, 텍스트-비디오 생성 장치(400)는 LSTM 네트워크에 비디오 특징을 입력하여 순차적인 특징을 추출한다.In operation S405, the text-video generating apparatus 400 extracts a sequential feature by inputting a video feature to the LSTM network.

단계 S406에서, 텍스트-비디오 생성 장치(400)는 순차적인 특징을 이용하여 샘플링된 비디오가 생성된 비디오에 의해 생성되는지 또는 실제 비디오에 의해 생성되는지를 판별한다.In operation S406, the text-video generating apparatus 400 determines whether the sampled video is generated by the generated video or the actual video by using the sequential features.

단계 S407에서, 텍스트-비디오 생성 장치(400)는 판별기(420) 및 생성기(410)를 업데이트한다.In operation S407, the text-video generating apparatus 400 updates the discriminator 420 and the generator 410.

도 13은 본 발명의 일 실시 예에 따른 시계열 적대적인 신경망 기반의 텍스트-비디오 생성 방법에 따른 실험 결과를 설명하기 위한 도면이다.FIG. 13 is a diagram for describing an experimental result according to a text-video generation method based on a time-series hostile neural network, according to an exemplary embodiment.

텍스트-비디오 생성 방법은 주어진 캡션으로부터 비디오 프레임을 생성할 수 있는 시계열 적대적인 네트워크를 이용한다. 우선, 텍스트-비디오 생성 방법은 비디오 내의 각 프레임 사이의 순차 관계를 추출한다. 그리고 텍스트-비디오 생성 방법은 추출된 특징을 캡션 문장의 내부 표현과 연결하고 각 시간 단계에 대한 비디오 프레임 시퀀스를 생성할 수 있다. 텍스트-비디오 생성 방법의 예측 결과를 위해, KTH 행동 데이터 세트에 대한 상이한 기준법과 비교하기로 한다.The text-video generation method utilizes a time-series hostile network that can generate video frames from a given caption. First, the text-video generation method extracts the sequential relationship between each frame in the video. In addition, the text-video generation method may connect the extracted feature with the internal representation of the caption sentence and generate a video frame sequence for each time step. For the prediction results of the text-video generation method, we compare it with different reference methods for the KTH behavior data set.

본 발명의 일 실시 예에 따른 텍스트-비디오 생성 방법은 랜덤 비디오를 생성하는 것이 아니라, 주어진 캡션을 사용하여 비디오를 생성하는 아키텍처를 이용할 수 있다. 텍스트-비디오 생성 방법은 세 가지 다른 관점에서 생성 프로세스를 다루는 모델을 구현할 수 있다. 첫째, 텍스트-비디오 생성 방법은 생성된 비디오가 주어진 캡션의 텍스트 컨텐츠의 적절한 시각화인지를 확인할 수 있다. 둘째, 텍스트-비디오 생성 방법은 순차적으로 비디오 프레임을 생성할 수 있다. 셋째, 텍스트-비디오 생성 방법은 그 생성된 비디오 프레임이 실제 비디오와 유사한지를 검사할 수 있다.The text-video generation method according to an embodiment of the present invention may use an architecture for generating video using a given caption, rather than generating random video. The text-video generation method can implement a model that addresses the creation process from three different perspectives. First, the text-video generation method can determine whether the generated video is an appropriate visualization of the text content of a given caption. Second, the text-video generation method may sequentially generate video frames. Third, the text-video generation method may check whether the generated video frame is similar to the actual video.

텍스트 시각화 문제를 해결하기 위해, 텍스트-비디오 생성 방법은 주어진 캡션과 비디오 프레임 사이의 관계를 캡처하는 것이 아니라, 비디오 내의 프레임의 올바른 순서를 처리할 수 있다. 텍스트-비디오 생성 방법은 생성 과정에서 높은 성능을 나타내는 GAN과 시퀀스 데이터의 잠정적인 특징을 얻을 수 있는 LSTM 구조를 기반으로 한다. 생성된 비디오가 지정된 텍스트 캡션을 갖는 KTH 비디오 데이터 세트에 대한 실험을 수행한다. 캡션 텍스트로부터의 비디오 생성은 거의 연구가 이루어지지 않았기 때문에, 주어진 캡션으로부터 연속적인 비디오 프레임 생성에 적용될 수 있는 가장 인기있는 모델을 선택하고 본 발명의 일 실시 예와 비교하기로 한다. 텍스트-비디오 생성 방법은 비디오 생성의 모든 필요한 특징을 학습했음을 증명하는 3가지 평가 메트릭 모두에서 최상의 성능을 나타낸다. To solve the text visualization problem, the text-video generation method can handle the correct order of the frames within the video, rather than capturing the relationship between a given caption and the video frame. The text-video generation method is based on the high performance GAN and the LSTM structure that can obtain the intermittent characteristics of the sequence data. The generated video performs an experiment on the KTH video data set with the specified text caption. Since video generation from caption text is hardly studied, the most popular model that can be applied to continuous video frame generation from a given caption is selected and compared with one embodiment of the present invention. The text-video generation method exhibits the best performance in all three evaluation metrics that prove that we have learned all the necessary features of video generation.

우선, 세 가지 메트릭을 사용하여 비디오 생성 작업에 대한 반복적인 적대적인 네트워크를 실험한다.First, we experiment with an iterative hostile network for video production using three metrics.

데이터 세트를 살펴보면, 실험 부분에서는 네 가지 시나리오에서 25가지 주제로 여러 번 수행되는 6가지 유형의 인간 행동(걷기, 조깅, 달리기, 권투, 손 흔들기 및 손 박수)이 포함된 KTH 작업 데이터 세트를 사용했다. 데이터 세트는 70% 학습(420개 항목)과 30% 테스트(180개 항목) 세트로 나뉘었다. 한 비디오의 크기는 64x64 픽셀이고 비디오의 길이는 16프레임이다.Looking at the data set, the experiment uses a KTH working data set that includes six types of human behaviors (walking, jogging, running, boxing, shaking hands and clapping hands) that are performed multiple times on 25 topics in four scenarios. did. The data set was divided into 70% training (420 items) and 30% test (180 items) sets. One video is 64x64 pixels long and 16 frames long.

실험 설정을 살펴보면 다음과 같다. 시계열 적대적인 신경망 기반의 텍스트-비디오 생성 방법에서 생성기(410)와 판별기(420)에 대해 4개의 컨벌루션 및 디컨벌루션 네트워크(413)를 사용했다. 잔여 연결은 제3 CNN 층에서부터 제6 층 및 제4 층 내지 제5 층에서 LSTM 층을 거치지 않고 생성기(410)에서 구현된다. 실험 결과, 잔여 연결을 가진 네트워크는 비디오가 없는 네트워크보다 빠르게 비디오의 배경을 재구성할 수 있다. 각 계층은 비선형 ReLU 기능을 포함한다. 발전기의 출력에는 탄젠트 비선형성이 있고, 판별기(420)의 출력에는 소프트맥스(Softmax) 기능을 사용했다.The experimental setup is as follows. In the time-series hostile neural network based text-video generation method, four convolutional and deconvolutional networks 413 were used for the generator 410 and the discriminator 420. The remaining connections are implemented in the generator 410 without going through the LSTM layer in the sixth and fourth to fifth layers from the third CNN layer. Experiments show that networks with residual connections can reconstruct the background of video faster than networks without video. Each layer includes nonlinear ReLU functionality. The output of the generator has tangent nonlinearity, and the output of the discriminator 420 uses a Softmax function.

측정 항목을 살펴보면, 생성된 샘플의 품질을 평가하기 위해 세 가지 메트릭을 사용한다. 먼저, 실제 프레임 Y와 예측 프레임

사이의 피크 신호대 잡음비(Peak Signal to Noise Ratio)인

를 계산한다.Looking at the metrics, three metrics are used to assess the quality of the generated sample. First, the actual frame Y and the prediction frame

Peak Signal to Noise Ratio between

Calculate

여기서, Y는 실제 프레임,

는 예측 프레임을 나타낸다.

는 영상 강도(image intensities)의 가능한 최대 제곱 값을 나타낸다. Where Y is the actual frame,

Represents a prediction frame.

Represents the maximum possible squared value of the image intensities.

우리는 또한 SSIM(Structural Similarity Index Measure)을 제공한다. SSIM은 -1과 1 사이의 범위를 가지며, 큰 점수는 두 이미지 간에 더 큰 유사성을 의미한다. 마지막 메트릭은 두 샘플 간의 픽셀 현 오차를 측정하는 평균 제곱 오류이다.We also provide a structural similarity index measure (SSIM). SSIM has a range between -1 and 1, with large scores indicating greater similarity between the two images. The final metric is the mean squared error, which measures the pixel chord error between two samples.

KTH 인간 행동 데이터 세트로부터 생성된 비디오의 비교가 도 13에 도시되어 있다. 도 13은 3개의 상이한 모델로 실험 결과를 관찰할 수 있다. 본 발명의 일 실시 예에 따른 텍스트-비디오 생성 방법과 실험 결과를 비교할 수 있는, 텍스트로부터의 비디오의 대규모 생성 모델이 존재하지 않는다. 따라서, 비교할 2개의 생성 구조를 새로 구현하여 실험결과를 비교하기로 한다.A comparison of the video generated from the KTH human behavior data set is shown in FIG. 13. 13 can observe the experimental results with three different models. There is no large-scale generation model of video from text, which can compare the experimental results with the text-video generation method according to an embodiment of the present invention. Therefore, the two generation structures to be compared are newly implemented to compare the experimental results.

첫 번째 모델은 3D CNN GAN 모델로서, 3D CNN GAN 모델은 3차원 컨벌루션 네트워크)3D CNN을 사용하여 캡션으로부터 비디오를 생성한다. 그리고 3D CNN GAN은 본 발명의 일 실시예가 수행하는 것처럼 프레임 단위가 아니라, 한 번에 비디오를 생성한다.The first model is a 3D CNN GAN model, which generates video from captions using a 3D convolutional network. The 3D CNN GAN generates video at a time, not in units of frames as one embodiment of the present invention performs.

두 번째 모델은 DperFrame 모델이다. 여기서, DperFrame 모델의 네트워크의 구조는 본 발명의 일 실시예와 동일하지만, 판별기 구조가 상이하다. DperFrame 모델의 판별기(420)에는 각 프레임에 대한 출력이 있기 때문에, 각 프레임이 실제인지 여부에 관계없이 각 프레임을 개별적으로 결정한다.The second model is the DperFrame model. Here, the structure of the network of the DperFrame model is the same as the embodiment of the present invention, but the discriminator structure is different. Since the discriminator 420 of the DperFrame model has an output for each frame, each frame is individually determined whether or not each frame is real.

세 번째 모델은 본 발명의 일 실시 예에 따른 DperVideo 모델이다. 여기서, 판별기(420)는 비디오 당 하나의 출력만을 제공한다.The third model is a DperVideo model according to an embodiment of the present invention. Here, discriminator 420 provides only one output per video.

도 13에서 볼 수 있듯이, 본 발명의 일 실시예에 따른 텍스트-비디오 생성 장치(400)는 모든 메트릭에서 다른 접근법보다 성능이 우수하다. 특히, SSIM 메트릭 결과의 차이가 가장 크다. 이는 비디오가 큰 노이즈 없이 재구성되고 프레임들 간의 연결이 무의미하지 않다는 것을 의미한다. 그 이유는 비디오 생성 네트워크의 아키텍처이기 때문이다. 생성기(410)는 캡션을 획득하여 이를 프레임 단위로 비디오 프레임에 매핑한다. 목적 함수는 전체 구조에 대한 적절한 피드백을 제공한다. 그러므로 텍스트-비디오 생성 장치(400)는 텍스트로부터 순차적으로 비디오를 재구성할 수 있고, 비디오의 컨텐츠가 텍스트의 컨텐츠와 일치한다는 것을 확신할 수 있다. 마지막으로, 텍스트-비디오 생성 장치(400)의 판별기 손실 함수는 생성된 데이터의 분포를 원래의 데이터의 분포와 동일하게 유지시킬 수 있다.As can be seen in Figure 13, the text-video generation apparatus 400 according to an embodiment of the present invention outperforms other approaches in all metrics. In particular, the difference in SSIM metric results is greatest. This means that the video is reconstructed without loud noise and the connection between the frames is not pointless. This is because of the architecture of the video generation network. The generator 410 obtains a caption and maps it to a video frame on a frame basis. The objective function provides appropriate feedback on the overall structure. Therefore, the text-video generating apparatus 400 can reconstruct the video sequentially from the text, and can be sure that the content of the video matches the content of the text. Finally, the discriminator loss function of the text-video generating apparatus 400 may keep the distribution of the generated data the same as the distribution of the original data.

본 발명의 실시예들은 GAN 구조를 수정하여 개념적으로 새로운 텍스트-비디오 생성 방법 및 장치에 관한 것이다. 본 발명의 일 실시 예에 따른 텍스트-비디오 생성 장치(400)는 3가지 상이한 관점에서 비디오 생성 프로세스를 최적화하고, 모든 시간적 및 공간적 잠재 특징을 학습할 수 있다. 상기 실험은 본 발명의 일 실시예와 비교하여 상이한 접근법에 비해 성능면에서 향상된 결과를 나타내고 있다.Embodiments of the present invention relate to a method and apparatus for conceptually creating a new text-video by modifying the GAN structure. Text-video generation apparatus 400 according to an embodiment of the present invention can optimize the video generation process from three different perspectives, and learn all the temporal and spatial potential features. The experiments show improved results in terms of performance compared to the different approaches compared to one embodiment of the present invention.

상술한 본 발명의 실시 예들에 따른 시계열 적대적인 신경망 기반의 텍스트-비디오 생성 방법은 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현되는 것이 가능하다. 본 발명의 실시 예들에 따른 시계열 적대적인 신경망 기반의 텍스트-비디오 생성 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터로 읽을 수 있는 기록매체에 기록될 수 있다.The time-series hostile neural network-based text-video generation method according to the above-described embodiments of the present invention may be implemented as computer-readable codes on a computer-readable recording medium. The time-series hostile neural network-based text-video generation method according to the embodiments of the present invention may be implemented in the form of program instructions that can be executed by various computer means and recorded in a computer-readable recording medium.

본 발명의 실시 예들에 따른 시계열 적대적인 신경망 기반의 텍스트-비디오 생성 방법을 컴퓨터에 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체에 있어서, 캡션 임베딩, 이전 프레임의 비디오 특징 및 노이즈가 연결된 데이터를 제1 시계열 신경망에 입력시켜 순차적인 특징을 추출하는 단계, 상기 추출된 순차적인 특징으로부터 다음 프레임을 생성하여 실제와 같은 비디오를 생성하는 단계, 기저장된 실제 비디오 또는 상기 생성된 비디오 중에서 어느 하나의 비디오를 샘플링하는 단계, 및 제2 시계열 신경망을 이용하여 상기 샘플링된 비디오가 상기 생성된 비디오로부터 샘플링된 비디오인지 또는 상기 실제 비디오로부터 샘플링된 비디오인지를 판별하는 단계가 제공될 수 있다.In a computer-readable recording medium that records a program for executing a time-series hostile neural network-based text-video generation method according to embodiments of the present invention, data including a caption embedding, video characteristics of a previous frame, and noise is connected. Extracting a sequential feature by inputting the first time series neural network; generating a next frame from the extracted sequential feature to generate a realistic video, a pre-stored real video, or one of the generated videos Sampling; and determining whether the sampled video is a sampled video from the generated video or a video sampled from the real video using a second time series neural network.

컴퓨터가 읽을 수 있는 기록 매체로는 컴퓨터 시스템에 의하여 해독될 수 있는 데이터가 저장된 모든 종류의 기록 매체를 포함한다. 예를 들어, ROM(Read Only Memory), RAM(Random Access Memory), 자기 테이프, 자기 디스크, 플래시 메모리, 광 데이터 저장장치 등이 있을 수 있다. 또한, 컴퓨터로 판독 가능한 기록매체는 컴퓨터 통신망으로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 읽을 수 있는 코드로서 저장되고 실행될 수 있다.Computer-readable recording media include all kinds of recording media having data stored thereon that can be decrypted by a computer system. For example, there may be a read only memory (ROM), a random access memory (RAM), a magnetic tape, a magnetic disk, a flash memory, an optical data storage device, and the like. The computer readable recording medium can also be distributed over computer systems connected over a computer network, stored and executed as readable code in a distributed fashion.

이상, 도면 및 실시예를 참조하여 설명하였지만, 본 발명의 보호범위가 상기 도면 또는 실시예에 의해 한정되는 것을 의미하지는 않으며 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다. As described above with reference to the drawings and embodiments, it does not mean that the scope of protection of the present invention is limited by the above drawings or embodiments, and those skilled in the art are skilled in the art It will be understood that various modifications and variations can be made in the present invention without departing from the spirit and scope.

구체적으로, 설명된 특징들은 디지털 전자 회로, 또는 컴퓨터 하드웨어, 펌웨어, 또는 그들의 조합들 내에서 실행될 수 있다. 특징들은 예컨대, 프로그래밍 가능한 프로세서에 의한 실행을 위해, 기계 판독 가능한 저장 디바이스 내의 저장장치 내에서 구현되는 컴퓨터 프로그램 제품에서 실행될 수 있다. 그리고 특징들은 입력 데이터 상에서 동작하고 출력을 생성함으로써 설명된 실시예들의 함수들을 수행하기 위한 지시어들의 프로그램을 실행하는 프로그래밍 가능한 프로세서에 의해 수행될 수 있다. 설명된 특징들은, 데이터 저장 시스템으로부터 데이터 및 지시어들을 수신하기 위해, 및 데이터 저장 시스템으로 데이터 및 지시어들을 전송하기 위해, 결합된 적어도 하나의 프로그래밍 가능한 프로세서, 적어도 하나의 입력 디바이스, 및 적어도 하나의 출력 디바이스를 포함하는 프로그래밍 가능한 시스템 상에서 실행될 수 있는 하나 이상의 컴퓨터 프로그램들 내에서 실행될 수 있다. 컴퓨터 프로그램은 소정 결과에 대해 특정 동작을 수행하기 위해 컴퓨터 내에서 직접 또는 간접적으로 사용될 수 있는 지시어들의 집합을 포함한다. 컴퓨터 프로그램은 컴파일된 또는 해석된 언어들을 포함하는 프로그래밍 언어 중 어느 형태로 쓰여지고, 모듈, 소자, 서브루틴(subroutine), 또는 다른 컴퓨터 환경에서 사용을 위해 적합한 다른 유닛으로서, 또는 독립 조작 가능한 프로그램으로서 포함하는 어느 형태로도 사용될 수 있다.Specifically, the described features may be implemented within digital electronic circuitry, or computer hardware, firmware, or combinations thereof. The features may be executed in a computer program product implemented in storage in a machine readable storage device, for example, for execution by a programmable processor. And features may be performed by a programmable processor executing a program of instructions for performing the functions of the described embodiments by operating on input data and generating output. The described features include at least one programmable processor, at least one input device, and at least one output coupled to receive data and directives from a data storage system, and to transmit data and directives to a data storage system. It can be executed in one or more computer programs that can be executed on a programmable system comprising a device. A computer program includes a set of directives that can be used directly or indirectly within a computer to perform a particular action on a given result. A computer program is written in any form of programming language, including compiled or interpreted languages, and included as a module, element, subroutine, or other unit suitable for use in another computer environment, or as a standalone program. Can be used in any form.

지시어들의 프로그램의 실행을 위한 적합한 프로세서들은, 예를 들어, 범용 및 특수 용도 마이크로프로세서들 둘 모두, 및 단독 프로세서 또는 다른 종류의 컴퓨터의 다중 프로세서들 중 하나를 포함한다. 또한 설명된 특징들을 구현하는 컴퓨터 프로그램 지시어들 및 데이터를 구현하기 적합한 저장 디바이스들은 예컨대, EPROM, EEPROM, 및 플래쉬 메모리 디바이스들과 같은 반도체 메모리 디바이스들, 내부 하드 디스크들 및 제거 가능한 디스크들과 같은 자기 디바이스들, 광자기 디스크들 및 CD-ROM 및 DVD-ROM 디스크들을 포함하는 비휘발성 메모리의 모든 형태들을 포함한다. 프로세서 및 메모리는 ASIC들(application-specific integrated circuits) 내에서 통합되거나 또는 ASIC들에 의해 추가될 수 있다.Suitable processors for the execution of a program of instructions include, for example, both general purpose and special purpose microprocessors, and one of a single processor or multiple processors of another kind of computer. Computer program instructions and data storage devices suitable for implementing the described features are, for example, magnetic memory such as semiconductor memory devices, internal hard disks and removable disks such as EPROM, EEPROM, and flash memory devices. Devices, magneto-optical disks and all forms of non-volatile memory including CD-ROM and DVD-ROM disks. The processor and memory may be integrated in application-specific integrated circuits (ASICs) or added by ASICs.

이상에서 설명한 본 발명은 일련의 기능 블록들을 기초로 설명되고 있지만, 전술한 실시 예 및 첨부된 도면에 의해 한정되는 것이 아니고, 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 여러 가지 치환, 변형 및 변경 가능하다는 것이 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 있어 명백할 것이다.Although the present invention described above has been described based on a series of functional blocks, the present invention is not limited to the above-described embodiments and the accompanying drawings, and various substitutions, modifications, and changes without departing from the technical spirit of the present invention. It will be apparent to one of ordinary skill in the art that this is possible.

전술한 실시 예들의 조합은 전술한 실시 예에 한정되는 것이 아니며, 구현 및/또는 필요에 따라 전술한 실시예들 뿐 아니라 다양한 형태의 조합이 제공될 수 있다.Combinations of the above-described embodiments are not limited to the above-described embodiments, and various types of combinations as well as the above-described embodiments may be provided according to implementation and / or need.

전술한 실시 예들에서, 방법들은 일련의 단계 또는 블록으로서 순서도를 기초로 설명되고 있으나, 본 발명은 단계들의 순서에 한정되는 것은 아니며, 어떤 단계는 상술한 바와 다른 단계와 다른 순서로 또는 동시에 발생할 수 있다. 또한, 당해 기술 분야에서 통상의 지식을 가진 자라면 순서도에 나타난 단계들이 배타적이지 않고, 다른 단계가 포함되거나, 순서도의 하나 또는 그 이상의 단계가 본 발명의 범위에 영향을 미치지 않고 삭제될 수 있음을 이해할 수 있을 것이다.In the above-described embodiments, the methods are described based on a flowchart as a series of steps or blocks, but the present invention is not limited to the order of steps, and any steps may occur in a different order or at the same time than the other steps described above. have. Also, one of ordinary skill in the art would appreciate that the steps shown in the flowcharts are not exclusive, that other steps may be included, or that one or more steps in the flowcharts may be deleted without affecting the scope of the present invention. I can understand.

전술한 실시 예는 다양한 양태의 예시들을 포함한다. 다양한 양태들을 나타내기 위한 모든 가능한 조합을 기술할 수는 없지만, 해당 기술 분야의 통상의 지식을 가진 자는 다른 조합이 가능함을 인식할 수 있을 것이다. 따라서, 본 발명은 이하의 특허청구범위 내에 속하는 모든 다른 교체, 수정 및 변경을 포함한다고 할 것이다.The foregoing embodiments include examples of various aspects. While not all possible combinations may be described to represent the various aspects, one of ordinary skill in the art will recognize that other combinations are possible. Accordingly, the invention is intended to embrace all other replacements, modifications and variations that fall within the scope of the following claims.

이상 도면 및 실시예를 참조하여 설명하였지만, 본 발명의 보호범위가 상기 도면 또는 실시예에 의해 한정되는 것을 의미하지는 않으며 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다. Although described above with reference to the drawings and embodiments, it does not mean that the scope of protection of the present invention is limited by the above drawings or embodiments, and those skilled in the art to the spirit of the present invention described in the claims It will be understood that various modifications and variations can be made in the present invention without departing from the scope of the invention.

200: 순환 신경망
210: 입력 게이트
220: 까먹음 게이트
230: 메모리 셀
240: 출력 게이트
250: 업데이트 게이트
310: 생성기 네트워크
320: 판별기 네트워크
400: 텍스트-비디오 생성 장치
410: 생성기
411: 제1 컨벌루션 네트워크
412: 제1 LSTM 네트워크
413: 디컨벌루션 네트워크
420: 판별기
421: 제2 컨벌루션 네트워크
422: 제2 LSTM 네트워크200: circular neural network
210: input gate
220: Forget Gate
230: memory cell
240: output gate
250: update gate
310: generator network
320: discriminator network
400: text-video generating device
410: generator
411: first convolutional network
412: first LSTM network
413 deconvolution network
420: discriminator
421: second convolutional network
422: second LSTM network

Claims

텍스트-비디오(Text-to-video) 생성 장치에 의해 수행되는 시계열 적대적인 신경망 기반의 텍스트-비디오 생성 방법에 있어서,
비디오 생성을 위한 텍스트에 대한 캡션 임베딩, 이전 프레임의 비디오 특징 및 노이즈를 제1 LSTM(long-short term memory) 네트워크인 제1 시계열 신경망에 입력시켜 순차적인 특징을 추출하는 단계;
상기 추출된 순차적인 특징으로부터 다음 프레임을 생성하여 실제와 같은 비디오(Real-like video)를 생성하는 단계;
기저장된 실제 비디오 또는 상기 생성된 비디오 중에서 어느 하나의 비디오를 샘플링하는 단계; 및
제2 LSTM 네트워크인 제2 시계열 신경망을 이용하여 상기 샘플링된 비디오가 상기 생성된 비디오로부터 샘플링된 비디오인지 또는 상기 실제 비디오로부터 샘플링된 비디오인지를 판별하는 단계를 포함하고,
상기 생성된 다음 프레임은 상기 다음 프레임 이후의 프레임을 생성하기 위한 제1 시계열 신경망에 다시 입력되고,
상기 비디오를 생성하는 단계는, 상기 다시 입력된 다음 프레임으로부터 새롭게 추출된 순차적인 특징을 이용하여, 상기 다음 프레임 이후의 적어도 하나의 프레임들을 순차적으로 생성하여 상기 실제와 같은 비디오를 생성하는, 시계열 적대적인 신경망 기반의 텍스트-비디오 생성 방법.In the time-series hostile neural network based text-video generation method performed by a text-to-video generation device,
Caption embedding for text for video generation, video characteristics and noise of a previous frame are input to a first time-series neural network, which is a first long-short term memory (LSTM) network, to extract sequential features;
Generating a real-like video by generating a next frame from the extracted sequential features;
Sampling any one of pre-stored real video or the generated video; And
Determining whether the sampled video is a sampled video from the generated video or a video sampled from the real video using a second time series neural network that is a second LSTM network;
The generated next frame is inputted back into a first time series neural network for generating a frame after the next frame,
The generating of the video may include sequentially generating at least one frame after the next frame using the newly extracted sequential feature from the next inputted frame to generate the realistic video. Neural Network-based Text-Video Generation Method.

제1항에 있어서,
캡션 처리를 통해 캡션을 원-핫 벡터(one-hot vector)로 변환하고, 상기 변환된 원-핫 벡터와 상기 제1 시계열 신경망을 이용하여 상기 캡션 임베딩을 생성하는 단계를 더 포함하는 시계열 적대적인 신경망 기반의 텍스트-비디오 생성 방법.The method of claim 1,
Converting the caption to a one-hot vector through caption processing, and generating the caption embedding using the converted one-hot vector and the first time series neural network. Based text-video creation method.

제1항에 있어서,
비디오 프레임 처리를 통해 상기 이전 프레임을 제1 컨벌루션 신경망에 입력시켜 상기 이전 프레임의 비디오 특징을 추출하는 단계를 더 포함하는 시계열 적대적인 신경망 기반의 텍스트-비디오 생성 방법.The method of claim 1,
And inputting the previous frame into a first convolutional neural network through video frame processing to extract the video feature of the previous frame.

제1항에 있어서,
랜덤 분포(random distribution)로부터 상기 노이즈를 샘플링하는 단계를 더 포함하는 시계열 적대적인 신경망 기반의 텍스트-비디오 생성 방법.The method of claim 1,
Sampling the noise from a random distribution; time series hostile neural network based text-video generation method.

삭제delete

제1항에 있어서,
상기 실제와 같은 비디오를 생성하는 단계는,
상기 추출된 순차적인 특징을 디컨벌루션(deconvolutional) 신경망에 입력시켜 상기 비디오의 다음 프레임을 생성하는 시계열 적대적인 신경망 기반의 텍스트-비디오 생성 방법.The method of claim 1,
Generating the real video,
And generating the next frame of the video by inputting the extracted sequential features to a deconvolutional neural network.

삭제delete

제1항에 있어서,
상기 판별하는 단계는,
상기 샘플링된 비디오를 제2 컨벌루션 신경망에 입력시켜 상기 샘플링된 비디오 특징을 추출하는 시계열 적대적인 신경망 기반의 텍스트-비디오 생성 방법.The method of claim 1,
The determining step,
And inputting the sampled video into a second convolutional neural network to extract the sampled video feature.

삭제delete

제1항에 있어서,
상기 판별하는 단계는,
상기 생성된 비디오 및 상기 실제 비디오로부터 샘플링된 비디오를 비교하여 상기 생성된 비디오가 상기 실제 비디오로부터 샘플링된 비디오와 동일한지를 판단하는 시계열 적대적인 신경망 기반의 텍스트-비디오 생성 방법.The method of claim 1,
The determining step,
And comparing the generated video with the video sampled from the real video to determine whether the generated video is the same as the video sampled from the real video.

비디오 생성을 위한 텍스트에 대한 캡션 임베딩, 이전 프레임의 비디오 특징 및 노이즈를 제1 LSTM(long-short term memory) 네트워크인 제1 시계열 신경망에 입력시켜 순차적인 특징을 추출하고, 상기 추출된 순차적인 특징으로부터 다음 프레임을 생성하여 실제와 같은 비디오(Real-like video)를 생성하는 생성기; 및
기저장된 실제 비디오 또는 상기 생성된 비디오 중에서 어느 하나의 비디오를 샘플링하고,
제2 LSTM 네트워크인 제2 시계열 신경망을 이용하여 상기 샘플링된 비디오가 상기 생성된 비디오로부터 샘플링된 비디오인지 또는 상기 실제 비디오로부터 샘플링된 비디오인지를 판별하는 판별기를 포함하고,
상기 생성된 다음 프레임은 상기 다음 프레임 이후의 프레임을 생성하기 위한 제1 시계열 신경망에 다시 입력되고,
상기 생성기는, 상기 다시 입력된 다음 프레임으로부터 새롭게 추출된 순차적인 특징을 이용하여, 상기 다음 프레임 이후의 적어도 하나의 프레임들을 순차적으로 생성하여 상기 실제와 같은 비디오를 생성하는, 시계열 적대적인 신경망 기반의 텍스트-비디오 생성 장치.Caption embedding for text for video generation, video features and noise of previous frames are input to a first time-series neural network, which is a first long-short term memory (LSTM) network, to extract sequential features, and to extract the sequential features. A generator for generating a next frame from the video to generate a real-like video; And
Sampling any one of the pre-stored real video or the generated video,
A discriminator for determining whether the sampled video is a sampled video from the generated video or a video sampled from the real video using a second time series neural network, which is a second LSTM network;
The generated next frame is inputted back into a first time series neural network for generating a frame after the next frame,
The generator generates a realistic video by sequentially generating at least one frame after the next frame by using a sequential feature newly extracted from the next inputted frame again. -Video generation device.

제11항에 있어서,
상기 생성기는,
캡션 처리를 통해 캡션을 원-핫 벡터(one-hot vector)로 변환하고, 상기 변환된 원-핫 벡터와 상기 제1 시계열 신경망을 이용하여 상기 캡션 임베딩을 생성하는 시계열 적대적인 신경망 기반의 텍스트-비디오 생성 장치.The method of claim 11,
The generator,
Caption processing converts the caption into a one-hot vector, and generates the caption embedding using the converted one-hot vector and the first time series neural network. Generating device.

제11항에 있어서,
상기 생성기는,
비디오 프레임 처리를 통해 상기 이전 프레임을 제1 컨벌루션 신경망에 입력시켜 상기 이전 프레임의 비디오 특징을 추출하는 시계열 적대적인 신경망 기반의 텍스트-비디오 생성 장치.The method of claim 11,
The generator,
And inputting the previous frame into a first convolutional neural network through video frame processing to extract the video feature of the previous frame.

제11항에 있어서,
상기 생성기는,
랜덤 분포(random distribution)로부터 상기 노이즈를 샘플링하는 시계열 적대적인 신경망 기반의 텍스트-비디오 생성 장치.The method of claim 11,
The generator,
A time-series hostile neural network based text-video generation apparatus for sampling the noise from a random distribution.

삭제delete

제11항에 있어서,
상기 생성기는,
상기 추출된 순차적인 특징을 디컨벌루션(deconvolutional) 신경망에 입력시켜 상기 비디오의 다음 프레임을 생성하는 시계열 적대적인 신경망 기반의 텍스트-비디오 생성 장치.The method of claim 11,
The generator,
And inputting the extracted sequential features into a deconvolutional neural network to generate a next frame of the video.

삭제delete

제11항에 있어서,
상기 판별기는,
상기 샘플링된 비디오를 제2 컨벌루션 신경망에 입력시켜 상기 샘플링된 비디오 특징을 추출하는 시계열 적대적인 신경망 기반의 텍스트-비디오 생성 장치.The method of claim 11,
The discriminator,
And inputting the sampled video to a second convolutional neural network to extract the sampled video feature.

삭제delete

제11항에 있어서,
상기 판별기는,
상기 생성된 비디오 및 상기 실제 비디오로부터 샘플링된 비디오를 비교하여 상기 생성된 비디오가 상기 실제 비디오로부터 샘플링된 비디오와 동일한지를 판단하는 시계열 적대적인 신경망 기반의 텍스트-비디오 생성 장치.The method of claim 11,
The discriminator,
And comparing the generated video with the video sampled from the real video to determine whether the generated video is the same as the video sampled from the real video.

시계열 적대적인 신경망 기반의 텍스트-비디오 생성 방법을 컴퓨터에 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체에 있어서,
캡션 임베딩, 이전 프레임의 비디오 특징 및 노이즈가 연결된 데이터를 제1 LSTM(long-short term memory) 네트워크인 제1 시계열 신경망에 입력시켜 순차적인 특징을 추출하는 단계;
상기 추출된 순차적인 특징으로부터 다음 프레임을 생성하여 실제와 같은 비디오를 생성하는 단계;
기저장된 실제 비디오 또는 상기 생성된 비디오 중에서 어느 하나의 비디오를 샘플링하는 단계; 및
제2 LSTM 네트워크인 제2 시계열 신경망을 이용하여 상기 샘플링된 비디오가 상기 생성된 비디오로부터 샘플링된 비디오인지 또는 상기 실제 비디오로부터 샘플링된 비디오인지를 판별하는 단계를 포함하고,
상기 생성된 다음 프레임은 상기 다음 프레임 이후의 프레임을 생성하기 위한 제1 시계열 신경망에 다시 입력되고,
상기 비디오를 생성하는 단계는, 상기 다시 입력된 다음 프레임으로부터 새롭게 추출된 순차적인 특징을 이용하여, 상기 다음 프레임 이후의 적어도 하나의 프레임들을 순차적으로 생성하여 상기 실제와 같은 비디오를 생성하도록 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체.A computer-readable recording medium recording a program for executing a time-series hostile neural network-based text-video generation method on a computer,
Extracting sequential features by inputting caption embedding, video features of a previous frame, and noise-connected data into a first time-series neural network, which is a first long-short term memory (LSTM) network;
Generating a realistic video by generating a next frame from the extracted sequential features;
Sampling any one of pre-stored real video or the generated video; And
Determining whether the sampled video is a sampled video from the generated video or a video sampled from the real video using a second time series neural network that is a second LSTM network;
The generated next frame is inputted back into a first time series neural network for generating a frame after the next frame,
The generating of the video may include sequentially generating at least one frame after the next frame using the newly extracted sequential feature from the next inputted frame to generate the actual video. Computer-readable recording medium that records the program.