JP7509403B2

JP7509403B2 - Synchronization device, synchronization method, program, and recording medium

Info

Publication number: JP7509403B2
Application number: JP2020036493A
Authority: JP
Inventors: 雄紀山田
Original assignee: NEC Solution Innovators Ltd
Current assignee: NEC Solution Innovators Ltd
Priority date: 2020-03-04
Filing date: 2020-03-04
Publication date: 2024-07-02
Anticipated expiration: 2040-03-04
Also published as: JP2021139992A

Description

本発明は、同期装置、同期方法、プログラム及び記録媒体に関する。 The present invention relates to a synchronization device, a synchronization method, a program, and a recording medium.

動画マスタから、映像と音声とを抽出し、前記映像中の人物や前記音声を、別の人物や音声に変換する技術が報告されている（例えば、特許文献１等）。 Technology has been reported that extracts video and audio from a video master and converts the person in the video and the audio into a different person and audio (for example, Patent Document 1, etc.).

特開２０００－１１２４８８号公報JP 2000-112488 A

しかしながら、変換した映像及び音声を用いて再生すると、映像と音声との再生のタイミングに大きな乖離が生じるという問題がある。 However, when the converted video and audio are used for playback, there is a problem that a large discrepancy occurs in the timing of the playback of the video and audio.

そこで、本発明は、再生される映像と音声とのタイミングの乖離を抑制可能な同期装置、及び、同期方法の提供を目的とする。 The present invention aims to provide a synchronization device and a synchronization method that can reduce the discrepancy in timing between the video and audio being played.

前記目的を達成するために、本発明の同期装置は、
動画マスタ取得手段、抽出手段、フレーム群形成手段、分割手段、変換手段、及び再生手段を含み、
前記動画マスタ取得手段は、動画マスタを取得し、
前記抽出手段は、前記動画マスタから、再生時間に紐づけて、複数のフレームから構成される映像と音声とを抽出し、
前記フレーム群形成手段は、前記再生時間の再生開始時から前記音声が所定時間発生しなかった場合に、「前記音声が最初に発生した再生時間と紐づけられている前記フレーム」から「前記所定時間の経過後において最初に前記音声が発生した再生時間と紐づけられている前記フレームの直前のフレーム」までを一連のフレームとしてフレーム群を形成し、
前記分割手段は、「前記音声が最初に発生した再生時間」から「前記所定時間の経過後において最初に前記音声が発生する再生時間の直前の再生時間」までの前記音声を１単位として分割し、
前記変換手段は、機械学習により、前記音声の単位毎に前記音声を他の音声に変換し、且つ前記フレーム群毎に前記映像内の人物を他の人物に変換し、
前記再生手段は、前記フレーム群における最初のフレームの再生開始時間毎に、前記変換した音声の単位毎における最初の音声の再生開始時間を同期して再生する、装置である。 In order to achieve the above object, the synchronization device of the present invention comprises:
The method includes a video master acquisition means, an extraction means, a frame group formation means, a division means, a conversion means, and a playback means,
The video master acquisition means acquires a video master,
The extraction means extracts video and audio composed of a plurality of frames from the video master in association with a playback time,
the frame group forming means, when the sound is not generated for a predetermined period of time from the start of playback of the playback time, forms a frame group as a series of frames from "the frame linked to the playback time at which the sound first occurred" to "the frame immediately preceding the frame linked to the playback time at which the sound first occurred after the predetermined period of time has elapsed",
the dividing means divides the audio from "the playback time at which the audio first occurs" to "the playback time immediately preceding the playback time at which the audio first occurs after the predetermined time has elapsed" into one unit,
the conversion means converts the voice into another voice for each unit of the voice by machine learning, and converts a person in the video into another person for each group of frames;
The reproduction means is a device that reproduces the converted audio in a synchronized manner with the reproduction start time of the first audio in each of the converted audio units, for each reproduction start time of the first frame in the frame group.

本発明の同期方法は、
動画マスタ取得工程、抽出工程、フレーム群形成工程、分割工程、変換工程、及び再生工程を含み、
前記動画マスタ取得工程は、動画マスタを取得し、
前記抽出工程は、前記動画マスタから、再生時間に紐づけて、複数のフレームから構成される映像と音声とを抽出し、
前記フレーム群形成工程は、前記再生時間の再生開始時から前記音声が所定時間発生しなかった場合に、「前記音声が最初に発生した再生時間と紐づけられている前記フレーム」から「前記所定時間の経過後において最初に前記音声が発生した再生時間と紐づけられている前記フレームの直前のフレーム」までを一連のフレームとしてフレーム群を形成し、
前記分割工程は、「前記音声が最初に発生した再生時間」から「前記所定時間の経過後において最初に前記音声が発生する再生時間の直前の再生時間」までの前記音声を１単位として分割し、
前記変換工程は、機械学習により、前記音声の単位毎に前記音声を他の音声に変換し、且つ前記フレーム群毎に前記映像内の人物を他の人物に変換し、
前記再生工程は、前記フレーム群における最初のフレームの再生開始時間毎に、前記変換した音声の単位毎における最初の音声の再生開始時間を同期して再生する、方法である。 The synchronization method of the present invention comprises the steps of:
The method includes a video master acquisition step, an extraction step, a frame group formation step, a division step, a conversion step, and a playback step.
The video master acquisition step acquires a video master,
The extraction step includes extracting, from the video master, video and audio composed of a plurality of frames in association with a playback time;
the frame group forming step forms a frame group as a series of frames from "the frame linked to the playback time at which the sound first occurred" to "the frame immediately preceding the frame linked to the playback time at which the sound first occurred after the passage of the predetermined time" when the sound has not occurred for a predetermined time from the start of playback of the playback time,
the dividing step divides the audio from "the playback time at which the audio first occurs" to "the playback time immediately preceding the playback time at which the audio first occurs after the predetermined time has elapsed" into one unit,
the converting step converts the voice into another voice for each unit of the voice by machine learning, and converts a person in the video into another person for each group of frames;
The reproduction step is a method of synchronizing reproduction start times of the first sound in each of the converted sound units with reproduction start times of the first frame in each of the frame groups.

本発明のプログラムは、
コンピュータに、動画マスタ取得手順、抽出手順、フレーム群形成手順、分割手順、変換手順、及び再生手順を含む手順を実行させるためのプログラムであって、
前記動画マスタ取得手順は、動画マスタを取得し、
前記抽出手順は、前記動画マスタから、再生時間に紐づけて、複数のフレームから構成される映像と音声とを抽出し、
前記フレーム群形成手順は、前記再生時間の再生開始時から前記音声が所定時間発生しなかった場合に、「前記音声が最初に発生した再生時間と紐づけられている前記フレーム」から「前記所定時間の経過後において最初に前記音声が発生した再生時間と紐づけられている前記フレームの直前のフレーム」までを一連のフレームとしてフレーム群を形成し、
前記分割手順は、「前記音声が最初に発生した再生時間」から「前記所定時間の経過後において最初に前記音声が発生する再生時間の直前の再生時間」までの前記音声を１単位として分割し、
前記変換手順は、機械学習により、前記音声の単位毎に前記音声を他の音声に変換し、且つ前記フレーム群毎に前記映像内の人物を他の人物に変換し、
前記再生手順は、前記フレーム群における最初のフレームの再生開始時間毎に、前記変換した音声の単位毎における最初の音声の再生開始時間を同期して再生する、プログラムである。 The program of the present invention comprises:
A program for causing a computer to execute procedures including a video master acquisition procedure, an extraction procedure, a frame group formation procedure, a division procedure, a conversion procedure, and a playback procedure,
The video master acquisition step acquires a video master,
The extraction step includes extracting, from the video master, video and audio composed of a plurality of frames in association with a playback time;
the frame group forming step, when the sound is not generated for a predetermined time from the start of playback of the playback time, forms a frame group as a series of frames from "the frame linked to the playback time at which the sound first occurred" to "the frame immediately preceding the frame linked to the playback time at which the sound first occurred after the predetermined time has elapsed";
The division step divides the audio from "the playback time at which the audio first occurs" to "the playback time immediately before the playback time at which the audio first occurs after the predetermined time has elapsed" as one unit,
the conversion step converts the voice into another voice for each unit of the voice by machine learning, and converts a person in the video into another person for each group of frames;
The playback procedure is a program for playing back the converted audio units in synchronization with the playback start time of the first audio in each of the converted audio units, for each playback start time of the first frame in the frame group.

本発明によれば、再生される映像と音声とのタイミングの乖離を抑制することができる。 The present invention makes it possible to reduce the discrepancy in timing between the video and audio being played.

図１は、実施形態１の装置の一例の構成を示すブロック図である。FIG. 1 is a block diagram showing an example of a configuration of an apparatus according to the first embodiment. 図２は、実施形態１の装置のハードウエア構成の一例を示すブロック図である。FIG. 2 is a block diagram showing an example of a hardware configuration of the apparatus according to the first embodiment. 図３は、実施形態１の装置における処理の一例を示すフローチャートである。FIG. 3 is a flowchart showing an example of processing in the device of the first embodiment. 図４は、実施形態１の装置におけるフレーム群形成手段及び分割手段における処理の一例を示す模式図である。FIG. 4 is a schematic diagram showing an example of processing in the frame group forming means and the dividing means in the apparatus of the first embodiment. 図５は、実施形態１の装置が選択手段を含む場合の使用の一例を示す模式図である。FIG. 5 is a schematic diagram showing an example of use of the device of the first embodiment when it includes a selection unit.

本発明の同期装置は、例えば、さらに、選択手段を含み、
前記選択手段は、モデル人物及びモデル音声が格納されたモデルデータベースから、任意のモデル人物及びモデル音声の少なくとも一方を選択し、
前記変換手段は、さらに、前記映像内の人物及び前記音声を、前記選択したモデル人物及びモデル音声の少なくとも一方に変換し、
前記再生手段は、前記変換したモデル人物及びモデル音声の少なくとも一方を用いて、前記再生を実行する、という態様であってもよい。 The synchronization device of the present invention further includes, for example, a selection means,
The selection means selects at least one of a model person and a model voice from a model database in which model people and model voices are stored,
The conversion means further converts the person and the voice in the video into at least one of the selected model person and model voice,
The reproducing means may execute the reproduction using at least one of the converted model person and the converted model voice.

本発明の同期装置において、例えば、前記再生手段は、前記変換したモデル人物と、前記変換したモデル音声とを合成してから再生する、という態様であってもよい。 In the synchronization device of the present invention, for example, the playback means may synthesize the converted model person and the converted model voice and then play them back.

本発明の同期方法は、例えば、さらに、選択工程を含み、
前記選択工程は、モデル人物及びモデル音声が格納されたモデルデータベースから、任意のモデル人物及びモデル音声の少なくとも一方を選択し、
前記変換工程は、さらに、前記映像内の人物及び前記音声を、前記選択したモデル人物及びモデル音声の少なくとも一方に変換し、
前記再生工程は、前記変換したモデル人物及びモデル音声の少なくとも一方を用いて、前記再生を実行する、という態様であってもよい。 The synchronization method of the present invention further includes, for example, a selection step,
The selection step includes selecting at least one of a model person and a model voice from a model database in which model people and model voices are stored,
The conversion step further includes converting the person and the voice in the video into at least one of the selected model person and model voice,
The reproducing step may be performed using at least one of the converted model person and the converted model voice.

本発明の同期方法において、例えば、前記再生工程は、前記変換したモデル人物と、前記変換したモデル音声とを合成してから再生する、という態様であってもよい。 In the synchronization method of the present invention, for example, the playback step may involve synthesizing the converted model person and the converted model voice before playing them back.

本発明のプログラムは、例えば、さらに、選択手順を含み、
前記選択手順は、モデル人物及びモデル音声が格納されたモデルデータベースから、任意のモデル人物及びモデル音声の少なくとも一方を選択し、
前記変換手順は、さらに、前記映像内の人物及び前記音声を、前記選択したモデル人物及びモデル音声の少なくとも一方に変換し、
前記再生手順は、前記変換したモデル人物及びモデル音声の少なくとも一方を用いて、前記再生を実行する、という態様であってもよい。 The program of the present invention further includes, for example, a selection step,
The selection step includes selecting at least one of a model person and a model voice from a model database in which model people and model voices are stored,
The conversion step further includes converting the person and the voice in the video into at least one of the selected model person and model voice;
The reproduction step may be such that the reproduction is performed using at least one of the converted model person and the converted model voice.

本発明のプログラムにおいて、例えば、前記再生手順は、前記変換したモデル人物と、前記変換したモデル音声とを合成してから再生する、という態様であってもよい。 In the program of the present invention, for example, the playback step may involve synthesizing the converted model person and the converted model voice before playing them back.

本発明の記録媒体は、本発明のプログラムを記録しているコンピュータ読み取り可能な記録媒体である。 The recording medium of the present invention is a computer-readable recording medium on which the program of the present invention is recorded.

本発明の適用分野は、特に制限されず、動画を視聴する分野であれば適用可能である。特に、本発明は、教育機関（学校、予備校等）等の教育支援の分野や、講演会やセミナー等のイベント分野において、有用である。 The application field of the present invention is not particularly limited, and it can be applied to any field where videos are viewed. In particular, the present invention is useful in the field of educational support in educational institutions (schools, preparatory schools, etc.) and in the field of events such as lectures and seminars.

本発明において、「マスタ」とは、マスターデータを意味する。 In this invention, "master" means master data.

本発明の実施形態について図を用いて説明する。本発明は、以下の実施形態には限定されない。以下の各図において、同一部分には、同一符号を付している。また、各実施形態の説明は、特に言及がない限り、互いの説明を援用でき、各実施形態の構成は、特に言及がない限り、組合せ可能である。 The following describes an embodiment of the present invention with reference to the drawings. The present invention is not limited to the following embodiment. In each of the drawings, the same parts are given the same reference numerals. Furthermore, the explanations of each embodiment can be mutually incorporated unless otherwise specified, and the configurations of each embodiment can be combined unless otherwise specified.

［実施形態１］
図１は、本実施形態の同期装置１０の一例の構成を示すブロック図である。図１に示すように、本装置１０は、動画マスタ取得手段１１、抽出手段１２、フレーム群形成手段１３、分割手段１４、変換手段１５、及び再生手段１６を含む。 [Embodiment 1]
Fig. 1 is a block diagram showing an example of the configuration of a synchronization device 10 according to the present embodiment. As shown in Fig. 1, the device 10 includes a video master acquisition unit 11, an extraction unit 12, a frame group formation unit 13, a division unit 14, a conversion unit 15, and a playback unit 16.

本装置１０は、例えば、前記各部を含む１つの装置でもよいし、前記各部が、通信回線網を介して接続可能な装置でもよい。また、本装置１０は、前記通信回線網を介して、後述する外部装置と接続可能である。前記通信回線網は、特に制限されず、公知のネットワークを使用でき、例えば、有線でも無線でもよい。前記通信回線網は、例えば、インターネット回線、ＷＷＷ（ＷｏｒｌｄＷｉｄｅＷｅｂ）、電話回線、ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）、ＳＡＮ（ＳｔｏｒａｇｅＡｒｅａＮｅｔｗｏｒｋ）、ＤＴＮ（ＤｅｌａｙＴｏｌｅｒａｎｔＮｅｔｗｏｒｋｉｎｇ）等があげられる。無線通信としては、例えば、ＷｉＦｉ（ＷｉｒｅｌｅｓｓＦｉｄｅｌｉｔｙ）、Ｂｌｕｅｔｏｏｔｈ（登録商標）等が挙げられる。前記無線通信としては、各装置が直接通信する形態（ＡｄＨｏｃ通信）、アクセスポイントを介した間接通信のいずれであってもよい。本装置１０は、例えば、システムとしてサーバに組み込まれていてもよい。また、本装置１０は、例えば、本発明のプログラムがインストールされたパーソナルコンピュータ（ＰＣ、例えば、デスクトップ型、ノート型）、スマートフォン、タブレット端末等であってもよい。さらに、本装置１０は、例えば、動画マスタ取得手段１１、抽出手段１２、フレーム群形成手段１３、分割手段１４、及び変換手段１５がサーバ上にあり、再生手段１６がユーザ端末上にあるような、クラウドコンピューティングの形態であってもよい。 The device 10 may be, for example, a single device including each of the above-mentioned parts, or a device to which each of the above-mentioned parts can be connected via a communication line network. The device 10 may also be connected to an external device described later via the communication line network. The communication line network is not particularly limited, and any known network may be used, for example, wired or wireless. Examples of the communication line network include the Internet line, WWW (World Wide Web), telephone line, LAN (Local Area Network), SAN (Storage Area Network), and DTN (Delay Tolerant Networking). Examples of wireless communication include WiFi (Wireless Fidelity) and Bluetooth (registered trademark). The wireless communication may be either a form in which each device communicates directly (Ad Hoc communication) or an indirect communication via an access point. The device 10 may be incorporated into a server as a system. The device 10 may be, for example, a personal computer (PC, for example, desktop type or notebook type) in which the program of the present invention is installed, a smartphone, a tablet terminal, or the like. Furthermore, the device 10 may be in the form of cloud computing, for example, in which the video master acquisition means 11, the extraction means 12, the frame group formation means 13, the division means 14, and the conversion means 15 are on a server, and the playback means 16 is on a user terminal.

図２に、本装置１０のハードウエア構成のブロック図を例示する。本装置１０は、例えば、中央演算装置（ＣＰＵ，ＧＰＵ等）１０１、メモリ１０２、バス１０３、記憶装置１０４、入力装置１０５、表示装置１０６、通信デバイス１０７等を有する。本装置１０のハードウエア構成の各部は、それぞれのインタフェース（Ｉ／Ｆ）により、バス１０３を介して相互に接続されている。 Figure 2 shows an example block diagram of the hardware configuration of the device 10. The device 10 has, for example, a central processing unit (CPU, GPU, etc.) 101, memory 102, bus 103, storage device 104, input device 105, display device 106, communication device 107, etc. Each part of the hardware configuration of the device 10 is connected to each other via the bus 103 by their respective interfaces (I/F).

中央演算装置（中央処理装置）１０１は、本装置１０の全体の制御を担う。本装置１０において、中央演算装置１０１により、例えば、本発明のプログラムやその他のプログラムが実行され、また、各種情報の読み込みや書き込みが行われる。具体的には、例えば、中央演算装置１０１が、動画マスタ取得手段１１、抽出手段１２、フレーム群形成手段１３、分割手段１４、変換手段１５、及び再生手段１６として機能する。 The central processing unit (Central Processing Unit) 101 is responsible for the overall control of the device 10. In the device 10, the central processing unit 101 executes, for example, the program of the present invention and other programs, and also reads and writes various information. Specifically, for example, the central processing unit 101 functions as a video master acquisition means 11, an extraction means 12, a frame group formation means 13, a division means 14, a conversion means 15, and a playback means 16.

バス１０３は、例えば、外部装置とも接続できる。前記外部装置は、例えば、外部データベース、プリンター、記憶装置等があげられる。本装置１０は、例えば、バス１０３に接続された通信デバイス１０７により、前記通信回線網に接続でき、前記通信回線網を介して、外部装置と接続することもできる。 The bus 103 can also be connected to, for example, an external device. Examples of the external device include an external database, a printer, a storage device, etc. The present device 10 can be connected to the communication line network, for example, by a communication device 107 connected to the bus 103, and can also be connected to an external device via the communication line network.

メモリ１０２は、例えば、メインメモリ（主記憶装置）が挙げられる。中央演算装置１０１が処理を行う際には、例えば、後述する記憶装置１０４に記憶されている本発明のプログラム等の種々の動作プログラムを、メモリ１０２が読み込み、中央演算装置１０１は、メモリ１０２からデータを受け取って、プログラムを実行する。前記メインメモリは、例えば、ＲＡＭ（ランダムアクセスメモリ）である。また、メモリ１０２は、例えば、ＲＯＭ（読み出し専用メモリ）であってもよい。 The memory 102 may be, for example, a main memory (primary storage device). When the central processing unit 101 performs processing, the memory 102 reads various operating programs, such as the program of the present invention, stored in the storage device 104 described below, and the central processing unit 101 receives data from the memory 102 and executes the programs. The main memory may be, for example, a RAM (random access memory). The memory 102 may also be, for example, a ROM (read only memory).

記憶装置１０４は、例えば、前記メインメモリ（主記憶装置）に対して、いわゆる補助記憶装置ともいう。前述のように、記憶装置１０４には、本発明のプログラムを含む動作プログラムが格納されている。記憶装置１０４は、例えば、記録媒体と、記録媒体に読み書きするドライブとの組合せであってもよい。前記記録媒体は、特に制限されず、例えば、内蔵型でも外付け型でもよく、ＨＤ（ハードディスク）、ＣＤ－ＲＯＭ、ＣＤ－Ｒ、ＣＤ－ＲＷ、ＭＯ、ＤＶＤ、フラッシュメモリー、メモリーカード等が挙げられる。記憶装置１０４は、例えば、記録媒体とドライブとが一体化されたハードディスクドライブ（ＨＤＤ）、及びソリッドステートドライブ（ＳＳＤ）であってもよい。 The storage device 104 is also referred to as an auxiliary storage device, for example, in contrast to the main memory. As described above, the storage device 104 stores operating programs including the program of the present invention. The storage device 104 may be, for example, a combination of a recording medium and a drive that reads and writes from the recording medium. The recording medium is not particularly limited, and may be, for example, an internal or external type, such as a HD (hard disk), CD-ROM, CD-R, CD-RW, MO, DVD, flash memory, memory card, etc. The storage device 104 may be, for example, a hard disk drive (HDD) in which the recording medium and the drive are integrated, or a solid state drive (SSD).

本装置１０において、メモリ１０２及び記憶装置１０４は、管理者からのアクセス情報及びログ情報、並びに、外部データベース（図示せず）から取得した情報を記憶することも可能である。 In the device 10, the memory 102 and storage device 104 can also store access information and log information from an administrator, as well as information obtained from an external database (not shown).

本装置１０は、例えば、さらに、入力装置１０５、表示装置１０６を有する。入力装置１０５は、例えば、タッチパネル、キーボード、マウス等である。表示装置１０６は、例えば、ＬＥＤディスプレイ、液晶ディスプレイ等が挙げられる。 The device 10 further includes, for example, an input device 105 and a display device 106. The input device 105 is, for example, a touch panel, a keyboard, a mouse, etc. The display device 106 is, for example, an LED display, a liquid crystal display, etc.

つぎに、本実施形態の同期方法の一例を、図３のフローチャートに基づき説明する。本実施形態の同期方法は、例えば、図１の同期装置１０を用いて、次のように実施する。なお、本実施形態の同期方法は、図１の同期装置１０の使用には限定されない。 Next, an example of the synchronization method of this embodiment will be described based on the flowchart in FIG. 3. The synchronization method of this embodiment is implemented, for example, as follows using the synchronization device 10 in FIG. 1. Note that the synchronization method of this embodiment is not limited to the use of the synchronization device 10 in FIG. 1.

まず、動画マスタ取得手段１１により、動画マスタを取得する（Ｓ１）。前記動画マスタは、特に制限されず、例えば、講義や講演中の講師や講演者を撮像した動画、アニメーション等の人工的に作成された動画等の動画である。前記取得の形式は、特に制限されず、例えば、通信デバイス１０７を介して、外部の撮像装置（カメラ等）が撮像した前記動画を取得してもよい。また、本装置１０が、さらに、撮像手段を含み、前記動画を撮像することで、取得してもよい。前記撮像手段は、例えば、カメラ等の撮像装置によって機能する。 First, the video master is acquired by the video master acquisition means 11 (S1). The video master is not particularly limited, and may be, for example, a video of a lecturer or speaker during a lecture or speech, or an artificially created video such as an animation. The format of the acquisition is not particularly limited, and may be, for example, the video captured by an external imaging device (camera, etc.) via the communication device 107. Furthermore, the device 10 may further include an imaging means and acquire the video by capturing the video. The imaging means functions, for example, as an imaging device such as a camera.

次に、抽出手段１２により、前記動画マスタから、再生時間に紐づけて、複数のフレームから構成される映像と音声とを抽出する（Ｓ２）。以下、前記抽出した映像と音声とを、それぞれ、映像マスタ及び音声マスタともいう。前記映像マスタ及び前記音声マスタは、例えば、メモリ１０２及び記憶装置１０４等に保存されてもよい。前記映像マスタ及び前記音声マスタの抽出は、例えば、公知技術（例えば、ＯｐｅｎＣＶ、ＦＦｍｐｅｇ等）を用いて抽出できる。 Next, the extraction means 12 extracts video and audio consisting of multiple frames from the video master, linked to the playback time (S2). Hereinafter, the extracted video and audio are also referred to as the video master and audio master, respectively. The video master and audio master may be stored in, for example, the memory 102 and the storage device 104. The video master and audio master can be extracted, for example, using publicly known technology (for example, OpenCV, FFmpeg, etc.).

次に、フレーム群形成手段１３により、前記再生時間の再生開始時から前記音声が所定時間発生しなかった場合に、「前記音声が最初に発生した再生時間と紐づけられている前記フレーム」から「前記所定時間の経過後において最初に前記音声が発生した再生時間と紐づけられている前記フレームの直前のフレーム」までを一連のフレームとしてフレーム群を形成する（Ｓ３）。前記所定時間は、特に制限されず、任意に設定できる。「前記音声が最初に発生した再生時間」の「最初に」とは、例えば、前記所定時間の経過後からカウントした「最初」でもよいし、前記音声マスタの全体からカウントした「最初」でもよい。また、前記「直前のフレーム」とは、例えば、最初に前記音声が流れた再生時間と紐づけられている前記フレームの１つ前のフレームである。具体的には、後述する。 Next, if the sound does not occur for a predetermined time from the start of playback of the playback time, the frame group forming means 13 forms a frame group as a series of frames from "the frame linked to the playback time at which the sound first occurred" to "the frame immediately preceding the frame linked to the playback time at which the sound first occurred after the passage of the predetermined time" (S3). The predetermined time is not particularly limited and can be set arbitrarily. The "first" in "the playback time at which the sound first occurred" may be, for example, the "first" counted from the passage of the predetermined time, or the "first" counted from the entire sound master. The "immediately preceding frame" is, for example, the frame immediately preceding the frame linked to the playback time at which the sound was first played. Specific details will be described later.

次に、分割手段１４により、「前記音声が最初に発生した再生時間」から「前記所定時間の経過後において最初に前記音声が発生する再生時間の直前の再生時間」までの前記音声を１単位として分割する（Ｓ４）。「前記音声が最初に発生した再生時間」の「最初に」とは、例えば、前記所定時間の経過後からカウントした「最初」でもよいし、前記音声マスタの全体からの再生時間からカウントした「最初」でもよい。また、前記「直前の音声」とは、例えば、最初に前記音声が発生する再生時間の任意の時間（例えば、１秒、０．５秒等）前の再生時間である。具体的には、後述する。なお、前記工程（Ｓ４）は、前記工程（Ｓ３）の前に処理してもよいし、前記工程（Ｓ３）と並行して処理してもよい。 Next, the dividing means 14 divides the audio from "the playback time when the audio first occurs" to "the playback time immediately prior to the playback time when the audio first occurs after the predetermined time has elapsed" as one unit (S4). The "first" in "the playback time when the audio first occurs" may be, for example, the "first" counted from the predetermined time has elapsed, or the "first" counted from the playback time of the entire audio master. The "immediately prior audio" is, for example, the playback time any time (for example, 1 second, 0.5 seconds, etc.) before the playback time when the audio first occurs. More specifically, this will be described later. Note that the step (S4) may be processed before the step (S3) or in parallel with the step (S3).

次に、変換手段１５により、機械学習により、前記音声の単位毎に前記音声を他の音声に変換し、且つ前記フレーム群毎に前記映像内の人物を他の人物に変換する（Ｓ５）。以下、前記変換した音声を「変換済み音声」ともいい、前記変換した映像を「変換済み映像」ともいう。前記他の音声は、特に制限されず、例えば、歌手、声優、芸能人、著名人等の実在する人物の声でもよいし、コンピュータによって人工的に生成された合成音声、任意のキャラクターの声等でもよい。前記他の人物は、特に制限されず、例えば、歌手、声優、芸能人、著名人等の実在する人物でもよいし、キャラクター、人工的に生成された人物等でもよい。前記機械学習は、例えば、深層学習であり、前記変換の方法を自動的に学習する。前記変換の方法は、特に制限されず、例えば、前記音声の場合は、StyarGAN-VC、VQ-VAE、Voice Conversion Using Input-to-Output Highway Networks、NSF法、deep_VoiceChanger、become-yukarin等の方法があり、前記映像の場合は、talking-head-anime、Everybody Dance Now等の方法がある。具体的には、例えば、大学の講義を撮像した動画の場合、前記映像マスタにおいて、講義をしている教授を任意のキャラクターに変換して、前記任意のキャラクターが講義しているように学習させる。また、前記音声マスタにおいて、例えば、前記講義をしている教授の音声を任意の人物に変換して、前記任意の人物が講義しているように学習させる。これにより、例えば、後述の再生手段１６において、任意のキャラクターが任意の人物の声で講義しているように再生される。なお、前記他の音声と前記他の人物とは、対応関係がなくともいい。具体的には、例えば、前記変換済み映像における前記任意のキャラクターと、前記変換済み音声における前記任意のキャラクターとの声は、異なっていてもよい。 Next, the conversion means 15 converts the voice into another voice for each unit of the voice by machine learning, and converts the person in the video into another person for each group of frames (S5). Hereinafter, the converted voice is also referred to as "converted voice", and the converted video is also referred to as "converted video". The other voice is not particularly limited, and may be, for example, the voice of a real person such as a singer, voice actor, entertainer, or famous person, or may be a synthetic voice artificially generated by a computer, or the voice of any character. The other person is not particularly limited, and may be, for example, a real person such as a singer, voice actor, entertainer, or famous person, or may be a character, an artificially generated person, etc. The machine learning is, for example, deep learning, and automatically learns the method of the conversion. The method of conversion is not particularly limited, and for example, in the case of the voice, there are methods such as StyrGAN-VC, VQ-VAE, Voice Conversion Using Input-to-Output Highway Networks, NSF method, deep_VoiceChanger, become-yukarin, etc., and in the case of the video, there are methods such as talking-head-anime, Everybody Dance Now, etc. Specifically, for example, in the case of a video of a university lecture, in the video master, the professor giving the lecture is converted into an arbitrary character, and the arbitrary character is trained to give the lecture. Also, in the audio master, for example, the voice of the professor giving the lecture is converted into an arbitrary person, and the arbitrary person is trained to give the lecture. As a result, for example, in the playback means 16 described later, an arbitrary character is played back as if he or she is lecturing in the voice of the arbitrary person. Note that there may be no correspondence between the other voice and the other person. Specifically, for example, the voice of the arbitrary character in the converted video and the voice of the arbitrary character in the converted audio may be different.

そして、再生手段１６により、前記フレーム群における最初のフレームの再生開始時間毎に、前記変換した音声の単位毎における最初の音声の再生開始時間を同期して再生し（Ｓ６）、終了する（ＥＮＤ）。 Then, the playback means 16 synchronizes and plays back the playback start time of the first audio in each converted audio unit with the playback start time of the first frame in the frame group (S6), and ends (END).

図４に、フレーム群形成手段１３及び分割手段１４の処理の一例を示す。図４において、上から下方向に向かって、再生時間が進行するものとする。また、音声マスタ及び変換済み音声の系列において、円形で示す箇所は、音声が発生していることを示し、映像マスタ及び変換済み映像の系列において、一部のフレームを省略している。図４に示すように、フレーム群形成手段１３は、「前記音声が最初に発生した再生時間Ａと紐づけられているフレーム１ａ」から「前記所定時間の経過後において最初に前記音声が発生した再生時間Ａと紐づけられている前記フレームの直前のフレーム１ｂ」までを一連のフレームとしてフレーム群を形成する。分割手段１４は、「前記音声が最初に発生した再生時間Ａ」から「前記所定時間の経過後において最初に前記音声が発生する再生時間Ａの直前の再生時間Ｂ」までの前記音声を１単位として分割する。このように、前記音声の１単位には、音声が発生していない時間と音声が発生している時間を含む。そして、図４に示すように、変換手段１５により、それぞれ変換し、再生手段１６により、前記変換済み音声と前記変換済み映像とを合わせて動画（同期済み動画ともいう）として再生する。 Figure 4 shows an example of the processing of the frame group forming means 13 and the division means 14. In Figure 4, the playback time progresses from top to bottom. In addition, in the series of audio master and converted audio, the parts shown in circles indicate that audio is occurring, and in the series of video master and converted video, some frames are omitted. As shown in Figure 4, the frame group forming means 13 forms a frame group as a series of frames from "frame 1a linked to playback time A at which the audio first occurred" to "frame 1b immediately before the frame linked to playback time A at which the audio first occurred after the predetermined time has elapsed". The division means 14 divides the audio from "playback time A at which the audio first occurred" to "playback time B immediately before playback time A at which the audio first occurred after the predetermined time has elapsed" as one unit. In this way, one unit of audio includes time when no audio is occurring and time when audio is occurring. Then, as shown in FIG. 4, the conversion means 15 converts each of them, and the playback means 16 combines the converted audio and the converted video to play back as a video (also called a synchronized video).

さらに、本装置１０は、例えば、選択手段を含んでもよい。前記選択手段は、例えば、前記工程（Ｓ５）の前に、モデル人物及びモデル音声が格納されたモデルデータベースから、任意のモデル人物及びモデル音声の少なくとも一方を選択する。この場合、変換手段１５は、さらに、前記映像内の人物及び前記音声を、前記選択したモデル人物及びモデル音声の少なくとも一方に変換する。本装置１０は、例えば、前記通信回線網を介して、前記モデルデータベースと通信可能である。前記モデル人物は、特に制限されず、例えば、実在する人物（歌手、声優、芸能人、著名人等）でもよいし、キャラクターでもよいし、コンピュータによって生成された架空の人物等でもよい。前記モデル音声は、特に制限されず、例えば、実在する人物（歌手、声優、芸能人、著名人等）の音声でもよいし、キャラクターの音声でもよいし、コンピュータによって人工的に生成された合成音声でもよい。そして、再生手段１６により、前記変換したモデル人物及びモデル音声の少なくとも一方を用いて、前記再生を実行してもよい。また、再生手段１６は、例えば、前記変換したモデル人物と、前記変換したモデル音声とを合成してから前記再生を実行してもよい。図５に、本装置１０が前記選択手段を含む場合の使用の一例を示す。図５において、本装置１０は、ＰＣとして示す。 Furthermore, the present device 10 may include, for example, a selection means. For example, the selection means selects at least one of an arbitrary model person and a model voice from a model database in which a model person and a model voice are stored before the step (S5). In this case, the conversion means 15 further converts the person in the video and the voice into at least one of the selected model person and model voice. The present device 10 can communicate with the model database, for example, via the communication line network. The model person is not particularly limited, and may be, for example, a real person (singer, voice actor, entertainer, famous person, etc.), a character, or a fictional character generated by a computer. The model voice is not particularly limited, and may be, for example, the voice of a real person (singer, voice actor, entertainer, famous person, etc.), the voice of a character, or a synthetic voice artificially generated by a computer. Then, the reproduction means 16 may perform the reproduction using at least one of the converted model person and model voice. In addition, the playback means 16 may, for example, synthesize the converted model person and the converted model voice before performing the playback. FIG. 5 shows an example of use of the device 10 when the device 10 includes the selection means. In FIG. 5, the device 10 is shown as a PC.

本実施形態の同期装置１０によれば、前記フレーム群毎に、前記変換済み音声の再生開始時間を合わせることで、再生される映像と音声とのタイミングの乖離を抑制することができる。また、本装置１０によれば、例えば、再生した動画を違和感なく視聴することができる。このため、ユーザは、動画形式の講義や講演等の視聴に集中することができる。さらに、本装置１０によれば、ユーザの好みに適した前記人物及び前記音声に変換可能であるため、より集中することができたり、動画の視聴が楽しくなるという効果がある。また、本装置１０によれば、動画マスタ内の人物及び音声を変換可能であるため、前記動画マスタ内の人物に関する情報を秘匿可能であり、前記動画マスタ内の人物が不特定多数の人物にさらされるリスクを低減することができる。 According to the synchronization device 10 of this embodiment, by synchronizing the playback start time of the converted audio for each frame group, it is possible to suppress the discrepancy in timing between the played video and audio. Furthermore, according to the present device 10, for example, the played video can be viewed without any sense of discomfort. This allows the user to concentrate on watching a lecture or speech in video format. Furthermore, according to the present device 10, since the person and the audio can be converted to suit the user's preferences, it has the effect of allowing the user to concentrate more and making watching the video more enjoyable. Furthermore, according to the present device 10, since the person and audio in the video master can be converted, it is possible to keep information about the person in the video master secret, and the risk of the person in the video master being exposed to an unspecified number of people can be reduced.

［実施形態２］
本実施形態のプログラムは、本発明の方法の各工程を、手順として、コンピュータに実行させるためのプログラムである。本発明において、「手順」は、「処理」と読み替えてもよい。また、本実施形態のプログラムは、例えば、コンピュータ読み取り可能な記録媒体に記録されていてもよい。前記記録媒体としては、特に限定されず、例えば、読み出し専用メモリ（ＲＯＭ）、ハードディスク（ＨＤ）、光ディスク等が挙げられる。 [Embodiment 2 ]
The program of the present embodiment is a program for causing a computer to execute each step of the method of the present invention as a procedure. In the present invention, the "procedure" may be read as "processing." The program of the present embodiment may be recorded, for example, in a computer-readable recording medium. The recording medium is not particularly limited, and examples thereof include a read-only memory (ROM), a hard disk (HD), and an optical disk.

以上、実施形態を参照して本発明を説明したが、本発明は、上記実施形態に限定されるものではない。本発明の構成や詳細には、本発明のスコープ内で当業者が理解しうる様々な変更をできる。 The present invention has been described above with reference to the embodiments, but the present invention is not limited to the above embodiments. Various modifications that can be understood by a person skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

本発明によれば、再生される映像と音声とのタイミングの乖離を抑制することができる。このため、本発明は、例えば、動画形式での授業を行う学校や予備校等の教育施設や、動画形式での講演を行う講演会やセミナー等のイベントにおいて、特に有用である。 The present invention can reduce the discrepancy in timing between the video and audio being played back. For this reason, the present invention is particularly useful in educational facilities such as schools and preparatory schools where lessons are given in video format, and in events such as lectures and seminars where lectures are given in video format.

１０同期装置
１１動画マスタ取得手段
１２抽出手段
１３フレーム群形成手段
１４分割手段
１５変換手段
１６再生手段

10 Synchronization device 11 Video master acquisition means 12 Extraction means 13 Frame group formation means 14 Division means 15 Conversion means 16 Playback means

Claims

動画マスタ取得手段、抽出手段、フレーム群形成手段、分割手段、変換手段、及び再生手段を含み、
前記動画マスタ取得手段は、動画マスタを取得し、
前記抽出手段は、前記動画マスタから、再生タイミングに紐づけて、複数のフレームから構成される映像と音声とを抽出し、
前記フレーム群形成手段は、再生期間の再生開始から前記音声が所定時間発生しなかった場合に、「前記音声が最初に発生した再生タイミングと紐づけられている前記フレーム」から「前記所定時間の経過後において最初に前記音声が発生した再生タイミングと紐づけられている前記フレームの直前のフレーム」までを一連のフレームとしてフレーム群を形成し、
前記分割手段は、「前記音声が最初に発生した再生タイミング」から「前記所定時間の経過後において最初に前記音声が発生する再生タイミングの直前の再生タイミング」までの前記音声を１単位として分割し、
前記変換手段は、機械学習により、前記音声の単位毎に前記音声を他の音声に変換し、且つ前記フレーム群毎に前記映像内の人物を他の人物に変換し、
前記再生手段は、前記フレーム群における最初のフレームの再生開始時間毎に、前記変換した音声の単位毎における最初の音声の再生開始時間を同期して再生する、同期装置。 The method includes a video master acquisition means, an extraction means, a frame group formation means, a division means, a conversion means, and a playback means,
The video master acquisition means acquires a video master,
The extraction means extracts video and audio composed of a plurality of frames from the video master in association with a playback timing ,
the frame group forming means , when the sound is not generated for a predetermined time from the start of playback of the playback period , forms a frame group as a series of frames from "the frame linked to the playback timing at which the sound was first generated" to "the frame immediately preceding the frame linked to the playback timing at which the sound was first generated after the predetermined time has elapsed",
the dividing means divides the sound from "the playback timing at which the sound first occurs" to "the playback timing immediately before the playback timing at which the sound first occurs after the predetermined time has elapsed" into one unit;
the conversion means converts the voice into another voice for each unit of the voice by machine learning, and converts a person in the video into another person for each group of frames;
The playback means plays back the converted audio in synchronization with the playback start time of the first audio in each of the converted audio units, for each playback start time of the first frame in the frame group.

さらに、選択手段を含み、
前記選択手段は、モデル人物及びモデル音声が格納されたモデルデータベースから、任意のモデル人物及びモデル音声の少なくとも一方を選択し、
前記変換手段は、さらに、前記映像内の人物及び前記音声を、前記選択したモデル人物及びモデル音声の少なくとも一方に変換し、
前記再生手段は、前記変換したモデル人物及びモデル音声の少なくとも一方を用いて、前記再生を実行する、請求項１記載の同期装置。 Further comprising a selection means,
The selection means selects at least one of a model person and a model voice from a model database in which model people and model voices are stored,
The conversion means further converts the person and the voice in the video into at least one of the selected model person and model voice,
2. The synchronization device according to claim 1, wherein said reproduction means executes said reproduction using at least one of said converted model person and said converted model voice.

前記再生手段は、前記変換したモデル人物と、前記変換したモデル音声とを合成してから再生する、請求項２記載の同期装置。 The synchronization device according to claim 2, wherein the playback means synthesizes the converted model person and the converted model voice before playing them back.

動画マスタ取得工程、抽出工程、フレーム群形成工程、分割工程、変換工程、及び再生工程を含み、
前記動画マスタ取得工程は、動画マスタを取得し、
前記抽出工程は、前記動画マスタから、再生タイミングに紐づけて、複数のフレームから構成される映像と音声とを抽出し、
前記フレーム群形成工程は、再生期間の再生開始から前記音声が所定時間発生しなかった場合に、「前記音声が最初に発生した再生タイミングと紐づけられている前記フレーム」から「前記所定時間の経過後において最初に前記音声が発生した再生タイミングと紐づけられている前記フレームの直前のフレーム」までを一連のフレームとしてフレーム群を形成し、
前記分割工程は、「前記音声が最初に発生した再生タイミング」から「前記所定時間の経過後において最初に前記音声が発生する再生タイミングの直前の再生タイミング」までの前記音声を１単位として分割し、
前記変換工程は、機械学習により、前記音声の単位毎に前記音声を他の音声に変換し、且つ前記フレーム群毎に前記映像内の人物を他の人物に変換し、
前記再生工程は、前記フレーム群における最初のフレームの再生開始時間毎に、前記変換した音声の単位毎における最初の音声の再生開始時間を同期して再生する、同期方法。 The method includes a video master acquisition step, an extraction step, a frame group formation step, a division step, a conversion step, and a playback step.
The video master acquisition step acquires a video master,
The extraction step includes extracting, from the video master, video and audio composed of a plurality of frames in association with a playback timing ;
the frame group forming step , when the sound is not generated for a predetermined time from the start of playback of a playback period , forms a frame group as a series of frames from "the frame linked to the playback timing at which the sound was first generated" to "the frame immediately preceding the frame linked to the playback timing at which the sound was first generated after the predetermined time has elapsed",
the dividing step divides the sound from "a playback timing at which the sound first occurs" to "a playback timing immediately before a playback timing at which the sound first occurs after the predetermined time has elapsed" into one unit;
the converting step converts the voice into another voice for each unit of the voice by machine learning, and converts a person in the video into another person for each group of frames;
The reproduction step comprises synchronizing reproduction with a reproduction start time of a first sound in each of the converted sound units with a reproduction start time of a first frame in the frame group.

さらに、選択工程を含み、
前記選択工程は、モデル人物及びモデル音声が格納されたモデルデータベースから、任意のモデル人物及びモデル音声の少なくとも一方を選択し、
前記変換工程は、さらに、前記映像内の人物及び前記音声を、前記選択したモデル人物及びモデル音声の少なくとも一方に変換し、
前記再生工程は、前記変換したモデル人物及びモデル音声の少なくとも一方を用いて、前記再生を実行する、請求項４記載の同期方法。 Further comprising a selection step,
The selection step includes selecting at least one of a model person and a model voice from a model database in which model people and model voices are stored,
The conversion step further includes converting the person and the voice in the video into at least one of the selected model person and model voice,
5. The synchronization method according to claim 4, wherein said reproduction step executes said reproduction using at least one of said converted model person and said converted model voice.

前記再生工程は、前記変換したモデル人物と、前記変換したモデル音声とを合成してから再生する、請求項５記載の同期方法。 The synchronization method according to claim 5, wherein the playback step synthesizes the converted model person and the converted model voice before playing them.

コンピュータに、動画マスタ取得手順、抽出手順、フレーム群形成手順、分割手順、変換手順、及び再生手順を含む手順を実行させるためのプログラムであって、
前記動画マスタ取得手順は、動画マスタを取得し、
前記抽出手順は、前記動画マスタから、再生タイミングに紐づけて、複数のフレームから構成される映像と音声とを抽出し、
前記フレーム群形成手順は、再生期間の再生開始から前記音声が所定時間発生しなかった場合に、「前記音声が最初に発生した再生タイミングと紐づけられている前記フレーム」から「前記所定時間の経過後において最初に前記音声が発生した再生タイミングと紐づけられている前記フレームの直前のフレーム」までを一連のフレームとしてフレーム群を形成し、
前記分割手順は、「前記音声が最初に発生した再生タイミング」から「前記所定時間の経過後において最初に前記音声が発生する再生タイミングの直前の再生タイミング」までの前記音声を１単位として分割し、
前記変換手順は、機械学習により、前記音声の単位毎に前記音声を他の音声に変換し、且つ前記フレーム群毎に前記映像内の人物を他の人物に変換し、
前記再生手順は、前記フレーム群における最初のフレームの再生開始時間毎に、前記変換した音声の単位毎における最初の音声の再生開始時間を同期して再生する、プログラム。 A program for causing a computer to execute procedures including a video master acquisition procedure, an extraction procedure, a frame group formation procedure, a division procedure, a conversion procedure, and a playback procedure,
The video master acquisition step acquires a video master,
The extraction step includes extracting video and audio composed of a plurality of frames from the video master in association with a playback timing ;
the frame group forming step , when the sound is not generated for a predetermined time from the start of playback of a playback period , forms a frame group as a series of frames from "the frame linked to the playback timing at which the sound was first generated" to "the frame immediately preceding the frame linked to the playback timing at which the sound was first generated after the predetermined time has elapsed";
The division step divides the sound from "the playback timing at which the sound first occurs" to "the playback timing immediately before the playback timing at which the sound first occurs after the predetermined time has elapsed" into one unit,
the conversion step converts the voice into another voice for each unit of the voice by machine learning, and converts a person in the video into another person for each group of frames;
The playback procedure includes playing back the converted audio in units of audio in synchronization with the playback start time of the first frame in the group of frames.

さらに、選択手順を含み、
前記選択手順は、モデル人物及びモデル音声が格納されたモデルデータベースから、任意のモデル人物及びモデル音声の少なくとも一方を選択し、
前記変換手順は、さらに、前記映像内の人物及び前記音声を、前記選択したモデル人物及びモデル音声の少なくとも一方に変換し、
前記再生手順は、前記変換したモデル人物及びモデル音声の少なくとも一方を用いて、前記再生を実行する、請求項７記載のプログラム。 Further comprising a selection step,
The selection step includes selecting at least one of a model person and a model voice from a model database in which model people and model voices are stored,
The conversion step further includes converting the person and the voice in the video into at least one of the selected model person and model voice;
8. The program according to claim 7, wherein said reproduction step executes said reproduction using at least one of said converted model person and said converted model voice.

前記再生手順は、前記変換したモデル人物と、前記変換したモデル音声とを合成してから再生する、請求項８記載のプログラム。 The program according to claim 8, wherein the playback step comprises synthesizing the converted model person and the converted model voice before playing them.

請求項７から９のいずれか一項に記載のプログラムを記録しているコンピュータ読み取り可能な記録媒体。 A computer-readable recording medium having recorded thereon the program according to any one of claims 7 to 9.