KR102487589B1

KR102487589B1 - The method for providing translation subtiles of video through voice recognition server, translation server, and collective intelligence and system using the same

Info

Publication number: KR102487589B1
Application number: KR1020200183145A
Authority: KR
Inventors: 이원재
Original assignee: 주식회사 소셜임팩트
Priority date: 2020-12-24
Filing date: 2020-12-24
Publication date: 2023-01-11
Also published as: KR20220091928A

Abstract

음성인식서버, 번역서버와 집단지성을 이용한 동영상의 자막 제공 방법 및 그를 이용한 시스템이 개시된다.
일 실시예에 따르면, 음성인식서버, 번역서버와 집단지성을 이용한 동영상의 자막을 자막제공 서버에서 제공하는 방법은 (a)상기 동영상의 링크 및 번역언어가 사용자 단말로부터 입력되는 단계; (b)자막제공 서버가 상기 동영상 링크에 접속하여 상기 동영상의 음원을 획득하는 단계; (c)상기 동영상의 음원을 음성자막으로 변환하는 단계; (d)상기 번역언어에 기초하여, 상기 음성자막이 상기 번역서버 및 집단지성 중 적어도 하나를 통해 자막으로 변환되는 단계; (e)상기 사용자 단말에서 상기 동영상과 상기 자막을 함께 재생하는 단계를 포함한다.Disclosed are a voice recognition server, a translation server, and a method for providing subtitles for video using collective intelligence, and a system using the same.
According to an embodiment, a method for providing subtitles of a video using a voice recognition server, a translation server, and collective intelligence from a subtitle providing server includes the steps of (a) inputting a link of the video and a translation language from a user terminal; (b) obtaining a sound source of the video by a subtitle providing server accessing the video link; (c) converting the sound source of the video into audio subtitles; (d) converting the audio subtitles into subtitles through at least one of the translation server and crowd intelligence based on the translation language; (e) playing the video and the subtitle together in the user terminal.

Description

음성인식서버, 번역서버와 집단지성을 이용한 동영상의 자막 제공 방법 및 그를 이용한 시스템{THE METHOD FOR PROVIDING TRANSLATION SUBTILES OF VIDEO THROUGH VOICE RECOGNITION SERVER, TRANSLATION SERVER, AND COLLECTIVE INTELLIGENCE AND SYSTEM USING THE SAME}Method for providing subtitles for video using voice recognition server, translation server and collective intelligence, and system using the same

본 발명은 동영상의 자막 제공 방법 및 그를 이용한 시스템에 관한 것으로, 보다 상세하게는, 음성인식서버, 번역서버와 집단지성을 통해 동영상의 자막을 제공하는 동영상 자막 제공 방법 및 그를 이용한 시스템에 대한 것이다.The present invention relates to a method for providing captions for a video and a system using the same, and more particularly, to a method for providing captions for a video through a voice recognition server, a translation server and collective intelligence, and a system using the same.

최근 미디어 컨텐츠가 발전함에 따라 네트워크상에 다양한 미디어 컨텐츠가 개발되고 있으며, 사용자들로부터 꾸준하게 인기를 얻고 있어 미디어 컨텐츠 분야는 급격하게 성장하고 있는 중이다.Recently, as media content develops, various media content is being developed on the network and steadily gaining popularity from users, so the field of media content is rapidly growing.

미디어 컨텐츠 중 동영상은 네트워크 상으로 제공되기 때문에 지역 및 언어 구분없이 모든 네트워크 사용자들이 제공받을 수 있는 강점이 있지만, 제작되는 동영상의 언어가 국가, 지역 또는 언어에 따라 다르기 때문에 번역 능력에 따라 사용자들이 이용할 수 있는 동영상이 제한적일 수밖에 없다.Among media contents, since video is provided on the network, it has the strength to be provided to all network users regardless of region and language. The number of videos available is bound to be limited.

예를 들어, 국내(한글) 사용자가 영어권 국가의 사용자가 제작한 동영상을 이용하고자 할 때, 영어권 사용자의 제작 동영상의 음성, 자막 등은 영어 기반으로 제작되기 때문에 국내 사용자가 영어 번역 능력이 좋지 않다면 국내 사용자는 영어권 국가의 사용자가 제작한 동영상 이용이 제한된다. For example, when a domestic (Korean) user wants to use a video produced by a user in an English-speaking country, the audio and subtitles of the video produced by an English-speaking user are produced in English, so if the domestic user does not have good English translation skills, Domestic users are restricted from using videos produced by users in English-speaking countries.

따라서, 음성인식 포탈 사이트에 접속하여 영어 기반의 동영상을 업로드하여 영어를 자막화 하고, 번역 포탈 사이트에 접속하여 영어 자막을 번역하는 방식을 고려할 수 있으나, 이는 영어 기반의 동영상을 업로드해야 하며, 영어 기반의 동영상의 일부가 필요할 때에도 해당 영상의 전체를 자막화 해야 하기 때문에 많은 시간이 소모될 것이므로, 긴급하게 영어권 동영상을 시청하고 이해해야 하는 경우에는 위의 포탈사이트들을 이용하여 번역내용을 얻는 방법을 이용하기 어렵다.Therefore, it is possible to consider a method of connecting to a voice recognition portal site, uploading an English-based video to subtitle English, and accessing a translation portal site to translate English subtitles. However, this requires uploading an English-based video, and Even when a part of the based video is needed, a lot of time will be consumed because the entire video must be subtitled, so if you need to watch and understand English-speaking videos urgently, use the above portal sites to obtain translations. Hard to do.

또한, 음성인식 포탈 사이트를 통하여 영어 동영상의 음성을 자막화 할 경우 해당 영상의 주변소음도 자막화 될 수 있어 음성자막 및 자막의 정확도가 떨어지는 문제가 있다.In addition, when the audio of an English video is subtitled through a voice recognition portal site, the surrounding noise of the video may also be subtitled, resulting in poor accuracy of audio subtitles and subtitles.

또한, 청각 장애인이 자막이 없는 동영상을 이용하고자 할 때, 프로그램 등에서 즉석에서 변환하여 제공되는 자막의 품질이 좋지 않은 경우, 청각 장애인은 해당 동영상의 내용을 이해하기 어려운 문제가 있다.In addition, when a hearing-impaired person wants to use a video without subtitles, if the quality of the subtitle provided by an on-the-fly conversion from a program or the like is poor, the hearing-impaired person has difficulty understanding the contents of the video.

본 발명은 상술한 문제점을 모두 해결하는 것을 목적으로 한다.The present invention aims to solve all of the above problems.

또한, 본 발명은 동영상의 음성을 일정구간으로 나누어 정확한 음성자막 및 번역자막을 제공하는 것을 다른 목적으로 한다.Another object of the present invention is to provide accurate audio subtitles and translated subtitles by dividing the audio of a video into certain sections.

또한, 본 발명은 음성인식서버와 번역서버 또는 집단지성을 통해 빠르게 동영상의 번역을 제공하는 것을 또 다른 목적으로 한다.Another object of the present invention is to quickly provide video translation through a voice recognition server, a translation server, or collective intelligence.

상기한 바와 같은 본 발명의 목적을 달성하고, 후술하는 본 발명의 특징적인 효과를 실현하기 위한, 본 발명의 특징적인 구성은 하기와 같다.In order to achieve the object of the present invention as described above and realize the characteristic effects of the present invention described later, the characteristic configuration of the present invention is as follows.

음성인식서버, 번역서버와 집단지성을 이용한 동영상의 자막을 제공하는 방법에 있어서, (a)상기 동영상의 링크 및 번역언어가 사용자 단말로부터 입력되는 단계; (b)자막제공 서버가 상기 동영상 링크에 접속하여 상기 동영상의 음원을 획득하는 단계; (c)상기 동영상의 음원을 음성자막으로 변환하는 단계; (d)상기 번역언어에 기초하여, 상기 음성자막이 상기 번역서버 및 집단지성 중 적어도 하나를 통해 자막으로 변환되는 단계; (e)상기 사용자 단말에서 상기 동영상과 상기 음성자막 및 상기 번역자막을 함께 재생하는 단계를 포함하는, 동영상의 자막 제공 방법.A method for providing subtitles for a video using a voice recognition server, a translation server, and collective intelligence, comprising the steps of: (a) inputting a link of the video and a translation language from a user terminal; (b) obtaining a sound source of the video by a subtitle providing server accessing the video link; (c) converting the sound source of the video into audio subtitles; (d) converting the audio subtitles into subtitles through at least one of the translation server and crowd intelligence based on the translation language; (e) playing the video, the audio subtitles, and the translated subtitles together in the user terminal;

본 발명에 의하면, 다음과 같은 효과가 있다.According to the present invention, there are the following effects.

본 발명의 일 실시예에 따르면 동영상의 음원파형을 크기, 형태로 분석하고 주변소음과 동영상의 음성을 구분하고 있으므로 종래의 발명들보다 더 정확한 음성자막 또는 번역자막을 제공할 수 있다.According to an embodiment of the present invention, since the sound source waveform of a video is analyzed in size and shape, and the ambient noise and the audio of the video are distinguished, more accurate audio captions or translated captions can be provided than conventional inventions.

또한, 본 발명의 일 실시예에 따르면 음성인식서버와 번역서버 또는 집단지성을 사용하여 종래의 발명들보다 빠르고 정확하게 동영상의 번역자막을 제공하는 효과가 있다.In addition, according to an embodiment of the present invention, there is an effect of providing translated subtitles of a video faster and more accurately than conventional inventions by using a voice recognition server, a translation server, or collective intelligence.

또한, 본 발명의 일 실시예에 따른 집단지성을 활용한 자막의 제공방법은 다수의 자막제공자 또는 번역자가 필요하고, 이에 따라 일자리 창출효과가 있다. In addition, the method for providing subtitles using collective intelligence according to an embodiment of the present invention requires a large number of subtitle providers or translators, and thus has a job creation effect.

도 1은 본 발명의 일 실시예에 따른 음성인식서버, 번역서버와 집단지성을 이용한 동영상 자막 제공 시스템의 블록도이다.
도 2는 본 발명의 일 실시예에 따른 음성인식서버, 번역서버와 집단지성을 이용한 동영상 자막 제공 방법의 흐름도이다.
도 3은 본 발명의 일 실시예에 따른 구간별 형상의 예시를 나타내는 도면이다.
도 4는 본 발명의 일 실시예에 따른 동영상 음성자막화의 다양한 흐름도이다.
도 5는 본 발명의 일 실시예에 따른 제2 디스플레이부의 다양한 예를 도시한 도면이다.
도 6은 본 발명의 일 실시예에 따른 언어 변환 데이터베이스 선택단계를 더 포함하는 흐름도이다.
도 7은 본 발명의 일 실시예에 따른 자막 싱크 조절단계를 더 포함하는 다양한 흐름도이다. 1 is a block diagram of a video caption providing system using a voice recognition server, a translation server, and collective intelligence according to an embodiment of the present invention.
2 is a flowchart of a video caption providing method using a voice recognition server, a translation server, and collective intelligence according to an embodiment of the present invention.
3 is a diagram illustrating an example of a shape of each section according to an embodiment of the present invention.
4 is a flow chart illustrating a video audio captioning process according to an embodiment of the present invention.
5 is a diagram illustrating various examples of a second display unit according to an embodiment of the present invention.
6 is a flowchart further including a language conversion database selection step according to an embodiment of the present invention.
7 is a flowchart illustrating various steps further including a subtitle sync adjustment step according to an embodiment of the present invention.

후술하는 본 발명에 대한 상세한 설명은, 본 발명이 실시될 수 있는 특정 실시예를 예시로서 도시하는 첨부 도면을 참조한다. 이들 실시예는 당업자가 본 발명을 실시할 수 있기에 충분하도록 상세히 설명된다. 본 발명의 다양한 실시예는 서로 다르지만 상호 배타적일 필요는 없음이 이해되어야 한다. 예를 들어, 여기에 기재되어 있는 특정 형상, 구조 및 특성은 일 실시예에 관련하여 본 발명의 정신 및 범위를 벗어나지 않으면서 다른 실시예로 구현될 수 있다. 또한, 각각의 개시된 실시예 내의 개별 구성요소의 위치 또는 배치는 본 발명의 정신 및 범위를 벗어나지 않으면서 변경될 수 있음이 이해되어야 한다. 따라서, 후술하는 상세한 설명은 한정적인 의미로서 취하려는 것이 아니며, 본 발명의 범위는, 적절하게 설명된다면, 그 청구항들이 주장하는 것과 균등한 모든 범위와 더불어 첨부된 청구항에 의해서만 한정된다. 도면에서 유사한 참조부호는 여러 측면에 걸쳐서 동일하거나 유사한 기능을 지칭한다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The detailed description of the present invention which follows refers to the accompanying drawings which illustrate, by way of illustration, specific embodiments in which the present invention may be practiced. These embodiments are described in sufficient detail to enable one skilled in the art to practice the present invention. It should be understood that the various embodiments of the present invention are different from each other but are not necessarily mutually exclusive. For example, specific shapes, structures, and characteristics described herein may be implemented in one embodiment in another embodiment without departing from the spirit and scope of the invention. Additionally, it should be understood that the location or arrangement of individual components within each disclosed embodiment may be changed without departing from the spirit and scope of the invention. Accordingly, the detailed description set forth below is not to be taken in a limiting sense, and the scope of the present invention, if properly described, is limited only by the appended claims, along with all equivalents as claimed by those claims. Like reference numbers in the drawings indicate the same or similar function throughout the various aspects.

이하, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명을 용이하게 실시할 수 있도록 하기 위하여, 본 발명의 바람직한 실시예들에 관하여 첨부된 도면을 참조하여 상세히 설명하기로 한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily practice the present invention.

도 1은 본 발명의 일 실시예에 따른 음성인식서버(400), 번역서버(500)와 집단지성을 이용한 동영상 자막 제공 시스템(1a, 1b)을 설명하기 위한 블록 도면이다.1 is a block diagram for explaining a voice recognition server 400, a translation server 500, and a video subtitle providing system 1a, 1b using collective intelligence according to an embodiment of the present invention.

본 발명의 일 실시예에 따르면, 도 1a에 도시된 바와 같이, 본 발명의 동영상 자막 제공 시스템(1a)은 사용자 단말기(100), 자막제공자 단말기(200), 자막제공 서버(300), 음성인식서버(400)를 포함할 수 있다. According to an embodiment of the present invention, as shown in FIG. 1a, the video caption providing system 1a of the present invention includes a user terminal 100, a caption provider terminal 200, a caption providing server 300, and voice recognition. The server 400 may be included.

이하 상기 구성요소들에 대해 차례로 살펴본다.Hereinafter, the above components are examined in turn.

일 실시예에 따른 사용자 단말기(100)는 자막이 필요한 동영상의 링크가 입력될 수 있는 제1 입력부(101)와 동영상을 재생할 수 있는 제1 디스플레이부(102)를 포함할 수 있다.The user terminal 100 according to an embodiment may include a first input unit 101 into which a link of a video requiring subtitles may be input and a first display unit 102 capable of playing a video.

일 실시예에 따른 자막제공자 단말기(200)는 동영상의 음성자막 또는 번역자막이 입력될 수 있는 제2 입력부(201)와 동영상 재생 및 번역구간을 나타낼 수 있는 제2 디스플레이부(202)를 포함할 수 있다. 번역구간을 나타내는 것에 관하여 도 3을 참조하여 자세히 서술하겠다.The caption provider terminal 200 according to an embodiment may include a second input unit 201 into which voice captions or translated captions of a video may be input, and a second display unit 202 capable of displaying video playback and translation sections. can The representation of the translation section will be described in detail with reference to FIG. 3 .

또한, 본 명세서상의 집단지성은 복수의 자막제공자 단말기(200)를 통해 자막제공자 또는 번역자로부터 동영상의 음성자막 및 번역자막 데이터를 획득하는 것을 의미할 수 있다. 예를 들어 집단지성을 사용하면, 복수의 자막제공자가 동시에 접속하여 서로 다른 구간의 음성에 대하여 음성자막이나 번역자막을 생성할 수 있다.In addition, collective intelligence in this specification may mean acquiring audio caption and translated caption data of a video from a caption provider or translator through a plurality of caption provider terminals 200 . For example, if collective intelligence is used, a plurality of subtitle providers can connect simultaneously to generate audio subtitles or translated subtitles for audio of different sections.

일 실시예에 따른 단말기(100, 200)는 통신할 수 있는 기능을 포함하는 디지털 기기로서, 데스크탑 컴퓨터, 노트북 컴퓨터, 워크스테이션, PDA, 웹 패드, 이동 전화기 등과 같이 메모리 수단을 구비하고 마이크로 프로세서를 탑재하여 연산 능력을 갖춘 디지털 기기라면 얼마든지 본 발명에 따른 단말기(100, 200) 로서 채택될 수 있다.The terminals 100 and 200 according to an embodiment are digital devices having a communication function, and include a memory unit such as a desktop computer, a notebook computer, a workstation, a PDA, a web pad, and a mobile phone, and include a microprocessor. Any digital device equipped with computing capability can be adopted as the terminal 100 or 200 according to the present invention.

또한, 입력부(101, 201)는 키보드, 마우스, 터치패드, 소리 등을 통해 단말기(100, 200)에 입력이 가능한 장치라면 본 발명에 따른 입력부(101, 201)로서 채택될 수 있다.In addition, the input units 101 and 201 may be adopted as the input units 101 and 201 according to the present invention if they are devices capable of inputting to the terminals 100 and 200 through a keyboard, mouse, touch pad, and sound.

또한, 디스플레이부(102, 202)는 비디오 신호 및 오디오 신호를 출력할 수 있는 장치로서 LCD(Liquid Crystal Display), TFT-LCD(Thin Film Transistor LCD), OLED(Organic Light Emitting Diodes), 발광다이오드(LED), AMOLED(Active Matrix Organic LED), 플렉시블 디스플레이(Flexible display) 및 3차원 디스플레이(3 Dimension) 등으로 구성될 수 있다.In addition, the display units 102 and 202 are devices capable of outputting video signals and audio signals, and include liquid crystal displays (LCDs), thin film transistor LCDs (TFT-LCDs), organic light emitting diodes (OLEDs), light emitting diodes ( LED), AMOLED (Active Matrix Organic LED), flexible display, and 3D display.

일 실시예에 따른 자막제공 서버(300)는 통신부(301), 분석부(302), 싱크부(303) 및 저장부(305)를 포함할 수 있다.The caption providing server 300 according to an embodiment may include a communication unit 301, an analysis unit 302, a sink unit 303, and a storage unit 305.

일 실시예에 따른 통신부(301)는 다양한 통신 기술을 통해 자막제공 서버(300)와 단말기(100, 200), 음성인식서버(400) 및 번역서버(500)간의 통신을 구현할 수 있다. 즉, 와이파이(WIFI), WCDMA(Wideband CDMA), HSDPA(High Speed Downlink Packet Access), HSUPA(High Speed Uplink Packet Access), HSPA(High Speed Packet Access), 모바일 와이맥스(Mobile WiMAX), 와이브로(WiBro), LTE(Long Term Evolution), 블루투스(bluetooth), 적외선 통신(IrDA, infrared data association), NFC(Near Field Communication), 지그비(Zigbee), 무선랜 기술 등이 적용될 수 있다. 또한, 인터넷과 연결되어 서비스를 제공하는 경우 인터넷에서 정보전송을 위한 표준 프로토콜인 TCP/IP를 따를 수 있다.The communication unit 301 according to an embodiment may implement communication between the caption providing server 300, the terminals 100 and 200, the voice recognition server 400, and the translation server 500 through various communication technologies. That is, WIFI, WCDMA (Wideband CDMA), HSDPA (High Speed Downlink Packet Access), HSUPA (High Speed Uplink Packet Access), HSPA (High Speed Packet Access), Mobile WiMAX, WiBro , LTE (Long Term Evolution), Bluetooth, infrared data association (IrDA), NFC (Near Field Communication), Zigbee, wireless LAN technology, and the like may be applied. In addition, when connecting to the Internet and providing services, TCP/IP, which is a standard protocol for information transmission on the Internet, may be followed.

일 실시예에 따른 분석부(302)는 동영상의 음원파형을 크기 및 형상에 기초하여 음성과 주변소음을 구분 및 분리할 수 있다. 이와 관련하여 도 2 내지 도 3에서 자세히 서술하겠다.The analyzer 302 according to an embodiment may classify and separate voice and ambient noise based on the size and shape of the sound source waveform of the video. In this regard, it will be described in detail in FIGS. 2 and 3 .

일 실시예에 따른 싱크부(303)는 자막을 동영상의 싱크에 맞출 수 있다. 이와 관련하여 도 7을 참조하여 자세히 서술하겠다.The sync unit 303 according to an embodiment may synchronize subtitles with video. This will be described in detail with reference to FIG. 7 .

일 실시예에 따른 저장부(305)는 음성인식서버(400) 또는 자막제공자 단말기(200)를 통해 자막화된 음성자막을 저장할 수 있다.The storage unit 305 according to an embodiment may store audio captions subtitled through the voice recognition server 400 or the caption provider terminal 200 .

또한, 저장부(305)는 번역서버(500) 또는 자막제공자 단말기(200)로부터 번역된 번역자막을 저장할 수 있다.Also, the storage unit 305 may store translated subtitles translated from the translation server 500 or the subtitle provider terminal 200 .

일 실시예에 따른 음성인식서버(400)는 Audio to Text부(401)를 포함할 수 있고, Audio to Text부(401)를 통해 동영상의 음원을 인식하여 음성자막으로 변환할 수 있다.The voice recognition server 400 according to an embodiment may include an audio to text unit 401, and may recognize a sound source of a video through the audio to text unit 401 and convert it into audio subtitles.

이때, Audio to Text부(401)는 Google의 speech to text, Audext 등의 외부 웹사이트 및 어플리케이션이나 음원을 텍스트로 변환하는 서버 내부의 어플리케이션을 포함할 수 있고, 웹사이트 및 어플리케이션을 통해 동영상의 음원을 인식하여 음성자막으로 변환할 수 있다.At this time, the audio to text unit 401 may include an external website and application such as Google's speech to text, Audext, or an application inside the server that converts the sound source into text, and the sound source of the video through the website and application. can be recognized and converted into audio subtitles.

또한, 변환된 음성자막은 자막제공 서버(300)의 저장부(305)에 저장될 수 있다.Also, the converted audio caption may be stored in the storage unit 305 of the caption providing server 300 .

본 발명의 다른 실시예에 따르면 도 1b에 도시된 바와 같이, 본 발명의 동영상 자막 제공 시스템(1b)은 사용자 단말기(100), 자막제공자 단말기(200), 자막제공 서버(300), 음성인식서버(400) 및 번역서버(500)를 포함할 수 있다. According to another embodiment of the present invention, as shown in FIG. 1b, the video caption providing system 1b of the present invention includes a user terminal 100, a caption provider terminal 200, a caption providing server 300, and a voice recognition server. (400) and a translation server (500).

다른 실시예에 따른 자막제공 서버(300)는 데이터베이스부(304)를 더 포함할 수 있다.The subtitle providing server 300 according to another embodiment may further include a database unit 304.

일 실시예에 따른 데이터베이스부(304)는 번역서버(500) 또는 번역자가 번역을 수행하기전에 번역분야를 선택하는 경우 이용될 수 있다. 이와 관련하여 도 6을 참조하여 자세히 서술하겠다.The database unit 304 according to an embodiment may be used when the translation server 500 or a translator selects a translation field before performing translation. This will be described in detail with reference to FIG. 6 .

일 실시예에 따른 번역서버(500)는 동영상의 음성자막을 사용자 단말기(100)로부터 설정된 언어로 번역할 수 있다.The translation server 500 according to an embodiment may translate audio subtitles of a video into a language set by the user terminal 100 .

이때 번역서버(500)는 Google의 translation, Papago 등 웹사이트 및 어플리케이션을 포함할 수 있고, 웹사이트 및 어플리케이션을 통해 음성자막을 사용자 단말기(100)로부터 설정된 언어로 번역할 수 있다.At this time, the translation server 500 may include websites and applications such as Google's translation and Papago, and may translate audio subtitles into a language set from the user terminal 100 through the websites and applications.

또한, 번역된 자막은 자막제공 서버(300)의 저장부(305)에 저장될 수 있다.Also, translated subtitles may be stored in the storage unit 305 of the subtitle providing server 300 .

이하 상기 구성요소에 대한 설명은 도 1a에서 설명한 구성요소와 중복되므로 구체적인 설명은 생략한다.Hereinafter, descriptions of the components are duplicated with those described in FIG. 1A, so detailed descriptions thereof are omitted.

도 2a는 본 발명의 일 실시예에 따른 음성인식서버(400)와 집단지성을 이용한 동영상 음성자막 제공 방법의 흐름도이다.2A is a flowchart of a method for providing audio captions for a video using a voice recognition server 400 and collective intelligence according to an embodiment of the present invention.

도 2a에 기재된 실시예에 따르면, 특정한 동영상의 번역자막을 제공하지 않고 특정한 동영상의 음성을 원어로 자막화한 음성자막을 제공할 수 있다.According to the embodiment described in FIG. 2A , it is possible to provide audio subtitles in which the audio of a specific video is subtitled in the original language without providing translated captions of the specific video.

단계 100a에서 음성자막을 획득하고자 하는 동영상의 링크는 사용자 단말기(100)의 제1 입력부(101)를 통해 입력될 수 있다.In step 100a, a link of a video for which audio subtitles are to be obtained may be input through the first input unit 101 of the user terminal 100.

단계 110a에서 자막제공 서버(300)는 사용자 단말기(100)로부터 제1 입력부(101)를 통해 입력된 동영상의 링크를 수신할 수 있고, 자막제공 서버(300)는 동영상의 링크에 접속하여 동영상의 음원을 추출하여 저장부(305)에 저장할 수 있다. 이때, 추출된 음원을 단계 120a에서 이용할 수 있다.In step 110a, the subtitle providing server 300 may receive a video link input through the first input unit 101 from the user terminal 100, and the subtitle providing server 300 accesses the video link to display the video. A sound source may be extracted and stored in the storage unit 305 . At this time, the extracted sound source may be used in step 120a.

단계 120a에서 동영상의 음원은 자막제공자 단말기(200) 또는 음성인식서버(400) 중 적어도 하나를 통해 음성자막으로 변환될 수 있다. 예를 들어, 자막제공 서버(300)는 추출된 음원을 자막제공자 단말기(200) 또는 음성인식서버(400)로 전송하여, 추출된 음원에 대응하는 자막을 수신할 수 있다.In step 120a, the audio source of the video may be converted into audio captions through at least one of the caption provider terminal 200 and the voice recognition server 400. For example, the caption providing server 300 may transmit the extracted sound source to the caption provider terminal 200 or the voice recognition server 400 to receive a caption corresponding to the extracted sound source.

이때, 일 실시예에 따른 동영상 음성자막화 단계(S120a)는 동영상 음원 분석단계(S121), 음성자막 변환단계(S124)를 포함할 수 있다.In this case, the video audio caption conversion step (S120a) according to an embodiment may include a video sound source analysis step (S121) and an audio caption conversion step (S124).

일 실시예에 따른 동영상 음원 분석단계(S121)는 자막제공 서버(300)가 음원파형을 분석하여 소음과 음성으로 분석할 수 있다.In the video sound source analysis step (S121) according to an embodiment, the subtitle providing server 300 may analyze the sound source waveform and analyze it into noise and voice.

일 실시예에 따른 동영상 음원 분석단계(S121)에서 자막제공 서버(300)의 분석부(302)는 동영상의 음원을 음원파형의 크기 및 음원파형의 형상을 기초로 음성으로 인식할 수 있다.In the video sound source analysis step (S121) according to an embodiment, the analyzer 302 of the caption providing server 300 may recognize the sound source of the video as voice based on the size and shape of the sound source waveform.

일 실시예에 따르면, 동영상의 음원파형의 크기가 기설정 크기 미만으로 줄어들면 음성으로 인식되지 않고 소음으로 인식될 수 있고, 동영상의 음원파형의 크기가 기설정 크기 이상으로 커지면 음성으로 인식될 수 있으며, 음원파형의 기설정 크기는 기계학습 또는 사용자를 통해 설정될 수 있다. According to an embodiment, if the size of a sound source waveform of a video is reduced to less than a preset size, it may not be recognized as a voice but may be recognized as noise, and if the size of a sound source waveform of a video increases to a preset size or more, it may be recognized as a voice. And, the preset size of the sound source waveform can be set through machine learning or the user.

다른 실시예에 따르면, 동영상의 음성에서 파형의 형상이 학습음원파형의 기준 형상과 미리 결정된 수준의 유사도를 가지지 않으면 음성으로 인식되지 않고 소음 구간으로 인식되고, 동영상의 음원파형의 형상이 학습음원파형의 기준 형상과 미리 결정된 수준의 유사도를 가지면 음성 구간으로 인식되도록 할 수 있다.According to another embodiment, if the shape of the waveform in the video voice does not have a predetermined level of similarity with the reference shape of the learning sound source waveform, it is not recognized as a voice but is recognized as a noise section, and the shape of the sound source waveform in the video is the learning sound source waveform If it has a similarity with the reference shape of a predetermined level, it can be recognized as a speech section.

이때, 기계학습으로 획득한 음성으로서의 패턴을 가지는 음원파형을 학습음원파형이라 할 수 있으며, 학습음원파형의 기준 형상은 기계학습 또는 사용자를 통해 설정될 수 있다. In this case, a sound source waveform having a pattern as a voice obtained through machine learning may be referred to as a learning sound source waveform, and a reference shape of the learning sound source waveform may be set through machine learning or a user.

또 다른 실시예에 따르면, 기보유 음성의 음원파형의 패턴을 기계학습을 통해 분석하고, 특정 구간의 음원과 기보유 음성의 음원파형의 패턴과의 유사도가 미리 결정된 기준치보다 낮은 구간은 소음 구간으로 간주하고 높은 구간은 음성 구간으로 간주할 수 있다.According to another embodiment, a pattern of a sound source waveform of a pre-possessed voice is analyzed through machine learning, and a section in which the similarity between the sound source of a specific section and the pattern of the sound source waveform of the pre-possessed voice is lower than a predetermined reference value is designated as a noise section. High intervals can be regarded as voice intervals.

일 실시예에 따른 음성자막 변환단계(S124)는 음원파형의 분석을 기초로 음성구간을 분리하는 분리단계(S122), 음성구간을 구간별로 표시하는 표시단계(S123), 음성구간의 음성을 음성자막으로 변환하는 음성자막 변환단계(S124)로 구성될 수 있다.The audio subtitle conversion step (S124) according to an embodiment includes a separation step (S122) of separating audio sections based on the analysis of the sound source waveform, a display step (S123) of displaying the audio sections by section, and converting the audio of the audio section to audio. It may be composed of an audio caption converting step (S124) of converting into a caption.

일 실시예에 따른 음성구간을 분리하는 분리단계(S122)에서 분석부(302)는 동영상 음원 분석단계(S121)에서의 분석 결과를 기초로 동영상의 음원을 음성 구간과 소음 구간으로 구분할 수 있고, 주변소음을 제외한 일정구간의 음성으로 분리할 수 있다.In the separation step (S122) of separating the audio section according to an embodiment, the analysis unit 302 may divide the sound source of the video into a voice section and a noise section based on the analysis result in the video sound source analysis step (S121), It can be separated into the voice of a certain section excluding ambient noise.

일 실시예에 따른 음성구간을 구간별로 표시하는 표시단계(S123)에서 복수의 일정구간의 음성은 음성구간을 분리하는 분리단계(S122)에서 도출된 결과에 의해 복수의 구간별 형상(600)으로 구별되고, 복수의 구간별 형상(600, 610)은 자막제공자 단말기(200)의 제2 디스플레이부(202)에 구별되도록 표시될 수 있다. 이와 관련해서는 도 3을 기초로 상세히 설명된다. In the display step (S123) of displaying the speech section by section according to an embodiment, the audio of a plurality of predetermined sections is formed into a plurality of shapes 600 for each section by the result derived from the separation step (S122) of separating the speech section. and the plurality of sections 600 and 610 may be displayed on the second display unit 202 of the caption provider terminal 200 to be distinguished. This will be described in detail based on FIG. 3 .

일 실시예에 따른 음성구간의 음성을 음성자막 변환단계(S124)에서 음성인식서버(400)는 Audio to Text부(401)를 통해 동영상의 음원을 인식하여 음성자막으로 변형할 수 있다.In the audio subtitle conversion step (S124) of the audio section according to an embodiment, the voice recognition server 400 may recognize the sound source of the video through the audio to text unit 401 and convert it into audio subtitles.

또는, 집단지성이 음성자막의 제작에 이용될 수 있다. 예를 들어, 동영상의 복수의 구간별 음원이 자막제공자 단말기(200)의 제2 디스플레이부(202)를 통해 자막제공자에게 표시될 수 있고, 자막제공자는 복수의 구간별 음원 중에서 하나의 구간별 음원을 클릭하여 출력되는 동영상의 음원을 듣고, 제2 입력부(201)를 통해 대응하는 음성자막을 입력할 수 있다.Alternatively, collective intelligence can be used to produce audio captions. For example, a plurality of sound sources for each section of a video may be displayed to the caption provider through the second display unit 202 of the caption provider terminal 200, and the caption provider selects one sound source for each section from among the plurality of sound sources for each section. You can listen to the sound source of the output video by clicking , and input the corresponding audio subtitle through the second input unit 201 .

구체적으로, 동영상이 “Good morning! Mr.Park!의 음원을 포함할 때, 번역자는 자막제공자 단말기(200)의 제2 디스플레이부(202)를 통해 표시되는 구간별 형상(600, 610)이나 구간별 형상(600, 610)에 대응하는 자막을 볼 수 있고, 이것을 클릭하면 해당하는 동영상의 음원을 자막제공자 단말기(200)를 통해 들을 수 있다. 이때 동영상의 음원을 들은 후 해당하는 음성자막인“Good morning! Mr.Park!”을 자막제공자가 제2 입력부(201)를 통해 입력할 수 있다. 이때, 복수의 자막제공자와 복수의 자막제공자 단말기(200)가 존재할 수 있으며, 각각 다른 구간의 자막을 제2 입력부(201)를 통해 입력할 수 있다. 따라서, 자막제공자와 자막제공자 단말기(200)의 수가 많아서 자막화 작업이 동시에 진행된다면 전체 구간에 대한 자막화가 빠르게 진행될 수 있다.Specifically, if the video is “Good morning! When Mr.Park!'s sound source is included, the translator corresponds to the shape for each section (600, 610) or the shape for each section (600, 610) displayed through the second display unit 202 of the caption provider terminal 200. You can see a subtitle that says, and if you click this, you can listen to the sound source of the corresponding video through the subtitle provider terminal 200. At this time, after listening to the sound source of the video, the corresponding voice subtitle “Good morning! Mr. Park!” may be input by the caption provider through the second input unit 201 . At this time, a plurality of caption providers and a plurality of caption provider terminals 200 may exist, and captions of different sections may be input through the second input unit 201 . Accordingly, if the number of caption providers and caption provider terminals 200 is large, and captioning is performed simultaneously, captioning for the entire section can be rapidly performed.

단계 140a에서 동영상과 단계 120a를 통해 획득한 음성자막은 사용자 단말기(100)의 제1 디스플레이부(102)를 통해 함께 재생될 수 있다.In step 140a, the video and the audio caption acquired through step 120a may be reproduced together through the first display unit 102 of the user terminal 100.

이때, 싱크부(303)는 동영상과 음성자막의 싱크를 맞춰 재생할 수 있다. 이와 관련하여 도 7a에서 자세히 설명하겠다.At this time, the sync unit 303 may reproduce the video and audio subtitles in sync. This will be described in detail with reference to FIG. 7A.

도 2b는 본 발명의 일 실시예에 따른 음성인식서버(400), 번역서버(500)와 집단지성을 이용한 동영상 번역자막 제공 방법의 흐름도이다.2B is a flowchart of a method for providing translated subtitles for a video using a voice recognition server 400, a translation server 500, and collective intelligence according to an embodiment of the present invention.

도 2b에 기재된 실시예에 따르면, 특정한 동영상의 음성자막과 번역자막을 제공할 수 있다.According to the embodiment described in FIG. 2B , audio subtitles and translated subtitles of a specific video may be provided.

단계 100b에서 동영상 링크는 사용자 단말기(100)의 제1 입력부(101)를 통해 입력될 수 있고, 자막언어는 사용자 단말기(100)의 제1 입력부(101)를 통해 선택될 수 있다.In step 100b, a video link may be input through the first input unit 101 of the user terminal 100, and a subtitle language may be selected through the first input unit 101 of the user terminal 100.

이때, 영어, 스페인어, 한국어, 중국어, 일본어, 독일어 등 다양한 국가의 언어는 본 명세서상의 자막언어로 선택될 수 있다.In this case, languages of various countries such as English, Spanish, Korean, Chinese, Japanese, and German may be selected as the subtitle language in this specification.

단계 100b 내지 120b는 단계 100a 내지 120a와 동일하므로, 동일한 설명은 생략하였다.Since steps 100b to 120b are the same as steps 100a to 120a, the same description is omitted.

단계 130b에서 음성자막은 단계 100b에서 설정된 자막언어를 기초로 자막제공자 단말기(200) 또는 번역서버(500) 중 적어도 하나를 통해 사용자가 설정한 자막언어로 번역된 번역자막으로 변환될 수 있다.In step 130b, based on the subtitle language set in step 100b, the audio subtitles may be converted into translated subtitles translated into the subtitle language set by the user through at least one of the subtitle provider terminal 200 and the translation server 500.

이때, 번역서버(500)는 자막제공 서버(300)를 통해 동영상의 음성자막을 획득하여 획득한 음성자막을 기초로 번역자막으로 변환할 수 있다.At this time, the translation server 500 may acquire audio subtitles of the video through the subtitle providing server 300 and convert them into translated subtitles based on the obtained audio subtitles.

또는, 집단지성이 설정한 자막언어의 번역자막으로 변환하는데 이용될 수 있다. 예를 들어, 동영상의 복수의 구간별 형상(600, 610)이 자막제공자 단말기(200)의 제2 디스플레이부(202)를 통해 자막제공자에게 표시될 수 있고, 자막제공자는 복수의 구간별 형상(600, 610) 중에서 하나의 구간별 형상(600)을 클릭하여, 제2 입력부(201)를 통해 하나의 구간별 형상(600)에 대응하는 음성의 번역자막을 입력할 수 있다.Alternatively, it may be used to convert subtitles into translated subtitles of a subtitle language set by collective intelligence. For example, the plurality of sections 600 and 610 of the video may be displayed to the caption provider through the second display unit 202 of the caption provider terminal 200, and the caption provider may display the plurality of sections (600, 610). By clicking on one of the shapes 600 for each section among 600 and 610 , translation subtitles for audio corresponding to one shape 600 for each section can be input through the second input unit 201 .

구체적으로, 동영상이 “Good morning! Mr.Park!의 음원을 포함할 때, 번역자는 자막제공자 단말기(200)의 제2 디스플레이부(202)를 통해 표시되는 구간별 형상(600, 610)에 대응하는 음성자막인 “Good morning! Mr.Park!”을 볼 수 있고, 구간별 형상(600)에 해당하는 동영상의 음원인 “Good morning!”은 자막제공자 단말기(200)를 통해 들을 수 있다. 이때 자막제공자는 동영상의 음원을 들은 후 번역자막인“좋은 아침이에요!”를 제2 입력부(201)를 통해 입력할 수 있다. 또는, 제막제공자는 동영상의 음성자막을 보고 번역자막인“좋은 아침이에요!”를 제2 입력부(201)를 통해 입력할 수 있다. 이때, 복수의 자막제공자와 복수의 자막제공자 단말기(200)가 존재할 수 있으며, 각각 다른 구간의 번역자막을 제2 입력부(201)를 통해 입력할 수 있다. 따라서, 자막제공자와 자막제공자 단말기(200)의 수가 많아서 자막화 작업이 동시에 진행된다면 전체 구간에 대한 번역자막 변환이 빠르게 진행될 수 있다. Specifically, if the video is “Good morning! When the sound source of Mr.Park! is included, the translator says “Good morning! Mr. Park!”, and “Good morning!”, which is the sound source of the video corresponding to the shape 600 for each section, can be heard through the caption provider terminal 200. At this time, the caption provider may listen to the sound source of the video and then input the translated caption “Good morning!” through the second input unit 201 . Alternatively, the festival provider may view the audio caption of the video and input the translated caption “Good morning!” through the second input unit 201 . At this time, a plurality of caption providers and a plurality of caption provider terminals 200 may exist, and translated captions of different sections may be input through the second input unit 201 . Therefore, if the captioning operation is performed simultaneously because the number of caption providers and caption provider terminals 200 is large, conversion of translated captions for the entire section can be performed quickly.

단계 140b에서 동영상, 단계 120b를 통해 획득한 음성자막 및 단계 130b를 통해 획득한 번역자막은 사용자 단말기(100)의 제1 디스플레이부(102)를 통해 함께 재생될 수 있다.The video in step 140b, the audio caption obtained through step 120b, and the translated caption obtained through step 130b may be played together through the first display unit 102 of the user terminal 100.

이때, 싱크부(303)는 동영상, 음성자막 및 번역자막의 싱크를 맞춰 재생할 수 있다. 이와 관련하여 도 7b에서 자세히 설명하겠다.At this time, the sync unit 303 may synchronize and reproduce video, audio subtitles, and translation subtitles. This will be described in detail with reference to FIG. 7B.

도 3은 본 발명의 일 실시예에 따른 동영상의 음원에서 주변소음을 제외한 음성으로 분리한 구간별 형상을 나타내는 도면이다.3 is a diagram showing the shape of each section in which a sound source of a video is separated into audio excluding ambient noise according to an embodiment of the present invention.

도 3을 보면 확인할 수 있듯이, 동영상의 음성은 주변소음과 분리되어 표시될 수 있다. As can be seen in FIG. 3 , the audio of the video may be displayed separately from ambient noise.

일 실시예에 따른 음성자막 또는 번역자막이 필요한 음성과 주변소음의 분리 및 표시는 동영상의 음원파형이 있을 때, 음원파형은 동영상 음원 분석단계(S121)를 통해 동영상의 음원파형의 크기 또는 기계학습을 통한 학습음원파형의 형상을 기초로 음성으로 인식될 수 있고, 인식된 동영상의 음성과 주변소음은 음성구간을 분리하는 분리단계(S122)를 통해 분리될 수 있다.According to an embodiment, when there is a sound source waveform of a video, the sound source waveform is the size of the sound source waveform of the video or machine learning through the video sound source analysis step (S121). It can be recognized as a voice based on the shape of the learning sound source waveform through , and the voice of the recognized video and ambient noise can be separated through a separation step (S122) of separating the voice section.

또한, 분리된 구간별 음성은 음성을 구간별로 표시하는 표시단계(S123)를 통해 미리 설정된 복수의 구간별 형상(600, 610)으로 표시될 수 있고, 분리된 복수의 구간별 음성은 음성자막 변환단계(S124)를 통해 음성자막으로 변환될 수 있다.In addition, the separated audio for each section may be displayed in a plurality of pre-set shapes 600 and 610 through a display step of displaying the audio for each section (S123), and the separated audio for each section is converted into audio subtitles. It can be converted into audio subtitles through step S124.

본 명세서에서는 미리 설정된 복수의 구간별 형상(600, 610)을 도 3에서 도시한 바와 같이 사각형을 활용하여 나타냈지만 사각형으로 한정하는 것은 아니며 원, 타원, 마름모, 삼각형 등으로 나타내질 수 있다. In this specification, a plurality of preset shapes 600 and 610 for each section are shown using a rectangle as shown in FIG. 3, but are not limited to rectangles and may be represented by circles, ellipses, rhombuses, triangles, and the like.

한편, 동영상 음성자막화 단계(S120a, 120b)에서 음성자막 변환단계(S124) 순서는 변경될 수 있다.Meanwhile, in the video audio caption conversion steps (S120a, 120b), the order of the audio caption conversion step (S124) may be changed.

도 4a는 일 실시예에 따른 동영상 음성자막화 단계(S120a, S120b)의 흐름을 도시하는 흐름도이며, 도 4b는 다른 실시예에 따른 동영상 음성자막화 단계(S120a, S120b)를 도시하는 흐름도이다.FIG. 4A is a flow chart showing the steps of converting video audio into subtitles (S120a and S120b) according to an embodiment, and FIG.

도 4a에 도시된 바와 같이, 동영상 음성자막화를 위해 동영상 음원 분석단계(S121), 음성구간을 분리하는 분리단계(S122), 음성을 구간별로 표시하는 표시단계(S123) 및 음성자막 변환단계(S124)가 차례로 수행할 수 있다.As shown in FIG. 4A, a video sound source analysis step (S121), a separation step of separating audio sections (S122), a display step of displaying audio by section (S123), and an audio subtitle conversion step (S123) for video audio captioning. S124) may be performed in turn.

동영상 음성자막화를 위한 다른 실시예에 따르면, 도 4b와 같이 음성자막 변환단계(S124), 동영상 음원 분석단계(S121), 음성구간을 분리하는 분리단계(S122) 및 음성을 구간별로 표시하는 표시단계(S123)가 차례로 수행될 수 있다.According to another embodiment for video audio captioning, as shown in FIG. 4B, an audio caption conversion step (S124), a video source analysis step (S121), a separation step of separating audio sections (S122), and a display for displaying audio by section. Step S123 may be performed sequentially.

이때, 음성자막 변환단계(S124)가 음성인식서버(400)를 통해 수행된다고 할 때, 도 4a에 따른 동영상 음성자막화 단계(S120a, S120b)와 도 4b에 따른 동영상 음성자막화 단계(S120a, S120b)의 정확도는 차이가 날 수 있다.At this time, assuming that the voice caption conversion step (S124) is performed through the voice recognition server 400, the video voice captioning steps (S120a, S120b) according to FIG. 4a and the video voice captioning steps (S120a, S120a, S120b) according to FIG. 4b The accuracy of S120b) may vary.

예를 들어, 도 4a와 같이 동영상 음원 분석단계(S121), 음성구간을 분리하는 분리단계(S122), 음성을 구간별로 표시하는 표시단계(S123)가 선행되고 음성자막 변환단계(S124)를 수행하는 경우, 미리 설정된 구간별 형상(600)의 음성구간은 주변소음과 분리된 상태에서 음성자막 변환단계(S124)가 수행되기 때문에 번역 정확도는 높을 수 있다.For example, as shown in FIG. 4A, a video source analysis step (S121), a separation step (S122) for separating audio sections, and a display step (S123) for displaying audio by section are preceded, followed by an audio subtitle conversion step (S124). In this case, since the audio subtitle conversion step (S124) is performed in a state in which the audio section of the preset shape 600 for each section is separated from the surrounding noise, the translation accuracy can be high.

이때, 복수의 구간별 형상(600, 610)이 존재할 수 있고, 음성자막 변환단계(S124)에서 분석부(302)는 각 구간별 형상(600, 610)만을 연결하여 도 3b에 도시된 것처럼 하나의 구간별 형상(620)으로 인식할 수 있고, 하나의 구간별 형상(620)에 대응하는 음성에 대해 음성자막 변환단계(S124)를 수행하여 하나의 구간별 형상(620)의 음성을 음성자막으로 변환할 수 있다.At this time, there may be a plurality of shapes 600 and 610 for each section, and in the audio caption conversion step (S124), the analysis unit 302 connects only the shapes 600 and 610 for each section to form one as shown in FIG. 3B. It can be recognized as the shape 620 for each section, and the audio corresponding to the shape 620 for each section is converted into audio captions by performing the audio caption conversion step (S124) to turn the audio of the shape 620 for each section into an audio caption. can be converted to

또는, 음성자막 변환단계(S124)는 각 구간별 형상(600, 610)에 대응하는 구간별 음성 각각에 대해 음성자막 변환단계(S124)를 수행하여 각 구간별 형상(600,610)의 음성을 음성자막으로 변환할 수 있다.Alternatively, in the audio caption conversion step (S124), the audio of each section shape (600, 610) is converted into audio subtitles by performing the audio caption conversion step (S124) for each section corresponding to the section shape (600, 610). can be converted to

반면, 도 4b와 같이 음성자막 변환단계(S124)를 선행하여 동영상의 음원구간 전체를 음성자막화하고, 그 후에 동영상 음원 분석단계(S121), 음성구간을 분리하는 분리단계(S122), 음성을 구간별로 표시하는 표시단계(S123)가 수행되는 경우, 음성인식서버(400)가 동영상의 전체 음원을 문장단위로 구분한 상태에서 음성자막 변환단계(S124)를 수행하게 된다. On the other hand, as shown in FIG. 4B, the audio subtitle conversion step (S124) is preceded to turn the entire sound source section of the video into audio captions, and thereafter, the video source analysis step (S121), the separation step of separating the audio section (S122), and the audio When the display step (S123) of displaying each section is performed, the voice recognition server 400 performs the audio subtitle conversion step (S124) in a state in which the entire sound source of the video is divided into sentence units.

또한, 일 실시예에 따른 음성자막 변환단계(S124)는 복수의 자막제공자 단말기(200)인 집단지성을 통해 수행될 수 있다.Also, the audio caption conversion step (S124) according to an embodiment may be performed through collective intelligence, which is a plurality of caption provider terminals 200.

도 4a와 같이 동영상 음원 분석단계(S121), 음성구간을 분리하는 분리단계(S122), 음성을 구간별로 표시하는 표시단계(S123)가 선행되고 음성자막 변환단계(S124)를 수행하는 경우, 동영상의 음성부분인 미리 설정된 복수의 구간별 형상(600, 610) 각각을 복수의 자막제공자 단말기(200)가 클릭하여 클릭한 구간별 형상(600, 610)의 음성에 대응하는 음성자막을 입력할 수 있다. 이때, 자막제공자가 직접 동영상의 음성을 기초로 음성자막으로 변환하는 작업을 수행하기에 정확도가 높다. As shown in FIG. 4A, when the video source analysis step (S121), the separation step of separating audio sections (S122), the display step of displaying audio by section (S123) are preceded, and the audio subtitle conversion step (S124) is performed, the video A plurality of subtitle provider terminals 200 may click each of a plurality of pre-set shapes 600 and 610, which are audio parts, to input audio captions corresponding to the audio of the clicked shapes 600 and 610. there is. In this case, since the caption provider directly converts the video into audio caption based on the audio, the accuracy is high.

예를 들어, 미리 설정된 복수의 구간별 형상(600, 610) 중 하나를 복수의 자막제공자 단말기(200) 중 하나가 클릭하면, 클릭한 구간별 형상(600, 610)의 음성에 대응하는 음성자막을 입력할 수 있는 기회가 자막제공자 단말기(200)에게 주어진다. 이때, 먼저 클릭한 자막제공자 단말기(200)에게 일정 시간의 우선 입력기회가 주어질 수 있다. For example, when one of the plurality of caption provider terminals 200 clicks on one of the plurality of preset shapes 600 and 610 for each section, an audio caption corresponding to the sound of the clicked shape 600 and 610 for each section is provided. An opportunity to input is given to the caption provider terminal 200 . At this time, the caption provider terminal 200 clicked first may be given a priority input opportunity for a certain period of time.

음성자막화 속도는 참여하는 복수의 자막제공자 단말기(200) 수를 기초로 결정될 수 있다. 이때, 자막제공자가 증가하여 음성자막을 입력하는 자막제공자 단말기(200)의 수가 증가하면 음성자막 제작시간이 감소될 수 있다.The voice captioning speed may be determined based on the number of participating caption provider terminals 200 . In this case, if the number of caption provider terminals 200 inputting audio captions increases as the number of caption providers increases, the audio caption production time can be reduced.

도 5a내지 도 5b는 일 실시예에 따른 제2 디스플레이부(202)의 실시예를 도시한 도면이고, 도 5의 (c) 내지 도 5의 (d)는 다른 실시예에 따른 제2 디스플레이부(202)의 실시예를 도시한 도면이다.5A to 5B are diagrams illustrating examples of the second display unit 202 according to one embodiment, and FIGS. 5(c) to 5(d) are second display units according to another embodiment. It is a diagram showing an embodiment of 202.

도 5a 내지 도 5b를 참조하면, 동영상의 음성구간은 복수의 구간별 형상(601a, 602a)인 사각형으로 표시될 수 있다.Referring to FIGS. 5A and 5B , the audio section of the video may be displayed as a rectangle having a plurality of sections 601a and 602a.

이때, 도 5a과 도 5b에 따르면 서로 다른 두개의 구간별 형상(601a, 602a)을 클릭함에 따라 다른 음성자막이 나타남을 볼 수 있다.At this time, according to FIGS. 5A and 5B , it can be seen that different audio subtitles appear as two different shapes 601a and 602a are clicked.

일 실시예에 따른 자막제공자 단말기(200)의 제2 디스플레이부(202)에는 동영상이 재생될 수 있고, 동영상 아래에는 음성을 복수의 구간별 형상(601a, 602a)으로 표시한 음원파형이 위치할 수 있다.A video may be played on the second display unit 202 of the caption provider terminal 200 according to an embodiment, and a sound source waveform displaying audio in a plurality of sections 601a and 602a may be located below the video. can

또한, 음원파형의 아래에는 자막제공자 단말기(200)와 음성인식서버(400) 중 적어도 하나 이상이 동영상의 음성을 음성자막화 단계(S120a)한 것을 수정 및 입력할 수 있는 음성자막 입력칸(650a)이 위치할 수 있다.In addition, below the sound source waveform, at least one of the caption provider terminal 200 and the voice recognition server 400 converts the voice of the video into a voice caption step (S120a), and an audio caption input box 650a for modifying and inputting the audio caption. can be located.

이하 상기 음성자막의 수정, 입력, 집단지성을 활용한 음성자막 제작 및 자막제공자의 평가 및 페널티에 대한 설명은 아래 도 5의 (c) 내지 도 5의 (d)에서 번역자막을 포함하여 자세하게 설명하겠다.Hereinafter, a description of the correction, input, production of audio subtitles using collective intelligence, and the evaluation and penalty of subtitle providers, including translation subtitles, will be described in FIGS. 5(c) to 5(d) below. would.

도 5의 (c) 내지 도 5의 (d)를 참조하면, 동영상의 음성구간은 복수의 구간별 형상(601b, 602b)인 사각형으로 표시될 수 있다.Referring to FIGS. 5(c) to 5(d) , audio sections of a video may be displayed as rectangles having a plurality of sections 601b and 602b.

이때, 도 5의 (c)과 도 5의 (d)에 따르면 서로 다른 두개의 구간별 형상(601b, 602b)을 클릭함에 따라 다른 음성자막 및 번역자막이 나타남을 볼 수 있다.At this time, according to FIG. 5(c) and FIG. 5(d), it can be seen that different audio subtitles and translation subtitles appear as two different shapes 601b and 602b are clicked.

일 실시예에 따른 자막제공자 단말기(200)의 제2 디스플레이부(202)에는 동영상이 재생될 수 있고, 동영상 아래에는 음성을 복수의 구간별 형상(601b, 602b)으로 표시한 음원파형이 위치할 수 있다.A video may be played on the second display unit 202 of the caption provider terminal 200 according to an embodiment, and a sound source waveform displaying audio in a plurality of sections 601b and 602b may be located below the video. can

또한, 음원파형의 아래에는 자막제공자 단말기(200)와 음성인식서버(400) 중 적어도 하나 이상이 동영상의 음성을 음성자막화 단계(S120b)한 것을 수정 및 입력할 수 있는 음성자막 입력칸(650b)이 위치할 수 있다.In addition, below the sound source waveform, at least one of the caption provider terminal 200 and the voice recognition server 400 converts the voice of the video into a voice caption step (S120b), and an audio caption input box 650b capable of modifying and inputting it. can be located.

또한, 음성자막 입력칸(650b) 아래에는 자막제공자 단말기(200)와 번역서버(500) 중 적어도 하나 이상이 동영상의 번역자막을 수정 및 입력할 수 있는 번역자막 입력칸(660b)이 위치할 수 있다.In addition, a translated subtitle input field 660b capable of modifying and inputting translated subtitles of a video may be positioned below the audio subtitle input field 650b.

이때, 음성자막 입력칸(650b)이 클릭되면 자막제공자 단말기(200)의 제2 입력부(201)를 통해 음성자막이 수정 및 입력될 수 있으며, 번역자막 입력칸(660b)이 클릭되면 자막제공자 단말기(200)의 제2 입력부(201)를 통해 음성자막이 수정 및 입력될 수 있다.At this time, when the audio caption input box 650b is clicked, the audio caption can be modified and input through the second input unit 201 of the caption provider terminal 200, and when the translated subtitle input box 660b is clicked, the caption provider terminal 200 The audio subtitle may be modified and input through the second input unit 201 of ).

일 실시예에 따른 동영상 자막 제공 시스템은 집단지성을 통해 음성자막 및 번역자막이 제공될 수 있다.A video caption providing system according to an embodiment may provide audio captions and translated captions through collective intelligence.

이때, 복수의 자막제공자 단말기(200)가 동일한 동영상의 구간별 형상(601b, 602b)을 시간차를 두고 클릭했을 때, 먼저 클릭한 자막제공자 단말기(200)가 일정 시간동안 우선적으로 구간별 형상(601b, 602b)에 대응되는 음성자막 입력칸(650b)의 음성자막을 수정 및 입력할 수 있다.At this time, when a plurality of caption provider terminals 200 click the shapes 601b and 602b of the same video with a time difference, the first clicked caption provider terminal 200 gives priority to the shape 601b for each section for a certain period of time. , 602b) can modify and input the audio caption of the audio caption input box 650b.

또한, 복수의 자막제공자 단말기(200)가 동일한 동영상의 구간별 형상(601b,602b)을 시간차를 두고 클릭했을 때, 먼저 클릭한 자막제공자 단말기(200)가 일정시간동안 우선적으로 구간별 형상(601b, 602b)에 대응되는 번역자막 입력칸(660b)의 번역자막을 수정 및 입력할 수 있다.In addition, when a plurality of caption provider terminals 200 click the shapes for each section (601b, 602b) of the same video with a time difference, the first clicked caption provider terminal 200 gives priority to the shape for each section (601b) for a certain period of time. , 602b) may modify and input the translated subtitle in the translated subtitle input field 660b.

따라서, 복수의 자막제공자 단말기(200)가 동일한 동영상의 구간별 형상(601b, 602b)을 시간차를 두고 클릭했을 때, 먼저 클릭한 자막제공자 단말기(200)에게 일정 시간 동안의 우선권이 제공되어 음성자막 입력칸(650b)의 음성자막 및 번역자막 입력칸(660b)의 번역자막을 수정 및 입력할 수 있다.Therefore, when a plurality of caption provider terminals 200 click the shapes 601b and 602b of the same video section with a time difference, priority is given to the caption provider terminal 200 that clicked first for a certain period of time to provide audio captions. The voice caption of the input box 650b and the translated caption of the translated caption input box 660b may be corrected and input.

예를 들어, 복수의 자막제공자 단말기(200)를 제1 자막제공자 단말기, 제2 자막제공자 단말기로 가정할 때, 제1 자막제공자 단말기가 자막을 수정하기위해 동영상의 구간별 형상(601b, 602b)을 클릭하고, 10초 뒤에 제2 자막제공자 단말기가 같은 동영상의 같은 구간별 형상(601b, 602b)을 클릭한다면 제1 자막제공자 단말기만 일정 시간동안 음성자막 입력칸(650b)의 음성자막 및 번역자막 입력칸(660b)의 번역자막을 수정할 수 있다.For example, assuming that the plurality of caption provider terminals 200 are a first caption provider terminal and a second caption provider terminal, the shapes 601b and 602b of each section of a video in order for the first caption provider terminal to correct the caption is clicked, and 10 seconds later, if the second caption provider terminal clicks the same shape for each section (601b, 602b) of the same video, only the first caption provider terminal has the audio caption and translated caption input fields of the audio caption input box 650b for a certain period of time. The translation subtitles of (660b) can be modified.

이때, 동일한 동영상에 대한 번역자가 증가하여 음성자막 및 번역자막을 입력하는 자막제공자 단말기(200)의 수가 증가하면 동영상의 음성자막 및 번역자막 제작시간이 감소될 수 있다.At this time, if the number of caption provider terminals 200 inputting audio captions and translated captions increases as the number of translators for the same video increases, the production time for audio captions and translated captions of the video can be reduced.

예를 들어, 번역수준이 동일한 복수의 자막제공자가 복수의 자막제공자 단말기(200)를 가질 때, 동일한 동영상에 대해 10개의 자막제공자 단말기(200)로 자막을 제작하는데 걸리는 시간은 5개의 자막제공자 단말기(200)로 번역자막을 제작하는데 걸리는 시간보다 짧을 수 있다.For example, when a plurality of subtitle providers having the same translation level have a plurality of subtitle provider terminals 200, the time required to produce subtitles for the same video with 10 subtitle provider terminals 200 is 5 subtitle provider terminals. (200), it may be shorter than the time required to produce translated subtitles.

일 실시예에 따른 복수의 음성구간 중 적어도 하나의 음성구간에 복수의 음성자막이 존재할 때, 복수의 사용자 단말기(100)가 복수의 음성자막을 평가할 수 있고, 평가를 기초로 복수의 음성자막 중에서 가장 높은 평가를 획득한 음성자막은 사용자 단말기(100)에 우선적으로 표시될 수 있다.According to an embodiment, when a plurality of audio subtitles exist in at least one audio section among a plurality of audio sections, the plurality of user terminals 100 may evaluate the plurality of audio subtitles, and select among the plurality of audio subtitles based on the evaluation. An audio caption with the highest rating may be preferentially displayed on the user terminal 100 .

일 실시예에 따른 복수의 음성구간 중 적어도 하나의 음성구간에 복수의 번역자막이 존재할 때, 복수의 사용자 단말기(100)가 복수의 번역자막을 평가할 수 있고, 평가를 기초로 복수의 번역자막 중에서 가장 높은 평가를 획득한 번역자막은 사용자 단말기(100)에 우선적으로 표시될 수 있다.When a plurality of translated subtitles exist in at least one of the plurality of voice sections according to an embodiment, the plurality of user terminals 100 may evaluate the plurality of translated subtitles, and among the plurality of translated subtitles based on the evaluation, Translated subtitles with the highest evaluation may be preferentially displayed on the user terminal 100 .

예를 들어, 특정 음성구간의 음성자막이 획득가능한 최대 평점 3점이고, 사용자는 0.5점 단위로 평가가 가능할 때, 특정 음성구간의 음성자막에 대해 평점 3점을 획득한 자막제공자 단말기(200)와 평점 2점을 획득한 자막제공자 단말기(200) 및 평점 1.5점을 획득한 자막제공자 단말기(200)가 존재하는 상태에서, 복수의 자막제공자 단말기(200)중 가장 높은 평점인 3점을 획득한 자막제공자 단말기(200)가 입력한 특정 음성구간의 음성자막이 우선적으로 사용자 단말기(100)에 표시될 수 있다.For example, when the audio subtitle of a specific audio section has a maximum rating of 3 points and the user can evaluate in units of 0.5 points, the caption provider terminal 200 that has obtained a rating of 3 points for the audio caption of a specific audio section and In the state where the caption provider terminal 200 with a rating of 2 points and the caption provider terminal 200 with a rating of 1.5 exist, the caption with the highest rating of 3 points among the plurality of caption provider terminals 200 exists. An audio caption of a specific audio section input by the provider terminal 200 may be preferentially displayed on the user terminal 100 .

일 실시예에 따른 특정 음성구간의 음성자막에 대해 획득한 평점이 기설정 평점보다 낮은 자막제공자 단말기(200)는 페널티를 받을 수 있다According to an embodiment, the caption provider terminal 200 whose rating obtained for the audio caption of a specific audio section is lower than the preset rating may receive a penalty.

이때, 기설정 평점은 사용자 및 자막제공자를 통해 설정 및 변경이 가능하고, 기설정 평점보다 낮은 음성자막을 입력한 자막제공자에게 평점을 회복할 때까지 소정의 시간이 제공될 수 있으며, 평점을 회복할 때까지 동영상 자막 제공 시스템(1a, 1b) 이용이 제한될 수 있다.At this time, the preset rating can be set and changed by the user and the subtitle provider, and the caption provider who inputs audio subtitles lower than the preset rating may be given a predetermined time until the rating is restored, and the rating is restored. The use of the video caption providing systems 1a and 1b may be restricted until

일 실시예에 따른 특정 음성구간의 번역자막에 대해 획득한 평점이 기설정 평점보다 낮은 자막제공자 단말기(200)는 페널티를 받을 수 있다.According to an exemplary embodiment, the caption provider terminal 200 with a score obtained for a translated subtitle of a specific audio section lower than a preset score may receive a penalty.

이때, 기설정 평점은 사용자 및 자막제공자를 통해 설정 및 변경이 가능하고, 기설정 평점보다 낮은 번역자막을 입력한 자막제공자에게 평점을 회복할 때까지 소정의 시간이 제공될 수 있으며, 평점을 회복할 때까지 동영상 자막 제공 시스템(1a, 1b) 이용이 제한될 수 있다.At this time, the preset rating can be set and changed by the user and the subtitle provider, and a certain amount of time may be provided until the rating is restored to the subtitle provider who inputs a translated subtitle lower than the preset rating, and the rating is restored. The use of the video caption providing systems 1a and 1b may be restricted until

예를 들어, 평점 1점 미만을 획득한 자막제공자 단말기(200)에 페널티를 부과한다고 할 때, 평점 0.5점의 평점을 획득한 자막제공자 단말기(200)는 평점 1점 이상을 획득할 때까지 동영상 자막 제공 시스템(1a, 1b) 이용제한의 페널티를 받을 수 있다.For example, when it is assumed that a penalty is imposed on the caption provider terminal 200 that has obtained a rating of less than 1 point, the caption provider terminal 200 that has obtained a rating of 0.5 points continues to obtain a video rating of 1 point or more. You may be subject to a penalty for limiting the use of the subtitle providing systems 1a and 1b.

일 실시예에 따른 특정 음성구간의 음성자막에 대해 복수의 자막제공자 단말기(200)가 획득한 평점이 동률인 경우, 복수의 자막제공자 단말기(200) 중 음성자막을 먼저 입력한 자막제공자 단말기(200)의 음성자막이 우선적으로 표시될 수 있다.According to an embodiment, if the ratings obtained by the plurality of caption provider terminals 200 for the audio caption of a specific audio section are the same, the caption provider terminal 200 that inputs the audio caption first among the plurality of caption provider terminals 200 ) may be preferentially displayed.

일 실시예에 따른 특정 음성구간의 번역자막에 대해 복수의 자막제공자 단말기(200)가 획득한 평점이 동률인 경우, 복수의 자막제공자 단말기(200) 중 번역자막을 먼저 입력한 자막제공자 단말기(200)의 번역자막이 우선적으로 표시될 수 있다.According to an embodiment, when scores obtained by a plurality of caption provider terminals 200 for translated captions of a specific audio section are the same, among the plurality of caption provider terminals 200, the caption provider terminal 200 that inputs the translated caption first ) may be preferentially displayed.

예를 들어, 복수의 자막제공자 단말기(200)의 평점이 페널티를 부과하는 1점 이상이고, 특정 음성구간의 번역자막에 대해 획득한 평점이 동률을 이루는 복수의 자막제공자 단말기(200)가 존재하는 경우, 특정 음성구간의 번역자막을 먼저 입력한 자막제공자 단말기(200)의 번역자막이 우선적으로 사용자 단말기(100)에 디스플레이 될 수 있다.For example, if the ratings of the plurality of caption provider terminals 200 are equal to or greater than 1 point for imposing a penalty, and there are a plurality of caption provider terminals 200 having the same score obtained for translated subtitles of a specific audio section, In this case, the translated caption of the caption provider terminal 200 that first inputs the translated caption of the specific audio section may be preferentially displayed on the user terminal 100 .

일 실시예에 따른 특정 음성구간에 대해 음성자막 변환에 참여한 자막제공자 단말기(200)가 없다면 음성인식서버(400)를 통해 획득한 음성자막이 우선적으로 사용자 단말기(100)의 제1 디스플레이부(102)에 디스플레이 될 수 있다.According to an embodiment, if there is no caption provider terminal 200 participating in audio caption conversion for a specific audio section, the voice caption obtained through the voice recognition server 400 is given priority to the first display unit 102 of the user terminal 100. ) can be displayed.

또한, 특정 음성구간에 대해 번역자막 변환에 참여한 자막제공자 단말기(200)가 없다면 번역서버(500)를 통해 획득한 번역자막이 우선적으로 사용자 단말기(100)의 제1 디스플레이부(102)에 디스플레이 될 수 있다.In addition, if there is no subtitle provider terminal 200 participating in the conversion of translated subtitles for a specific audio section, the translated subtitles obtained through the translation server 500 will be preferentially displayed on the first display unit 102 of the user terminal 100. can

한편, 본 명세서상에는 평점을 기초로 평가방법에 대해 설명하고 있지만 평점 외에 좋아요, 싫어요 등의 기호를 표현할 수 있는 방식이면 본 명세서상에 설명하는 평가방법에 해당될 수 있다.On the other hand, although the evaluation method is described based on ratings in this specification, any method capable of expressing symbols such as likes and dislikes in addition to ratings may correspond to the evaluation method described in this specification.

도 6은 일 실시예에 따른 언어 변환 데이터베이스 선택(S150b)단계를 더 포함한 흐름도이다.6 is a flowchart further including a step of selecting a language conversion database (S150b) according to an embodiment.

도 6에 도시되었듯이, 설정된 자막언어로 자막변환 단계(S130b)를 수행하기 전에 언어 변환 데이터베이스 선택 단계(S150b)를 수행하여 번역의 정확도를 높일 수 있다.As shown in FIG. 6, before performing the subtitle conversion step (S130b) in the set subtitle language, the language conversion database selection step (S150b) can be performed to increase translation accuracy.

일 실시예에 따른 자막제공 서버(300)는 데이터베이스부(304)를 포함할 수 있고 데이터베이스부(304)는 언어 변환 데이터베이스를 포함할 수 있다.The subtitle providing server 300 according to an embodiment may include a database unit 304, and the database unit 304 may include a language conversion database.

일 실시예에 따른 언어 변환 데이터 베이스는 사용자 단말기(100)와 연동되고, 사용자 단말기(100)의 제1 입력부(101)를 통해 언어 변환 데이터베이스가 선택될 수 있다.The language conversion database according to an embodiment is interlocked with the user terminal 100, and the language conversion database can be selected through the first input unit 101 of the user terminal 100.

일 실시예에 따른 언어 변환 데이터베이스는 번역분야(ex, 기술, 농업, 문화, 운동, 정치 등)가 포함될 수 있다.The language conversion database according to an embodiment may include translation fields (eg, technology, agriculture, culture, movement, politics, etc.).

예를 들어, 사용자 단말기(100)를 통해 정치를 언어 변환 데이터베이스로 선택되었을 때, 음성자막이“Party”일 경우 “정당 또는 당”으로 해석될 수 있지만, 언어 변환 데이터베이스로 문화가 선택되었을 때, 음성자막이 “Party”일 경우 “파티 또는 만찬”으로 해석될 수 있다.For example, when politics is selected as the language conversion database through the user terminal 100, if the audio subtitle is “Party”, it can be interpreted as “political party or party”, but when culture is selected as the language conversion database, If the audio subtitle is “Party”, it can be interpreted as “party or dinner”.

한편, 언어 변환 데이터베이스는 번역분야로 한정되는 것이 아니라 외국어 번역 정확도를 향상시킬 수 있는 데이터는 본 명세서상의 언어 변환 데이터베이스로 볼 수 있다.Meanwhile, the language conversion database is not limited to the field of translation, and data capable of improving foreign language translation accuracy can be regarded as the language conversion database in this specification.

도 7은 본 발명의 일 실시예에 자막 싱크 조절단계를 더 포함한 다양한 흐름도이다.7 is a flowchart illustrating various steps including a subtitle sync adjustment step according to an embodiment of the present invention.

도 7a는 일 실시예에 따른 자막 싱크 조절(S160a)을 더 포함한 흐름도이고, 도 7b는 다른 실시예에 따른 자막 싱크 조절(S160b)을 더 포함한 흐름도이다.7A is a flowchart further including subtitle sync adjustment (S160a) according to an embodiment, and FIG. 7B is a flowchart further including subtitle sync adjustment (S160b) according to another embodiment.

도 7a에 도시된 바와 같이, 음성자막과 동영상 함께 재생(S140a)전에 자막 싱크 조절(S160a)단계를 수행하여 사용자에게 동영상과 싱크된 음성자막을 제공할 수 있다. As shown in FIG. 7A , before audio caption and video are reproduced together (S140a), a subtitle sync adjustment step (S160a) may be performed to provide the user with audio captions synchronized with the video.

일 실시예에 따른 싱크부(303)는 동영상 음성자막화 단계(S120a)에서 획득한 구간별 형상(601a, 602a)의 음성과 음성자막의 시작점을 일치시킬 수 있다.The sink unit 303 according to an embodiment may match the audio of the shapes 601a and 602a for each section acquired in the video audio captioning step (S120a) with the starting point of the audio caption.

예를 들어, 구간별 형상(601a)과 다른 구간별 형상(602a)이 일정 시간 간격을 두고 존재하고 각 구간 형상과 대응되는 음성, 음성자막 및 번역자막이 존재할 때, 싱크부(303)는 구간별 형상(601a)의 음성 및 음성자막이 출력되는 시작지점을 일치시킬 수 있다.For example, when the shape 601a for each section and the shape 602a for each section that is different from each other exist at regular time intervals and there are audio, audio subtitles, and translated subtitles corresponding to each section shape, the sink unit 303 operates the section. The start point of outputting the audio and audio captions of the star shape 601a may be matched.

이때, 시작지점이 일치된 구간별 형상(601a)의 음성과 음성자막은 구간별 형상(602a)의 시작점까지 디스플레이부(102,202)에 출력될 수 있으며, 위 작업을 반복하여 외국어 동영상의 음성과 음성자막의 싱크는 조절될 수 있다.At this time, the audio and audio subtitles of the shape 601a for each section where the starting point coincides can be output to the display units 102 and 202 up to the starting point of the shape 602a for each section. Sync of subtitles can be adjusted.

도 7b에 도시된 바와 같이, 번역자막과 동영상 함께 재생(S140b)전에 자막 싱크 조절(S160b)을 수행하여 사용자에게 동영상과 싱크된 자막을 제공할 수 있다. As shown in FIG. 7B, before playing back translated subtitles and videos together (S140b), subtitle sync adjustment (S160b) is performed to provide users with subtitles synchronized with videos.

일 실시예에 따른 싱크부(303)는 동영상 음성자막화 단계(S120b)에서 획득한 구간별 형상(601b, 602b)의 음성과 음성자막의 시작점과 설정된 자막언어로 자막변환 (S130b)을 통해 획득한 번역자막의 시작점을 일치시킬 수 있다.The sink unit 303 according to an embodiment converts the audio of the shapes 601b and 602b for each section obtained in the video audio captioning step (S120b), the starting point of the audio caption, and the set caption language through subtitle conversion (S130b). The starting point of one translation subtitle can be matched.

예를 들어, 구간별 형상(601b)과 다른 구간별 형상(602b)이 일정 시간 간격을 두고 존재하고 각 구간 형상과 대응되는 음성, 음성자막 및 번역자막이 존재할 때, 싱크부(303)는 구간별 형상(601b)의 음성 및 음성자막이 출력되는 시작지점과 구간별 형상(601b)에 대응되는 번역자막이 출력되는 시작지점을 일치시킬 수 있다.For example, when the shape 601b for each section and the shape 602b for each section that is different from the shape 602b exist at a certain time interval and there are audio, audio subtitles, and translated subtitles corresponding to each section shape, the sink unit 303 operates the section. The start point at which audio and voice captions of the star shape 601b are output and the start point at which translated subtitles corresponding to the shape 601b for each section are output may be matched.

이때, 시작지점이 일치된 구간별 형상(601b)의 음성, 음성자막 및 번역자막은 구간별 형상(602b)의 시작점까지 디스플레이부(102,202)에 출력될 수 있으며, 위 작업을 반복하여 외국어 동영상의 음성, 음성자막 및 번역자막의 싱크는 조절될 수 있다.At this time, the audio, audio captions, and translation subtitles of the shape 601b for each section with the same starting point can be output to the display units 102 and 202 up to the starting point of the shape 602b for each section. Synchronization of audio, audio subtitles, and translated subtitles can be adjusted.

이상에서, 본 발명이 구체적인 구성요소 등과 같은 특정 사항들과 한정된 실시 예 및 도면에 의해 설명되었으나, 이는 본 발명의 보다 전반적인 이해를 돕기 위해서 제공된 것일 뿐, 본 발명이 상기 실시예들에 한정되는 것은 아니며, 본 발명이 속하는 기술 분야에서 통상적인 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형을 꾀할 수 있다.In the above, the present invention has been described with specific details such as specific components and limited embodiments and drawings, but these are provided only to help a more general understanding of the present invention, and the present invention is not limited to the above embodiments. No, and those skilled in the art to which the present invention pertains may seek various modifications and variations from these descriptions.

따라서, 본 발명의 사상은 상기 설명된 실시예에 국한되어 정해져서는 아니되며, 후술하는 청구범위 뿐만 아니라 이 청구범위와 균등하게 또는 등가적으로 변형된 모든 것들은 본 발명의 사상의 범주에 속한다고 할 것이다.Therefore, the spirit of the present invention should not be limited to the above-described embodiments, and it should be said that not only the claims to be described later, but also all modifications equivalent or equivalent to these claims belong to the scope of the spirit of the present invention. will be.

1a, 1b: 동영상 자막 제공 시스템
100: 사용자 단말기
101: 제1 입력부
102: 제1 디스플레이부
200: 자막제공자 단말기
201: 제2 입력부
202: 제2 디스플레이부
300: 자막제공 서버
301: 통신부
302: 분석부
303: 싱크부
304: 데이터베이스부
305: 저장부
400: 음성인식서버
401: Audio to Text부
500: 번역서버
600, 601a, 601b, 602a, 602b, 610, 620: 구간별 형상
650a, 650b: 음성자막 입력칸
660b: 번역자막 입력칸1a, 1b: Video subtitle providing system
100: user terminal
101: first input unit
102: first display unit
200: Subtitle provider terminal
201: second input unit
202: second display unit
300: subtitle server
301: communication department
302: analysis unit
303: sink part
304: database unit
305: storage unit
400: voice recognition server
401: Audio to Text part
500: translation server
600, 601a, 601b, 602a, 602b, 610, 620: shape for each section
650a, 650b: Audio caption input field
660b: Translation subtitle input field

Claims

음성인식서버, 집단지성을 이용한 동영상의 자막을 자막제공 서버에서 제공하는 방법에 있어서,
(a)상기 동영상의 링크가 사용자 단말로부터 입력되는 단계;
(b)자막제공 서버가 상기 동영상 링크에 접속하여 상기 동영상의 음원을 획득하는 단계;
(c)상기 동영상의 음원을 음성자막으로 변환하는 단계;
(d)상기 사용자 단말에서 상기 동영상과 상기 음성자막을 함께 재생하는 단계를 포함하고,
상기 (c)단계는,
상기 음원의 음원파형을 분석하여 상기 음원을 소음과 음성으로 구분하는 분석단계; 상기 음원의 음원파형 분석을 기초로 상기 음원을 상기 음성자막으로 변환하는 단계를 더 포함하며,
상기 음성자막으로 변환하는 단계는, 상기 음원파형의 분석을 기초로 상기 음원의 음성구간을 분리하는 분리단계; 상기 음성구간을 표시하는 표시단계; 상기 표시된 음성구간에 기초하여 상기 음원을 상기 음성자막으로 변환하는 단계를 더 포함하고,
상기 분석단계에서, 상기 동영상의 음원은 상기 음원파형의 크기를 기초로 상기 음원파형의 크기가 기설정 크기 미만으로 줄어들면 음성으로 인식되지 않고, 상기 음원파형의 크기가 기설정 크기 이상으로 커지면 음성으로 인식되고, 상기 분리단계에서, 상기 동영상의 음원은 복수의 소음을 제외한 복수의 일정구간의 음성으로 분리되며, 상기 표시단계에서 상기 복수의 일정구간의 음성은 미리 설정된 구간별 형상으로 구별되어 표시되는, 동영상의 자막 제공 방법.
A method for providing subtitles of a video using a voice recognition server and collective intelligence from a subtitle providing server,
(a) inputting a link of the video from a user terminal;
(b) obtaining a sound source of the video by a subtitle providing server accessing the video link;
(c) converting the sound source of the video into audio subtitles;
(d) playing the video and the audio subtitle together in the user terminal;
In step (c),
an analysis step of analyzing a sound source waveform of the sound source and classifying the sound source into noise and voice; Further comprising converting the sound source into the audio subtitle based on the sound source waveform analysis of the sound source,
The converting into audio subtitles may include a separation step of separating a voice section of the sound source based on the analysis of the sound source waveform; a display step of displaying the speech section; Further comprising converting the sound source into the audio subtitle based on the displayed audio section,
In the analysis step, based on the size of the sound source waveform, the sound source of the video is not recognized as a voice when the size of the sound source waveform is reduced to less than a preset size, and when the size of the sound source waveform increases to a preset size or more, the sound source is not recognized as a voice. , and in the separation step, the sound source of the video is separated into audio of a plurality of certain sections excluding a plurality of noises, and in the display step, the audio of the plurality of certain sections is distinguished and displayed in a preset shape for each section. How to provide subtitles for videos.

제1항에 있어서,
상기 음성자막으로 변환하는 단계에서 상기 동영상의 음성은 상기 동영상의 음성을 기초로 상기 음성인식서버의 Audio to Text부를 통해 상기 음성자막으로 변환되는 동영상의 자막 제공 방법.
According to claim 1,
In the step of converting into audio captions, the audio of the video is converted into audio subtitles through the Audio to Text unit of the voice recognition server based on the audio of the video.

삭제delete

음성인식서버, 집단지성을 이용한 동영상의 자막을 자막제공 서버에서 제공하는 방법에 있어서,
(a)상기 동영상의 링크가 사용자 단말로부터 입력되는 단계;
(b)자막제공 서버가 상기 동영상 링크에 접속하여 상기 동영상의 음원을 획득하는 단계;
(c)상기 동영상의 음원을 음성자막으로 변환하는 단계;
(d)상기 사용자 단말에서 상기 동영상과 상기 음성자막을 함께 재생하는 단계를 포함하고,
상기 (c)단계는,
상기 음원의 음원파형을 분석하여 상기 음원을 소음과 음성으로 구분하는 분석단계; 상기 음원의 음원파형 분석을 기초로 상기 음원을 상기 음성자막으로 변환하는 단계를 더 포함하며,
상기 음성자막으로 변환하는 단계는, 상기 음원파형의 분석을 기초로 상기 음원의 음성구간을 분리하는 분리단계; 상기 음성구간을 표시하는 표시단계; 상기 표시된 음성구간에 기초하여 상기 음원을 상기 음성자막으로 변환하는 단계를 더 포함하고,
상기 분석단계에서, 상기 동영상의 음원은 기계학습을 통한 학습음원파형의 형상을 기초로 상기 음원파형의 형상이 상기 학습음원파형의 형상과 미리 결정된 수준의 유사도를 가지지 않으면 음성으로 인식되지 않고, 상기 음원파형의 형상이 상기 학습음원파형의 형상과 상기 미리 결정된 수준의 유사도를 가지면 음성으로 인식되고, 상기 분리단계에서, 상기 동영상의 음원은 복수의 소음을 제외한 복수의 일정구간의 음성으로 분리되며, 상기 표시단계에서 상기 복수의 일정구간의 음성은 미리 설정된 구간별 형상으로 구별되어 표시되는, 동영상의 자막 제공 방법.
A method for providing subtitles of a video using a voice recognition server and collective intelligence from a subtitle providing server,
(a) inputting a link of the video from a user terminal;
(b) obtaining a sound source of the video by a subtitle providing server accessing the video link;
(c) converting the sound source of the video into audio subtitles;
(d) playing the video and the audio subtitle together in the user terminal;
In step (c),
an analysis step of analyzing a sound source waveform of the sound source and classifying the sound source into noise and voice; Further comprising converting the sound source into the audio subtitle based on the sound source waveform analysis of the sound source,
The converting into audio subtitles may include a separation step of separating a voice section of the sound source based on the analysis of the sound source waveform; a display step of displaying the speech section; Further comprising converting the sound source into the audio subtitle based on the displayed audio section,
In the analysis step, the sound source of the video is not recognized as a voice unless the shape of the sound source waveform has a predetermined level of similarity with the shape of the learning sound source waveform based on the shape of the learning sound source waveform through machine learning. If the shape of the sound source waveform has a similarity with the shape of the learning sound source waveform at the predetermined level, it is recognized as a voice, and in the separation step, the sound source of the video is separated into a plurality of voices of a certain section excluding a plurality of noises, In the display step, the audio of the plurality of predetermined sections is distinguished and displayed in a preset shape for each section.

음성인식서버, 집단지성을 이용한 동영상의 자막을 자막제공 서버에서 제공하는 방법에 있어서,
(a)상기 동영상의 링크가 사용자 단말로부터 입력되는 단계;
(b)자막제공 서버가 상기 동영상 링크에 접속하여 상기 동영상의 음원을 획득하는 단계;
(c)상기 동영상의 음원을 음성자막으로 변환하는 단계;
(d)상기 사용자 단말에서 상기 동영상과 상기 음성자막을 함께 재생하는 단계를 포함하고,
상기 (c)단계는,
상기 음원의 음원파형을 분석하여 상기 음원을 소음과 음성으로 구분하는 분석단계; 상기 음원의 음원파형 분석을 기초로 상기 음원을 상기 음성자막으로 변환하는 단계를 더 포함하며,
상기 음성자막으로 변환하는 단계는, 상기 음원파형의 분석을 기초로 상기 음원의 음성구간을 분리하는 분리단계; 상기 음성구간을 표시하는 표시단계; 상기 표시된 음성구간에 기초하여 상기 음원을 상기 음성자막으로 변환하는 단계를 더 포함하고,
상기 동영상의 음성을 자막화하는 단계는, 표시된 복수의 일정구간의 음성이 복수의 단말기로 전송되는 단계; 상기 복수의 단말기 중 적어도 하나가 각각 표시단계를 통해 획득한 구간별 형상을 클릭하는 경우, 상기 일정구간의 음성을 전달받으며, 일정구간의 음성에 대한 음성자막을 입력할 수 있는 일정 시간을 배정받는 단계; 상기 배정된 일정 시간 동안 상기 적어도 하나의 단말기가 상기 음성구간에 대하여 상기 음성자막을 입력하는 단계; 상기 입력된 음성자막에 기초하여 상기 동영상의 음원을 자막화하는 단계;를 더 포함하는, 동영상 자막 제공 방법.
A method for providing subtitles of a video using a voice recognition server and collective intelligence from a subtitle providing server,
(a) inputting a link of the video from a user terminal;
(b) obtaining a sound source of the video by a subtitle providing server accessing the video link;
(c) converting the sound source of the video into audio subtitles;
(d) playing the video and the audio subtitle together in the user terminal;
In step (c),
an analysis step of analyzing a sound source waveform of the sound source and classifying the sound source into noise and voice; Further comprising converting the sound source into the audio subtitle based on the sound source waveform analysis of the sound source,
The converting into audio subtitles may include a separation step of separating a voice section of the sound source based on the analysis of the sound source waveform; a display step of displaying the speech section; Further comprising converting the sound source into the audio subtitle based on the displayed audio section,
The subtitling of the audio of the moving picture may include transmitting the displayed audio of a plurality of predetermined sections to a plurality of terminals; When at least one of the plurality of terminals clicks the shape for each section obtained through the display step, the audio of the certain section is received, and a certain time is allocated to input audio subtitles for the audio of the certain section. step; inputting, by the at least one terminal, the audio caption for the audio section during the assigned period of time; Capturing the sound source of the video based on the input audio caption; further comprising a video caption providing method.

음성인식서버, 번역서버와 집단지성을 이용한 동영상의 자막을 제공하는 방법에 있어서,
(a)상기 동영상의 링크 및 번역언어가 사용자 단말로부터 입력되는 단계;
(b)자막제공 서버가 상기 동영상 링크에 접속하여 상기 동영상의 음원을 획득하는 단계;
(c)상기 동영상의 음원을 음성자막으로 변환하는 단계;
(d)상기 번역언어에 기초하여, 상기 음성자막이 상기 번역서버 및 집단지성 중 적어도 하나를 통해 번역자막으로 변환되는 단계;
(e)상기 사용자 단말에서 상기 동영상과 상기 음성자막 및 상기 번역자막을 함께 재생하는 단계를 포함하고,
상기 (c)단계는,
상기 음원의 음원파형을 분석하여 상기 음원을 소음과 음성으로 구분하는 분석단계; 상기 음원의 음원파형 분석을 기초로 상기 음원을 상기 음성자막으로 변환하는 단계를 더 포함하며,
상기 음성자막으로 변환하는 단계는, 상기 음원파형의 분석을 기초로 상기 음원의 음성구간을 분리하는 분리단계; 상기 음성구간을 표시하는 표시단계; 상기 표시된 음성구간에 기초하여 상기 음원을 상기 음성자막으로 변환하는 단계를 더 포함하고,
상기 분석단계에서, 상기 동영상의 음원은 상기 음원파형의 크기를 기초로 상기 음원파형의 크기가 기설정 크기 미만으로 줄어들면 음성으로 인식되지 않고, 상기 음원파형의 크기가 기설정 크기 이상으로 커지면 음성으로 인식되고, 상기 분리단계에서, 상기 동영상의 음원은 복수의 소음을 제외한 복수의 일정구간의 음성으로 분리되며, 상기 표시단계에서 상기 복수의 일정구간의 음성은 미리 설정된 구간별 형상으로 구별되어 표시되는, 동영상의 자막 제공 방법.
A method for providing subtitles for videos using a voice recognition server, a translation server and collective intelligence,
(a) inputting a link of the video and a translation language from a user terminal;
(b) obtaining a sound source of the video by a subtitle providing server accessing the video link;
(c) converting the sound source of the video into audio subtitles;
(d) converting the audio subtitles into translated subtitles through at least one of the translation server and crowd intelligence based on the translation language;
(e) playing the video, the audio subtitles, and the translated subtitles together in the user terminal;
In step (c),
an analysis step of analyzing a sound source waveform of the sound source and classifying the sound source into noise and voice; Further comprising converting the sound source into the audio subtitle based on the sound source waveform analysis of the sound source,
The converting into audio subtitles may include a separation step of separating a voice section of the sound source based on the analysis of the sound source waveform; a display step of displaying the speech section; Further comprising converting the sound source into the audio subtitle based on the displayed audio section,
In the analysis step, based on the size of the sound source waveform, the sound source of the video is not recognized as a voice when the size of the sound source waveform is reduced to less than a preset size, and when the size of the sound source waveform increases to a preset size or more, the sound source is not recognized as a voice. , and in the separation step, the sound source of the video is separated into audio of a plurality of certain sections excluding a plurality of noises, and in the display step, the audio of the plurality of certain sections is distinguished and displayed in a preset shape for each section. How to provide subtitles for videos.

음성인식서버, 번역서버, 집단지성을 이용한 동영상의 자막을 제공하는 시스템에 있어서,
상기 동영상의 링크가 입력되는 제1 입력부와 상기 동영상 링크에 접속하여 상기 동영상을 재생하는 제1 디스플레이부를 포함하는 사용자 단말기;
상기 동영상의 음성자막 및 번역자막이 입력되는 제2 입력부와 상기 동영상 및 번역 구간을 나타내는 제2 디스플레이부를 포함하는 자막제공자 단말기;
상기 동영상의 음성을 음성자막으로 변환하는 음성인식서버;
상기 음성자막을 사용자로부터 설정된 언어로 번역하는 번역서버;
상기 사용자 단말기, 상기 자막제공자 단말기, 음성인식서버 및 번역서버와 통신을 수행하는 통신부, 상기 동영상의 음원을 음원파형으로 분석하는 분석부, 자막을 상기 동영상의 싱크에 맞추는 싱크부, 데이터베이스가 저장된 데이터베이스부 및 변환된 상기 음성자막과 번역된 상기 번역자막을 저장하는 저장부를 포함하는 자막제공 서버;를 포함하고,
상기 자막제공 서버는 상기 음원의 음원파형을 분석하여 상기 음원을 소음과 음성으로 구분하고, 상기 음성인식서버는 상기 음원의 음원파형 분석을 기초로 상기 음원을 상기 음성자막으로 변환하며,
상기 자막제공 서버는 상기 음원파형의 분석을 기초로 상기 음원의 음성구간을 분리하고, 상기 자막제공자 단말기는 상기 음성구간을 상기 제2 디스플레이부에 표시하며, 상기 음성인식서버는, 상기 표시된 음성구간에 기초하여 상기 음원을 상기 음성자막으로 변환하고,
상기 자막제공 서버에 의하여, 상기 동영상의 음원은 상기 음원파형의 크기를 기초로 상기 음원파형의 크기가 기설정 크기 미만으로 줄어들면 음성으로 인식되지 않고, 상기 음원파형의 크기가 기설정 크기 이상으로 커지면 음성으로 인식되고,
상기 동영상의 음원은 상기 자막제공 서버에 의하여 복수의 소음을 제외한 복수의 일정구간의 음성으로 분리되고, 상기 복수의 일정구간의 음성은 상기 자막제공자 단말기에 의하여 미리 설정된 구간별 형상으로 구별되어 표시되는 동영상 자막 제공 시스템.A voice recognition server, a translation server, and a system for providing subtitles for videos using collective intelligence,
a user terminal including a first input unit for inputting the video link and a first display unit for accessing the video link and playing the video;
a caption provider terminal comprising a second input unit for inputting audio and translation subtitles of the video and a second display unit displaying the video and translation sections;
a voice recognition server that converts the voice of the video into an audio caption;
a translation server that translates the audio subtitles into a language set by a user;
A communication unit that communicates with the user terminal, the caption provider terminal, a voice recognition server, and a translation server, an analysis unit that analyzes the sound source of the video as a sound source waveform, a sink unit that synchronizes subtitles with the video, and a database in which a database is stored. and a subtitle providing server including a storage unit for storing the translated audio subtitles and the translated subtitles,
The caption providing server analyzes the sound source waveform of the sound source and classifies the sound source into noise and voice, and the voice recognition server converts the sound source into the audio caption based on the sound source waveform analysis of the sound source;
The caption providing server separates the voice section of the sound source based on the analysis of the sound source waveform, the caption provider terminal displays the voice section on the second display unit, and the voice recognition server separates the displayed voice section converting the sound source into the audio subtitle based on;
By the caption providing server, the sound source of the video is not recognized as a voice when the size of the sound source waveform is reduced to less than a preset size based on the size of the sound source waveform, and the size of the sound source waveform is greater than or equal to the preset size. When it grows, it is recognized as a voice,
The sound source of the video is separated into audio of a plurality of certain sections excluding a plurality of noises by the caption providing server, and the audio of the plurality of certain sections is distinguished and displayed in a shape for each section preset by the caption provider terminal. Video captioning system.