KR102035596B1

KR102035596B1 - System and method for automatically generating virtual character's facial animation based on artificial intelligence

Info

Publication number: KR102035596B1
Application number: KR1020180165748A
Authority: KR
Inventors: 조수진; 박지원
Original assignee: 주식회사 데커드에이아이피
Priority date: 2018-05-25
Filing date: 2018-12-20
Publication date: 2019-10-23

Abstract

According to the present invention, an artificial intelligence based facial animation automatic generating method of a virtual character comprises the steps of: allowing a processing unit to classify input texts to transmit a part required for TTS services to an external TTS service providing source; allowing the processing unit to receive TTS service data provided from the TTS service providing source to extract voice waveforms and voice waveform information from the TTS service data, and dissemble the voice waveforms fitting for a form required for facial animation generation of a virtual character; allowing a data unit to collect location information of points designated in the face pronouncing a specific language, convert the same into 3D data, and classify the 3D data according to a preset feature to generate lip-sync data set; allowing the data unit to generate a voice data set personalized based on the voice waveform data extracted by collecting voice data of a specific person; allowing an operation unit to receive the voice waveforms and the voice waveform information extracted by the processing unit and the lip-sync data set generated by the data unit to perform a mathematical operation of a lip-sync animation; allowing the operation unit to synchronize voices with animation in a process for converting the voice waveform information into voice data; and allowing a forming unit to substitute a value processed by the operation unit with 3D modeling data through a game engine to generate a final result including the lip-sync animation and voices. Accordingly, the present invention provides a natural mouth motion showing Korean pronunciations to make a user feel the mouth motion is like that of a real person.

Description

인공지능 기반의 가상 캐릭터의 페이셜 애니메이션 자동 생성 시스템 및 방법{System and method for automatically generating virtual character's facial animation based on artificial intelligence}System and method for automatically generating virtual character's facial animation based on artificial intelligence}

본 발명은 애니메이션 자동 생성 시스템 및 방법에 관한 것으로서, 더 상세하게는 클라우드 시스템 기반의 음성 데이터를 활용하여 한국어 발음을 보여주는 자연스러운 입모양 애니메이션으로, 실제의 사람과 같은 느낌을 줄 수 있는 인공지능 기반의 가상 캐릭터의 페이셜 애니메이션 자동 생성 시스템 및 방법에 관한 것이다.The present invention relates to a system and method for automatically generating animation, and more specifically, to a natural mouth-type animation showing Korean pronunciation using cloud system-based voice data, which is based on artificial intelligence that can give a feeling like a real person. A system and method for automatically generating facial animation of a virtual character.

얼굴은 사람들 사이의 상호작용(Interaction)과 대화(Communication)에 있어서 매우 중요한 요소이지만, 공학적으로 정의하고 표현하기 매우 복잡한 대상이다.이와 같은 얼굴을 검출(Detection), 추적(tracking), 인식(recognition), 모델링(modeling), 합성(synthesis) 및 표현(rendering)하기 위한 다양한 연구가 컴퓨터 그래픽스(Computer graphics), 컴퓨터 비전(computer vision), 컴퓨터 애니메이션(Computer Animation) 등과 같은 다양한 분야에서 활발히 진행 중이다.Faces are a very important factor in the interaction and communication between people, but they are very complex objects to be defined and expressed engineeringally. Such faces can be detected, tracked, or recognized. Various studies are being actively conducted in various fields such as computer graphics, computer vision, computer animation, and the like for modeling, synthesis, and rendering.

특히, 얼굴 애니메이션(Facial Animation)은 얼굴의 움직임과 표정을 사실감 있게 표현하기 위한 것으로, 얼굴의 해부학적인 구조와 섬세한 표정을 사실감 있게 표현해야 하기 때문에, 컴퓨터 애니메이션 분야에서도 가장 어려운 분야로 인식되고 있다.In particular, facial animation is used to realistically express facial movements and facial expressions, and it is recognized as the most difficult field in the field of computer animation because the facial anatomy and delicate facial expressions must be realistically expressed.

하지만, 얼굴 애니메이션은 디지털 콘텐츠 분야에서 실시간 얼굴 애니메이션, 캐릭터 애니메이션, 대화 가능한 실사수준의 아바타(Avatar) 생성, 영상 통신, 사진 내 사람들에 대한 동영상 애니메이션 생성, 휴먼 인터페이스 등에 활용될 수 있기 때문에, 얼굴 애니메이션에 대한 기술적 수요가 증대되고 있다.However, face animation can be used for real-time face animation, character animation, interactive photorealistic avatar creation, video communication, video animation for people in photos, and human interfaces in digital content. The technical demand for is increasing.

이때, 얼굴 애니메이션을 위해서는 컴퓨터 그래픽스와 영상 처리(Image Processing) 기술을 포함한 컴퓨터 비전 기술을 이용하여 얼굴 모션 표현(Facial motion representation) 및 얼굴 모션 합성(facial motion synthesis) 기술이 필요하며, 텍스트에 대한 음성 변환(Text-To-Speech) 및 음성 합성(Speech Synthesis)을 위한 기계 학습(Machine Learning)과 같은 인공지능 기술이 필요하다.In this case, for facial animation, facial motion representation and facial motion synthesis are required using computer vision techniques including computer graphics and image processing. Artificial intelligence techniques such as machine learning for Text-To-Speech and Speech Synthesis are needed.

한편, 한국 공개특허공보 제10-2010-0041586호(특허문헌 1)에는 "음성 기반 얼굴 캐릭터 형성 장치 및 방법"이 개시되어 있는바, 이에 따른 음성 기반 얼굴 캐릭터 형성 방법은, 얼굴 캐릭터 형상에 대한 다수 개의 키모델을 이용하여 상기 얼굴 캐릭터 형상을 다수 개의 영역으로 분할하는 단계; 음성 샘플을 분석하여 발음 및 감정을 인식하기 위한 적어도 하나의 파라미터에 대한 정보들을 추출하는 음성매개변수화를 수행하는 단계; 음성이 입력되면, 상기 음성에 대한 프레임 단위별 음성으로부터 적어도 하나의 파라미터별 정보를 추출하는 단계; 및 상기 파라미터별 정보에 기초하여 상기 프레임 단위로 상기 분할된 얼굴 영역별로 얼굴 캐릭터 영상을 합성하는 단계를 포함하여 구성된 것을 특징으로 한다.On the other hand, Korean Patent Laid-Open Publication No. 10-2010-0041586 (Patent Document 1) discloses a "voice-based face character forming apparatus and method", according to the voice-based face character forming method, for the face character shape Dividing the face character shape into a plurality of regions by using a plurality of key models; Analyzing the voice sample to perform voice parameterization to extract information on at least one parameter for recognizing pronunciation and emotion; If a voice is input, extracting at least one parameter-specific information from the voice per frame for the voice; And synthesizing a face character image for each of the divided face regions based on the parameter information.

이상과 같은 특허문헌 1의 경우, 삼차원 얼굴 캐릭터 표정을 사용자의 음성만으로 빠르게 생성하여 온라인에서 음성 구동 캐릭터 얼굴 애니메이션을 실시간으로 제공할 수 있을지는 모르겠으나, 여기에서 적용되는 음성 발음 데이터는 영어를 기준으로 나온 학술 데이터로서 이를 수정, 보완없이 한국어에 억지로 적용함으로써, 캐릭터가 실제의 사람 모습과는 다른 부자연스러운 모습으로 표출되어 시청자로 하여금 이질감(불편함)을 느끼게 하는 문제점을 내포하고 있다.In the case of Patent Document 1 as described above, it may be possible to quickly generate a three-dimensional face character expression by using only a user's voice to provide voice driven character face animation online in real time, but the voice pronunciation data applied here is based on English. By applying this to the Korean language without any modification or supplementation, it is expressed as an unnatural figure different from the actual human figure, which causes the viewer to feel discomfort.

제10-2010-0041586호(2010.04.22. 공개)No. 10-2010-0041586 (published April 22, 2010)

본 발명은 상기와 같은 종래 기술의 문제점을 개선하기 위하여 창출된 것으로서, 클라우드 시스템 기반의 음성 데이터를 활용하여 한국어 발음을 보여주는 자연스러운 입모양 애니메이션을 제공함으로써, 실제의 사람과 같은 느낌을 줄 수 있는 인공지능 기반의 가상 캐릭터의 페이셜 애니메이션 자동 생성 시스템 및 방법을 제공함에 그 목적이 있다.The present invention was created to improve the problems of the prior art as described above, by providing a natural mouth-shaped animation that shows the pronunciation of the Korean language by utilizing the speech data based on the cloud system, artificial that can give a feeling like a real person An object of the present invention is to provide a system and method for automatically generating facial animation of an intelligent character.

상기의 목적을 달성하기 위하여 본 발명에 따른 인공지능 기반의 가상 캐릭터의 페이셜 애니메이션 자동 생성 시스템은,In order to achieve the above object, the facial animation automatic generation system of the artificial intelligence-based virtual character according to the present invention,

사용자 또는 외부 소스(source)에 의해 입력된 텍스트를 분류하여 TTS(Text-to-Speech) 서비스 요청 대상 부분을 외부의 TTS 서비스 제공원으로 전송하고, 상기 TTS 서비스 제공원으로부터 제공된 TTS 서비스 자료로부터 음성 파형 및 음성 파형 정보를 추출하여 가상 캐릭터의 페이셜 애니메이션 생성에 필요한 형태에 맞게 음성 파형을 분해하는 처리부;Transmit text-to-speech (TTS) service request target part to an external TTS service provider by classifying text input by a user or an external source, and voice from the TTS service material provided from the TTS service provider. A processor for extracting waveform and voice waveform information and decomposing the voice waveform according to a shape required for generating a facial animation of the virtual character;

특정 언어를 발음하고 있는 얼굴의 지정된 포인트의 위치 정보를 3D 데이터로 변환한 값을 미리 설정한 특징에 따라 분류한 입모양(립싱크) 데이터 세트와, 특정 사람의 음성 데이터를 수집하여 추출한 음성 파형 데이터에 기초한 개인화된 음성 데이터 세트를 생성하는 자료부;A set of mouth-shaped (lip-sync) data, which is classified according to a predetermined feature, by converting positional information of a specified point of a face in which a specific language is pronounced into 3D data, and voice waveform data obtained by collecting and extracting voice data of a specific person. A data portion for generating a personalized voice data set based on the;

상기 처리부 및 자료부와 전기적으로 각각 연결되며, 상기 처리부에 의해 추출된 음성 파형 및 음성 파형 정보와, 상기 자료부에 의해 생성된 입모양(립싱크) 데이터 세트를 제공받아 립싱크(lip-sync) 애니메이션의 수학적 연산을 수행하며, 상기 음성 파형 정보를 음성 데이터로 변환하는 과정에서 음성과 애니메이션의 동기를 맞추는 연산부; 및A lip-sync animation is electrically connected to the processing unit and the data unit, respectively, and is provided with a voice waveform and audio waveform information extracted by the processing unit and a set of lip sync data generated by the data unit. A calculation unit configured to perform a mathematical operation of and synchronize synchronization of a voice and an animation in the process of converting the voice waveform information into voice data; And

상기 연산부와 전기적으로 연결되며, 게임 엔진을 통해 상기 연산부에서 처리된 값을 3D 모델링 데이터에 대입하여 립싱크 애니메이션 및 음성을 포함한 최종 결과물을 생성하는 형성부를 포함하는 점에 그 특징이 있다.It is characterized in that it includes a formation unit electrically connected to the operation unit and generating a final result including lip sync animation and voice by substituting the value processed by the operation unit into 3D modeling data through a game engine.

여기서, 상기 처리부는 사용자 또는 외부의 소스(source)로부터 제공된 텍스트를 입력받는 입력부와, 입력된 텍스트를 분류하여 TTS 서비스 요청 대상 부분을외부의 TTS 서비스 제공원으로 전송하고, 상기 TTS 서비스 제공원으로부터 제공된 TTS 서비스 자료로부터 음성 파형 및 음성 파형 정보를 추출하는 음성 및 음성정보 추출부를 포함하여 구성될 수 있다.Here, the processing unit classifies the input unit for receiving the text provided from the user or an external source, and transmits the TTS service request target part to the external TTS service provider by classifying the input text from the TTS service provider. It may be configured to include a voice and voice information extraction unit for extracting the voice waveform and voice waveform information from the provided TTS service data.

또한, 상기 처리부는 사용자 또는 외부의 소스(source)로부터 제공된 텍스트의 내용을 파싱(parsing)하여 감정 또는 행동과 관련된 내용은 상황에 맞는 애니메이션이 제작될 수 있도록 하기 위해 상기 형성부로 전달할 수 있다.In addition, the processor may parse content of text provided from a user or an external source and transmit the content related to emotion or action to the forming unit so that an animation suitable for a situation may be produced.

또한, 상기 TTS(Text-to-Speech) 서비스 제공원은 클라우드 음성합성 시스템을 포함할 수 있다.In addition, the text-to-speech (TTS) service provider may include a cloud speech synthesis system.

이때, 상기 클라우드 음성합성 시스템은 Google TTS, IBM Watson, AWS(Amazon Web Service) Polly, Naver Clova 등을 포함할 수 있다.In this case, the cloud speech synthesis system may include Google TTS, IBM Watson, Amazon Web Service (AWS) Polly, Naver Clova, and the like.

또한, 상기 연산부는 상기 처리부에 의해 추출된 음성 파형 및 음성 파형 정보와, 상기 자료부에 의해 생성된 입모양(립싱크) 데이터 세트를 제공받아 립싱크애니메이션의 수학적 연산을 수행하는 립싱크 애니메이션 연산부를 포함하여 구성될 수 있다.The calculator may include a lip sync animation calculator configured to perform a mathematical operation of a lip sync animation by receiving a voice waveform and voice waveform information extracted by the processor and a mouth-shaped (lip sync) data set generated by the data unit. Can be configured.

이때, 상기 립싱크 애니메이션 연산부는 상기 자료부에 의해 생성된 음성 데이터 세트까지 더 제공받아 립싱크애니메이션의 수학적 연산을 수행할 수 있다.In this case, the lip sync animation calculator may further receive a voice data set generated by the data unit to perform a mathematical operation of the lip sync animation.

또한, 상기 형성부는 상기 연산부에서 처리된 값을 3D 모델링 데이터에 대입하여 립싱크 애니메이션 및 음성을 포함한 렌더링(rendering)된 최종 결과물을 산출하는 최종 결과물 형성부를 포함하여 구성될 수 있다.In addition, the forming unit may be configured to include a final result forming unit for calculating the rendered final result including the lip sync animation and voice by substituting the value processed by the calculating unit into the 3D modeling data.

또한, 상기의 목적을 달성하기 위하여 본 발명에 따른 인공지능 기반의 가상 캐릭터의 페이셜 애니메이션 자동 생성 방법은,In addition, in order to achieve the above object, artificial intelligence-based facial animation automatic generation method according to the present invention,

처리부, 자료부, 연산부, 형성부를 포함하는 인공지능 기반의 가상 캐릭터의 페이셜 애니메이션 자동 생성 시스템을 기반으로 한 인공지능 기반의 가상 캐릭터의 페이셜 애니메이션 자동 생성 방법으로서,An automatic facial animation generation method of an artificial intelligence-based virtual character based on an automatic facial animation generation system of an artificial intelligence-based virtual character including a processing unit, a data unit, an operation unit, and a forming unit,

a) 상기 처리부가 사용자 또는 외부 소스(source)에 의해 입력된 텍스트를 분류하여 TTS(Text-to-Speech) 서비스 요청 대상 부분을 외부의 TTS 서비스 제공원으로 전송하는 단계;a) classifying, by the processor, text input by a user or an external source and transmitting a text-to-speech service request target portion to an external TTS service provider;

b) 상기 처리부가 상기 TTS 서비스 제공원으로부터 제공된 TTS 서비스 자료를 수신하여 TTS 서비스 자료로부터 음성 파형 및 음성 파형 정보를 추출하고, 가상 캐릭터의 페이셜 애니메이션 생성에 필요한 형태에 맞게 음성 파형을 분해하는 단계;b) the processor receiving TTS service data provided from the TTS service provider, extracting voice waveform and voice waveform information from the TTS service data, and decomposing the voice waveform according to a shape required for generating a facial animation of the virtual character;

c) 상기 자료부에 의해 특정 언어를 발음하고 있는 얼굴의 지정된 포인트의 위치 정보를 수집하여 3D 데이터로 변환한 값을 미리 설정한 특징에 따라 분류한 입모양(립싱크) 데이터 세트를 생성하는 단계;c) generating, by the data unit, a mouth-shaped (lip-sync) data set which collects position information of a designated point of a face pronouncing a specific language and classifies the value converted into 3D data according to a predetermined feature;

d) 상기 자료부에 의해 특정 사람의 음성 데이터를 수집하여 추출한 음성 파형 데이터에 기초한 개인화된 음성 데이터 세트를 생성하는 단계;d) generating a personalized speech data set based on speech waveform data obtained by collecting and extracting speech data of a specific person by the data unit;

e) 상기 연산부가 상기 처리부에 의해 추출된 음성 파형 및 음성 파형 정보와, 상기 자료부에 의해 생성된 입모양(립싱크) 데이터 세트를 제공받아 립싱크 (lip-sync) 애니메이션의 수학적 연산을 수행하는 단계; e) performing, by the calculation unit, a mathematical operation of a lip-sync animation by receiving the speech waveform and the speech waveform information extracted by the processing unit and a set of lip sync data generated by the data unit; ;

f) 상기 연산부가 상기 음성 파형 정보를 음성 데이터로 변환하는 과정에서 음성과 애니메이션의 동기를 맞추는 단계; 및f) synchronizing an animation with a voice in the operation unit converting the voice waveform information into voice data; And

g) 상기 형성부가 게임 엔진을 통해 상기 연산부에서 처리된 값을 3D 모델링 데이터에 대입하여 립싱크 애니메이션 및 음성을 포함한 최종 결과물을 생성하는 단계를 포함하는 점에 그 특징이 있다.and g) generating the final result including the lip sync animation and the voice by substituting the 3D modeling data into the 3D modeling data through the game engine.

여기서, 상기 처리부는 내부의 입력부를 통해 사용자 또는 외부의 소스(source)로부터 제공된 텍스트를 입력받고, 음성 및 음성정보 추출부를 통해 입력부로부터 입력된 텍스트를 분류하여 TTS 서비스 요청 대상 부분을 외부의 TTS 서비스 제공원으로 전송하고, 상기 TTS 서비스 제공원으로부터 제공된 TTS 서비스 자료로부터 음성 파형 및 음성 파형 정보를 추출할 수 있다.Here, the processing unit receives text provided from a user or an external source through an internal input unit, classifies the text input from the input unit through a voice and voice information extraction unit, and converts a TTS service request target portion to an external TTS service. Transmit to a provider and extract voice waveform and voice waveform information from TTS service data provided from the TTS service provider.

또한, 상기 연산부는 내부의 립싱크 애니메이션 연산부를 통해 상기 처리부에 의해 추출된 음성 파형 및 음성 파형 정보와, 상기 자료부에 의해 생성된 입모양(립싱크) 데이터 세트를 제공받아 립싱크 애니메이션의 수학적 연산을 수행할 수 있다.In addition, the operation unit receives a voice waveform and voice waveform information extracted by the processing unit and an lip sync data set generated by the data unit through an internal lip sync animation operation unit to perform a mathematical operation of the lip sync animation. can do.

이때, 상기 립싱크 애니메이션 연산부는 상기 자료부에 의해 생성된 음성 데이터 세트까지 더 제공받아 립싱크 애니메이션의 수학적 연산을 수행할 수 있다.In this case, the lip sync animation calculator may further receive a voice data set generated by the data unit to perform a mathematical operation of the lip sync animation.

또한, 상기 형성부는 내부의 최종 결과물 형성부를 통해 상기 연산부에서 처리된 값을 3D 모델링 데이터에 대입하여 립싱크 애니메이션 및 음성을 포함한 렌더링(rendering)된 최종 결과물을 산출할 수 있다.The forming unit may calculate a rendered final result including lip sync animation and voice by substituting the value processed by the calculating unit into 3D modeling data through an internal final result forming unit.

이와 같은 본 발명에 의하면, 클라우드 시스템 기반의 음성 데이터를 활용하여 한국어 발음을 보여주는 자연스러운 입모양 애니메이션을 제공함으로써, 실제의 사람과 같은 느낌을 줄 수 있는 장점이 있다.According to the present invention, by providing a natural mouth-shaped animation showing the pronunciation of the Korean language using the cloud data based on the voice system, there is an advantage that can give a feeling like a real person.

도 1은 본 발명의 실시예에 따른 인공지능 기반의 가상 캐릭터의 페이셜 애니메이션 자동 생성 시스템의 구성을 개략적으로 나타낸 도면이다.
도 2는 본 발명의 실시예에 따른 인공지능 기반의 가상 캐릭터의 페이셜 애니메이션 자동 생성 방법의 실행 과정을 나타낸 흐름도이다.
도 3은 모음 및 묵음 층계를 모음 사각도를 이용하여 좌표화 및 각 모음 발음을 이차원 좌표계에 대입하여 좌표화시킨 상태를 나타낸 도면이다.
도 4는 각 발음이 나오는 시간과 관련하여 가중치가 "ㅣ(I)"에서 "ㅗ(O)"까지 도달해야 하는 시간을 도식적으로 나타낸 도면이다.
도 5는 각 발음에 대하여 선형 보간을 이용하여 시간의 변화량에 따라 형태 변환을 수행하는 과정을 도식적으로 나타낸 도면이다.
도 6은 Blend Space 상에서 비중값을 적용함에 있어 자연스러움을 유도하기 위해 Ease-in 방식 및 Ease-out 방식을 적용하는 것을 나타낸 도면이다.
도 7은 기존의 IPA 기준 morph target list를 표로 나타낸 도면이다.
도 8의 기존의 IPA 기준 morph target list를 얼굴 형태 기준 morph target list로 변경함에 있어서 입, 턱 관련 부분을 변경한 테이블을 나타낸 도면이다.
도 9는 IPA 기준 morph target을 얼굴 형태 기준 morph target으로 재구성한 것을 표로 나타낸 도면이다.
도 10은 립싱크 애니메이션 및 음성을 포함한 최종 결과물의 일 예를 나타낸 도면이다.
도 11은 립싱크 애니메이션 및 음성을 포함한 최종 결과물의 다른 예를 나타낸 도면이다.1 is a view schematically showing the configuration of a system for automatically generating facial animation of an artificial intelligence-based virtual character according to an embodiment of the present invention.
2 is a flowchart illustrating an execution process of a method for automatically generating facial animation of an artificial intelligence-based virtual character according to an embodiment of the present invention.
3 is a view showing a state in which the vowel and the silent stairs are coordinated using vowel squares and each vowel pronunciation is substituted into a two-dimensional coordinate system.
FIG. 4 is a diagram schematically showing a time when a weight must reach from "I (I)" to "ㅗ (O)" in relation to the time at which each pronunciation occurs.
FIG. 5 is a diagram schematically illustrating a process of performing shape transformation according to a change amount of time using linear interpolation for each pronunciation.
FIG. 6 is a diagram illustrating the application of the Ease-in method and the Ease-out method to induce naturalness in applying specific gravity values in Blend Space.
7 is a table showing a conventional IPA reference morph target list as a table.
8 is a diagram illustrating a table in which mouth and jaw related parts are changed in changing the existing IPA reference morph target list of FIG. 8 into a face shape reference morph target list.
9 is a table showing a reconstruction of the IPA reference morph target into a face shape reference morph target.
10 is a diagram illustrating an example of a final result including lip sync animation and voice.
11 is a view showing another example of the final result including lip sync animation and voice.

본 명세서 및 청구범위에 사용된 용어나 단어는 통상적이거나 사전적인 의미로 한정되어 해석되지 말아야 하며, 발명자는 그 자신의 발명을 가장 최선의 방법으로 설명하기 위해 용어의 개념을 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야 한다.The terms or words used in this specification and claims are not to be construed as being limited to the common or dictionary meanings, and the inventors may properly define the concept of terms in order to best explain their invention in the best way. Based on the principle, it should be interpreted as meaning and concept corresponding to the technical idea of the present invention.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있다는 것을 의미한다. 또한, 명세서에 기재된 "…부", "…기", "모듈", "장치" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다.Throughout the specification, when a part is said to "include" a certain component, it means that it may further include other components, except to exclude other components unless specifically stated otherwise. In addition, the terms “… unit”, “… unit”, “module”, “device”, and the like described in the specification mean a unit that processes at least one function or operation, which is hardware or software or a combination of hardware and software. It can be implemented as.

이하 첨부된 도면을 참조하여 본 발명의 실시예를 상세히 설명한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 실시예에 따른 인공지능 기반의 가상 캐릭터의 페이셜 애니메이션 자동 생성 시스템의 구성을 개략적으로 나타낸 도면이다.1 is a view schematically showing the configuration of a system for automatically generating facial animation of an artificial intelligence-based virtual character according to an embodiment of the present invention.

도 1을 참조햐면, 본 발명에 따른 인공지능 기반의 가상 캐릭터의 페이셜 애니메이션 자동 생성 시스템(100)은 처리부(110), 자료부(120), 연산부(130) 및 형성부(140)를 포함하여 구성된다.Referring to FIG. 1, the automatic facial animation generation system 100 of an artificial intelligence-based virtual character according to the present invention includes a processing unit 110, a data unit 120, an operation unit 130, and a forming unit 140. It is composed.

처리부(110)는 사용자 또는 외부 소스(source)에 의해 입력된 텍스트를 분류하여 TTS(Text-to-Speech) 서비스 요청 대상 부분을 외부의 TTS 서비스 제공원(미도시)으로 전송하고, 상기 TTS 서비스 제공원으로부터 제공된 TTS 서비스 자료로부터 음성 파형 및 음성 파형 정보를 추출하여 가상 캐릭터의 페이셜 애니메이션 (facial animation) 생성에 필요한 형태에 맞게 음성 파형을 분해한다. 여기서, 이와 같은 처리부(110)는 사용자 또는 외부의 소스(source)로부터 제공된 텍스트(예를 들면, 가상 캐릭터가 말을 하게 될 대사)를 입력받는 입력부(111)와, 입력된 텍스트를 분류하여 TTS 서비스 요청 대상 부분을 외부의 TTS 서비스 제공원으로 전송하고, 상기 TTS 서비스 제공원으로부터 제공된 TTS 서비스 자료로부터 음성 파형 및 음성 파형 정보를 추출하는 음성 및 음성정보 추출부(112)를 포함하여 구성될 수 있다. 이때, 상기 입력부(111)는 문단 분류와 기타 애니메이션(표정 및 행동) 정보를 동시에 받을 수 있도록 특정된 JSon 포맷으로 텍스트(대사)를 입력받을 수 있다. 여기서, 사용하는 JSon 포맷에 대한 예시를 들면 다음과 같다.The processor 110 classifies text input by a user or an external source and transmits a text-to-speech service request target part to an external TTS service provider (not shown), and the TTS service. Voice waveform and voice waveform information are extracted from TTS service data provided by the provider, and the voice waveform is decomposed according to the shape required for generating the facial animation of the virtual character. The processor 110 may classify the input text into an input unit 111 that receives text (for example, a dialogue to be spoken by a virtual character) from a user or an external source, and classifies the input text into a TTS. And a voice and voice information extraction unit 112 for transmitting a service request target portion to an external TTS service provider and extracting voice waveform and voice waveform information from TTS service data provided from the TTS service provider. have. In this case, the input unit 111 may receive text (metabolism) in a specific JSon format to simultaneously receive paragraph classification and other animation (expression and behavior) information. Here is an example of the JSon format used:

{"Speak" : [{ "Context" : "안녕하세요. Deckard AIP의 비아입니다. 만나서 반갑습니다.",{"Speak": [{"Context": "Hello, I'm Via of Deckard AIP. Nice to meet you.",

"Emotion" : "Smile","Emotion": "Smile",

"Body" : "Hello"}]}"Body": "Hello"}]}

또한, 상기 음성 및 음성정보 추출부(112)는 입력받은 대사를 분류하여 외부의 TTS 서비스 제공원에 요청할 텍스트와 출력할 애니메이션을 선정한다. 이와 같은 음성 및 음성정보 추출부(112)는 입력부(111)를 통해 입력된 JSon 포맷 대사를 분해하여, JSon의 내용 중 "Context"의 내용을 파싱하여 외부의 TTS 서비스 제공원으로 전송한다. In addition, the voice and voice information extractor 112 classifies the input dialogue and selects text to be requested from an external TTS service provider and an animation to be output. The voice and voice information extraction unit 112 decomposes the JSon format metabolism inputted through the input unit 111, parses the contents of "Context" among the contents of JSon, and transmits the contents to the external TTS service provider.

또한, 이상과 같은 처리부(110)(더 상세하게는 음성 및 음성정보 추출부 (112))는 사용자 또는 외부의 소스(source)로부터 제공된 텍스트의 내용을 파싱 (parsing)하여 감정 또는 행동과 관련된 내용은 상황에 맞는 애니메이션이 제작될 수 있도록 하기 위해 형성부(140)로 전달할 수 있다. 이때, 음성 및 음성정보 추출부(112)는 외부의 TTS 서비스 제공원으로부터 제공된 음성 파일과 음성 정보 파일을 다운로드하고, 음성 정보 파일을 사용하기 쉽게 파싱하여 연산부(130)로 전달한다.In addition, the processing unit 110 (more specifically, the voice and voice information extraction unit 112) as described above parses the content of the text provided from the user or an external source (source) to the content related to the emotion or action This may be transmitted to the forming unit 140 to be produced according to the situation. At this time, the voice and voice information extraction unit 112 downloads the voice file and the voice information file provided from an external TTS service provider, parses the voice information file for easy use, and delivers the voice information file to the operation unit 130.

또한, 상기 TTS(Text-to-Speech) 서비스 제공원은 클라우드 음성합성 시스템을 포함할 수 있다. 이때, 상기 클라우드 음성합성 시스템은 Google TTS, IBM Watson, AWS(Amazon Web Service) Polly(즉, AWS에서 TTS 서비스 부문), Naver Clova 등을 포함할 수 있다.In addition, the text-to-speech (TTS) service provider may include a cloud speech synthesis system. In this case, the cloud speech synthesis system may include Google TTS, IBM Watson, Amazon Web Service (AWS) Polly (that is, the TTS service section in AWS), Naver Clova, and the like.

여기서, 이상과 같은 처리부(110)와 관련하여 조금 더 설명을 부가해 보기로 한다. Here, a little more explanation will be made with respect to the processing unit 110 as described above.

전술한 바와 같이, 처리부(110)는 텍스트 데이터를 외부의 TTS 서비스 제공원(예를 들면, 클라우드 음성합성 시스템)으로 전송하고, 클라우드 음성합성 시스템은 전송받은 텍스트 데이터에 대응하여 2가지 형태의 정보, 즉 음성 파형(실제 음성으로 출력될 데이터)과 음성 파형에 대한 정보(발음을 위한 발음 기호 및 길이 정보)를 제공한다. 제공된 음성 파형은 애니메이션용 최종 음성으로 재생되고, 파형에 대한 정보는 타이밍과 그 타이밍에 대응되는 viseme 발음 기호로 분류한다. 이때, 발음 기호는 모음 및 자음으로 나눈다.As described above, the processor 110 transmits the text data to an external TTS service provider (eg, a cloud speech synthesis system), and the cloud speech synthesis system corresponds to the received text data in two types of information. That is, it provides voice waveforms (data to be output as actual voices) and information on the speech waveforms (pronounced symbols and length information for sounding). The provided speech waveform is reproduced as the final speech for animation, and information about the waveform is classified into timing and viseme phonetic symbols corresponding to the timing. At this time, the phonetic symbols are divided into vowels and consonants.

자료부(120)는 특정 언어를 발음하고 있는 얼굴의 지정된 포인트의 위치 정보를 3D 데이터로 변환한 값을 미리 설정한 특징에 따라 분류한 입모양(립싱크) 데이터 세트(121)와, 특정 사람의 음성 데이터를 수집하여 추출한 음성 파형 데이터에 기초한 개인화된 음성 데이터 세트(122)를 생성한다. 여기서, 이상과 같은 입모양(립싱크) 데이터 세트(121)와 음성 데이터 세트(122)와 관련하여 설명을 부가해 보기로 한다.The data unit 120 includes a mouth-shaped (lip sync) data set 121 classified according to a preset feature of converting the positional information of a specified point of a face pronouncing a specific language into 3D data and a specific person. A personalized speech data set 122 is generated based on the extracted speech waveform data by collecting the speech data. Here, description will be made with respect to the above-described mouth-shaped (lip sync) data set 121 and the voice data set 122.

<립싱크 데이터 세트> <Lipsync data set>

입모양(립싱크) 데이터 세트(121)는 IPA(International Phonetic Alphabet; 국제음성기호)를 활용한다. IPA는 실용성 추구를 위한 근사값의 집합이기 때문에 특수 연속성을 반영하지 못하며, 그에 따라 각 언어별로 기호로써는 표현하기 힘든 단어가 발생되는 문제가 있다. 따라서, 본 발명에서는 주요 언어로 적용하고 있는 한국어를 정확하게 표현하기 위하여 한국어에는 없는 발음(예를 들면, F, Z, R 등)의 제거 및 한국어에서 특이하게 발음되고 있는 모음(예를 들면, ㅑ, ㅕ와 같은 11개의 이중모음)에 대한 발음 기호 등을 테이블에 추가함으로써, IPA와 한국어의 교집합 비교분석표를 생성하고, 이 비교분석표를 한국어 발음 기록 및 분석 단계에서 활용함으로써 최대한 한국어에 적합한 데이터를 추출할 수 있도록 한다.The mouth-shaped (lip sync) data set 121 utilizes an International Phonetic Alphabet (IPA). Since IPA is a set of approximations for the pursuit of practicality, it does not reflect special continuity, and accordingly, there is a problem that words that are difficult to express by symbols are generated for each language. Therefore, in the present invention, in order to accurately express Korean, which is applied as a main language, removal of pronunciations (for example, F, Z, R, etc.) that are not in Korean and vowels that are specifically pronounced in Korean (for example, By adding the phonetic symbols of 11 double vowels, such as ㅕ, to the table, we can create a comparison analysis table of IPA and Korean, and use this comparison analysis table in the Korean pronunciation recording and analysis stages Allow extraction.

또한, 입모양(립싱크) 데이터 세트(121)는 레코딩된 얼굴 데이터를 활용한다. 본 발명에서 도입하고 있는 깊이 카메라(depth camera)를 이용한 애플 ARKit의 얼굴 3D 트래킹 기술은 FACS(Facial Action Coding System)의 액션 유닛과 연결되어 있으며, FACS의 액션 유닛 자체로는 할 수 없었던 변수화를 가능하게 한다. 이 변수화를 통해 각 표정을 지을 때 FACS의 액션 유닛 번호 및 그에 따라오는 액션 유닛의 수치를 획득할 수 있으며, 이렇게 얻어진 수치들의 그룹은 나중에 애니메이션을 생성하는데 사용할 수 있는 발음별 그룹으로 묶는다.In addition, the mouth-shaped (lip sync) data set 121 utilizes recorded face data. Apple ARKit's face 3D tracking technology using the depth camera introduced in the present invention is connected to the action unit of the FACS (Facial Action Coding System), and it is possible to change the variable that the action unit of the FACS itself could not do. Let's do it. Through this parameterization, each expression can be obtained the action unit number of FACS and the value of the action unit that follows, and the group of the obtained values is grouped into pronunciation groups which can be used later to generate animation.

또한, 성별, 나이 등을 다르게 한 다양한 분포군 테스트를 반복하여 각 유닛별 변수값을 분석하여 평균값을 추출할 수 있다. 이렇게 추출된 각 발음별 액션 유닛들과 그에 맞는 액션 유닛별 변수값을 그룹화 하여 한국어 Viseme 테이블로 구성한다. 그리고 그룹별 평균치 및 이후 캐릭터의 그룹화를 위한 가중치를 획득한다. 따라서, 시대의 흐름에 따른 언어 변화에 대응할 때에도, 각 시대의 데이터를 수집하기만 하면 별도의 작업없이 적용시킬 수 있다.In addition, various distribution group tests having different genders and ages may be repeated to analyze the variable values of each unit to extract an average value. The pronunciation units for each pronunciation and the variable values for each action unit are grouped to form a Korean Viseme table. And the average value for each group and the weight for grouping the character later is obtained. Therefore, even when responding to language changes according to the passage of times, it is possible to apply without additional work just by collecting data of each era.

입모양(립싱크) 데이터 세트(121)는 IPA를 활용한 비교분석표를 통해 추출한 데이터를 기반으로 3D 모델링을 하고, 캐릭터 세트를 구성한다.The mouth shape (lip sync) data set 121 performs 3D modeling based on the data extracted through the comparison analysis table using IPA, and constructs a character set.

<음성 데이터 세트><Voice data set>

특정인을 표적으로 하여 품질 높은 립싱크 애니메이션을 진행하고자 할 때, 그 사람 특유의 말하는 방식을 추가하는 개인화 작업이 필요하다. 이를 위해 데이터 획득 과정에서 다수의 평균치가 아닌 고유의 값을 사용해야 하며, 이에 따라 본 발명에서는 한 사람으로부터 많은 데이터를 추출하는 방식을 취한다. 그리고 추출된 데이터는 파형 분석을 통해 특정 파형 및 특정 발음에서의 습관을 데이터화 한다. 이러한 과정에서 얻은 데이터 세트 값을 TTS 서비스 제공원(예컨대, 클라우드 음성합성 시스템)으로 전달하고, TTS 서비스 제공원으로부터 이 값이 적용된 음성 파일을 제공받는다.When you want to make a high quality lip-sync animation targeting a specific person, you need to personalize it to add the person's unique way of talking. To this end, a unique value rather than a plurality of average values should be used in the data acquisition process. Accordingly, the present invention takes a method of extracting a lot of data from one person. The extracted data is data-formed habits in specific waveforms and specific pronunciations through waveform analysis. Data set values obtained in this process are transferred to a TTS service provider (eg, a cloud speech synthesis system), and a voice file to which the values are applied is provided from the TTS service provider.

다시 도 1을 참조하면, 연산부(130)는 상기 처리부(110) 및 자료부(120)와 전기적으로 각각 연결되며, 상기 처리부(110)에 의해 추출된 음성 파형 및 음성 파형 정보와, 상기 자료부(120)에 의해 생성된 입모양(립싱크) 데이터 세트를 제공받아 립싱크(lip-sync) 애니메이션의 수학적 연산을 수행하며, 상기 음성 파형 정보를 음성 데이터로 변환하는 과정에서 음성과 애니메이션의 동기를 맞춘다. 여기서, 이와 같은 연산부(130)는 상기 처리부(110)에 의해 추출된 음성 파형 및 음성 파형 정보와, 상기 자료부(120)에 의해 생성된 입모양(립싱크) 데이터 세트를 제공받아 립싱크 애니메이션의 수학적 연산을 수행하는 립싱크 애니메이션 연산부(131)를 포함하여 구성될 수 있다. 이때, 상기 립싱크 애니메이션 연산부(131)는 상기 자료부(120)에 의해 생성된 음성 데이터 세트(122)까지 더 제공받아 립싱크 애니메이션의 수학적 연산을 수행할 수 있다.Referring back to FIG. 1, the operation unit 130 is electrically connected to the processing unit 110 and the data unit 120, respectively, and the voice waveform and the voice waveform information extracted by the processing unit 110, and the data unit. Receives a set of lip-sync data generated by 120 to perform a mathematical operation of a lip-sync animation, and synchronizes voice and animation in the process of converting the voice waveform information into voice data. . Here, the operation unit 130 receives the voice waveform and the voice waveform information extracted by the processing unit 110 and the mouth-shaped (lip sync) data set generated by the data unit 120 to receive the mathematical expression of the lip sync animation. It may be configured to include a lip sync animation operation unit 131 for performing the operation. In this case, the lip sync animation calculator 131 may further receive the voice data set 122 generated by the data unit 120 to perform a mathematical operation of the lip sync animation.

형성부(140)는 상기 연산부(130)와 전기적으로 연결되며, 게임 엔진을 통해 상기 연산부(130)에서 처리된 값을 3D 모델링 데이터에 대입하여 립싱크 애니메이션 및 음성을 포함한 최종 결과물을 생성한다. 이와 같은 형성부(140)는 상기 연산부(130))에서 처리된 값을 3D 모델링 데이터에 대입하여 립싱크 애니메이션 및 음성을 포함한 렌더링(rendering)된 최종 결과물을 산출하는 최종 결과물 형성부 (141)를 포함하여 구성될 수 있다.The forming unit 140 is electrically connected to the calculating unit 130, and substitutes the value processed by the calculating unit 130 into 3D modeling data through a game engine to generate a final result including lip sync animation and voice. The forming unit 140 includes a final result forming unit 141 that substitutes the value processed by the calculating unit 130 into 3D modeling data and calculates a final rendered result including lip sync animation and voice. It can be configured by.

여기서, 이상과 같은 연산부(130) 및 형성부(140)와 관련하여 부연 설명을 해보기로 한다.Here, the description will be made in detail with respect to the calculation unit 130 and the forming unit 140 as described above.

<연산부><Operation part>

본 발명에서는 AWS Polly(Amazon Web Service, 이 중에서 TTS 서비스 부문)에서 제공하는 TTS 서비스 값을 그대로 사용하지 않고, 한국어에 접목하여 수정된 값을 사용한다. 애니메이션의 구성에서 자연스러운 발음을 표출하는 애니메이션(이하 '발음 애니메이션'이라 한다)의 생성을 위해 애니메이션의 층계를 3가지의 층계, 즉 모음 및 묵음 층계, 자음 층계, 이중모음 층계로 구성한다. 여기서, 발음 애니메이션의 주 층계는 모음 및 묵음 층계이며, 이는 도 3에 도시된 바와 같이, 모음 사각도를 이용하여 좌표화한다. 즉, 각 모음 발음을 이차원 좌표계에 대입하여 좌표화시킨다. 이때, 좌표는 0∼1 범위의 값을 갖는다. 좌표에 대입하는 데이터는 특정 그룹별 평균치로 추출된 값을 사용한다.In the present invention, the TTS service value provided by AWS Polly (Amazon Web Service, TTS service unit) is not used as it is, and a value modified by grafting into Korean is used. In order to create an animation that expresses natural pronunciation in the composition of the animation (hereinafter, referred to as a 'pronounced animation'), the stairs of animation are composed of three steps, vowel and silent steps, consonant steps, and double vowel steps. Here, the main steps of the pronunciation animation are vowel and silent steps, which are coordinated using vowel squareness, as shown in FIG. 3. That is, each vowel pronunciation is substituted into a two-dimensional coordinate system and coordinated. At this time, the coordinate has a value in the range of 0-1. The data assigned to the coordinates use the value extracted as the average value of a specific group.

애니메이션을 구동하기 위해 음성 파형에 대한 정보에서 분류한 내용 중, 시간을 각 모음 별로 이동하는 기준으로 사용하며, 그 모음의 이동 목적지는 아래의 예시로 설명한다.In order to drive the animation, time is used as a reference for moving each vowel from the information classified in the information on the voice waveform, and the moving destination of the vowel will be described as an example below.

예) "니코시아"-NicosiaEx) "Nicosia"-Nicosia

위의 니코시아에서 발음해야 하는 모음은 "ㅣ(I)", "ㅗ(O)", "ㅣ(I)", "ㅏ(A)"이다. 각 발음이 나오는 시간은 "ㅣ(I)": 0.230초, "ㅗ(O)": 0.132초, "ㅣ(I)": 0.259초, "ㅏ(A)": 0.169초이다. 가중치가 "ㅣ(I)"에서 "ㅗ(O)"까지 도달해야 하는 시간은 도 4에 도시된 바와 같이, 0.132초이다. 여기서, 0.132초 내로 설정한 이유는 TTS에서 생성된 음성 파형내에서 실제 특정 발음 사운드 파장이 들리는 시점에서 시각적으로 입모양이 완성되어 있어야 립싱크가 성립하기 때문에, "ㅗ(O)"의 발음이 나오는 시간이 0.132초이므로, 0.132초 내에 입모양이 완성되어야 하기 때문이다.The vowels to be pronounced in Nicosia above are "ㅣ (I)", "ㅗ (O)", "ㅣ (I)", and "ㅏ (A)". The time for each pronunciation is "ㅣ (I)": 0.230 seconds, "ㅗ (O)": 0.132 seconds, "ㅣ (I)": 0.259 seconds, "ㅏ (A)": 0.169 seconds. The time the weight should reach from "I (I)" to "ㅗ (O)" is 0.132 seconds, as shown in FIG. Here, the reason for setting within 0.132 seconds is that the lip sync is established only when the lip sync is visually completed at the point of time when the actual sound of the specific sound sound is heard in the voice waveform generated by the TTS. Since the time is 0.132 seconds, the mouth shape must be completed within 0.132 seconds.

시간의 흐름에 따라 "ㅣ(I)"와 "ㅗ(O)" 사이의 가중치가 변하며, 그 값에 따라 애니메이션 값을 연산한다.As time passes, the weight between “I (I)” and “ㅗ (O)” changes, and the animation value is calculated according to the value.

이때, "ㄴ", "ㅋ", "ㅅ" 등의 자음 음은 자음의 타이밍 발생시 첨가형 애니메이션(Additive Animation)으로 빠른 시간 내에 1차원 파라미터 상에서 움직임으로써 자연스러움을 이끈다. At this time, the consonant sounds such as "b", "ㅋ", "ㅅ" are natural animations by moving on one-dimensional parameters in a short time with additive animation when timing of the consonants occurs.

또한, 이중 모음(예: "ㅑ" 등)의 경우 IPA에 근거한 [j]에 관련된 1차원 파라미터를 현재 발음 중인 모음이 끝날 때까지 유지함으로써 이중모음 구현이 되도록 한다.In addition, in the case of double vowels (eg, “ㅑ”), the double vowel implementation is maintained by maintaining the one-dimensional parameter related to [j] based on IPA until the end of the current vowel.

여기서, 이상과 같은 연산부와 관련하여 설명을 조금 더 부가해 보기로 한다.Here, a little more explanation will be given in relation to the above calculation unit.

연산부에서의 연산 처리와 관련하여 립싱크 애니메이션 결과물을 발전시킬 수 있는 방식은 다음과 같은 5가지 방식을 들 수 있다.There are five ways to develop the lip-sync animation result in relation to the operation processing in the operation unit.

* 제1 방식: 모든 발음을 타겟 1:1 이동* Method 1: Move all pronunciations to target 1: 1

- Morphing 기술을 응용한 형태로, 각 발음을 Morph Target이라는 형태로 만들어 형태 변환을 시킨다.(완성형 morph) -Morphing technology is applied, and each pronunciation is transformed into Morph Target (finished form morph).

- 각 발음은 형태 변환을 위한 끝점의 값이 정해져 있기 때문에 선형 보간을 이용하여 시간의 변화량에 따라 형태 변환을 추가한다.-Since each pronunciation has an end point value for shape conversion, shape conversion is added based on the amount of time change using linear interpolation.

- 예를 들어, "각도기"라는 발음을 출력할 경우For example, if you output the pronunciation "protractor"

f(p) = d2f(p1) + d1f(p2)라는 선형 보간 공식을 이용하여 아래와 같이 표현한다. 이를 도 5를 참조하여 설명하면 다음과 같다.Using the linear interpolation formula, f (p) = d2f (p1) + d1f (p2), This will be described with reference to FIG. 5.

[(시작:무음) → '각'][(Start: Silent) → 'Each']

'각'에 대한 morph ⇒ 0∼100까지 선형 보간으로 이동 Morph for 'angle' ⇒ moves in linear interpolation from 0 to 100

['각' → '도']['Angle' → 'degree']

'각'에 대한 morph ⇒ 100∼0까지 선형 보간Morph for each angle ⇒ linear interpolation from 100 to 0

'도'에 대한 morph ⇒ 0∼100까지 선형 보간Morph for degrees ⇒ linear interpolation from 0 to 100

['기' → (종료: 무음)]['기' → (Exit: Silent)]

'기'에 대한 morph ⇒ 100∼0까지 선형 보간Morph ⇒ linear interpolation from 100 to 0

이상과 같은 제1 방식으로 진행할 경우, 다음과 같은 2가지 문제가 있다.Proceeding to the first method as described above, there are two problems as follows.

1) 자음에 의한 부적절한 값이 생겨 값이 튀는 현상이 발생한다, 또한, 너무 빠른 속도 변경 및 선형 보간에 의한 일정한 morph의 동화 현상이 이질감을 발생시킨다.1) Inappropriate values caused by consonants can cause the value to bounce. Also, too fast speed changes and constant morph assimilation due to linear interpolation cause heterogeneity.

2) 또한, 많은 양의 Morph Target을 제작하게 되어 제작 데이터의 Loss가 함께 발생한다.2) Also, a large amount of morph targets are produced, resulting in loss of production data.

* 제2 방식* Second way

이 제2 방식은 위의 제1 방식의 문제를 보완하기 위한 것으로서, 모음만 타겟으로 만들어 모음 morph의 동화현상을 최소화 한다.This second method is to compensate for the problem of the first method, minimizing assimilation of the vowel morph by targeting only the vowel.

- 자음동화 현상을 최소화 하기 위해 위의 제1 방식에서의 데이터 량을 최소화 한다.Minimize the amount of data in the first method to minimize consonant phenomena.

- 'ㅏ', 'ㅔ', 'ㅣ', 'ㅗ', 'ㅜ' 및 무음으로 최소화하여 제작한다. 단, 연산은 위의 제1 방식과 동일한 방식으로 한다.-Minimize to 'ㅏ', 'ㅔ', 'ㅣ', 'ㅗ', 'TT' and silence. However, the calculation is performed in the same manner as the first method above.

- 자음동화 현상의 완화에 의해 더 자연스러운 발음이 가능하다. -More natural pronunciation is possible by alleviating consonant phenomena.

- 단, 발음 시, 묵음 또는 기타 턱관절의 움직임을 일으키는 자음의 발음의 부재로 정확한 발성은 불가능하다.-However, accurate speech is not possible due to the absence of consonants that cause silence or other movements of the jaw joint during pronunciation.

* 제3 방식: 모음사각도 + 특정 자음 추가* Third way: vowel angle + specific consonants

- 애니메이션 구동을 위한 좌표 테이블로써 모음 사각도를 활용한다. (1차 자료의 모음 사각도 및 예제 활용)-Use vowel square as a coordinate table for driving animation. (Using collection vowel squares and examples of primary data)

- 위 좌표테이블은 BlendSpace로 구성한다.-The above coordinate table is composed of BlendSpace.

- 이를 위해서 Morph 모음은 모음 사각도에 맞추어 위치를 정한다.To do this, the Morph vowel is positioned according to the vowel squareness.

예를 들어 모음사각도에서 'ㅏ' 발음은 (0.45,0.1), 'ㅗ'는 (0.9,0.6)의 좌표를 가진다.For example, in the vowel angle, 'ㅏ' pronunciation has coordinates of (0.45,0.1) and 'ㅗ' has (0.9,0.6) coordinates.

단, 설계된 캐릭터의 성향 및 느낌에 따라 좌표값은 조절할 수 있다.However, coordinate values can be adjusted according to the inclination and feeling of the designed character.

- 이렇게 구성된 좌표계는 BlendSpace의 연산에 따라 비중값을 변경한다.This coordinate system changes specific gravity according to the operation of BlendSpace.

- 또한 정확한 발음의 보완을 위해 애니메이션의 연산을 위해 애니메이션의 층계를 제작한다.-Also, to make up the correct pronunciation, animation steps are produced for the calculation of animation.

모음 및 묵음 층계 → 모음 사각도에 의해 구동Vowel and silent stairs → vowel powered by square

자음 층계 → 특정 경우만 가산형으로 작동Consonant stairs → act as additive only in certain cases

이중모음 층계 → 특정 경우만 가산형으로 작동Double vowel stairs → act as additive only in certain cases

* 제3-1 방식: 타겟 비중 추가* Method 3-1: Add target specific gravity

- BlendSpace 상에서 가중치(이하 비중) 변경과정에서 각 층계별 비중을 가함에 있어 단순 선형 보간의 공식을 적용함이 아니라 상황 별 보간 공식을 다르게 적용하여 더욱 자연스러운 방식을 사용한다.-In the process of changing the weight (hereinafter, specific gravity) on BlendSpace, we apply a more natural method by applying different linear interpolation formulas differently than simple linear interpolation formulas.

- 리니어 방식은 시간 변화가 일정하다.-Linear change is constant in time.

리니어 방식은 도 6의 (A)에 도시된 바와 같이, 등속운동이기 ?문에 자연스러움이 없어 테스트 용도로만 사용한다.Since the linear method is a constant velocity motion as shown in FIG.

- BlendSpace 상에서 비중값을 적용함에 있어 자연스러움을 유도하기 위해 도 6의 (B)와 같은 Ease-in 방식과, (C)와 같은 Ease-out 방식을 적극 활용한다.-Ease-in method as shown in (B) and Ease-out method as shown in (C) is actively used to induce naturalness in applying specific gravity on BlendSpace.

모음 및 묵음 층계는 (B)와 같은 Ease-in 방식으로 적용하고, 빠르게 발음하고 사라져야 하는 자음 및 이중 모음 층계는 (C)와 같은 Ease-Out 방식으로 급하게 변하는 값을 적용한다. 이때, 자연스러움을 위해 Ease-In-Out의 변화는 적용하지 않는다. 이는 Ease-In-Out의 변화를 적용하게 되면 실제 변화량의 시간이 너무 빨라 그러한 변화에 애니메이션이 부합하지 않는 것처럼 보이게 되기 때문이다.The vowel and mute staircases are applied by the Ease-in method as shown in (B), and the consonant and double vowel staircases that should be pronounced and disappeared rapidly are applied by the Ease-Out method as shown in (C). At this time, the change of Ease-In-Out is not applied for the sake of naturalness. This is because applying the change of Ease-In-Out makes the actual change amount time so fast that the animation does not seem to match the change.

* 제3-2 방식: 자료부에서 분석된 데이터 추가* Method 3-2: Add data analyzed in data book

- 같은 발음을 해도 사람마다 입모양이 다를 수 있다는 점을 감안하여 타겟을 발음별 형태로 설정하지 않고, 표현할 수 있는 얼굴 움직임으로 변경한다. 예를 들면, 'ㅏ' 발음 타겟 → JawDrop으로 한다.-Even if the pronunciation is different, the shape of the mouth may be different for each person. Therefore, the target is changed to a facial motion that can be expressed without setting the target in the form of pronunciation. For example, let 'ㅏ' pronunciation target → JawDrop.

- morph target list를 변경한다. 즉, 도 7의 [표 1]과 같은 기존의 IPA 기준 morph target list를 도 8의 [표 2]와 같이 얼굴 형태 기준 morph target list 중 입, 턱 관련 부분을 변경한 테이블을 적용한다.Change the morph target list. That is, the table in which the mouth and the jaw-related parts of the face shape reference morph target list are changed is applied to the existing IPA reference morph target list as shown in [Table 1] of FIG. 7.

이상과 같은 기존의 IPA 기준 morph target(표 1) 및 얼굴 형태 기준 morph target(표 2)과 관련하여 "Viseme" 재구성 방법에 대해 간략히 설명해 보기로 한다. With respect to the conventional IPA reference morph target (Table 1) and face shape reference morph target (Table 2) as described above, a brief description of the "Viseme" reconstruction method.

morph target의 데이터는 자료부에서 구성할 데이터를 활용한다. 예를 들면, 'ㅏ'를 발음하기 위해서 여러 개의 morph target을 조합하는(ex.: JawDrop(60%)+ mouthShrugUpper(30%)) 방법(이하 조합형 morph)을 사용할 수 있다.The data of the morph target utilizes the data to be configured in the data book. For example, you can use a combination of several morph targets (e.g. JawDrop (60%) + mouthShrugUpper (30%)) to pronounce 'ㅏ' (combined morph).

예를 들면, 도 7의 [표 1]에서의 "2번, a(전설 비원순 근저모음)"를 표현하기 위해 도 8의 [표 2]에서의 "JawOpen"과 "MouthShrugUpper" 값을 적정 비율로 혼합해서 표현한다. 이를 표로 정리하면 도 9와 같다.For example, in order to represent "No. 2, a (Legend non-universe root vowel)" in [Table 1] of FIG. 7, the values of "JawOpen" and "MouthShrugUpper" in Table 2 of FIG. To mix. This is summarized in a table as shown in FIG. 9.

<형성부><Formation part>

형성부(140)의 최종 결과물 형성부(141)는 그래픽 제작 툴(예를 들면, 마야, 맥스 등)에서 연산부(130)의 연산 결과를 적용할 수 있는 morph target 셋업 및 연출된 화면 및 립싱크 애니메이션, 음성 신호를 동시에 재생하여 사용자에게 렌더링된 최종 결과물을 제공한다. 또한, 이와 같은 최종 결과물 형성부(141)는 효과적인 연출을 위헤 입력부(111)로부터 제공받은 표정 및 행동 애니메이션을 발음과 동시에 또는 원하는 시점에 플레이할 수 있도록 행동 애니메이션의 재생을 제어하는 기능도 갖는다.The final result forming unit 141 of the forming unit 140 is a morph target setup and directed screen and lip-sync animation to apply the calculation result of the operation unit 130 in a graphic production tool (eg, Maya, Max, etc.) At the same time, the audio signal is played simultaneously to provide the user with the final rendered result. In addition, the final result forming unit 141 also has a function of controlling the reproduction of the behavior animation so that the expression and behavior animation provided from the input unit 111 can be played simultaneously with the pronunciation or at a desired time point for effective presentation.

그러면, 이하에서는 이상과 같은 구성 및 기술적 특징을 가지는 본 발명에 따른 인공지능 기반의 가상 캐릭터의 페이셜 애니메이션 자동 생성 시스템을 바탕으로 한 인공지능 기반의 가상 캐릭터의 페이셜 애니메이션 자동 생성 방법에 대해 설명해 보기로 한다.Next, a method for automatically generating facial animation of an artificial intelligence-based virtual character based on an automatic facial animation generation system of an artificial intelligence-based virtual character according to the present invention having the above-described configuration and technical features will be described. do.

도 2는 본 발명의 실시예에 따른 인공지능 기반의 가상 캐릭터의 페이셜 애니메이션 자동 생성 방법의 실행 과정을 나타낸 흐름도이다.2 is a flowchart illustrating an execution process of a method for automatically generating facial animation of an artificial intelligence-based virtual character according to an embodiment of the present invention.

도 2를 참조하면, 본 발명에 따른 인공지능 기반의 가상 캐릭터의 페이셜 애니메이션 자동 생성 방법은, 전술한 바와 같은 처리부(110), 자료부(120), 연산부(130), 형성부(140)를 포함하는 인공지능 기반의 가상 캐릭터의 페이셜 애니메이션 자동 생성 시스템(100)을 기반으로 한 인공지능 기반의 가상 캐릭터의 페이셜 애니메이션 자동 생성 방법으로서, 먼저 상기 처리부(110)가 사용자 또는 외부 소스(source)에 의해 입력된 텍스트(예를 들면, 가상 캐릭터가 말을 하게 될 대사)를 분류하여 TTS(Text-to-Speech) 서비스 요청 대상 부분을 외부의 TTS 서비스 제공원(미도시)으로 전송한다(단계 S201). 이에 따라 외부의 TTS 서비스 제공원은 수신한 TTS(Text-to-Speech) 서비스 요청 대상 부분에 대응하는 TTS 서비스 자료를 제공하게 된다.Referring to FIG. 2, in the method for automatically generating a facial animation of an artificial intelligence-based virtual character according to the present invention, the processing unit 110, the data unit 120, the operation unit 130, and the forming unit 140 may be configured as described above. A method for automatically generating facial animation of an artificial intelligence-based virtual character based on the automatic facial animation generation system 100 of an artificial intelligence-based virtual character, including the first processing unit 110 to a user or an external source The text inputted by the user (eg, a dialogue to be spoken by the virtual character) is classified and transmitted to a TTS (Text-to-Speech) service request target part to an external TTS service provider (not shown) (step S201). ). Accordingly, the external TTS service provider provides TTS service material corresponding to the received text-to-speech service request target portion.

상기 처리부(110)는 상기 TTS 서비스 제공원으로부터 제공된 TTS 서비스 자료를 수신하고, 그 TTS 서비스 자료로부터 음성 파형 및 음성 파형 정보를 추출한다. 그리고 추출한 음성 파형 및 음성 파형 정보를 바탕으로 가상 캐릭터의 페이셜 애니메이션 생성에 필요한 형태에 맞게 음성 파형을 분해한다(단계 S202).The processing unit 110 receives TTS service data provided from the TTS service provider, and extracts voice waveform and voice waveform information from the TTS service data. Based on the extracted speech waveform and the speech waveform information, the speech waveform is decomposed according to the shape required for generating the facial animation of the virtual character (step S202).

여기서, 상기 처리부(110)는 내부의 입력부(111)를 통해 사용자 또는 외부의 소스(source)로부터 제공된 텍스트를 입력받고, 음성 및 음성정보 추출부(112)를 통해 입력부(111)로부터 입력된 텍스트를 분류하여 TTS 서비스 요청 대상 부분을 외부의 TTS 서비스 제공원으로 전송하고, TTS 서비스 제공원으로부터 제공된 TTS 서비스 자료로부터 음성 파형 및 음성 파형 정보를 추출할 수 있다.Here, the processing unit 110 receives text provided from a user or an external source through an internal input unit 111, and text input from the input unit 111 through a voice and voice information extraction unit 112. And classify the TTS service request target portion to an external TTS service provider, and extract voice waveform and voice waveform information from TTS service data provided from the TTS service provider.

또한, 상기 처리부(110)는 사용자 또는 외부의 소스(source)로부터 제공된 텍스트의 내용을 파싱(parsing)하여 감정 또는 행동과 관련된 내용은 상황에 맞는 애니메이션이 제작될 수 있도록 하기 위해 형성부(140)로 전달할 수 있다.In addition, the processing unit 110 parses the contents of the text provided from the user or an external source, so that the contents related to the emotions or actions can be produced according to the situation. Can be delivered to.

또한, 상기 TTS(Text-to-Speech) 서비스 제공원은 클라우드 음성합성 시스템을 포함할 수 있다. 이때, 상기 클라우드 음성합성 시스템은 Google TTS, IBM Watson, AWS(Amazon Web Service) Polly, Naver Clova 등을 포함할 수 있다.In addition, the text-to-speech (TTS) service provider may include a cloud speech synthesis system. In this case, the cloud speech synthesis system may include Google TTS, IBM Watson, Amazon Web Service (AWS) Polly, Naver Clova, and the like.

한편, 자료부(120)는 특정 언어를 발음하고 있는 얼굴의 지정된 포인트의 위치 정보를 수집하여 3D 데이터로 변환한 값을 미리 설정한 특징에 따라 분류한 입모양(립싱크) 데이터 세트(121)를 생성한다(단계 S203).On the other hand, the data unit 120 collects the positional information of the specified point of the face pronouncing a specific language and converts the value converted into 3D data according to a predetermined feature to set the mouth-shaped (lip sync) data set 121. To generate (step S203).

또한, 자료부(120)는 특정 사람의 음성 데이터를 수집하여 추출한 음성 파형 데이터에 기초한 개인화된 음성 데이터 세트(122)를 생성한다(단계 S204).In addition, the data unit 120 generates a personalized voice data set 122 based on voice waveform data obtained by collecting and extracting voice data of a specific person (step S204).

이렇게 하여 상기 처리부(110)에 의해 TTS 서비스 자료로부터 음성 파형 및 음성 파형 정보를 추출하고, 자료부(120)에 의해 입모양(립싱크) 데이터 세트(121)와 음성 데이터 세트(122)를 생성하면, 연산부(130)는 상기 처리부(110)에 의해 추출된 음성 파형 및 음성 파형 정보와, 상기 자료부(120)에 의해 생성된 입모양(립싱크) 데이터 세트(121)를 제공받아 립싱크(lip-sync) 애니메이션의 수학적 연산을 수행한다(단계 S205). 여기서, 상기 연산부(130)는 내부의 립싱크 애니메이션 연산부(131)를 통해 상기 처리부(110)에 의해 추출된 음성 파형 및 음성 파형 정보와, 상기 자료부(120)에 의해 생성된 입모양(립싱크) 데이터 세트(121)를 제공받아 립싱크 애니메이션의 수학적 연산을 수행할 수 있다. 이때, 상기 립싱크 애니메이션 연산부(131)는 상기 자료부(120)에 의해 생성된 음성 데이터 세트(122)까지 더 제공받아 립싱크 애니메이션의 수학적 연산을 수행할 수 있다.In this manner, the processing unit 110 extracts the voice waveform and the voice waveform information from the TTS service data, and generates the mouth-shaped (lip sync) data set 121 and the voice data set 122 by the data unit 120. The operation unit 130 receives the speech waveform and the speech waveform information extracted by the processing unit 110 and the lip sync data set 121 generated by the data unit 120. sync) performs a mathematical operation of the animation (step S205). Here, the operation unit 130 is a voice waveform and voice waveform information extracted by the processing unit 110 through the lip sync animation operation unit 131, and the mouth shape (lip sync) generated by the data unit 120 The data set 121 may be provided to perform a mathematical operation of the lip sync animation. In this case, the lip sync animation calculator 131 may further receive the voice data set 122 generated by the data unit 120 to perform a mathematical operation of the lip sync animation.

또한, 상기 연산부(130)는 상기 음성 파형 정보를 음성 데이터로 변환하는 과정에서 음성과 애니메이션의 동기를 맞춘다(단계 S206).In addition, the operation unit 130 synchronizes the voice and the animation in the process of converting the voice waveform information into voice data (step S206).

이상에 의해 연산부(130)에 의해 립싱크 애니메이션의 수학적 연산이 완료되면, 형성부(140)는 게임 엔진을 통해 상기 연산부(130)에서 연산 처리된 값을 3D 모델링 데이터에 대입하여, 도 10 및 도 11에 도시된 바와 같이, 립싱크 애니메이션 및 음성을 포함한 최종 결과물(142)을 생성한다(단계 S207). 여기서, 이와 같은 형성부(140)는 내부의 최종 결과물 형성부(141)를 통해 상기 연산부(130)에서 처리된 값을 3D 모델링 데이터에 대입하여 립싱크 애니메이션 및 음성을 포함한 렌더링(rendering)된 최종 결과물(142)을 산출(출력)할 수 있다. 여기서, 도 10의 최종 결과물(142)은 대화형 챗봇과 연결시켜, 답변을 받을 때 말하는 캐릭터로 보일 수 있도록 구성한 형태이고, 도 11의 최종 결과물(142)은 단독 실행 파일(stand alone) 형식 또는 웹에서도 말하는 캐릭터를 적용시킬 수 있도록 구성한 형태이다.When the mathematical operation of the lip sync animation is completed by the operation unit 130 as described above, the forming unit 140 substitutes the value processed by the operation unit 130 into the 3D modeling data through a game engine, FIGS. 10 and FIG. As shown in 11, the final result 142 including the lip sync animation and the voice is generated (step S207). Here, the forming unit 140 substitutes the value processed by the calculating unit 130 through the final result forming unit 141 into 3D modeling data to render the final result including lip sync animation and voice. 142 can be calculated (output). Here, the final result 142 of FIG. 10 is configured to be connected to an interactive chatbot so that it can be seen as a talking character when receiving an answer, and the final result 142 of FIG. 11 is of a stand alone file type or It is a form that can be applied to the characters spoken on the web.

이상의 설명과 같이, 본 발명에 따른 인공지능 기반의 가상 캐릭터의 페이셜 애니메이션 자동 생성 시스템 및 방법은, 클라우드 시스템 기반의 음성 데이터를 활용하여 한국어 발음을 보여주는 자연스러운 입모양 애니메이션을 제공함으로써, 실제의 사람과 같은 느낌을 줄 수 있는 장점이 있다.As described above, the system and method for automatically generating facial animation of an artificial intelligence-based virtual character according to the present invention provides a natural mouth-shaped animation showing Korean pronunciation using cloud system-based voice data, thereby realizing a real person and It has the advantage of giving the same feeling.

이상, 바람직한 실시예를 통하여 본 발명에 관하여 상세히 설명하였으나, 본 발명은 이에 한정되는 것은 아니며, 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 다양하게 변경, 응용될 수 있음은 당해 기술분야의 통상의 기술자에게 자명하다. 따라서, 본 발명의 진정한 보호 범위는 다음의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술적 사상은 본 발명의 권리 범위에 포함되는 것으로 해석되어야 할 것이다.As mentioned above, the present invention has been described in detail through the preferred embodiments, but the present invention is not limited thereto, and various changes and applications may be made without departing from the technical spirit of the present invention. Self-explanatory Therefore, the true scope of protection of the present invention should be interpreted by the following claims, and all technical ideas within the equivalent scope should be construed as being included in the scope of the present invention.

110: 처리부 111: 입력부
112: 음성 및 음성정보 추출부 120: 자료부
121: 립싱크 데이터 세트 122: 음성 데이터 세트
130: 연산부 131: 립싱크 애니메이션 연산부
140: 형성부 141: 최종 결과물 형성부
142: 최종 결과물110: processing unit 111: input unit
112: voice and voice information extraction unit 120: data
121: Lip sync data set 122: Voice data set
130: calculator 131: lip-sync animation calculator
140: forming unit 141: final result forming unit
142: Final output

Claims

사용자 또는 외부 소스(source)에 의해 입력된 텍스트를 분류하여 TTS(Text-to-Speech) 서비스 요청 대상 부분을 외부의 TTS 서비스 제공원으로 전송하고, 상기 TTS 서비스 제공원으로부터 제공된 TTS 서비스 자료로부터 음성 파형 및 음성 파형 정보를 추출하여 가상 캐릭터의 페이셜 애니메이션 생성에 필요한 형태에 맞게 음성 파형을 분해하는 처리부;
특정 언어를 발음하고 있는 얼굴의 지정된 포인트의 위치 정보를 3D 데이터로 변환한 값을 미리 설정한 특징에 따라 분류한 입모양(립싱크) 데이터 세트와, 특정 사람의 음성 데이터를 수집하여 추출한 음성 파형 데이터에 기초한 개인화된 음성 데이터 세트를 생성하는 자료부;
상기 처리부 및 자료부와 전기적으로 각각 연결되며, 상기 처리부에 의해 추출된 음성 파형 및 음성 파형 정보와, 상기 자료부에 의해 생성된 입모양(립싱크) 데이터 세트를 제공받아 립싱크(lip-sync) 애니메이션의 수학적 연산을 수행하며, 상기 음성 파형 정보를 음성 데이터로 변환하는 과정에서 음성과 애니메이션의 동기를 맞추는 연산부; 및
상기 연산부와 전기적으로 연결되며, 게임 엔진을 통해 상기 연산부에서 처리된 값을 3D 모델링 데이터에 대입하여 립싱크 애니메이션 및 음성을 포함한 최종 결과물을 생성하는 형성부를 포함하고,
상기 처리부는 사용자 또는 외부의 소스(source)로부터 제공된 텍스트를 입력받는 입력부와, 입력된 텍스트를 분류하여 TTS 서비스 요청 대상 부분을외부의 TTS 서비스 제공원으로 전송하고, 상기 TTS 서비스 제공원으로부터 제공된 TTS 서비스 자료로부터 음성 파형 및 음성 파형 정보를 추출하는 음성 및 음성정보 추출부를 포함하며,
상기 입력부는 문단 분류와 애니메이션의 표정 및 행동 정보를 동시에 받을 수 있도록 특정된 JSon 포맷으로 텍스트(대사)를 입력받도록 구성되고,
상기 음성 및 음성정보 추출부는 입력받은 대사를 분류하여 외부의 TTS 서비스 제공원에 요청할 텍스트와 출력할 애니메이션을 선정하고, 상기 입력부를 통해 입력된 JSon 포맷 대사를 분해하여, JSon의 내용 중 "Context"의 내용을 파싱하여 외부의 TTS 서비스 제공원으로 전송하며, 외부의 TTS 서비스 제공원으로부터 제공된 음성 파일과 음성 정보 파일을 다운로드하고, 음성 정보 파일을 사용하기 쉽게 파싱하여 상기 연산부로 전달하는 인공지능 기반의 가상 캐릭터의 페이셜 애니메이션 자동 생성 시스템.
Transmit text-to-speech (TTS) service request target part to an external TTS service provider by classifying text input by a user or an external source, and voice from the TTS service material provided from the TTS service provider. A processor for extracting waveform and voice waveform information and decomposing the voice waveform according to a shape required for generating a facial animation of the virtual character;
A set of mouth-shaped (lip-sync) data, which is classified according to a predetermined feature, by converting positional information of a specified point of a face in which a specific language is pronounced into 3D data, and voice waveform data obtained by collecting and extracting voice data of a specific person. A data portion for generating a personalized voice data set based on the;
A lip-sync animation is electrically connected to the processing unit and the data unit, respectively, and is provided with a voice waveform and audio waveform information extracted by the processing unit and a set of lip sync data generated by the data unit. A calculation unit configured to perform a mathematical operation of and synchronize synchronization of a voice and an animation in the process of converting the voice waveform information into voice data; And
It is electrically connected to the operation unit, including a formation unit for generating a final result including lip sync animation and voice by substituting the value processed by the operation unit through the game engine to the 3D modeling data,
The processor may classify the input text into an input unit for receiving text provided from a user or an external source, and transmit a TTS service request target part to an external TTS service provider, and transmit a TTS provided from the TTS service provider. Voice and voice information extraction unit for extracting voice waveform and voice waveform information from the service data,
The input unit is configured to receive text (metabolism) in a specific JSon format to receive paragraph classification and expression and behavior information of an animation at the same time,
The voice and voice information extractor classifies the input dialogue and selects text to be requested from an external TTS service provider and an animation to be output, and decomposes the JSon format dialogue inputted through the input unit to display "Context" in the contents of JSon. Parse the contents of the TTS service provider and transmit it to the external TTS service provider, download the voice file and voice information file provided from the external TTS service provider, and parse the voice information file for easy use and deliver it to the operation unit. Facial animation automatic generation system of virtual characters.

삭제delete

제1항에 있어서,
상기 처리부는 사용자 또는 외부의 소스(source)로부터 제공된 텍스트의 내용을 파싱(parsing)하여 감정 또는 행동과 관련된 내용은 상황에 맞는 애니메이션이 제작될 수 있도록 하기 위해 상기 형성부로 전달하는 인공지능 기반의 가상 캐릭터의 페이셜 애니메이션 자동 생성 시스템.
The method of claim 1,
The processing unit parses the content of the text provided from the user or an external source, and the content related to emotion or action is transmitted to the forming unit so that the animation corresponding to the situation can be produced. Automatic facial animation generation system for characters.

삭제delete

제1항에 있어서,
상기 연산부는 상기 처리부에 의해 추출된 음성 파형 및 음성 파형 정보와, 상기 자료부에 의해 생성된 입모양(립싱크) 데이터 세트를 제공받아 립싱크 애니메이션의 수학적 연산을 수행하는 립싱크 애니메이션 연산부를 포함하는 인공지능 기반의 가상 캐릭터의 페이셜 애니메이션 자동 생성 시스템.
The method of claim 1,
The calculation unit includes an lip sync animation operation unit configured to perform a mathematical operation of a lip sync animation by receiving a speech waveform and audio waveform information extracted by the processor and a mouth-shaped (lip sync) data set generated by the data unit. Automatic facial animation generation system based on virtual characters.

제6항에 있어서,
상기 립싱크 애니메이션 연산부는 상기 자료부에 의해 생성된 음성 데이터 세트까지 더 제공받아 립싱크 애니메이션의 수학적 연산을 수행하는 인공지능 기반의 가상 캐릭터의 페이셜 애니메이션 자동 생성 시스템.
The method of claim 6,
The lip sync animation calculating unit is further provided with the voice data set generated by the data unit to perform a facial animation automatic generation system for artificial intelligence-based virtual character to perform a mathematical operation of the lip sync animation.

제1항에 있어서,
상기 형성부는 상기 연산부에서 처리된 값을 3D 모델링 데이터에 대입하여 립싱크 애니메이션 및 음성을 포함한 렌더링(rendering)된 최종 결과물을 산출하는 최종 결과물 형성부를 포함하는 인공지능 기반의 가상 캐릭터의 페이셜 애니메이션 자동 생성 시스템.
The method of claim 1,
The forming unit substitutes the value processed by the calculating unit into the 3D modeling data and automatically generates a facial animation system for an artificial character-based virtual character including a final result forming unit for calculating a final rendered result including lip sync animation and voice. .

처리부, 자료부, 연산부, 형성부를 포함하는 인공지능 기반의 가상 캐릭터의 페이셜 애니메이션 자동 생성 시스템을 기반으로 한 인공지능 기반의 가상 캐릭터의 페이셜 애니메이션 자동 생성 방법으로서,
a) 상기 처리부가 사용자 또는 외부 소스(source)에 의해 입력된 텍스트를 분류하여 TTS(Text-to-Speech) 서비스 요청 대상 부분을 외부의 TTS 서비스 제공원으로 전송하는 단계;
b) 상기 처리부가 상기 TTS 서비스 제공원으로부터 제공된 TTS 서비스 자료를 수신하여 TTS 서비스 자료로부터 음성 파형 및 음성 파형 정보를 추출하고, 가상 캐릭터의 페이셜 애니메이션 생성에 필요한 형태에 맞게 음성 파형을 분해하는 단계;
c) 상기 자료부에 의해 특정 언어를 발음하고 있는 얼굴의 지정된 포인트의 위치 정보를 수집하여 3D 데이터로 변환한 값을 미리 설정한 특징에 따라 분류한 입모양(립싱크) 데이터 세트를 생성하는 단계;
d) 상기 자료부에 의해 특정 사람의 음성 데이터를 수집하여 추출한 음성 파형 데이터에 기초한 개인화된 음성 데이터 세트를 생성하는 단계;
e) 상기 연산부가 상기 처리부에 의해 추출된 음성 파형 및 음성 파형 정보와, 상기 자료부에 의해 생성된 입모양(립싱크) 데이터 세트를 제공받아 립싱크 (lip-sync) 애니메이션의 수학적 연산을 수행하는 단계;
f) 상기 연산부가 상기 음성 파형 정보를 음성 데이터로 변환하는 과정에서 음성과 애니메이션의 동기를 맞추는 단계; 및
g) 상기 형성부가 게임 엔진을 통해 상기 연산부에서 처리된 값을 3D 모델링 데이터에 대입하여 립싱크 애니메이션 및 음성을 포함한 최종 결과물을 생성하는 단계를 포함하고,
상기 처리부는 내부의 입력부를 통해 사용자 또는 외부의 소스(source)로부터 제공된 텍스트를 입력받고, 음성 및 음성정보 추출부를 통해 입력부로부터 입력된 텍스트를 분류하여 TTS 서비스 요청 대상 부분을 외부의 TTS 서비스 제공원으로 전송하고, 상기 TTS 서비스 제공원으로부터 제공된 TTS 서비스 자료로부터 음성 파형 및 음성 파형 정보를 추출하며,
상기 입력부는 문단 분류와 애니메이션의 표정 및 행동 정보를 동시에 받을 수 있도록 특정된 JSon 포맷으로 텍스트(대사)를 입력받고,
상기 음성 및 음성정보 추출부는 입력받은 대사를 분류하여 외부의 TTS 서비스 제공원에 요청할 텍스트와 출력할 애니메이션을 선정하고, 상기 입력부를 통해 입력된 JSon 포맷 대사를 분해하여, JSon의 내용 중 "Context"의 내용을 파싱하여 외부의 TTS 서비스 제공원으로 전송하며, 외부의 TTS 서비스 제공원으로부터 제공된 음성 파일과 음성 정보 파일을 다운로드하고, 음성 정보 파일을 사용하기 쉽게 파싱하여 상기 연산부로 전달하는 인공지능 기반의 가상 캐릭터의 페이셜 애니메이션 자동 생성 방법.
An automatic facial animation generation method of an artificial intelligence-based virtual character based on an automatic facial animation generation system of an artificial intelligence-based virtual character including a processing unit, a data unit, an operation unit, and a forming unit,
a) classifying, by the processor, text input by a user or an external source and transmitting a text-to-speech service request target portion to an external TTS service provider;
b) the processor receiving TTS service data provided from the TTS service provider, extracting voice waveform and voice waveform information from the TTS service data, and decomposing the voice waveform according to a shape required for generating a facial animation of the virtual character;
c) generating, by the data unit, a mouth-shaped (lip-sync) data set which collects position information of a designated point of a face pronouncing a specific language and classifies the value converted into 3D data according to a predetermined feature;
d) generating a personalized speech data set based on speech waveform data obtained by collecting and extracting speech data of a specific person by the data unit;
e) performing, by the calculation unit, a mathematical operation of a lip-sync animation by receiving the speech waveform and the speech waveform information extracted by the processing unit and a set of lip sync data generated by the data unit; ;
f) synchronizing an animation with a voice in the operation unit converting the voice waveform information into voice data; And
g) generating a final result including a lip sync animation and voice by substituting the 3D modeling data into the 3D modeling data by the forming unit through a game engine.
The processing unit receives text provided from a user or an external source through an internal input unit, classifies the text input from the input unit through a voice and voice information extraction unit, and selects a TTS service request target portion from an external TTS service provider. And extracts the speech waveform and the speech waveform information from the TTS service data provided from the TTS service provider,
The input unit receives text (metabolism) in a specific JSon format to receive paragraph classification and animation expression and behavior information at the same time,
The voice and voice information extractor classifies the input dialogue and selects text to be requested from an external TTS service provider and an animation to be output, and decomposes the JSon format dialogue inputted through the input unit to display "Context" in the contents of JSon. Parse the contents of the TTS service provider and transmit it to the external TTS service provider, download the voice file and voice information file provided from the external TTS service provider, and parse the voice information file for easy use and deliver it to the operation unit. How to automatically generate facial animation of a virtual character.

삭제delete

제9항에 있어서,
상기 처리부는 사용자 또는 외부의 소스(source)로부터 제공된 텍스트의 내용을 파싱(parsing)하여 감정 또는 행동과 관련된 내용은 상황에 맞는 애니메이션이 제작될 수 있도록 하기 위해 상기 형성부로 전달하는 인공지능 기반의 가상 캐릭터의 페이셜 애니메이션 자동 생성 방법.
The method of claim 9,
The processing unit parses the content of the text provided from the user or an external source, and the content related to emotion or action is transmitted to the forming unit so that the animation corresponding to the situation can be produced. How to automatically generate facial animations for characters.

삭제delete

제9항에 있어서,
상기 연산부는 내부의 립싱크 애니메이션 연산부를 통해 상기 처리부에 의해 추출된 음성 파형 및 음성 파형 정보와, 상기 자료부에 의해 생성된 입모양(립싱크) 데이터 세트를 제공받아 립싱크 애니메이션의 수학적 연산을 수행하는 인공지능 기반의 가상 캐릭터의 페이셜 애니메이션 자동 생성 방법.
The method of claim 9,
The operation unit receives an audio waveform and audio waveform information extracted by the processing unit and an lip sync data set generated by the data unit through an internal lip sync animation operation unit to perform a mathematical operation of a lip sync animation. Automatic generation of facial animations of intelligence-based virtual characters.

제14항에 있어서,
상기 립싱크 애니메이션 연산부는 상기 자료부에 의해 생성된 음성 데이터 세트까지 더 제공받아 립싱크 애니메이션의 수학적 연산을 수행하는 인공지능 기반의 가상 캐릭터의 페이셜 애니메이션 자동 생성 방법.
The method of claim 14,
And the lip sync animation calculator is further provided with a voice data set generated by the data unit to perform mathematical operation of a lip sync animation.

제9항에 있어서,
상기 형성부는 내부의 최종 결과물 형성부를 통해 상기 연산부에서 처리된 값을 3D 모델링 데이터에 대입하여 립싱크 애니메이션 및 음성을 포함한 렌더링(rendering)된 최종 결과물을 산출하는 인공지능 기반의 가상 캐릭터의 페이셜 애니메이션 자동 생성 방법.The method of claim 9,
The forming unit substitutes the value processed by the calculating unit into 3D modeling data through an internal final result forming unit to automatically generate facial animation of an artificial character-based virtual character that calculates a rendered final result including lip sync animation and voice. Way.