KR102134860B1

KR102134860B1 - Artificial Intelligence speaker and method for activating action based on non-verbal element

Info

Publication number: KR102134860B1
Application number: KR1020190116698A
Authority: KR
Inventors: 서형국; 이경한; 가에턴 게레로; 최재영
Original assignee: (주)제노임펙트
Priority date: 2019-09-23
Filing date: 2019-09-23
Publication date: 2020-08-27

Abstract

The present invention relates to an artificial intelligence speaker and a method for activating action based on a non-verbal element and includes: a motion recognition camera for obtaining and providing a head rotation quaternion by recognizing a user′s motion in three dimensions; a motion analysis unit which generates an eye alignment event when the head rotation quaternion is within a preset reference angle range; a microphone for receiving a user voice signal; a voice command processing unit which processes the user voice signal in natural language to obtain a voice command; an operation mode setting unit which enters a request processing mode when an eye contact event occurs during an activation mode, and re-enters the activation mode if the voice signal is not received for longer than a set time during the request processing mode; a service execution unit which calls and executes a service corresponding to the voice command acquired during the request processing mode; and a speaker which reproduces sound corresponding to a service execution result of the service execution unit in real time.

Description

인공지능 스피커 및 이의 비언어적 요소 기반 동작 활성화 방법{Artificial Intelligence speaker and method for activating action based on non-verbal element}Artificial Intelligence speaker and method for activating action based on non-verbal element {Artificial Intelligence speaker and method for activating action based on non-verbal element}

본 발명은 사용자의 머리 회전정보를 바탕으로 사용자 눈 맞춤을 탐지하고, 이를 동작 활성화 정보로 활용할 수 있도록 하는 인공지능 스피커 및 이의 비언어적 요소 기반 동작 활성화 방법에 관한 것이다. The present invention relates to an artificial intelligence speaker that detects a user's eye contact based on the user's head rotation information and uses the same as motion activation information, and a non-verbal element-based motion activation method thereof.

최근 깊이 신경망과 음성인식 기술의 발전으로 인공지능 스피커가 상용화되며 인공지능 스피커 시장은 급격하게 성장하고 있다. 인공지능 스피커는 음성인식을 이용한 입력장치로써 다양한 형태의 새로운 인터랙션 방식을 발생시켰다. 이러한 음성 기반의 인터페이스는 원격에서 쉽게 사용자가 원하는 정보나 기기제어를 가능하게 하지만 몇몇 상황에서 한계점을 드러낸다.With the recent development of deep neural networks and voice recognition technology, artificial intelligence speakers are commercialized, and the artificial intelligence speaker market is growing rapidly. The artificial intelligence speaker is an input device using voice recognition and has created a variety of new interaction methods. This voice-based interface makes it possible to easily control information or devices desired by users remotely, but reveals limitations in some situations.

사용자는 현재 음성만으로 인공지능 스피커와 인터랙션 할 수 있기 때문에 인공지능 스피커를 활성화하기 위해서 사용자는 스피커마다 할당된 호칭을 호명해야 한다. 사용자는 인공지능 스피커가 활성화된 이후에야 자신의 요청사항을 전달할 수 있다. Since the user can interact with the artificial intelligence speaker only with the current voice, in order to activate the artificial intelligence speaker, the user must call the name assigned to each speaker. Users can deliver their requests only after the artificial intelligence speaker is activated.

예를 들어, (1) 사용자가 음악을 재생하기 위해서는 스피커 이름을 불려 스피커가 활성화되기를 기다린 후, (2) 스피커가 활성화될 때에 음악을 켜달라는 요청을 말해야 한다. For example, (1) to play music, the user must call the speaker name, wait for the speaker to be activated, and then (2) say a request to turn on the music when the speaker is activated.

이러한 2 단계로 이루어진 인터랙션 방식은 불필요한 시간 소요를 발생시킨다. 또한 맥락에 대한 이해 없이 소리만 가지고 활성화 여부를 판단하기 때문에 의도하지 않은 활성화가 일어날 수도 있다. This two-step interaction method generates unnecessary time. In addition, unintended activation may occur because it determines whether or not it is activated with only sound without understanding the context.

이를 해결하기 위해서는 사용자를 고려한 지능적 환경 구성이 필요하다. 이를 구현하기 위해 시각 정보와 소리 신호를 함께 고려하려는 시도는 오래전부터 존재했다. 또한 인공지능 스피커 시스템에 카메라를 설치하여 더욱 정확한 입력 활성화를 시도하고자 한 시도도 존재했다. To solve this problem, it is necessary to configure an intelligent environment that considers users. In order to realize this, there have been long attempts to consider visual information and sound signals together. In addition, there have been attempts to activate more accurate inputs by installing a camera in an artificial intelligence speaker system.

그러나 대부분의 연구는 입력의 정확도 및 보안성 측면에서 음성-기반 인간-컴퓨터 상호작용의 성능을 높이는 데 초점을 맞춰왔다However, most studies have focused on improving the performance of voice-based human-computer interactions in terms of input accuracy and security.

그보다 인공지능 스피커의 활성화 시점의 설계는 대화의 끊김이나 겹침을 부정적인 신호로서 인식하는 사람들의 문화와 관련 있다. 즉, 더욱더 자연스럽고 사회적인 대화를 하는 인공지능 스피커는 사용자가 말하는 동안에는 말을 끊지 않으면서 (not overlap) 동시에 말이 끝난 이후 자신의 말을 재생하기까지의 시간 (silence)을 최소로 해야 한다. Rather, the design of the AI speaker's activation point is related to the culture of people who perceive interruptions or overlaps in conversation as negative signals. In other words, the artificial intelligence speaker, which has a more natural and social conversation, should minimize the time (silence) from the end of the speech to the reproduction of his or her speech without interrupting the speech while the user is speaking (not overlap).

이를 위해 인공지능 스피커는 상대방의 말이 언제 끝났는지, 또한 언제 자신의 말을 시작해야 하는지 그 타이밍을 정확히 알아내야 하며, 이는 사용자가 한 말의 의미를 정확히 알아내는 것만큼 인공지능 스피커 시스템의 사용성을 높이는데 매우 중요하다. To this end, the artificial intelligence speaker needs to know exactly when the other person's speech has ended and when it should start his speech. This is as much as finding out the meaning of the user's speech. It is very important to raise.

하지만 그러한 입력 활성화의 적절한 시기와 그것이 인공지능 스피커와 사용자가 수행하는 대화의 질에 미치는 영향에 대해서는 거의 연구가 이뤄지지 않은 한계가 있다. However, there is a limit to which little research has been conducted on the appropriate timing of such input activation and its effect on the quality of conversations conducted by artificial intelligence speakers and users.

국내등록번호 제10-1970731호 (등록 일자 : 2019.04.15)Domestic registration number 10-1970731 (Registration date: 2019.04.15)

이에 상기와 같은 문제점을 해결하기 위한 것으로서, 본 발명은 카메라 좌표 내에서 사용자의 머리 회전정보를 바탕으로 사용자-카메라 간의 눈 맞춤을 탐지함으로써, 인공 지능 스피커를 동작 활성화하는 인공지능 스피커 및 이의 비언어적 요소 기반 동작 활성화 방법을 제공하고자 한다. Accordingly, as to solve the above problems, the present invention detects the eye alignment between the user and the camera based on the rotation information of the user's head within the camera coordinates, thereby activating the artificial intelligence speaker and a non-verbal element thereof. We would like to provide a method for activating the base operation.

본 발명의 목적은 이상에서 언급한 목적으로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 본 발명이 속하는 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The object of the present invention is not limited to the object mentioned above, and other objects not mentioned will be clearly understood by those of ordinary skill in the art from the following description.

상기 과제를 해결하기 위한 수단으로서, 본 발명의 일 실시 형태에 따르면 사용자 모션을 3차원적으로 인식하여 머리 회전 쿼터니언을 획득 및 제공하는 모션 인식 카메라; 상기 머리 회전 쿼터니언이 기 설정된 기준 각도 범위에 속하는 경우, 눈 맞춤 이벤트를 발생하는 모션 분석부; 사용자 음성 신호를 수신하는 마이크; 상기 사용자 음성 신호를 자연어 처리하여, 음성 명령을 획득하는 음성 명령어 처리부; 활성화 모드 중에 눈 맞춤 이벤트가 발생하면 요청 처리 모드로 진입하고, 요청 처리 모드 중에 사용자 음성 신호 미수신 상태가 설정 시간 이상 유지되면 활성화 모드로 재진입하는 동작 모드 설정부; 요청 처리 모드 중에 획득된 상기 음성 명령에 상응하는 서비스를 호출 및 실행하는 서비스 실행부; 및 상기 서비스 실행부의 서비스 실행 결과에 상응하는 소리를 실시간 재생하는 스피커를 포함하는 인공 지능 스피커를 제공한다. As a means for solving the above problem, according to an embodiment of the present invention, a motion recognition camera for obtaining and providing a head rotation quaternion by recognizing a user's motion in three dimensions; A motion analysis unit that generates an eye contact event when the head rotation quaternion falls within a preset reference angle range; A microphone for receiving a user voice signal; A voice command processing unit that processes the user voice signal in natural language to obtain a voice command; An operation mode setting unit that enters a request processing mode when an eye contact event occurs during the activation mode, and re-enters the activation mode when a user voice signal non-reception state during the request processing mode is maintained for a predetermined time or longer; A service execution unit for calling and executing a service corresponding to the voice command acquired during a request processing mode; And a speaker that reproduces a sound corresponding to a service execution result of the service execution unit in real time.

상기 모션 분석부는 기 설정 시간에 거쳐 머리 회전 쿼터니언을 수집 및 평균한 후, 평균값 기반으로 눈 맞춤 이벤트 발생 여부를 결정하는 것을 특징으로 한다. The motion analysis unit is characterized in that after collecting and averaging the head rotation quaternion over a preset time, determines whether an eye contact event occurs based on the average value.

상기 음성 명령어 처리부는 활성화 모드 중에 기 설정된 호출 명령어에 대응되는 사용자 음성 신호가 입력되면, 호출 이벤트를 발생하는 기능을 더 포함하는 것을 특징으로 한다. The voice command processing unit may further include a function of generating a call event when a user voice signal corresponding to a preset call command is input during the activation mode.

상기 동작 모드 설정부는 상기 호출 이벤트를 추가 고려하여 요청 처리 모드로의 진입 여부를 결정하는 기능을 더 포함하는 것을 특징으로 한다. The operation mode setting unit may further include a function of determining whether to enter the request processing mode in consideration of the call event.

상기 모션 분석부는 호출 이벤트 발생을 위한 모션 제스처를 사전 설정한 후, 상기 머리 회전 쿼터니언을 상기 모션 제스처와 비교 분석한 후 호출 이벤트 발생 여부를 추가 결정하는 기능을 더 포함하는 것을 특징으로 한다. The motion analysis unit may further include a function of presetting a motion gesture for generating a call event, comparing and analyzing the head rotation quaternion with the motion gesture, and further determining whether a call event occurs.

상기 과제를 해결하기 위한 수단으로서, 본 발명의 다른 실시 형태에 따르면 활성화 모드가 설정되면, 모션 인식 카메라를 통해 사용자 모션을 3차원적으로 인식하여 머리 회전 쿼터니언을 획득하는 단계; 상기 머리 회전 쿼터니언이 기 설정된 기준 각도 범위에 속하는 경우, 요청 처리 모드로 진입하는 단계; 및 요청 처리 모드로 진입하면, 마이크를 통해 사용자 음성을 수신 및 분석하여, 사용자 필요 서비스를 호출 및 실행하는 단계를 포함하는 것을 특징으로 하는 인공 지능 스피커의 동작 활성화 방법을 제공한다. As a means for solving the above problem, according to another embodiment of the present invention, when the activation mode is set, obtaining a head rotation quaternion by recognizing a user's motion three-dimensionally through a motion recognition camera; Entering a request processing mode when the head rotation quaternion falls within a preset reference angle range; And when entering into the request processing mode, receiving and analyzing a user's voice through a microphone, and calling and executing a service required by the user.

상기 요청 처리 모드로 진입하는 단계는 기 설정 시간에 거쳐 머리 회전 쿼터니언을 수집 및 평균한 후, 기 설정된 기준 각도 범위와 비교 분석함으로써, 요청 처리 모드로의 진입 여부를 결정하는 것을 특징으로 한다. The step of entering the request processing mode is characterized in determining whether to enter the request processing mode by collecting and averaging the head rotation quaternions over a preset time period, and then comparing and analyzing with a preset reference angle range.

본 발명은 사용자 눈 맞춤이라는 비언어적 요소를 통해 인공 지능 스피커를 동작 활성화함으로써, 음성 기반 대비 보다 빠른 응답성을 확보할 수 있도록 한다. The present invention enables an artificial intelligence speaker to be activated through a non-verbal element such as a user's eye contact, thereby ensuring a faster response compared to a voice-based.

또한 눈 맞춤 인식 방식과 음성 인식 방식을 혼용하여 동작 모드 변경 동작을 수행함으로써, 보다 효과적이고 정확도가 높은 사용자 인터랙션을 지원할 수도 있도록 한다. In addition, by performing an operation mode change operation using a mixture of eye alignment recognition method and voice recognition method, it is possible to support more effective and highly accurate user interaction.

도 1 및 도 2는 본 발명의 일 실시예에 따른 인공지능 스피커를 도시한 도면이다.
도 3은 본 발명의 일 실시예에 따른 인공지능 스피커의 비언어적 요소 기반 동작 활성화 방법을 설명하기 위한 도면이다.
도 4는 본 발명의 다른 실시예에 따른 인공지능 스피커의 비언어적 요소 기반 동작 활성화 방법을 설명하기 위한 도면이다. 1 and 2 are views showing an artificial intelligence speaker according to an embodiment of the present invention.
3 is a diagram for explaining a method of activating an operation based on a non-verbal element of an artificial intelligence speaker according to an embodiment of the present invention.
4 is a diagram illustrating a method of activating an operation based on a non-verbal element of an artificial intelligence speaker according to another embodiment of the present invention.

이하의 내용은 단지 본 발명의 원리를 예시한다. 그러므로 당업자는 비록 본 명세서에 명확히 설명되거나 도시되지 않았지만 본 발명의 원리를 구현하고 본 발명의 개념과 범위에 포함된 다양한 장치를 발명할 수 있는 것이다. 또한, 본 명세서에 열거된 모든 조건부 용어 및 실시예들은 원칙적으로, 본 발명의 개념이 이해되도록 하기 위한 목적으로만 명백히 의도되고, 이와 같이 특별히 열거된 실시예들 및 상태들에 제한적이지 않는 것으로 이해되어야 한다.The following content merely illustrates the principles of the present invention. Therefore, those skilled in the art can implement the principles of the present invention and invent various devices included in the concept and scope of the present invention, although not clearly described or illustrated herein. In addition, it is understood that all conditional terms and examples listed in this specification are, in principle, expressly intended only for the purpose of making the concept of the present invention understood, and are not limited to the embodiments and states specifically listed as such. Should be.

또한, 본 발명의 원리, 관점 및 실시예들 뿐만 아니라 특정 실시예를 열거하는 모든 상세한 설명은 이러한 사항의 구조적 및 기능적 균등물을 포함하도록 의도되는 것으로 이해되어야 한다. 또한 이러한 균등물들은 현재 공지된 균등물뿐만 아니라 장래에 개발될 균등물 즉 구조와 무관하게 동일한 기능을 수행하도록 발명된 모든 소자를 포함하는 것으로 이해되어야 한다.In addition, it is to be understood that all detailed descriptions listing specific embodiments as well as principles, aspects and embodiments of the present invention are intended to include structural and functional equivalents of these matters. It should also be understood that these equivalents include not only currently known equivalents, but also equivalents to be developed in the future, that is, all devices invented to perform the same function regardless of structure.

따라서, 예를 들어, 본 명세서의 블럭도는 본 발명의 원리를 구체화하는 예시적인 회로의 개념적인 관점을 나타내는 것으로 이해되어야 한다. 이와 유사하게, 모든 흐름도, 상태 변환도, 의사 코드 등은 컴퓨터가 판독 가능한 매체에 실질적으로 나타낼 수 있고 컴퓨터 또는 프로세서가 명백히 도시되었는지 여부를 불문하고 컴퓨터 또는 프로세서에 의해 수행되는 다양한 프로세스를 나타내는 것으로 이해되어야 한다.Thus, for example, the block diagrams herein are to be understood as representing a conceptual perspective of exemplary circuits embodying the principles of the invention. Similarly, all flowcharts, state transition diagrams, pseudocodes, etc. are understood to represent various processes performed by a computer or processor, whether or not the computer or processor is clearly depicted and that can be represented substantially in a computer-readable medium. Should be.

프로세서 또는 이와 유사한 개념으로 표시된 기능 블럭을 포함하는 도면에 도시된 다양한 소자의 기능은 전용 하드웨어뿐만 아니라 적절한 소프트웨어와 관련하여 소프트웨어를 실행할 능력을 가진 하드웨어의 사용으로 제공될 수 있다. 프로세서에 의해 제공될 때, 상기 기능은 단일 전용 프로세서, 단일 공유 프로세서 또는 복수의 개별적 프로세서에 의해 제공될 수 있고, 이들 중 일부는 공유될 수 있다.The functions of the various elements shown in the figures, including a processor or functional block represented by a similar concept, may be provided by the use of dedicated hardware as well as hardware having the ability to execute software in association with appropriate software. When provided by a processor, the function may be provided by a single dedicated processor, a single shared processor, or a plurality of individual processors, some of which may be shared.

또한 프로세서, 제어 또는 이와 유사한 개념으로 제시되는 용어의 명확한 사용은 소프트웨어를 실행할 능력을 가진 하드웨어를 배타적으로 인용하여 해석되어서는 아니되고, 제한 없이 디지털 신호 프로세서(DSP) 하드웨어, 소프트웨어를 저장하기 위한 롬(ROM), 램(RAM) 및 비 휘발성 메모리를 암시적으로 포함하는 것으로 이해되어야 한다. 주지관용의 다른 하드웨어도 포함될 수 있다.In addition, the explicit use of terms presented as processor, control, or similar concepts should not be interpreted exclusively by referring to hardware capable of executing software, and without limitation, digital signal processor (DSP) hardware, ROM for storing software. It should be understood to implicitly include (ROM), RAM, and non-volatile memory. Other commonly used hardware may also be included.

본 명세서의 청구범위에서, 상세한 설명에 기재된 기능을 수행하기 위한 수단으로 표현된 구성요소는 예를 들어 상기 기능을 수행하는 회로 소자의 조합 또는 펌웨어/마이크로 코드 등을 포함하는 모든 형식의 소프트웨어를 포함하는 기능을 수행하는 모든 방법을 포함하는 것으로 의도되었으며, 상기 기능을 수행하도록 상기 소프트웨어를 실행하기 위한 적절한 회로와 결합된다. 이러한 청구범위에 의해 정의되는 본 발명은 다양하게 열거된 수단에 의해 제공되는 기능들이 결합되고 청구항이 요구하는 방식과 결합되기 때문에 상기 기능을 제공할 수 있는 어떠한 수단도 본 명세서로부터 파악되는 것과 균등한 것으로 이해되어야 한다.In the claims of the present specification, components expressed as means for performing the functions described in the detailed description include all types of software including, for example, a combination of circuit elements or firmware/microcode that perform the functions described above. It is intended to include all methods of performing a function to perform the function, and is combined with suitable circuitry for executing the software to perform the function. Since the invention defined by these claims is combined with the functions provided by the various enumerated means and combined with the manner required by the claims, any means capable of providing the above functions are equivalent to those conceived from this specification. It should be understood as.

상술한 목적, 특징 및 장점은 첨부된 도면과 관련한 다음의 상세한 설명을 통하여 보다 분명해질 것이며, 그에 따라 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명의 기술적 사상을 용이하게 실시할 수 있을 것이다. 또한, 본 발명을 설명함에 있어서 본 발명과 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에 그 상세한 설명을 생략하기로 한다. The above-described objects, features, and advantages will become more apparent through the following detailed description in connection with the accompanying drawings, whereby those of ordinary skill in the technical field to which the present invention pertains can easily implement the technical idea of the present invention. There will be. In addition, in describing the present invention, when it is determined that a detailed description of known technologies related to the present invention may unnecessarily obscure the subject matter of the present invention, a detailed description thereof will be omitted.

도 1은 본 발명의 일 실시예에 따른 인공지능 스피커를 도시한 도면이다. 1 is a diagram showing an artificial intelligence speaker according to an embodiment of the present invention.

도 1을 참고하면, 본 발명은 모션 인식 카메라(110), 모션 분석부(120), 마이크(130), 음성 처리부(140), 동작 모드 설정부(150), 서비스 실행부(160), 및 스피커(170) 등을 포함한다. Referring to FIG. 1, the present invention relates to a motion recognition camera 110, a motion analysis unit 120, a microphone 130, a voice processing unit 140, an operation mode setting unit 150, a service execution unit 160, and And a speaker 170 and the like.

모션 인식 카메라(110)는 사용자 모션을 3차원적으로 인식하고, 특히 사용자 신체 중 머리의 회전 쿼터니언(quaternion)을 3차원적으로 획득 및 제공하도록 한다. The motion recognition camera 110 recognizes the user's motion in three dimensions, and in particular, obtains and provides a rotational quaternion of the head among the user's bodies in three dimensions.

이는 일반 카메라와 적외선 카메라를 바탕으로 컬러맵, 깊이맵, 적외선맵 등을 출력하며, 이로부터 사용자의 얼굴 정보 및 머리가 카메라 좌표에서 얼마나 회전되어있는지를 나타내는 쿼터니언을 추적하여 제공할 수 있는 깊이 카메라를 통해 구현 가능하며, 깊이 카메라의 대표적인 예로는 키넥트(Kinect) 카메라가 있다. It outputs color maps, depth maps, and infrared maps based on general cameras and infrared cameras, from which the depth camera that can provide by tracking quaternions indicating how much the user's face information and head is rotated in camera coordinates. It can be implemented through, and a representative example of a depth camera is a Kinect camera.

모션 분석부(120)는 모션 인식 카메라(110)로부터 추출된 머리 회전 쿼터니언을 기반으로 눈 맞춤 이벤트 발생 여부를 결정한다. The motion analysis unit 120 determines whether an eye alignment event occurs based on the head rotation quaternion extracted from the motion recognition camera 110.

참고로, 3차원의 얼굴 모델(Object)을 생성하여 3 차원 가상 환경에 위치시킨 후, 3차원의 얼굴 모델의 z 축을 머리 회전 쿼터니언에 따라 회전시키면, 도 2 와 같은 결과를 얻을 수 있다. 이때, 어느 위치에서건 사용자가 카메라를 정면으로 바라볼 경우, 카메라 입장에서는 머리가 회전하지 않은 것처럼 보인다. 따라서 사용자가 카메라를 정면으로 바라볼 때 추적된 회전 쿼터니언으로는 회전이 거의 일어나지 않는다. For reference, if a three-dimensional face model (Object) is created and placed in a three-dimensional virtual environment, and then the z-axis of the three-dimensional face model is rotated according to the head rotation quaternion, the result as shown in FIG. 2 can be obtained. At this time, when the user looks at the camera from any position, it appears that the head has not rotated from the camera's point of view. Therefore, rotation rarely occurs with the tracked rotational quaternion when the user looks directly at the camera.

이에 본 발명에서는 이러한 성질을 이용해 z 축 방향의 단위벡터를 사용자의 머리 회전 쿼터니언으로 회전시킨 벡터와 z 축 방향의 단위벡터 사이의 각도가 어느 한계점 이하이면 사용자-카메라간에 눈 맞춤이 되었다고 가정한다. 이때, 한계점은 10도일 수 있으나, 이는 변경 가능한 값이다. Accordingly, in the present invention, using this property, it is assumed that eye contact between the user and the camera is achieved if the angle between the vector rotated in the z-axis direction by the user's head rotation quaternion and the unit vector in the z-axis direction is less than a certain threshold. At this time, the threshold may be 10 degrees, but this is a changeable value.

또한, 카메라가 사용자 머리 방향을 추적할 때, 값이 굉장히 불안정한 것을 발견한 특징이 있다. 이에 본 발명은 매 프레임 눈 맞춤 탐지를 할 경우 결과가 불안정해질 수 있다고 판단하고, 이를 보정 해주기 위해 머리 방향 벡터와 z 방향 단위벡터가 이루는 각도들을 버퍼에 넣어서 평균값을 바탕으로 탐지하도록 한다. 이때, 버퍼 크기는 20인 것이 바람직하나, 이는 변경 가능한 값이다. 즉, 머리 회전 쿼터니언을 소정 시간에 걸쳐 수집 및 평균한 후, 평균값이 기 설정 각도 이내에 속하는 지를 확인함으로써, 눈 맞춤 이벤트 발생 여부를 결정하도록 한다. Also, when the camera tracks the direction of the user's head, there is a characteristic that the value is very unstable. Accordingly, the present invention determines that the result may become unstable when eye contact detection is performed every frame, and to correct this, angles formed by the head direction vector and the z direction unit vector are put in a buffer to detect based on the average value. In this case, the buffer size is preferably 20, but this is a changeable value. That is, after collecting and averaging the head rotation quaternion over a predetermined period of time, it is determined whether an eye contact event occurs by checking whether the average value falls within a preset angle.

마이크(130)는 사용자 음성 신호를 수신한다. The microphone 130 receives a user voice signal.

음성 처리부(140)는 음성 처리 알고리즘을 구비하고, 이를 통해 사용자 음성 신호를 자연어 처리하여 음성 명령을 획득 및 출력하도록 한다. The voice processing unit 140 is equipped with a voice processing algorithm, through which the user voice signal is processed in natural language to obtain and output a voice command.

더하여, 음성 처리부(140)는 동작 모드 변경을 위한 호출 명령어를 사전 정의한다. 그리고 사용자가 호출 명령어를 말하면, 이에 응답하여 호출 이벤트를 발생 및 출력하도록 한다. In addition, the voice processing unit 140 pre-defines a call command for changing the operation mode. And when a user speaks a call command, it generates and outputs a call event in response.

동작 모드 설정부(150)는 활성화 모드와 요청 처리 모드라는 2개의 동작 모드를 구비한다. 그리고 활성화 모드 중에 눈 맞춤 이벤트와 호출 이벤트 중 적어도 하나가 발생하면 요청 처리 모드로 진입하고, 요청 처리 모드 중에 사용자 음성 신호 미수신 상태가 설정 시간 이상 유지되면 활성화 모드로 재진입하도록 한다. The operation mode setting unit 150 includes two operation modes: an activation mode and a request processing mode. In addition, when at least one of an eye contact event and a call event occurs during the activation mode, the request processing mode is entered, and when a state of not receiving a user voice signal during the request processing mode is maintained for a set time or longer, the activation mode is re-entered.

서비스 실행부(160)는 요청 처리 모드시에만 선택적으로 동작 활성화된다. 그리고 인공지능 알고리즘을 이용해 음성 처리부(140)를 통해 획득된 음성 명령에 상응하는 서비스를 호출 및 실행한다. 이때, 서비스는 장치 단독형으로 구현 및 제공될 수 있으나, 필요한 경우, 인터넷망을 통해 연결된 외부 서버와의 협업을 통해 제공될 수도 있도록 한다. The service execution unit 160 is selectively activated only in the request processing mode. In addition, a service corresponding to a voice command acquired through the voice processing unit 140 is called and executed using an artificial intelligence algorithm. In this case, the service may be implemented and provided as a device alone, but if necessary, the service may be provided through cooperation with an external server connected through an Internet network.

스피커(170)는 서비스 실행부(160)는 서비스 실행부(160)의 서비스 실행 결과에 상응하는 소리를 실시간 재생함으로써, 사용자가 이를 청각적으로 인식할 수 있도록 한다. The speaker 170 enables the service execution unit 160 to reproduce a sound corresponding to the service execution result of the service execution unit 160 in real time, so that the user can audibly recognize it.

도 3은 본 발명의 일 실시예에 따른 인공지능 스피커의 비언어적 요소 기반 동작 활성화 방법을 설명하기 위한 도면이다. 3 is a diagram for explaining a method of activating an operation based on a non-verbal element of an artificial intelligence speaker according to an embodiment of the present invention.

인공지능 스피커가 구동되기 시작하면, 동작 모드 설정부(150)는 활성화 모드를 우선 설정하도록 한다(S1). When the artificial intelligence speaker starts to be driven, the operation mode setting unit 150 sets the activation mode first (S1).

그리고 모션 인식 카메라(110)를 통해 자신에 전방에 위치하는 사용자를 촬영하여, 사용자 머리가 어디를 바라보고 있는지가 반영된 머리 회전 쿼터니언(quaternion)을 획득한다(S2). Then, the user is photographed in front of the user through the motion recognition camera 110 to obtain a head rotation quaternion reflecting where the user's head is looking (S2).

그리고 모션 분석부(120)를 통해 획득된 머리 회전 쿼터니언이 사전 설정된 기준 각도 이내에 속하면(S3), 모션 분석부(120)는 사용자가 인공지능 스피커 사용을 위해 인공지능 스피커(특히, 모션 인식 카메라)를 바라보고 있다고 판단하고, 눈 맞춤 이벤트를 발생하도록 한다(S4). And if the head rotation quaternion acquired through the motion analysis unit 120 falls within a preset reference angle (S3), the motion analysis unit 120 uses an artificial intelligence speaker (in particular, a motion recognition camera) for the user to use the artificial intelligence speaker. It is determined that it is looking at ), and an eye contact event is generated (S4).

그러면, 동작 모드 설정부(150)는 요청 처리 모드로 진입하여 서비스 실행부(160)를 동작 활성화시킨다(S5). Then, the operation mode setting unit 150 enters the request processing mode and activates the service execution unit 160 (S5).

이러한 상태에서 사용자가 음성을 통해 특정 서비스 제공을 요청하면(S6), 서비스 실행부(160)는 이에 상응하는 서비스를 호출 및 실행한 후, 스피커(170)를 통해 서비스 실행 결과를 음성 안내하도록 한다(S7). In this state, when the user requests to provide a specific service through voice (S6), the service execution unit 160 calls and executes the corresponding service, and then voices the service execution result through the speaker 170. (S7).

반면, 요청 처리 모드로 진입하였으나, 기 설정 시간이 경과하도록 사용자 음성이 입력되지 않으면(S8), 다시 단계 S1의 활성화 모드로 진입하도록 한다(S1). On the other hand, if the request processing mode is entered, but the user's voice is not input so that the preset time elapses (S8), the activation mode of step S1 is again entered (S1).

이와 같이, 본 발명은 사용자의 카메라 눈 맞춤과 같은 비언어적 요소를 탐지하여, 사용자 음성을 인식 및 처리할 수 있는 활성화 모드로 진입할 수 있도록 한다. As described above, the present invention detects a non-verbal element such as a user's camera eye contact, and allows the user to enter an activation mode capable of recognizing and processing a user's voice.

한편, 모션 인식 카메라 위치가 사용자 시선이 많이 머물러 있는 곳 근처일 경우에 의도하지 않은 활성화를 일으키는 문제가 발생할 수 있다. 예를 들어, 모션 인식 카메라가 TV 바로 앞에 설치되어 있으면, 사용자가 TV를 시청하는 동안 인공지능 스피커는 사용자가 자신을 쳐다본다고 오해할 수 있다. On the other hand, when the position of the motion recognition camera is near a place where the user's gaze stays a lot, a problem that causes unintended activation may occur. For example, if a motion recognition camera is installed right in front of the TV, the artificial intelligence speaker may misunderstand that the user looks at himself while the user is watching TV.

이에 본 발명에서는 사용자가 모션 인식 카메라를 바라보면서 호출 명령어를 부르는 경우에 한하여 인공지능 스피커를 동작 활성화시킴으로써, 사용자 인터랙션의 정확도가 극대화될 수 있도록 한다. Accordingly, in the present invention, the artificial intelligence speaker is activated only when a user calls a call command while looking at a motion recognition camera, thereby maximizing the accuracy of user interaction.

도 4는 본 발명의 다른 실시예에 따른 인공지능 스피커의 비언어적 요소 기반 동작 활성화 방법을 설명하기 위한 도면이다. 4 is a diagram illustrating a method of activating an operation based on a non-verbal element of an artificial intelligence speaker according to another embodiment of the present invention.

먼저, 활성화 모드가 설정되면(S1), 모션 인식 카메라(110)를 통해 머리 회전 쿼터니언(quaternion)을 획득한 후(S2). 모션 분석부(120)를 통해 눈 맞춤 이벤트 발생 여부를 확인하도록 한다(S4). First, when the activation mode is set (S1), a head rotation quaternion is obtained through the motion recognition camera 110 (S2). The motion analysis unit 120 checks whether an eye alignment event has occurred (S4).

그리고 이와 동시에 마이크(130)를 통해 사용자 음성을 획득한 후(S11), 음성 처리부(140)를 통해 분석함으로써, 사용자가 호출 이벤트 발생을 위한 호출 명령어를 발성하는 지 확인하도록 한다(S12, S13). At the same time, by acquiring the user's voice through the microphone 130 (S11) and analyzing it through the voice processing unit 140, it is checked whether the user utters a call command for generating a call event (S12, S13). .

그리고 단계 S4 및 단계 S13을 통해 눈 맞춤 이벤트와 호출 이벤트가 동시 발생하는 경우에 한해, 동작 모드 설정부(150)가 요청 처리 모드로 진입하여 사용자 음성 인식 결과에 상응하는 서비스를 호출 및 실행할 수 있도록 한다. And only when the eye contact event and the call event occur simultaneously through steps S4 and S13, the operation mode setting unit 150 enters the request processing mode so that the service corresponding to the user's voice recognition result can be called and executed. do.

즉, 본 발명에서는 호출 명령어를 부르는 방식과 눈 맞춤을 이용한 방식을 혼합 이용하여, 사용자 의도를 보다 정확하게 파악한 후 인공 지능 스피커를 동작 활성화할 수 있도록 한다. That is, in the present invention, a method of invoking a call command and a method using eye contact are mixed to enable the artificial intelligence speaker to be activated after the user's intention is more accurately identified.

또한 본 발명에서는 눈 맞춤 대신에 사용자가 여러 가지 방식으로 취하는 모션 제스처에 기반하여 인공 지능 스피커를 동작 활성화할 수도 있도록 한다. In addition, in the present invention, instead of eye contact, the artificial intelligence speaker may be activated based on motion gestures taken by the user in various ways.

즉, 모션 분석부(120)를 통해 호출 이벤트 발생을 위한 모션 제스처를 사전 정의한 후, 활성화 모드 중에 사용자가 이에 해당하는 모션을 발생함이 확인되면, 호출 이벤트를 즉각 발생할 수 있도록 한다. That is, after pre-defining a motion gesture for generating a call event through the motion analysis unit 120, when it is confirmed that the user generates a corresponding motion during the activation mode, the call event can be immediately generated.

예를 들어, 사용자 머리를 두 번 연속 끄덕거리는 모션 또는 사용자 머리를 좌우로 반복 회전시키는 모션을 호출 이벤트 발생에 대응되는 모션 제스처로 설정한 후, 사용자가 해당 모션을 취하는지 모션 인식 카메라(110)를 통해 반복적으로 모니터링하도록 한다. 그리고 활성화 모드 중에 사용자가 기 설정된 모션 제스처를 취하면, 이에 응답하여 호출 이벤트 발생시킴으로써, 인공 지능 스피커가 동작 활성화되어 요청 처리 모드로 동작될 수 있도록 한다. For example, after setting the motion of nodding the user's head twice in succession or the motion of repeatedly rotating the user's head left and right as a motion gesture corresponding to the occurrence of a call event, the motion recognition camera 110 whether the user takes the corresponding motion Repeatedly monitor through. In addition, when the user makes a preset motion gesture during the activation mode, a call event is generated in response thereto, so that the artificial intelligence speaker is activated to operate in the request processing mode.

상술한 본 발명에 따른 방법은 컴퓨터에서 실행되기 위한 프로그램으로 제작되어 컴퓨터가 읽을 수 있는 기록 매체에 저장될 수 있으며, 컴퓨터가 읽을 수 있는 기록 매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광 데이터 저장장치 등이 있으며, 또한 캐리어 웨이브(예를 들어 인터넷을 통한 전송)의 형태로 구현되는 것도 포함한다.The above-described method according to the present invention may be produced as a program for execution in a computer and stored in a computer-readable recording medium. Examples of computer-readable recording media include ROM, RAM, CD-ROM, and magnetic tape. , A floppy disk, an optical data storage device, and the like, and also include those implemented in the form of a carrier wave (for example, transmission through the Internet).

컴퓨터가 읽을 수 있는 기록 매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다. 그리고, 상기 방법을 구현하기 위한 기능적인(function) 프로그램, 코드 및 코드 세그먼트들은 본 발명이 속하는 기술분야의 프로그래머들에 의해 용이하게 추론될 수 있다.The computer-readable recording medium is distributed over a computer system connected by a network, and computer-readable codes can be stored and executed in a distributed manner. In addition, functional programs, codes and code segments for implementing the method can be easily deduced by programmers in the art to which the present invention belongs.

이상에서는 본 발명의 바람직한 실시예에 대하여 도시하고 설명하였지만, 본 발명은 상술한 특정의 실시예에 한정되지 아니하며, 청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 다양한 변형 실시가 가능한 것은 물론이고, 이러한 변형 실시들은 본 발명의 기술적 사상이나 전망으로부터 개별적으로 이해되어서는 안될 것이다.In the above, preferred embodiments of the present invention have been illustrated and described, but the present invention is not limited to the specific embodiments described above, and is generally used in the technical field to which the present invention belongs without departing from the gist of the present invention claimed in the claims. Various modifications can be implemented by a person having the knowledge of, of course, these modifications should not be individually understood from the technical spirit or prospect of the present invention.

Claims

사용자 모션을 3차원적으로 인식하여 머리 회전 쿼터니언을 획득 및 제공하는 모션 인식 카메라;
상기 머리 회전 쿼터니언이 기 설정된 기준 각도 범위에 속하는 경우, 눈 맞춤 이벤트를 발생하는 모션 분석부;
사용자 음성 신호를 수신하는 마이크;
적어도 하나의 호출 명령어를 사전 정의한 후, 상기 사용자 음성 신호를 자연어 처리하여 음성 명령을 획득하되, 상기 음성 명령이 상기 호출 명령어인 경우에는 호출 이벤트를 즉각 발생하는 음성 명령어 처리부;
활성화 모드 중에 눈 맞춤 이벤트와 호출 이벤트가 동시 발생하면 요청 처리 모드로 진입하되, 요청 처리 모드 중에 사용자 음성 신호 미수신 상태가 설정 시간 이상 유지되면 활성화 모드로 재진입하는 동작 모드 설정부;
요청 처리 모드 중에 획득된 상기 음성 명령에 상응하는 서비스를 호출 및 실행하는 서비스 실행부; 및
상기 서비스 실행부의 서비스 실행 결과에 상응하는 소리를 실시간 재생하는 스피커를 포함하며,
상기 모션 분석부는
기 설정 시간에 거쳐 머리 회전 쿼터니언을 수집 및 평균한 후, 평균값 기반으로 눈 맞춤 이벤트 발생 여부를 결정하는 것을 특징으로 하는 인공 지능 스피커. A motion recognition camera for acquiring and providing a head rotation quaternion by recognizing a user's motion in three dimensions;
A motion analysis unit that generates an eye contact event when the head rotation quaternion falls within a preset reference angle range;
A microphone for receiving a user voice signal;
After pre-defining at least one call command, the voice command processing unit processes the user voice signal in natural language to obtain a voice command, and when the voice command is the call command, a call event is immediately generated;
An operation mode setting unit that enters a request processing mode when an eye contact event and a call event occur simultaneously during the activation mode, but re-enters the activation mode when a user voice signal non-reception state during the request processing mode is maintained for a set time or longer;
A service execution unit for calling and executing a service corresponding to the voice command acquired during a request processing mode; And
And a speaker that reproduces a sound corresponding to a service execution result of the service execution unit in real time,
The motion analysis unit
An artificial intelligence speaker, characterized in that, after collecting and averaging head rotation quaternions over a preset time, whether or not an eye contact event occurs based on the average value.

삭제delete

제1항에 있어서, 상기 모션 분석부는
호출 이벤트 발생을 위한 모션 제스처를 사전 설정한 후, 상기 머리 회전 쿼터니언을 상기 모션 제스처와 비교 분석한 후 호출 이벤트 발생 여부를 추가 결정하는 기능을 더 포함하는 것을 특징으로 하는 인공 지능 스피커. The method of claim 1, wherein the motion analysis unit
And a function of additionally determining whether or not a call event occurs after pre-setting a motion gesture for generating a call event, comparing and analyzing the head rotation quaternion with the motion gesture.

활성화 모드가 설정되면, 모션 인식 카메라를 통해 사용자 모션을 3차원적으로 인식하여 머리 회전 쿼터니언을 획득하는 단계;
상기 머리 회전 쿼터니언이 기 설정된 기준 각도 범위에 속하는 경우, 눈 맞춤 이벤트를 발생하는 단계; 및
마이크를 통해 사용자 음성을 수신 및 분석하여 음성 명령을 획득하되, 상기 음성 명령이 사전 정의된 호출 명령어인 경우에는 호출 이벤트를 발생하는 단계; 및
활성화 모드 중에 눈 맞춤 이벤트와 호출 이벤트가 동시 발생하면 요청 처리 모드로 진입한 후, 상기 음성 명령에 상응하는 서비스를 호출 및 실행하되, 요청 처리 모드 중에 음성 명령이 기 설정 시간 이상 미 획득되면 활성화 모드로 재진입하는 단계를 포함하며,
상기 호출 이벤트를 발생하는 단계는
기 설정 시간에 거쳐 머리 회전 쿼터니언을 수집 및 평균한 후, 평균값 기반으로 눈 맞춤 이벤트 발생 여부를 결정하는 것을 특징으로 하는 인공 지능 스피커의 비언어적 요소 기반 동작 활성화 방법.When the activation mode is set, obtaining a head rotation quaternion by recognizing a user's motion in three dimensions through a motion recognition camera;
Generating an eye contact event when the head rotation quaternion falls within a preset reference angle range; And
Receiving and analyzing a user's voice through a microphone to obtain a voice command, but generating a call event when the voice command is a predefined call command; And
If the eye contact event and the call event occur simultaneously during the activation mode, after entering the request processing mode, the service corresponding to the voice command is called and executed, but if the voice command is not acquired for more than a preset time during the request processing mode, the activation mode Including the step of re-entry to,
The step of generating the call event
After collecting and averaging head rotation quaternions over a preset time, it is determined whether or not an eye contact event occurs based on the average value.

삭제delete