KR20200114019A

KR20200114019A - The method and apparatus for identifying speaker based on pitch information

Info

Publication number: KR20200114019A
Application number: KR1020190035068A
Authority: KR
Inventors: 송유중; 김우중
Original assignee: 주식회사 공훈
Priority date: 2019-03-27
Filing date: 2019-03-27
Publication date: 2020-10-07

Abstract

According to one embodiment of the present invention, a method for identifying a speaker from pitch information of a voice may comprise the steps of: extracting a pitch section from voice data through utterance from a user performing voice activity detection (VAD) for the extracted pitch section; extracting feature information of a voice from the performed VAD; generating a voice recognition model for speaker identification based on the extracted feature information; and identifying a speaker based on the generated voice recognition model.

Description

음성의 피치 정보에 기초한 화자 식별 방법 및 그 장치{THE METHOD AND APPARATUS FOR IDENTIFYING SPEAKER BASED ON PITCH INFORMATION}TECHNICAL FIELD The method and apparatus for identifying a speaker based on pitch information of a voice

본 발명은 음성의 피치 정보에 기초한 화자 식별 방법 및 그 장치에 관한 것으로, 더욱 상세하게는 음성 활동 감지(Voice Activity Detection, VAD)를 수행하기 이전에 사용자의 음성 데이터로부터 피치(pitch) 구간 만을 추출하고, 추출된 피치 구간 만을 VAD의 입력으로써 사용하여 화자 확인의 정확성을 높이고자 하는 방법 및 그 장치에 관한 것이다. The present invention relates to a method and apparatus for identifying a speaker based on voice pitch information, and more particularly, extracting only a pitch section from the user's voice data before performing voice activity detection (VAD). And, by using only the extracted pitch section as an input of VAD, it relates to a method and apparatus for increasing the accuracy of speaker identification.

기존의 화자 확인 시스템에서 음성 활동 감지(Voice Activity Detection, VAD) 구간을 추출하는 시기를 2가지로 나눌 수 있다. 첫 번째는 음원에서 VAD 를 추출하는 방법이고, 두 번째는 음성 특징 추출 후 VAD를 실행하는 방법이다. VAD를 실행하면 무성음 구간과 유성음 구간을 구분하고 소음이 감쇠되어 분산 값이 다소 적어진 음성 특징을 추출할 수 있다는 이점이 있다. 이러한 VAD 등과 같은 기법은 음성 인식 분야에서 종래부터 이미 많이 활용 되고 일부 화자 확인을 위한 분야에서 활용되어 오고 있다. The timing of extracting the voice activity detection (VAD) section in the existing speaker identification system can be divided into two. The first is a method of extracting VAD from a sound source, and the second is a method of executing VAD after extracting voice features. When VAD is executed, there is an advantage in that it is possible to classify unvoiced and voiced sections, and to extract voice features with slightly less variance due to noise attenuation. Techniques such as VAD are already widely used in the field of speech recognition and have been used in the field for identifying some speakers.

1. 대한민국 등록특허 제10-1831078호 (등록일자 2018년02월13일)1. Korean Patent Registration No. 10-1831078 (Registration Date February 13, 2018)

본 발명은 화자 확인의 성능 향상을 위하여 VAD를 활용하지만 VAD를 수행하기 위한 절차 이전에 피치(Pitch) 구간 만을 추출하여, 이를 VAD의 입력 값으로 활용함으로써 화자 확인 시스템의 반응 속도를 개선하고, 종래 대비 화자 식별의 정확성을 더욱 높이고자 하기 위한 방법 및 그 장치를 제시하고자 한다.The present invention utilizes VAD to improve speaker identification performance, but extracts only the pitch section before the procedure for performing VAD, and uses it as an input value of VAD, thereby improving the response speed of the speaker identification system. To further increase the accuracy of contrast speaker identification, we would like to present a method and a device for the same.

본 발명의 일 실시예로써, 음성의 피치 정보에 기초한 화자 식별 방법 및 그 장치가 제공될 수 있다. As an embodiment of the present invention, a method and apparatus for identifying a speaker based on pitch information of a voice may be provided.

본 발명의 일 실시예에 따른 음성의 피치 정보에 기초한 화자 식별 방법은 사용자로부터의 발화를 통한 음성 데이터로부터 피치 구간을 추출하는 단계, 추출된 피치 구간에 대하여 음성 활동 감지(Voice Activity Detection, VAD)를 수행하는 단계, 수행된 VAD로부터 음성의 특징 정보를 추출하는 단계, 추출된 특징 정보에 기초하여 화자 식별을 위한 음성 인식 모델을 생성하는 단계 및 생성된 음성 인식 모델에 기초하여 화자를 식별하는 단계를 포함할 수 있다.A method for identifying a speaker based on pitch information of a voice according to an embodiment of the present invention includes the steps of extracting a pitch section from voice data through speech from a user, and detecting voice activity on the extracted pitch section (Voice Activity Detection, VAD). Performing, extracting characteristic information of speech from the performed VAD, generating a speech recognition model for speaker identification based on the extracted characteristic information, and identifying a speaker based on the generated speech recognition model It may include.

VAD를 수행하기 이전에 음성 데이터로부터 피치 구간만을 추출하고, 추출된 피치 구간만을 VAD의 입력으로 사용할 수 있다. Before performing VAD, only the pitch section can be extracted from the voice data, and only the extracted pitch section can be used as an input of the VAD.

수행된 VAD로부터 음성의 특징 정보를 추출하는 단계에서는 음성 만의 고유의 특징벡터가 추출될 수 있다. In the step of extracting the feature information of the voice from the performed VAD, a feature vector unique to the voice may be extracted.

생성된 음성 인식 모델에 기초하여 화자를 식별하는 단계에서는, 사용자로부터 발화되어 획득된 음성 데이터로부터 피치 구간을 추출하고, 추출된 피치 구간에 대하여 VAD가 수행된 후 추출된 특징 정보와 음성 인식 모델에 포함되어 있는 특징 정보를 상호 비교하여 기 설정된 기준값에 따라 화자를 식별할 수 있다. In the step of identifying a speaker based on the generated speech recognition model, the pitch section is extracted from the voice data obtained by uttering from the user, and after VAD is performed on the extracted pitch section, the extracted feature information and the voice recognition model are added. A speaker can be identified according to a preset reference value by comparing the included feature information with each other.

기 설정된 기준값은 임계치로써 사전 설정되어 있고, 특징 정보 간의 상호 비교의 결과가 임계치보다 크거나 같으면 사용자 본인인 것으로 식별하고, 상호 비교의 결과가 임계치 미만이라면 사용자 이외의 타인인 것으로 식별할 수 있다. The preset reference value is preset as a threshold value, and if the result of the mutual comparison between the feature information is greater than or equal to the threshold value, it is identified as being the user, and if the result of the mutual comparison is less than the threshold value, it may be identified as someone other than the user.

본 발명의 일 실시예에 따른 음성의 피치 정보에 기초한 화자 식별 장치는 사용자로부터의 발화를 통한 음성 데이터로부터 피치 구간을 추출하는 피치 정보 추출부, 추출된 피치 구간에 대하여 음성 활동 감지(Voice Activity Detection, VAD)를 수행하는 VAD 수행부, 수행된 VAD로부터 음성의 특징 정보를 추출하는 특징 정보 추출부, 추출된 특징 정보에 기초하여 화자 식별을 위한 음성 인식 모델을 생성하는 음성인식모델 생성부 및 생성된 음성 인식 모델에 기초하여 화자를 식별하는 화자 식별부를 포함할 수 있다. According to an embodiment of the present invention, a speaker identification device based on pitch information of a voice includes a pitch information extracting unit that extracts a pitch section from voice data through a speech from a user, and detects voice activity on the extracted pitch section (Voice Activity Detection). , VAD) performing a VAD execution unit, a feature information extracting unit that extracts feature information of speech from the performed VAD, a speech recognition model generation unit that generates a speech recognition model for speaker identification based on the extracted feature information and generation It may include a speaker identification unit for identifying the speaker based on the voice recognition model.

VAD 수행부에 의하여 VAD를 수행하기 이전에 피치 정보 추출부에 의하여 음성 데이터로부터 피치 구간만이 추출되고, 추출된 피치 구간만이 VAD의 입력으로 제공될 수 있다. Before VAD is performed by the VAD execution unit, only the pitch section is extracted from the voice data by the pitch information extracting unit, and only the extracted pitch section may be provided as an input of the VAD.

한편, 본 발명의 일 실시예로써, 전술한 방법을 컴퓨터에서 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체가 제공될 수 있다.Meanwhile, as an embodiment of the present invention, a computer-readable recording medium in which a program for executing the above-described method on a computer is recorded may be provided.

본 발명의 일 실시예에 따른 음성의 피치 정보에 기초한 화자 식별 방법 및 그 장치를 활용하면, 화자 확인을 위한 피치(pitch) 구간만을 추출함으로써 특징 추출에 필요한 계산량을 종래 대비 대폭 감소시킬 수 있다.If the speaker identification method and the apparatus thereof based on the pitch information of a voice according to an embodiment of the present invention are used, the amount of calculation required for feature extraction can be significantly reduced compared to the prior art by extracting only a pitch section for speaker identification.

또한, 본 발명의 일 실시예에 따르면 화자 확인을 위한 pitch 구간만을 추출함으로써 특징 값의 분산을 줄임으로써 화자 확인의 정확성을 기존 보다 더욱 높일 수 있다.In addition, according to an embodiment of the present invention, by extracting only the pitch section for speaker identification, the variance of feature values can be reduced, thereby further increasing the accuracy of speaker identification than before.

본 발명의 일 실시예에 따르면 화자 확인에 있어서 빠른 응답 시간을 달성할 수 있다. According to an embodiment of the present invention, it is possible to achieve a fast response time in speaker identification.

도 1은 기존의 사용자 인식 시스템에서 VAD를 이용하는 과정을 나타낸다.
도 2는 본 발명의 일 실시예에 따른 음성의 피치 정보에 기초한 화자 식별 과정을 나타낸다.
도 3은 본 발명의 일 실시예에 따른 음성의 피치 정보에 기초한 화자 식별 방법을 나타낸 순서도이다.
도 4는 본 발명의 일 실시예에 따른 음성의 피치 정보에 기초한 화자 식별 장치를 나타낸 블록도이다.
1 shows a process of using VAD in an existing user recognition system.
2 shows a speaker identification process based on pitch information of a voice according to an embodiment of the present invention.
3 is a flowchart illustrating a speaker identification method based on pitch information of a voice according to an embodiment of the present invention.
4 is a block diagram showing a speaker identification device based on pitch information of a voice according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 발명의 실시예를 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art can easily implement the present invention. However, the present invention may be implemented in various different forms and is not limited to the embodiments described herein. In the drawings, parts irrelevant to the description are omitted in order to clearly describe the present invention, and similar reference numerals are assigned to similar parts throughout the specification.

본 명세서에서 사용되는 용어에 대해 간략히 설명하고, 본 발명에 대해 구체적으로 설명하기로 한다. The terms used in the present specification will be briefly described, and the present invention will be described in detail.

본 발명에서 사용되는 용어는 본 발명에서의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어들을 선택하였으나, 이는 당 분야에 종사하는 기술자의 의도 또는 판례, 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 특정한 경우는 출원인이 임의로 선정한 용어도 있으며, 이 경우 해당되는 발명의 설명 부분에서 상세히 그 의미를 기재할 것이다. 따라서 본 발명에서 사용되는 용어는 단순한 용어의 명칭이 아닌, 그 용어가 가지는 의미와 본 발명의 전반에 걸친 내용을 토대로 정의되어야 한다. The terms used in the present invention have been selected from general terms that are currently widely used while considering functions in the present invention, but this may vary depending on the intention or precedent of a technician working in the field, the emergence of new technologies, and the like. In addition, in certain cases, there are terms arbitrarily selected by the applicant, and in this case, the meaning of the terms will be described in detail in the description of the corresponding invention. Therefore, the terms used in the present invention should be defined based on the meaning of the term and the overall contents of the present invention, not a simple name of the term.

명세서 전체에서 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있음을 의미한다. 또한, 명세서에 기재된 "...부", "모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합으로 구현될 수 있다. 또한, 명세서 전체에서 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, "그 중간에 다른 소자를 사이에 두고" 연결되어 있는 경우도 포함한다. When a part of the specification is said to "include" a certain component, it means that other components may be further included rather than excluding other components unless otherwise stated. In addition, terms such as "... unit" and "module" described in the specification mean units that process at least one function or operation, which may be implemented as hardware or software, or as a combination of hardware and software. . In addition, when a part is "connected" to another part throughout the specification, this includes not only a case in which it is "directly connected", but also a case in which it is connected "with another element in the middle."

이하 첨부된 도면을 참고하여 본 발명을 상세히 설명하기로 한다.Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.

도 1은 기존의 사용자 인식 시스템에서 VAD를 이용하는 과정을 나타내고, 도 2는 본 발명의 일 실시예에 따른 음성의 피치 정보에 기초한 화자 식별 과정을 나타낸다. 또한, 도 3은 본 발명의 일 실시예에 따른 음성의 피치 정보에 기초한 화자 식별 방법을 나타낸 순서도이고, 도 4는 본 발명의 일 실시예에 따른 음성의 피치 정보에 기초한 화자 식별 장치를 나타낸 블록도이다. FIG. 1 shows a process of using VAD in an existing user recognition system, and FIG. 2 shows a process of identifying a speaker based on pitch information of a voice according to an embodiment of the present invention. In addition, FIG. 3 is a flowchart showing a method for identifying a speaker based on pitch information of a voice according to an embodiment of the present invention, and FIG. 4 is a block showing a device for identifying a speaker based on pitch information of a voice according to an embodiment of the present invention. Is also.

기존의 화자 확인 시스템에서 음성 활동 감지(Voice Activity Detection, VAD) 구간을 추출하는 시기를 2가지로 나눌 수 있다. 첫 번째는 음원에서 VAD 를 추출하는 방법이고, 두 번째는 도 1에서와 같이 음성 특징 추출 후 VAD를 실행하는 방법이다. VAD를 실행하면 무성음 구간과 유성음 구간을 구분하고 소음이 감쇠되어 분산 값이 다소 적어진 음성 특징을 추출할 수 있다는 이점이 있다. 다만, 도 1에서와 같은 종래의 방식은 음성에 대한 특징 추출에 소요되는 시간이 적지 않고, 이에 대하여 VAD 절차를 수행해야 하므로 음성 인식을 위한 반응 속도가 빠르지 않다. 다시 말해서 화자가 단어, 문장 등을 발화하고 나서 상당한 시간이 흘러야 화자에 대한 식별이 가능할 수 있다. 또한, 화자 식별을 위하여 처리해야 할 데이터가 적지 않다는 점에서 전술한 바와 같은 느린 반응 속도를 보일 수 밖에 없다. 이에 비하여 본 발명은 화자 확인의 성능 향상을 위하여 VAD를 활용하는데 있어서, VAD를 수행하기 위한 절차 이전에 피치(Pitch) 구간 만을 추출하여, 추출된 피치 구간 만을 VAD의 입력 값으로 활용함으로써 화자 확인 시스템의 반응 속도를 효과적으로 개선하고자 한다. 화자 식별을 위하여 처리해야 할 데이터 량(ex. 특징 추출에 필요한 데이터 계산량 등)이 감소될 수 있으므로, 화자 식별에 소요되는 시간이 대폭 단축될 수 있다. 나아가 화자 확인을 위하여 피치 구간만을 추출함으로써 음성 특징 값의 분산을 획기적으로 줄임으로써 화자 확인의 정확성을 종래 대비 크게 높일 수 있다. The timing of extracting the voice activity detection (VAD) section in the existing speaker identification system can be divided into two. The first is a method of extracting VAD from a sound source, and the second is a method of executing VAD after voice feature extraction as shown in FIG. 1. When VAD is executed, there is an advantage in that it is possible to classify unvoiced and voiced sections, and to extract voice features with slightly less variance due to noise attenuation. However, in the conventional method as shown in FIG. 1, the time required for feature extraction for speech is not small, and the response speed for speech recognition is not fast because the VAD procedure must be performed. In other words, the speaker can be identified only after a considerable amount of time has elapsed after the speaker has uttered a word or sentence. In addition, since there is not a lot of data to be processed for speaker identification, the above-described slow reaction speed is inevitable. On the contrary, the present invention extracts only the pitch section before the procedure for performing the VAD, and utilizes only the extracted pitch section as an input value of the VAD in using VAD to improve the performance of speaker identification. It is intended to effectively improve the reaction rate of Since the amount of data to be processed for speaker identification (ex. the amount of data calculation required for feature extraction, etc.) can be reduced, the time required for speaker identification can be significantly reduced. Furthermore, by extracting only the pitch section for speaker identification, the variance of voice feature values can be drastically reduced, and thus the accuracy of speaker identification can be greatly improved compared to the prior art.

도 2에서와 같이, 본 발명의 일 실시예에 따르면 VAD를 수행하기 위한 절차 이전에 이미 피치(Pitch) 구간 만을 추출하고, 추출된 피치 구간 만을 VAD의 입력 값으로 활용할 수 있다. 사용자 A(10)로부터 발화된 단어 등의 음성 데이터로부터 피치 구간에 대한 정보를 추출하고, 추출된 Pitch 관련 정보에만 기초하여 VAD를 수행한 후 특징 정보를 추출할 수 있다. 추출된 특징 정보로는 화자(ex. 사용자 A)의 음성을 통한 추후 식별을 위하여 음성 인식 모델을 생성해둘 수 있고, 이러한 생성된 음성 인식 모델에 기초하여 화자가 식별될 수 있다. 예를 들면, 도 2에서와 같이, 다른 사용자인 사용자 B(20)가 발화하는 경우, 해당 발화에 따른 음성 데이터에서 피치 관련 정보를 추출하고, 추출된 피치 관련 정보에만 기초하여 VAD를 수행한 후 사용자 B(20) 음성의 특징 정보가 추출될 수 있다. 이러한 추출된 특징 정보는 기 구축된 음성 인식 모델과 비교를 통한 화자 식별 과정에 사용될 수 있다. 다시 말해서, 음성 인식 모델에 포함된 특징 정보와 사용자 B(20) 음성의 특징 정보가 상호 비교되어 그 유사도에 따라 사용자 A(10) 본인 인지 타 사용자(ex. 사용자 B(20))인지 여부가 판단될 수 있다. 이와 관련된 구체적인 설명은 도 3을 참조하며 후술한다. As shown in FIG. 2, according to an embodiment of the present invention, before the procedure for performing VAD, only the pitch section may be extracted, and only the extracted pitch section may be used as an input value of VAD. Information on a pitch section may be extracted from voice data such as a word uttered by the user A 10, and feature information may be extracted after performing VAD based only on the extracted pitch-related information. As the extracted feature information, a voice recognition model may be generated for later identification through the voice of a speaker (eg, user A), and a speaker may be identified based on the generated voice recognition model. For example, as shown in FIG. 2, when user B 20, which is another user, speaks, after extracting pitch-related information from voice data according to the corresponding speech, and performing VAD based only on the extracted pitch-related information, Feature information of the voice of the user B 20 may be extracted. This extracted feature information can be used in a speaker identification process through comparison with a previously constructed speech recognition model. In other words, the feature information included in the voice recognition model and the feature information of the voice of the user B (20) are compared with each other to determine whether the user A (10) is the person or another user (ex. user B (20)) according to the similarity. Can be judged. A detailed description related thereto will be described later with reference to FIG. 3.

도 3을 참조하면, 본 발명의 일 실시예에 따른 음성의 피치 정보에 기초한 화자 식별 방법은 사용자로부터의 발화를 통한 음성 데이터로부터 피치 구간을 추출하는 단계(S100), 추출된 피치 구간에 대하여 음성 활동 감지(Voice Activity Detection, VAD)를 수행하는 단계(S200), 수행된 VAD로부터 음성의 특징 정보를 추출하는 단계(S300), 추출된 특징 정보에 기초하여 화자 식별을 위한 음성 인식 모델을 생성하는 단계(S400) 및 생성된 음성 인식 모델에 기초하여 화자를 식별하는 단계(S500)를 포함할 수 있다.Referring to FIG. 3, a method for identifying a speaker based on pitch information of a voice according to an embodiment of the present invention includes the step of extracting a pitch section from voice data through speech from a user (S100), and a voice for the extracted pitch section. Performing a voice activity detection (VAD) (S200), extracting characteristic information of a voice from the performed VAD (S300), generating a speech recognition model for speaker identification based on the extracted characteristic information It may include an operation S400 and an operation S500 of identifying a speaker based on the generated speech recognition model.

본 발명의 일 실시예에 따르면, VAD를 수행(S200)하기 이전에 음성 데이터로부터 피치 구간만을 추출하고, 추출된 피치 구간만을 VAD의 입력으로 사용할 수 있다. According to an embodiment of the present invention, before performing VAD (S200), only a pitch section may be extracted from voice data, and only the extracted pitch section may be used as an input of the VAD.

수행된 VAD로부터 음성의 특징 정보를 추출하는 단계에서는 음성 만의 고유의 특징 벡터가 추출될 수 있다. 음성의 특징을 추출하기 위한 기술로써 선형예측계수(Linear Predictive Coefficient) 기술, 켑스트럼(Cepstrum) 기술, 멜프리퀀시켑스트럼(Mel Frequency Cepstral Coefficient, MFCC) 기술, 주파수 대역별 에너지(Filter Bank Energy) 기술, 주파수 대역별 에너지(Filter Bank Energy) 기술 중 적어도 하나가 사용될 수 있다. In the step of extracting the feature information of the voice from the performed VAD, a feature vector unique to the voice may be extracted. As a technology for extracting features of speech, Linear Predictive Coefficient technology, Cepstrum technology, Mel Frequency Cepstral Coefficient (MFCC) technology, and filter bank energy ) At least one of technology and frequency band energy (Filter Bank Energy) technology may be used.

본 발명의 일 실시예에 따른 음성 인식 모델은 사운드로부터 추출된 특징 정보가 사용자를 기준으로 사용자별로 구분되거나 발화된 단어를 기준으로 단어별로 구분되어 데이터베이스(DB)의 형태(600)로 구축될 수 있고, 이러한 데이터베이스는 M X N의 행렬 형태(M과 N은 자연수)로 형성될 수 있다.In the speech recognition model according to an embodiment of the present invention, the characteristic information extracted from the sound may be divided by user based on the user or word by word based on the spoken word to be constructed in the form 600 of a database (DB). In addition, such a database may be formed in the form of a matrix of MXN (M and N are natural numbers).

생성된 음성 인식 모델에 기초하여 화자를 식별하는 단계(S500)에서는, 사용자로부터 발화되어 획득된 음성 데이터로부터 피치 구간을 추출하고, 추출된 피치 구간에 대하여 VAD가 수행된 후 추출된 특징 정보와 음성 인식 모델에 포함되어 있는 특징 정보를 상호 비교하여 기 설정된 기준값(ex. 임계치)에 따라 화자를 식별할 수 있다. In the step of identifying a speaker based on the generated speech recognition model (S500), a pitch section is extracted from the voice data uttered and obtained from the user, and after VAD is performed on the extracted pitch section, the extracted feature information and voice The speaker can be identified according to a preset reference value (ex. a threshold) by comparing feature information included in the recognition model with each other.

본 발명의 일 실시예에 따르면, 기 설정된 기준값은 임계치로써 화자 식별 강도(ex. 정확성), 목적 등에 따라 미리 상이하게 사전 설정되어 있고, 특징 정보 간의 상호 비교의 결과가 임계치보다 크거나 같으면 기 등록(확인)된 사용자 본인(ex. 사용자 A(10))인 것으로 식별하고, 상호 비교의 결과가 임계치 미만이라면 사용자 이외의 타인(ex. 사용자 B(20))인 것으로 식별할 수 있다. According to an embodiment of the present invention, a preset reference value is a threshold value and is preset differently according to speaker identification strength (ex. accuracy), purpose, etc., and if the result of mutual comparison between feature information is greater than or equal to the threshold value, pre-registration It is identified as being the (verified) user (ex. User A(10)), and if the result of the mutual comparison is less than the threshold, it can be identified as someone other than the user (ex. User B(20)).

본 발명의 일 실시예에 따른 음성의 피치 정보에 기초한 화자 식별 장치(1000)는 사용자로부터의 발화를 통한 음성 데이터로부터 피치 구간을 추출하는 피치 정보 추출부(100), 추출된 피치 구간에 대하여 음성 활동 감지(Voice Activity Detection, VAD)를 수행하는 VAD 수행부(200), 수행된 VAD로부터 음성의 특징 정보를 추출하는 특징 정보 추출부(300), 추출된 특징 정보에 기초하여 화자 식별을 위한 음성 인식 모델을 생성하는 음성인식모델 생성부(400) 및 생성된 음성 인식 모델에 기초하여 화자를 식별하는 화자 식별부(500)를 포함할 수 있다. 또한, 본 발명의 일 실시예에 따르면, 사용자 별 음성 특징 정보, 음성인식모델 등은 데이터베이스(600)에 저장되어 관리될 수 있다. The speaker identification device 1000 based on the pitch information of the voice according to an embodiment of the present invention includes a pitch information extracting unit 100 for extracting a pitch section from voice data through speech from a user, and a voice for the extracted pitch section. A VAD execution unit 200 that performs a voice activity detection (VAD), a feature information extraction unit 300 that extracts feature information of a voice from the performed VAD, and a voice for speaker identification based on the extracted feature information It may include a speech recognition model generation unit 400 for generating a recognition model and a speaker identification unit 500 for identifying a speaker based on the generated speech recognition model. Further, according to an embodiment of the present invention, voice characteristic information for each user, voice recognition model, etc. may be stored and managed in the database 600.

VAD 수행부(200)에 의하여 VAD를 수행하기 이전에 피치 정보 추출부(100)에 의하여 음성 데이터로부터 피치 구간만이 추출되고, 추출된 피치 구간만이 VAD의 입력으로 제공될 수 있다. Before VAD is performed by the VAD execution unit 200, only the pitch section is extracted from the voice data by the pitch information extraction unit 100, and only the extracted pitch section may be provided as an input of the VAD.

본 발명의 일 실시예에 따른 장치와 관련하여서는 전술한 방법에 대한 내용이 적용될 수 있다. 따라서, 장치와 관련하여, 전술한 방법에 대한 내용과 동일한 내용에 대하여는 설명을 생략하였다.In relation to the apparatus according to an embodiment of the present invention, the contents of the above-described method may be applied. Accordingly, description of the same contents as those of the above-described method in relation to the apparatus is omitted.

한편, 전술한 방법은 컴퓨터에서 실행될 수 있는 프로그램으로 작성 가능하고, 컴퓨터 판독 가능 매체를 이용하여 상기 프로그램을 동작시키는 범용 디지털 컴퓨터에서 구현될 수 있다. 또한, 상술한 방법에서 사용된 데이터의 구조는 컴퓨터 판독 가능 매체에 여러 수단을 통하여 기록될 수 있다. 본 발명의 다양한 방법들을 수행하기 위한 실행 가능한 컴퓨터 프로그램이나 코드를 기록하는 기록 매체는, 반송파(carrier waves)나 신호들과 같이 일시적인 대상들은 포함하는 것으로 이해되지는 않아야 한다. 상기 컴퓨터 판독 가능 매체는 마그네틱 저장매체(예를 들면, 롬, 플로피 디스크, 하드 디스크 등), 광학적 판독 매체(예를 들면, 시디롬, DVD 등)와 같은 저장 매체를 포함할 수 있다.Meanwhile, the above-described method can be written as a program that can be executed on a computer, and can be implemented in a general-purpose digital computer that operates the program using a computer-readable medium. Further, the structure of the data used in the above-described method may be recorded on a computer-readable medium through various means. A recording medium for recording executable computer programs or codes for performing various methods of the present invention should not be understood as including temporary objects such as carrier waves or signals. The computer-readable medium may include a storage medium such as a magnetic storage medium (eg, ROM, floppy disk, hard disk, etc.), and an optical reading medium (eg, CD-ROM, DVD, etc.).

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The above description of the present invention is for illustrative purposes only, and those of ordinary skill in the art to which the present invention pertains will be able to understand that other specific forms can be easily modified without changing the technical spirit or essential features of the present invention will be. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not limiting. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as being distributed may also be implemented in a combined form.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is indicated by the claims to be described later rather than the detailed description, and all changes or modified forms derived from the meaning and scope of the claims and their equivalent concepts should be interpreted as being included in the scope of the present invention. do.

Claims

음성의 피치 정보에 기초한 화자 식별 방법으로서,
사용자로부터의 발화를 통한 음성 데이터로부터 피치 구간을 추출하는 단계;
상기 추출된 피치 구간에 대하여 음성 활동 감지(Voice Activity Detection, VAD)를 수행하는 단계;
상기 수행된 VAD로부터 음성의 특징 정보를 추출하는 단계;
상기 추출된 특징 정보에 기초하여 화자 식별을 위한 음성 인식 모델을 생성하는 단계; 및
상기 생성된 음성 인식 모델에 기초하여 화자를 식별하는 단계를 포함하는 것을 특징으로 하는 화자 식별 방법.
As a speaker identification method based on voice pitch information,
Extracting a pitch section from voice data through speech from a user;
Performing voice activity detection (VAD) on the extracted pitch section;
Extracting voice characteristic information from the performed VAD;
Generating a speech recognition model for speaker identification based on the extracted feature information; And
And identifying a speaker based on the generated speech recognition model.

제 1 항에 있어서,
상기 VAD를 수행하기 이전에 상기 음성 데이터로부터 피치 구간만을 추출하고, 상기 추출된 피치 구간만을 상기 VAD의 입력으로 사용하는 것을 특징으로 하는 화자 식별 방법.
The method of claim 1,
Before performing the VAD, only a pitch section is extracted from the voice data and only the extracted pitch section is used as an input of the VAD.

제 1 항 또는 제 2 항에 있어서,
상기 수행된 VAD로부터 음성의 특징 정보를 추출하는 단계에서는 상기 음성 만의 고유의 특징벡터가 추출되는 것을 특징으로 하는 화자 식별 방법.
The method according to claim 1 or 2,
In the step of extracting voice feature information from the performed VAD, a feature vector unique to the voice is extracted.

제 3 항에 있어서,
상기 생성된 음성 인식 모델에 기초하여 화자를 식별하는 단계에서는, 사용자로부터 발화되어 획득된 음성 데이터로부터 피치 구간을 추출하고, 추출된 피치 구간에 대하여 VAD가 수행된 후 추출된 특징 정보와 상기 음성 인식 모델에 포함되어 있는 특징 정보를 상호 비교하여 기 설정된 기준값에 따라 화자를 식별하는 것을 특징으로 하는 화자 식별 방법.
The method of claim 3,
In the step of identifying a speaker based on the generated speech recognition model, a pitch section is extracted from voice data obtained by uttering from a user, and the extracted feature information and the voice recognition after VAD is performed on the extracted pitch section. A speaker identification method, characterized in that the speaker is identified according to a preset reference value by comparing feature information included in the model with each other.

제 4 항에 있어서,
상기 기 설정된 기준값은 임계치로써, 사전 설정되어 있고,
상기 특징 정보 간의 상호 비교의 결과가 상기 임계치보다 크거나 같으면 사용자 본인인 것으로 식별하고,
상기 상호 비교의 결과가 상기 임계치 미만이라면 상기 사용자 이외의 타인인 것으로 식별하는 것을 특징으로 하는 화자 식별 방법.
The method of claim 4,
The preset reference value is a threshold value and is preset,
If the result of the mutual comparison between the feature information is greater than or equal to the threshold, it is identified as the user,
If the result of the mutual comparison is less than the threshold value, the speaker identification method, characterized in that the identification as other than the user.

음성의 피치 정보에 기초한 화자 식별 장치로서,
사용자로부터의 발화를 통한 음성 데이터로부터 피치 구간을 추출하는 피치 정보 추출부;
상기 추출된 피치 구간에 대하여 음성 활동 감지(Voice Activity Detection, VAD)를 수행하는 VAD 수행부;
상기 수행된 VAD로부터 음성의 특징 정보를 추출하는 특징 정보 추출부;
상기 추출된 특징 정보에 기초하여 화자 식별을 위한 음성 인식 모델을 생성하는 음성인식모델 생성부; 및
상기 생성된 음성 인식 모델에 기초하여 화자를 식별하는 화자 식별부를 포함하는 것을 특징으로 하는 화자 식별 장치.
A speaker identification device based on voice pitch information,
A pitch information extracting unit for extracting a pitch section from voice data through speech from a user;
A VAD performing unit for performing voice activity detection (VAD) on the extracted pitch section;
A feature information extracting unit for extracting feature information of voice from the performed VAD;
A speech recognition model generator for generating a speech recognition model for speaker identification based on the extracted feature information; And
And a speaker identification unit for identifying a speaker based on the generated speech recognition model.

제 6 항에 있어서,
상기 VAD 수행부에 의하여 VAD를 수행하기 이전에 상기 피치 정보 추출부에 의하여 상기 음성 데이터로부터 피치 구간만이 추출되고, 상기 추출된 피치 구간만이 상기 VAD의 입력으로 제공되는 것을 특징으로 하는 화자 식별 장치.
The method of claim 6,
Speaker identification, characterized in that only the pitch section is extracted from the voice data by the pitch information extraction section and only the extracted pitch section is provided as an input of the VAD before performing VAD by the VAD execution unit Device.

제 1 항의 방법을 구현하기 위한 프로그램이 기록된 컴퓨터로 판독 가능한 기록 매체.
A computer-readable recording medium in which a program for implementing the method of claim 1 is recorded.