KR101804787B1

KR101804787B1 - Method and Apparatus for Speaker Recognition Using Voice Quality Feature

Info

Publication number: KR101804787B1
Application number: KR1020160124770A
Authority: KR
Inventors: 김경화; 김형순; 이가희
Original assignee: 대한민국
Priority date: 2016-09-28
Filing date: 2016-09-28
Publication date: 2017-12-06

Abstract

Embodiments of the present invention provide a speaker recognition apparatus and a method thereof, capable of recognizing a speaker with regard to a voice section to minimize an influence of a background noise by extracting a voice quality feature parameter for each voice frame, extracting a statistical feature value from the voice quality feature parameter, and selecting an effective voice frame among voice frames based on the statistical feature value, and improving the reliability of speaker recognition technology in various application fields like the confirmation of the identity of a user with the voice of the user and the comparison of an unknown voice and a voice of a suspect in a criminal investigation.

Description

음질특징을 이용한 화자인식장치 및 방법 {Method and Apparatus for Speaker Recognition Using Voice Quality Feature}BACKGROUND OF THE INVENTION 1. Field of the Invention [0001] The present invention relates to a speaker recognition apparatus and method,

본 실시예가 속하는 기술 분야는 화자를 인식하는 장치 및 방법에 관한 것이다.The technical field to which this embodiment belongs is an apparatus and a method for recognizing a speaker.

이 부분에 기술된 내용은 단순히 본 실시예에 대한 배경 정보를 제공할 뿐 종래기술을 구성하는 것은 아니다.The contents described in this section merely provide background information on the present embodiment and do not constitute the prior art.

화자인식(Speaker Recognition) 기술은 음성신호가 누구의 것인지 판별하는 기술로서, 크게 화자식별(Speaker Identification) 기술과 화자확인(Speaker Verification) 기술로 나뉜다. 화자식별 기술은 미리 등록된 복수의 화자 중에서 입력된 음성신호에 해당하는 화자를 선정하는 기술이고, 화자확인 기술은 입력된 음성신호가 등록된 화자와 일치하는 지 여부를 판별하는 기술이다.Speaker Recognition technology is a technology for determining who a voice signal is. Speaker recognition technology is classified into Speaker Identification technology and Speaker Verification technology. The speaker identification technique is a technique for selecting a speaker corresponding to a voice signal input from among a plurality of speakers registered in advance, and the speaker identification technique is a technique for determining whether or not the input voice signal coincides with the registered speaker.

화자인식 기술은 실험실 환경에서 높은 수준의 정확도를 나타내는 반면에 실제 환경에서 상대적으로 낮은 정확도가 나타나는 문제가 있다. 배경잡음의 영향으로 음성신호가 왜곡되기 때문이다. 여기서 배경잡음은 발화에 삽입되는 외부 소리를 의미한다. 예를 들면, 주변 사물로부터 발생되는 소리, 타인의 목소리 등이 있다.Speaker recognition technology shows a high level of accuracy in the laboratory environment, while it has a relatively low accuracy in a real environment. This is because the audio signal is distorted due to background noise. Here, the background noise means an external sound to be inserted into the utterance. For example, there are sounds from surrounding objects, and the voices of others.

화자인식에서 배경잡음의 영향을 줄이기 위하여, 음성을 검출하는 과정에서 음성구간과 비음성구간, 즉, 배경잡음 구간을 구분하여 음성구간에 대해서만 화자인식을 수행하는 방안이 있으나 음성구간에도 배경잡음으로 인하여 음성신호가 왜곡되는 문제가 있다In order to reduce the influence of background noise in speaker recognition, there is a method of performing speaker recognition only for a voice section by dividing a voice section and a non-voice section, that is, a background noise section in the process of detecting a voice, There is a problem that the voice signal is distorted

음성구간에서 배경잡음으로 인한 문제를 해결하기 위하여 음성신호로부터 배경잡음을 감쇠시키는 방안, 배경잡음으로부터 신호특징을 추출하여 화자 모델을 보상하는 방안 등이 제안되었으나 배경잡음을 효과적으로 극복하는 데에는 한계가 있다.In order to solve the problem caused by the background noise in the speech section, a method of attenuating the background noise from the speech signal and a method of compensating the speaker model by extracting the signal feature from the background noise have been proposed, but there is a limitation in effectively overcoming the background noise .

따라서 배경잡음으로 인한 영향을 최소화할 필요성이 있으며, 이러한 문제점들을 해결한 화자인식장치 및 방법은 아직 구현되지 못한 실정이다.Therefore, there is a need to minimize the influence due to the background noise, and a speaker recognition apparatus and a method for solving these problems have not yet been realized.

본 발명의 발명자는 배경잡음으로 인하여 비음성구간뿐만 아니라 음성구간의 음성신호가 왜곡되는 문제점을 인식하여, 음성구간 중에서 화자인식에 유효한 음성프레임만을 선별하고자 한다.The inventor of the present invention recognizes the problem that the voice signal of the voice section is distorted as well as the non-voice section due to background noise, and seeks to select only the voice frames effective for speaker recognition from the voice section.

이를 위하여 본 발명에서는 음성의학 분야에서 발성장애 진단 척도로 사용되는 음질(voice quality) 특징 파라미터가 잡음 섞인 음성이 화자인식에 얼마나 유용한지 판단하는 척도로도 활용될 수 있음에 착안하여, 이를 화자인식에 유효한 음성프레임 선별에 적용한다.For this purpose, the present invention is based on the fact that the voice quality feature parameter used as a diagnostic index for vocal disorders in the field of speech medicine can be used as a measure for determining how useful the noise mixed speech is for speaker recognition, To a valid voice frame selection.

본 발명의 실시예들은 음성프레임마다 음질특징을 추출하고, 음질특징 파라미터로부터 통계적 특징값을 산출하고, 통계적 특징값에 기반하여 음성프레임 중에서 유효음성프레임을 선별함으로써, 배경잡음으로 인한 영향을 최소화한 음성구간에 대하여 화자를 인식하는 데 발명의 주된 목적이 있다.Embodiments of the present invention provide a method and apparatus for extracting speech quality features per speech frame, calculating statistical feature values from speech quality feature parameters, selecting valid speech frames among speech frames based on statistical feature values, The main purpose of the invention is to recognize a speaker for a voice section.

본 발명의 명시되지 않은 또 다른 목적들은 하기의 상세한 설명 및 그 효과로부터 용이하게 추론할 수 있는 범위 내에서 추가적으로 고려될 수 있다.Other and further objects, which are not to be described, may be further considered within the scope of the following detailed description and easily deduced from the effects thereof.

본 실시예와 관련된 일 측면에 의하면, 복수개의 음성프레임마다 각각 음질특징을 추출하여 복수개의 음질특징 파라미터를 생성하는 음질특징 파라미터 생성부, 상기 복수개의 음질특징 파라미터로부터 통계적 특징값을 산출하는 통계적 특징값 산출부, 및 상기 통계적 특징값에 기반하여 상기 복수개의 음성프레임 중에서 하나 이상의 유효음성프레임을 선별하는 유효음성프레임 선별부를 포함하는 음성구간선별장치를 제공한다.According to an aspect related to the present embodiment, there is provided a speech processing apparatus, comprising: a speech quality characteristic parameter generating unit for generating a plurality of speech quality characteristic parameters by extracting speech quality characteristics for each of a plurality of speech frames; a statistical characteristic calculating unit And a valid voice frame selector for selecting at least one valid voice frame among the plurality of voice frames based on the statistical feature value.

본 실시예와 관련된 다른 측면에 의하면, 화자로부터 음성신호를 획득하고 상기 음성신호로부터 복수개의 음성프레임을 생성하는 음성획득부, 상기 복수개의 음성프레임마다 각각 음질특징을 추출하여 복수개의 음질특징 파라미터를 생성하고, 상기 복수개의 음질특징 파라미터로부터 통계적 특징값을 산출하고, 상기 통계적 특징값에 기반하여 상기 복수개의 음성프레임 중에서 하나 이상의 유효음성프레임을 선별하는 음성구간선별장치, 및 상기 하나 이상의 유효음성프레임에 대하여 화자를 인식하는 화자인식부를 포함하는 화자인식장치를 제공한다.According to another aspect of the present invention, there is provided a speech recognition apparatus comprising: a speech acquisition unit that acquires a speech signal from a speaker and generates a plurality of speech frames from the speech signal; A speech segment selection device for calculating a statistical feature value from the plurality of speech quality feature parameters and selecting at least one valid speech frame among the plurality of speech frames based on the statistical feature value, And a speaker recognition unit for recognizing the speaker with respect to the speaker recognition unit.

본 실시예와 관련된 다른 측면에 의하면, 화자인식장치가 화자를 인식하는 방법에 있어서, 화자로부터 음성신호를 획득하고 상기 음성신호로부터 복수개의 음성프레임을 생성하는 과정, 상기 복수개의 음성프레임마다 각각 음질특징을 추출하여 복수개의 음질특징 파라미터를 생성하는 과정, 상기 복수개의 음질특징 파라미터로부터 통계적 특징값을 산출하는 과정, 상기 통계적 특징값에 기반하여 상기 복수개의 음성프레임 중에서 하나 이상의 유효음성프레임을 선별하는 과정, 및 상기 하나 이상의 유효음성프레임에 대하여 화자를 인식하는 과정을 포함하는 화자인식방법을 제공한다.According to another aspect of the present invention, there is provided a method of recognizing a speaker by a speaker recognition apparatus, comprising the steps of: acquiring a speech signal from a speaker and generating a plurality of speech frames from the speech signal; Calculating a plurality of sound quality feature parameters by extracting features from the plurality of sound feature parameters, calculating a statistical feature value from the plurality of sound quality feature parameters, selecting one or more effective sound frames from the plurality of sound frames based on the statistical feature values And recognizing a speaker for the at least one valid voice frame.

이상에서 설명한 바와 같이 본 발명의 실시예들에 의하면, 음성프레임마다 음질특징 파라미터를 추출하고, 음질특징 파라미터로부터 통계적 특징값을 산출하고, 통계적 특징값에 기반하여 음성프레임 중에서 유효음성프레임을 선별함으로써, 배경잡음으로 인한 영향을 최소화한 음성구간에 대하여 화자를 인식할 수 있는 효과가 있다.As described above, according to the embodiments of the present invention, speech quality feature parameters are extracted for each speech frame, statistical feature values are calculated from speech quality feature parameters, and valid speech frames are selected from speech frames based on statistical feature values , The speaker can be recognized with respect to the voice section whose influence due to the background noise is minimized.

여기에서 명시적으로 언급되지 않은 효과라 하더라도, 본 발명의 기술적 특징에 의해 기대되는 이하의 명세서에서 기재된 효과 및 그 잠정적인 효과는 본 발명의 명세서에 기재된 것과 같이 취급된다.Even if the effects are not expressly mentioned here, the effects described in the following specification which are expected by the technical characteristics of the present invention and their potential effects are handled as described in the specification of the present invention.

도 1은 본 발명의 일 실시예에 따른 화자인식장치를 예시한 블록도이다.
도 2는 본 발명의 일 실시예에 따른 화자인식장치의 동작을 예시한 흐름도이다.1 is a block diagram illustrating a speaker recognition apparatus according to an embodiment of the present invention.
2 is a flowchart illustrating an operation of the speaker recognition apparatus according to an embodiment of the present invention.

이하, 본 발명을 설명함에 있어서 관련된 공지기능에 대하여 이 분야의 기술자에게 자명한 사항으로서 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략하고, 본 발명의 일부 실시예들을 예시적인 도면을 통해 상세하게 설명한다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. Will be described in detail with reference to exemplary drawings.

도 1은 본 발명의 일 실시예에 따른 화자인식장치를 예시한 블록도이다. 도 1에 도시한 바와 같이, 화자인식장치(10)는 음성획득부(100), 음성구간선별장치(200), 및 화자인식부(300)를 포함한다. 화자인식장치(10)는 도 1에서 예시적으로 도시한 다양한 구성요소들 중에서 일부 구성요소를 생략하거나 다른 구성요소를 추가로 포함할 수 있다.1 is a block diagram illustrating a speaker recognition apparatus according to an embodiment of the present invention. 1, the speaker recognition apparatus 10 includes a voice acquisition unit 100, a voice segment selection unit 200, and a speaker recognition unit 300. [ The speaker recognition apparatus 10 may omit some of the various components illustrated in FIG. 1 and further include other components.

화자인식장치(10)는 화자로부터 음성신호를 획득하고, 획득한 음성신호 중에서 유효한 음성구간을 선별하고, 유효한 음성구간에 대하여 화자를 인식하는 장치이다.The speaker recognition apparatus 10 is a device that acquires a voice signal from a speaker, selects a valid voice section from among the acquired voice signals, and recognizes a speaker with respect to a valid voice section.

음성획득부(100)는 화자로부터 음성신호를 획득하고 음성신호로부터 복수개의 음성프레임을 생성한다. 음성획득부(100)는 마이크로폰으로 입력된 아날로그 음성신호를 디지털로 변환하거나 메모리에 기 저장된 디지털 음성신호로부터 기 설정된 프레임길이를 갖는 음성프레임을 생성한다. 여기서 마이크로폰은 녹음기, 핸드폰, 스마트폰, 퍼스널컴퓨터 등 전자기기에 구비된 녹음수단을 포함하는 개념이다. 메모리는 저장수단으로서, 랜덤액세스메모리(Random Access Memory, RAM), 자기 디스크(Magnetic Disc), 플래시 메모리(Flash Memory), 정적램(Static Random Access Memory, SRAM), 롬(Read Only Memory, ROM), EEPROM(Electrically Erasable Programmable Read Only Memory), PROM(Programmable Read Only Memory) 등으로 구현될 수 있으나 이에 한정되는 것은 아니다.The voice acquisition unit 100 acquires a voice signal from the speaker and generates a plurality of voice frames from the voice signal. The voice acquisition unit 100 converts an analog voice signal input through a microphone into a digital signal or generates a voice frame having a predetermined frame length from a digital voice signal stored in a memory. Here, the microphone is a concept including recording means provided in electronic devices such as a sound recorder, a mobile phone, a smart phone, and a personal computer. The memory may be a random access memory (RAM), a magnetic disk, a flash memory, a static random access memory (SRAM), a read only memory (ROM) , An EEPROM (Electrically Erasable Programmable Read Only Memory), a PROM (Programmable Read Only Memory), and the like.

음성프레임은 음성신호를 일정한 길이로 토큰화(Tokenization)한 것이다. 프레임길이는 사전에 설정되고 변경이 가능하다. 프레임길이는 음성신호 처리량에 기반하여 추정된 수치, 통계적으로 산출된 수치, 또는 기술규정에 따른 수치일 수 있다. A voice frame is a tokenized voice signal having a predetermined length. The frame length is preset and can be changed. The frame length may be an estimated value based on the voice signal throughput, a statistically calculated value, or a numerical value according to the technical specification.

음성구간선별장치(200)는 복수개의 음성프레임마다 각각 음질특징 파라미터를 추출하여 복수개의 음질특징 파라미터를 생성하고, 복수개의 음질특징 파라미터로부터 통계적 특징값을 산출하고, 통계적 특징값에 기반하여 복수개의 음성프레임 중에서 하나 이상의 유효음성프레임을 선별한다.The speech segment sorting apparatus 200 extracts speech quality feature parameters for each of a plurality of speech frames, generates a plurality of speech quality feature parameters, calculates statistical feature values from a plurality of speech quality feature parameters, One or more valid voice frames are selected from the voice frames.

화자인식부(300)는 하나 이상의 유효음성프레임에 대하여 화자를 판별한다. 화자인식부(300)는 GMM-UBM(Gaussian Mixture Model-Universal Background Model), SVM(Support Vector Machine), JFA(Joint Factor Analysis), i-vector, DNN(Deep Neural Network) 또는 이들의 조합을 이용하여 화자를 인식한다. 화자인식부(300)는 음질특징 파라미터 및 통계적 특징값에 근거하여, GMM-UBM, SVM, JFA, i-Vector, DNN, 또는 이들의 조합을 선택할 수 있다. 한편, GMM-UBM, SVM, JFA, i-Vector, DNN에 관한 기술내용은 본 실시예가 속하는 기술분야의 기술자들에 의해 용이하게 추론될 수 있을 것이다.The speaker recognition unit 300 determines a speaker for one or more effective voice frames. The speaker recognition unit 300 uses a Gaussian Mixture Model-Universal Background Model (GMM-UBM), a Support Vector Machine (SVM), a Joint Factor Analysis (JFA), an i-vector, a Deep Neural Network (DNN) Thereby recognizing the speaker. Speaker recognition unit 300 can select GMM-UBM, SVM, JFA, i-Vector, DNN, or a combination thereof based on sound quality feature parameters and statistical feature values. On the other hand, the technical content of the GMM-UBM, SVM, JFA, i-Vector, and DNN can be easily deduced by those skilled in the art.

화자인식장치(10)가 음성프레임마다 음질특징 파라미터를 추출하고, 음질특징 파라미터로부터 통계적 특징값을 산출하고, 통계적 특징값에 기반하여 음성프레임 중에서 유효음성프레임을 선별함으로써, 배경잡음으로 인한 영향을 최소화한 음성구간에 대하여 화자를 인식할 수 있고, 사용자의 음성으로 본인의 신원을 확인하거나 범죄수사에서 미지의 음성과 용의자의 음성을 비교하는 등의 다양한 응용분야에서 화자인식 기술의 신뢰성을 높일 수 있는 효과가 있다.The speaker recognition apparatus 10 extracts sound quality characteristic parameters for each voice frame, calculates statistical characteristic values from the sound quality characteristic parameters, and selects effective voice frames among the voice frames based on the statistical characteristic values. It is possible to recognize the speaker for the minimized voice section and to increase the reliability of the speaker recognition technology in various application fields such as confirming the identity of the user with the voice of the user or comparing the voice of the unknown and the voice of the suspect in the criminal investigation There is an effect.

도 1을 다시 참조하면, 음성구간선별장치(200)는 음질특징 파라미터 생성부(210), 통계적 특징값 산출부(220), 및 유효음성프레임 선별부(230)을 전부 또는 일부 포함한다.Referring to FIG. 1 again, the speech segment selection apparatus 200 includes all or part of a speech quality characteristic parameter generation unit 210, a statistic feature value calculation unit 220, and a valid speech frame selection unit 230.

음질특징 파라미터 생성부(210)는 복수개의 음성프레임마다 각각 음질특징을 추출하여 복수개의 음질특징 파라미터를 생성한다.The speech quality characteristic parameter generation unit 210 extracts speech quality characteristics for each of a plurality of speech frames to generate a plurality of speech quality characteristic parameters.

음질특징 파라미터로는 주파수 변동율(Jitter), 진폭 변동율(Shimmer), 잡음 대 배음비(Noise-to-Harmonic Ratio, NHR), 켑스트럼 피크 현저도(Cepstral Peak Prominence, CPP) 등이 있을 수 있으나 이에 한정되는 것은 아니다. 여기서 켑스트럼(Cepstrum)은 시간영역의 신호를 푸리에변환(Fourier Transform)하여 얻은 스펙트럼(Spectrum)에 로그값(Logarithm)을 취한 후 이를 역푸리에변환(inverse Fourier Transform)한 것이다. 켑스트럼 피크 현저도는 켑스트럼 피크(Cepstral Peak)가 켑스트럼을 선형회귀(Linear Regression)로 근사화한 직선에 비해 얼마나 더 높은지를 데시벨(dB) 스케일로 나타낸 것이다.The sound quality characteristic parameters may include frequency fluctuation rate (Jitter), amplitude fluctuation rate (Shimmer), noise-to-harmonic ratio (NHR), cepstral peak prominence (CPP) But is not limited thereto. Here, cepstrum is a logarithm of a spectrum obtained by performing Fourier transform on a signal in a time domain, and then performing an inverse Fourier transform on the spectrum. The Stokes peak intensities are in decibels (dB) scale how much higher the Cepstral Peak is compared to the straight line approximating the cepstrum by Linear Regression.

음질특징 파라미터 중에서 주파수 변동율(Jitter), 진폭 변동율(Shimmer), 및 잡음 대 배음비(NHR)는 음성신호의 기본 주기 또는 기본 주파수의 정확한 추정에 기반을 둔 방법으로서, 발성과정의 문제 또는 배경잡음과 같은 다양한 원인으로 기본 주파수 추정의 정확도가 떨어질 경우, 그 측정값이 크게 영향을 받는 문제가 있다. 반면에 켑스트럼 피크 현저도는 명시적으로 음성의 기본 주파수 값을 사용하지 않고 단지 켑스트럼 피크와 선형회귀 직선 사이의 거리로 구해지기 때문에 기본 주파수 추정 오류의 영향에 강인하다. 발명자는 켑스트럼 피크 현저도를 이용하여 음성프레임마다 음질특징 파라미터를 추출하여 배경잡음을 배제하거나 최소화하고자 한다.The frequency variation ratio (Jitter), the amplitude variation ratio (Shimmer), and the noise to noise ratio (NHR) among the sound quality characteristic parameters are based on the accurate estimation of the fundamental period or fundamental frequency of a speech signal. There is a problem that the measurement value is greatly affected when the accuracy of the fundamental frequency estimation is decreased due to various causes such as the frequency of the base frequency estimation. On the other hand, the cepstrum peak intensities are robust to the effects of the fundamental frequency estimation error because they are obtained by simply calculating the distance between the peak peak and the linear regression line without explicitly using the fundamental frequency value of the speech. The inventor intends to exclude or minimize the background noise by extracting speech quality characteristic parameters for each speech frame using the cepstrum peak intensities.

음질특징 파라미터 생성부(210)는 켑스트럼 피크 현저도를 이용하여 음성프레임마다 음질특징 파라미터를 추출할 수 있다. 다시 말해 음질특징 파라미터는 켑스트럼 피크와 선형회귀 직선간의 거리를 이용하여 추출되고, 켑스트럼은 시간영역의 신호를 푸리에변환하여 얻은 스펙트럼에 로그값을 취한 후 이를 역푸리에변환한 것이고, 선형회귀 직선은 켑스트럼 곡선의 모양을 선형회귀 방법으로 근사화한 직선이다.The speech quality characteristic parameter generation unit 210 may extract the speech quality characteristic parameter for each speech frame using the cepstrum peak intensities. In other words, the sound quality feature parameter is extracted by using the distance between the cepstrum peak and the linear regression line, and the cepstrum is the inverse Fourier transform obtained by taking the log value in the spectrum obtained by Fourier transforming the signal in the time domain, The regression line is a straight line approximating the shape of the 켑 strum curve by a linear regression method.

음성신호로부터 N(N은 자연수)개의 음성프레임이 형성되고 이들 N개의 음성프레임으로부터 N개의 음질특징 파라미터가 생성된다. 예컨대, n(n은 자연수)번째 프레임의 음질특징 파라미터를 P(n)이라고 하면, P(1), P(2), …, P(N)이 생성된다.N (N is a natural number) voice frames are formed from the voice signal and N voice quality characteristic parameters are generated from these N voice frames. For example, letting P (n) be the sound quality characteristic parameter of n (n is a natural number) frame, P (1), P (2), ... , P (N) are generated.

통계적 특징값 산출부(220)는 복수개의 음질특징 파라미터의 분포 특성으로부터 통계적 특징값을 산출한다. 즉, 통계적 특징값 산출부(220)는 N개의 음성프레임으로부터 구한 N개의 음질특징 파라미터 P(1), P(2), …, P(N)의 분포로부터 통계적 특징값을 산출한다. 통계적 특징값 산출부(220)는 이들 음질특징 파라미터의 최대값, 복수개의 음질특징 파라미터의 최소값, 복수개의 음질특징 파라미터의 중앙값, 복수개의 음질특징 파라미터의 평균값. 복수개의 음질특징 파라미터의 분산값 중 적어도 하나로부터 통계적 특징값을 산출할 수 있다.The statistical feature value calculating unit 220 calculates a statistical feature value from distribution characteristics of a plurality of sound quality feature parameters. That is, the statistical feature value calculation unit 220 calculates the N sound quality feature parameters P (1), P (2), ..., N , And P (N), respectively. The statistical feature value calculation unit 220 calculates a maximum value of the sound quality feature parameters, a minimum value of the plurality of sound quality feature parameters, a median value of the plurality of sound quality feature parameters, and an average value of the plurality of sound quality feature parameters. A statistical feature value can be calculated from at least one of variance values of a plurality of sound quality feature parameters.

통계적 특징값 산출부(220)는 복수개의 음질특징 파라미터의 최대값, 복수개의 음질특징 파라미터의 최소값, 복수개의 음질특징 파라미터의 중앙값, 복수개의 음질특징 파라미터의 평균값, 복수개의 음질특징 파라미터의 분산값 중 두 개 이상의 요소를 선택하고, 선택된 두 개 이상의 요소 중 적어도 하나의 요소에 가중치를 적용하여 통계적 특징값을 산출할 수 있다.The statistical feature value calculating unit 220 may calculate a statistical feature value calculating unit 220 based on a maximum value of a plurality of sound quality feature parameters, a minimum value of a plurality of sound quality feature parameters, a median value of a plurality of sound quality feature parameters, an average value of a plurality of sound quality feature parameters, A statistic feature value can be calculated by applying a weight to at least one of the selected two or more elements.

유효음성프레임 선별부(230)는 통계적 특징값에 기반하여 복수개의 음성프레임 중에서 하나 이상의 유효음성프레임을 선별한다. 예를 들면, 임계 파라미터값은 수학식 1과 같이 표현될 수 있다.The valid speech frame selector 230 selects one or more valid speech frames from a plurality of speech frames based on the statistical feature value. For example, the critical parameter value may be expressed as: " (1) "

수학식 1을 참조하면, TH는 예시적인 임계 파라미터값이고, Pmax는 P(n)의 최대값이고, Pmin은 P(n)의 최소값이고, α는 가중치를 나타낸다. 가중치는 음성데이터 처리량에 기반하여 추정된 수치, 통계적으로 산출된 수치, 또는 기술규정에 따른 수치일 수 있다. 예를 들면, 가중치는 0 내지 1 사이의 상수일 수 있으나 이에 한정되는 것은 아니다.Referring to Equation (1), TH is an exemplary threshold parameter value, Pmax is a maximum value of P (n), Pmin is a minimum value of P (n), and a represents a weight value. The weights may be estimated values based on the voice data throughput, statistically calculated values, or numbers according to the technical specifications. For example, the weight may be a constant between 0 and 1, but is not limited thereto.

수학식 1에서 사용된 최대값과 최소값이 아닌 다른 예를 들면, P(n) (여기서 n=1,2,3,…,N) 값들을 두 개의 정규분포의 가중합 형태의 가우스 혼합 모델(Gaussian Mixture Model, GMM)로 모델링하고, 이때 두 개의 가중합 정규분포가 만나는 위치를 임계 파라미터값으로 설정할 수도 있다. For example, the values of P (n) (where n = 1, 2, 3, ..., N) other than the maximum and minimum values used in Equation 1 are multiplied by a Gaussian mixture model of weighted sum of two normal distributions Gaussian Mixture Model (GMM), and the position where the two weighted sum normal distributions meet can be set as critical parameter values.

임계 파라미터값은 예시된 임계 파라미터값 선정방식 외에도 화자의 발화 패턴, 음성데이터의 특징 등을 고려하여 다양하게 설정될 수 있으며, 임계 파라미터값이 이에 한정되는 것은 아니다.The threshold parameter value may be variously set in consideration of the speech parameter of the speaker, the characteristics of the voice data, etc., in addition to the threshold parameter value selection method, and the threshold parameter value is not limited thereto.

유효음성프레임 선별부(230)는 통계적 특징값으로부터 임계 파라미터값을 산출하고, 임계 파라미터값보다 큰 음질특징 파라미터를 갖는 음성프레임을 선별한다.The valid speech frame selector 230 calculates a threshold parameter value from the statistical feature value and selects a speech frame having a speech quality feature parameter larger than the threshold parameter value.

음성구간선별장치(200)가 음성프레임마다 음질특징 파라미터를 추출하고, 음질특징 파라미터로부터 통계적 특징값을 산출하고, 통계적 특징값에 기반하여 음성프레임 중에서 유효음성프레임을 선별함으로써, 배경잡음으로 인한 영향을 최소화한 음성구간에 대하여 화자를 인식할 수 있는 효과가 있다.The speech segmentation apparatus 200 extracts speech quality feature parameters for each speech frame, calculates statistical feature values from the speech quality feature parameters, and selects valid speech frames among the speech frames based on the statistical feature values, It is possible to recognize a speaker with respect to a voice section that minimizes the number of voice segments.

도 2는 본 발명의 일 실시예에 따른 화자인식장치의 동작을 예시한 흐름도이다.2 is a flowchart illustrating an operation of the speaker recognition apparatus according to an embodiment of the present invention.

과정 S210에서는 화자인식장치(10)가 화자로부터 음성신호를 획득하고 음성신호로부터 복수개의 음성프레임을 생성한다. 화자인식장치(10)는 마이크로폰으로 입력된 아날로그 음성신호를 디지털로 변환하거나 메모리에 기 저장된 디지털 음성신호로부터 기 설정된 프레임길이를 갖는 음성프레임을 생성할 수 있다.In step S210, the speaker recognition apparatus 10 acquires a speech signal from the speaker and generates a plurality of speech frames from the speech signal. The speaker recognition apparatus 10 can convert an analog voice signal input through a microphone into a digital signal or generate a voice frame having a predetermined frame length from a digital voice signal stored in the memory.

과정 S220에서는 화자인식장치(10)가 복수개의 음성프레임마다 각각 음질특징을 추출하여 복수개의 음질특징 파라미터를 생성한다. 화자인식장치(10)는 음질특징 파라미터의 일례로 켑스트럼 피크 현저도(CPP) 파라미터를 생성할 수 있고, CPP는 켑스트럼 피크와 선형회귀 직선간의 거리를 이용하여 추출되고, 켑스트럼은 시간영역의 신호를 푸리에변환하여 얻은 스펙트럼에 로그값을 취한 후 이를 역푸리에변환한 것이고, 선형회귀 직선은 켑스트럼 곡선의 모양을 선형회귀 방법으로 근사화한 직선이다.In step S220, the speaker recognition apparatus 10 extracts sound quality characteristics for each of a plurality of sound frames to generate a plurality of sound quality feature parameters. Speaker recognition apparatus 10 may generate cepstrum peak intensities (CPP) parameters as an example of speech quality feature parameters, CPP is extracted using the distance between the cepstrum peak and the linear regression line, and the cepstrum Is a straight line obtained by taking the logarithm of the spectrum obtained by Fourier transform of the signal in the time domain and performing inverse Fourier transform on the spectrum and the linear regression line approximating the shape of the quadratic curve by the linear regression method.

과정 S230에서는 화자인식장치(10)가 복수개의 음질특징 파라미터로부터 통계적 특징값을 산출한다. 화자인식장치(10)는 복수개의 음질특징 파라미터의 최대값, 복수개의 음질특징 파라미터의 최소값, 복수개의 음질특징 파라미터의 중앙값, 복수개의 음질특징 파라미터의 평균값, 복수개의 음질특징 파라미터의 분산값 중 두 개 이상의 요소를 선택하고, 선택된 두 개 이상의 요소 중 적어도 하나의 요소에 가중치를 적용하여 통계적 특징값을 산출할 수 있다.In step S230, the speaker recognition apparatus 10 calculates a statistical feature value from a plurality of sound quality feature parameters. The speaker recognition apparatus 10 is a device for recognizing a speaker, which is a speaker for recognizing a speech characteristic of the speech recognition apparatus 10, including a maximum value of a plurality of sound quality characteristic parameters, a minimum value of a plurality of sound quality characteristic parameters, a median value of a plurality of sound quality characteristic parameters, A statistical feature value can be calculated by selecting more than one element and applying a weight to at least one of the selected two or more elements.

과정 S240에서는 화자인식장치(10)가 통계적 특징값에 기반하여 복수개의 음성프레임 중에서 하나 이상의 유효음성프레임을 선별한다. 화자인식장치(10)는 통계적 특징값으로부터 임계 파라미터값을 산출하고, 임계 파라미터값보다 큰 음질특징 파라미터를 갖는 음성프레임을 선별할 수 있다.In step S240, the speaker recognition apparatus 10 selects one or more valid voice frames from among a plurality of voice frames based on the statistical feature values. The speaker recognition apparatus 10 can calculate threshold parameter values from statistical feature values and select voice frames having sound quality feature parameters larger than the threshold parameter value.

과정 S250에서는 화자인식장치(10)가 하나 이상의 유효음성프레임에 대하여 화자를 인식한다. 화자인식장치(10)는 음질특징 파라미터 및 통계적 특징값에 근거하여, GMM-UBM, SVM, JFA, i-Vector, DNN, 또는 이들의 조합을 선택적으로 이용하여 화자를 인식할 수 있다.In step S250, the speaker recognition apparatus 10 recognizes the speaker for one or more effective voice frames. Speaker recognition apparatus 10 can recognize the speaker selectively using GMM-UBM, SVM, JFA, i-Vector, DNN, or a combination thereof based on sound quality characteristic parameters and statistical characteristic values.

각각의 과정에서 음성프레임마다 음질특징 파라미터를 추출하고, 음질특징 파라미터로부터 통계적 특징값을 산출하고, 통계적 특징값에 기반하여 음성프레임 중에서 유효음성프레임을 선별함으로써, 배경잡음으로 인한 영향을 최소화한 음성구간에 대하여 화자를 인식할 수 있고, 사용자의 음성으로 본인의 신원을 확인하거나 범죄수사에서 미지의 음성과 용의자의 음성을 비교하는 등의 다양한 응용분야에서 화자인식 기술의 신뢰성을 높일 수 있는 효과가 있다.In each process, the speech quality feature parameters are extracted for each speech frame, the statistical feature values are calculated from the speech quality feature parameters, the valid speech frames are selected from the speech frames based on the statistical feature values, It is possible to increase the reliability of the speaker recognition technology in various applications such as recognizing the speaker for the section and confirming the identity of the user with the voice of the user or comparing the voice of the unknown and the voice of the suspect in the criminal investigation have.

도 2에서는 각각의 과정을 순차적으로 실행하는 것으로 기재하고 있으나 이는 예시적으로 설명한 것에 불과하고, 이 분야의 기술자라면 본 발명의 실시예의 본질적인 특성에서 벗어나지 않는 범위에서 도 2에 기재된 순서를 변경하여 실행하거나 또는 하나 이상의 과정을 병렬적으로 실행하거나 다른 과정을 추가하는 것으로 다양하게 수정 및 변형하여 적용 가능할 것이다.2, it is described that each process is sequentially executed. However, those skilled in the art will appreciate that any person skilled in the art can change the order described in FIG. 2 by changing the order in which they are not deviated from the essential characteristics of the embodiments of the present invention Or may be variously modified and modified by executing one or more processes in parallel or by adding other processes.

본 실시예들에 따른 장치는 프로그램을 실행하기 위한 데이터를 저장하는 메모리, 프로그램을 실행하여 연산 및 명령하기 위한 마이크로프로세서 등을 전부 또는 일부 구비한 다양한 장치를 의미할 수 있다. 여기서 장치는 하드웨어, 펌웨어, 소프트웨어 또는 이들의 조합에 의해 로직회로 내에서 구현될 수 있고, 범용 또는 특정 목적 컴퓨터를 이용하여 구현될 수도 있다. 장치는 고정배선형(Hardwired) 기기, 필드 프로그램 가능한 게이트 어레이(Field Programmable Gate Array, FPGA), 주문형 반도체(Application Specific Integrated Circuit, ASIC) 등을 이용하여 구현될 수 있다. 또한, 장치는 하나 이상의 프로세서 및 컨트롤러를 포함한 시스템온칩(System on Chip, SoC)으로 구현될 수 있다.An apparatus according to the present embodiments may refer to various apparatuses having all or part of a memory for storing data for executing a program, a microprocessor for executing and calculating a program, and the like. Wherein the apparatus may be implemented in logic circuitry by hardware, firmware, software, or a combination thereof, and may be implemented using a general purpose or special purpose computer. The device may be implemented using a hardwired device, a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), or the like. Further, the device may be implemented as a System on Chip (SoC) including one or more processors and controllers.

본 실시예들에 따른 장치의 동작은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능한 매체에 기록될 수 있다. 컴퓨터 판독 가능한 매체는 실행을 위해 프로세서에 명령어를 제공하는 데 참여한 임의의 매체를 나타낸다. 컴퓨터 판독 가능한 매체는 프로그램 명령, 데이터 파일, 데이터 구조 또는 이들의 조합을 포함할 수 있다. 예를 들면, 자기 매체, 광기록 매체, 메모리 등이 있으며, 컴퓨터 프로그램은 네트워크로 연결된 컴퓨터 시스템 상에 분산되어 분산 방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수도 있다. 본 실시예를 구현하기 위한 기능적인(Functional) 프로그램, 코드, 및 코드 세그먼트들은 본 실시예가 속하는 기술분야의 프로그래머들에 의해 용이하게 추론될 수 있을 것이다.The operation of the apparatus according to the present embodiments may be implemented in the form of a program command that can be executed through various computer means and recorded in a computer-readable medium. A computer-readable medium represents any medium that participates in providing instructions to a processor for execution. The computer readable medium may include program instructions, data files, data structures, or a combination thereof. For example, a magnetic medium, an optical recording medium, a memory, or the like, and a computer program may be distributed and distributed on a networked computer system so that computer readable code may be stored and executed in a distributed manner. Functional programs, codes, and code segments for implementing the present embodiment may be easily deduced by programmers of the technical field to which the present embodiment belongs.

본 실시예들은 본 실시예의 기술 사상을 설명하기 위한 것이고, 이러한 실시예에 의하여 본 실시예의 기술 사상의 범위가 한정되는 것은 아니다. 본 실시예의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 실시예의 권리범위에 포함되는 것으로 해석되어야 할 것이다. The present embodiments are for explaining the technical idea of the present embodiment, and the scope of the technical idea of the present embodiment is not limited by these embodiments. The scope of protection of the present embodiment should be construed according to the following claims, and all technical ideas within the scope of equivalents thereof should be construed as being included in the scope of the present invention.

10: 화자인식장치 100: 음성획득부
200: 음성구간선별장치 210: 음질특징 파라미터 생성부
220: 통계적 특징값 산출부 230: 유효음성프레임 선별부
300: 화자인식부10: Speaker recognition device 100:
200: speech segment discriminator 210: speech quality characteristic parameter generator
220: Statistical feature value calculating unit 230: Valid voice frame selector
300: Speaker recognition unit

Claims

복수개의 음성프레임마다 각각 음질특징을 추출하여 복수개의 음질특징 파라미터를 생성하는 음질특징 파라미터 생성부;
상기 복수개의 음질특징 파라미터로부터 통계적 특징값을 산출하는 통계적 특징값 산출부; 및
상기 통계적 특징값에 기반하여 상기 복수개의 음성프레임 중에서 하나 이상의 유효음성프레임을 선별하는 유효음성프레임 선별부
를 포함하고,
상기 음질특징 파라미터는 켑스트럼 피크(Cepstrum Peak)와 선형회귀(Linear Regression) 직선간의 거리를 이용하여 추출되고, 켑스트럼은 시간영역의 신호를 푸리에변환하여 얻은 스펙트럼에 로그값을 취한 후 이를 역푸리에변환한 것이고, 선형회귀 직선은 켑스트럼 곡선의 모양을 선형회귀 방법으로 근사화한 직선인 것을 특징으로 하는 음성구간선별장치.A speech quality characteristic parameter generation unit for generating a plurality of speech quality characteristic parameters by extracting speech quality characteristics for each of a plurality of speech frames;
A statistical feature value calculation unit for calculating a statistical feature value from the plurality of sound quality feature parameters; And
A valid voice frame selector for selecting one or more valid voice frames among the plurality of voice frames based on the statistical feature value;
Lt; / RTI >
The sound quality characteristic parameter is extracted by using a distance between a cepstrum peak and a linear regression line. The cepstrum is obtained by taking a logarithm of a spectrum obtained by Fourier transforming a signal in a time domain, And the linear regression line is a straight line obtained by approximating the shape of the quadratic curve by a linear regression method.

삭제delete

복수개의 음성프레임마다 각각 음질특징을 추출하여 복수개의 음질특징 파라미터를 생성하는 음질특징 파라미터 생성부;
상기 복수개의 음질특징 파라미터로부터 통계적 특징값을 산출하는 통계적 특징값 산출부; 및
상기 통계적 특징값에 기반하여 상기 복수개의 음성프레임 중에서 하나 이상의 유효음성프레임을 선별하는 유효음성프레임 선별부
를 포함하고,
상기 통계적 특징값 산출부는 상기 복수개의 음질특징 파라미터의 최대값, 상기 복수개의 음질특징 파라미터의 최소값, 상기 복수개의 음질특징 파라미터의 중앙값, 복수개의 음질특징 파라미터의 평균값, 복수개의 음질특징 파라미터의 분산값 중 두 개 이상의 요소를 선택하고, 상기 선택된 두 개 이상의 요소 중 적어도 하나의 요소에 가중치를 적용하여 상기 통계적 특징값을 산출하는 것을 특징으로 하는 음성구간선별장치.A speech quality characteristic parameter generation unit for generating a plurality of speech quality characteristic parameters by extracting speech quality characteristics for each of a plurality of speech frames;
A statistical feature value calculation unit for calculating a statistical feature value from the plurality of sound quality feature parameters; And
A valid voice frame selector for selecting one or more valid voice frames among the plurality of voice frames based on the statistical feature value;
Lt; / RTI >
The statistical feature value calculating unit may calculate a statistical feature value by using a maximum value of the plurality of sound quality feature parameters, a minimum value of the plurality of sound quality feature parameters, a median value of the plurality of sound quality feature parameters, an average value of a plurality of sound quality feature parameters, Wherein the statistical feature value calculating unit calculates the statistical feature value by selecting at least two elements among the two or more selected elements and applying a weight to at least one of the selected two or more elements.

제 1 항에 있어서,
상기 유효음성프레임 선별부는 상기 통계적 특징값으로부터 임계 파라미터값을 산출하고, 상기 임계 파라미터값보다 큰 음질특징 파라미터를 갖는 음성프레임을 선별하는 것을 특징으로 하는 음성구간선별장치.The method according to claim 1,
Wherein the valid voice frame selector calculates a threshold parameter value from the statistical feature value and selects a voice frame having a voice quality feature parameter larger than the threshold parameter value.

복수개의 음성프레임마다 각각 음질특징을 추출하여 복수개의 음질특징 파라미터를 생성하는 음질특징 파라미터 생성부;
상기 복수개의 음질특징 파라미터로부터 통계적 특징값을 산출하는 통계적 특징값 산출부; 및
상기 통계적 특징값에 기반하여 상기 복수개의 음성프레임 중에서 하나 이상의 유효음성프레임을 선별하는 유효음성프레임 선별부
를 포함하고,
상기 유효음성프레임 선별부는 상기 복수개의 음질특징 파라미터에 관한 두 개의 정규분포에 대하여 각각의 가중합을 구하고, 상기 두 개의 가중합 정규분포를 모델링하고, 상기 두 개의 가중합 정규분포가 만나는 위치를 임계 파라미터값으로 설정하는 것을 특징으로 하는 음성구간선별장치.A speech quality characteristic parameter generation unit for generating a plurality of speech quality characteristic parameters by extracting speech quality characteristics for each of a plurality of speech frames;
A statistical feature value calculation unit for calculating a statistical feature value from the plurality of sound quality feature parameters; And
A valid voice frame selector for selecting one or more valid voice frames among the plurality of voice frames based on the statistical feature value;
Lt; / RTI >
The effective speech frame selector may calculate weighted sums for two normal distributions of the plurality of tone quality characteristic parameters, model the two weighted sum normal distributions, and determine a position where the two weighted sum normal distributions meet, And the parameter is set to a parameter value.

화자로부터 음성신호를 획득하고 상기 음성신호로부터 복수개의 음성프레임을 생성하는 음성획득부;
상기 복수개의 음성프레임마다 각각 음질특징을 추출하여 복수개의 음질특징 파라미터를 생성하고, 상기 복수개의 음질특징 파라미터로부터 통계적 특징값을 산출하고, 상기 통계적 특징값에 기반하여 상기 복수개의 음성프레임 중에서 하나 이상의 유효음성프레임을 선별하는 음성구간선별장치; 및
상기 하나 이상의 유효음성프레임에 대하여 화자를 인식하는 화자인식부
를 포함하고,
상기 음성구간선별장치는 상기 복수개의 음질특징 파라미터의 최대값, 상기 복수개의 음질특징 파라미터의 최소값, 상기 복수개의 음질특징 파라미터의 중앙값, 복수개의 음질특징 파라미터의 평균값, 복수개의 음질특징 파라미터의 분산값 중 두 개 이상의 요소를 선택하고, 상기 선택된 두 개 이상의 요소 중 적어도 하나의 요소에 가중치를 적용하여 상기 통계적 특징값을 산출하는 것을 특징으로 하는 화자인식장치.A voice acquiring unit acquiring a voice signal from a speaker and generating a plurality of voice frames from the voice signal;
Extracting sound quality characteristics for each of the plurality of voice frames to generate a plurality of sound quality feature parameters, calculating a statistical feature value from the plurality of sound quality feature parameters, and calculating one or more A speech segment selection device for selecting an effective speech frame; And
A speaker recognition unit for recognizing a speaker with respect to the at least one valid voice frame,
Lt; / RTI >
Wherein the speech segment selection device comprises: a maximum value of the plurality of sound quality characteristic parameters; a minimum value of the plurality of sound quality characteristic parameters; a median of the plurality of sound quality characteristic parameters; an average value of a plurality of sound quality characteristic parameters; Wherein the statistical feature value is calculated by selecting at least two elements out of the plurality of selected elements and applying a weight to at least one of the selected two or more elements.

제 7 항에 있어서,
상기 음성획득부는 마이크로폰으로 입력된 아날로그 음성신호를 디지털로 변환하거나 메모리에 기 저장된 디지털 음성신호로부터 기 설정된 프레임길이를 갖는 음성프레임을 생성하는 것을 특징으로 하는 화자인식장치.8. The method of claim 7,
Wherein the voice acquiring unit converts the analog voice signal input through the microphone to digital or generates a voice frame having a predetermined frame length from the digital voice signal stored in the memory.

제 7 항에 있어서,
상기 화자인식부는 상기 음질특징 파라미터 및 상기 통계적 특징값에 근거하여, GMM-UBM(Gaussian Mixture Model-Universal Background Model), SVM(Support Vector Machine), JFA(Joint Factor Analysis), i-Vector, DNN, 또는 이들의 조합을 선택적으로 이용하여 화자를 인식하는 것을 특징으로 하는 화자인식장치.8. The method of claim 7,
The speaker recognizing unit recognizes the speech quality characteristic parameter and the statistical characteristic value based on the Gaussian Mixture Model-Universal Background Model (GMM-UBM), Support Vector Machine (SVM), Joint Factor Analysis (JFA) Or a combination thereof to selectively recognize a speaker.

화자인식장치가 화자를 인식하는 방법에 있어서,
화자로부터 음성신호를 획득하고 상기 음성신호로부터 복수개의 음성프레임을 생성하는 과정;
상기 복수개의 음성프레임마다 각각 음질특징을 추출하여 복수개의 음질특징 파라미터를 생성하는 과정;
상기 복수개의 음질특징 파라미터로부터 통계적 특징값을 산출하는 과정;
상기 통계적 특징값에 기반하여 상기 복수개의 음성프레임 중에서 하나 이상의 유효음성프레임을 선별하는 과정; 및
상기 하나 이상의 유효음성프레임에 대하여 화자를 인식하는 과정
을 포함하고,
상기 통계적 특징값을 산출하는 과정은 상기 복수개의 음질특징 파라미터의 최대값, 상기 복수개의 음질특징 파라미터의 최소값, 상기 복수개의 음질특징 파라미터의 중앙값, 복수개의 음질특징 파라미터의 평균값, 복수개의 음질특징 파라미터의 분산값 중 두 개 이상의 요소를 선택하고, 상기 선택된 두 개 이상의 요소 중 적어도 하나의 요소에 가중치를 적용하여 상기 통계적 특징값을 산출하는 것을 특징으로 하는 화자인식방법.In a method for a speaker recognition device to recognize a speaker,
Obtaining a speech signal from a speaker and generating a plurality of speech frames from the speech signal;
Generating a plurality of sound quality feature parameters by extracting sound quality features for each of the plurality of sound frames;
Calculating a statistical feature value from the plurality of sound quality feature parameters;
Selecting at least one valid voice frame among the plurality of voice frames based on the statistical feature value; And
Recognizing a speaker for the at least one valid voice frame
/ RTI >
Wherein the step of calculating the statistical feature value includes calculating a maximum value of the plurality of sound quality feature parameters, a minimum value of the plurality of sound quality feature parameters, a median value of the plurality of sound quality feature parameters, an average value of a plurality of sound quality feature parameters, Wherein the statistical feature value is calculated by selecting at least two elements among the variance values of the selected two or more elements and applying a weight to at least one of the selected two or more elements.