KR20100086729A

KR20100086729A - Method for producing feature vectors used in the discrimination of audio information, and method and apparatus for classifying adult movies by using the feature vectors

Info

Publication number: KR20100086729A
Application number: KR1020090006112A
Authority: KR
Inventors: 이용주; 김봉완
Original assignee: 원광대학교산학협력단
Priority date: 2009-01-23
Filing date: 2009-01-23
Publication date: 2010-08-02
Also published as: KR101040906B1

Abstract

PURPOSE: An obscene multimedia contents discriminating apparatus at high speed by generating a specific vector is provided to generate the specific vector at high speed by using a part of the audio signal by using MCME and MFCC(Mel-Frequency Cepstral Coefficients) at the same time. CONSTITUTION: A specific vector extractor(100) extracts a specific vector in a part of the multimedia contents audio signal. An obscene discriminating unit(200) discriminates the obscene character of the multimedia contents through the specific vector and a sound model(300). The sound model includes obscene audio signal model and normality audio signal model. The obscene discriminating unit discriminates the obscene characteristic through the similarity in probability.

Description

오디오 신호 판별을 위한 특징벡터 생성 방법, 및 상기 특징벡터를 이용한 음란성 멀티미디어 콘텐츠 판별 방법 및 장치{METHOD FOR PRODUCING FEATURE VECTORS USED IN THE DISCRIMINATION OF AUDIO INFORMATION, AND METHOD AND APPARATUS FOR CLASSIFYING ADULT MOVIES BY USING THE FEATURE VECTORS}TECHNICAL FOR PRODUCING FEATURE VECTORS USED IN THE DISCRIMINATION OF AUDIO INFORMATION, AND METHOD AND APPARATUS FOR CLASSIFYING ADULT MOVIES BY USING THE FEATURE VECTORS }

본 발명은 음성, 음악 및 음란 동영상 등의 멀티미디어 콘텐츠에 포함된 오디오 신호의 판별을 위해 사용되는 특징벡터를 생성하는 방법에 관한 것으로, 멜프리컨시 캡스트럴 모듈레이션 에너지(Mel-frequency Cepstrum Modualtion Energy, MCME, 이하 “MCME"라 함)만을 사용하거나 또는 이와 더불어 멜프리컨시 캡스트럴 계수(Mel-Frequency Cepstral Coefficients, MFCC, 이하 “MFCC"라 함)를 동시에 사용하되, 오디오 신호의 일부 구간을 이용하여 고속으로 생성하는, 멀티미디어 콘텐츠의 오디오 신호 판별을 위한 특징벡터 생성 방법에 관한 것이다.The present invention relates to a method for generating a feature vector used for discriminating an audio signal included in multimedia content such as voice, music, and obscene video. The present invention relates to a mel-frequency Cepstrum Modualtion Energy. , MCME, hereinafter referred to as “MCME” only, or Mel-Frequency Cepstral Coefficients (MFCC, hereinafter referred to as “MFCC”) at the same time. The present invention relates to a method of generating a feature vector for discriminating an audio signal of multimedia content using a high speed.

본 발명은 멀티미디어 콘텐츠에 포함된 오디오 신호로부터 특징벡터를 추출하여 음란 여부를 판단하는 음란성 멀티미디어 콘텐츠의 판별 방법 및 장치에 관한 것으로, MCME 만을 사용하거나 또는 이와 더불어 MFCC를 동시에 사용하되, 오디오 신호의 일부 구간을 이용하여 고속으로 생성된 특징벡터를 이용한, 음란성 멀티미디어 콘텐츠의 판별 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for discriminating obscene multimedia contents for extracting feature vectors from audio signals included in multimedia contents and determining whether they are indecent, and using only MCME or MFCC simultaneously. The present invention relates to a method and apparatus for discriminating obscene multimedia content using a feature vector generated at high speed using a section.

최근 대량의 멀티미디어 콘텐츠가 인터넷을 통해 공개 및 유통되면서, 음란성 콘텐츠가 청소년에게 무방비로 노출되는 사례가 증가하고 있다.Recently, as a large amount of multimedia content is disclosed and distributed through the Internet, there is an increasing number of cases where pornographic content is exposed to adolescents unprotected.

종래에는, 이러한 사례를 방지하기 위한 음란성 콘텐츠 판별 방법으로서, 파일 이름 등에서 선택된 키워드를 이용하여 판별하는 방법을 사용하거나, 콘텐츠 영상 이미지를 분석하여 판별하는 방법이 사용되었다.Conventionally, as a method of discriminating obscene contents to prevent such cases, a method of determining using a keyword selected from a file name or the like or a method of analyzing and determining a content video image has been used.

그러나 키워드 판별 방법은 파일 이름을 변경시키는 등의 회피 방법에 대처하기 힘들고, 이미지 분석 판별 방법은 배경색, 조명, 색의 분포, 화이트 밸런스 및 피부색 등과 같이 분석 변수들의 종류가 많고 또한 각 변수의 정도(level)가 다양하여 정확한 판별에 어려움이 있음은 물론 분석 시간이 오래 걸리는 등의 문제점이 있었다.However, keyword discrimination method is difficult to cope with avoidance method such as changing file name, and image analysis discrimination method has many kinds of analysis variables such as background color, lighting, color distribution, white balance and skin color, There are problems such as difficulty in accurate identification due to various levels) and long analysis time.

따라서 본 발명의 발명자들은 논문 “오디오 신호에 기반한 음란 동영상 판별”(대한음성학회지 : 말소리, no.63, pp.139-151, 2007년 9월)에서, 멀티미디어 콘텐츠에 포함된 오디오 신호를 분석하여, 음란성을 판단하는 방법을 제시한바 있다. Therefore, the inventors of the present invention analyze the audio signal included in the multimedia content in the article "Discrimination of the obscene video based on the audio signal" (Korean Journal of Speech Sciences, Speech, no.63, pp.139-151, September 2007). In this paper, we have presented a method for determining indecentness.

상기 논문의 방법은 음성 위주 콘텐츠, 음악 위주 콘텐츠 및 음란 영상 위주 콘텐츠에 포함되어 있는 오디오 신호들은 그 특징이 상이하다는 것에 착안한 것으로서, 구체적으로 보면, 음성의 경우 자음과 모음의 연속적 발성으로 신호 스펙트럼 포락선의 변화가 다른 신호에 비해 빠르고, 음악의 경우에는 록큰롤 같이 빠른 음악도 음성에 비해 신호 스펙트럼 포락선 변화가 빠르지 않고, 음란 영상의 경우 유사한 음향적 특징이 일정한 주기로 매우 분명하게 반복되고 있다는 특징에 기반한 것이다. The method of the paper focuses on the fact that the audio signals included in the voice-oriented content, the music-based content, and the audio-visual content are different from each other. Specifically, in the case of voice, the signal spectrum is formed by the continuous speech of consonants and vowels. It is based on the characteristic that the change in envelope is faster than other signals, and in the case of music, fast music such as rock and roll does not change the signal spectral envelope faster than voice, and similar sound characteristics are repeated very clearly at regular intervals in the case of obscene video. will be.

즉 멀티미디어 콘텐츠에 포함된 오디오 신호를 판별하기 위해서는, 음란 영상에서 자주 출현하는 교성, 신음소리 등의 음향적 특징뿐만 아니라 단위시간당 변화 특징까지 고려하여 오디오 신호를 분석하는 것이 요구되는 것이다.That is, in order to determine the audio signal included in the multimedia content, it is required to analyze the audio signal in consideration of not only acoustic features such as doctrinal and moaning sounds frequently appearing in the obscene video, but also change characteristics per unit time.

따라서 상기 논문은, 임의의 시간에 존재하는 음향적 특성을 반영하는 특징벡터인 MFCC 뿐만 아니라, 단위시간당 변화를 반영하는 특징벡터로서 본 발명자들이 새로이 MCME를 제안하여, MCME 또는 MCME + MFCC로 이루어진 특징벡터로 멀티미디어 콘텐츠의 오디오 신호를 판별하는 방법을 제안한 바 있다.Therefore, the paper proposes a new MCME as a feature vector reflecting a change per unit time, as well as a MFCC which is a feature vector reflecting an acoustic characteristic existing at any time, and is made of MCME or MCME + MFCC. A method for discriminating an audio signal of multimedia content by a vector has been proposed.

또한 더 나아가서, 상기 논문은, 멀티미디어 콘텐츠에서 MFCC, MCME, MCME + MFCC 등의 특징벡터를 추출하여 실험한 결과, 기대한 바와 같이, MCME, MCME + MFCC의 특징벡터가 탁월한 판별 효과를 거두었음을 밝혔다.Furthermore, in the paper, as a result of extracting and testing feature vectors of MFCC, MCME, MCME + MFCC, etc. from multimedia contents, the feature vectors of MCME, MCME + MFCC showed excellent discrimination effect. Said.

그러나 상기 논문의 방법은 판별 대상이 되는 멀티미디어 콘텐츠의 오디오 신호 전체를 분석 대상으로 하고 있는바, 동일한 판별 효과를 보유하되 보다 적은 계산량과 빠른 분석 속도가 가능하도록 하는 방안이 결여되어 있으므로 이에 대한 개선 방법이 필요하다 하겠다.However, the method of the paper is to analyze the whole audio signal of the multimedia content to be discriminated, and the method of improving the method is possible because it has the same discrimination effect but less computation and faster analysis speed. I will need this.

따라서 본 발명의 목적은, 음성, 음악 및 음란 동영상 등의 멀티미디어 콘텐츠에 포함된 오디오 신호의 판별을 위해 사용되는 특징벡터로써 MCME 만을 사용하거나 또는 이와 더불어 MFCC 를 동시에 사용하되, 오디오 신호의 일부 구간을 이용하여 고속으로 특징벡터를 생성함에 있다.Accordingly, an object of the present invention is to use only MCME or MFCC simultaneously as a feature vector used for discriminating an audio signal included in multimedia content such as voice, music and pornographic video, It is used to generate feature vectors at high speed.

또한 본 발명의 목적은, 멀티미디어 콘텐츠에 포함된 오디오 신호로부터 특징벡터를 추출하여 음란 여부를 판단하는 음란성 멀티미디어 콘텐츠의 판별 방법 및 장치를 제공하되, MCME 만을 사용하거나 또는 이와 더불어 MFCC 를 동시에 사용하되, 오디오 신호의 일부 구간을 이용하여 고속으로 생성된 특징벡터를 이용한, 음란성 멀티미디어 콘텐츠의 판별 방법 및 장치를 제공함에 있다.In addition, an object of the present invention is to provide a method and apparatus for discriminating obscene multimedia content for determining whether obscene by extracting a feature vector from the audio signal included in the multimedia content, using only MCME or MFCC at the same time, The present invention provides a method and apparatus for discriminating obscene multimedia content using a feature vector generated at a high speed by using a portion of an audio signal.

상기 기술적 과제를 달성하기 위하여, 본 발명에 따른 멀티미디어 콘텐츠의 오디오 신호 판별을 위한 특징벡터 생성 방법은 멀티미디어 콘텐츠의 오디오 신호의 특징을 판별하는 특징벡터로 사용하기 위하여, 다음의 수학식을 이용하여 멜프리컨시 캡스트럴 모듈레이션 에너지(Mel-frequency Cepstrum Modualtion Energy, MCME)를 생성하되, 오디오 신호의 일부 구간을 이용하여 생성하는 것을 특징으로 한다.In order to achieve the above technical problem, the feature vector generation method for determining the audio signal of the multimedia content according to the present invention, in order to use as a feature vector for determining the feature of the audio signal of the multimedia content, by using the following equation Generating the pre-conciliation capsular modulation energy (MCME), characterized in that it is generated using a portion of the audio signal.

여기서, n은 프레임의 인덱스이고, q는 모듈레이션 주파수의 인덱스이고, C[n,l]은 n번째 프레임의 l번 인덱스의 MFCC 성분이고, L은 MFCC 특징벡터의 차수이고, P는 MFCC 특징벡터에서 푸리에 변환을 수행하기 위한 푸리에 변환 사이즈이고, E(n)은 n번째 프레임에 포함된 오디오 신호의 각 값에 제곱을 취하고 이를 더한 값이다.Where n is the index of the frame, q is the index of the modulation frequency, C [n, l] is the MFCC component of index l of the nth frame, L is the order of the MFCC feature vector, and P is the MFCC feature vector. Is a Fourier transform size for performing Fourier transform, and E (n) is a squared value of each value of the audio signal included in the nth frame and added.

본 발명인 멀티미디어 콘텐츠에 포함된 오디오 신호의 판별을 위한 특징벡터 생성 방법은, MCME 만을 사용하거나 또는 이와 더불어 MFCC를 동시에 사용하여 특징벡터를 생성하되, 오디오 신호의 일부 구간을 이용함으로써 고속으로 특징벡터를 생성할 수 있는 효과가 있다.In the present invention, a feature vector generation method for discriminating an audio signal included in a multimedia content includes generating a feature vector using only MCME or MFCC at the same time, and using a part of an audio signal to generate a feature vector at high speed. There is an effect that can be created.

본 발명인 음란성 멀티미디어 콘텐츠의 판별 방법 및 장치는, MCME 만을 사용하거나 또는 이와 더불어 MFCC 를 동시에 사용하여 생성된 특징벡터를 이용하되, 오디오 신호의 일부 구간을 이용하여 고속으로 생성된 특징벡터를 이용함으로써 고속으로 음란성을 판별할 수 있는 효과가 있다.The present invention provides a method and apparatus for discriminating obscene multimedia content by using a feature vector generated by using only MCME or MFCC simultaneously, and by using a feature vector generated at high speed using a portion of an audio signal. As a result, it is possible to determine obsceneness.

이하, 도면을 참조하여 설명하기에 앞서, 본 발명의 요지를 드러내기 위해서 필요하지 않은 사항 즉 통상의 지식을 가진 당업자가 자명하게 부가할 수 있는 공 지 구성에 대해서는 도시하지 않거나, 구체적으로 기술하지 않았음을 밝혀둔다.Before describing the present invention with reference to the drawings, it is not shown or described in detail the matters that are not necessary to reveal the gist of the present invention that can be obviously added by those skilled in the art. Make sure you didn't.

먼저, MFCC는 인간의 청각 특성을 반영하여, 임의의 시간에 존재하는 오디오 신호의 음향적 특성을 드러내는 특징벡터이다.First, the MFCC is a feature vector that reflects the human auditory characteristics and reveals the acoustic characteristics of an audio signal present at any time.

MFCC 는 오디오 신호에 푸리에 변환을 취하여 스펙트럼을 구한 후, 구한 스펙트럼에 대해 멜 스케일에 맞춘 삼각 필터 뱅크를 대응시켜 각 밴드에서의 크기의 합을 구하고 필터 뱅크 출력값에 로그를 취한 후, 이산 코사인 변환을 하여 구해진다. The MFCC takes a Fourier transform on an audio signal to obtain a spectrum. The MFCC matches a triangular filter bank matched to a mel scale with the obtained spectrum, sums the magnitudes in each band, takes a logarithm of the filter bank output, and then performs a discrete cosine transform. Obtained by

MFCC 는 음성인식 기술 분야에서 통상적으로 사용되는 것이므로 보다 구체적인 기술은 생략한다.Since MFCC is commonly used in the voice recognition technology field, more specific description thereof is omitted.

MCME는, 본 발명의 발명자들이 상기 논문 “오디오 신호에 기반한 음란 동영상 판별”(대한음성학회지 : 말소리, no.63, pp.139-151, 2007년 9월)에서 새로이 제안한 것으로, MFCC 영역에서 푸리에 변환을 수행하여 구해진 에너지 값이다.MCME, the inventors of the present invention newly proposed in the paper "Discrimination of obscene video based on audio signal" (Korean Journal of Speech Sciences, no.63, pp.139-151, September 2007), Fourier in the MFCC domain The energy value obtained by performing the conversion.

즉 MFCC가 특정 시간에 존재하는 오디오 신호의 음향적 특성을 반영한다면 MCME는 보다 넓은 시간 구간에서의 음향적 특성의 변화 양상을 표현한다.That is, if the MFCC reflects the acoustic characteristics of the audio signal existing at a specific time, the MCME expresses a change in acoustic characteristics over a wider time interval.

MCME의 정의는 하기 [수학식 1]과 같다.The definition of MCME is shown in Equation 1 below.

여기에서, n은 프레임의 인덱스를 의미한다.Here, n means the index of the frame.

또한 q는 모듈레이션 주파수의 인덱스를 의미하는데, 낮은 q는 시간에 따른 변화가 적음을 의미하며 높은 q는 시간에 따른 변화가 많음을 의미한다.In addition, q means the index of the modulation frequency, where low q means less change over time and high q means more change over time.

또한 C[n,l]은 n번째 프레임의 l번 인덱스의 MFCC 성분을, L은 MFCC 특징벡터의 차수를, P는 MFCC 특징벡터에서 푸리에 변환을 수행하기 위한 푸리에 변환 사이즈를 각각 의미한다.C [n, l] denotes the MFCC component at index l of the nth frame, L denotes the order of the MFCC feature vector, and P denotes the Fourier transform size for performing Fourier transform on the MFCC feature vector.

또한 E(n)은 n번째 프레임의 에너지를 의미하며, n번째 프레임에 포함된 오디오 신호의 각 값에 제곱을 취하고 이를 더한 값이다.In addition, E (n) means the energy of the n-th frame, the square of each value of the audio signal included in the n-th frame, and adds it.

도 1은 본 발명의 일 실시예에 따른 특징 벡터 생성 방법이다.1 is a method of generating a feature vector according to an embodiment of the present invention.

먼저, S110 단계에서는, 멀티미디어 콘텐츠에서 1개의 MCME를 구할 수 있을 길이의 오디오 신호를 획득한다.First, in step S110, an audio signal having a length that can obtain one MCME from the multimedia content is obtained.

이는 음란성 멀티미디어 콘텐츠의 경우, 유사한 음향적 특징이 일정한 주기로 분명하게 반복되고 있을 뿐만 아니라, 그 재생 구간의 길이가 최소한 몇 분 ~ 몇 십 분으로 상당히 긴 편이므로, 콘텐츠 전체의 오디오 신호에 대하여 특징벡터를 추출할 필요 없이, 음란성 음향적 특징이 드러나기에 충분할 정도의 일정한 간격마다 특징벡터를 추출하여 이를 기반으로 음란성 여부를 판단하더라도 성능의 저하는 없으면서 분석 속도를 향상시킬 수 있다는 점에 기반 한다.In the case of obscene multimedia contents, similar acoustic characteristics are clearly repeated at regular intervals, and the length of the playback section is quite long, at least several minutes to several ten minutes, so that the feature vector is not limited to the audio signal of the entire contents. It is based on the fact that it is possible to improve the speed of analysis without any deterioration in performance even if the feature vector is extracted at regular intervals sufficient to reveal the obscene acoustic features without having to extract.

일 실시예로서, 1개의 MCME를 구할 수 있을 길이(Length)의 계산식은 하기 [수학식 2]와 같다.In one embodiment, the equation for calculating the length (Length) to obtain one MCME is shown in Equation 2 below.

여기에서 P는 MFCC 특징벡터에서 MCME를 구하기 위해 푸리에 변환을 수행하기 위한 DFT (Discrete Fourier Transform) 사이즈를, A는 MFCC를 구하기 위한 프레임 윈도우의 전진 크기를, W는 MFCC를 구하기 위한 프레임 윈도우의 크기를 각각 의미한다.Where P is the Discrete Fourier Transform (DFT) size for performing the Fourier transform to obtain the MCME from the MFCC feature vector, A is the forward size of the frame window for obtaining the MFCC, and W is the size of the frame window for obtaining the MFCC. Means each.

길이(Length)를 구하기 위한 예로서, 만약 MFCC를 구하기 위한 해밍 윈도우의 크기(W)가 25 msec이고, 윈도우를 10 msec씩 전진(A)하면서 MFCC를 구하며, MCME를 구하기 위해 32포인트(P)의 FFT(Fast Fourier Transform)을 수행한다면, 1개의 MCME를 구하기 위한 길이(Length)는 335 msec (= 32 * 10 msec + (25 msec - 10 msec))가 된다.As an example to find the length, if the size (W) of the Hamming window for obtaining the MFCC is 25 msec, the window is advanced by 10 msec (A), the MFCC is obtained, and the 32 points (P) to obtain the MCME. If Fast Fourier Transform (FFT) is performed, the length for obtaining one MCME is 335 msec (= 32 * 10 msec + (25 msec-10 msec)).

S120 단계에서는, S110 단계에서 획득한 오디오 신호 구간에 대해서만 샘플링 주파수(Sampling frequency) 정규화를 수행한다.In step S120, sampling frequency normalization is performed only on the audio signal section acquired in step S110.

멀티미디어 콘텐츠로부터 추출한 오디오 신호의 샘플링 주파수는 44.1 KHz, 48 KHz, 22 KHz, 11 KHz, 32 KHz, 16 KHz 등과 같이 다양할 수 있으며, 이러한 다양한 샘플링 주파수를 갖는 오디오 신호로부터 바로 특징벡터를 추출할 경우에는, 정상 오디오 신호 모델 및 음란 오디오 신호 모델의 학습에 사용된 오디오 신호의 샘플링 주파수와 다를 경우(sampling frequency mismatch) 판별 성능이 저하될 우려가 있다.The sampling frequency of the audio signal extracted from the multimedia contents can be varied, such as 44.1 KHz, 48 KHz, 22 KHz, 11 KHz, 32 KHz, 16 KHz, and when the feature vector is extracted directly from the audio signal having the various sampling frequencies. There is a fear that the discrimination performance is deteriorated when the sampling frequency mismatch is different from the sampling frequency of the audio signal used for learning the normal audio signal model and the obscene audio signal model.

따라서 S120 단계에서는, 판별하려고 하는 멀티미디어 콘텐츠에 포함된 오디오 신호의 샘플링 주파수가 정상 오디오 신호 모델 및 음란 오디오 신호 모델의 학습에 사용된 샘플링 주파수 보다 높은 경우에는 다운 샘플링(down sampling)과정 을, 낮은 경우에는 업 샘플링(up sampling)과정을 각각 거쳐서 학습에 사용된 샘플링 주파수와 맞추어 주게 된다.Therefore, in step S120, when the sampling frequency of the audio signal included in the multimedia content to be determined is higher than the sampling frequency used for learning the normal audio signal model and the obscene audio signal model, the down sampling process is performed. Up sampling is performed to match the sampling frequency used for learning.

S130 단계에서는, 1개의 MCME를 구하기 위한 MFCC 및 로그 에너지(Log Energy)를 추출한다. [수학식 1]을 참조하여 보면, MCME를 구하기 위해서는, 분자 및 분모에 대응되는 값 즉 MFCC 및 로그 에너지(Log Energy)의 추출이 필요함을 알 수 있다.In step S130, MFCC and Log Energy (Log Energy) for obtaining one MCME is extracted. Referring to [Equation 1], it can be seen that in order to obtain the MCME, it is necessary to extract values corresponding to the numerator and denominator, that is, MFCC and log energy.

또한 추후 MFCC를 MCME에 합하여 새로운 특징벡터를 구하기 위해서도 필요하다. In addition, it is necessary to later obtain a new feature vector by adding the MFCC to the MCME.

여기에서, MFCC를 추출하기 위한 방법은 본 기술분야에서 통상적인 방법으로 수행되는 것으로서 구체적인 기술은 생략한다.Here, the method for extracting the MFCC is performed by a conventional method in the art, and a detailed description thereof is omitted.

또한 로그 에너지(Log Energy)는 음향적 특성의 변화 양상을 드러내기 위한 것으로서, [수학식 1]의 분모에 표시되어 있다. In addition, the log energy (Log Energy) is to reveal the change in the acoustic characteristics, it is shown in the denominator of [Equation 1].

바람직하게, 향후 MCME 계산이나 또는 MFCC + MCME 계산에 사용하기 위해서, 추출된 MFCC 및 Log Energy 각 값을 별도의 메모리에 저장한다. Preferably, the extracted MFCC and Log Energy values are stored in separate memories for future MCME calculation or MFCC + MCME calculation.

S140 단계에서는, S130 단계에서 구한 추출한 MFCC 및 로그 에너지(Log Energy)를 이용하여 1개의 MCME 특징벡터를 추출한다.In step S140, one MCME feature vector is extracted using the extracted MFCC and log energy obtained in step S130.

바람직하게, 향후 MFCC + MCME 계산에 사용하기 위해서, 추출된 MFCC 특징벡터를 별도의 메모리에 저장한다.Preferably, the extracted MFCC feature vector is stored in a separate memory for use in future MFCC + MCME calculation.

S 150 단계에서는, 현재 위치가 멀티미디어 콘텐츠에 포함된 오디오 신호 전체 구간의 끝인지를 체크한다.In step S 150, it is checked whether the current position is the end of the entire audio signal section included in the multimedia content.

S 160 단계는, S 150 단계에서 오디오 신호 전체 구간의 끝이 아니라고 판단되면, S 110 단계에서 획득한 오디오 신호의 끝에서부터 미리 정한 간격만큼 오디오 신호를 스킵하고, 다시 S 110 단계로 돌아가게 된다.In step S160, if it is determined that the step is not the end of the entire audio signal in step S150, the audio signal is skipped by a predetermined interval from the end of the audio signal acquired in step S110, and the process returns to step S110 again.

실험적으로 보면, MCME 특징벡터로만 멀티미디어 콘텐츠의 오디오 신호를 판별할 때, 일정 간격마다 MCME를 추출하는 것이 오디오 신호 전체에 대해서 MCME를 추출하는 것보다 월등하게 빠른 속도를 보이고 있다.Experimentally, when determining the audio signal of the multimedia content using only the MCME feature vector, extracting the MCME at regular intervals is much faster than extracting the MCME for the entire audio signal.

구체적으로 보면, 1.6초 마다 1개의 MCME 특징 벡터를 추출할 경우 성능 저하는 전혀 없으면서 4.8배 빠른 속도로 오디오 신호를 판별하였으며, 6.4초 마다 1개의 MCME 특징 벡터를 추출할 경우 0.18%의 판별 성능 저하는 있었으나 속도는 19.1배의 향상이 있다.Specifically, when extracting one MCME feature vector every 1.6 seconds, the audio signal was discriminated 4.8 times faster without any performance degradation. When extracting one MCME feature vector every 6.4 seconds, the discrimination performance decreased by 0.18%. There is a 19.1 times improvement in speed.

이는 샘플링 주파수 정규화, MFCC 및 로그 에너지(Log Energy) 추출, MCME 특징벡터 추출 및 CMS 등의 신호처리 과정이 전체 오디오 신호에 대해 수행하는 것이 아니고, 일정한 간격마다 수행하므로 계산량과 시간이 현저하게 절약되는 것에 기인한다.This is because signal processing such as sampling frequency normalization, MFCC and log energy extraction, MCME feature vector extraction, and CMS are not performed for the entire audio signal, but at regular intervals. Due to

바람직하게, 콘텐츠에서 음란성 음향이 있는 구간을 충분히 반영하여 판별 성능에 지장이 없도록 스킵 간격을 적절히 조절한다.Preferably, the skip interval is appropriately adjusted so as to sufficiently reflect a section having obscene sound in the content so as not to interfere with the discrimination performance.

S 160 단계 이후에 다시 S 110 단계 ~ S 150 단계를 거치게 되는 경우, 새로이 추출된 MFCC, 로그 에너지(Log Energy) 및 MCME는 그 전에 구해졌던 각각의 값들에 더해지게 되며, 특히 MFCC 및 MCME의 경우에는 벡터의 차수가 증가되면서 새로운 특징벡터로 형성된다. In the case of going through step S 110 to step S 150 again after step S 160, the newly extracted MFCC, log energy and MCME are added to the respective values obtained before, especially in the case of MFCC and MCME. Is formed as a new feature vector as the order of the vector increases.

S 170 단계 및 S 180 단계는 MCME + MFCC 의 특징벡터를 생성할 경우에만 수행되는 단계이므로, 만약 MCME 로만 오디오 신호를 판별하는 특징벡터로 사용할 시에는 수행되지 않을 수도 있다.Since steps S 170 and S 180 are performed only when generating a feature vector of MCME + MFCC, it may not be performed when using the feature vector for determining the audio signal only by the MCME.

S 170 단계는, S 150 단계에서 오디오 신호 전체 구간의 끝이라고 판단되면, 추출된 MFCC 특징벡터에 대하여 CMS(Cepstral Mean Subtraction)를 적용한다.In operation S 170, when it is determined that the end of the entire audio signal section is performed in operation S 150, a CMS (Cepstral Mean Subtraction) is applied to the extracted MFCC feature vector.

즉 S 130 단계에서 추출되어 별도 메모리에 저장되어 있던 MFCC 특징벡터에 대하여 CMS(Cepstral Mean Subtraction)를 적용하는 것으로, MFCC 특징벡터의 각 계수(coefficients)들의 전체 평균을 구한 후, 이 계수별 평균값을 MFCC 특징벡터에서 빼어주는 과정이다.In other words, by applying the CMS (Cepstral Mean Subtraction) to the MFCC feature vector extracted in step S 130 and stored in a separate memory, the overall average of the coefficients of the MFCC feature vector is obtained, and then the average value for each coefficient is obtained. Subtracts from the MFCC feature vector.

여기에서, CMS(Cepstral Mean Subtraction)은 CMN (Cepstral Mean Normalization)으로도 불리는 과정으로서, 채널 (전송 채널 및 마이크로 폰 등) 특성에 따른 음향적 특성의 변화를 보상하는 테크닉으로 널리 사용되는 과정이다.Here, CMS (Cepstral Mean Subtraction) is a process called CMN (Cepstral Mean Normalization), a process widely used as a technique for compensating for the change in acoustic characteristics according to the characteristics of the channel (transmission channel, microphone, etc.).

일반적으로 멀티미디어 콘텐츠들은 다양한 환경에서 다양한 장비를 이용하여 제작되므로 각 콘텐츠마다 채널 특성이 상이한바, 이를 보상하는 CMS를 적용하여 판별 성능을 좋게 할 수 있다.In general, since multimedia contents are produced using various equipments in various environments, channel characteristics are different for each contents, and a discrimination performance can be improved by applying a CMS that compensates for the contents.

그러나 이는 콘텐츠 특성 및 판별 성능의 조건에 따라 적용이 되지 않을 수도 있는 것이므로, S 170 단계는 일종의 선택 사항으로서 수행된다.However, since this may not be applied depending on the condition of the content characteristic and the discrimination performance, step S 170 is performed as a kind of option.

S 180 단계는 MCME + MFCC 특징벡터를 생성하는 단계로서, 기 서술하였듯이 MCME 로만 오디오 신호를 판별할 시에는 수행되지 않는다. Step S180 is a step of generating the MCME + MFCC feature vector, and as described above, it is not performed when the audio signal is determined only by the MCME.

S 130 단계에서 추출되어 별도 메모리에 저장되어 있던 MFCC 특징벡터와 S 140 단계에서 추출되어 별도 메모리에 저장되어 있던 MCME 특징벡터를 합하여 MCME + MFCC 특징벡터가 생성된다.MCME + MFCC feature vectors are generated by adding the MFCC feature vectors extracted in step S 130 and stored in a separate memory and the MCME feature vectors extracted in step S 140 and stored in a separate memory.

예를 들어, MCME의 차수가 15차이고, MFCC의 차수가 12차라면, 27차의 MCME + MFCC 특징벡터가 생성된다. For example, if the order of MCME is 15th order and the order of MFCC is 12th order, then the 27th MCME + MFCC feature vector is generated.

도 2는 본 발명의 일 실시예에 따른 특징벡터를 이용한 음란성 멀티미디어 콘텐츠 판별 장치이다.2 is an apparatus for discriminating obscene multimedia contents using a feature vector according to an embodiment of the present invention.

특징벡터 추출부(100)는 멀티미디어 콘텐츠 오디오 신호의 일부 구간을 이용하여 음란성 판별을 위한 특징벡터를 추출하는 구성요소로서, 도 1에 도시한 바의 방법을 이용하여 MCME 또는 MCME + MFCC 의 특징벡터를 구한다.The feature vector extractor 100 is a component for extracting feature vectors for discriminating obscenity by using a section of a multimedia content audio signal. The feature vector of MCME or MCME + MFCC is illustrated using the method illustrated in FIG. 1. Obtain

음란성 판별부(200)는 특징벡터 추출부(100)에서 추출된 특징벡터와 음향모델(300)을 이용하여 멀티미디어 콘텐츠의 음란성을 판별하는 구성요소이다.The obscene determination unit 200 is a component for determining the obsceneness of multimedia content using the feature vector extracted from the feature vector extraction unit 100 and the acoustic model 300.

음향모델(300)에는 음란 오디오 신호 모델 및 정상 오디오 신호 모델이 포함될 수 있는데, 음란 오디오 신호 모델에는 음란성 콘텐츠에서 자주 출현하는 교성, 신음소리, 접촉음 등의 음향적 특징 및 이 음향적 특징의 시간당 변화 특징에 대한 통계적 정보가 저장되며, 사전에 음란 콘텐츠의 오디오 신호로부터 MCME 및 MCME + MFCC의 특징벡터가 추출되어 학습된다.The acoustic model 300 may include a lewd audio signal model and a normal audio signal model. The lewd audio signal model includes acoustic features such as indoctrination, groan, and contact sound, which frequently appear in obscene contents, and the time per hour of the acoustic features. Statistical information about the change feature is stored, and feature vectors of MCME and MCME + MFCC are extracted from the audio signal of the obscene content in advance and are learned.

정상 오디오 신호 모델은 스포츠, 뉴스, 음악, 음성 등 다양한 오디오 신호로부터 MCME 및 MCME + MFCC의 특징벡터가 추출되어 학습된다.The normal audio signal model is trained by extracting feature vectors of MCME and MCME + MFCC from various audio signals such as sports, news, music, and voice.

음란성 판별부(200)는 특징벡터 추출부(100)에서 추출된 특징벡터를 이용하여 음향모델(300)에 포함된 정상 오디오 신호 모델 및 음란 오디오 신호 모델과의 확률적 유사함을 계산하고 음란 오디오 신호 모델일 확률이 높으면 음란성 콘텐츠로 판별하게 된다.The obscene determination unit 200 calculates a probabilistic similarity between the normal audio signal model and the obscene audio signal model included in the acoustic model 300 by using the feature vector extracted by the feature vector extraction unit 100, and the obscene audio signal. If the probability of the signal model is high, it is determined as the sexually explicit content.

여기에서, 음란성 판별부(200)는 음성인식 분야에서 많이 사용되는 HMM(Hidden Markov Model), GMM(Gaussian Mixture Model), SVM(Support Vector Machine)외에 인공 신경망, 유전자 알고리즘 등을 사용할 수 있다.Here, the perturbation determining unit 200 may use artificial neural networks, genetic algorithms, etc. in addition to HMM (Hidden Markov Model), GMM (Gaussian Mixture Model), SVM (Support Vector Machine) which are widely used in the speech recognition field.

한편 상기에서 도 1 및 도 2를 이용하여 서술한 것은, 본 발명의 주요 사항만을 서술한 것으로, 그 기술적 범위 내에서 다양한 설계가 가능한 만큼, 본 발명이 도 1 및 도 2에 한정되는 것이 아님은 자명하다.1 and 2 describe only the main matters of the present invention, and the present invention is not limited to FIGS. 1 and 2 as long as various designs are possible within the technical scope. Self-explanatory

도 1은 본 발명의 일 실시예에 따른 특징벡터 생성 방법.1 is a feature vector generation method according to an embodiment of the present invention.

도 2는 본 발명의 일 실시예에 따른 특징벡터를 이용한 음란성 멀티미디어 콘텐츠 판별 장치.2 is a device for discriminating obscene multimedia contents using a feature vector according to an embodiment of the present invention.

<도면의 주요 부분에 대한 부호의 설명><Explanation of symbols for the main parts of the drawings>

100 : 특징벡터 추출부 200 : 음란성 판별부100: feature vector extraction unit 200: obscene discrimination unit

300 : 음향모델300: acoustic model

Claims

멀티미디어 콘텐츠의 오디오 신호의 특징을 판별하는 특징벡터로 사용하기 위하여, 다음의 수학식을 이용하여 멜프리컨시 캡스트럴 모듈레이션 에너지(Mel-frequency Cepstrum Modualtion Energy, MCME)를 생성하되,In order to use the feature vector to determine the characteristics of the audio signal of the multimedia content, the Mel-frequency Cepstrum Modulation Energy (MCME) is generated using the following equation,

오디오 신호의 일부 구간을 이용하여 생성하는 것을 특징으로 하는, 멀티미디어 콘텐츠의 오디오 신호 판별을 위한 특징벡터 생성 방법. Characterized vector generation method for determining the audio signal of the multimedia content, characterized in that for generating using a portion of the audio signal.

제 1 항에 있어서, 오디오 신호의 일부 구간을 이용하여 생성하는 것은,The method of claim 1, wherein the generating using the partial section of the audio signal comprises:

1개의 멜프리컨시 캡스트럴 모듈레이션 에너지를 구할 수 있을 길이의 오디오 신호를 획득하는 단계;Acquiring an audio signal of a length capable of obtaining one melpresy capsular modulation energy;

멜프리컨시 캡스트럴 모듈레이션 에너지를 생성하는 단계;Generating a melprecipitation capsular modulation energy;

오디오 신호 전체 구간의 끝인지를 판단하는 단계;Determining whether it is the end of the entire audio signal section;

오디오 신호 전체 구간의 끝이면 작업을 종료하고, 끝이 아니면 미리 정한 간격만큼 오디오 신호를 스킵 하고 상기 오디오 신호 획득 단계로 돌아가는 단계;를 포함하는 것을 특징으로 하는, 멀티미디어 콘텐츠의 오디오 신호 판별을 위한 특징벡터 생성 방법.If the end of the entire audio signal section, the operation is terminated, otherwise, skipping the audio signal by a predetermined interval and returning to the audio signal acquisition step; characterized in that it comprises a, characterized in that Vector generation method.

멀티미디어 콘텐츠의 오디오 신호의 특징을 판별하는 특징벡터로 사용하기 위하여, 청구항 1의 수학식을 이용하여 멜프리컨시 캡스트럴 모듈레이션 에너지를 생성하고, 이를 멜프리컨시 캡스트럴 계수(Mel-Frequency Cepstral Coefficients, MFCC) 와 합하여 특징벡터를 생성하되,In order to use the feature vector for determining the characteristic of the audio signal of the multimedia content, the melprecency capsular modulation energy is generated using the equation of claim 1, and the melprecency capsular coefficient (Mel−) is generated. Frequency Cepstral Coefficients (MFCC) is added to generate a feature vector,

제 3 항에 있어서, 오디오 신호의 일부 구간을 이용하여 생성하는 것은,The method of claim 3, wherein the generating using the partial period of the audio signal comprises:

오디오 신호 전체 구간의 끝이면 멜프리컨시 캡스트럴 모듈레이션 에너지와 멜프리컨시 캡스트럴 계수를 합하는 단계;Summing the mel frisch capsular modulation energy and the mel frisch capsular coefficients at the end of the entire audio signal interval;

오디오 신호 전체 구간의 끝이 아니면 미리 정한 간격만큼 오디오 신호를 스킵 하고 상기 오디오 신호 획득 단계로 돌아가는 단계;를 포함하는 것을 특징으로 하는, 멀티미디어 콘텐츠의 오디오 신호 판별을 위한 특징벡터 생성 방법. And skipping the audio signal by a predetermined interval if not at the end of the entire audio signal section and returning to the audio signal obtaining step.

오디오 신호 전체 구간의 끝이면, 멜프리컨시 캡스트럴 계수에 캡스트럴 민 서브트랙션(Cepstral Mean Subtraction, CMS)를 적용한 후, 다시 멜프리컨시 캡스트럴 모듈레이션 에너지를 합하는 단계;If the end of the entire audio signal period, applying a captral mean subtraction (CMS) to the melprecency capstrial coefficients, and then adding the melpresence capsular modulation energy again;

멀티미디어 콘텐츠의 오디오 신호에서 특징벡터를 추출하는 단계;Extracting a feature vector from an audio signal of the multimedia content;

상기 특징벡터 및, 정상 오디오 신호 모델 및 음란 오디오 신호 모델을 포함하는 음향모델을 이용하여 멀티미디어 콘텐츠의 음란성을 판별하는 단계;를 포함하되,Determining an obscurity of multimedia content by using the acoustic vector including the feature vector and the normal audio signal model and the obscene audio signal model.

상기 특징벡터를 추출하는 단계는 청구항 1 내지 5 중 한 항의 방법을 이용하는 것을 특징으로 하는, 음란성 멀티미디어 콘텐츠 판별 방법.The extracting of the feature vector is characterized in that using the method of any one of claims 1 to 5, multimedia content determination method.

특징벡터 추출부;Feature vector extraction unit;

음란성 판별부;Obscene discrimination unit;

음향모델;을 포함하되,Including; acoustic model;

상기 특징벡터 추출부는 청구항 1 내지 5 중 한 항의 방법을 이용하여 멀티미디어 콘텐츠의 오디오 신호로부터 특징벡터를 추출하고, 상기 음란성 판별부는 상기 특징벡터 추출부에서 추출한 특징벡터와 상기 음향모델을 이용하여 멀티미디어 콘텐츠의 음란성을 판별하고, 상기 음향모델은 정상 오디오 신호 모델과 음란 오디오 신호 모델을 포함하는 것을 특징으로 하는 음란성 멀티미디어 콘텐츠 판별 장치.The feature vector extractor extracts a feature vector from an audio signal of the multimedia content using the method of claim 1, and the lewdness discriminator extracts the feature vector from the feature vector extractor and the acoustic model. And determine the audio signal of the audio model, wherein the acoustic model includes a normal audio signal model and an audio signal model of pornographic multimedia content.