KR100870870B1

KR100870870B1 - High quality time-scaling and pitch-scaling of audio signals

Info

Publication number: KR100870870B1
Application number: KR1020037013446A
Authority: KR
Inventors: 브레트 그래햄 크로켓
Original assignee: 돌비 레버러토리즈 라이쎈싱 코오포레이션
Priority date: 2001-04-13
Filing date: 2002-02-12
Publication date: 2008-11-27
Also published as: KR20030085597A; AU2002248431B2; MY141496A

Abstract

일 대안에서, 오디오 신호는 타임 스케일링 및/또는 피치 시프팅 프로세싱이 가청되지 않거나 또는 최소로 가청되는 신호의 영역을 식별하기 위해 다중 심리음향 기준을 사용하여 분석되며, 신호는 그 영역내에서 타임 스케이링 및/또는 피치 시프팅된다. 다른 대안에서, 신호는 오디토리 이벤트들로 분할되며, 신호는 오디토리 이벤트내에서 타임 스케일링 및/또는 피치 시프팅된다. 또 다른 대안에서, 신호는 오디토리 이벤트로 분할들로 분할되며, 오디토리 이벤트들은 신호의 타임 스케일링 및/또는 피치 시프팅 프로세싱이 가청되지 않거나 또는 최소로 가청되며 오디토리 이벤트들을 식별하기 위해 심리음향 기준을 사용하여 분석된다. 다른 대안들은 오디오의 다중 채널들에 대해 고려한다.In one alternative, the audio signal is analyzed using multiple psychoacoustic criteria to identify areas of the signal where time scaling and / or pitch shifting processing is audible or minimally audible, and the signal is time skewed within that area. Ring and / or pitch shifted. In another alternative, the signal is divided into auditory events, and the signal is time scaled and / or pitch shifted within the auditory event. In another alternative, the signal is divided into divisions into auditory events, wherein the auditory events are audible or minimally audible with time scaling and / or pitch shifting processing of the signal and psychoacoustic to identify the auditory events. Analyze using criteria. Other alternatives consider multiple channels of audio.

Description

오디오 신호의 고품질 타임 스케일링 및 피치 스케일링{HIGH QUALITY TIME-SCALING AND PITCH-SCALING OF AUDIO SIGNALS}HIGH QUALITY TIME-SCALING AND PITCH-SCALING OF AUDIO SIGNALS}

본 발명은 오디오 신호의 심리음향 프로세싱의 분야에 관한 것이다. 특히, 본 발명은 오디오 신호의 타임 스케일링(time scaling) 및/또는 피치 스케일링(pitch scaling)(피치 시프팅)을 어디에서 및/또는 어떻게 실행하는지의 양태에 관한 것이다. 상기 프로세싱은 특히 샘플들, 이를 테면 디지털 오디오 신호들로 표현되는 오디오 신호에 적용가능하다. 본 발명은 또한 그 각각이 별도로 인지되는 경향이 있는, "오디토리 이벤트(auditory event)"로 오디오를 분할하는 양태에 관한 것이다.The present invention relates to the field of psychoacoustic processing of audio signals. In particular, the present invention relates to aspects of where and / or how to perform time scaling and / or pitch scaling (pitch shifting) of an audio signal. The processing is particularly applicable to an audio signal represented by samples, such as digital audio signals. The invention also relates to an aspect of splitting audio into "auditory events", each of which tends to be perceived separately.

타임 스케일링은 신호의 스펙트럼 콘텐트(인지된 음색) 또는 인지된 피치를 변경시킴 없이 오디오 신호의 시간 전개 또는 기간을 변경하는 것에 관한 것이다(피치는 주기적인 오디오 신호와 관련된 특성이다). 피치 스케일링은 그 시간 전개 또는 기간에 영향을 끼침없이 오디오 신호의 스펙트럼 콘텐트 또는 인지된 피치를 수정하는 것에 관한 것이다. 타임 스케일링 및 피치 스케일링은 서로 중첩적인 방법이다. 예를 들면, 디지털화된 오디오 신호의 피치는 타임 스케일링에 의해 신호의 시간 기간을 5%까지 증가시키고 그후 샘플들을 5% 더 높은 샘플 레이트에서 판 독함으로써(예를 들면, 리샘플링하므로써), 그것의 오리지날 시간 기간을 유지시키는 것에 의해 시간 기간에 영향을 끼침없이 5%까지 스케일링된다. 결과적인 신호는 수정된 피치 또는 스펙트럼 특성을 갖는다는 것 외에는 오리지날 신호처럼 동일한 시간 기간을 갖는다. 하기에 더 논의되는 것처럼, 리샘플링이 적용될 수 있지만 일정한 출력 샘플링 레이트를 유지하거나 또는 입력 및 출력 샘플링 레이트를 동일하게 유지시키는데 요구되지 않는다면 필수적인 단계는 아니다.Time scaling relates to changing the time evolution or duration of an audio signal without changing the spectral content (perceived timbre) or perceived pitch of the signal (pitch is a characteristic associated with a periodic audio signal). Pitch scaling relates to modifying the spectral content or perceived pitch of an audio signal without affecting its time evolution or duration. Time scaling and pitch scaling are overlapping methods. For example, the pitch of a digitized audio signal can be increased by increasing the time period of the signal by 5% and then reading the samples at 5% higher sample rate (e.g., by resampling), thereby reducing the original time. Maintaining the time period scales up to 5% without affecting the time period. The resulting signal has the same time period as the original signal except that it has modified pitch or spectral characteristics. As discussed further below, resampling may be applied but is not a necessary step unless it is required to maintain a constant output sampling rate or to keep the input and output sampling rates the same.

오디오 신호의 타임 및 피치 특성의 독립적인 제어를 제공하는 고급 방법에 대한 많은 사용법들이 있다. 이는 특히 단순한 톤(tone)의 신호에서부터 음성 신호 및 복잡한 음악 악절에 이르는 넓은 범위 콘텐트를 포함하는 고충실, 다중채널 오디오에 특히 사실이다. 타임 및 피치 스케일링에 대한 사용법은 오디오/비디오 방송, 오디오/비디오 편집 동기화 및 다중-트랙 오디오 레코딩 및 믹싱을 포함한다. 오디오/비디오 방송 및 편집 환경에서, 소스 자료와 상이한 레이트에서 비디오를 재생시키는 것이 필요하여, 동반된 오디오 신호의 피치-스케일링된 버전을 야기시킨다. 오디오를 피치 스케일링하는 것은 오디오와 비디오간에 동기화를 유지시키며 오리지날 소스 자료의 음색 및 피치를 보존할 수 있다. 다중-트랙 오디오 또는 오디오/비디오 편집에서는, 새로운 자료가 오디오 또는 비디오 부분의 시간-억제 기간을 일치시키는 것이 요구된다. 오디오를 타임-스케일링하는 것은 소스 오디오의 음색 및 피치를 수정하는것 없이 오디오의 새로운 부분을 시간-억제시킬 수 있다.There are many uses for advanced methods that provide independent control of the time and pitch characteristics of an audio signal. This is especially true for high fidelity, multichannel audio that includes a wide range of content, from simple tonal signals to voice signals and complex musical passages. Usage for time and pitch scaling includes audio / video broadcast, audio / video edit synchronization, and multi-track audio recording and mixing. In audio / video broadcasting and editing environments, it is necessary to play the video at a different rate than the source material, resulting in a pitch-scaled version of the accompanying audio signal. Pitch scaling the audio maintains synchronization between the audio and video and preserves the timbre and pitch of the original source material. In multi-track audio or audio / video editing, new material is required to match the time-suppression period of the audio or video portion. Time-scaling audio can time-suppress new portions of audio without modifying the timbre and pitch of the source audio.

본 발명의 양태에 따르면, 오디오 신호를 타임 스케일링 및/또는 피치 시프 팅하는 방법이 제공된다. 신호는 오디오 신호의 타임 스케일링 및/또는 피치 시프팅 프로세싱이 가청되지 않거나 또는 최소로 가청되는 오디오 신호의 영역을 식별하기 위해 다중 심리음향 기준을 사용하여 분석되며, 그리고 상기 신호는 그 영역내에서 타임 스케일링 및/또는 피치 시프팅된다.According to an aspect of the present invention, a method of time scaling and / or pitch shifting an audio signal is provided. The signal is analyzed using multiple psychoacoustic criteria to identify an area of the audio signal where time scaling and / or pitch shifting processing of the audio signal is not audible or minimally audible, and the signal is timed within that area. Scaling and / or pitch shifting.

본 발명의 다른 양태에 따르면, 오디오 신호들의 다중 채널들을 타임 스케일링 및/또는 피치 시프팅하는 방법이 제공된다. 오디오 신호들의 각 채널들은 오디오 신호의 타임 스케일링 및/또는 피치 시프팅 프로세싱이 가청되지 않거나 또는 최소로 가청되는 오디오 신호들의 채널들에서 영역을 식별하기 위해 적어도 하나의 심리음향 기준을 사용하여 분석되며, 오디오 신호들의 모든 다중 채널들은 오디오 신호들의 채널들중 적어도 하나에서 식별된 영역내에 있는 시간 세그먼트중에 타임 스케일링 및/또는 피치 시프팅된다.According to another aspect of the present invention, a method of time scaling and / or pitch shifting multiple channels of audio signals is provided. Each channel of the audio signals is analyzed using at least one psychoacoustic criterion to identify an area in the channels of audio signals where time scaling and / or pitch shifting processing of the audio signal is audible or minimally audible, All multiple channels of audio signals are time scaled and / or pitch shifted among time segments that are in the region identified in at least one of the channels of audio signals.

본 발명의 다른 양태에 따르면, 오디오 신호를 타임 스케일링 및/또는 피치 시프팅하는 방법이 제공되며, 오디오 신호는 오디토리 이벤트로 분할되며 신호는 오디토리 이벤트내에서 타임 스케일링 및/또는 피치 시프팅된다.According to another aspect of the present invention, a method of time scaling and / or pitch shifting an audio signal is provided, wherein the audio signal is divided into auditory events and the signal is time scaled and / or pitch shifted within the auditory event. .

본 발명의 다른 양태에 따르면, 다수의 오디오 신호 채널을 타임 스케일링 및/또는 피치 시프팅을 하는 방법이 제공되며, 각 채널의 오디오 신호는 오디토리 이벤트로 분할된다. 결합된 오디토리 이벤트들이 결정되면, 각각은 오디토리 이벤트 경계가 임의의 오디오 신호 채널에서 발생할 때 경계를 갖는다. 모든 오디오 신호 채널들은 결합된 오디토리 이벤트내에서 타임 스케일링 및/또는 피치 시프팅되므로, 타임 스케일링 및/또는 피치 시프팅은 각 채널의 오디토리 이벤트내에 있다. According to another aspect of the invention, a method of time scaling and / or pitch shifting a plurality of audio signal channels is provided, wherein the audio signal of each channel is divided into an auditory event. Once the combined auditory events are determined, each has a boundary when the auditory event boundary occurs on any audio signal channel. Since all audio signal channels are time scaled and / or pitch shifted within the combined auditory event, time scaling and / or pitch shifting is within the auditory event of each channel.

본 발명의 다른 양태에 따르면, 오디오 신호를 타임 스케일링 및/또는 피치 시프팅하는 방법이 제공되며, 신호는 오디토리 이벤트로 분할되며, 오디토리 이벤트는 오디오 신호의 타임 스케일링 및/또는 피치 시프팅 프로세싱이 가청되지 않거나 또는 최소로 가청되는 오디토리 이벤트를 식별하기 위해 심리음향 기준을 사용하여 분석된다. 타임 스케일링 및/또는 피치 시프팅 프로세싱은, 오디오 신호의 타임 스케일링 및/또는 피치 시프팅 프로세싱이 가청되지 않거나 또는 최소로 가청되는 것으로서 식별되는 오디토리 이벤트내에서 수행된다.According to another aspect of the present invention, a method of time scaling and / or pitch shifting an audio signal is provided, wherein the signal is divided into auditory events, wherein the auditory event is time scaling and / or pitch shifting processing of the audio signal. This is analyzed using psychoacoustic criteria to identify auditory events that are not audible or minimally audible. Time scaling and / or pitch shifting processing is performed within an auditory event in which the time scaling and / or pitch shifting processing of the audio signal is identified as not audible or minimally audible.

본 발명의 또 다른 양태에 따르면, 다중 채널의 오디오 신호들을 타임 스케일링 및/또는 피치 시프팅하는 방법이 제공되며, 각 채널의 오디오 신호는 오디토리 이벤트로 분할된다. 오디토리 이벤트들은 오디오 신호의 타임 스케일링 및/또는 피치 시프팅 프로세싱이 가청되지 않거거나 최소로 가청되는 오디토리 이벤트를 식별하기 위해서 적어도 하나의 심리음향 기준을 사용하여 분석된다. 결합된 오디토리 이벤트들이 결정되면, 각각은 오디토리 이벤트 경계가 임의 채널의 오디오 신호에서 발생하는 경계를 갖는다. 타임 스케일링 및/또는 피치 시프팅 프로세싱은, 다중 채널의 오디오 신호들에서 타임 스케일링 및/또는 피치 시프팅 프로세싱이 가청되지 않거나 최소로 가청되는 것으로서 식별되는 결합된 오디토리 이벤트내에서 수행된다.According to another aspect of the invention, a method of time scaling and / or pitch shifting multiple channels of audio signals is provided, wherein the audio signals of each channel are divided into auditory events. Auditory events are analyzed using at least one psychoacoustic criterion to identify auditory events where time scaling and / or pitch shifting processing of the audio signal is not audible or minimally audible. Once the combined auditory events are determined, each has a boundary where the auditory event boundary occurs in the audio signal of any channel. Time scaling and / or pitch shifting processing is performed in a combined auditory event in which time scaling and / or pitch shifting processing in multiple channels of audio signals is identified as being audible or minimally audible.

발명의 또 다른 양태에 따르면, 다중 심리음향 기준을 사용하여 오디오 신호를 분석하는 것은 상기 오디오가 심리음향 기분의 그룹중 적어도 1개 기준을 충족시키는 오디오 신호의 영역을 식별하기 위해 오디오 신호를 분석하는 것을 포함한 다.According to another aspect of the invention, analyzing an audio signal using multiple psychoacoustic criteria comprises analyzing the audio signal to identify an area of the audio signal in which the audio meets at least one criterion in the group of psychoacoustic moods. It includes.

발명의 또 다른 양태에 따르면, 심리음향 기준은 1개 이상의 하기 사항을 포함한다: (1)오디오 신호의 식별된 영역은 대체로 과도현상의 결과로서 프리마스킹되거나 포스트마스킹되며, (2)오디오 신호의 식별된 영역은 대체로 가청되지 않으며, (3)오디오 신호의 식별된 영역은 주로 고주파수이고, (4)오디오 신호의 식별된 영역은 오디오 신호의 세그먼트의 더 조용한 부분이며 상기 영역에 선행하는 및/또는 이어지는 세그먼트의 일부분 또는 부분들이 소리가 더 크다. 심리음향 마스킹의 몇몇 기본 원리들이 하기에 기술되어 있다.According to another aspect of the invention, the psychoacoustic criteria comprises one or more of the following: (1) The identified region of the audio signal is premasked or postmasked as a result of transient transients, and (2) the audio signal The identified region is largely inaudible, (3) the identified region of the audio signal is predominantly high frequency, and (4) the identified region of the audio signal is a quieter part of the segment of the audio signal and precedes and / or the region. The portion or portions of the segment that follows are louder. Some basic principles of psychoacoustic masking are described below.

발명의 양태는, 심리음향 기준의 그룹이 타임 스케일링 및/또는 피치 스케일링 프로세싱으로부터 야기되는 증가적인 가청 가공물의 내림 차순(즉, 계층적 기준)으로 배열된다는 것이다. 발명의 다른 양태에 따르면, 영역은, 최상위 등급 심리음향 기준(즉, 가장 적게 가청되는 가공물로 이르게하는 기준)이 충족될 때 식별된다. 이와 달리, 기준이 충족될 지라도, 기준을 충족시키는 오디오의 다른 영역들을 1개 이상 식별하기 위해서 다른 기준이 조사된다. 후자의 접근방법은 계층 아래의 기준들을 포함하여 임의의 기준을 충족시키는 모든 가능한 영역의 위치를 결정하기 위한 다중채널의 경우에 유용하므로, 다중 채널들간에는 더 많은 가능한 공통 접합점(splice point)들이 있다.An aspect of the invention is that groups of psychoacoustic criteria are arranged in descending order (ie hierarchical criteria) of increasing audible artifacts resulting from time scaling and / or pitch scaling processing. According to another aspect of the invention, an area is identified when the highest grade psychoacoustic criterion (ie, the criterion leading to the least audible workpiece) is met. In contrast, even if a criterion is met, another criterion is examined to identify one or more other areas of audio that meet the criterion. The latter approach is useful in the case of multichannels to determine the location of all possible regions that meet any criterion, including criteria below the hierarchy, so there are more possible common splice points between multiple channels. .

발명의 양태들이 다른 유형의 타임 스케일링 및/또는 피치 시프팅 프로세싱을 사용하더라도(예를 들면, 미국 특허 제6,266,003 B1호에 공고된 프로세스로, 이는 참조로 완전히 본문에 채용됨), 본 발명의 양태들은: Although aspects of the invention use other types of time scaling and / or pitch shifting processing (e.g., a process disclosed in US Pat. No. 6,266,003 B1, which is incorporated herein by reference in its entirety), aspects of the invention Hear:

접합점이 오디오 신호 영역에서 선택되어서, 때를 맞춰 접합점을 이끄는 오디오 신호의 선두 세그먼트(leading segment)를 규정하며,The junction is selected in the audio signal area, defining the leading segment of the audio signal that in time leads the junction,

접합점으로부터 이격된 종단점이 선택되어서, 때를 맞춰 종단점을 추적하는 오디오 신호의 후미 세그먼트(trailing segment)와, 접함접과 종단점간의 오디오 신호의 타겟 세그먼트를 규정하며,An endpoint spaced apart from the junction is selected to define a trailing segment of the audio signal that tracks the endpoint in time and the target segment of the audio signal between the junction and the endpoint,

선두 및 후미 세그먼트가 접합점에서 연결되어서, 종단점이 상기 접합점보다도 더 늦을 때(더 높은 샘플 개수를 가짐) 타겟 세그먼트를 생략함으로써 오디오 신호의 시간 기간을 단축시키거나(샘플로 표현되는 디지털 오디오 신호의 경우에, 오디오 신호 샘플링 개수를 감소시킴), 또는 종단점이 상기 접합점보다도 더 이를 때(더 낮은 샘플 개수를 가짐) 타겟 세그먼트를 반복함으로써 시간 기간을 연장시키며(샘플의 개수를 증가시킴), 그리고When the leading and trailing segments are connected at the junction, so that the end point is later than the junction (with a higher number of samples), the target segment can be omitted to shorten the time period of the audio signal (for a digital audio signal represented by a sample). Reduce the number of audio signal samplings), or extend the time period (increase the number of samples) by repeating the target segment when the endpoint is earlier (having a lower number of samples) than the junction.

바람직한 타임 스케일링 및/또는 피치 시프팅을 산출하는 레이트(rate)에서 결합된 선두 및 후미 세그먼트를 판독하는, 타임 스케일링 및/또는 피치 시프팅 처리의 일 유형을 바람직하게 사용한다.One type of time scaling and / or pitch shifting process is preferably used, which reads the combined leading and trailing segments at a rate that yields the desired time scaling and / or pitch shifting.

연결된 선두 및 후미 세그먼트들은:Connected leading and trailing segments are:

오리지날 시간 기간처럼 동일한 시간 기간이 오디오 신호를 피치 시프팅시키는,The same time period, like the original time period, causes the audio signal to be pitch shifted,

타겟 세그먼트를 생략하는 경우에, 샘플 개수에서 감소시의 상대적인 변화와 같은 동일한 비율로 감소된 시간 기간이 오디오 신호를 시간 압축시키는,In the case of omitting the target segment, the reduced time period at the same rate as the relative change in the decrease in the number of samples causes the audio signal to be time compressed,

타겟 세그먼트를 반복하는 경우에, 샘플 개수에서 증가시의 상대적인 변화와 같은 동일한 비율로 증가된 시간 기간이 오디오 신호를 시간 팽창시키는,In the case of repeating the target segment, an increased time period at the same rate as the relative change in the increase in the number of samples causes the audio signal to time expand,

샘플 개수에서 감소시의 상대적인 변화와 상이한 비율로 감소된 시간 기간이 오디오 신호를 시간 압축 및 피치 시프팅하는, 또는 A reduced time period at a rate that is different from the relative change in the decrease in the number of samples causes time compression and pitch shifting of the audio signal, or

샘플 개수에서 증가시의 상대적인 변화와 상이한 비율로 증가된 시간 기간이 오디오 신호를 시간 팽창 및 피치 시프팅시키는 것과 같은 레이트로 판독된다.The increased time period at a different rate than the relative change in the increase in the number of samples is read at the same rate as time expansion and pitch shifting the audio signal.

타겟 세그먼트가 생략(데이터 압축) 또는 반복(데이터 팽창)되더라도, 하나의 접합점과 하나의 접합만이 있다. 타겟 세그먼트를 생략하는 경우에, 상기 접합은 생략된 타겟 세그먼트의 접합점과 종단점이 함께 연결되거나 또는 접합되는 곳이다. 타겟 세그먼트를 반복하는 경우에, 단일 접합만이 있으며 - 상기 접합점은 타겟 세그먼트의 제 1 표출(rendition)(접합점)의 종단이 타겟 세그먼트의 제 2 표출(종단점)의 시작에 만나는 곳이다. 오디오 샘플의 개수를 감소시키는 경우에(데이터 압축), 프리마스킹 또는 포스트마스킹이외의 기준에 대해, 종단점이 식별된 영역내에 있음이 바람직하다(게다가 접합점이 항상 식별된 영역 내에 있어야 한다). 접합점이 과도현상에 의해 프리마스킹 또는 포스트마스킹되는 압축의 경우에, 종단점은 식별된 영역내에 있을 필요가 없다. 다른 경우에 대해(하기되는 것처럼, 프로세싱이 오디토리 이벤트내에서 발생하는 경우를 제외), 종단점이 식별된 영역내에 있는 것이 바람직하므로 어떠한 것도 생략 또는 반복되는것 없이 가청가능될 것이다. 오디오 샘플의 개수를 증가시키는 경우에(데이터 팽창), 오디지날 오디오의 종단점은 바람직하게 오디오 신호의 식별된 영역내에 있다. 하기되는 것처럼, 가능한 접합점 위치들은 가장 이르며 가장 최근의 시간을 가지며 가능한 종단 점 위치들은 가장 이르며 가장 최근의 시간을 갖는다. 오디오가 버퍼 메모리의 데이터 블럭내의 샘플들로 표현될 때, 가능한 접합점 위치들은 블럭내에 최소 및 최대 위치들을 가지며, 이는 가장 이르며 가장 최근의 가능한 접합점 시간들을 각각 가지며, 종단점 또한 블럭내에 최소 및 최대 위치를 가지며, 이는 가장 이르며 가장 최근의 가능한 종단점 시간을 각각 갖는다.Even if the target segment is omitted (data compression) or repeated (data expansion), there is only one junction and one junction. In the case of omitting the target segment, the junction is where the junction point and the endpoint of the omitted target segment are connected or joined together. In the case of repeating the target segment, there is only a single junction-the junction is where the end of the first rendition (junction) of the target segment meets at the beginning of the second rendition (endpoint) of the target segment. In the case of reducing the number of audio samples (data compression), for criteria other than premasking or postmasking, it is desirable that the endpoint is in the identified area (and the junction must always be in the identified area). In the case of compression where the junction is premasked or postmasked by transients, the endpoint need not be in the identified area. For other cases (except where processing occurs within the auditory event, as described below), it is desirable that the endpoint be in the identified region so that nothing will be audible without omission or repetition. In the case of increasing the number of audio samples (data expansion), the endpoint of the original audio is preferably in the identified region of the audio signal. As will be discussed below, the possible junction locations have the earliest and most recent time and the possible endpoint locations have the earliest and most recent time. When audio is represented as samples in the data block of the buffer memory, the possible junction positions have the minimum and maximum positions in the block, which have the earliest and most recent possible junction times, respectively, and the endpoint also has the minimum and maximum positions in the block. Which has the earliest and most recent possible endpoint time.

다중채널 오디오를 프로세싱시, 지향성 큐(directional cue)를 방해하지 않기 위해서, 채널들간의 상대적 진폭 및 위상을 유지시키는 것이 바람직하다. 그러므로, 1개 채널에 있는 오디오의 타겟 세그먼트가 생략 또는 반복되어야 한다면, 다른 채널의 해당 세그먼트들(동일한 표본 지수들을 가짐)이 또한 생략 또는 반복되어야 한다. 따라서, 모든 채널에서 가청되지 않는 접합을 허용하는 모든 채널에 대체로 공통적인 타겟 세그먼트를 찾는 것이 필요하다.When processing multichannel audio, it is desirable to maintain the relative amplitude and phase between the channels so as not to disturb the directional cues. Therefore, if the target segment of audio in one channel is to be omitted or repeated, the corresponding segments of another channel (with the same sample indices) must also be omitted or repeated. Thus, it is necessary to find a target segment that is generally common to all channels that allow for inaudible splicing on all channels.

정의Justice

본 문헌을 통하여, 용어 "데이터 압축(data compression)"은 세그먼트를 생략함으로써 샘플들의 개수를 감소시키는 것에 관한 것으로, 시간 압축을 유도하며, 용어 "데이터 팽창(data expansion)"은 세그먼트를 반복함으로써 샘플들의 개수를 증가시키는 것에 관한 것으로, 시간 팽창을 유도한다. 오디오 "영역(region)", "세그먼트(segment)", 및 "부분(portion)"은 때를 맞춰서 임의의 2개 순간 사이에 개념적으로 있는 단일 채널로부터 오디오의 유한 연속 부분의 표시를 말한다. 그러한 영역, 세그먼트, 또는 부분은 연속적인 샘플 번호 또는 지수를 갖는 샘플들로 표현된다. "식별된 영역(identified region)"은 심리음향 기준에 의해 식별된 오디오의 영역, 세그먼트 또는 부분을 말하며 그리고 그 내에 접합점, 및 일반적으로는 종단점이 있을 수 있다. "상호관계 프로세싱 영역(correlation processing region)"은 종단점 또는 접합점과 종단점에 대한 검색에서 상호관계가 실행되는 오디오의 영역, 세그먼트 또는 부분에 관한 것이다. "심리음향 기준(psychoacoustic criteria)"은 시간 영영 마스킹, 주파수 영역 마스킹, 및/또는 기타 심리음향 요인에 기초한 기준들을 포함한다. 상기된 것처럼, "타겟 세그먼트(target segment)"는, 데이터 압축의 경우에, 제거되거나, 또는 데이터 팽창의 경우에, 반복되는 오디오의 부분이다.Throughout this document, the term "data compression" relates to reducing the number of samples by omitting a segment, leading to time compression, and the term "data expansion" to repeating a segment It relates to increasing the number of fields, inducing time expansion. Audio "region", "segment", and "portion" refer to the representation of a finite continuous portion of audio from a single channel that is conceptually between any two moments in time. Such an area, segment, or portion is represented by samples having consecutive sample numbers or exponents. An "identified region" refers to a region, segment or portion of audio identified by psychoacoustic criteria and within which there may be junctions, and generally endpoints. A "correlation processing region" relates to an area, segment or portion of audio in which correlations are performed in the search for endpoints or junctions and endpoints. "Psychoacoustic criteria" include criteria based on time domain masking, frequency domain masking, and / or other psychoacoustic factors. As mentioned above, the "target segment" is the portion of audio that is removed in case of data compression or repeated in case of data expansion.

마스킹Masking

본 발명의 양태들은 인간의 가청, 및 특히 마스킹(masking)으로서 공지된 심리음향 현상을 이용한다. 몇몇 간략화된 마스킹 개념들은 도 1 및 하기 설명과 관련하여 이해될 것이다. 도 1의 실선(10)은, 사인파 또는 협대역의 노이즈와 같은 사운드(sound)가 들리수 있는, 즉, 가청 임계치인 사운드 압력 레벨을 나타낸다. 곡선 위의 레벨에서의 사운드는 들릴 수 있다; 곡선 아래의 사운드는 들리지 않는다. 이러한 임계치는 주파수에 매우 의존적이다. 사람은 50Hz 또는 15kHz에서 보다는 4kHz에서 더 소프트한 사운드를 들을 수 있다. 25kHz에서는, 임계치가 스케일을 벗어나 있다: 아무리 시끄럽더라도, 사람은 그것을 들을 수 없다.Aspects of the present invention make use of human audible and in particular psychoacoustic phenomena known as masking. Some simplified masking concepts will be understood with reference to FIG. 1 and the description below. The solid line 10 in FIG. 1 represents the sound pressure level at which sound, such as a sine wave or narrowband noise, can be heard, that is, an audible threshold. Sound at levels above the curve can be heard; No sound under the curve is heard. This threshold is very frequency dependent. One can hear softer sound at 4 kHz than at 50 Hz or 15 kHz. At 25 kHz, the threshold is off scale: no matter how noisy, humans cannot hear it.

1개 주파수에서 상대적으로 시끄러운 신호, 즉 참조번호 12에서 500Hz 사인파의 존재하에서의 임계치를 고려해보자. 수정된 임계치(14)는 주파수에서 알맞게 다소 떨어진, 500Hz의 바로 이웃한 근처에서 극적으로 증가하며, 가청 범위의 원거 리 부분에서는 전혀 그렇지 않다.Consider a relatively loud signal at one frequency, i.e. the threshold in the presence of a 500 Hz sine wave at 12. The modified threshold 14 increases dramatically in the immediate vicinity of 500 Hz, moderately somewhat off the frequency, and not at all in the far part of the audible range.

임계치에서의 이러한 상승이 마스킹으로 불리운다. 시끄러운 500Hz 사인파 신호("마스킹 신호" 또는 "마스커")의 존재하에서, "마스킹 임계치"로서 언급되는, 이러한 임계치 아래의 신호들은 시끄러운 신호에 의해 은폐되거나, 또는 마스킹된다. 게다가, 다른 신호들은 비신호 임계치상의 레벨에서 다소 상승할 수 있지만, 여전히 새로운 마스킹된 임계치 아래에 있으므로 들리지 않는다. 그러나, 비신호 임계치가 변동하지 않는 스펙트럼의 원거리 부분에서는, 500Hz 마스커없이 들릴 수 있었던 임의의 사운드는 그와 함께 들릴 수 있는 것으로서 남게된다. 그러므로, 마스킹은 1개 이상의 마스킹 신호의 단순한 존재에 의존적이지 않다; 그것은 상기 신호들이 스펙트럼으로 있는 곳에 의존한다. 예를 들면, 몇몇 음악 악절은 가청 주파수 범위에 걸쳐서 분포된 수많은 스펙트럼 구성요소를 포함하며, 따라서 비신호 임계 곡선에 비례하여 어디에서나 상승되는 마스킹된 임계 곡선을 전달한다. 다른 음악 악절들은, 예를 들면, 작은 부분의 스펙트럼으로 한정된 스펙트럼 구성요소를 갖는 단독 악기로부터의 상대적으로 시끄러운 사운드로 이루어지므로, 도 1의 사인파 마스커 예에 더 가까운 마스킹된 곡선을 전달한다.This rise in threshold is called masking. In the presence of a noisy 500 Hz sine wave signal ("masking signal" or "masker"), signals below this threshold, referred to as "masking threshold", are concealed or masked by the loud signal. In addition, other signals may rise somewhat at the level above the non-signal threshold, but are still inaudible as they are below the new masked threshold. However, in the far part of the spectrum where the non-signal threshold does not change, any sound that could be heard without a 500 Hz masker remains as it can be heard with it. Therefore, masking is not dependent on the simple presence of one or more masking signals; It depends where the signals are in the spectrum. For example, some music passages contain numerous spectral components distributed over an audible frequency range, thus conveying a masked threshold curve that rises anywhere in proportion to the non-signal threshold curve. Other musical passages, for example, consist of a relatively loud sound from a single instrument having a spectral component defined by a small portion of the spectrum, thus conveying a masked curve closer to the sine wave masker example of FIG.

마스킹은 또한 마스커(들)와 마스킹된 신호(들)간의 시간 관계에 의존하는 일시적 양태를 갖는다. 일부 마스킹 신호들은 마스킹 신호가 존재하는 동안에만 본질적으로 마스킹을 제공한다("동시 마스킹"). 다른 마스킹 신호들은 마스커가 좀더 일찍("백워드 마스킹" 또는 "프리마스킹") 뿐만 아니라 좀더 늦게("포워드 마스킹" 또는 "포스트마스킹") 발생하는 동안 마스킹을 제공한다. "과도현상", 즉 신호 레 벨에서의 갑작스런, 단시간의 및 상당한 증가는 모두 3가지 "유형"의 마스킹을 나타낸다: 백워드 마스킹, 동시 마스킹, 및 포워드 마스킹, 반면에, 정상 상태 및 준정상 상태 신호는 동시 마스킹만을 나타낸다. 본 발명의 내용에서, 이점은 과도현상으로부터 야기되는 동시 마스킹에 영향을 받지말아야 하는데 왜냐하면 동시 마스킹과 일치하는 또는 거의 일치하는 접합을 배치함으로써 과도현상을 방해하는 것이 바람직하지 않기 때문이다.Masking also has a transient aspect that depends on the time relationship between the masker (s) and the masked signal (s). Some masking signals essentially provide masking only while the masking signal is present ("simultaneous masking"). Other masking signals provide masking while the masker occurs earlier ("backward masking" or "premasking") as well as later ("forward masking" or "postmasking"). "Transient", ie, sudden, short-term and significant increases in signal levels all represent three "types" of masking: backward masking, simultaneous masking, and forward masking, while steady state and quasi-steady state. The signal represents only simultaneous masking. In the context of the present invention, the advantage should not be influenced by the simultaneous masking resulting from the transient, because it is not desirable to interrupt the transient by placing junctions that coincide with or nearly coincide with simultaneous masking.

오디오 과도현상 데이터는 포워드 및 백워드 일시적 마스킹을 제공하는 것으로 오랫동안 공지되어 왔다. 과도현상 오디오 자료(material)는 선행 및 후행하는 오디오가 청취자에게 지각가능하지 않도록 과도현상 전후의 가청 자료를 "마스크"한다(과도현상에 의한 동시 마스킹은 과도현상을 반복 또는 혼란시키는 것을 회피시키는데 사용되지 않는다). 프리마스킹이 측정되었으며 상대적으로 짧으며 몇 밀리초(msec)만을 지속하며 포스크마스킹은 50msec보다 길게 지속한다. 포스트마스킹이 그것의 더 긴 기간 때문에 일반적으로 더 유용하더라도, 프리- 및 포스트-과도현상 마스킹 모두가 본 발명의 양태들과 관련하여 이용될 것이다.Audio transient data has long been known to provide forward and backward transient masking. Transient audio material "masks" the audible material before and after the transient so that the preceding and following audio is not perceptible to the listener (simultaneous masking by transient is used to avoid repeating or confusing the transient). Is not). Premasking was measured, relatively short, lasting only a few milliseconds (msec) and forcing masking lasting longer than 50msec. Although postmasking is generally more useful because of its longer duration, both pre- and post-transient masking will be used in connection with aspects of the present invention.

본 발명의 일 양태는 과도현상 검출이다. 하기에 기술된 실제 구현에서, 서브블럭들(오디오 샘플들의 블럭의 부분들)이 실험된다. 그 크기들의 치수는 그 포인트까지의 신호 크기를 나타내는 평활 이동 평균(smoothed moving average)과 비교된다. 작동은 전체 오디오 스펙트럼에 대해 그리고 고주파수에 대해서만 각각 실행되어, 고주파수 과도현상들이 더 큰 저주파수 신호들의 존재에 의해 감쇄(diluted) 및 분실되지 않음을 보장한다. 이와 달리, 과도현상을 검출하는 임 의의 적절한 공지된 방법이 사용될 수 있다.One aspect of the present invention is transient detection. In the actual implementation described below, subblocks (parts of the block of audio samples) are tested. The dimensions of the magnitudes are compared to a smooth moving average representing the signal magnitude up to that point. Operation is performed for the entire audio spectrum and only for high frequencies, respectively, to ensure that high frequency transients are not diluted and lost by the presence of larger low frequency signals. Alternatively, any suitable known method for detecting transients can be used.

접합은 시간에 따라 쇠퇴하는 스펙트럼 구성요소를 갖는 가공물을 야기하는 방해물을 생성시킨다. 접합 가공물의 스펙트럼(및 진폭)은: (1)접합되는 신호들의 스펙트럼(하기에 더 기술되는 것처럼, 가공물들이 접합되는 신호들과 상이한 스펙트럼을 잠재적으로 가짐이 인식됨), (2)접합점에서 함께 연결될 때 파형들이 일치하는 한도(불연속성의 회피), 및 (3)파형들이 접합점에서 함께 결합되는 곳에서 크로스페이드(crossfade)의 형상 및 기간에 의존한다. 발명의 양태에 따른 크로스페이딩이 하기에 더 기술된다. 연결되는 곳에서의 파형을 일치시키는데 조력하는 상호관계 기술이 또한 하기에 기술된다. 본 발명의 양태에 따라, 접합 가공물이 마스킹되거나 또는 가청되지 않거나 또는 최소로 가청되는 것이 바람직하다. 본 발명의 양태에 의해 숙고되는 심리음향 기준은 마스킹되는, 가청되지 않는, 또는 최소로 가청되는 가공물을 야기하는 기준을 포함한다. 가청불능도 또는 최소 가청도는 마스킹의 유형으로서 고려된다. 마스킹은 가공물이 마스킹 신호(들)의 마스킹 임계치 아래에 있도록 시간 및 주파수에서 제약되어야 함을 요구한다(또는 마스킹 신호(들)의 부재시, 가청도의 비신호 임계치 아래에서, 이는 마스킹의 일 형태로 고려될 것이다). 가공물의 기간은, 제 1 근사값, 본질적으로는 크로스페이드의 길이(시간 기간)인 것으로 잘 규정되어 있다. 크로스페이드가 느려지면 느려질수록, 가공물의 스펙트럼이 더 협소해지지만 그 기간은 더 길어진다.Bonding creates an obstruction that results in a workpiece with spectral components that decay over time. The spectrum (and amplitude) of the junction workpiece is: (1) the spectrum of the signals to be joined (as further described below, it is recognized that the workpiece potentially has a different spectrum than the signals to which it is joined), (2) the joints to be joined together When the waveforms match (avoid discontinuity), and (3) the shape and duration of the crossfade where the waveforms are joined together at the junction. Crossfading according to aspects of the invention is further described below. Correlation techniques that assist in matching waveforms at the point of connection are also described below. According to an aspect of the invention, it is preferred that the joining workpiece is masked or not audible or minimally audible. Psychoacoustic criteria contemplated by aspects of the present invention include criteria that result in masked, inaudible, or minimally audible workpieces. Audibleness or minimal audibility is considered as a type of masking. Masking requires that the workpiece be constrained in time and frequency so that the workpiece is below the masking threshold of the masking signal (s) (or in the absence of the masking signal (s), below the non-signal threshold of audibility, which is one form of masking. Will be considered). The duration of the workpiece is well defined as being the first approximation, essentially the length of the crossfade (time period). The slower the crossfade, the narrower the spectrum of the workpiece but the longer the duration.

접합을 가청되지 않게 또는 최소로 가청되게 하는 것에 관한 몇가지 일반 원리들은 상승 신호 레벨들의 연속체를 고려함으로써 이해될 것이다. 적은 마스킹을 제공하거나 또는 어떠한 마스킹도 제공하지 않는 하위-레벨 신호들을 접합시키는 경우를 고려해보자. 잘 실행된 접합(즉, 최소 불연속성으로 잘 일치된 파형들)은 진폭에서 다소 낮은, 아마도 가청 임계치 아래의 가공물을 도입시키므로, 어떠한 마스킹 신호도 요구되지 않는다. 레벨들이 상승됨에 따라, 신호들은 마스킹 신호로서 작용하기 시작하여, 가청 임계치를 상승시킨다. 가공물들은 또한 진폭에서 상승하므로, 그것들은, 가청 임계치가 또한 상승되었던 경우를 제외하고는 비신호 임계치 위에 있다(도 1과 관련하여 상기되어 있음).Some general principles regarding making the junction inaudible or minimally audible will be understood by considering a continuum of rising signal levels. Consider the case of splicing low-level signals that provide less masking or no masking. Well performed junctions (ie, waveforms that are well matched with minimal discontinuity) introduce a workpiece that is somewhat lower in amplitude, perhaps below the audible threshold, so no masking signal is required. As the levels rise, the signals begin to act as masking signals, raising the audible threshold. The workpieces also rise in amplitude, so they are above the non-signal threshold, except when the audible threshold has also been raised (as described above with respect to FIG. 1).

이상적으로는, 본 발명의 양태에 따라, 가공물들을 마스킹하는 과도현상에 대해, 가공물들은 과도현상의 백워드 마스킹 또는 포워드 마스킹 일시적 영역을 발생시키며 모든 가공물의 스펙트럼 구성요소의 진폭은 매 순간 때를 맞춰 과도현상의 마스킹 임계치 아래에 있다. 그러나, 실제 구현에서, 가공물들의 모든 스펙트럼 구성요소들이 시간의 매 순간 마스킹되는 것은 아니다.Ideally, according to aspects of the present invention, for transient masking of the workpieces, the workpieces generate transient masking or forward masking transient regions of the transients and the amplitudes of the spectral components of all the workpieces in time. It is below the masking threshold of the transient. However, in practical implementations, not all spectral components of the workpieces are masked every moment of time.

이상적으로는, 본 발명의 양태에 따라, 가공물들을 마스킹하는 정상 상태 또는 준 정상 상태에 대해, 가공물들은 동시에 마스킹 신호(동시 마스킹)로서 발생하며, 모든 스펙트럼 구성요소는 매순간 때를 맞춰 정상 상태 신호의 마스킹 임계치 아래에 있다.Ideally, according to aspects of the present invention, for a steady state or quasi-steady state of masking the workpieces, the workpieces occur simultaneously as a masking signal (simultaneous masking), and all spectral components are instantaneously corrected for the steady state signal. It is below the masking threshold.

본 발명의 다른 양태에 따라 다른 가능성이 있으며, 이는 가공물들의 스펙트럼 구성요소의 진폭이 사람 가청도의 비신호 임계치 아래에 있다는 것이다. 이러한 경우에, 그러한 가청불능도가 가공물의 마스킹인 것으로 고려되지만, 임의의 마스킹 신호가 필요하지 않다. According to another aspect of the invention there is another possibility, that the amplitude of the spectral components of the workpieces is below the non-signal threshold of human audibility. In this case, such audibility is considered to be masking of the workpiece, but no masking signal is required.

원칙적으로, 충분한 프로세싱 전원 및/또는 프로세싱 시간에 따라, 가공물들이 마스킹되거나 또는 가청되지 않는지를 결정하기 위해서 접합되는 신호에 기초한 가공물들의 시간 및 스펙트럼 특성을 예측하는것이 가능하다. 그러나, 프로세싱 전원 및 시간을 절약하기 위해서, 접합점의 근처에서(특히 크로스페이드내에서) 접합되는 신호들의 크기를 고려함으로써, 또는 신호에서 정상 상태 또는 준 정상 상태의 압도적 고주파수 식별 영역의 경우에, 크기에 관계없이 접합되는 신호들의 주파수 내용을 고려함으로써, 유용한 결과들이 획득될 것이다.In principle, with sufficient processing power and / or processing time, it is possible to predict the time and spectral characteristics of the workpieces based on the signal being joined to determine whether the workpieces are masked or not audible. However, in order to save processing power and time, by considering the magnitude of the signals to be bonded near the junction (especially within the crossfade), or in the case of the overwhelming high frequency identification region of steady state or quasi-steady state in the signal, By considering the frequency content of the signals being spliced regardless, useful results will be obtained.

접합으로부터 야기되는 가공물들의 크기들은 일반적으로 접합되는 신호들의 크기들보다 작거나 또는 유사하다. 그러나, 일반적으로 가공물들의 스펙트럼을 예측하는 것이 실질적이지 못하다. 만일 접합점이 사람 가청도의 임계치 아래의 오디오 신호의 영역 내라면, 결과적인 가공물들은, 비록 크기에서 작거나 또는 유사하더라도, 사람 가청도의 임계치 위에 있는데, 왜냐하면 그것들은 귀에 더 민감한 주파수들을 포함하기 때문이다(하위 임계치를 가짐). 그러므로, 가청도를 평가시, 신호 진폭과 가청의 진정한 주파수-의존 임계치보다는, 고정 레벨, 즉 귀의 가장 민감한 주파수(약 4kHz)에서의 가청 임계치와 비교하는것이 바람직하다. 이러한 보수적 접근법은 처리 가공물들이 스펙트럼에서 나타나는 실제 가청 임계치 아래에 있음을 보장한다. 이러한 경우에, 크로스페이드의 길이는 가청도에 영향을 끼치지 않아야 하지만, 데이터 압축 또는 팽창을 위한 최상의 공간을 허용하기 위해서 상대적으로 짧은 크로스페이드를 사용하는 것이 바람직하다.The magnitudes of the workpieces resulting from the junction are generally less than or similar to the magnitudes of the signals to be bonded. In general, however, it is not practical to predict the spectrum of workpieces. If the junction is in the region of the audio signal below the threshold of human audibility, the resulting artifacts, even if small or similar in magnitude, are above the threshold of human audibility because they contain frequencies that are more sensitive to the ear. (With lower threshold). Therefore, when evaluating audibility, it is desirable to compare the signal amplitude with the audible threshold at a fixed level, ie the most sensitive frequency of the ear (about 4 kHz), rather than the true frequency-dependent threshold of audible. This conservative approach ensures that the workpieces are below the actual audible threshold that appears in the spectrum. In this case, the length of the crossfade should not affect the audibility, but it is preferable to use a relatively short crossfade to allow the best space for data compression or expansion.

사람의 귀는 지배적인 고주파수 파형에서의 불연속성에 대한 감수성의 결여 를 갖고있다(예를 들면, 고주파수 파형 불연속성으로부터 야기되는 고주파수 클릭은 저주파수 클릭에서보다도 더 쉽게 마스킹되거나 또는 들리지 않는다). 고주파수 파형의 경우에, 가공물들의 구성요소들은 또한 지배적인 고주파이며 접합점에서의 신호 크기에 관계없이 마스킹된다(왜냐하면 식별된 영역의 정상 상태 또는 준 정상 상태 특성 때문에, 접합점에서의 크기들은 마스커로서 역할을 하는 식별된 영역에서의 신호의 크기들에 유사하다). 이는 동시 마스킹의 경우로서 고려될 것이다. 이러한 경우에, 크로스페이드의 길이가 아마도 가공물의 가청도에 영향을 끼치지 않더라도, 데이터 압축 또는 팽창 처리를 위한 최상의 공간을 허용하기 위해서 상대적으로 짧은 크로스페이드를 사용하는 것이 바람직하다.The human ear has a lack of susceptibility to discontinuities in the dominant high frequency waveforms (eg, high frequency clicks resulting from high frequency waveform discontinuities are more easily masked or not heard than at low frequency clicks). In the case of high frequency waveforms, the components of the workpieces are also dominant high frequencies and masked regardless of the signal magnitude at the junction (because of the steady state or quasi-steady nature of the identified area, the magnitudes at the junction act as maskers) Similar to the magnitudes of the signal in the identified region). This will be considered as the case of simultaneous masking. In this case, although the length of the crossfade probably does not affect the audibility of the workpiece, it is preferable to use a relatively short crossfade to allow the best space for data compression or expansion processing.

만일 접합점이 과도현상에 의해(즉, 프리마스킹 또는 포스트마스킹중 어느 하나에 의해) 마스킹되는 것으로서 식별된 오디오 신호의 영역내에 있다면, 적용되는 크로스페이딩 특성을 고려한, 크로그페이딩 길이를 포함하는, 접합되는 각 신호들의 크기는 특정한 접합점이 과도현상에 의해 마스킹되는지를 결정한다. 과도현상에 의해 제공된 마스킹의 양은 시간에 따라 감쇠한다. 그러므로, 과도현상에 의한 프리마스킹 또는 포스트마스킹의 경우에, 더 커다란 방해물이지만 더 짧은 시간동안 지속하며 그리고 프리마스킹 또는 포스트마스킹의 시간 기간내에 있을 수 있는 방해를 야기시키는, 상대적으로 짧은 크로스페이드를 사용하는 것이 바람직하다.If the junction is in the region of the audio signal identified as being masked by transients (i.e., by either premasking or postmasking), the junction, including the craze fading length, takes into account the crossfading characteristics applied. The magnitude of each signal determines if a particular junction is masked by transients. The amount of masking provided by the transients decays over time. Therefore, in the case of premasking or postmasking due to transients, use a relatively short crossfade, which is a larger obstacle but lasts for a shorter time and causes disturbances that may be within the time period of premasking or postmasking. It is desirable to.

접합점이 과도현상의 결과로서 프리마스킹 또는 포스트마스킹되지 않는 오디오 신호의 영역내에 있을 때, 본 발명의 양태는 오디오 신호의 세그먼트내에 있는 오디오 신호중 가장 조용한 서브-세그먼트를 선택하는 것이다(실질적으로, 세그먼 트는 버퍼 메모리에 있는 샘플들의 블럭이다). 이러한 경우에, 적용되는 크로스페이딩 특성을 고려한, 크로그페이딩 길이를 포함하는, 접합되는 각 신호들의 크기는 접합 방해물에 의해 야기되는 가공물들이 들릴 수 있는 한도를 결정한다. 만일 서브-세그먼트의 레벨이 낮다면, 가공물 구성요소들의 레벨이 또한 낮다. 하위 서브-세그먼트의 레벨 및 스펙트럼에 따라, 일부 동시 마스킹이 있을 수 있다. 게다가, 하위-레벨 서브-세그먼트를 에워싸는 오디오의 상위-레벨 부분들은 크로스페이드중에 임계치를 상승시키는, 몇개의 일시적 프리마스킹 또는 포스트마스킹을 또한 제공한다. 가공물들은 항상 가청되지 않지만, 접합이 더 시끄러운 영역에서 실행된다면 덜 들릴 수 있다. 그러한 가청도는 더 긴 크로스페이드 길이를 사용하며 접합점에서 파형들을 잘 일치시킴으로써 최소화된다. 그러나, 긴 크로스페이드는, 변동되는 오디오의 통로를 상당히 길게하며 그리고 접합점들 및/또는 종단점들을 블럭의 종단으로부터 더 떨어지게 하므로, 타겟 세그먼트의 길이 및 위치를 제한시킨다(실질적인 경우에, 오디오 샘플들이 블럭들로 분할된다). 그래서, 최대 크로스페이드 길이가 절충물이다.When the junction is in the region of an audio signal that is not premasked or postmasked as a result of the transient, an aspect of the invention is to select the quietest sub-segment of the audio signal in the segment of the audio signal (substantially, segment Is a block of samples in the buffer memory). In this case, the magnitude of each signal to be bonded, including the crag fading length, taking into account the crossfading characteristics applied, determines the limit in which workpieces caused by the bonding blockages can be heard. If the level of the sub-segment is low, the level of the workpiece components is also low. Depending on the level and spectrum of the lower sub-segment, there may be some simultaneous masking. In addition, the higher-level portions of the audio surrounding the lower-level sub-segments also provide some temporary premasking or postmasking, which raises the threshold during the crossfade. The workpieces are not always audible but can be less audible if the bonding is performed in a noisy area. Such audibility is minimized by using longer crossfade lengths and by matching waveforms well at the junction. However, the long crossfade limits the length and position of the target segment (in practical cases, the audio samples may block because the length of the fluctuating audio is considerably longer and the junctions and / or endpoints are further away from the end of the block). Divided into two). Thus, the maximum crossfade length is a compromise.

오디토리 신(auditory scene) 분석Auditory Scene Analysis

심리음향 분석을 사용하는 것은 타임 및/또는 피치 스케일링을 제공하기 위해 프로세스에서 바람직하지않은 가청 가공물들을 감소시키는데 유용하지만, 바람직하지않은 가청 가공물들의 감소는 또한 그 각각이 개별적으로 인지되는 경향이 있는, "이벤트(event)" 또는 "오디토리 이벤트(auditory event)"로서 언급되는, 타임 세그먼트로 오디오를 분할함으로써, 그리고 이벤트내에서 타임-스케일링 및/또 는 피치-스케일링 처리를 실행시킴으로써 달성된다. 개별적으로 인지되는 유니트로 사운드의 분할은 때때로 "오디토리 이벤트 분석" 또는 "오디토리 신 분석(ASA)"으로서 언급된다. 비록 심리음향 분석 및 오디토리 신 분석이 타임 및/또는 피치 스케일링 프로세스에서 바람직하지 않은 가공물들을 감소시 보조로서 독립적으로 사용되더라도, 그것들은 서로에 대해 유익하게 사용된다.Using psychoacoustic analysis is useful for reducing undesirable audible artifacts in the process to provide time and / or pitch scaling, but the reduction of undesirable audible artifacts also tends to be perceived individually, respectively. This is accomplished by dividing the audio into time segments, referred to as "events" or "auditory events," and by executing time-scaling and / or pitch-scaling processes within the event. The division of sound into individually perceived units is sometimes referred to as "audit event analysis" or "audit scene scene analysis (ASA)". Although psychoacoustic analysis and auditory scene analysis are used independently as aids in reducing undesirable workpieces in the time and / or pitch scaling process, they are beneficial to each other.

(1)심리음향 분석, (2)오디토리 신 분석, 그리고 (3)서로 관련있는 심리음향 및 오디토리 신 분석과 관련있는 타임 및/또는 피치 스케일링이 본 발명의 모든 양태이다. 본 발명의 다른 양태는 오디오의 세그먼트들이 삭제 또는 반복되는 유형 이외에 타임 및/또는 피치 스케일링의 유형의 부분으로서 심리음향 분석 및/또는 오디토리 신 분석의 사용을 포함한다. 예를 들면, 미국 특허 제6,266,003호에 공고된 오디오 신호들의 타임 스케일링 및/또는 피치 수정에 대한 프로세스들은 본문에 기술된 1개 이상의 심리음향 기준을 충족시키는 오디오 세그먼트들에만 및/또는 오디토리 이벤트를 초과하지 않는 각각의 오디오 세그먼트들에만 공보의 프로세싱 기술을 사용함으로써 개선된다.Time and / or pitch scaling related to (1) psychoacoustic analysis, (2) auditory scene analysis, and (3) correlated psychoacoustic and auditory scene analysis are all aspects of the present invention. Another aspect of the invention includes the use of psychoacoustic analysis and / or auditory scene analysis as part of the type of time and / or pitch scaling in addition to the type in which segments of audio are deleted or repeated. For example, processes for time scaling and / or pitch correction of audio signals disclosed in US Pat. No. 6,266,003 may only include audio segments and / or auditory events that meet one or more psychoacoustic criteria described herein. It is improved by using the publication's processing technique only for the respective audio segments that do not exceed.

오디토리 신 분석의 광범위한 논의는 알버트 에스. 브레그먼에 의해 그의 저서 'Auditory Scene Analysis - The perceptual Organization of Sound"에 설명되어 있다(Massachusetts Institute of Technology, 1991, Fourth printing, 2001, Second MIT Press paperback edition). 게다가, 1999년 12월 14일 배드캄카르(Bhadkamkar) 등의 미국 특허 제6,002,776호는 "오디토리 신 분석에 의한 사운드 분리에 관련한 종래 기술"로서 1976년의 공보들을 인용한다. 그러나, 배 드캄카르 등의 특허는, "오디토리 신 분석을 수반한 기술들이, 과학적 관점에서 사람의 청각 처리 모델로서 관심을 갖더라도, 현재 연산적으로 너무 요구가 지나치며 근본적인 진행이 이루어질 때 까지는 사운드 분리를 위한 실질적인 기술로 생각되기에는 전문적이다"라는 결론을 내리며, 오디토리 신 분석의 실질적인 사용을 실망시킨다.An extensive discussion of the auditory god analysis is Albert S. It is described by Bregman in his book Audition Scene Analysis-The perceptual Organization of Sound (Massachusetts Institute of Technology, 1991, Fourth printing, 2001, Second MIT Press paperback edition). US Pat. No. 6,002,776 to Bhadkamkar et al. Cites publications in 1976 as “a prior art relating to sound separation by auditory scene analysis.” However, Bad de Kamkar et al. Even if the techniques involved in analysis are of interest to human auditory models from a scientific point of view, they are now too technically demanding and are not professional enough to be considered as practical techniques for sound separation until fundamental progress is made. ” It disappoints the practical use of auditory scene analysis.

본 발명의 양태들에 따르면, 오디오를 개별적으로 인지되는 경향이 있는 일시적 세그먼트 또는 "오디토리 이벤트"로 분할하기 위한 계산적으로 효과적인 프로세스가 제공된다.In accordance with aspects of the present invention, a computationally effective process is provided for dividing audio into temporal segments or "audit events" that tend to be perceived individually.

브레그먼은 일절에서 "우리는 사운드가 음색, 피치, 소리크기, 또는 (더 적은 한계로) 공간의 위치에서 갑자기 변할 때 분리된 유니트를 듣는다"고 언급한다(Auditory Scene Analysis - The Perceptual Organization of Sound, upra at page 469). 브레그먼은, 예를 들면, 다중 동시 사운드 스트림들이 주파수에서 분리될 때, 그것들의 인지를 또한 논한다.Bregman mentions, in some cases, that we hear a separate unit when the sound suddenly changes in tone, pitch, loudness, or (with less limits) the location of the space. (Auditory Scene Analysis-The Perceptual Organization of Sound , upra at page 469). Bregman also discusses their perception, for example, when multiple simultaneous sound streams are separated in frequency.

음색 및 피치에서의 변동 그리고 진폭에서의 일정한 변동을 검출하기 위해서, 본 발명의 양태에 따른 오디토리 이벤트 검출 프로세스는 시간에 따른 스펙트럼 구성의 변동을 검출한다. 채널들이 공간에서 방향으로 나타내는 다중채널 사운드 배열에 적용될 때, 본 발명의 양태에 따른 프로세스는 또한 시간에 따른 공간적 위치에서의 변동으로부터 야기하는 오디토리 이벤트를 검출한다. 선택적으로, 본 발명의 다른 양태에 따르면, 프로세스는 시간과 관련한 진폭에서의 변동을 또한 검출하며, 이는 시간과 관련한 스펙트럼 구성에서의 변동을 검출함으로써 검출되지 않는다. 오디토리 이벤트내에서 타임-스케일링 및/또는 피치-스케일링을 실행시킴은 더 적게 가청되는 가공물들을 이끌기 쉬운데, 왜냐하면 이벤트내의 오디오는 상당히 일정하며, 상당히 일정한 것으로 인식되며, 또는 자체로 오디오 개체(예를 들면, 악기에 의해 연주되는 음)이기 때문이다.In order to detect fluctuations in timbre and pitch and constant fluctuations in amplitude, the auditory event detection process according to aspects of the present invention detects fluctuations in spectral composition over time. When applied to a multichannel sound arrangement in which the channels are oriented in space, the process according to aspects of the present invention also detects auditory events resulting from variations in spatial location over time. Optionally, according to another aspect of the present invention, the process also detects a change in amplitude with respect to time, which is not detected by detecting a change in spectral configuration with respect to time. Implementing time-scaling and / or pitch-scaling within an auditory event is likely to lead to less audible artifacts, because the audio in the event is fairly constant, perceived to be fairly constant, or by itself an audio object (eg. For example, a note played by an instrument.

적어도 계산적으로 요구가 지나친 구현에서, 프로세스는 전체 주파수 대역(전대역폭 오디오) 또는 대체로 전체 주파수 대역을 분석함으로써(실제 구현에서, 대역 제한 필터링이 스펙트럼의 종단에서 종종 사용됨) 그리고 가장 시끄러운 오디오 신호 구성요소에 가장 큰 가중을 둠으로써 오디오를 타임 세그먼트로 분할한다. 이러한 접근법은 더 작은 타임 스케일(20msec 이하)에서 귀가 일정한 시간에 단일 오디토리 이벤트에 집중하는 심리음향 현상을 이용한다. 이는, 다중 이벤트가 동시에 발생하면, 하나의 구성요소가 지각적으로 가장 현저하며 마치 단일 이벤트가 발생하는 것처럼 개별적으로 처리됨을 의미한다. 이러한 효과를 이용하는 것은 오디토리 이벤트 검출을 프로세싱되는 오디오의 복잡성에 따라 일정한 비율로 만드는 것을 허용한다. 예를 들면, 프로세싱되는 입력 오디오 신호가 단독 악기라면, 식별되는 오디토리 이벤트들은 연주되는 개별 음(note)들이다. 입력 음성 신호에 유사하게, 언어, 즉 모음과 자음의 개별 구성요소들은, 예를 들면, 개별 오디오 요소로서 식별될 것이다. 오디오, 이를 테면 북소리 또는 다수의 악기 및 음성을 갖는 음악의 복잡성이 증가함에 따라, 오디토리 이벤트 검출은 임의의 순간에 가장 현저한(즉, 가장 시끄러운) 오디오 요소를 식별한다. 이와 달리, "가장 현저한" 오디오 요소는 가청 임계치와 주파수 응답을 고려함으로써 결정될 수 있다. In at least computationally demanding implementations, the process involves analyzing the entire frequency band (full bandwidth audio) or generally the entire frequency band (in practical implementations, band limit filtering is often used at the end of the spectrum) and the loudest audio signal components. Split the audio into time segments by giving the largest weight to. This approach uses psychoacoustic phenomena in which the ear concentrates on a single auditory event at a fixed time on a smaller time scale (less than 20 msec). This means that when multiple events occur at the same time, one component is perceptually most prominent and treated separately as if a single event occurred. Using this effect allows the auditory event detection to be made at a constant rate depending on the complexity of the audio being processed. For example, if the input audio signal being processed is a single instrument, the auditory events identified are the individual notes played. Similar to the input speech signal, the individual components of language, ie vowels and consonants, will be identified, for example, as separate audio elements. As the complexity of audio, such as drums or music with multiple instruments and voices, increases, auditory event detection identifies the most prominent (ie, loudest) audio element at any moment. Alternatively, the "most prominent" audio element can be determined by considering the audible threshold and the frequency response.

선택적으로, 본 발명의 다른 양태에 따르면, 커다란 계산적 복잡성을 희생시키며, 프로세스는 전대역폭 보다는 이산 주파수 대역(고정 또는 동적으로 결정되거나 또는 고정 및 동적으로 결정되는 대역)에서 시간과 관련한 스펙트럼 구성의 변동을 고려한다. 이러한 대안적 접근접은, 단일 스트림만이 특정 시간에 지각가능하다고 추정하기 보다는 상이한 주파수 대역에서 1개 이상의 오디오 스트림을 고려한다.Optionally, according to another aspect of the present invention, at the expense of large computational complexity, the process is a variation of spectral composition with respect to time in the discrete frequency bands (bands fixed or dynamically determined or fixed and dynamically determined) rather than full bandwidth. Consider. This alternative approach considers one or more audio streams in different frequency bands rather than presuming that only a single stream is perceptible at a particular time.

오디오를 세그먼팅하기 위한 본 발명에 따른 간단하고 계산적으로 효과적인 프로세스는 오디토리 이벤트를 식별하는데, 그리고 타임 및/또는 피치 변형 기술과 사용될 때, 가청 가공물을 감소시키는데 유용한것으로 밝혀졌다.A simple and computationally effective process according to the invention for segmenting audio has been found to be useful for identifying auditory events and for reducing audible artifacts when used with time and / or pitch deformation techniques.

본 발명의 오디토리 이벤트 검출 프로세스는, 필터 뱅크 또는 시간-주파수 변환, 이를 테면 FFT를 사용하여, 시간 도메인 오디오 파형을 시간 간격 또는 블럭으로 분할하고 그후 각 블럭의 데이터를 주파수 영역으로 컨버트시킴으로써 구현된다. 각 블럭의 스펙트럼 콘텐트의 진폭은 진폭 변동의 효과를 제거 또는 감소시키기 위해서 표준화된다. 각각의 결과적인 주파수 도메인 표시는 특정 블럭에서 오디오의 스펙트럼 콘텐트의 지시를(진폭을 주파수의 함수로서) 제공한다. 연속적인 블럭들의 스펙트럼 콘텐트가 비교되며 임계치보다 더 큰 변동은 오디토리 이벤트의 일시적 시작 또는 일시적 종단을 지시하는 것으로 받아들여진다.The auditory event detection process of the present invention is implemented by dividing the time domain audio waveform into time intervals or blocks using a filter bank or time-frequency conversion, such as an FFT, and then converting the data of each block into the frequency domain. . The amplitude of the spectral content of each block is normalized to remove or reduce the effects of amplitude variations. Each resulting frequency domain indication provides an indication (amplitude as a function of frequency) of the spectral content of the audio in a particular block. The spectral content of successive blocks is compared and a variation greater than the threshold is taken to indicate a temporary start or temporary end of the auditory event.

계산적 복잡성을 최소화시키기 위해서, 시간 영역 오디오 파형의 주파수들중 단일 대역만이 프로세싱되거나, 바람직하게는 스펙트럼의 전체 주파수 대역(이는 평균 품질의 음악 시스템의 경우에 50Hz 내지 15kHz) 또는 대체로 전체 주파수 대 역(예를 들면, 대역 디파이닝 필터(defining filter)는 고주파수 및 저주파수 양극단을 제외함)이 프로세싱된다.In order to minimize the computational complexity, only a single band of frequencies of the time domain audio waveform is processed, or preferably the entire frequency band of the spectrum (which is 50 Hz to 15 kHz for average quality music systems) or generally the entire frequency band. (E.g., a band defining filter excludes high and low frequency extremes).

주파수 영역 데이터가 표준화되는데 필요한 등급(degree)은 진폭의 표시 도수를 가리킨다. 그러므로, 이러한 등급의 변동이 소정의 임계치를 초과한다면, 그것은 또한 이벤트 경계를 지시하는 것으로 받아들여진다. 스펙트럼 변동으로부터 그리고 진폭 변동으로부터 야기하는 이벤트 시작 및 종단점들은 어느 한 유형의 변동으로부터 야기하는 이벤트 경계를 식별하기 위해서 함께 OR된다.The degree required for the frequency domain data to be standardized refers to the frequency of the indication. Therefore, if this variation in class exceeds a certain threshold, it is also accepted as indicating an event boundary. Event start and end points resulting from spectral variation and from amplitude variation are ORed together to identify event boundaries resulting from either type of variation.

실질적으로, 오디토리 이벤트의 일시적 시작 및 종단점 경계들은 필연적으로 시간 영역 오디오 파형이 분할되는 블럭들의 경계와 각각 일치한다. 실시간 프로세싱 요건들(대형 블럭들이 적은 처리 오버헤드를 요구)과 이벤트 위치의 분해능(소형 블럭들은 오디토리 이벤트의 위치에 대한 더 상세한 정보를 제공)간에 상충이 이뤄진다.In practice, the temporal start and end boundaries of the auditory event necessarily coincide with the boundaries of the blocks where the time domain audio waveform is divided. There is a trade-off between real-time processing requirements (large blocks require less processing overhead) and resolution of event locations (small blocks provide more detailed information about the location of the auditory event).

다중 오디오 채널의 경우에, 각각은 공간에서 방향을 나타내는 것으로, 각 채널은 독립적으로 프로세싱되며 모든 채널에 대한 결과적인 이벤트 경계들은 함께 OR된다. 그러므로, 예를 들면, 방향들을 갑자기 스위치하는 오디토리 이벤트는 하나의 채널에서는 "종단의 이벤트" 경계와 다른 채널에서는 "시작의 이벤트" 경계가 된다. 함께 OR될 때, 2개 이벤트가 식별될 것이다. 그러므로, 본 발명의 오디토리 이벤트 검출 프로세스는 스펙트럼(음색 및 피치), 진폭 및 지향성 변동에 기초한 오디토리 이벤트를 검출한다.In the case of multiple audio channels, each indicating a direction in space, each channel is processed independently and the resulting event boundaries for all channels are ORed together. Thus, for example, an auditory event that suddenly switches directions is a "end event" boundary in one channel and a "start event" boundary in another channel. When ORed together, two events will be identified. Therefore, the auditory event detection process of the present invention detects auditory events based on spectral (voice and pitch), amplitude and directional variations.

부가적인 선택으로서, 그러나 더 커다란 계산적 복잡성을 희생시키며, 주파 수들의 단일 대역에서 시간 도메인 파형의 스펙트럼 콘텐트를 처리하는 대신에, 주파수 도메인 변환에 우선하여 시간 도메인 파형의 스펙트럼은 2개 이상의 주파수 대역으로 분할된다. 각각의 주파수 대역들은 그후에 주파수 영역으로 컨버트되며 마치 상기된 방식에서 독립적인 채널인 것처럼 처리된다. 결과적인 이벤트 경계들은 함께 OR되어 그 채널에 대한 이벤트 경계들을 규정한다. 다중 주파수 대역들은 고정형, 적응형, 또는 고정형 및 적응형의 조합이다. 예를 들면, 오디오 잡음 감소에 사용되는 트랙킹 필터 기술 및 기타 기술들은 적응형 주파수 대역들을 규정하는데 사용된다(예를 들면, 800Hz와 2kHz에서 지배적인 동시 사인파들이 그 두개 주파수를 중심으로 한 2개의 적응적으로 결정되는 대역들이다).As an additional choice, but at the expense of greater computational complexity, instead of processing the spectral content of the time domain waveform in a single band of frequencies, the spectrum of the time domain waveform in two or more frequency bands prior to the frequency domain transformation. Divided. Each frequency band is then converted into the frequency domain and treated as if it were an independent channel in the manner described above. The resulting event boundaries are ORed together to define the event boundaries for that channel. Multiple frequency bands are fixed, adaptive, or a combination of fixed and adaptive. For example, tracking filter techniques and other techniques used to reduce audio noise are used to define adaptive frequency bands (e.g. two adaptive dominant sine waves at 800 Hz and 2 kHz around their two frequencies). Bands as determined by

오디토리 신 분석을 제공하기 이한 다른 기술들은 본 발명의 다양한 양태들에서 오디토리 이벤트를 식별하는데 사용된다.Other techniques for providing auditory scene analysis are used to identify auditory events in various aspects of the present invention.

본문에 진술한 실시예에서, 오디오는 고정 길이 샘플 블럭들로 분할된다. 그러나, 발명의 다양한 양태들의 원리들은 오디오를 샘플 블럭들로 배열하는 것뿐만 아니라, 일정한 길이의 블럭들을 제공하는 것을 요구하지 않는다(블럭들은 가변 길이이며, 그 각각은 본질적으로 오디토리 이벤트의 길이이다). 오디오가 블럭들로 분할될 때, 발명의 다른 양태는, 단일 채널과 다중채널 환경 모두에서, 특정 블럭들을 처리하는 것은 아니다.In the embodiment stated in the text, the audio is divided into fixed length sample blocks. However, the principles of the various aspects of the invention do not require to arrange audio into sample blocks, as well as to provide blocks of constant length (blocks are of variable length, each of which is essentially the length of an auditory event. ). When audio is divided into blocks, another aspect of the invention is not to process specific blocks in both single channel and multichannel environments.

발명의 다른 양태들은 발명의 상세한 설명이 해독 및 이해됨에 따라 인식 및 이해될 것이다.Other aspects of the invention will be appreciated and understood as the description thereof is read and understood.

도 1은 사운드가 없을 시(실선)와 500Hz 사인파의 존재시(점선) 사람의 가청 임계치의 이상적 도면이다. 수평 척도는 헤르츠(Hz)의 주파수이며 수직 척도는 20μPa에 관하여 데시벨(dB)이다.1 is an ideal diagram of an audible threshold of a person in the absence of sound (solid line) and in the presence of a 500 Hz sine wave (dashed line). The horizontal scale is the frequency in hertz (Hz) and the vertical scale is in decibels (dB) with respect to 20 μPa.

도 2A와 2B는 타겟 세그먼트를 제거함으로써 데이터 압축의 개념을 도시하는 개략적인 개념도이다. 수평축은 시간을 나타낸다.2A and 2B are schematic conceptual diagrams illustrating the concept of data compression by removing target segments. The horizontal axis represents time.

도 2C와 2D는 타겟 세그먼트를 반복함으로써 데이터 팽창의 개념을 도시하는 개력적인 개념도이다. 수평축은 시간을 나타낸다.2C and 2D are conceptual conceptual diagrams illustrating the concept of data expansion by repeating target segments. The horizontal axis represents time.

도 3A는 샘플에 의해 표시되는 오디오 데이터의 블럭의 개략적인 개념도로, 데이터 압축의 경우에 최소 접합점 위치와 최대 접합점 위치를 나타낸다. 수평축은 샘플이며 그리고 시간을 나타낸다. 수직축은 표준화된 진폭이다.3A is a schematic conceptual diagram of a block of audio data represented by a sample, showing a minimum junction position and a maximum junction position in the case of data compression. The horizontal axis is the sample and represents time. The vertical axis is the normalized amplitude.

도 3B는 샘플에 의해 표시되는 오디오 데이터의 블럭의 개략적인 개념도로, 데이터 팽창의 경우에 최소 접합점 위치와 최대 접합점 위치를 나타낸다. 수평축은 샘플이며 그리고 시간을 나타낸다. 수직축은 표준화된 진폭이다.3B is a schematic conceptual diagram of a block of audio data represented by a sample, showing a minimum junction position and a maximum junction position in the case of data expansion. The horizontal axis is the sample and represents time. The vertical axis is the normalized amplitude.

도 4는 샘플에 의해 표시되는 오디오 데이터의 블럭의 개략적인 개념도로, 접합점, 최소 종단점 위치, 최대 종단점 위치, 상호관계 프로세싱 영역, 및 최대 프로세싱점 위치를 나타낸다. 수평축은 샘플이며 시간을 나타낸다. 수직축은 표준화된 진폭이다.4 is a schematic conceptual diagram of a block of audio data represented by a sample, showing a junction point, a minimum end point position, a maximum end point position, a correlation processing region, and a maximum processing point position. The horizontal axis is a sample and represents time. The vertical axis is the normalized amplitude.

도 5는 심리음향 분석이 실행되는 본 발명의 양태에 따른 타임 및 피치-스케일링 프로세스를 설명하는 순서도이다.5 is a flow chart illustrating a time and pitch-scaling process in accordance with an aspect of the present invention in which psychoacoustic analysis is performed.

도 6은 도 5의 심리음향 분석 단계(206)의 상세한 사항을 나타내는 순서도이 다.6 is a flowchart showing the details of the psychoacoustic analysis step 206 of FIG.

도 7은 과도현상 분석 단계의 과도현상 검출 서브단계의 상세한 사항을 나타내는 순서도이다.7 is a flowchart showing the details of the transient detection sub-step of the transient analysis step.

도 8은 과도현상 분석 버퍼에 있는 데이터 샘플들의 블럭의 개략적인 개념도이다. 수평축은 블럭에 있는 샘플들이다.8 is a schematic conceptual diagram of a block of data samples in a transient analysis buffer. The horizontal axis is the samples in the block.

도 9는 450Hz 사인파가 블럭에서 시작 및 종결 섹션보다는 레벨에서 6dB 더 낮은 중간 부분을 갖는 오디오 블럭 분석예를 나타내는 개략적인 개념도이다. 수평축은 시간을 나타내는 샘플들이며 수직축은 표준화된 진폭이다.FIG. 9 is a schematic conceptual diagram illustrating an example of an audio block analysis in which a 450 Hz sine wave has an intermediate portion 6 dB lower in level than a starting and ending section in a block. The horizontal axis is samples representing time and the vertical axis is normalized amplitude.

도 10은 어떻게 크로스페이딩이 구현되는지의 개략적인 개념도로, 해닝 윈도우(Hanning window)에 따라 형성된 비선형 크로스페이딩을 사용하여 데이터 세그먼트 접합의 예를 나타낸다. 수평축은 시간을 나타내며 수직축은 진폭을 나타낸다.FIG. 10 is a schematic conceptual diagram of how crossfading is implemented, illustrating an example of data segment conjugation using nonlinear crossfading formed along a Hanning window. The horizontal axis represents time and the vertical axis represents amplitude.

도 11은 도 5의 다중채널 접합점 선택 단계(210)의 상세한 사항을 나타내는 순서도이다.FIG. 11 is a flowchart illustrating details of the multichannel junction selection step 210 of FIG. 5.

도 12는 4개의 오디오 채널에 오디오 데이터 샘플들의 블럭들을 나타내는 일련의 이상적인 파형들로, 각각이 상이한 기준을 충족시키는 각 채널에서 식별된 영역을 나타내며, 공통 다중채널 접합점이 위치되는 식별된 영역의 오버랩을 나타낸다. 수평축은 샘플들이며 그리고 시간을 나타낸다. 수직축은 표준화된 진폭이다.12 is a series of ideal waveforms representing blocks of audio data samples on four audio channels, representing regions identified in each channel, each meeting a different criterion, overlapping the identified regions where a common multichannel junction is located. Indicates. The horizontal axis is samples and represents time. The vertical axis is the normalized amplitude.

도 13은 예시적인 언어 신호의 아주 주기적인 부분의 시간-도메인 정보를 나타낸다. 패기된 데이터 세그먼트의 어느 일측상의 데이터의 유사성을 최대화시키는 잘 선택된 접합점 및 종단점의 예가 나타나있다. 수평축은 시간을 나타내는 샘플들 이며 수직축은 진폭이다.13 shows time-domain information of the very periodic portion of an exemplary speech signal. Examples of well-chosen junctions and endpoints are shown to maximize the similarity of data on either side of the discarded data segment. The horizontal axis is samples representing time and the vertical axis is amplitude.

도 14는 파형들의 이상적인 도면으로, 시간 영역 신호(x(n))간에 중첩되는 언어 신호의 순간 위상(라디안(radian))을 나타낸다. 수평축은 샘플이며 수직축은 표준화된 진폭과 위상(라디안)이다.FIG. 14 is an ideal diagram of waveforms, showing the instantaneous phase (in radians) of a language signal superimposed between time domain signals x (n). The horizontal axis is the sample and the vertical axis is the normalized amplitude and phase (in radians).

도 15는 도 5의 상호관계 단계(214)의 상세한 사항을 나타내는 순서도이다. 도 15는 각각의 5개 오디오 채널에서 위상 상호관계의 결과와 각각의 5개 채널에서 시간-도메인 상호관계의 결과를 나타내는 이상적인 파형들을 포함한다. 파형들은 오디오 데이터 샘플들의 블럭들을 나타낸다. 수평축은 시간을 나타내는 샘플들이며 수직축은 표준화된 진폭이다.FIG. 15 is a flowchart showing details of the correlation step 214 of FIG. FIG. 15 includes ideal waveforms showing the result of phase correlation in each of the five audio channels and the result of time-domain correlation in each of the five channels. The waveforms represent blocks of audio data samples. The horizontal axis is samples representing time and the vertical axis is normalized amplitude.

도 16은 블럭도와 순서도의 양태를 구비하며 부가적인 가중 상호관계 분석 프로세싱예를 나타내는 이상적인 파형을 또한 포함하는 개략적인 개념도이다. 파형의 수평축은 시간을 나타내는 샘플들이며 수직축은 표준화된 진폭이다.FIG. 16 is a schematic conceptual diagram having aspects of a block diagram and a flowchart and also including an ideal waveform showing an example of additional weighted correlation analysis processing. The horizontal axis of the waveform is samples representing time and the vertical axis is the normalized amplitude.

도 17은 심리음향 분석과 오디토리 신 분석이 실행되는 본 발명의 양태에 따른 타임 및 피치-스케일링 프로세스를 진술하는 순서도이다.FIG. 17 is a flowchart stating time and pitch-scaling processes in accordance with aspects of the present invention in which psychoacoustic analysis and auditory scene analysis are performed.

도 18은 도 17의 프로세스의 오디토리 신 분석 단계(706)의 상세한 사항을 나타내는 순서도이다.18 is a flowchart showing details of auditory scene analysis step 706 of the process of FIG.

도 19는 스펙트럼 프로파일을 계산하는 일반적인 방법의 개략적인 개념도이다.19 is a schematic conceptual diagram of a general method of calculating the spectral profile.

도 20은 2개 오디오 채널에 있는 일련의 이상적인 파형들로, 각 채널의 오디토리 이벤트와 2개 채널간의 결합된 오디토리 이벤트를 나타낸다. FIG. 20 is a series of ideal waveforms on two audio channels, illustrating the auditory event of each channel and the combined auditory event between the two channels.

도 21은 도 17의 프로세스의 심리음향 분석 단계(708)의 상세한 사항을 나타내는 순서도이다.21 is a flow chart showing details of psychoacoustic analysis step 708 of the process of FIG.

도 22는 과도현상 분석 버퍼에 있는 데이터 샘플들의 블럭의 개략적인 개념도이다. 수평축은 블럭에 있는 샘플들이다.22 is a schematic conceptual diagram of a block of data samples in a transient analysis buffer. The horizontal axis is the samples in the block.

도 23은 오디토리 이벤트와 심리음향 기준을 도시하는 오케스트라 음악의 단일 채널의 이상적인 파형이다.23 is an ideal waveform of a single channel of orchestral music showing auditory events and psychoacoustic criteria.

도 24는 4개의 오디오 채널에 있는 일련의 이상적인 파형들로, 오디토리 이벤트, 심리음향 기준 및 결합된 오디토리 이벤트의 등급을 도시한다.FIG. 24 is a series of ideal waveforms in four audio channels, illustrating the rating of the auditory event, psychoacoustic criteria, and combined auditory event.

도 25는 도 24의 하나의 결합된 오디토리 이벤트를 더 상세히 나타낸다.FIG. 25 illustrates in more detail one combined auditory event of FIG. 24.

도 26은 단일 채널의 이상적인 파형으로, 스킵되는 하위 심리음향 품질 등급의 오디토리 이벤트 예를 도시한다.FIG. 26 illustrates an example of an auditory event of a skipped lower psychoacoustic quality class with an ideal waveform of a single channel.

도 27은 단일 채널의 이상적인 파형을 포함하는 개략적인 개념도로, 발명의 대안적 양태에 따른 단일 채널의 오디오, 접합점 및 종단점 위치에 대한 선택시의 초기 단계를 도시한다.FIG. 27 is a schematic conceptual diagram including an ideal waveform of a single channel, showing an initial step in selecting audio, junction and endpoint positions of a single channel according to an alternative aspect of the invention.

도 28은 N개 샘플만큼 시프트된 접합점 영역(Tc)을 나타내는 것을 제외하고는 도 27과 유사하다.FIG. 28 is similar to FIG. 27 except that it shows the junction region Tc shifted by N samples.

도 29는 접합점 영역이 Tc 샘플에 의해 일련적으로 진행될 때 다중 상호관계 계산의 예를 나타내는 개략적인 개념도이다. 3개의 프로세싱 단계가 오디오 데이터 블럭 데이터 도면간에 중첩된다. 도 29에 예시된 프로세싱은 도 30A-C 각각에 나타난 것처럼 각각 최대값을 갖는 3개의 상호관계 함수이다. FIG. 29 is a schematic conceptual diagram illustrating an example of multiple correlation calculations when a junction area is advanced in series by a Tc sample. FIG. Three processing steps are superimposed between the audio data block data diagrams. The processing illustrated in FIG. 29 is three correlation functions each having a maximum value as shown in each of FIGS. 30A-C.

도 30A는 도 29에 나타난 제 1 접합점 영역(Tc) 위치의 경우에 대한 이상적인 상호관계 함수이다.FIG. 30A is an ideal correlation function for the case of the first junction region Tc position shown in FIG. 29.

도 30B는 도 29에 나타난 제 2 접합점 영역(Tc) 위치의 경우에 대한 이상적인 상호관계 함수이다.FIG. 30B is an ideal correlation function for the case of the second junction region Tc location shown in FIG. 29.

도 30C는 도 29에 나타난 제 3 접합점 영역(Tc) 위치의 경우에 대한 이상적인 상호관계 함수이다.FIG. 30C is an ideal correlation function for the case of the third junction region Tc location shown in FIG. 29.

도 31은 3개 결합된 오디토리 이벤트 영역을 갖는 이상적인 오디오 파형으로, 제 1 결합된 이벤트 영역에서 363개 샘플들의 타겟 세그먼트가 선택되는 예를 나타낸다.FIG. 31 shows an example of an ideal audio waveform having three combined auditory event regions in which a target segment of 363 samples is selected in the first combined event region.

도 2A와 2B는 타겟 세그먼트를 제거함으로써 데이터 압축의 개념을 개략적으로 도시하며, 도 2C와 2D는 타겟 세그먼트를 반복함으로써 데이터 팽창의 개념을 개략적으로 도시한다. 실질적으로, 데이터 압축 및 데이터 팽창 프로세스는 1개 이상의 버퍼 메모리의 데이터에 적용되며, 데이터는 오디오 신호를 나타내는 샘플들이다.2A and 2B schematically illustrate the concept of data compression by eliminating target segments, while FIGS. 2C and 2D schematically illustrate the concept of data expansion by repeating target segments. In practice, the data compression and data expansion process is applied to data in one or more buffer memories, where the data are samples representing an audio signal.

비록 도 2A 내지 2D에서 식별된 영역들이 신호 과도현상의 결과로서 포스트마스킹되는 기준을 충족시키더라도, 도 2A 내지 2D의 예들을 기초로 하는 원리들은, 상기 언급된 기타 3가지 사항을 포함하여, 다른 심리음향 기준들을 충족시키는 식별된 영역에 또한 적용된다.Although the areas identified in FIGS. 2A-2D meet the criteria of postmasking as a result of signal transients, the principles based on the examples of FIGS. 2A-2D, including the other three mentioned above, It also applies to identified areas that meet psychoacoustic criteria.

도 2A를 참조하면, 데이터 압축을 도시하는 것으로서, 오디오(102)는 "식별 된 영역(identified region)"을 구성하는 심리음향적으로 포스트마스킹된 영역(106)으로 오디오(102)의 부분이 되는 과도현상(104)을 갖는다. 오디오가 분석되며 접합점(108)이 상기 식별된 영역(106)내에서 선택된다. 도 3A 및 3B와 관련하여 하기에 더 설명되는 것처럼, 오디오가 버퍼에서 데이터의 블럭으로 표현되면, 블럭 내에 최저 또는 가장 이른 접합점 위치(즉, 데이터가 샘플들로 표현되면, 낮은 샘플 번호 또는 지수를 갖는다)와 최대 또는 가장 최근의 접합점 위치(즉, 데이터가 샘플들로 표현되면, 높은 샘플 번호 또는 지수를 갖는다)가 있다. 비록 대부분의 경우에 타겟 세그먼트의 사이즈를 최대화시키기 위해서 최저 또는 가장 이른 접합점 위치에 또는 근처에 접합점을 위치시키는 것이 바람직하더라도, 접합점의 위치는 최저 접합점 위치에서 최대 접합점 위치까지의 가능한 접합점 위치들의 범위내에서 선택되며 중요하지 않다. 디폴트 접합점 위치에서, 식별된 영역의 시작 이후에 단시간(이를 테면, 5ms)이 소요된다. 더 많은 최적화된 접합점 위치를 제공하는 대안의 방법이 하기되어 있다.Referring to FIG. 2A, as depicting data compression, audio 102 is part of audio 102 into psychoacoustical postmasked region 106 that constitutes an “identified region”. Has a transient 104. Audio is analyzed and junction 108 is selected within the identified area 106. As further described below with respect to Figures 3A and 3B, if audio is represented as a block of data in the buffer, then the lowest or earliest junction position within the block (i.e., if the data is represented as samples), the And the maximum or most recent junction location (ie, if the data is represented in samples, have a high sample number or index). Although in most cases it is desirable to position the junction at or near the lowest or earliest junction location in order to maximize the size of the target segment, the location of the junction is within the range of possible junction locations from the lowest junction location to the maximum junction location. Is chosen from and does not matter. At the default junction location, it takes a short time (such as 5ms) after the start of the identified area. Alternative methods of providing more optimized junction locations are described below.

오디오에 대한 분석이 지속되며 종단점(110)이 선택된다. 하나의 대안으로, 분석은 접합점(108)으로부터 앞쪽으로(더 높은 샘플 번호 또는 지수쪽으로) 최대 처리점 위치(115)까지의 영역(112)에서 오디오(102)의 자동상호관계를 포함한다. 실질적으로, 최대 종단점 위치는, 하기에 더 설명되는 것처럼, 크로스페이드 시간의 반에 해당하는 시간(또는 샘플들의 시간-당량 개수) 만큼의 최대 처리점보다도 더 이르다(낮은 샘플 번호 또는 지수를 갖는다). 게다가, 하기에 더 설명되는 것처럼, 자동상호관계 프로세스는 최저 종단점 위치(116)와 최대 종단점 위치(114)간의 상호관계 최대수를 탐색하며 시간-도메인 상호관계 또는 시간-도메인 상호관계와 위상 상호관계 모두를 사용한다. 최대 및 최저 종단점 위치들을 결정하는 방식이 하기에 설명되어 있다. 시간 압축을 위해, 자동상호관계에 의해 결정된, 종단점(110)은 동시에 접합점(108)에 이어진다(즉, 오디오가 샘플들로 표현된다면, 그것은 더 높은 샘플 번호 또는 지수를 갖는다). 접합점(108)은 상기 접합점을 이끄는 오디오의 선두 세그먼트(leading segment)(118)를 규정한다(즉, 데이터가 샘플들로 표현되면, 그것은 접합점보다도 더 낮은 샘플 번호 또는 지수를 갖는다). 종단점(110)은 상기 종단점을 끄는 후미 세그먼트(trailing segment)(120)를 규정한다(만일 데이터가 샘플들로 표현되면, 그것은 종단점 보다도 더 높은 샘플 번호 또는 지수를 갖는다). 상기 접합점(108)과 종단점(110)은 오디오의 세그먼트의 양끝, 즉 타겟 세그먼트(122)를 규정한다.Analysis of the audio continues and endpoint 110 is selected. In one alternative, the analysis involves autocorrelation of the audio 102 in the region 112 from the junction 108 forward (to a higher sample number or exponent) to the maximum processing point location 115. In practice, the maximum endpoint position is earlier (with a lower sample number or exponent) as the maximum processing point by a time (or time-equivalent number of samples) equal to half of the crossfade time, as described further below. . In addition, as described further below, the autocorrelation process searches for the maximum number of correlations between the lowest endpoint location 116 and the maximum endpoint location 114, and either time-domain correlation or time-domain correlation and phase correlation. Use everything. The manner of determining the maximum and lowest endpoint positions is described below. For temporal compression, the endpoint 110, determined by autocorrelation, simultaneously leads to the junction 108 (ie, if the audio is represented in samples, it has a higher sample number or index). Junction 108 defines a leading segment 118 of audio that leads the junction (ie, if the data is represented in samples, it has a lower sample number or index than the junction). Endpoint 110 defines a trailing segment 120 that turns off the endpoint (if the data is represented in samples, it has a higher sample number or index than the endpoint). The junction 108 and the endpoint 110 define the ends of the segment of audio, i.e. the target segment 122.

데이터 압축을 위해, 타겟 세그먼트가 제거되며 도 2에서 선두 세그먼트가 접합점에서 바람직하게 크로스페이딩을 사용하여 후미 세그먼트와 함께 연결, 충돌 또는 접합되어, 접합점이 식별된 영역(106)내에 남는다. 그러므로, 크로스페이딩된 접합"점"은 접합 "영역"으로서 특징지워진다. 접합 가공물들의 구성요소들은 원칙적으로 크로스페이드내에 남으며, 이는 식별된 영역(106)내에 있어서, 데이터 압축의 가청도를 최소화시킨다. 도 2B에서, 압축된 데이터는 참조번호 102'로 식별된다.For data compression, the target segment is removed and in FIG. 2 the leading segment is joined, collided or joined with the trailing segment, preferably using crossfading at the junction, leaving the junction within the identified region 106. Therefore, the crossfaded junction "point" is characterized as the junction "region." The components of the joining workpieces remain in principle in the crossfade, which in the identified region 106 minimizes the audibility of data compression. In Fig. 2B, the compressed data is identified by reference numeral 102 '.

다양한 도면들을 통하여 동일한 참조번호는 유사 요소에 적용되며, 주요 표기를 지닌 참조번호들은 관련이 있지만 수정된 요소들을 지정하는데 사용된다. Throughout the various drawings, the same reference numerals apply to similar elements, and reference numerals with major notations are used to designate relevant but modified elements.

도 2C를 참조하면, 데이터 팽창을 도시하는 것으로서, 오디오(124)는 "식별된 영역"을 구성하는 심리음향적으로 포스트마스킹된 영역(128)으로 오디오(124)의 일부분이 되는 과도현상(126)을 갖는다. 데이터 팽창의 경우에, 오디오가 분석되며 접합점(130)이 또한 식별된 영역내에서 선택된다. 하기에 더 설명되는 것처럼, 오디오가 데이터의 블럭으로 버퍼에 표현된다면, 블럭내에는 최저 접합점 위치와 최대 접합점 위치가 있다. 오디오는 종단점을 위치시키기 위해서 접합점으로부터 포워드(forward)(데이터가 샘플들로 표현된다면, 더 높은 샘플 번호 또는 지수)와 백워드(데이터가 샘플들로 표현된다면, 더 낮은 샘플 번호 또는 지수)로 분석된다. 이러한 포워드 및 백워드 검색은, 데이터와 가장 유사한 접합점이전에 그리고 복사 및 반복에 적절한 접합점이후에 데이터를 찾기위해 실행된다. 더 상세하게는, 포워드 검색은 접합점(130)으로부터 첫번째 최대 프로세싱점 위치(132)까지 그리고 백워드 검색은 상기 접합점(130)으로부터 두번째 최대 프로세싱점 위치(134)까지 실행된다. 2개의 최대 프로세싱점 위치들은 접합점(130)으로부터 동수의 샘플만큼 떨어져 있지만, 필요한것은 아니다. 하기에 더 설명되는 것처럼, 접합점으로부터 최대 검색점 위치 및 최대 종단점 위치까지의 2개 신호 세그먼트는 각각 상호관계 최대값을 찾기위해서 교차-상호관련된다. 교차 상호관계는 시간-도메인 상호관계 또는 시간-도메인 상관관계와 위상 상호관계 모두를 사용한다. 실질적으로, 최대 종단점 위치(135)는, 하기에 더 설명되는 것처럼, 두번째 최대 프로세싱점(134)보다도 크로스페이드 시간의 반에 해당하는 시간(또는 샘플들의 시간-당량 개수) 만큼 더 늦다(더 높은 샘플 번호 또는 지수를 갖는다). Referring to FIG. 2C, illustrating the data expansion, audio 124 is a transient 126 that becomes part of the audio 124 into a psychoacoustical postmasked region 128 that constitutes an “identified region”. Has In the case of data expansion, audio is analyzed and junction point 130 is also selected within the identified area. As described further below, if audio is represented in a buffer as a block of data, there is a minimum junction position and a maximum junction position within the block. Audio is analyzed forwards from the junction (higher sample number or exponent if data is represented in samples) and backwards (lower sample number or exponent if data is represented in samples) from the junction to locate the endpoint. do. This forward and backward search is performed to find the data before the junction that most closely resembles the data and after the junction that is appropriate for copying and repeating. More specifically, forward search is performed from junction 130 to the first maximum processing point position 132 and backward search is performed from the junction 130 to the second maximum processing point position 134. The two maximum processing point positions are separated by the same number of samples from junction 130 but are not necessary. As described further below, the two signal segments from the junction to the maximum search point position and the maximum endpoint position are each cross-correlated to find the correlation maximum. Cross correlation uses either time-domain correlation or time-domain correlation and phase correlation. In practice, the maximum endpoint position 135 is later (higher) by a time (or time-equivalent number of samples) equal to half the crossfade time than the second maximum processing point 134, as described further below. Sample number or exponent).

도 2A와 2B의 데이터 압축의 경우에 상반되게, 종단점(136)은, 교차 상호관계에 의해 결정되며, 동시에 접합점(130)에 선행한다(즉, 오디오가 샘플들로 표현된다면, 더 낮은 샘플 번호 또는 지수를 갖는다). 상기 접합점(130)은 점합점을 이끄는 오디오의 선두 세그먼트(138)를 규정한다(즉, 오디오가 샘플들로 표현된다면, 접합점보다도 더 낮은 샘플 번호 또는 지수를 갖는다). 상기 종단점(136)은 종단점을 추적하는 후미 세그먼트(140)를 규정한다(즉, 오디오가 샘플들로 표현되다면, 종단점보다도 더 높은 샘플 번호 또는 지수를 갖는다). 상기 접합점(130)과 종단점(136)은 오디오 세그먼트의 양 끝, 즉 타겟 세그먼트(142)를 규정한다. 그러므로, 접합점, 종단점, 선두 세그먼트, 후미 세그먼트, 및 타겟 세그먼트의 규정은 데이터 압축의 경우와 데이터 팽창의 경우에 동일하다. 그러나, 데이터 팽창의 경우에, 타겟 세그먼트는 선두 세그먼트와 후미 세그먼트 양측의 부분(그러므로 반복된다)이지만, 데이터 압축의 경우에, 타겟 세그먼트는 어느 측의 부분도 아니다(그러므로, 삭제된다).In contrast to the data compression of FIGS. 2A and 2B, endpoint 136 is determined by cross correlation and simultaneously precedes junction 130 (ie, lower sample number if audio is represented in samples). Or has an exponent). The junction 130 defines the leading segment 138 of audio that leads to the junction (ie, if the audio is represented in samples, it has a lower sample number or index than the junction). The endpoint 136 defines a trailing segment 140 that tracks the endpoint (ie, if the audio is represented in samples, it has a higher sample number or index than the endpoint). The junction point 130 and the endpoint 136 define both ends of the audio segment, ie the target segment 142. Therefore, the definitions of the junction point, the endpoint, the leading segment, the trailing segment, and the target segment are the same for data compression and data expansion. However, in the case of data expansion, the target segment is a portion (and thus is repeated) on both the leading and trailing segments, but in the case of data compression, the target segment is not a portion on either side (and therefore is deleted).

도 2D에서, 선두 세그먼트는 접합점에서 바람직하게는 크로스페이딩을 사용하는 타겟 세그먼트와 함께 결합되어(본 도면에는 도시되지 않음), 타겟 세그먼트가 결과적인 오디오(124')에서 반복되도록 한다. 데이터 팽창의 경우에, 종단점(136)은 오디지날 오디오의 식별된 영역(128)내에 있어야 한다(그러므로, 모든 타겟 세그먼트를 식별된 영역내의 오리지날 오디오에 위치시킨다). 타겟 세그먼트(이 부분은 선두 세그먼트의 일부분임)와 접합점(130)의 첫번째 표기 142'는 마스킹된 영역(128)내에 남는다. 타겟 세그먼트(이 부분은 후미 세그먼트의 일부분 임)의 두번째 표기 142"는 접합점(130) 이후에 있으며, 마스킹된 영역(128) 외측으로 연장하지만, 그럴 필요는 없다. 그러나, 마스킹된 영역 외측으로의 이러한 연장은 어떠한 가청 효과를 갖지 않는데, 왜냐하면 타겟 세그먼트는 오리지날 오디오에서와 시간 연장된 버전에서 모두 후미 세그먼트와 연관되어 있기 때문이다.In FIG. 2D, the leading segment is combined with the target segment, preferably using crossfading at the junction (not shown in this figure), allowing the target segment to repeat in the resulting audio 124 '. In the case of data expansion, the endpoint 136 must be in the identified region 128 of the original audio (thus placing all target segments in the original audio in the identified region). The target segment (this portion is part of the leading segment) and the first notation 142 ′ of the junction point 130 remain within the masked area 128. The second notation 142 "of the target segment (this portion is part of the trailing segment) is after the junction 130 and extends outside the masked area 128, but need not be. However, outside the masked area This extension does not have any audible effect, since the target segment is associated with the trailing segment both in the original audio and in the time extended version.

바람직하게, 타겟 세그먼트는 데이터 압축의 경우에 과도현상의 생략을, 또는 데이터 팽창의 경우에 과도현상의 반복을 회피하기 위해서 과도현상을 포함하지 말아야 한다. 그러므로, 점합점과 종단점들은 양측이 과도현상보다도 이르거나(즉, 오디오가 샘플들로 표현된다면, 더 낮은 샘플 번호 또는 지수를 가진다) 또는 더 늦게(즉, 오디오가 샘플들로 표현된다면, 더 높은 샘플 번호 또는 지수를 갖는다) 되도록 과도현상의 어느 일측에 있어야 한다.Preferably, the target segment should not contain transients in order to avoid omission of transients in case of data compression or to avoid repetition of transients in case of data expansion. Therefore, the junction points and endpoints are either higher than the transient on both sides (i.e., if the audio is expressed in samples, have a lower sample number or exponent), or later (i.e., if the audio is expressed in samples) Sample number or exponent).

본 발명의 다른 양태는 접합의 가청도가 크로스페이드 형상의 선택에 의해 그리고 오디오 신호에 응답하여 크로스페이드의 형상 및 기간을 변동시키므로써 더 감소될 수 있다는 것이다. 크로스페이딩의 더 상세한 설명은 도 10 및 그 설명과 관련하여 하기에 진술되어 있다. 실질적으로, 크로스페이드 시간은, 하기에 더 설명되는 것처럼, 접합점과 종단점의 극한 위치의 배치에 약간 영향을 끼친다.Another aspect of the invention is that the audibility of the junction can be further reduced by the selection of the crossfade shape and by varying the shape and duration of the crossfade in response to the audio signal. A more detailed description of crossfading is set forth below in connection with FIG. 10 and its description. Indeed, the crossfade time slightly affects the placement of the extreme positions of the junction and the endpoint, as further described below.

도 3A와 3B는 압축(3A)과 팽창(3B)에 대한 입력 오디오를 나타내는 샘플들의 블럭내에서 최저 및 최대 접합점 위치를 결정하는 예들을 진술한다. 최저(가장 이른) 접합점 위치는 최대(가장 최근의) 접합점 위치보다도 더 낮은 샘플 번호 또는 지수를 갖는다. 데이터 압축 및 데이터 팽창에 대한 블럭의 종단과 관련하여 접합점들의 최저 및 최대 위치는 접합에 사용된 크로스페이드의 길이 및 상호관련 프 로세싱 영역의 최대 길이에 다양하게 관련있다. 상호관계 프로세싱 영역의 최대 길이의 결정은 도 4와 관련하여 더 설명되어 있다. 타임 스케일 압축에 대해, 상호관계 프로세싱 영역은 적절한 종단점을 식별하기 위해 자동상호관계 프로세싱에 사용된 접합점 이후의 오디오 데이터의 영역이다. 타임 스케일 팽창에 대해, 2개의 상호관계 프로세싱 영역이 있으며, 이는 동일한 길이이지만, 그럴 필요가 없으며, 하나가 접합점 이전에 그리고 다른 하나가 접합점 이후에 있다. 그것들은 적절한 종단점을 결정하기 위해 교차-상호관계 프로세싱에 사용된 2개의 영역을 규정한다.3A and 3B state examples of determining the lowest and maximum junction positions within a block of samples representing input audio for compression 3A and expansion 3B. The lowest (earliest) junction position has a lower sample number or index than the maximum (most recent) junction position. The minimum and maximum positions of the junctions with respect to the end of the block for data compression and data expansion vary in relation to the length of the crossfade used for the junction and the maximum length of the correlated processing region. Determination of the maximum length of the correlation processing region is further described with respect to FIG. 4. For time scale compression, the correlation processing region is the region of audio data after the junction used for autocorrelation processing to identify appropriate endpoints. For time scale expansion, there are two correlation processing regions, which are the same length but need not be, one before the junction and the other after the junction. They define two areas used in cross-correlation processing to determine the appropriate endpoints.

오디오 데이터의 모든 블럭은 최저 접합점 위치와 최대 접합점 위치를 갖는다. 도 3A에 나타난 것처럼, 압축의 경우에 가장 이른 시간을 나타내는, 블럭의 종단과 관련한 최저 접합점 위치는 크로스페이드의 길이의 반정도로 제한되는데, 왜냐하면 접합점 주위의 오디오 데이터가 종단점 주위에서 크로스페이딩되기 때문이다. 유사하게, 타임 스케일 압축을 위해, 압축의 경우에 가장 최근의 시간을 나타내는 블럭의 종단과 관련한 최대 접합점 위치는 최대 상호관계 프로세싱 길이로 제한된다(최대 종단점 위치는 최대 프로세싱 길이의 종단보다도 크로스페이드 길이의 반만큼 더 "이르다").Every block of audio data has a lowest junction position and a maximum junction position. As shown in Fig. 3A, the lowest junction position with respect to the end of the block, which represents the earliest time in the case of compression, is limited to half the length of the crossfade because the audio data around the junction is crossfaded around the endpoint. . Similarly, for time scale compression, the maximum junction position relative to the end of the block representing the most recent time in the case of compression is limited to the maximum correlation processing length (the maximum endpoint position is the crossfade length rather than the end of the maximum processing length). Half as much as "early").

도 3B는 시간축 팽창에 대한 최저 및 최대 접합점 위치의 결정을 약술한다. 시간축 팽창에 대해 가장 이른 시간을 나타내는, 블럭의 종단과 관련한 최저 접합함 위치는 타임 스케일 압축에 대한 최대 접합점의 결정과 유사한 방식으로 상호관계 프로세싱 영역의 최대 길이와 관련있다(최저 종단점 위치는 최대 상호관계 프로세스 길이의 종단보다도 크로스페이드 길이의 반만큼 "더 늦다"). 타임 스케일 팽 창에 대해 가장 최근의 시간을 나타내는 블럭의 종단과 관련한 최대 접합점 위치는 최대 상호관계 프로세싱 길이에만 관련있다. 이는 타임 스케일 팽창에 대해 접합점 이후의 데이터가 상호관계 프로세싱을 위해서만 사용되며 종단점이 최대 접합점 위치 이후에 위치되지 않기 때문이다.3B outlines the determination of the lowest and maximum junction locations for time axis expansion. The lowest junction position relative to the end of the block, which represents the earliest time for time-base expansion, is related to the maximum length of the correlation processing region in a manner similar to the determination of the maximum junction for time scale compression (the lowest endpoint position is the maximum mutual "Late" by half the crossfade length than the end of the relation process length). The maximum junction position with respect to the end of the block representing the most recent time for time scale expansion is related only to the maximum correlation processing length. This is because for time scale expansion, the data after the junction is used only for correlation processing and the endpoint is not located after the maximum junction position.

도 3A 내지 3B는 입력 데이터의 블럭과 관련하여 기술되어 있지만, 동일한 원리들은, 하기에 더 설명되는 것처럼, 오디토리 이벤트를 포함하여, 각각 프로세싱되는 입력 데이터의 서브세트(즉, 연속적인 샘플들의 그룹)에 관련한 최대 및 최저 종단점을 설정하는데 적용된다.Although FIGS. 3A-3B are described with respect to blocks of input data, the same principles, including auditory events, as described further below, each subset of input data being processed (ie, a group of consecutive samples). Applies to setting the maximum and minimum endpoints in relation to

도 4에 나타난 것처럼, 타임 스케일 압축의 경우에 대해서, 상호관계 프로세싱을 위해 사용된 영역은 접합점 이후에 위치된다. 상기 접합점과 최대 프로세싱점 위치는 상호관계 프로세싱 영역의 길이를 규정한다. 도 4에 나타난 접합점과 최대 프로세싱점의 위치들은 임의적인 예들이다. 최저 종단점 위치는 종단점이 위치되는 접합점 이후의 최저 샘플 번호 또는 지수값을 가리킨다. 유사하게, 최대 종단점 위치는 종단점이 위치되는 접합점 이후의 최대 샘플 번호 또는 지수값을 가리킨다. 최대 종단점 위치는 최대 프로세싱점 위치보다도 크로스페이드 길이의 반만큼 "더 이르다". 접합점이 선택되었다면, 최저 및 최대 종단점 위치들은 타겟 세그먼트에 사용되며 할당되는 디폴트값인 데이터의 양을 제어한다(가용값들은 각각 7.5 및 25msec이다). 이와 달리, 최저 및 최대 종단점 위치들은 오디오 콘텐트 및/또는 바람직한 양의 타임 스케일링에 동적으로 좌우하여 변동시키기 위해서 가변가능하다(최저 종단점은 바람직한 타임 스케일 레이트에 기초하여 변동한다). 예를 들면, 현 저한 주파수 구성요소가 50Hz이며 44.1kHz에서 샘플링되는 신호에 대해서, 단일 기간의 오디오 파형은 길이(또는 20msec)에 대략 882개 샘플이 있다. 이는 최대 종단점 위치가 오디오 데이터의 적어도 1개 사이클을 포함하도록 충분한 길이의 타겟 세그먼트이어야 함을 가리킨다. 임의의 경우에, 최대 프로세싱점은 프로세싱 블럭의 종단보다도 더 늦지 않을 수 있다(본 예에서, 4096개 샘플들은, 또는, 하기에 설명된 것처럼, 오디토리 이벤트가 고려될 때, 오디토리 이벤트의 종단보다도 더 늦지 않다). 유사하게, 최저 종단점 위치가 접합점 이후에 7.5msec로 선택되고 프로세싱되는 오디오가 최저 종단점 위치에 근접한 종단점을 일반적으로 선택하는 신호를 포함한다면, 타임 스케일링의 최대 백분율은 각 입력 데이터 블럭의 길이에 좌우한다. 예를 들면, 입력 데이터 블럭 사이즈가 4096 샘플이라면(또는 44.1kHz 샘플 레이트에서 약 93msec), 그후 7.5msec의 최저 타겟 세그먼트 길이는, 최저 종단점 위치가 선택되었더라도, 7.5/93=8%의 최대 타임 스케일 레이트일 것이다. 타임 스케일 압축을 위한 최저 종단점 위치는 7% 이하의 변동 레이트에 대해 7.5msec(44.1kHz에 대해 331개 샘플)로 설정되며 다음과 같이 설정된다:As shown in Figure 4, for the case of time scale compression, the area used for correlation processing is located after the junction. The junction and maximum processing point location define the length of the correlation processing region. The locations of the junction and maximum processing points shown in FIG. 4 are arbitrary examples. The lowest endpoint position refers to the lowest sample number or exponent value after the junction where the endpoint is located. Similarly, the maximum endpoint location refers to the maximum sample number or exponent value after the junction where the endpoint is located. The maximum endpoint position is "earlier" by half the crossfade length than the maximum processing point position. If the junction is selected, the lowest and maximum endpoint positions are used for the target segment and control the amount of data that is the default value assigned (available values are 7.5 and 25 msec respectively). Alternatively, the lowest and maximum endpoint positions are variable to dynamically vary depending on the audio content and / or the desired amount of time scaling (the lowest endpoint varies based on the desired time scale rate). For example, for a signal with a significant frequency component of 50 Hz and sampled at 44.1 kHz, a single period audio waveform has approximately 882 samples in length (or 20 msec). This indicates that the maximum endpoint position must be a target segment of sufficient length to contain at least one cycle of audio data. In any case, the maximum processing point may not be later than the end of the processing block (in this example, 4096 samples, or the end of the auditory event when the auditory event is considered, as described below). Not later than). Similarly, if the lowest endpoint position is selected at 7.5 msec after the junction and the processed audio includes a signal that typically selects an endpoint close to the lowest endpoint position, the maximum percentage of time scaling will depend on the length of each input data block. . For example, if the input data block size is 4096 samples (or approximately 93 msec at 44.1 kHz sample rate), then the lowest target segment length of 7.5 msec is 7.5 / 93 = 8% maximum time scale, even if the lowest endpoint position is selected. Will be rate. The lowest endpoint position for time scale compression is set to 7.5 msec (331 samples at 44.1 kHz) for fluctuation rates below 7% and is set as follows:

최저 종단점 위치=((time_scale_rate-1.0)*블럭 사이즈);Lowest endpoint position = ((time_scale_rate-1.0) * block size);

여기서, time_scale_rate는 타임 스케일 압축에 대해 >1.0이며(재생 비율에서 1.10=10%증가), 블럭 사이즈는 현재 44.1kHz에서 4096개 샘플이다. 이러한 예들은 최저 및 최대 종단점 위치들을 오디오 콘텐트 및 바람직한 타임 스케일 백분율에 좌우하여 변동시킨다는 이점을 나타낸다. 임의의 경우에, 최저 종단점은 검색 영역을 과도하게 제한할 정도로 너무 크거나 또는 최대 종단점 부근에 있지 말아야 한다.Here, time_scale_rate is> 1.0 for time scale compression (1.10 = 10% increase in reproduction ratio), and the block size is currently 4096 samples at 44.1 kHz. These examples show the advantage of varying the lowest and maximum endpoint positions depending on the audio content and the desired time scale percentage. In any case, the lowest endpoint should not be too large or near the maximum endpoint to excessively limit the search area.

발명의 다른 양태는, 가청 접합의 가능성을 더 감소시키기 위해서, 대조 기술을 사용하여 마스킹 또는 가청불능도에 의존하는 필요성을 줄이기 위해서 접합점과 종단점에서 신호 파형들을 매칭시키는 것이다. 발명의 다른 양태를 구성하는 매칭 기술은 접합에서 연결되는 파형들의 진폭과 위상 모두를 매칭시키도록 탐색된다. 이는, 상기된 것처럼, 차례로 상호관계를 수반하며, 이는 또한 발명의 양태이다. 상호관계는 주파수에 따른 귀 감응도의 변동에 대해 보상을 포함한다.Another aspect of the invention is to match the signal waveforms at the junction and endpoint to reduce the need to rely on masking or inaudibility using a contrasting technique to further reduce the possibility of an audible junction. Matching techniques that constitute another aspect of the invention are searched to match both amplitude and phase of the waveforms connected at the junction. This in turn entails interrelationships, as described above, which is also an aspect of the invention. Correlation includes compensation for variations in ear sensitivity over frequency.

도 2A-2D과 관련하여 기술된 것처럼, 본 발명의 양태들에 사용된 데이터 압축 또는 팽창 기술은 오디오 섹션을 삭제 또는 반복한다. 상기된 제 1 대안에서, 접합점 위치는 과도현상과 같은 신호성분들로부터 접합점 위치의 소정 거리 또는 크로스페이드의 길이에 기초한 일반적인, 사전 정의된 시스템 파라미터를 사용하여 및/또는 기타 신호 조건들을 고려하여 선택된다. 오디오의 더 상세한 분석(예를 들면, 상호관계)은 다소 임의의 접합점 주위에서 실행되어 종단점을 결정한다.As described in connection with Figures 2A-2D, the data compression or expansion technique used in aspects of the present invention deletes or repeats the audio section. In the first alternative described above, the junction position is selected using general, predefined system parameters based on the length of the crossfade or a predetermined distance of the junction position from signal components such as transients and / or taking into account other signal conditions. do. More detailed analysis of the audio (e.g., correlation) is performed around some arbitrary junction to determine the endpoint.

제 2 대안에 따르면, 접합점과 종단점 위치들은 더 신호 의존적 방식으로 선택된다. 일련의 후미 접합점 위치들 주위의 윈도우된 데이터는 상호관계 프로세싱 영역에서 데이터에 상호관련되어 관련된 후미 종단점 위치를 선택된다. 모든 후미 접합점 위치간에 강력한 상호관계를 갖는 후미 접합점 위치가 최종 접합점으로서 선택되고 후미 종단점이 가장 강력한 상호관계의 위치에서 대체로 위치된다. 비록, 원칙적으로는, 후미 접합점들간의 간격이 단지 1개 샘플이더라도, 프로세싱 복잡성을 감소시키기 위해서 후미 접합점들은 더 폭 넓게 간격지워질 수 있다. 크로스페 이드 영역의 폭은, 하기에 설명된 것처럼, 견본 접합점들에 적절한 증가량이다. 접합점과 종단점 위치들을 선택하는 이러한 대안적 방법은 데이터 압축에 그리고 데이터 팽창 프로세싱에 적용된다. 비록 접합점과 종단점 위치를 선택하는 이러한 대안이 오디토리 신 분석을 사용하는 발명의 양태와 관련하여 하기에 더 상세히 기술되어있지만, 심리음향 분석을 사용하는, 발명의 제 1 실시예에 또한 사용될 수 있다.According to a second alternative, the junction and endpoint positions are chosen in a more signal dependent manner. The windowed data around the series of trailing junction locations are correlated to the data in the correlation processing region to select the associated trailing endpoint location. The trailing junction location, which has a strong correlation between all trailing junction positions, is selected as the final junction and the trailing endpoint is generally located at the position of the strongest correlation. Although, in principle, the trailing junctions can be spaced more widely to reduce processing complexity, even if the spacing between trailing junctions is only one sample. The width of the crossfade region is an appropriate increase in the sample junctions, as described below. This alternative method of selecting junction and endpoint locations applies to data compression and data expansion processing. Although this alternative of selecting junction and endpoint locations is described in more detail below in connection with aspects of the invention using auditory scene analysis, it may also be used in the first embodiment of the invention, using psychoacoustic analysis. .

심리음향 분석 실시예Psychoacoustic Analysis Example

심리음향 분석을 수반하는 본 발명의 양태에 따라 단일 채널 또는 다중채널 타임-스케일링 및/또는 피치-스케일링 프로세스를 진술하는 순서도가 도 5에 예시되어 있다. 심리음향 분석과 오디토리 이벤트 분석을 수반하는 발명의 양태에 따른 단일 채널 또는 다중채널 타임-스케일링 및/또는 피치-스케일링 프로세스를 진술하는 순서도가 도 17에 예시되어 있으며, 이는 하기에 설명된다. 발명의 기타 양태들은 도 5와 도 17 프로세스들의 부분 또는 변형을 형성한다. 상기 프로세스들은 실시간 피치 스케일링과 비실시간 피치 및 타임 스케일링을 실행하는데 사용된다. 낮은 대기 시간-스케일링 프로세스는 효과적으로 실시간으로 작동할 수 없는데, 왜냐하면 상이한 레이트로 입력 오디오 신호를 재생시키기 위해서 입력 오디오 신호를 버퍼링시켜야 하므로 버퍼 언더플루어 또는 오버플루어가 된다 - 상기 버퍼는 입력 데이터가 수신되는 것과 달리 상이한 레이트로 비워진다.A flowchart describing a single channel or multichannel time-scaling and / or pitch-scaling process in accordance with aspects of the invention involving psychoacoustic analysis is illustrated in FIG. 5. A flowchart describing a single channel or multichannel time-scaling and / or pitch-scaling process in accordance with aspects of the invention involving psychoacoustic analysis and auditory event analysis is illustrated in FIG. 17, which is described below. Other aspects of the invention form part or variations of the FIGS. 5 and 17 processes. The processes are used to perform real time pitch scaling and non real time pitch and time scaling. The low latency-scaling process cannot operate effectively in real time, because the input audio signal must be buffered in order to reproduce the input audio signal at different rates, resulting in a buffer underfloor or overfloor-the buffer receives the input data. Unlike that, it is emptied at different rates.

데이터 입력 202(도 5)Data Entry 202 (Figure 5)

도 5를 참조하면, 제 1 단계, 결정 단계 202("데이터 입력?")는 디지털화된 입력 오디오 데이터가 데이터 압축 또는 데이터 팽창 프로세싱에 가용한지를 결정한다. 데이터의 소스는 컴퓨터 화일 또는 블럭의 입력 데이터이며, 이는 실시간 입력 버퍼에 저장된다. 만일 데이터가 가용하다면, 시간-일치 세그먼트를 나타내는, N 시간 동기 샘플들의 데이터 블럭들이 단계 204(각 채널에 대해 N개 샘플을 획득)에 의해 각각의 입력 채널들이 데이터 압축 또는 데이터 팽창 프로세싱되는 1개 블럭으로 누적된다(채널들의 개수는 1이상이다). 프로세스에 의해 사용된, 입력 데이터 샘플들의 개수(N)는 임의의 적절한 개수의 샘플로 고정되므로, 입력 데이터를 블럭으로 분할한다. 원칙적으로, 프로세싱된 오디오는 디지털 또는 아날로그이며 블럭들로 분할될 필요가 없다.Referring to Fig. 5, a first step, decision step 202 (" data entry? &Quot;) determines whether digitized input audio data is available for data compression or data expansion processing. The source of data is input data in a computer file or block, which is stored in a real time input buffer. If data is available, one of the data blocks of N time synchronous samples, representing a time-matched segment, the respective input channels are subjected to data compression or data expansion processing by step 204 (get N samples for each channel). Accumulated in blocks (the number of channels is one or more). The number N of input data samples used by the process is fixed at any suitable number of samples, thus dividing the input data into blocks. In principle, the processed audio is digital or analog and does not need to be divided into blocks.

도 5는 각 오디오 채널에 대한 입력 데이터가 4096 샘플의 블럭으로 데이터 압축 또는 데이터 팽창 프로세싱되는 발명의 양태중 실제 실시예와 관련하여 논의되며, 이는 44.1kHz의 샘플링 레이트에서 약 93msec의 입력 오디오에 해당한다. 발명의 양태들이 그러한 실시예로 제한되지 않음이 이해될 것이다. 상기된 것처럼, 발명의 다양한 양태들의 원리들은 오디오를 샘플 블럭들로 배열하는 것을 요구하지 않으며, 만일 그것들이 있다면, 일정한 길이의 블럭들을 제공하는 것을 요구하지 않는다. 그러나, 복잡성을 최소화시키기 위해서, 고정 블럭 길이의 4096개 샘플이 3가지 주요 이유로 유용하다. 첫째, 실시간 프로세싱 어플리케이션에 허용가능한 낮은 충분한 지연을 제공한다. 둘째, 2n(power of two)의 샘플들이며, 이는 빠른 퓨리에 변환(FFT) 분석에 유용하다. 셋째, 입력 신호의 유용한 심리음향 분석을 실행하기 위해 적절하게 큰 윈도우 사이즈를 제공한다. FIG. 5 is discussed in connection with a practical embodiment of the aspect of the invention in which input data for each audio channel is data compressed or data expanded processed into blocks of 4096 samples, which corresponds to about 93 msec input audio at a sampling rate of 44.1 kHz. . It is to be understood that aspects of the invention are not limited to such embodiments. As noted above, the principles of the various aspects of the invention do not require arranging audio into sample blocks, and if they do, do not require providing blocks of constant length. However, to minimize complexity, 4096 samples of fixed block length are useful for three main reasons. First, it provides a low enough delay that is acceptable for real time processing applications. Secondly, 2n (power of two) samples, which is useful for fast Fourier transform (FFT) analysis. Third, it provides an appropriately large window size for performing useful psychoacoustic analysis of the input signal.

하기 논의에서, 입력 신호는 범위 [-1, +1]의 진폭값을 갖는 데이터인 것으로 추정된다.In the discussion below, the input signal is assumed to be data having an amplitude value in the range [-1, +1].

심리음향 분석 206(도 5)Psychoacoustic analysis 206 (Fig. 5)

입력 데이터 블럭킹에 이어서, 심리음향 분석 206("입력 데이터의 각 블럭에 대한 심리음향 분석을 실행")이 각 채널에 대한 입력 데이터의 블럭에 대해 실행된다. 다중 채널의 경우에, 심리음향 분석 206과 후속 단계들은 모든 채널에 대해 병렬로 또는 연속하여 채널 단위로 실행된다(각 채널의 데이터의 적절한 저장과 각각의 분석을 제공한다). 비록 병렬 프로세싱이 더 커다란 프로세싱 전원을 요구하지만, 실시간 어플리케이션에 바람직하다. 도 5의 설명은 채널들이 병렬로 처리되는 것으로 추정한다.Following input data blocking, psychoacoustic analysis 206 (“execute psychoacoustic analysis for each block of input data”) is performed on the block of input data for each channel. In the case of multiple channels, psychoacoustic analysis 206 and subsequent steps are performed on a channel-by-channel basis in parallel or in succession for all channels (provide appropriate storage of each channel's data and respective analysis). Although parallel processing requires larger processing power, it is desirable for real-time applications. The description of FIG. 5 assumes that the channels are processed in parallel.

단계 206의 더 상세한 설명이 도 6에 예시되어 있다. 상기 분석 206은 심리음향 기준을 충족시키는 각 채널에 대한 데이터 블럭에서 1개 이상의 영역을 식별하고(또는, 일부 신호 조건에 대해, 블럭에서 그러한 영역들을 식별하지 않음) 각각의 식별된 영역내에서 잠재적 또는 임시의 접합점 위치를 결정한다. 1개 채널만이 있다면, 후속 단계 210("공통 접합점을 선택")가 스킵되고 상기 단계 206에서 식별된 영역들중 하나에서 임시 접합점 위치가 사용된다(바람직하게는, 블럭에서 "최상의" 영역이 계층 기준에 따라 선택된다). 다중채널의 경우에 대해, 단계 210는 식별된 영역을 재평가하며, 공통 중첩된 영역을 식별하고, 그러한 공통 중첩된 영역에서 최상의 공통 접합점을 선택하며, 이 접합점은 심리음향 분석 단계 206에서 식별된 임시 접합점 위치이지만, 반드시 그런것은 아니다. A more detailed description of step 206 is illustrated in FIG. 6. The analysis 206 identifies one or more regions in the data block for each channel that meets the psychoacoustic criteria (or, for some signal conditions, does not identify those regions in the block) and potentially within each identified region. Or determine a temporary junction point location. If there is only one channel, subsequent step 210 (“select common junction”) is skipped and the temporary junction location is used in one of the areas identified in step 206 (preferably, the “best” area in the block Selected according to hierarchical criteria). For the multichannel case, step 210 re-evaluates the identified regions, identifies common overlapped regions, and selects the best common junctions in those common overlapped regions, which junctions are identified in the psychoacoustic analysis stage 206. Junction location, but not necessarily.

오디오의 타임 및/또는 피치 스케일링에서 가청 가공물을 최소화시키는 심리음향 분석의 사용은 본 발명의 일 양태이다. 심리음향 분석은 상기된 4가지 기준중 1개 이상 또는 그 안의 파형들을 접합시키는 것 또는 그 안의 타임 및/또는 피치 스케일링을 실행하는 것으로부터 발생하는 가공물을 억압 또는 최소화시키는 오디오의 세그먼트들을 식별하는 기타 심리음향 기준을 적용하는 것을 포함한다.The use of psychoacoustic analysis to minimize audible artifacts in time and / or pitch scaling of audio is one aspect of the present invention. Psychoacoustic analysis may be used to identify segments of audio that suppress or minimize artifacts arising from joining one or more of the four criteria described above, or combining waveforms therein or performing time and / or pitch scaling therein. Includes the application of psychoacoustic criteria.

본문에 기술된 도 5 프로세스에 있어서, 블럭에는 다수의 심리음향적으로 식별된 영역들이 있으며, 각각은 임시 접합점을 갖는다. 그렇지만, 다른 실시예에 있어서, 단일 채널의 경우에, 입력 데이터의 각 블럭에는 최대 하나의 심리음향적으로 식별된 영역이 데이터 압축 또는 데이터 팽창 프로세싱을 위해 선택되며, 그리고 다중 채널의 경우에, 입력 데이터의 시간-일치 블럭들의 각 세트에서(각 채널에 대해 한개 블럭), 심리음향적으로 식별된 영역들중 최대 1개 중첩이 데이터 압축 또는 팽창 프로세싱을 위해 선택된다. 바람직하게는, 심리음향적으로 "최상의"(예를 들면, 본문에 기술된 기준과 같은 계층에 따라) 식별된 영역 또는 식별된 영역들의 중첩은, 블럭 또는 입력 데이터의 블럭들에서 각각 다수의 식별된 영역 또는 식별된 영역의 중첩이 있을 때, 선택된다.In the FIG. 5 process described herein, there are a number of psychoacoustically identified areas in the block, each with a temporary junction. However, in another embodiment, in the case of a single channel, at most one psychoacoustically identified region is selected for data compression or data expansion processing in each block of input data, and in the case of multiple channels, In each set of time-matched blocks of data (one block for each channel), at most one overlap of psychoacoustically identified regions is selected for data compression or expansion processing. Preferably, the psychoacoustically " best " (e.g., according to a hierarchy such as criteria described in the text) identified region or overlap of the identified regions are each identified in a block or blocks of input data. When there is an overlap of the identified region or identified region, it is selected.

이와 달리, 1개 이상의 식별된 영역 또는 식별된 영역들의 중첩은 각 블럭 또는 시간 일치 입력 데이터의 블럭 세트 각각에서 프로세싱을 위해 선택되며, 이 경우에 선택된 것들은 바람직하게는 심리음향적으로 최상의 것이며(예를 들면, 본문에 기술된 것처럼 계층에 따름), 또는 이와 달리 모든 식별된 이벤트가 선택된다. Alternatively, the overlap of one or more identified regions or identified regions is selected for processing in each block or block set of time coincident input data, in which case the selections are preferably psychoacoustically the best (eg For example, according to the hierarchy as described in the text), or alternatively all identified events are selected.

임시 접합점을 매 식별된 영역에 위치시키는 대신에, 단일 채널의 경우에는, 접합점(이 경우, 접합점은 "임시(provisional)"가 아니며, 실제 접합점임)은, 프로세싱을 위해 영역이 선택된 이후에 식별된 영역에 위치된다. 다중 채널의 경우에는, 임시 접합점들은, 중첩인지 결정된 이후에만 식별된 영역에 위치된다.Instead of placing a temporary junction in every identified region, in the case of a single channel, the junction (in this case, the junction is not "provisional" and is the actual junction) is identified after the region is selected for processing. Are located in a closed area. In the case of multiple channels, temporary junctions are located in the identified area only after it is determined whether it is overlapping.

원칙적으로, 임시 접합점들의 식별은, 다중 채널일 때 중첩 영역에서 공통 접합점을 선택하는 것이 바람직한 한은, 불필요하며, 이 공통 접합점은 통상적으로 개별 채널의 각각의 임시 접합점들과 상이하다. 그러나, 구현 설명에서 처럼, 임시 접합점들의 식별이 유용한데, 왜냐하면, 임시 접합점을 요구하는(그것은 실제 접합점임) 단일 채널, 또는 다중 채널들과의 작동을 허용하며, 이 경우 임시 접합점들은 무시된다.In principle, identification of temporary junctions is unnecessary, as long as it is desirable to select a common junction in the overlap region when in multiple channels, which common junction is typically different from each temporary junction of the individual channels. However, as in the implementation description, the identification of temporary junctions is useful because it allows operation with single channels or multiple channels that require a temporary junction (which is the actual junction), in which case the temporary junctions are ignored.

도 6은 도 5의 심리음향 분석 프로세스 206의 작동 순서도이다. 심리음향 분석 프로세스 206은 5개의 일반 프로세싱 서브단계로 이루어진다. 처음 4개는, 제 1 서브단계 또는 제 1 기준을 충족시키는 오디오 영역이 가청불능인 또는 최소로 가청되는 영역내에서 접합의 최대 가능성(또는 다른 타임 시프팅 또는 피치 시프팅 프로세싱)을 갖도록 계층으로 배열되는 심리음향 기분 분석 서브단계들로, 후속 단계 기준들은 가청불능인 또는 최소로 가청되는 영역내에서 차츰 접합의 적은 가능성을 갖는다.6 is an operational flowchart of the psychoacoustic analysis process 206 of FIG. The psychoacoustic analysis process 206 consists of five general processing substeps. The first four are layered such that an audio region that meets the first substep or the first criterion has a maximum likelihood of splicing (or other time shifting or pitch shifting processing) within the inaudible or least audible region. With arranged psychoacoustic mood analysis substeps, subsequent step criteria have less likelihood of gradual bonding within the inaudible or minimally audible region.

각 서브단계들의 심리음향 기준 분석은 입력 데이터 블럭의 사이즈가 1.64인 사이즈를 갖는 심리음향 서브블럭을 사용한다. 본 예에서, 심리음향 서브블럭들은 도 8예 나타난 것처럼 대략 1.5msec(또는 44.1.kHz에서 74개 샘플)이다. 심리음향 서브블럭들의 사이즈가 1.5msec일 필요가 없으므로, 이 사이즈는 실제 구현에 따라 선택되는데, 왜냐하면 실시간 프로세싱 요건들(대형 서브블럭 사이즈는 적은 심리음향 프로세싱 오버헤드를 요구함)과 심리음향 기준을 충족시키는 세그먼트의 분해능(소형 서브블럭들은 그러한 세그먼트들의 위치에 대한 더 상세한 정보를 제공함)간의 양호한 상충을 제공하기 때문이다. 원칙적으로, 심리음향 서브블럭 사이즈는 각 유형의 심리음향 기준 분석에 대해 동일할 필요는 없지만, 실제 실시예들에서, 구현의 용이를 위해, 이는 바람직하다.The psychoacoustic reference analysis of each substep uses a psychoacoustic subblock having a size of an input data block of 1.64. In this example, the psychoacoustic subblocks are approximately 1.5 msec (or 74 samples at 44.1. KHz) as shown in FIG. 8. Since the size of psychoacoustic subblocks does not need to be 1.5 msec, this size is chosen according to the actual implementation because it meets real-time processing requirements (large subblock size requires less psychoacoustic processing overhead) and psychoacoustic criteria. This is because they provide a good trade-off between the resolution of the segments being made (small subblocks provide more detailed information about the location of those segments). In principle, the psychoacoustic subblock size need not be the same for each type of psychoacoustic reference analysis, but in practical embodiments, for ease of implementation, this is desirable.

과도현상 검출 206-1(도 6)Transient Detection 206-1 (Figure 6)

프로세스 206-1은 각 채널에 대한 데이터 블럭을 분석하고, 만약 있다면, 오디오 신호 과도현상의 위치를 결정한다. 순간 과도현상 정보는 마스킹 분석과 임시 접합점의 위치를 선택시 사용된다(본 예의 심리음향 분석 프로세스에서 서브단계). 상기 논의된 것처럼, 과도현상이 순간적 마스킹을 도입시키는 것으로 공지되어 있다.(과도현상의 발생 전후 양측의 오디오 정보를 은폐시킨다).Process 206-1 analyzes the data block for each channel and determines the location of the audio signal transient, if any. The transient transient information is used in masking analysis and in selecting the location of the temporary junction (substep in the psychoacoustic analysis process of this example). As discussed above, transients are known to introduce instant masking (hiding audio information on both sides before and after the occurrence of transients).

도 7의 순서도에 나타난 것처럼, 과도현상 검출 서브단계 206-1에서 제 1 서브-서브단계 206-1a("입력 전대역폭 오디오를 고역필터함")은 입력 데이터 블럭을 필터링시키는 것이다(시간 함수에 따라 블럭 콘텐츠를 처리함). 입력 블럭 데이터는, 예를 들면, 대략 8kHz의 3dB 차단 주파수를 갖는 2차 IIR 고역 필터로 고역 필터링된다. 차단 주파수와 필터 특성은 중요하지 않다. 오리지날 언필터링된 데이터에 따른 필터링된 데이터는 그후의 과도현상 분석에 사용된다. 양측 전대역과 고역 필터링된 데이터의 사용은 복잡한 자료, 이를 테면 음악에서 조차 과도현상을 식별 하는 능력을 향상시킨다. "전대역" 데이터는, 예를 들면, 초고주파수와 초저주파수를 필터링함으로써 제한되는 대역이다. 상기 데이터는 또한 다른 차단 주파수를 갖는 1개 이상의 부가 필터에 의해 고역 필터링된다. 신호의 고주파수 과도현상 구성요소들은 더 강력한 저주파수 구성요소 아래의 진폭을 갖지만 여전히 청취자에게 가청 가능하다. 입력 데이터를 필터링하는 것은 고주파수 과도현상들을 고립시키며 그것들을 더 용이하게 식별하도록 만든다.As shown in the flowchart of Fig. 7, the transient detection substep 206-1 to the first sub-substep 206-1a ("high-pass filtering the input full-bandwidth audio") is filtering the input data block (time function Process block content accordingly). The input block data is high pass filtered, for example with a second order IIR high pass filter having a 3 dB cutoff frequency of approximately 8 kHz. The cutoff frequency and filter characteristics are not important. The filtered data according to the original unfiltered data is then used for subsequent transient analysis. The use of bilateral full-range and high-pass filtered data improves the ability to identify transients even in complex materials, such as music. "Full band" data is a band that is restricted by filtering, for example, very high and very low frequencies. The data is also high pass filtered by one or more additional filters having different cutoff frequencies. The high frequency transient components of the signal have an amplitude below the more powerful low frequency components but are still audible to the listener. Filtering the input data isolates high frequency transients and makes them easier to identify.

다음 서브-서브단계 206-1b("최대 절대값 샘플들을 전대역폭과 필터링된 오디오 서브블럭에 위치시킴")에서, 전 범위와 필터링된 입력 블럭들은 최대 절대값 샘플들을 전대역폭과 필터링된 오디오 서브블럭들에 위치시키기 위해서 도 8에 예시된 것처럼 대약 1.5msec(또는 44.1kHz에서 64개 샘플)의 서브블럭으로 프로세싱된다.In the next sub-substep 206-1b (“Place the maximum absolute samples in the full bandwidth and the filtered audio subblock”), the full range and the filtered input blocks place the maximum absolute samples in the full bandwidth and filtered audio subblocks. It is processed into a subblock of approximately 1.5 msec (or 64 samples at 44.1 kHz) as illustrated in FIG. 8 to locate in the blocks.

과도현상 검출 서브단계 206-1의 제 3 서브-서브단계 206-1c("전대역폭과 필터링된 피크 데이터를 저역 필터로 평탄화")는 각 64-샘플 서브블럭에 포함된 최대 절대 데이터값들의 저역 필터링 또는 누설 평균화(leaky averaging)을 실행하는 것이다(시간 함수에 따라 데이터 값들을 처리). 이러한 프로세싱은 최대 절대 데이터를 평탄화시키고 실제 서브-블럭 최대 절대 데이터 값이 비교될 수 있는 입력 블럭의 평균 피크값들의 일반적 지시를 제공하도록 실행된다.The third sub-substep 206-1c of the transient detection substep 206-1 ("flatten the full bandwidth and the filtered peak data with the low pass filter") is a low pass of the maximum absolute data values contained in each 64-sample subblock. Perform filtering or leak averaging (processing data values as a function of time). This processing is performed to smooth the maximum absolute data and provide a general indication of the average peak values of the input block to which the actual sub-block maximum absolute data values can be compared.

과도현상 검출 프로세싱 206-1의 제 4 서브-서브단계 206-1d("각 전대역폭과 필터링된 서브블럭의 스케일링된 피크 절대값과 평탄화된 데이터를 비교")는 각 서브블럭의 피크와 평탄화된 이동 평균 피크값의 배열의 해당 번호를 비교하여, 과도 현상이 존재하는지를 결정한다. 이러한 2가지 수단을 비교하는 다수의 방법들이 존재하지만, 하기에 진술된 접근법은 넓은 범위의 오디오 신호들을 분석하여 결정된 것처럼 최선적으로 실행하도록 설정되었던 스케일링 팩터(factor)의 사용에 의해 비교의 튜닝을 허용한다.The fourth sub-substep 206-1d of the transient detection processing 206-1 ("compare the scaled peak absolute value of each full bandwidth and the filtered subblock and the flattened data") is smoothed with the peak of each subblock. The corresponding numbers in the array of moving average peak values are compared to determine if there is a transient. While there are a number of ways to compare these two means, the approach set out below allows tuning of the comparison by the use of a scaling factor that has been set to perform best as determined by analyzing a wide range of audio signals. Allow.

결정 서브-단계 206-1e("스케일링된 데이터>평탄화된 이동 평균 피크값?")에서, k번째 서브블럭의 피크값이 스케일링값에 의해 증가되고 연산되어 평탄화된 이동 평균 피크값의 k번째 값과 비교된다. 만일 서브블럭의 스케일링된 피크값이 상기 이동 평균값보다 크다면, 과도현상이 존재하는 것으로서 플래그된다. 서브블럭내에 과도현상의 존재 및 위치는 연속 프로세싱을 위해 저장된다. 이러한 작동은 필터링되지 않은 데이터와 필터링된 데이터 모두에 실행된다. 과도현상으로서 플래그된 서브블럭 또는 과도현상으로서 플래그된 인접 서브블럭들의 스트링은 과도현상의 존재와 위치를 지시한다. 이러한 정보는, 예를 들면, 어디에서 프리마스킹 및 포스트마스킹이 과도현상에 의해 제공되며 그리고 과도현상을 교란시키는 것을 방지하기 위해서 데이터 압축 또는 팽창이 어디에서 회피되어야 하는지를 지시하도록 프로세스의 다른 부분들에 사용된다.(예를 들면, 도 6의 서브단계 310참조).In decision sub-step 206-1e (" scaled data > flattened moving average peak value? &Quot;), the peak value of the kth subblock is increased by the scaling value and calculated to calculate the kth value of the flattened moving average peak value. Is compared with If the scaled peak value of the subblock is greater than the moving average value, it is flagged as having a transient. The presence and location of the transients in the subblocks are stored for continuous processing. This operation is performed on both unfiltered and filtered data. A subblock flagged as a transient or a string of adjacent subblocks flagged as a transient indicates the presence and location of the transient. This information may, for example, be directed to other parts of the process to indicate where premasking and postmasking are provided by the transients and where data compression or expansion should be avoided in order to avoid disturbing the transients. (See, eg, substep 310 of FIG. 6).

과도현상 검출에 이어서, 몇가지 정정 검사가 서브-서브단계 206-1f("과도현상을 소거시키도록 정정 검사를 실행")에서 이루어져, 64-샘플 서브블럭에 대한 과도현상 플래그가 소거되어야 하는지를 결정한다(TRUE에서 FALSE로 리셋). 이러한 검사는 거짓(false) 과도현상 검출을 감소시키도록 실행된다. 첫째, 전범위 또는 고주파수 피크값들이 최소 피크값 아래로 떨어지면 그후 과도현상이 소거된다(적은 또는 어떠한 임시 마스킹도 제공하지 않는 하위 레벨 과도현상을 제거하기 위함). 둘째, 서브블럭의 피크가 과도현상을 트리거시키지만, 또한 과도현상 플래그를 트리거시켰던 이전 서브블럭보다도 상당히 크지 않는다면, 그후 현재 서브블럭의 과도현상이 소거된다. 이는 과도현상의 위치에 대한 정보의 손상(smearing)을 감소시킨다. 각 오디오 채널에 대해, 과도현상들과 그 위치들의 개수는 심리음향 분석 단계에서 차후 사용을 위해 저장된다.Following transient detection, several correction checks are made in sub-substeps 206-1f ("Perform a correction check to eliminate transients") to determine whether the transient flag for the 64-sample subblock should be cleared. (Reset from TRUE to FALSE). This test is performed to reduce false transient detection. First, when the full range or high frequency peak values fall below the minimum peak value, the transient is then canceled (to remove low level transients that provide little or no temporary masking). Second, if the peak of the subblock triggers a transient, but is also not significantly greater than the previous subblock that triggered the transient flag, then the transient of the current subblock is erased. This reduces the smearing of information about the location of the transient. For each audio channel, the number of transients and their locations are stored for later use in the psychoacoustic analysis phase.

본 발명은 개시된 특정 과도현상 검출로 제한되지 않는다. 기타 적절한 과도현상 검출 방식이 사용될 수 있다.The invention is not limited to the specific transient detection disclosed. Other suitable transient detection methods may be used.

가청 임계 분석 206-2(도 6)Audible Critical Analysis 206-2 (Figure 6)

도 6을 다시 참조하면, 심리음향 분석 프로세스의 제 2 단계 206-2, 가청 임계 분석은, 가청 임계 또는 그 아래인 것으로 기대될 수 있는 충분히 낮은 신호 세기를 갖는 오디오 세그먼트의 위치와 기간을 결정한다. 상기 논의된 바와 같이, 이러한 오디오 세그먼트들이 중요한데, 왜냐하면 타임 스케일링과 피치 시프팅에 의해 도입된 가공물들이 그러한 영역에서 가청되기 쉽기 않기 때문이다.Referring back to FIG. 6, a second step 206-2, audible threshold analysis of the psychoacoustic analysis process determines the location and duration of an audio segment with a sufficiently low signal strength that can be expected to be at or below an audible threshold. . As discussed above, these audio segments are important because the workpieces introduced by time scaling and pitch shifting are not easy to be audible in such areas.

상기 논의된 것처럼, 가청 임계는 주파수의 함수이다(저주파수와 고주파수들이 중간 주파수보다도 적게 가청됨). 실시간 프로세싱 적용을 위한 프로세싱을 최소화시키기 위해서, 분석을 위한 가청 임계 모델은 균일한 가청 임계를 추정한다(여기서 주파수의 가장 민감한 범위의 가청 임계는 모든 주파수에 적용된다). 이러한 보수적 추정은 가청 감응 곡선에 의해 추측되는 것보다도 재생 볼륨을 더 크게 올릴 수 있도록 허용하며 저 에너지 프로세싱 이전에 입력 데이터에 대한 주파수 의존 프로세싱을 실행시키는 요건을 감소시킨다.As discussed above, the audible threshold is a function of frequency (low and high frequencies are less audible than intermediate frequencies). To minimize the processing for real-time processing applications, the audible threshold model for analysis estimates a uniform audible threshold (where the audible threshold of the most sensitive range of frequencies applies to all frequencies). This conservative estimation allows for a higher playback volume than would be expected by an audible response curve and reduces the requirement to run frequency dependent processing on the input data before low energy processing.

가청 임계 분석 단계는 필터링되지 않은 오디오를 프로세스하며, 대략 1.5msec 서브블럭(44.1kHz 입력 데이터에 대해 64 샘플)의 입력을 또한 프로세스하고, 상기 개시된 동일한 평탄화된 이동 평균 연산법을 이용한다. 이러한 연산법에 이어서, 각 서브블럭에 대한 평탄화된 이동 평균값은 임계값과 비교되어 서브블럭이 가청되지 않는 서브블럭으로서 플래그되는지를 결정한다. 입력 블럭에서 각 하위-가청 임계 세그먼트의 위치와 기간은 이러한 분석 단계에서 차후 사용을 위해 저장된다. 충분한 길이의 인접 플래그된 서브블럭들의 스트링은 하위 가청 임계 심리음향 기준을 충족시키는 식별된 영역을 구성한다. 최소 길이(시간 기간)는, 식별된 영역이 접합점 또는 접합점과 종단점 양측에 대해 유용한 위치인 것으로서 충분히 길다. 만일 단지 1개 영역이 입력 블럭에서 식별되는 것이라면, 플래그된 서브블럭들의 가장 긴 인접 스트링만을 식별하는데 유용하다.The audible threshold analysis step processes the unfiltered audio, also processes the input of approximately 1.5 msec subblock (64 samples for 44.1 kHz input data) and uses the same flattened moving average algorithm described above. Following this algorithm, the flattened moving average value for each subblock is compared with a threshold to determine if the subblock is flagged as an inaudible subblock. The location and duration of each sub-audible critical segment in the input block is stored for later use in this analysis step. The string of adjacent flagged subblocks of sufficient length constitutes an identified region that meets the lower audible threshold psychoacoustic criteria. The minimum length (time period) is long enough that the identified region is a useful location for the junction or both the junction and the endpoint. If only one region is to be identified in the input block, it is useful to identify only the longest contiguous string of flagged subblocks.

고주파수 분석 206-3(도 6)High Frequency Analysis 206-3 (Figure 6)

제 3 서브단계 206-3, 고주파수 분석 단계는 주로 고주파수 오디오 콘텐트를 포함하는 오디오 세그먼트들의 위치와 길이를 결정한다. 대략 10-12kHz 이상의 고주파수 세그먼트들은 심리음향 분석에서 중요한데, 왜냐하면 조용한 곳에서 가청 임계가 대략 10-12kHz이상으로 급속히 증가하며 귀가 현저한 저주파수 파형의 불연속성보다는 현저한 고주파수 파형의 불연속성에 덜 민감하다. 오디오 신호가 주로 고주파수 에너지로 이루어졌는지를 결정하는데 가능한 수많은 방법들이 있지만, 본문에 개시된 방법이 양호한 검출 결과를 제공하며 연산 요건들을 최소화시킨다. 그 럼에도 불구하고, 기타 방법들이 사용될 수 있다. 개시된 방법은, 강력한 저주파수 콘텐트와 고주파수 콘텐트 모두를 포함한다면, 영역을 고주파수인것으로 분류하지 않는다. 이는, 데이터 압축 또는 데이터 팽창이 프로세싱될 때 저주파수 콘텐트가 가청 가공물들을 발생시키기 더 쉽기 때문이다.The third substep 206-3, the high frequency analysis step, determines the position and length of the audio segments that predominantly comprise the high frequency audio content. High frequency segments of approximately 10-12 kHz and above are important in psychoacoustic analysis, because in a quiet place the audible threshold rapidly increases to approximately 10-12 kHz and above, and ears are less sensitive to significant high frequency waveform discontinuities than significant low frequency waveform discontinuities. There are a number of possible methods for determining whether an audio signal consists primarily of high frequency energy, but the method disclosed herein provides good detection results and minimizes computational requirements. Nevertheless, other methods may be used. The disclosed method does not classify an area as being high frequency if it contains both strong low frequency content and high frequency content. This is because low frequency content is easier to generate audible artifacts when data compression or data expansion is processed.

고주파수 분석 단계는 또한 64 샘플 서브블럭들의 입력 블럭을 프로세스하고 각 서브블럭의 영교차(zero-crossing) 정보를 사용하여 현저하게 고주파수 데이터를 포함하는지를 결정한다. 영교차 임계(즉, 고주파수 오디오 블럭으로 레이블링되기 이전에 얼마나 많은 영교차들이 블럭에 존재하는가)는 대략 10 내지 12kHz 범위의 주파수에 해당하도록 설정된다. 즉, 서브블럭은 약 10 내지 12kHz 범위의 신호에 해당하는 영교차들의 최소 개수를 포함한다면(10kHz 신호는 44.1kHz 샘플링 주파수를 지닌 64-샘플 서브블럭에서 29개 영교차를 갖는다), 고주파수 오디오 콘텐트를 포함하는 것으로서 플래그된다. 가청 임계 분석의 경우에서 처럼, 충분한 길이의 인접 플래그된 서브블럭들의 스트링은 고주파수 콘텐트 심리음향 기준을 충족시키는 식별된 영역을 구성한다. 최소 길이(시간 기간)는, 식별된 영역이 접합점 또는 접합점과 종단점 양측에 대해 유용한 위치인 것으로서 충분히 긴 것임을 보장하기 위해서 설정된다. 만일 단지 1개 영역이 입력 블럭에서 식별되는 것이라면, 플래그된 서브블럭의 가장 긴 인접 스트링만을 식별하는데 유용하다. The high frequency analysis step also processes the input block of 64 sample subblocks and uses the zero-crossing information of each subblock to determine whether it contains significant high frequency data. The zero crossing threshold (ie, how many zero crossings exist in the block before it is labeled with a high frequency audio block) is set to correspond to a frequency in the range of approximately 10-12 kHz. That is, if the subblock contains the minimum number of zero crossings corresponding to a signal in the range of about 10 to 12 kHz (the 10 kHz signal has 29 zero crossings in a 64-sample subblock with a 44.1 kHz sampling frequency), high frequency audio content Is flagged as including. As in the case of an audible threshold analysis, a string of adjacent flagged subblocks of sufficient length constitutes an identified region that meets high frequency content psychoacoustic criteria. The minimum length (time period) is set to ensure that the identified area is long enough as to be a useful point for the junction or both the junction and the endpoint. If only one region is to be identified in the input block, it is useful to identify only the longest contiguous string of flagged subblocks.

오디오 레벨 분석 206-4(도 6)Audio Level Analysis 206-4 (Figure 6)

심리음향 분석 프로세스의 제 4 서브단계(206-4), 오디오 데이터 블럭 레벨 분석은 입력 데이터 블럭을 분석하고 입력 데이터 블럭에서 최하위 신호 세기(진 폭)의 오디오 세그먼트들위 위치를 결정한다. 현재 입력 블럭이 프로세싱중에 이용될 수 있는 어떠한 심리음향 마스킹 이벤트를 포함하지 않는다면(예를 들면, 입력이 가청 임계 아래의 어떠한 과도현상 또는 오디오 세그먼트들을 포함하지 않는 준비 상태 신호라면), 상기 오디오 레벨 분석 정보가 사용된다. 이러한 경우, 타임-스케일링 프로세싱은, 하위 레벨 또는 가청불능의 접합 가공물들에서 오디오 결과의 레벨 세그먼트를 낮추는 이론적 해석에 기초하여 입력 블럭의 오디오의 최하위 레벨 또는 가장 조용한 세그먼트들(만일 그러한 세그먼트들이 있다면)을 바람직하게 선호한다. 450Hz 톤(tone)(사인파)을 사용하는 간단한 예가 도 9에 예시되어있다. 도 9에 예시된 톤 신호는 가청 임계 또는 고주파수 콘텐트 아래의 어떠한 과도현상도 포함하지 않는다. 그러나, 신호의 중간 부분은 블럭에서 신호의 초기와 말기 섹션보다도 레벨에서 6dB 더 낮다. 더 시끄러운 종단 섹션보다도 더 조용한 중간 섹션에 주의를 집중하는 것이 가청 데이터 압축 또는 데이터 팽창 프로세싱 가공물을 최소화시키는 것으로 고려된다.A fourth substep 206-4 of the psychoacoustic analysis process, audio data block level analysis analyzes the input data block and determines the position on the audio segments of the lowest signal strength (amplitude) in the input data block. If the current input block does not contain any psychoacoustic masking events that may be used during processing (e.g., if the input is a ready signal that does not contain any transients or audio segments below the audible threshold), the audio level analysis Information is used. In this case, the time-scaling processing is based on the theoretical interpretation of lowering the level segment of the audio result in lower level or inaudible joint artifacts, if any, such as the lowest level or quietest segments of the audio of the input block. Is preferred. A simple example using 450 Hz tone (sine wave) is illustrated in FIG. 9. The tone signal illustrated in FIG. 9 does not include any transient below the audible threshold or high frequency content. However, the middle portion of the signal is 6 dB lower in level than the early and late sections of the signal in the block. Focusing attention on the quieter middle section than the noisy termination section is considered to minimize audible data compression or data expansion processing artifacts.

입력 오디오 블럭이 임의 개수의 가변 길이를 갖는 오디오 레벨 세그먼트로 분리시킬 때, 오디오 데이터 블럭 레벨 분석이 나머지 부분(들)보다도 더 조용한 1개 부분 또는 2개 인접한 부분들을 찾도록 각 블럭에서 신호의 제 1, 제 2 및 최종 제 3 부분들간에 실행되기 위해서 블럭을 3개의 균등 부분으로 분할하는 것이 적합한 것으로 발견되었다. 이와 달리, 하위 가청 임계와 고주파수 기준에 대한 블럭들의 서브블럭 분석에 유사한 방식에서, 서브블럭들은 블럭의 가장 조용한 부분을 구성하는 가장 조용한 서브블럭들의 가장 긴 인접 스트림을 지닌 그 피크 레벨에 따 라 등급매겨진다. 어느 경우에든, 이러한 서브단계는 출력으로서 가장 조용한 영역 심리음향 기준을 충족시키는 식별된 영역을 제공한다. 비이상적인 신호 조건, 이를 테면, 분석중 블럭내의 일정한 진폭 신호를 제외하고, 이러한 최후 심리음향 분석, 일반적으로 오디오 레벨은 항상 "최후의(last resort)"의 식별된 영역을 제공한다. 방금 개시된 서브단계들의 경우에서 처럼, 최소 길이(시간 주기)는 식별된 영역이 접합점 또는 접합점과 종단점 모두에 대해 유용한 위치인 것으로서 충분히 긴 것임을 보장하기 위해서 설정된다.When the input audio block splits into any number of variable-length audio level segments, the audio data block level analysis determines the first part of the signal in each block to find one part or two adjacent parts that are quieter than the remaining part (s). It has been found suitable to divide the block into three equal parts in order to be executed between the first, second and final third parts. In contrast, in a similar manner to the subblock analysis of blocks against the lower audible threshold and high frequency criteria, the subblocks are ranked according to their peak level with the longest adjacent stream of the quietest subblocks that make up the quietest portion of the block. Is numbered. In either case, this substep provides as the output the identified region that meets the quietest region psychoacoustic criteria. Except for non-ideal signal conditions, such as constant amplitude signals in a block during analysis, this last psychoacoustic analysis, generally the audio level, always provides an "last resort" identified area. As in the case of the substeps just disclosed, the minimum length (time period) is set to ensure that the identified area is long enough as to be a useful location for the junction or for both the junction and the endpoint.

임시 접합점과 크로스페이드 파라미터 설정 206-5(도 6)Temporary junction and crossfade parameter settings 206-5 (Figure 6)

도 6의 심리음향 분석 프로세스에서 최종 서브단계 206-5("임시 접합점과 크로스페이드 파라미터 설정")는 이전 단계들로부터 수집된 정보를 사용하여 입력 블럭에서 심리음향적으로 최상의 식별된 영역을 선택하고 그 식별된 영역내에서 접합점과 크로스페이드 길이를 설정한다.In the psychoacoustic analysis process of FIG. 6, the final substeps 206-5 (“Temporary Junction and Crossfade Parameter Setting”) use the information collected from the previous steps to select the psychoacoustically best identified region in the input block. Set the junction and crossfade length in the identified area.

크로스페이드 파라미터 설정Crossfade Parameter Settings

상기 언급된 것처럼, 크로스페이딩은 가청 가공물들을 최소화시키는데 사용된다. 도 10은 크로스페이딩을 적용하는 방법을 개념적으로 도시한다. 결과적인 크로스페이드는 파형들이 함께 연합되는 접합점을 스트래들(straddle)한다. 도 10에서, 접합점 이전에 시작하는 점선은 신호 파형에 적용되는 최대 진폭에서 최소 진폭으로의 비선형 하향 페이드를 나타내는 것으로서, 접합점에서 반이 내려와 있다. 접합점에 걸치 페이드는 시간 t1 내지 t2이다. 종단점 이전에 시작하는 점선은 신호 파형에 적용되는 최소 진폭에서 최대 진폭으로의 상보적 비선형 상향 페이드를 나타내는 것으로서, 종단점에서 반이 올라와 있다. 종단점에 걸친 페이드는 시간 t3 내지 t4이다. 페이드 업과 페이드 다운은 대칭적이며 유니트(unity)로 합산된다(해닝 및 카이저-베셀 윈도우가 그 특성을 갖는다; 그러므로, 크로스페이드가 그러한 윈도우의 방식으로 형성된다면 이러한 요건이 충족될 것이다). t1 내지 t2의 시간 기간은 t3 내지 t4의 시간 기간과 동일하다. 이러한 시간 압축 예에서, 접합점과 종단점간의 데이터를 버리는 것이 바람직하다(횡선으로 예시됨). 이는 t2를 나타내는 샘플과 t3를 나타내는 샘플간의 데이터를 버림으로써 달성된다. 그후, 접합점과 종단점은 t1 내지 t2 그리고 t3 내지 t4의 데이터가 함께 합산되어, 상보적인 페이드 업과 페이드 다운 특성으로 이루어지는 크로스페이드가 되도록 서로에 더하여 (개념적으로) 위치된다.As mentioned above, crossfading is used to minimize audible workpieces. 10 conceptually illustrates a method of applying crossfading. The resulting crossfade straddles the junction where the waveforms are associated together. In FIG. 10, the dashed line starting before the junction shows a non-linear downward fade from the maximum amplitude to the minimum amplitude applied to the signal waveform, half down at the junction. The fade across the junction is times t1 to t2. The dashed line starting before the endpoint represents the complementary nonlinear upward fade from the minimum amplitude to the maximum amplitude applied to the signal waveform, which is half as high at the endpoint. The fade across the endpoints is time t3 to t4. The fade up and fade down are symmetric and add up to unity (Hanning and Kaiser-Bessel windows have their properties; therefore, if crossfades are formed in the manner of such windows, this requirement will be met). The time period of t1 to t2 is the same as the time period of t3 to t4. In this time compression example, it is desirable to discard the data between the junction and the endpoint (illustrated by horizontal lines). This is accomplished by discarding data between the sample representing t2 and the sample representing t3. The junction and endpoints are then placed (conceptually) in addition to each other such that the data of t1 to t2 and t3 to t4 are summed together to form a crossfade of complementary fade up and fade down characteristics.

일반적으로, 더 긴 크로스페이드는 더 짧은 크로스페이드보다도 더 양호한 접합의 가청 가공물을 마스크한다. 그러나, 크로스페이드의 길이는 입력 데이터 블럭의 고정 사이즈로 제한된다. 또한 더 긴 크로스페이드는 타임 스케일링 프로세싱을 위해 사용될 수 있다. 이는 크로스페이드가 블럭 경계에 의해(및/또는 오디토리 이벤트 경계에 의해, 오디토리 이벤트가 고려될 때) 제한되며 현재 데이터 블럭 전후의 데이터(및/또는 현재 오디토리 이벤트, 오디토리 이벤트가 고려될 때)는 데이터 압축 또는 데이터 팽창 프로세싱과 크로스페이딩에 사용하기에 이용가능하지 않다. 그러나, 과도현상의 마스킹 특성은 크로스페이드의 길이를 단축시키는데 사용될 수 있는데, 왜냐하면 더 짧은 크로스페이드로부터 야기되는 가청 가공물의 일부 또는 모두가 과도현상에 의해 마스킹되기 때문이다. In general, longer crossfades mask audible workpieces of a better bond than shorter crossfades. However, the length of the crossfade is limited to the fixed size of the input data block. Longer crossfades can also be used for time scaling processing. This means that the crossfade is limited by the block boundary (and / or by the auditory event boundary, when the auditory event is considered) and the data before and after the current data block (and / or the current auditor event, the auditor event) is considered. Is not available for use in data compression or data expansion processing and crossfading. However, the transient masking properties can be used to shorten the length of the crossfade, because some or all of the audible workpiece resulting from the shorter crossfade is masked by the transient.

크로스페이드 길이가 오디오 콘텐트에 응하여 변동되더라도, 적절한 디폴트 크로스페이드 길이는 10msec인데, 왜냐하면 넓은 범위의 자료에 대해 최소 가청 접합 가공물을 도입시키기 때문이다. 과도현상 포스트마스킹과 프리마스킹은 크로스페이드 길이가 다소 더 짧아지게, 예를 들면, 5msec로 설정되도록 허용한다. 그러나, 오디토리 이벤트가 고려될 때, 10msec보다도 더 긴 크로스페이드는 특정 조건하에서 사용될 수 있다.Although the crossfade length varies in response to audio content, the appropriate default crossfade length is 10 msec, because it introduces the minimum audible splicing artifact for a wide range of materials. Transient postmasking and premasking allow the crossfade length to be set somewhat shorter, eg, 5 msec. However, when auditory events are considered, crossfades longer than 10 msec may be used under certain conditions.

임시 접합점 설정Temporary Junction Settings

만일 과도현상이 도 6의 서브단계 206-1에 의해 결정된 것처럼 존재한다면, 임시 접합점은 블럭의 과도현상 위치와 시간 팽창 또는 압축 프로세싱이 실행되는지에 좌우하여 과도현상 전후의 일시적 마스킹 영역내의 블럭에 위치되어, 과도현상을 반복 또는 손상(smearing)하는 것을 회피시킨다(즉, 바람직하게는, 어떠한 과도현상의 부분도 크로스페이드내에 있어서는 않된다). 과도현상 정보는 크로스페이드 길이를 결정하는데 또한 사용된다. 만일 1개 이상의 과도현상이 존재하여 1개 이상의 이용가능한 일시적 마스킹 영역들이 있다면, 최상의 마스킹 영역(고려하여, 예를 들면, 블럭에서의 그 위치, 그 길이 및 그 세기)은 임시 접합점이 위치되는 식별된 영역으로서 선택된다.If the transient exists as determined by substep 206-1 of Figure 6, the temporary junction is located in the block in the temporary masking area before and after the transient, depending on the transient location of the block and whether time expansion or compression processing is performed. To avoid repeating or smearing the transients (ie, preferably, no portion of the transients should be in the crossfade). Transient information is also used to determine the crossfade length. If there is more than one transient masking area available because there is more than one transient, the best masking area (in consideration of, for example, its location in the block, its length and its strength) is identified by which the temporary junction is located. Selected area.

만일 어떠한 신호 과도현상도 존재하지 않는다면, 임시 접합점과 크로스페이드 파라미터들을 설정하는 서브단계 206-5는 가청 임계 세그먼트, 고주파수, 및 오디오 레벨을 분석하고, 심리음향적으로 식별된 영역의 검색시 서브단계 206-2, 206-3 및 206-4의 결과들을 분석하여 임시 접합점을 위치시킨다. 만일 1개 이상의 하위 레벨이, 가청 임계 세그먼트들이 존재하는 곳 또는 아래에 있다면, 임시 접합점은 하나의 그러한 세그먼트 또는 최상의 그러한 세그먼트(고려하여, 예를 들면, 블럭내의 그 위치과 그 길이)내에 설정된다. 만일 어떠한 하위 가청 임계 세그먼트들이 존재하지 않는다면, 상기 단계는 데이터 블럭의 고주파수 세그먼트들을 검색하고 하나의 그러한 세그먼트 또는 최상의 그러한 세그먼트, 고려하여 예를 들면, 블럭 내의 위치와 그 길이내에 임시 접합점을 설정한다. 만일 어떠한 고주파수 세그먼트들이 발견되지 않는다면, 상기 단계는 그후 임의의 하위 레벨 오디오 세그먼트들을 검색하고 1개 또는 최상의 그러한 세그먼트(고려하여 예를 들면, 블럭내의 그 위치와 그 길이)내에 임시 접합점을 설정한다. 결국, 임시 접합점이 각각의 입력 블럭에 위치되는 하나의 식별된 영역만이 있게된다. 상기 언급된 것처럼, 드문 경우에는, 심리음향 기준을 충족시키는 블럭에 어떠한 세그먼트도 없으며, 그 경우, 블럭에 어떠한 임시 접합점도 없다.If no signal transients exist, substep 206-5 of setting temporary junction and crossfade parameters analyzes audible threshold segments, high frequencies, and audio levels, and substeps in searching for psychoacoustically identified regions. The results of 206-2, 206-3 and 206-4 are analyzed to locate the temporary junction. If more than one lower level is below or where audible critical segments are present, the temporary junction is set within one such segment or the best such segment (consider, for example, its location and length in the block). If there are no lower audible critical segments, the step retrieves the high frequency segments of the data block and establishes one such segment or the best such segment, taking into account, for example, a temporary junction within the block and its length. If no high frequency segments are found, then the step then searches for any lower level audio segments and establishes a temporary junction in one or the best such segment (consider, for example, its position and length in the block). As a result, there is only one identified region where the temporary junction is located in each input block. As mentioned above, in rare cases, there are no segments in the block that meet the psychoacoustic criteria, in which case there are no temporary junctions in the block.

이와 달리, 심리음향 분석 설명의 논의 이전에 상기 언급된 것처럼, 심리음향 기준을 충족시키는 각각의 입력 블럭에서 하나의 영역만을 선택하고 그리고 (선택적으로) 그 식별된 영역에 임시 접합점을 위치시키는 대신에, 심리음향 기준을 충족시키는 1개 이상의 영역이 선택되고 그리고 (선택적으로) 임시 접합점이 그 각각에 위치된다. 이것을 달성하는 몇가지 방식이 있다. 예를 들면, 영역이 식별되어 더 상위 등급 심리음향 기준중 하나를 충족시키고 임시 접합점이 (선택적으로) 그 안에 위치되더라도, 심리음향 계층에서 하위 등급을 갖는 특정 입력 블럭에서 1개 이상의 부가 식별된 영역들이 선택되고 임시 접합점이 그 각각에 위치된다. 다른 방식은, 만일 동일한 심리음향 기준을 충족시키는 다수의 영역들이 특정 블럭에서 발견되더라도, 각각의 그러한 부가 식별된 영역들이 이용가능하다면(고려하여 예를 들면, 블럭에서 그 길이와 위치) 그 영역들중 1개 이상이 선택된다(그리고, 임시 접합점이 각각에 위치된다)는 것이다. 또 다른 방식은 그 서브블럭에 다른 식별된 영역들이 있는지 또는 없는지 그리고 심리음향 기준이 식별된 영역에 의해 충족되는지에 관계없이 매 식별된 영역을 선택하고 그리고, 선택적으로, 임시 접합점을 각각에 위치시키는 것이다. 각 블럭에서 다수의 식별된 영역들은 하기에 더 기술되는 것처럼 다중 채널간에 공통 접합점을 찾는데 유용하다.Alternatively, as mentioned above prior to the discussion of the psychoacoustic analysis description, instead of selecting only one region in each input block that meets the psychoacoustic criteria and (optionally) placing a temporary junction in the identified region, In this case, one or more regions that meet the psychoacoustic criteria are selected and (optionally) temporary junctions are located in each of them. There are several ways to accomplish this. For example, one or more additionally identified regions in a particular input block that have a lower grade in the psychoacoustic layer, even though the region is identified to meet one of the higher grade psychoacoustic criteria and the temporary junction is (optionally) located therein. Are selected and temporary junctions are located on each of them. Another way is that even if multiple regions that meet the same psychoacoustic criteria are found in a particular block, if each such additional identified region is available (eg, its length and position in the block), those regions One or more of them are selected (and temporary junctions are located in each). Another way is to select every identified area and optionally place temporary junctions in each, whether or not there are other identified areas in the subblock and whether the psychoacoustic criteria are met by the identified area. will be. Multiple identified regions in each block are useful for finding common junctions between multiple channels, as described further below.

그러므로, 도 6의 심리음향 분석 프로세스(도 5의 단계 206)는 심리음향 기준에 따른 입력 블럭내에서 그리고 각각의 그 영역내에서 영역들을 식별하며, (선택적으로) 임시 접합점을 위치시킨다. 상기 프로세스는 임시 접합점(예를 들면, 과도현상, 가청 임계, 고주차수, 최하위 오디오 레벨의 결과로서 마스킹하였는지) 그리고 각 입력 블럭에서 과도현상의 개수와 위치를 식별하는데 사용된 기준의 식별을 제공하며, 이 모두는, 하기에 더 설명되는 것처럼, 다중 채널이 있을 때 공통 접합점을 결정시 그리고 기타 목적을 위해 유용하다.Therefore, the psychoacoustic analysis process of FIG. 6 (step 206 of FIG. 5) identifies the regions within the input block according to the psychoacoustic criteria and within each of those regions, and (optionally) locates a temporary junction. The process provides for identification of temporary junctions (eg, masking as a result of transients, audible thresholds, high frequencies, lowest audio levels) and the criteria used to identify the number and location of transients in each input block. All of which are useful for determining common junctions when there are multiple channels and for other purposes, as further described below.

공통 다중채널 접합점 선택 210(도 5)Common multichannel junction selection 210 (FIG. 5)

상기 언급된 것처럼, 도 6의 심리음향 분석 프로세스는 매 채널의 입력 블럭에 적용된다. 도 5를 다시 참조하면, 1개 이상의 오디오 채널이 프로세싱된다면, 결정 단계 208("채널 번호>1?")에 의해 결정되는 것처럼, 임시 접합점들은, 단계 206에서 옵션으로 위치되더라도, 다중 채널들에 부합하지 않기 쉽다(예를 들면, 일 부 또는 모든 채널들이 다른 채널들에 관련없는 오디오 콘텐트를 포함한다). 다음 단계 210("공통 접합점 선택")은 심리음향 분석 단계 206에 의해 제공된 정보를 사용하여 다중 채널에서 중첩 식별된 영역들을 식별하므로 공통 접합점이 다중 채널들의 각각의 시간-일치 블럭들에서 선택된다.As mentioned above, the psychoacoustic analysis process of FIG. 6 is applied to the input block of every channel. Referring back to FIG. 5, if more than one audio channel is to be processed, temporary junctions, as determined by decision step 208 (“channel number> 1?”), May be placed on multiple channels, even if optionally located in step 206. Not easy to match (eg some or all channels contain audio content not related to other channels). The next step 210 (“common junction selection”) uses the information provided by psychoacoustic analysis step 206 to identify overlapping identified regions in the multiple channels so that a common junction is selected in each of the time-matched blocks of the multiple channels.

비록, 대안으로서, 공통 접합점, 이를 테면 최상의 총체적 접합점이 도 5의 단계 206에 의해 선택적으로 결정된 각 채널에서 1개 이상의 임시 접합점들간에서 선택되더라도, 채널들에 걸쳐 중첩하는 식별된 영역들내에서 잠정적으로 더 최적의 공통 접합점을 선택하는 것이 바람직하며, 이 접합점은 도 5의 단계 206에 의해 결정된 모든 임시 접합점들과 상이하다.Although, alternatively, a common junction, such as the best overall junction, is selected between one or more temporary junctions in each channel selectively determined by step 206 of FIG. 5, potentially in identified regions that overlap across channels. It is desirable to select a more optimal common junction, which is different from all temporary junctions determined by step 206 of FIG.

결국, 각 채널의 식별된 영역들은 함께 AND되어 공통 중첩되는 세그먼트를 산출한다. 몇몇 경우에 있어서는, 어떠한 공통 중첩되는 세그먼트도 없으며, 다른 경우에 있어서는, 블럭에서 1개 이상의 심리음향 영역을 식별하는 대안이 사용될 때, 1개 이상의 공통 중첩되는 세그먼트가 있음을 유의한다. 상이한 채널들의 식별된 영역들은 정확하게 부합하지 않지만, 그것들이 중첩하여 채널들간의 공통 접합점 위치가 선택되어 매 채널에서 식별된 영역내에 있음이 충분하다. 다중채널 접합 프로세싱 선택 단계는 각 채널에 대해 단지 하나의 공통 접합점을 선택하고 데이터 자체의 위치 또는 콘텐트를 수정 또는 변경하지 않는다.As a result, the identified regions of each channel are ANDed together to yield a common overlapping segment. Note that in some cases there are no common overlapping segments, and in other cases there is one or more common overlapping segments when an alternative is used to identify one or more psychoacoustic regions in the block. The identified regions of the different channels do not exactly match, but it is sufficient that they overlap so that a common junction location between the channels is chosen and within the region identified in each channel. The multichannel junction processing selection step selects only one common junction for each channel and does not modify or change the location or content of the data itself.

예를 들면, 심리음향 기준의 계층에 따른 중첩된 영역의 등급은 다중 중첩된 영역의 경우에 프로세싱을 위한 1개 이상의 최상의 중첩된 영역을 선택하는데 사용된다. 비록 상이한 채널들의 식별된 영역들이 동일한 심리음향 기준으로부터 야기 될 필요는 없지만, 채널들간에 기준 유형의 분포는 중첩된 영역의 품질에 영향을 끼친다(최상위 품질은 프로세싱이 중첩된 영역에서 실행될 때 가장 적게 가청된 다). 중첩된 영역의 품질은, 각 채널들에서 충족되는 심리음향 기준을 고려하여 등급매겨진다. 예를 들면, 매 채널의 식별된 영역이 "과도현상의 결과로서 포스트마스킹" 기준을 충족시키는 중첩된 영역이 가장 높게 등급매겨진다. 1개를 제외한 매 채널이 "과도현상의 결과로서 포스크마스킹" 기준을 충족시키며 기타 채널이 "하위 가청 임계" 기준을 충족시키는 중첩된 영역이 다음 등급으로 된다. 등급매김 방식의 상세한 사항은 중요하지 않다.For example, the rank of the overlapped regions according to the hierarchy of psychoacoustic criteria is used to select one or more best overlapped regions for processing in the case of multiple overlapped regions. Although the identified regions of different channels do not have to result from the same psychoacoustic criteria, the distribution of the reference type between the channels affects the quality of the overlapping regions (the highest quality is least when processing is performed in the overlapping regions). Audible). The quality of the overlapped regions is graded taking into account the psychoacoustic criteria that are met in each channel. For example, overlapping regions where the identified regions of each channel meet the "postmasking as a result of transient" criteria are ranked highest. Every channel except one meets the "mask masking as a result of transients" criteria, and the overlapping areas where the other channels meet the "low audible threshold" criteria become the next class. The details of the rating scheme are not important.

이와 달리, 다중 채널들에 걸친 공통 영역은, 채널들의 모두가 아닌 일부에 관해서만 심리음향적으로 식별된 영역들이 중첩하더라도 프로세싱을 위해 선택된다. 그 경우, 1개 이상의 채널에서 심리음향 기준을 충족시키는 것에 대한 실패는 최소의 이의성 가청 가공물들을 야기하기 쉽다. 예를 들면, 교차-채널 마스킹은 일부 채널들이 공통의 중첩하는 식별된 영역을 가질 필요가 없다; 예를 들면, 서로 다른 채널의 마스킹 신호는, 채널이 별도로 가청된다면 접합이 받아들여지지 않는 영역에서 접합을 실행하는 것을 받아들여질 수 있게 한다.In contrast, a common region across multiple channels is selected for processing even if the psychoacoustically identified regions overlap only for some but not all of the channels. In that case, failure to meet psychoacoustic criteria in one or more channels is likely to result in minimal objectionable audible artifacts. For example, cross-channel masking does not need some channels to have a common overlapping identified area; For example, masking signals of different channels can be accepted to perform the junction in areas where the junction is unacceptable if the channel is separately audible.

공통 접합점을 선택시 다른 변동은, 상기 채널들중 하나의 임시 접합점을 개별 임시 접합점들중 하나가 공통 접합점 이라면 최소의 이의성 가공물을 야기시킨다는 결정에 기초하여 공통 접합점으로서 선택하는 것이다.Another variation in selecting a common junction is to select a temporary junction of one of the channels as a common junction based on the determination that one of the individual temporary junctions causes a minimal objection workpiece.

스킵핑Skipping

단계 210의 부분으로서(도 5), 중첩된 영역의 등급은, 특정의 중첩된 영역내 의 프로세싱이 스킵되었는지를 결정하는데 또한 사용된다. 예를 들면, 모든 식별된 영역들이 최하위 등급 기준, "가장 조용한 부분" 기준만을 충족시키는 중첩된 영역이 스킵된다. 특정한 경우, 시간-일치 입력 블럭들의 특정 세트에 대해 채널들간에 식별된 영역들의 공통 중첩을 식별하는 것이 가능하지 않으며, 이 경우 스킵 플래그가 단계 210의 부분 처럼 상기 블럭의 세트에 대해 설정된다. 스킵 플래그를 설정하기 위한 다른 요인들이 또한 있다. 예를 들면, 1개 이상의 채널에 다수의 과도현상이 있어서 과도현상을 삭제 또는 반복하지 않고 데이터 압축 또는 데이터 팽창 프로세싱을 위해 공간이 불충분하거나 또는 이와 달리 프로세싱을 위해 공간이 불충분하다면, 스킵 플래그가 설정된다.As part of step 210 (FIG. 5), the rank of the overlapped region is also used to determine if processing in the particular overlapped region has been skipped. For example, overlapping areas where all identified areas meet only the lowest rank criterion, the "quietest part" criterion, are skipped. In certain cases, it is not possible to identify a common overlap of the identified regions between channels for a particular set of time-matched input blocks, in which case a skip flag is set for the set of blocks as part of step 210. There are also other factors for setting the skip flag. For example, if there is a large number of transients in one or more channels and there is insufficient space for data compression or data expansion processing without deleting or repeating the transients or otherwise insufficient space for processing, the skip flag is set. do.

시간-일치 블럭들간의 공통 접합점(및 공통 종단점)은 다중 채널들간에 위상 정렬을 유지시키기 위해서 오디오 세그먼트들을 삭제 또는 반복할 때 선택되는 것이 바람직하다. 이것은, 스테레오 이미지의 시프트가 2개 채널간에 10㎲(마이크로세컨드) 차이만큼 적게 인식될 수 있다고 심리음향 연구들이 제안하는 2개 채널 프로세싱에 특히 중요하며, 이는 44.1kHz의 샘플 레이팅에서 1 이하의 샘플에 해당한다. 위상 정렬은 서라운드-인코딩된 자료의 경우에 또한 중요하다. 서라운드-인코딩된 스테레오 채널의 위상 관계가 유지되거나 또는 디코딩된 신호가 저하될 것이다.The common junction (and common endpoint) between time-matched blocks is preferably selected when deleting or repeating audio segments to maintain phase alignment between multiple channels. This is especially important for the two-channel processing suggested by psychoacoustic studies, that shifts in stereo images can be perceived as little as 10 microseconds difference between the two channels, which is less than one sample at a sample rating of 44.1 kHz. Corresponds to Phase alignment is also important in the case of surround-encoded material. The phase relationship of the surround-encoded stereo channel will be maintained or the decoded signal will be degraded.

그럼에도 불구하고, 일부 경우에 있어서는, 모든 채널이 완벽하게 샘플 정렬되지 않도록 다중채널 데이터를 프로세스하는 것(즉, 상기 채널들중 적어도 몇개에 대해 비정렬된 독립적인 접합점 및 종단점 위치들로 채널들을 프로세스하는 것)이 용이하다. 예를 들면, L, C, R(좌측, 중앙 및 우측) 채널(영화 또는 DVD 신호에 대해)의 접합점들과 종단점들을 정렬하고 개별적으로 정렬된 LS 및 RS(좌측 서라운드 및 우측 서라운드) 채널을 프로세스하는데 유용하다. 정보는 도 5 프로세스의 프로세싱 단계들간에 공유될 수 있어서 프로세싱에서 약간의 위상차는 블럭-투-블럭으로 조정되어 차이를 최소화시킬 수 있다.Nevertheless, in some cases, processing multichannel data such that not all channels are perfectly sample aligned (ie, processing channels with independent junction and endpoint locations unaligned for at least some of the channels). It is easy to do). For example, align junctions and endpoints of L, C, and R (left, center, and right) channels (for movie or DVD signals) and process individually aligned LS and RS (left surround and right surround) channels. Useful for Information can be shared between the processing steps of the process of FIG. 5 so that some phase difference in processing can be adjusted block-to-block to minimize the difference.

다중채널 접합점 선택의 예Example of Multichannel Junction Selection

도 11은 도 5의 다중채널 접합점 선택 분석 단계 210의 상세한 설명을 나타낸다. 제 1 프로세싱 단계 210-1("심리음향적으로 식별되었던 영역들의 위치를 정하도록 각 채널에 대한 블럭을 분석")은 상기 개시된 것처럼 심리음향 분석을 사용하여 식별되었던 영역들의 위치를 정하도록 각 채널에 대해 입력 블럭을 분석한다. 프로세싱 단계 210-2("중첩하는 식별된 영역들을 그룹지움")는 식별된 영역들의 중첩 부분들을 그룹지운다(채널들에 걸친 식별된 영역들을 함께 AND시킨다). 다음, 프로세싱 단계 210-3("우선시되는 중첩하는 식별된 영역들에 기반한 공통 접합점을 선택...")은 상기 채널들간에 공통 접합점을 선택한다. 다수의 중첩하는 식별된 영역들의 경우에, 각각의 중첩하는 식별된 영역들과 관련한 기준의 계층은 바람직하게는 심리음향적 계층에 따라 상기 언급된 것처럼 식별된 영역의 중첩을 등급매김하는데 사용된다. 교차-채널 마스킹 효과는 식별된 영역들의 다수의 중첩을 등급매김하는데 또한 고려될 수 있다. 상기 단계 210-3은 각 채널에 다수의 과도현상이 있는지를, 과도현상 서로에 대한 근접성 그리고 시간 압축 또는 팽창이 실행되는지를 또한 고려한다. 프로세싱(압축 또는 팽창)의 유형은, 종단점이 접합점 이전 또 는 이후에 위치 정해진다는점에서 또한 중요하다(도 2A-D와 관련하여 설명됨).FIG. 11 shows a detailed description of the multi-channel junction selection analysis step 210 of FIG. 5. The first processing step 210-1 ("analyze the block for each channel to locate the psychoacoustically identified regions") uses each channel to locate the regions that were identified using psychoacoustic analysis as described above. Analyze the input block for. Processing step 210-2 (“group the overlapping identified regions”) groups overlapping portions of the identified regions (AND ANDs the identified regions across the channels together). Processing step 210-3 (" select a common junction based on the overlapping identified regions as a priority ... ") selects a common junction between the channels. In the case of multiple overlapping identified regions, the hierarchy of criteria associated with each overlapping identified regions is preferably used to rank the overlap of the identified regions as mentioned above according to the psychoacoustic layer. Cross-channel masking effects can also be considered in ranking multiple overlaps of identified regions. Step 210-3 also takes into account whether there are multiple transients in each channel, proximity to each other and whether time compression or expansion is performed. The type of processing (compression or expansion) is also important in that the endpoint is positioned before or after the junction (described in connection with FIGS. 2A-D).

도 12는 데이터 압축 또는 데이터 팽창 프로세싱을 실행하기에 적절한 것으로서 개별 채널의 심리음향 프로세싱에서 식별된 영역들을 사용하여 타임 스케일 압축의 경우에 대한 공통 다중채널 접합점을 선택하는 예를 나타낸다. 도 12의 채널 1과 3은 도표에서 나타난 것처럼 상당량의 일시적 포스트 마스킹을 제공하는 과도현상을 포함한다. 도 12에서 채널 2의 오디오는 데이터 압축 또는 데이터 팽창 프로세싱을 위해 이용되는 더 조용한 부분을 지닌 오디오를 포함하며 채널 2에 대해서 대략 오디오 블럭의 후반부에 포함된다. 채널 4의 오디오는 가청 임계 아래인 부분을 포함하며 대략적으로 데이터 블럭의 처음 3300 샘플에 위치되어 있다. 도 12의 하부 범례는 데이터 압축 또는 데이터 팽창 프로세싱이 각각의 채널에서 실행될 수 있는 전반적인 양호한 영역에 최소 가청도를 제공하는 중첩하는 식별된 영역을 나타낸다. 바람직하게는, 공통 접합점은, 도 12에 나타난 것처럼, 공통 중첩 부분의 시작 이후에 위치 정하여져, 크로스페이드가 식별된 영역들간에 전이하는 것을 방지하며 잠정적 타겟 세그먼트의 사이즈를 최대화시킨다.12 shows an example of selecting a common multichannel junction for the case of time scale compression using the areas identified in psychoacoustic processing of an individual channel as suitable for performing data compression or data expansion processing. Channels 1 and 3 of FIG. 12 include transients that provide a significant amount of temporary post masking as shown in the diagram. The audio of channel 2 in FIG. 12 includes audio with quieter portions used for data compression or data expansion processing and is included approximately at the end of the audio block for channel 2. The audio on channel 4 includes the portion below the audible threshold and is approximately located in the first 3300 samples of the data block. The lower legend of FIG. 12 shows overlapping identified areas that provide minimal audibility to the overall good area where data compression or data expansion processing can be performed in each channel. Preferably, the common junction is positioned after the beginning of the common overlapping portion, as shown in FIG. 12, to prevent the crossfade from transitioning between the identified regions and to maximize the size of the potential target segment.

종단점 위치 설정Set endpoint location

도 11을 다시 참조하면, 일단 공통 접합점이 단계 210-3에서 식별되었다면, 프로세싱 단계 210-4("최소 및 최대 종단점 위치 설정..")는 타임 스케일링 레이트(즉, 데이터 압축 또는 팽창의 바람직한 레이트)에 따라 최소 및 최대 종단점 위치를 설정하고 식별된 영역의 중첩 부분내에서 상호관련 프로세싱 영역을 유지시킨다. 이와 달리, 타임 스케일링 레이트와 식별된 영역 사이즈를 상호관계 이 전에 고려하는 대신에, 타겟 세그먼트 길이가 공지되기 이전에, 최소 및 최대 종단점 위치들은 디폴트값, 이를 테면, 상기 언급된 각각의 7.5와 25msec로 결정된다. 상기 단계 210-4는 최소 및 최대 종단점 위치들을 따라 모든 채널들에 대해서 공통 다중채널 접합점을 출력한다(도 12에 예시됨). 단계 210-4는 단계 206(도 5)의 서브단계 206-5(도 6)에 의해 제공되는 크로스페이드 파라미터 정보를 또한 출력한다. 최대 종단점 위치는 다수의 상호-채널 또는 교차-채널 과도현상들이 존재하는경우에 중요하다. 상기 접합점은 데이터 압축 또는 데이터 팽창 프로세싱이 과도현상들간에 발생하도록 설정된다. 종단점 위치(그리고, 궁극적으로는, 타겟 세그먼트 길이, 이는 접합점 위치, 종단점 위치 및 크로스페이드 길이에 의해 결정됨)를 정확하게 설정시, 동일한 또는 다른 채널에서 데이터 압축 또는 데이터 팽창 프로세싱과 관련하여 다른 과도현상들을 고려하는 것이 필요하다.Referring back to FIG. 11, once the common junction has been identified in step 210-3, processing step 210-4 (“minimum and maximum endpoint positioning ..”) is the time scaling rate (ie, the desired rate of data compression or expansion). Set minimum and maximum endpoint positions and maintain the correlated processing region within the overlapping portion of the identified region. Alternatively, instead of considering the time scaling rate and the identified area size before correlation, before the target segment length is known, the minimum and maximum endpoint positions are default values, such as 7.5 and 25 msec, respectively, as mentioned above. Is determined. Step 210-4 outputs a common multichannel junction for all channels along the minimum and maximum endpoint locations (illustrated in FIG. 12). Step 210-4 also outputs the crossfade parameter information provided by substep 206-5 (Fig. 6) of step 206 (Fig. 5). The maximum endpoint position is important when there are multiple cross-channel or cross-channel transients. The junction is set such that data compression or data expansion processing occurs between transients. When setting the endpoint location (and ultimately, the target segment length, which is determined by the junction location, endpoint location, and crossfade length), other transients may be associated with data compression or data expansion processing on the same or different channels. It is necessary to consider.

블럭 프로세싱 결정 212(도 5)Block Processing Decision 212 (FIG. 5)

도 5를 다시 참조하면, 프로세싱에서 다음 단계는 입력 블럭 프로세싱 결정 단계 212("복잡도에 기반하여 스킵?") 이다. 이 단계는 프로세싱 스킵 플래그가 단계 210에 의해 설정되었는지를 결정하도록 검사한다. 그렇다면, 데이터의 현재 블럭이 프로세싱되지 않는다.Referring back to FIG. 5, the next step in processing is input block processing decision step 212 (" skip based on complexity? &Quot;). This step checks to determine if the processing skip flag is set by step 210. If so, the current block of data is not processed.

상호관계 프로세싱 214(도 5)Correlation Processing 214 (Figure 5)

만일 현재 입력 데이터 블럭이 프로세싱된 것으로 결정되면, 그후, 도 5의 상호관계 단계 214에 나타난 것처럼, 2개 유형의 상호관계 프로세싱은 각각의 그러한 데이터 블럭에 관하여 제공된다. 데이터 블럭의 시간 도메인 정보의 상호관계 프로세싱은 서브단계 214-1("가중")와 214-2("각 블럭의 시간 영역 데이터의 상호관계 프로세싱")에 의해 제공된다. 입력 신호의 위상 정보의 상호관계 프로세싱은 서브단계 214-3("각 블럭의 위상 연산")와 214-4(각 블럭의 위상 데이터의 상호관계 프로세싱")에 의해 제공된다. 입력 데이터 블럭의 결합된 위상과 시간-도메인 정보를 사용하는 것은 시간-도메인 정보만을 사용하는것 보다도 고품질 타임 스케일링 결과를 음성내지 복잡한 음악의 신호 범위에 제공한다. 이와 달리, 감소된 성능이 받아들여질것이라 여겨진다면 시간-도메인 정보만이 프로세싱 및 사용될 것이다. 상호관계 프로세싱의 상세한 사항은 몇가지 주요 원리의 설명 이후, 하기에 진술되어 있다.If it is determined that the current input data block has been processed, then two types of correlation processing are provided for each such data block, as shown in correlation step 214 of FIG. Correlation processing of the time domain information of the data blocks is provided by substeps 214-1 ("weighting") and 214-2 ("correlation processing of time block data of each block"). Correlation processing of the phase information of the input signal is provided by substeps 214-3 ("phase computation of each block") and 214-4 (correlation processing of phase data of each block "). Using phased and time-domain information that is accurate provides higher quality time scaling results for voice to complex musical signal ranges than using time-domain information alone, whereas, if reduced performance is deemed to be acceptable, time-domain Only information will be processed and used The details of correlation processing are set out below after explanation of some key principles.

도 2A-D에서 논의되고 예시된 것처럼, 본 발명의 양태들에 따른 타임 스케일링은 입력 블럭들의 세그먼트들을 무시 또는 반복하면서 작업한다. 만일, 제 1 대체 실시예에 따라, 접합점 및 종단점 위치들이 선택되므로, 일정한 접합점에 대해, 종단점이 신호 주기성을 최대로 유지하여, 가청 가공물들이 감소될 것이다. 주기성을 최대화시키는 잘 선택된 접합 및 종단점 프로세싱 위치 설정의 예가 도 13에 제시되어 있다. 도 13에 나타난 신호는 음성 신호의 매우 간헐적인 부분의 시간-도메인 정보이다.As discussed and illustrated in Figures 2A-D, time scaling in accordance with aspects of the present invention works while ignoring or repeating segments of input blocks. If, according to the first alternative embodiment, the junction and endpoint locations are selected, for a given junction, the endpoint will maintain maximum signal periodicity, so that the audible workpieces will be reduced. An example of well chosen junction and endpoint processing positioning to maximize periodicity is shown in FIG. 13. The signal shown in FIG. 13 is time-domain information of a very intermittent part of the speech signal.

일단 접합점이 결정되면, 적절한 종단점 위치를 결정하는 방법이 필요하다. 그렇게 시행할 시, 사람의 가청에 일부 관계를 갖는 방식으로 오디오를 가중시키고 그후 상호관계를 실행하는 것이 바람직하다. 신호의 시간-도메인 진폭 데이터의 상호관계는 신호 주기성의 사용하기 쉬운 평가를 제공하며, 이는 종단점 위치를 선택 시 유용하다. 비록 가중 및 상호관계가 시간 도메인에서 달성될 수 있더라도, 주파수 도메인에서 그렇게 하는 것이 연산적으로 효과적이다. 빠른 퓨리에 변화(FFT)은 신호의 상호관계의 퓨리에 변환에 관련있는 신호의 파워 스펙트럼의 평가를 효과적으로 연산하는데 사용될 수 있다. 예를 들면, 뉴욕, 캠브리지 대학 출판부, 윌리엄 에이치. 출판부 등의 Numerical Recipes in C, The Art of Scientific Computing에서 "Correlation and Autocorrelation Using the FFT" 참조.Once the junction is determined, a method is needed to determine the appropriate endpoint location. In so doing, it is desirable to weight the audio in such a way that it has some relationship to the human audible and then carry out the interrelationship. The correlation of the time-domain amplitude data of the signal provides an easy-to-use assessment of signal periodicity, which is useful when choosing endpoint locations. Although weighting and correlation can be achieved in the time domain, doing so in the frequency domain is computationally effective. Fast Fourier Changes (FFT) can be used to effectively compute an estimate of the power spectrum of a signal related to the Fourier transform of the signal's interrelationship. For example, New York, Cambridge University Press, William H. See "Correlation and Autocorrelation Using the FFT" in Numerical Recipes in C, The Art of Scientific Computing, et al.

적절한 종단점 위치는 입력 데이터 블럭의 위상과 시간-도메인 정보의 상호관계 데이터를 사용하여 결정된다. 시간 압축을 위해서는, 접합점 위치와 최대 프로세싱 점간에 오디오의 자동상호관계가 사용된다(도 2A, 3A, 4참조). 상기 자동상호관계는, 데이터의 주기성의 척도를 제공하며 오디오의 주된 주파수 구성요소의 사이클의 정수를 어떻게 이동시키는지 결정하는데 도움을 주기때문에 사용된다. 시간 팽창을 위해서는, 접합점 위치 전후의 데이터의 교차 상호관계는 데이터의 주기성을 평가하도록 연산되어 오디오의 지속을 증가시키도록 반복된다(도 2C, 3B, 4참조).The appropriate endpoint location is determined using the correlation data of the phase and time-domain information of the input data block. For temporal compression, the autocorrelation of the audio between the junction location and the maximum processing point is used (see Figures 2A, 3A, 4). The autocorrelation is used because it provides a measure of the periodicity of the data and helps determine how to shift the integer number of cycles of the main frequency component of the audio. For time expansion, the cross correlation of the data before and after the junction location is computed to evaluate the periodicity of the data and repeated to increase the duration of the audio (see FIGS. 2C, 3B, 4).

상기 상호관계(시간 압축을 위한 자동상호관계 또는 시간 팽창을 위한 교차 상호관계)는 접합점에서 시작하며 이전 프로세스에 의해 리턴되는 것처럼 최대 프로세싱 길이(최대 프로세싱 길이는 최대 종단점 위치와, 종단점 이후에 크로스페이드가 있다면 크로스페이드 길이의 반을 더한 것임) 또는 글로벌 최대 프로세싱 길이(디폴트 최대 프로세싱 길이) 중 어느 하나에서 종결하는 식으로 연산된다.The correlation (auto-correlation for time compression or cross-correlation for time expansion) starts at the junction and, as returned by the previous process, the maximum processing length (the maximum processing length is the maximum endpoint position and crossfade after the endpoint). If so, then it is calculated by terminating at either half of the crossfade length) or the global maximum processing length (default maximum processing length).

시간-도메인 데이터의 주파수 가중 상호관계는 각 입력 채널 데이터 블럭에 대해서 서브단계 214-1에서 연산된다. 상기 주파수 가중은 사람 가청의 가장 민감한 주파수 범위에 상호관계 프로세싱을 초점을 맞추도록 수행되며 상호관계 프로세싱 이전에 시간-도메인 데이터를 필터링하는 대신에 한다. 다수의 상이한 가중된 음향 곡선들이 이용가능하다면, 하나의 적절한 곡선은 변형된 B-가중 음향 곡선이다. 상기 변형된 곡선은 하기 식을 사용하여 연산된 표준 B-가중 곡선이다:The frequency weighted correlation of the time-domain data is computed in substep 214-1 for each input channel data block. The frequency weighting is performed to focus the correlation processing on the most sensitive frequency range of the human audible and instead of filtering the time-domain data prior to the correlation processing. If a number of different weighted acoustic curves are available, one suitable curve is a modified B-weighted acoustic curve. The modified curve is a standard B-weighted curve calculated using the following formula:

하위 주파수 구성요소들(대략 97Hz이하)은 0.5로 설정된다.Lower frequency components (approximately 97 Hz or less) are set to 0.5.

하위-주파수 신호 구성요소들은, 불가청일지라도, 접합될 때 가청의 고주파수 가공물을 발생시킨다. 그러므로, 표준의 비변형 B-가중 곡선에 가중을 두기보다는 저주파수 구성요소에 더 가중을 두는 것이 바람직하다.Sub-frequency signal components, although inaudible, produce an audible high frequency workpiece when bonded. Therefore, it is desirable to weight more low frequency components rather than weighting the standard undeformed B-weighting curve.

가중 단계에 이어서, 프로세스 214-2에서는, 시간-도메인 상호관계는 다음과 같이 연산된다:Following the weighting step, in process 214-2, the time-domain correlation is calculated as follows:

1) x(n)을 제로로 증대시킴으로써 L-포인트 시퀀스(2의 제곱)를 형성,1) form an L-point sequence (squared of 2) by increasing x (n) to zero,

2) x(n)의 L-포인프 FFT를 연산,2) compute the L-point FFT of x (n),

3) 복소수 FFT 결과를 자체의 공액으로 곱셈, 그리고3) multiply the complex FFT result by its conjugate, and

4) L-포인트 역 FFT를 연산.4) Calculate L-point inverse FFT.

여기서, x(n)은 상호관계 프로세싱 영역에서 오디오 샘플을 나타내는 입력 데이터 블럭에 포함된 디지털화된 시간-도메인 데이터로, n은 샘플 번호 또는 지수를 지시하며 길이 L은 상기 프로세싱에서 샘플의 개수보다 더 큰 2의 제곱이다. Where x (n) is digitized time-domain data contained in an input data block representing an audio sample in the correlation processing region, where n indicates a sample number or exponent and length L is greater than the number of samples in the processing. Is the power of two.

상기 언급된 것처럼, 가중 및 상호관계는 가중 음향 곡선에 의해 주파수 도메인에서 신호들이 상호관계이도록 곱셈됨으로써 효과적으로 달성된다. 그 경우, FFT는 가중 및 상호관계 이전에 적용되며, 가중은 상호관계중에 적용되고 그후 역 FFT가 적용된다. 시간 도메인 또는 주파수 도메인에서 수행되더라도, 상호관계는 그후 다음 단계에 의한 프로세싱을 위해 저장된다.As mentioned above, weighting and correlation are effectively achieved by multiplying the signals in the frequency domain by a weighted acoustic curve to correlate them. In that case, the FFT is applied before weighting and correlation, and the weighting is applied during correlation and then the inverse FFT is applied. Even if performed in the time domain or frequency domain, the correlation is then stored for processing by the next step.

도 5에 나타난 것처럼, 각 입력 채널의 데이터 블럭의 순간 위상이 서브단계 214-3에서 연산되며, 순간 위상은 다음과 같이 정의된다As shown in Fig. 5, the instantaneous phase of the data block of each input channel is calculated in substep 214-3, and the instantaneous phase is defined as follows.

phase(n)=arctan(imag(analytic(x(n))/real(analytic(x(n))phase (n) = arctan (imag (analytic (x (n)) / real (analytic (x (n)))

여기서, x(n)은 상호관계 프로세싱 영역에서 오디오 샘플들을 나타내는 입력 데이터 블럭에 포함된 디지털화된 시간-도메인 데이터로 n은 샘플 번호 또는 지수를 가리킨다.Where x (n) is digitized time-domain data contained in an input data block representing audio samples in the correlation processing region, where n indicates a sample number or exponent.

함수 analytic()은 x(n)의 복소수 분석 버전을 포함한다. 분석 신호는 x(n)의 힐버트 변환(Hilbert transform)을 취하고 복합 신호를 생성시킴으로서 생성될 수 있으며 신호의 실수 부분은 x(n)이며 신호의 허수 부분은 x(n)의 힐버트 변환이다. 이 구현예에서, 분석 신호는 입력 신호 x(n)의 FFT를 취하고, 주파수 도메인 신호의 네거티브 주파수 성분들을 제로로 만들고 그후 역 FFT를 실행한다. 결과는 복합 분석 신호이다. x(n)의 위상은 분석 신호의 실수 부분으로 나누어 분석 신호의 허수 부분의 아크탄젠트를 취함으로써 연산된다. x(n)의 분석 신호의 순간 위상이 사용되는데, 왜냐하면 그것은 신호의 국소적 반응에 관련한 주요 정보를 포함하기 때문이며, 이는 x(n)의 주기성의 분석에 도움을 준다. The function analytic () contains a complex analytic version of x (n). The analytical signal can be generated by taking a Hilbert transform of x (n) and generating a composite signal, where the real part of the signal is x (n) and the imaginary part of the signal is a Hilbert transform of x (n). In this implementation, the analysis signal takes the FFT of the input signal x (n), zeros the negative frequency components of the frequency domain signal and then performs an inverse FFT. The result is a composite analysis signal. The phase of x (n) is computed by taking the arc tangent of the imaginary part of the analysis signal by dividing it by the real part of the analysis signal. The instantaneous phase of the analysis signal of x (n) is used because it contains key information relating to the local response of the signal, which aids in the analysis of the periodicity of x (n).

도 14는 시간-도메인 신호, x(n)에 중첩되는 음성 신호, 라디안의 순간 위상을 나타낸다. "순간 위상"의 설명은 Sam Shanmugam, John과 Wiley & Sons의 New York(1979), 페이지 278-280, Digital and Analog Communication System의 섹션 6.4.1("Analog Modulated Signals")에 진술되어 있다. 위상 및 시간 도메인 특성 모두를 고려하면, 부가 정보가 획득되어 접합점에서 파형들을 일치시키는 능력을 향상시킨다. 접합점에서의 위상 왜곡을 최소화시킴은 바람직하지 않은 가공물들을 감소시키는 경향이 있다.14 shows the time-domain signal, the speech signal superimposed on x (n), and the instantaneous phase of the radian. A description of the "momentary phase" is given in Sam Shanmugam, John and Wiley & Sons' New York (1979), pages 278-280, Section 6.4.1 ("Analog Modulated Signals") of the Digital and Analog Communication System. Considering both phase and time domain characteristics, additional information is obtained to improve the ability to match waveforms at the junction. Minimizing phase distortion at the junction tends to reduce undesirable workpieces.

타임-도메인 신호(x(n))는 다음과 같이 x(n)의 분석 신호의 순간 위상과 관련이 있다:The time-domain signal x (n) is related to the instantaneous phase of the analysis signal of x (n) as follows:

x(n)의 네거티브 진행 영교차점 = +π/2 위상Negative running zero crossing of x (n) = + π / 2 phase

x(n)의 포지티브 진행 영교차점 = -π/2 위상Positive running zero crossing of x (n) = -π / 2 phase

x(n)의 구간 최대 = 0 위상Interval maximum of x (n) = 0 phase

x(n)의 구간 최소 = ±π위상Interval minimum of x (n) = ± π phase

이러한 매핑(mapping), 이외에 중간점들은 x(n) 진폭에 독립적인 정보를 제공한다. 각 채널의 데이터에 대한 위상의 계산에 이어서, 각 채널의 위상 정보에 대한 상호관계는 단계 214-4에서 연산되고 차후 프로세싱을 위해 저장된다.In addition to this mapping, the midpoints provide information independent of the x (n) amplitude. Subsequent to the calculation of the phase for the data of each channel, the correlation for the phase information of each channel is computed in step 214-4 and stored for later processing.

다중 상관관계 프로세싱 216(도 5, 도 15, 도 16)Multiple Correlation Processing 216 (FIGS. 5, 15, 16)

위상 및 타임-도메인 상호관계가 각 입력 채널의 데이터 블럭에 대해 연산되었다면, 상호관계-프로세싱 단계(도 5의 216)("다중 상호관계를 프로세싱하여 크로스페이드 위치를 결정")는, 도 15에서 더 상세히 나타난 것처럼, 상호관계들을 프 로세스한다. 도 15는 음악을 포함하는 5개(좌측, 중앙, 우측, 좌측 서라운드 및 우측 서라운드) 입력 채널들에 대한 위상 및 타임-도메인 상호관계를 나타낸다. 도 16에 개념적으로 나타난, 상호관계 프로세싱 단계는 각 채널에 대한 위상 및 타임-도메인 상호관계를 입력으로서 받아들이며, 각각을 가중값으로 곱셈하고 그후 모든 입력 채널들의 타임-도메인과 위상 상호관계 정보의 모든 입력을 나타내는 단일 상호관계 함수를 형성하도록 그것들을 합산한다. 즉, 도 16 어레인지먼트는 단일 상호관계를 산출하도록 10가지 상이한 상호관계를 함께 합산하는 초-상호관계 함수를 고려한다. 도 16의 파형은, 약 샘플 500에서 바람직한 공통 종단점을 구성하는, 최대 상호관계값을 나타내며, 이는 최소 및 최대 종단점 위치 사이이다. 접합점은 본 예에서 샘플 0에 있다. 가중값들은 특정 채널들 또는 상호관계 유형(예를 들면, 타임-도메인 대 위상)이 전반적인 다중채널 분석에서 지배적인 역할을 하도록 선택된다. 매우 간단하지만, 이용가능한, 가중 함수는 채널들간에 상대적인 음향의 측량 단위이다. 그러한 가중은 레벨이 너무 낮아 무시될 수 있는 신호들의 기여를 최소화시킨다. 다른 가중 함수들이 또한 가능하다. 예를 들면, 더 커다란 가중 함수가 과도현상에 주어질 수 있다. 개별 상호관계들의 가중 함수와 결합되는 "초 상호관계(super correlation)"의 목적은 가능한한 양호한 공통 종단점을 탐색하는 것이다. 다중 채널들은 상이한 파형들이기 때문에, 한가지 이상적인 해결책이 없을뿐만 아니라 공통 종단점을 탐색하기 위한 한가지 이상적인 기술이 없다. 최적화된 한쌍의 접합점과 종단점 위치들을 탐색하기 위한 대체 프로세스가 하기에 개시되어 있다. If the phase and time-domain correlations have been computed for the data blocks of each input channel, then the correlation-processing step (216 of FIG. 5) ("process multiple correlations to determine the crossfade position"), in FIG. As shown in more detail, the relationships are processed. FIG. 15 shows the phase and time-domain correlations for five (left, center, right, left surround and right surround) input channels containing music. Conceptually represented in FIG. 16, the correlation processing step accepts the phase and time-domain correlations for each channel as inputs, multiplies each by a weighted value, and then inputs all of the time-domain and phase correlation information of all input channels. Sum them to form a single correlation function that represents That is, the FIG. 16 arrangement considers a super-correlation function that sums together 10 different correlations to yield a single correlation. The waveform of FIG. 16 shows the maximum correlation value, which constitutes the desired common endpoint at about sample 500, which is between the minimum and maximum endpoint positions. The junction is at sample 0 in this example. Weights are chosen such that certain channels or correlation types (eg time-domain versus phase) play a dominant role in the overall multichannel analysis. Although very simple, the weighting function available is a unit of measure of sound relative between channels. Such weighting minimizes the contribution of signals that are so low that they can be ignored. Other weighting functions are also possible. For example, larger weighting functions can be given to transients. The purpose of the "super correlation" combined with the weighting function of the individual correlations is to find a common endpoint that is as good as possible. Since multiple channels are different waveforms, there is not only one ideal solution but also one ideal technique for searching for common endpoints. An alternative process for searching for an optimized pair of junction and endpoint locations is described below.

각 상호관계의 가중 합계는 모든 채널에 대한 입력 블럭들의 전반적인 주기적 특성에 유용한 통찰력을 제공한다. 결과적인 전반적 상호관계는, 상호관계의 최대값을 결정하기 위해서 접합점과 최대 상호관계 프로세싱 위치 사이의 상호관계 프로세싱 영역에서 검색된다.The weighted sum of each correlation provides useful insights into the overall periodic nature of the input blocks for all channels. The resulting overall correlation is retrieved in the correlation processing area between the junction and the maximum correlation processing location to determine the maximum value of the correlation.

블럭 프로세스 결정 단계 218(도 5)Block Process Decision Step 218 (FIG. 5)

도 5의 설명으로 돌아가서, 블럭 프로세싱 결정 단계 218("블럭들 프로세스?")은 타임 스케일링의 요구되는 양과 비교하여 얼마나 많은 데이터가 타임 스케일링 되었는지를 비교한다. 예를 들면, 압축의 경우에 있어서, 결정 단계는, 바람직한 압축율에 비하여 얼마나 많은 압축이 실행되었는지의 누적 트랙킹을 유지한다. 출력 타임 스케일링 요소는 블럭마다 변동하며, 요구되는 타임 스케일링 요소 주위에서 약간 변동한다(임의의 주어진 시간에서 바람직한 양보다도 많거나 또는 적을 수 있다). 만일 단지 하나의 공통 중첩 영역이 각 시간-일치("현재") 블럭(시간-일치 오디오 세그먼트들을 나타내는 입력 데이터 블럭들의 세트, 각 채널에 대한 블럭)에서 허용된다면, 블럭 프로세싱 결정 단계는 요구되는 타임 스케일링 요소와 출력 타임 스케일링 요소를 비교하며, 현재 입력 데이터 블럭을 프로세스하였는지에 따라 결정을 한다. 상기 결정은 공통 중첩 영역에서, 만약 있다면, 현재 블럭에서 타겟 세그먼트의 길이에 기초로 한다. 예를 들면, 110%의 타임 스케일링 요소가 요구되며 출력 스케일링 요소가 상기 요구되는 스케일링 요소 아래라면, 현재 입력 블럭들이 프로세스된다. 그렇지 않다면 현재 블럭들이 스킵된다. 만일 1개 이상의 공통 중첩 영역이 입력 데이터 블럭의 시간-일치 세트에서 허용된다 면, 블럭 프로세싱 결정 단계는 하나의 중첩 영역을 프로세스하도록 결정하며, 1개 이상의 중첩 영역이라면, 현재 블럭들을 스킵한다. 이와 달리, 프로세싱 또는 스킵하기 위한 다른 기준들이 사용될 수 있다. 예를 들면, 현재 누적된 팽창 또는 압축이 바람직한 정도 이상인지에 따라 현재 블럭을 스킵하여야 하는지의 결정을 기초로 하는 대신에, 상기 결정은, 프로세싱한 후의 결과가 여전히 반대 방향에서 에러일지라도, 현재 블럭을 프로세싱하는 것이 누적된 팽창 또는 압축을 바람직한 정도로 변동시켰는지에 기초로 한다.Returning to the description of FIG. 5, block processing decision step 218 (" blocks process? &Quot;) compares how much data has been time scaled compared to the required amount of time scaling. For example, in the case of compression, the determining step maintains a cumulative tracking of how much compression has been performed relative to the desired compression ratio. The output time scaling factor varies from block to block and slightly fluctuates around the required time scaling factor (can be more or less than the desired amount at any given time). If only one common overlap area is allowed in each time-matched ("current") block (a set of input data blocks representing time-matched audio segments, block for each channel), the block processing decision step is a time required. The scaling factor is compared to the output time scaling factor and a decision is made based on whether the current input data block has been processed. The determination is based on the length of the target segment in the current block, if any, in the common overlap area. For example, if 110% of the time scaling factor is required and the output scaling factor is below the required scaling factor, then the current input blocks are processed. Otherwise the current blocks are skipped. If more than one common overlapping region is allowed in the time-matched set of input data blocks, the block processing decision step determines to process one overlapping region, and if more than one overlapping region, skips the current blocks. Alternatively, other criteria for processing or skipping may be used. For example, instead of based on the determination of whether the current block should be skipped depending on whether the current cumulative expansion or compression is above the desired degree, the determination may be based on the current block, even if the result after processing is still an error in the opposite direction. Is based on whether processing the cumulative cumulative expansion or compression has changed to a desired degree.

크로스페이드 프로세싱 220(도 5)Crossfade Processing 220 (FIG. 5)

접합점 및 종단점 위치의 결정과 블럭을 프로세스하여야 하는지에 따른 결정에 이어서, 각 채널의 데이터 블럭은 도 5의 크로스페이드 블럭 단계 220("각 채널에 대한 블럭을 크로스페이드")에 의해 프로세싱된다. 이 단계는 각 채널의 데이터 블럭, 공통 접합점, 공통 종단점 및 크로스페이드 정보를 받아들인다.Subsequent to the determination of the junction and endpoint locations and the determination of whether the block should be processed, the data block of each channel is processed by crossfade block step 220 ("crossfade blocks for each channel") of FIG. 5. This step accepts data blocks, common junctions, common endpoints and crossfade information for each channel.

도 10을 다시 참조하면, 적절한 형상의 크로스페이드가 입력 데이터에 적용되며 2개 세그먼트가 함께 접합되어, 타겟 세그먼트를 생략(도 10 참조) 또는 반복한다. 크로스페이드의 길이는 바람직하게 최대 10msec이지만, 이전 분석 단계에서 결정된 크로스페이드 파라미터에 따라 더 짧아질 수 있다. 그러나, 오디토리 이벤트가 고려될 때, 더 긴 크로스페이드는 하기에 논의된 것처럼 특정 조건하에서 사용될 수 있다. 비선형 크로스페이드들은, 예를 들면, 이분의 일 해닝 윈도우의 형상에 따르면, 특히 음조 및 음조 곡선(sweep)과 같은 간단한 단일 주파수 신호들에 대해 선형(직선) 크로스페이드보다도 덜 들리는 가공물인데, 왜냐하면 해닝 윈도우 는 직선 크로스페이드의 슬로프의 불연속성을 갖지 않기 때문이다. 다른 형상들, 이를 테면, 카이저-베셀 윈도우의 형상은, 상승 및 포락 크로스페이드가 50%에서 교차하며 크로스페이드 기간의 전체에 걸쳐서 단일체로 합산한다면, 만족스러운 결과를 또한 제공한다.Referring again to FIG. 10, a crossfade of appropriate shape is applied to the input data and the two segments are joined together to omit (see FIG. 10) or repeat the target segment. The length of the crossfade is preferably at most 10 msec, but may be shorter depending on the crossfade parameters determined in the previous analysis step. However, when auditory events are considered, longer crossfades may be used under certain conditions as discussed below. Nonlinear crossfades are artifacts that sound less than linear (straight) crossfades, especially for simple single frequency signals, such as tonal and tonal sweeps, according to the shape of the half-hanning window, for example This is because the window does not have the discontinuity of the slope of the straight crossfade. Other shapes, such as the shape of the Kaiser-Bessel window, also provide satisfactory results if the rise and envelope crossfades intersect at 50% and add up as a single piece throughout the crossfade period.

피치 스케일링 프로세싱 222(도 5)Pitch Scaling Processing 222 (FIG. 5)

크로스페이드 프로세싱에 이어서, 도 5의 결정 단계 222("피치 스케일?")는 피치 시프팅(스케일링)이 실행되었는지를 결정하도록 검사된다. 상기 논의된 것처럼, 타임 스케일링은 버퍼 언더플루어 또는 오버플루어로 인하여 실시간으로 수행될 수 없다. 그러나, 피치 스케일링은 "리샘플링(resampling)" 단계 224("모든 데이터 블럭들을 리샘플")의 작동으로 인해 실시간 실행될 수 있다. 리샘플링 단계는 상이한 레이트로 샘플들을 판독한다. 고정 출력 클럭을 지닌 디지털 구현에서, 이는 리샘플링함으로써 달성된다. 그러므로, 리샘플링 단계 224는 동일한 시간 전개 또는 기간을 입력 신호로서 갖지만 바뀐 스펙트럼 정보를 갖는 피치-스케일링된 신호로 타임-스케일링된 입력 신호를 림샘플한다. 실시간 구현을 위해, 리샘플링은 전용 하드웨어 샘플 레이트 컨버터로 실행되어 DSP 구현에서의 연산을 감소시킨다. 리샘플링은, 일정한 출력 샘플링 레이트를 유지하거나 또는 입력 샘플링 레이트 및 이와 동일한 출력 샘플링 레이트를 유지하는 것이 바람직할 때에만 요구됨에 유의한다. 디지털 시스템에서, 일정한 출력 샘플링 레이트 또는 균등한 입력/출력 샘플링 레이트가 통상적으로 요구된다. 그러나, 만일 주요 출력이 아날로그 도메인으로 컨버트된다면, 변동하는 출력 샘플링 레이트는 관심밖이 될 것이다. 그러므로, 리 샘플링은 본 발명의 임의의 양태들의 필요한 부분이 아니다.Following crossfade processing, decision step 222 (“Pitch scale?”) Of FIG. 5 is checked to determine if pitch shifting (scaling) has been performed. As discussed above, time scaling cannot be performed in real time due to buffer underfloor or overfloor. However, pitch scaling may be performed in real time due to the operation of “resampling” step 224 (“Resample all data blocks”). The resampling step reads samples at different rates. In digital implementations with a fixed output clock, this is achieved by resampling. Thus, resampling step 224 resamples the time-scaled input signal with a pitch-scaled signal having the same time evolution or duration as the input signal but with changed spectral information. For real time implementations, resampling is performed with a dedicated hardware sample rate converter to reduce the computation in the DSP implementation. Note that resampling is only required when it is desirable to maintain a constant output sampling rate or to maintain the input sampling rate and the same output sampling rate. In digital systems, a constant output sampling rate or equivalent input / output sampling rate is typically required. However, if the primary output is converted to the analog domain, the variable output sampling rate will be of interest. Therefore, resampling is not a necessary part of any aspects of the present invention.

피치 스케일링 결정과 가능한 리샘플링에 이어서, 모든 프로세싱된 입력 데이터 블럭들은 비실시간 작동에 대해 화일로, 또는 실시간 작동에 대해 출력 데이터 블럭으로 단계 226("프로세싱된 데이터 블럭들을 출력")에서 출력된다. 그후 프로세스는 부가적인 입력 데이터를 검사하고 프로세싱을 지속한다.Following the pitch scaling decision and possible resampling, all processed input data blocks are output in step 226 (“output processed data blocks”) as files for non-real time operation or as output data blocks for real time operation. The process then checks for additional input data and continues processing.

심리음향 분석 및 오디토리 신 분석 실시예Psychoacoustic analysis and auditory scene analysis example

본 발명의 양태들에 따라 심리음향 분석과 오디토리 신 분석을 사용하는 다중채널 타임 및/또는 피치 스케일링 프로세스의 실시예가 도 17에 나타나 있다. 상기 프로세스는, 입력 신호들이 샘플들로 표현되는 1개 이상의 채널들의 디지털 오디오이며 각 채널의 연속 샘플들이 4096개 샘플의 블럭으로 분할되는 실시예에 개시되어 있더라도, 프로세싱된 오디오는 디지털 또는 아날로그이며 블럭들로 분할될 필요가 없다.An embodiment of a multichannel time and / or pitch scaling process using psychoacoustic analysis and auditory scene analysis in accordance with aspects of the present invention is shown in FIG. 17. Although the process is disclosed in the embodiment where the input signals are digital audio of one or more channels in which the samples are represented and the consecutive samples of each channel are divided into blocks of 4096 samples, the processed audio is digital or analog and the block Need not be divided into

도 17을 참조하면, 제 1 단계, 결정 단계 702("데이터 입력?")는 디지털화된 입력 오디오 데이터가 데이터 압축 또는 데이터 팽창 프로세싱에 이용가능한지를 결정한다. 데이터의 소스는 컴퓨터 화일 또는 입력 데이터의 블럭이며, 이는 예를 들면 실시간 입력 버퍼에 저장된다. 만일 데이터가 이용가능하다면, 시간-일치 세그먼트를 나타내는 N 시간 동기 샘플들의 데이터 블럭들은 데이터 압축 또는 데이터 팽창이 프로세싱되도록 각각의 입력 채널들에 대해 한개 블럭으로 단계 704("각 채널에 대해 N 샘플을 획득")에 의해 누적된다(채널들의 개수는 1 이상이다). 상기 프로세스에 의해 사용된 입력 데이터 샘플의 개수, N은 상당한 개수의 샘플로 고정 되므로, 입력 데이터를 블럭들로 분할한다. 원칙적으로, 프로세싱된 오디오는 디지털 또는 아날로그이며 블럭들로 분할될 필요가 없다.Referring to Fig. 17, a first step, decision step 702 (" data input? &Quot;) determines whether digitized input audio data is available for data compression or data expansion processing. The source of data is a computer file or a block of input data, which is stored, for example, in a real time input buffer. If data is available, the data blocks of N time-synchronized samples representing the time-matched segment are taken to step 704 ("N samples for each channel" in one block for each input channel such that data compression or data expansion is processed). Acquisition ") (the number of channels is one or more). The number of input data samples used by the process, N, is fixed to a significant number of samples, thus splitting the input data into blocks. In principle, the processed audio is digital or analog and does not need to be divided into blocks.

도 17은 각 오디오 채널의 입력 데이터가 4096 샘플의 블럭으로 프로세싱된 데이터 압축 또는 데이터 팽창하는 발명의 양태들중 실제 실시예와 관련하여 논의되며, 이는 44.1kHz의 샘플링 레이트에서 약 93msec의 입력 데이터에 해당한다. 발명의 양태들은 그러한 실제 실시예로 제한되지 않음이 이해될 것이다. 상기 언급된 것처럼, 발명의 다양한 양태들의 원리들은 오디오를 샘플들로 어레인지하거나, 또는 만일 그렇다면, 일정한 길이의 블럭들을 제공하는 것을 요구하지 않는다. 그러나, 복잡도를 최소화시키기 위해서, 4096개 샘플들(또는 2n(power of two)의 샘플들)의 고정 블럭 길이는 3가지 주요 이유때문에 유용하다. 첫째, 그것은 실시간 프로세싱 어플리케이션에 받아들여질 수 있도록 충분히 낮은 지연을 제공한다. 둘째, 그것은 2n의 샘플들로, 이는 빠른 퓨리에 변환(FFT) 분석에 유용하다. 셋째, 적절히 큰 윈도우 사이즈를 제공하여 입력 신호의 유용한 오디토리 신 및 심리음향 분석을 실행한다.FIG. 17 is discussed in connection with a practical embodiment of aspects of the invention in which the input data of each audio channel is data compressed or data expanded processed into blocks of 4096 samples, which translates to about 93 msec input data at a sampling rate of 44.1 kHz. Corresponding. It is to be understood that aspects of the invention are not limited to such practical embodiments. As mentioned above, the principles of the various aspects of the invention do not require arranging audio into samples, or, if so, providing blocks of constant length. However, to minimize complexity, a fixed block length of 4096 samples (or 2 n (power of two) samples) is useful for three main reasons. First, it provides a low enough delay to be acceptable for real time processing applications. Second, it is 2n samples, which is useful for fast Fourier transform (FFT) analysis. Third, it provides an appropriately large window size to perform useful auditory scene and psychoacoustic analysis of the input signal.

하기 논의에서는, 입력 신호들은 범위[-1,+1]의 진폭값을 갖는 데이터인 것으로 추정된다.In the discussion below, it is assumed that the input signals are data with amplitude values in the range [-1, +1].

오디토리 신 분석 706(도 17)Auditory Scene Analysis 706 (FIG. 17)

오디오 입력 데이터 블럭킹에 이어서, 각 채널의 데이터 블럭의 콘텐트들은 오디토리 이벤트로 분할되며, 그 각각은 개별적으로 인지되는 경향이 있다("각 채널의 블럭에 대해 오디토리 신 분석을 실행")(단계 706). 다중채널의 경우에, 오디 토리 신 분석 706과 후속 단계들은 모든 채널에 대해 병렬로 또는 순차적으로, 채널 단위(각 채널의 데이터의 적절한 저장과 각각의 분석을 제공하는 동안)로 실행된다. 병렬 프로세싱이 더 커다란 프로세싱 파워를 요구하더라도, 실시간 어플리케이션이 바람직하다. 도 17의 설명은 채널들이 병렬로 프로세싱된다고 추정한다.Following audio input data blocking, the contents of the data blocks of each channel are divided into auditory events, each of which tends to be perceived individually ("perform an auditory scene analysis for each block of channels") (step 706). In the case of multiple channels, auditory scene analysis 706 and subsequent steps are performed in parallel (or sequentially) for all channels, on a channel-by-channel basis (while providing the appropriate storage of each channel's data and each analysis). Although parallel processing requires greater processing power, real time applications are desirable. The description of FIG. 17 assumes that the channels are processed in parallel.

오디토리 신 분석은 상기 논의된 오디토리 신 분석(ASA) 프로세스에 의해 달성된다. 비록 오디토리 신 분석을 실행하기 위한 하나의 적절한 프로세스가 본문에 기술되어 있더라도, 발명은 ASA를 실행하기 위한 다른 유용한 기술들이 사용될 수 있음을 고려한다. 오디토리 이벤트는 상당히 일정한 것으로서 인지되는 경향때문에, 오디토리 신 분석 결과들은 고품질 타임 및 피치 스케일링을 실행시 그리고 가청 프로세싱 가공물들의 도입을 감소시 유용한 중요 정보를 제공한다. 오디토리 이벤트를 개별적으로 식별하고 그리고 후속적으로 프로세싱함으로써, 타임 및 피치 스케일링 프로세싱에 의해 도입되는 가청 가공물들이 현격히 감소될 것이다.Auditory scene analysis is achieved by the auditory scene analysis (ASA) process discussed above. Although one suitable process for performing auditory scene analysis is described herein, the invention contemplates that other useful techniques for performing ASA may be used. Because auditory events tend to be perceived as fairly constant, auditory scene analysis results provide valuable information useful in performing high quality time and pitch scaling and reducing the introduction of audible processing artifacts. By individually identifying and subsequently processing the auditory event, the audible artifacts introduced by the time and pitch scaling processing will be significantly reduced.

도 18은 도 17의 오디토리 신 분석 단계에 사용되는 본 발명의 기술에 따른 프로세스를 약술한다. ASA 단계는 3가지 일반적인 프로세스 서브단계로 이루어진다. 제 1 서브단계 706-1("입력 오디오 블럭의 스펙트럼 프로파일을 계산")은 N개 샘플 입력 블럭을 가지고, 그것을 서브블럭들로 분할하고 각각의 서브블럭들에 대해 스펙트럼 프로파일 또는 스펙트럼 콘텐트를 계산한다. 그러므로, 제 1 서브단계는 오디오 신호의 연속 시간 세그먼트의 스펙트럼 콘텐트를 계산한다. 실제 실시예에서, 하기 개시된, ASA 서브블럭 사이즈는 입력 데이터 블럭 사이즈(예를 들면, 4096 샘플)의 8분의 1 사이즈(예를 들면, 512 샘플)이다. 제 2 서브단계 706-2에 서, 서브블럭간에 스펙트럼 콘텐트의 차이가 결정된다("스펙트럼 프로파일 차이 측정을 실행"). 그러므로, 제 2 서브단계는 오디오 신호의 연속 시간 세그먼트간에 스펙트럼 콘텐트의 차이를 계산한다. 제 3 서브단계 706-3("오디토리 이벤트 경계 위치를 식별")에 있어서, 하나의 스펙트럼-프로파일 서브블럭과 다음 블럭간의 스펙트럼 차이가 임계치보다도 클 때, 서브블럭 경계가 오디토리 이벤트 경계로 취하여진다. 그러므로, 제 3 서브단계는, 그러한 연속 시간 세그먼트간의 스펙트럼 프로파일 콘텐트에서의 차이가 임계치를 초과할 때 연속 시간 세그먼트 사이에 오디토리 이벤트 경계를 설정한다. 상기 논의된 것처럼, 인지된 오디토리 이벤트의 시작 또는 종단의 효과적인 지시자는 스펙트럼 콘텐트에서 변동하는 것으로 여겨진다.18 outlines a process in accordance with the techniques of the present invention used in the auditory scene analysis step of FIG. 17. The ASA phase consists of three general process substeps. The first substep 706-1 (“calculate the spectral profile of the input audio block”) has N sample input blocks, divides it into subblocks and calculates the spectral profile or spectral content for each subblock. . Therefore, the first substep calculates the spectral content of the continuous time segment of the audio signal. In a practical embodiment, the ASA subblock size, described below, is one-eighth the size (eg, 512 samples) of the input data block size (eg, 4096 samples). In a second substep 706-2, the difference in the spectral content between the subblocks is determined ("perform spectral profile difference measurement"). Therefore, the second substep calculates the difference in the spectral content between successive time segments of the audio signal. In the third substep 706-3 (“identifying the auditory event boundary position”), when the spectral difference between one spectrum-profile subblock and the next block is larger than the threshold, the subblock boundary is taken as the auditory event boundary. Lose. Therefore, the third substep sets the auditory event boundary between successive time segments when the difference in spectral profile content between such successive time segments exceeds a threshold. As discussed above, the effective indicator of the start or end of a recognized auditory event is believed to vary in spectral content.

본 실시예에서, 오디토리 이벤트 경계들은 최소 길이의 1개 스펙트럼 프로파일 서브블럭을 지닌 정수개의 스펙트럼 프로파일 서브블럭인 길이를 갖는 오디토리 이벤트를 규정한다(본 예에서, 512개 샘플). 원칙적으로, 상기 이벤트 경계들이 그렇게 제한될 필요는 없다. 또한, 입력 블럭 사이즈가 가변이 아니라면 입력 블럭 사이즈가 오디토리 이벤트의 최대 길이를 제한함을 유의한다(본문에서 논의된 실제 실시예에 대안으로서, 입력 블럭 사이즈는 본질적으로 오디토리 이벤트의 사이즈이도록 변동한다).In this embodiment, the auditory event boundaries define an auditory event with a length that is an integer number of spectral profile subblocks with one spectral profile subblock of minimum length (in this example, 512 samples). In principle, the event boundaries need not be so limited. Also note that the input block size limits the maximum length of the auditory event if the input block size is not variable (as an alternative to the actual embodiment discussed in the text, the input block size is essentially variable to the size of the auditory event). do).

도 19는 시간 가변 스펙트럼 프로파일을 계산하는 방법을 약술한다. 도 19에서, 오디오의 중첩 세그먼트들은 윈도우되고 입력 오디오의 스펙트럼 프로파일을 계산하는데 사용된다. 중첩(overlap)은 오디토리 이벤트의 위치에 관해서 더 미세 한 분해능을 가져오며, 또한, 과도현상과 같은 이벤트를 거의 놓치기 쉽지 않다. 그러나, 시간 분해능이 증가함에 따라, 주파수 분해능이 감소한다. 중첩은 또한 연산 복잡도를 증가시킨다. 그러므로, 하기 진술된 실제예에서는 중첩이 생략되어 있다.19 outlines a method for calculating a time varying spectral profile. In Fig. 19, overlapping segments of audio are windowed and used to calculate the spectral profile of the input audio. Overlap results in finer resolution with respect to the location of auditory events, and it is hard to miss events such as transients. However, as time resolution increases, frequency resolution decreases. Nesting also increases computational complexity. Therefore, the overlap is omitted in the practical example stated below.

하기의 변수들은 입력 블럭의 스펙트럼 프로파일을 연산하는데 사용된다:The following variables are used to calculate the spectral profile of the input block:

N = 입력 오디오 블럭에서의 샘플 개수N = number of samples in the input audio block

M = 스펙트럼 프로파일을 연산하는데 사용된 윈도우된 샘플들의 개수M = number of windowed samples used to compute the spectral profile

P = 스펙트럼 연산 중첩의 샘플들의 개수P = number of samples of spectral arithmetic overlap

Q = 연산된 스펙트럼 윈도우/영역의 개수Q = number of computed spectral windows / regions

일반적으로, 임의의 정수들이 변수들에 사용된다. 그러나, 구현예는, 만일 M이 2n(power of 2)으로 설정되어서 표준 FFT가 스펙트럼 프로파일 계산에 사용된다면 더 효율적일 것이다. 게다가, 만일 N, M 및 P가, Q가 정수이도록 선택된다면, 이는 N개 샘플 블럭의 종단에서 오디오를 언더러닝 또는 오버러닝하는 것을 회피할 것이다. 오디토리 신 분석 프로세스의 실제 실시예에서, 목록된 파라미터들이 하기와 같이 설정된다:In general, arbitrary integers are used for variables. However, implementations will be more efficient if M is set to 2n (power of 2) so that a standard FFT is used for the spectral profile calculation. In addition, if N, M and P are chosen such that Q is an integer, this will avoid underruning or overrunning the audio at the end of the N sample blocks. In a practical embodiment of the auditory scene analysis process, the listed parameters are set as follows:

N = 4096개 샘플(또는 44.1kHz에서 93msec)N = 4096 samples (or 93 msec at 44.1 kHz)

M = 512개 샘플(또는 44.1kHz에서 12msec)M = 512 samples (or 12 msec at 44.1 kHz)

P = 0개 샘플(또는 중첩)P = 0 samples (or nested)

Q = 8개 블럭Q = 8 blocks

상기 목록된 값들은 경험적으로 결정되었으며 타임 스케일링 및 피치 시프팅을 위해 오디토리 이벤트의 위치 및 기간을 충분히 정확하게 식별하도록 발견되었다. 그러나, P의 값을 256개 샘플로 설정하는 것(50% 중첩)은 몇몇 발견하기 어려운 이벤트들을 식별하는데 유용한 것으로 밝혀졌다. 수많은 상이한 유형의 윈도우들이 윈도우잉으로 인한 스펙트럼 가공물들을 최소화시키는데 사용되더라도, 스펙트럼 프로파일 계산에 사용된 윈도우는 M-포인트 해닝, 카이저-베셀 또는 기타 적절한, 바람직하게는 비-직사각형(non-rectangular) 윈도우이다. 상기 지시된 값들과 해닝 윈도우 유형은 폭넓은 실험 분석 이후에 선택되었으므로 폭넓은 범위의 오디오 자료에 걸쳐서 우수한 결과를 제공하는 것으로 나타내었다. 비-직사각형 윈도우잉은 현저하게 낮은 주파수 콘텐트를 지닌 오디오 신호들의 프로세싱에 바람직하다. 직사각형 윈도우잉은 이벤트의 부정확한 검출을 야기하는 스펙트럼 가공물을 생성한다.The above listed values were determined empirically and were found to accurately identify the location and duration of the auditory event for time scaling and pitch shifting. However, setting the value of P to 256 samples (50% overlap) has been found to be useful for identifying some elusive events. Although many different types of windows are used to minimize spectral artifacts due to windowing, the window used for spectral profile calculation is M-point hanning, Kaiser-Bessel or other suitable, preferably non-rectangular window. to be. The indicated values and the hanning window type were chosen after extensive experimental analysis and have been shown to provide good results over a wide range of audio data. Non-rectangular windowing is desirable for the processing of audio signals with significantly lower frequency content. Rectangular windowing creates spectral artifacts that cause inaccurate detection of events.

서브단계 706-1에서, 각 M-샘플 서브블럭의 스펙트럼은 데이터를 M-포인트 해닝, 카이버-베셀 또는 기타 적절한 윈도우로 윈도우잉하며, M-포인트 빠른 퓨리에 변환을 사용하여 주파수 도메인으로 컨버트하고, FFT 계수의 크기를 계산함으로써 연산된다. 결과적인 데이터가 노멀라이즈(normalize)되어 가장 큰 크기가 단위체로 설정되며, M 개의 노멀라이즈된 어레이가 로그(log) 도메인으로 컨버트된다. 상기 어레이는 로그 도메인으로 컨버트될 필요는 없지만, 컨버전은 서브단계 706-2에서 차이 측정의 계산을 간략화시킨다. 게다가, 로그 도메인은 사람의 청각 시스템의 로그 도메인 진폭 특성에 더욱 밀접히 일치시킨다. 결과적인 로그 도메인 값들은 마이너스 무한대에서 제로의 범위를 갖는다. 실제 실시예에서, 하한선은 값들의 범위에 부과된다; 상기 하한선은 예를 들면 -60dB로 고정적이거나, 또는 저주파 수 또는 고주파수에서 조용한 음의 낮은 가청도를 반영하도록 주파수 의존적이다. (FFT가 네거티브 이외에 포지티브 주파수를 나타낸다는 점에서 어레이의 사이즈를 M/2로 감소시키는 것이 가능함에 유의한다).In substep 706-1, the spectrum of each M-sample subblock is windowed to M-point hanning, Khyber-Bessel or other appropriate window, and converted into the frequency domain using M-point fast Fourier transform and Is computed by calculating the magnitude of the FFT coefficient. The resulting data is normalized so that the largest size is set in units, and M normalized arrays are converted to a log domain. The array need not be converted to the log domain, but conversion simplifies the calculation of the difference measure at substep 706-2. In addition, the log domain more closely matches the log domain amplitude characteristics of the human auditory system. The resulting log domain values range from negative infinity to zero. In a practical embodiment, the lower limit is imposed on the range of values; The lower limit is, for example, fixed at -60 dB or frequency dependent to reflect quiet low audibility at low or high frequencies. (Note that it is possible to reduce the size of the array to M / 2 in that the FFT represents positive frequencies other than negative).

서브단계 706-2는 인접 서브블럭들의 스펙트럼간의 차이의 수치를 계산한다. 각 서브블럭들에 대해, 상기 서브단계 706-1로부터 각각의 M(log) 스펙트럼 계수들은 선행 서브블럭에 대해 해당 계수로부터 공제되며, 차이의 크기가 계산된다. 이러한 M 차이는 그후 하나의 넘버로 합산된다. 그러므로, 전체적인 오디오 신호에 대해, 결과는 Q 포지티브 넘버의 어레이이다; 넘버가 크면 클 수 록 서브블럭은 스펙트럼에서 선행 서브블럭과 더욱더 차이가 난다. 이러한 차이 수치는 또한 차이 수치를 합산에 사용된 스펙트럼 계수의 수자로 나눔으로써 스펙트럼 계수 당 평균 차이로서 표현된다(이 경우 M은 계수이다).Substep 706-2 calculates the value of the difference between the spectra of adjacent subblocks. For each subblock, each M (log) spectral coefficients from substep 706-1 are subtracted from that coefficient for the preceding subblock, and the magnitude of the difference is calculated. This M difference is then summed to one number. Therefore, for the whole audio signal, the result is an array of Q positive numbers; The larger the number, the greater the subblock differs from the preceding subblock in the spectrum. This difference value is also expressed as the average difference per spectral coefficient by dividing the difference value by the number of spectral coefficients used in the summation, in which case M is the coefficient.

서브단계 706-3은 임계값을 갖는 서브단계 706-2로부터 차이 측정의 어레이에 임계를 적용함으로써 오디토리 이벤트 경계들의 위치들을 식별한다. 차이 정도가 임계를 초과할 때, 스펙트럼에서의 변동은 새로운 이벤트를 신호보내기에 충분한 것으로 여겨지며 변동의 서브블럭 개수는 이벤트 경계로서 기록된다. 상기 주어진 M, N, P 및 Q의 값에 대해 그리고 dB로 표현된 log 도메인 값에 대해(서브단계 706-2), 상기 임계는 전체 크기 FFT(반사된 부분 포함)가 비교된다면 2500으로 설정되거나, 또는 FFT의 반이 비교된다면 1250으로 설정된다(상기 언급된 것처럼, FFT는 네거티브 이외에 포지티브 주파수를 나타낸다 - FFT의 크기에 대해, 하나는 다른것의 반사상이다). 이값은 실험적으로 선택되었으며 양호한 오디토리 이벤트 경계 검출을 제공한다. 이 파라미터값은 이벤트의 검출을 감소(임계를 증가) 또는 증가(임계를 감소)시키도록 변동된다.Substep 706-3 identifies the locations of auditory event boundaries by applying a threshold to the array of difference measurements from substep 706-2 having a threshold. When the degree of difference exceeds the threshold, the variation in the spectrum is considered sufficient to signal a new event and the number of subblocks of the variation is recorded as the event boundary. For the given values of M, N, P and Q and for log domain values expressed in dB (substep 706-2), the threshold is set to 2500 if the full size FFT (including the reflected part) is compared or , Or half of the FFT is set to 1250 (as mentioned above, the FFT represents a positive frequency in addition to the negative-for the magnitude of the FFT, one is the reflection of the other). This value was chosen experimentally and provides good auditory event boundary detection. This parameter value is varied to reduce (increase the threshold) or increase (decrease the threshold) the detection of an event.

이러한 실제 실시예의 상세한 사항은 중요하지 않다. 오디오 신호의 연속 시간 세그먼트의 스펙트럼 콘텐트를 계산하며, 연속 시간 세그먼트간의 차이를 계산하고, 그러한 연속 시간 세그먼트간의 스펙트럼 콘텐트 프로파일에서의 차이가 임계를 초과할 때 연속 시간 세그먼트사이의 각각의 경계에서 오디토리 이벤트 경계를 설정하는 다른 방식들이 사용될 수 있다.The details of this practical embodiment are not important. Compute spectral content of consecutive time segments of an audio signal, calculate the difference between successive time segments, and audition at each boundary between successive time segments when the difference in the spectral content profile between such successive time segments exceeds a threshold. Other ways of setting event boundaries can be used.

도 17의 오디토리 신 분석 프로세스 단계 706의 출력들은 오디토리 이벤트 경계의 위치, 입력 블럭에서 검출된 오디토리 이벤트의 개수, 및 N 포인트 입력 블럭에 대해 연산된 마지막 또는 L번째 스펙트럼 프로파일 블럭이다. 초기에 언급된 것처럼, 오디토리 분석 프로세스는 각 채널의 입력 데이터 블럭에 대해 1회 실행된다. 단계 710과 관련하여 하기에 더 상세히 개시되어 있는 것처럼, 1개 이상의 오디오 채널이 프로세싱된다면, 오디토리 이벤트 정보는 모든 채널에 대한 전체 오디토리 이벤트 개관을 생성하도록 결합된다("결합된 오디토리 이벤트" 세그먼트를 생성). 이는 위상 동기 다중 채널 프로세싱을 이용한다. 이러한 방식에서, 다중 오디오 채널들은 단일 복합 오디오 신을 생성하도록 함께 혼합되는 다수의 개별 오디오 "트랙"으로 여겨진다. 2개 채널에 대한 이벤트 검출 프로세싱의 예가 도 20에 나타나 있으며, 하기에 기술되어 있다.The outputs of the auditory scene analysis process step 706 of FIG. 17 are the location of the auditory event boundary, the number of auditory events detected in the input block, and the last or L-th spectral profile block computed for the N point input block. As mentioned earlier, the auditory analysis process is performed once for each channel's input data block. As described in more detail below with respect to step 710, if one or more audio channels are to be processed, the auditory event information is combined to generate a complete auditory event overview for all channels ("Combined Auditory Events"). Create segments). This utilizes phase locked multichannel processing. In this way, multiple audio channels are considered multiple individual audio "tracks" that are mixed together to create a single composite audio scene. An example of event detection processing for two channels is shown in FIG. 20 and described below.

오디토리 이벤트의 심리음향 분석 708(도 17)Psychoacoustic analysis of auditory events 708 (FIG. 17)

도 17을 다시 참조하면, 입력 데이터 블럭킹과 오디토리 신 분석에 이어서, 심리음향 분석이 각 오디토리 이벤트에 대한 각 입력 데이터 블럭에서 실행된다("각 블럭의 각 이벤트에 대해 심리음향 분석을 실행")(단계 708). 일반적으로, 상기 심리음향 특성은 오디토리 이벤트의 길이 또는 시간 주기상의 오디오 채널에 대체로 균일하게 남는데, 왜냐하면 오디토리 이벤트내의 오디오가 상당히 일정하게 인지되기 때문이다. 그러므로, 오디오 정보가, 오디토리 이벤트 검출 프로세스에서 보다도 본문에 개시된 실제예에서 64개 샘플 서브블럭들을 고찰하며, 본문에 개시된 실제예에서 512개 샘플 서브블럭을 고찰하는, 심리음향 분석 프로세스에서 더 세밀하게 평가되더라도, 심리음향 분석 프로세스는 일반적으로 오디토리 이벤트의 도처에서 단지 하나의 현저한 심리음향 조건을 발견하고 그 이벤트를 태그한다. 도 17의 프로세스의 부분으로서 실행되는 심리음향 분석은, 전체 입력 블럭에 보다는 입력 블럭내의 각 오디토리 이벤트에 적용된다는 점에서 도 5의 프로세스의 부분으로서 실행되는 심리음향 분석과 상이하다.Referring back to FIG. 17, following input data blocking and auditory scene analysis, psychoacoustic analysis is performed on each input data block for each auditory event ("perform psychoacoustic analysis for each event in each block"). ) (Step 708). In general, the psychoacoustic characteristic remains largely uniform in the audio channel over the length or time period of the auditory event because the audio in the auditory event is perceived to be fairly constant. Therefore, the audio information is more detailed in the psychoacoustic analysis process, which considers 64 sample subblocks in the actual example disclosed in the text and 512 sample subblocks in the actual example disclosed in the text than in the auditory event detection process. Although evaluated, the psychoacoustic analysis process generally finds only one salient psychoacoustic condition throughout the auditory event and tags the event. The psychoacoustic analysis executed as part of the process of FIG. 17 differs from the psychoacoustic analysis performed as part of the process of FIG. 5 in that it is applied to each auditory event in the input block rather than to the entire input block.

일반적으로, 오디토리 이벤트의 심리음향 분석은 2가지 중요 정보를 제공한다 - 첫째는, 상기 정보는 입력 신호의 이벤트의 부분이, 프로세싱된다면, 가청 가공물을 생성하기 가장 쉬움을 식별하며, 둘째는, 입력 신호의 부분이 실행되는 프로세싱을 마스킹하는데 유익하게 사용될 수 있음을 식별한다. 도 21은 심리음향 분석 프로세스에 사용된, 상기된 도 6의 프로세스에 유사한 프로세스를 진술한다. 심리음향 분석 프로세스는 4개의 일반적 프로세싱 서브단계로 구성된다. 상기 언급된 것처럼, 각각의 심리음향 프로세싱 서브단계들은 사이즈가 스펙트럼 프로파일 서브블럭의 8분의 1(또는 입력 블럭의 64분의 1)인 사이즈를 갖는 심리음향 서브블럭을 사용한다. 그러므로, 본 예에서, 심리음향 서브블럭들은 도 22에 나타난 것처럼 대략 1.5msec(또는 44.1kHz에서 64개 샘플)이다. 심리음향 서브블럭들의 실제 사이즈는 1.5msec로 제한되지 않으며 서로다른 값을 갖더라도, 이러한 사이즈가 실제 구현예에 선택되었는데, 왜냐하면 실시간 프로세싱 요건(더 큰 서브블럭 사이즈가 적은 심리음향 프로세싱 오버헤드를 요구)과 과도현상 위치의 분해능(더 작은 서브블럭들이 과도현상의 위치에 대해 더 상세한 정보를 제공) 사이에 양호한 절충을 제공하기 때문이다. 원칙적으로, 심리음향 서브블럭 사이즈는 매 유형의 심리음향 분석에 대해 동일한 필요는 없지만, 구현의 용이를 위해 실제 실시예에서, 이는 바람직하다.In general, psychoacoustic analysis of auditory events provides two important pieces of information: first, the information identifies the portion of the event of the input signal that is most likely to produce an audible artifact if processed. It identifies that a portion of the input signal can be beneficially used to mask the processing performed. FIG. 21 states a process similar to the process of FIG. 6 described above, used in the psychoacoustic analysis process. The psychoacoustic analysis process consists of four general processing substeps. As mentioned above, each psychoacoustic processing substep uses a psychoacoustic subblock whose size is one eighth of the spectral profile subblock (or one sixth of an input block). Therefore, in this example, the psychoacoustic subblocks are approximately 1.5 msec (or 64 samples at 44.1 kHz) as shown in FIG. The actual size of the psychoacoustic subblocks is not limited to 1.5 msec and even if they have different values, this size was chosen for the actual implementation because of the real-time processing requirements (requires psychoacoustic processing overhead with a smaller subblock size). ) And the resolution of the transient location (smaller subblocks provide more detailed information about the location of the transient). In principle, the psychoacoustic subblock size does not need to be the same for every type of psychoacoustic analysis, but for practical implementation, this is desirable.

과도현상 검출 708-1(도 21)Transient Detection 708-1 (FIG. 21)

도 21을 참조하면, 제 1 서브단계 708-1("과도현상 검출/마스킹 분석을 실행")은 각 오디오 채널의 입력 블럭에서 각 오디토리 이벤트 세그먼트를 분석하여 각각의 그러한 세그먼트가 과도현상을 포함하는지를 결정한다. 이는 ASA 프로세스의 스펙트럼 변동 양태가 본질적으로 과도현상들을 고려하며 과도현상을 포함하는 오디오 세그먼트를 오디토리 이벤트로서 식별하였을지라도 필요한데, 왜냐하면 본문에 기술된 스펙트럼 기반 ASA 프로세스가 과도현상을 포함하는지에 따라 오디토리 이벤트를 식별하지 못하기 때문이다(과도현상들이 스펙트럼 변동을 야기하기 때문이다). 결과적인 일시적 과도현상 정보가 마스킹 분석에 사용되며 임시 또는 공통 접합점 위치의 교체에 도움을 준다. 상기 논의된 것처럼, 과도현상들이 일시적 마스킹(과도현상의 발생 전후에 오디오 정보를 은폐)을 도입시킴이 공지되어 있다. 특정 블럭에서 오디토리 이벤트 세그먼트는, 과도현상이 이벤트의 전체 길이 또는 시간 주기를 점유하건 그렇지 않건 과도현상으로서 태그된다. 심리음향 분석 단계에서 과도현상 검출 프로세스는, 오디토리 이벤트를 구성하는 입력 블럭의 세그먼트만을 분석하는 것을 제외하고는 상기 기술된 과도현상 검출 프로세스처럼 본질적으로 동일하다. 그러므로, 과도현상 검출 프로세스의 상세한 사항에 대해서는 상기 기술된 도 8의 프로세스 순서도를 참조한다.Referring to Figure 21, a first substep 708-1 ("Perform transient detection / masking analysis") analyzes each auditory event segment in the input block of each audio channel so that each such segment contains a transient. Decide if This is necessary even though the spectral variation aspect of the ASA process essentially considers transients and identifies audio segments containing transients as auditory events, depending on whether the spectrum-based ASA process described in the text contains transients. This is because they do not identify tory events (since transients cause spectral fluctuations). The resulting transient transient information is used for masking analysis and helps to replace temporary or common junction locations. As discussed above, it is known that transients introduce temporary masking (hiding audio information before and after the occurrence of the transient). An auditory event segment in a particular block is tagged as a transient, whether or not the transient occupies the entire length or time period of the event. The transient detection process in the psychoacoustic analysis step is essentially the same as the transient detection process described above, except that only the segments of the input block that make up the auditory event are analyzed. Therefore, for details of the transient detection process, refer to the process flowchart of FIG. 8 described above.

가청 임계 분석 708-3(도 21) Audible Critical Analysis 708-3 (FIG. 21)

도 21을 참조하면, 심리음향 분석 프로세스에서 제 2 단계 708-2, "가청 임계 분석을 실행"하는 서브단계는 각 오디오 채널의 입력 블럭의 각 오디토리 이벤트를 분석하여 각각의 그러한 세그먼트가 가청 임계 또는 그 이하로 고려될 수 있는 현저하게 충분히 낮은 신호 세기인지를 결정한다. 상기 언급된 것처럼, 오디토리 이벤트는 그 길이 또는 시간 주기에서 상당히 일정한 것으로서 인지되는 경향이 있으며, 스펙트럼-프로파일 서브블럭 사이즈의 그래뉴랠러티(granularity)로 인하여 그 경계들에 가까운 가능한 편차에 물론 적용된다(예를 들면, 오디오는 그 특성을 가능한 이벤트 경계에서 보다도 정확하게 변동시킨다). 심리음향 분석 단계에서 가청 임계 분석 프로세스는 오디토리 이벤트를 구성하는 입력 블럭의 세그먼트만을 분석하는 것을 제외하고는 상기된 가청 임계 분석 프로세스(예를 들면, 도 6의 서브단계 206-2의 설명 참조) 처럼 본질적으로 동일하므로, 이전 설명을 참조한다. 오디토리 이벤트들이 중요한데 왜냐하면 그러한 오디토리 이벤트의 타임 스케일링 및 피치 시프팅에 의해 도입된 가공물들은 그러한 영역에서 가청되기 쉽지 않 기 때문이다.Referring to FIG. 21, in the psychoacoustic analysis process, a second step 708-2, “Performing audible threshold analysis”, analyzes each auditory event of the input block of each audio channel such that each such segment is an audible threshold. Or a signal strength that is significantly low enough to be considered or less. As mentioned above, auditory events tend to be perceived as fairly constant in their length or time period, and of course apply to possible deviations close to their boundaries due to the granularity of the spectral-profile subblock size. (E.g., audio varies its characteristics more accurately at possible event boundaries). In the psychoacoustic analysis step, the audible threshold analysis process, except for analyzing only the segments of the input block constituting the auditory event (see, for example, the description of sub-step 206-2 in FIG. 6). Are essentially the same, as in the previous description. Auditory events are important because the workpieces introduced by the time scaling and pitch shifting of such auditory events are not easy to hear in such areas.

고주파수 분석 708-3(도 21)High Frequency Analysis 708-3 (Figure 21)

제 3 서브단계 708-3(도 21)("고주파수 분석을 실행")은 각 오디오 채널의 입력 블럭의 각 오디토리 이벤트를 분석하여 각각의 그러한 세그먼트가 고주파수 오디오 콘텐트를 포함하는지를 결정한다. 고주파수 세그먼트들은 심리음향 분석에서 중요한데, 왜냐하면 조용한 곳에서의 가청 임계는 대략 10-12kHz 이상으로 급격히 증가하기 때문이며 귀는 주로 저주파수 파형의 불연속성에 보다는 주로 고주파수 파형의 불연속성에 덜 민감하기 때문이다. 오디오 신호가 주로 고주파수 에너지로 이루어져있는지를 결정하는 가능한 수많은 방법들이 있더라도, 도 6의 서브단계 206-3과 관련하여 상기된 방법은 양호한 검출 결과를 제공하며, 연산 요건을 최소화시키고 오디토리 이벤트를 구성하는 세그먼트를 분석하는 단계에 적용된다.A third substep 708-3 (FIG. 21) (" executing high frequency analysis ") analyzes each auditory event of the input block of each audio channel to determine if each such segment contains high frequency audio content. High-frequency segments are important in psychoacoustic analysis because the audible threshold in quiet places increases rapidly above approximately 10-12 kHz, and the ear is less sensitive to discontinuities in high-frequency waveforms than to discontinuities in low-frequency waveforms. Although there are a number of possible ways of determining whether an audio signal consists primarily of high frequency energy, the method described above with respect to substep 206-3 of FIG. 6 provides good detection results, minimizes computational requirements and organizes auditory events. Applies to the step of analyzing the segment.

오디오 레벨 분석 708-4(도 21)Audio Level Analysis 708-4 (FIG. 21)

심리음향 분석 프로세스의 제 4 서브단계 708-4(도 21), "일반적인 오디오 블럭 레벨 분석을 실행" 서브단계는 각 오디오 채널의 입력 블럭의 각 오디토리 이벤트 세그먼트를 분석하여 이벤트의 신호 세기의 수치(measure)를 계산한다. 그러한 정보는 이벤트가 프로세싱중에 이용될 수 있는 상기 심리음향 특성을 갖지않는다면 사용된다. 이 경우, 데이터 압축 또는 데이터 팽창 프로세싱은 오디오의 하위 레벨 세그먼트들이 가청되기 쉽지 않은 하위 레벨 프로세싱 가공물들을 발생시킨다는 원리에 기초하여 입력 데이터 블럭에서 가장 조용한 오디토리 이벤트 또는 최하위 레벨을 선호한다. 관현악의 단일 채널을 사용하는 간단한 예가 도 23에 나타나 있다. 새로운 음이 연주됨에 따라 발생하는 스펙트럼 변동은 각각 샘플 2048과 2590에서 새로운 이벤트 2와 3을 유발시킨다. 도 23에 나타난 관현악 신호는 가청 임계 또는 고주파수 콘텐트 아래에 있어서 어떠한 과도현상도 포함하지 않는다. 그러나, 상기 신호의 제 1 오디토리 이벤트는 블럭의 제 2 및 제 3 이벤트보다도 레벨이 더 낮다. 가청 프로세싱 가공물들은 더 시끄러운 후속 이벤트들보다도 데이터 팽창 또는 압축 프로세싱을 위한 조용한 이벤트를 선택함으로써 최소화된다.The fourth substep 708-4 (FIG. 21) of the psychoacoustic analysis process, "Perform a general audio block level analysis" substep analyzes each auditory event segment of the input block of each audio channel to determine the numerical value of the signal's signal strength. Calculate the measure Such information is used if the event does not have the psychoacoustic characteristics that can be used during processing. In this case, data compression or data expansion processing prefers the quietest auditory event or the lowest level in the input data block based on the principle that lower level segments of audio generate lower level processing artifacts that are not audible. A simple example of using a single channel of orchestral music is shown in FIG. Spectral fluctuations that occur as new notes are played cause new events 2 and 3 in samples 2048 and 2590, respectively. The orchestral signal shown in FIG. 23 does not include any transient under the audible threshold or high frequency content. However, the first auditory event of the signal is at a lower level than the second and third events of the block. Audible processing artifacts are minimized by selecting quiet events for data expansion or compression processing rather than louder subsequent events.

오디토리 이벤트의 일반적인 레벨을 연산하기 위해서, 서브단계 708-4는 64-샘플 서브블럭들로 분할되는 이벤트내의 데이터를 취하며, 각 서브블럭의 가장 큰 샘플의 크기를 발견하고, 이벤트에 있는 64-샘플 서브블럭들의 개수에 관해서 가장 큰 크기의 평균을 취한다. 각 이벤트의 일반적인 오디오 레벨은 차후 비교를 위해 저장된다.To compute the general level of auditory events, substep 708-4 takes data in the event that is divided into 64-sample subblocks, finds the size of the largest sample of each subblock, and finds the 64 in the event. Take the average of the largest magnitudes in terms of the number of sample subblocks. The general audio level of each event is stored for later comparison.

결합된 오디토리 이벤트 결정 및 공통 접합점 설정 710(도 17)Combined auditory event determination and common junction establishment 710 (Figure 17)

도 17에 나타난 것처럼, 각 블럭에서 오디토리 이벤트를 구성하는 각 세그먼트의 오디토리 신 분석 및 심리음향 분석에 이어서, 프로세싱에서 다음 단계 710("결합된 오디토리 이벤트 결정 및 공통 접합점 설정")은 모든 채널에 걸친 병행 블럭들에서 결합된 오디토리 이벤트들의 경계들을 결정하며(결합된 오디토리 이벤트들은 도 20과 관련하여 하기에 더 상세히 기술된다), 각 병행 블럭의 세트에서 1개 이상의 결합된 오디토리 이벤트 세그먼트들에 대한 모든 채널에 걸친 수반 블럭에서 공통 접합점을 결정하고, 결합된 오디토리 이벤트 세그먼트들에서 오디토리 이벤트들의 심리음향 품질을 등급매기는 것이다. 그러한 등급매김은 상기 진술된 심 리음향 기준의 계층에 기초한 것이다. 단일 채널이 프로세싱되는 경우에, 그 채널의 오디토리 이벤트들은 본 설명의 다중 채널의 결합된 오디토리 이벤트와 동일한 방식으로 처리된다.As shown in FIG. 17, following auditory scene analysis and psychoacoustic analysis of each segment constituting the auditory event in each block, the next step 710 ("Combined auditory event determination and common junction establishment") in processing is performed. Determine boundaries of combined auditory events in parallel blocks across the channel (the combined auditory events are described in more detail below in connection with FIG. 20), and one or more combined auditories in a set of each parallel block. It is to determine a common junction in the accompanying block across all channels for the event segments and to grade the psychoacoustic quality of the auditory events in the combined auditory event segments. Such grading is based on the hierarchy of psychoacoustic criteria stated above. When a single channel is processed, the auditory events of that channel are processed in the same way as the combined auditory events of the multiple channels of the present description.

1개 이상의 공통 접합점들의 설정은, 결합된 오디토리 이벤트들이 식별된 영역들의 공통 중첩보다도 고려되는 것을 제외하고는 도 5의 설명과 관련하여 상기된 방식으로 수행된다. 그러므로, 예를 들면, 공통 접합점은 통상적으로 압축의 경우에 대해서는 결합된 오디토리 이벤트 주기에서 초기에 설정되며 팽창의 경우에 대해서는 결합된 오디토리 이벤트 주기에서 나중에 설정된다. 예를 들면, 결합된 오디토리 이벤트의 개시이후 5msec의 디폴트 시간이 사용된다.The setting of one or more common junctions is performed in the manner described above in connection with the description of FIG. 5 except that combined auditory events are considered more than common overlap of identified regions. Thus, for example, the common junction is typically set initially in the combined auditory event period for compression and later in the combined auditory event period for expansion. For example, a default time of 5 msec is used after the start of the combined auditory event.

각 채널에서 결합된 오디토리 이벤트 세그먼트의 심리음향 품질은, 데이터 압축 또는 팽창 프로세싱이 특정 결합된 오디토리 이벤트내에서 발생하는지를 결정하기 위해서 고려된다. 원칙적으로, 심리음향 품질 결정은 각 결합된 이벤트 세그먼트에 공통 접합점을 설정한 이후에 실행되거나 또는 각 결합된 이벤트 세그먼트에 공통 접합점을 설정하기 이전에 실행된다(이 경우, 어떠한 공통 접합점도 복잡성에 기초하여 스킵되는 그러한 네거티브 심리음향 품질 등급을 갖는 결합된 이벤트에 설정될 필요는 없다).The psychoacoustic quality of the combined auditory event segment in each channel is considered to determine if data compression or expansion processing occurs within a particular combined auditory event. In principle, psychoacoustic quality determination is performed after establishing a common junction for each combined event segment or before establishing a common junction for each combined event segment (in this case any common junction is based on complexity). Need not be set to a combined event with such a negative psychoacoustic quality grade that is skipped).

결합된 이벤트의 심리음향 품질 랭킹은 결합된 이벤트 시간 세그먼트중에 다양한 채널에 있는 오디오의 심리음향 특성에 기초를 둔다(각 채널이 과도현상에 의해 마스킹되는 결합된 이벤트는 가장 높은 심리음향 품질 등급을 가지며 채널들중 어떠한 것도 심리음향 기준을 충족시키지 못하는 결합된 이벤트는 가장 낮은 심리 음향 품질 등급을 갖는다). 예를 들면, 상기된 심리음향 기준의 계층이 사용될 수 있다. 결합된 이벤트의 상대적인 심리음향 품질 등급은 그후 다양한 채널에서 결합된 이벤트 세그먼트의 복잡성을 고려하는 하기에 더 기술된 제 1 결정 단계(단계 712)와 관련하여 사용된다. 복잡한 세그먼트가, 데이터 압축 또는 팽창을 실행시 가청 가공물들을 야기하기 쉽다는 것이다. 예를 들면, 복잡한 세그먼트는, 채널들중 적어도 하나가 임의의 심리음향 기준을 충족시키지 못하거나(상기됨) 또는 과도현상을 포함한다(상기 언급된 것처럼, 과도현상을 변동시키는 것이 바람직하지 않다)는 것이다. 극심한 복잡성에서는, 예를 들면, 매 채널이 심리음향 기준을 충족시키는데 실패하거나 또는 과도현상을 포함한다. 하기 기술된 제 2 결정 단계(단계 718)는 타겟 세그먼트의 길이(이는 결합된 이벤트 세그먼트의 길이에 의해 영향을 받음)를 고려한다. 단일 채널의 경우에, 이벤트는 심리음향 기준에 따라 등급매겨져 스킵되어야 하는 지를 결정한다.The psychoacoustic quality ranking of the combined event is based on the psychoacoustic characteristics of the audio in the various channels of the combined event time segment (the combined event where each channel is masked by transients has the highest psychoacoustic quality rating Combined events in which none of the channels meet the psychoacoustic criteria have the lowest psychoacoustic quality rating). For example, the hierarchy of psychoacoustic criteria described above may be used. The relative psychoacoustic quality rating of the combined event is then used in connection with the first decision step (step 712) described below, which takes into account the complexity of the combined event segment in the various channels. Complex segments are susceptible to audible artifacts when performing data compression or expansion. For example, a complex segment may contain at least one of the channels that do not meet any psychoacoustic criteria (described above) or include transients (as mentioned above, it is not desirable to vary the transients). Will. In extreme complexity, for example, every channel fails to meet psychoacoustic criteria or involves transients. The second decision step (step 718) described below takes into account the length of the target segment, which is affected by the length of the combined event segment. In the case of a single channel, an event is ranked according to psychoacoustic criteria to determine whether it should be skipped.

결합된 오디토리 이벤트들은 2개 채널 오디오 신호에 대한 오디토리 신 분석 결과를 나타내는 도 20을 참조함으로써 잘 이해될 것이다. 도 20은 2개 채널에 있는 오디오 데이터의 병행 블럭들을 나타낸다. 제 1 채널의 오디오의 ASA 프로세싱, 도 20의 상부 파형은 다수의 스펙트럼 프로파일 서브블럭 사이즈인 샘플들, 본 예에서 1024 및 1536 샘플들에서 오디토리 이벤트 경계들을 식별한다. 도 20의 하부 파형은 제 2 채널이며 ASA 프로세싱은 또한 다수의 스펙트럼 프로파일 서브블럭 사이즈인 샘플들, 본 예에서 1024, 2048 및 3072 샘플에서 이벤트 경계들을 식별한다. 양측 채널에 대한 복합 오디토리 이벤트 분석은 샘플 1024, 1536, 2048 및 3072에서 경계를 갖는 복합 오디토리 이벤트 세그먼트로 귀착한다(매 채널들의 오디토리 이벤트 경계들이 함께 "OR"된다). 실질적으로 오디토리 이벤트 경계들의 정확도는 스펙트럼 프로파일 서브블럭 사이즈의 사이즈에 좌우하는 것으로 이해되는데(N은 본 실시에에서 512 샘플이다), 왜냐하면 이벤트 경계들이 서브블럭 경계에서만 발생할 수 있기 때문이다. 그럼에도 불구하고, 512 샘플들의 서브블럭 사이즈는 만족스러운 결과를 제공하는 것에 관해 충분한 정확도로 오디토리 이벤트 경계를 결정하는 것으로 발견되었었다.The combined auditory events will be well understood by referring to FIG. 20 which shows the auditory scene analysis results for the two channel audio signal. 20 shows parallel blocks of audio data in two channels. ASA processing of the audio of the first channel, the top waveform of FIG. 20, identifies the auditory event boundaries at samples of multiple spectral profile subblock sizes, in this example 1024 and 1536 samples. The lower waveform of FIG. 20 is the second channel and ASA processing also identifies event boundaries in samples that are multiple spectral profile subblock sizes, in this example 1024, 2048 and 3072 samples. Compound auditory event analysis for both channels results in compound auditory event segments with boundaries in samples 1024, 1536, 2048, and 3072 (the auditory event boundaries of every channel are "OR" together). In practice, the accuracy of the auditory event boundaries depends on the size of the spectral profile subblock size (N is 512 samples in this embodiment), since the event boundaries can only occur at the subblock boundaries. Nevertheless, the subblock size of 512 samples has been found to determine auditory event boundaries with sufficient accuracy in providing satisfactory results.

도 20을 참조하면, 상부 도면에서 과도현상을 포함하는 오디오의 단일 채널만이 프로세싱된다면, 그후 3개의 개별 오디토리 이벤트들이 데이터 압축 또는 팽창 프로세싱을 위해 이용가능하다. 이러한 이벤트들은 (1)과도현상전의 오디오의 조용한 부분, (2)과도현상 이벤트, 및 (3)오디오 과도현상의 에코/서스테인(sustain) 부분을 포함한다. 유사하게, 하위 도면에 표현된 음성 신호만이 프로세싱된다면, 그후 4개 개별 오디토리 이벤트들이 데이터 압축 또는 팽창 프로세싱을 위해 이용가능하다. 이러한 이벤트들은 현저한 고주파수 치찰음 이벤트, 치찰음이 발생 또는 모음으로의 이형태와 같은 이벤트, 모음의 전반부, 및 모음의 후반부를 포함한다.Referring to FIG. 20, if only a single channel of audio including transient in the top view is processed, then three separate auditory events are available for data compression or expansion processing. These events include (1) the quiet portion of the audio before the transient, (2) the transient event, and (3) the echo / sustain portion of the audio transient. Similarly, if only the voice signal represented in the lower figure is processed, then four separate auditory events are available for data compression or expansion processing. These events include significant high frequency hissing events, events such as hissing or morphing into vowels, the first half of the vowel, and the second half of the vowel.

도 20은 오디토리 이벤트 데이터가 2개 채널의 병행 데이터 블럭들에 걸쳐서 공유될 때의 복합 이벤트 경계를 또한 나타낸다. 그러한 이벤트 세그멘테이션은 데이터 압축 또는 팽창 프로세싱이 발생할 수 있는 5가지 복합 이벤트 경계를 제공한다(이벤트 경계들이 함께 OR된다). 복합 오디토리 이벤트 세그먼트내에서의 프로세 싱은, 프로세싱이 매 채널의 오디토리 이벤트와 발생함을 보장한다. 사용되는 데이터 압축 또는 팽창의 방법과 오디오 데이터의 콘텐츠에 좌우하여, 하나의 복합 이벤트내의 2개 채널에 있는 데이터만을 또는 몇몇 복합 이벤트만을(복합 이벤트들 보다는) 프로세스하는 것이 가장 적절함에 유의한다. 복합 오디토리 이벤트 경계들은, 그것들이 모든 오디오 채널들의 이벤트 경계들을 OR하는 것으로부터 초래하더라도, 각각의 병행 입력 채널 블럭의 데이터와는 독립적으로 실행되는 데이터 압축 또는 팽창 프로세싱을 위한 세그먼트를 규정하는데 사용된다. 그러므로, 단지 단일 복합 이벤트가 프로세싱을 위해 선택된다면, 각 오디오 채널에 대한 데이터는 그 복합 이벤트의 길이 또는 시간 세그먼트내에서 프로세싱된다. 예를 들면, 도 20에서, 바람직한 전반적인 타임 스케일링량이 10%라면, 그후 가청 가공물들의 최소량은, 단지 복합 이벤트 영역 4개가 각 채널에서 프로세싱된다면 그리고 복합 이벤트 영역 4개에서의 샘플 개수가 충분히 변동하여 전체 N개 샘플의 길이가 0.10*N 샘플로 변동된다면 도입될 것이다. 그러나, 모든 복합 이벤트들간에 길이의 전체 변동이 0.10*N 샘플로 합산되도록 프로세싱을 분산시키고 각각의 복합 이벤트들을 프로세스하는 것이 또한 가능하다. 복합 이벤트들의 개수 및 그 개수들이 프로세싱을 위해 선택된다는 것이 하기 설명된 단계 718에서 결정된다.20 also shows a composite event boundary when auditory event data is shared across two channels of parallel data blocks. Such event segmentation provides five complex event boundaries where event compression or expansion processing can occur (event boundaries are ORed together). Processing within the complex auditory event segment ensures that processing occurs with the auditory events of each channel. Note that depending on the method of data compression or expansion used and the content of the audio data, it is most appropriate to process only data in two channels or only a few composite events (rather than composite events) in one composite event. Compound auditory event boundaries are used to define a segment for data compression or expansion processing that is executed independently of the data of each parallel input channel block, even if they result from ORing the event boundaries of all audio channels. . Therefore, if only a single composite event is selected for processing, the data for each audio channel is processed within the length or time segment of that composite event. For example, in FIG. 20, if the desired overall time scaling amount is 10%, then the minimum amount of audible artifacts is totally varied if only four complex event regions are processed in each channel and the number of samples in four complex event regions varies sufficiently. It will be introduced if the length of N samples varies by 0.10 * N samples. However, it is also possible to distribute the processing and process the respective composite events so that the overall variation in length between all composite events sums up to 0.10 * N samples. It is determined at step 718, described below, that the number of composite events and the numbers are selected for processing.

도 24는 4개의 채널 입력 신호의 예를 나타낸다. 채널 1과 4는 각각 3개의 오디토리 이벤트를 포함하며 채널 2와 3은 각각 2개의 오디토리 이벤트를 포함한다. 4개 모든 채널에 걸친 병행 데이터 블럭들에 대한 복합 오디토리 이벤트 경계들은 도 24의 하부에 지시된 것처럼 샘플 번호 512, 1024, 1536 및 3072에 위치된 다. 이는 모든 6개 복합 오디토리 이벤트들이 4개 채널에 걸쳐서 프로세싱됨을 의미한다. 그러나, 몇몇 복합 오디토리 이벤트들은 그렇게 상대적으로 낮은 심리음향 등급을 갖거나(즉, 그것들이 너무 복잡하다) 또는 너무 짧아서 그것들내에서 프로세스하기에 바람직하지 않다. 도 24의 예에서, 프로세싱을 위한 가장 바람직한 복합 오디토리 이벤트는 복합 이벤트 영역 4이며, 다음으로 가장 바람직한것은 복합 이벤트 영역 6이다. 다른 3가지 복합 이벤트 영역들은 모두 최소 사이즈이다. 게다가, 복합 이벤트 영역 2는 채널 1에서 과도현상을 포함한다. 상기 언급된 것처럼, 과도현상중에 프로세싱을 회피하는 것이 가장 바람직하다. 복합 이벤트 영역 4가 바람직한데, 왜냐하면 가장 길며 각 채널의 심리음향 특성이 만족스럽기 때문이다 - 그것은 채널 1에서 과도현상 포스트마스킹을 가지며, 채널 4는 가청 임계 아래에 있으며 그리고 채널 2와3은 상대적으로 낮은 레벨이다.24 shows an example of four channel input signals. Channels 1 and 4 each contain three auditory events, while channels 2 and 3 each contain two auditory events. Compound auditory event boundaries for parallel data blocks across all four channels are located at sample numbers 512, 1024, 1536 and 3072 as indicated at the bottom of FIG. This means that all six composite auditory events are processed across four channels. However, some composite auditory events have such a relatively low psychoacoustic rating (ie they are too complex) or too short to be desirable to process within them. In the example of FIG. 24, the most preferred compound auditory event for processing is compound event region 4 and the next most preferred is compound event region 6. The other three composite event regions are all of minimum size. In addition, composite event region 2 includes transients in channel one. As mentioned above, it is most desirable to avoid processing during transients. Compound event region 4 is preferred because it is the longest and satisfactory psychoacoustic characteristic of each channel-it has transient postmasking on channel 1, channel 4 is below the audible threshold and channels 2 and 3 are relatively Low level.

최대 상호관계 프로세싱 길이 및 크로스페이드 길이는 복합 오디토리 이벤트 시간 세그먼트내에서 제거 또는 반복될 수 있는 오디오의 최대량을 제한한다. 최대 상호관계 프로세싱 길이는, 어느 것이 적든지 간에, 복합 오디토리 이벤트 시간 세그먼트의 길이 또는 소정의 값으로 제한된다. 최대 상호관계 프로세싱 길이는 데이터 압축 또는 팽창 프로세싱이 이벤트의 시작 및 종결 경계내이어냐 한다. 그렇게 하는 것을 실패하면 이벤트 경계의 "손상(smearing)" 또는 "흐릿함(blurring)"을 야기한다.The maximum correlation processing length and crossfade length limit the maximum amount of audio that can be removed or repeated within the complex auditory event time segment. The maximum correlation processing length, however few, is limited to the length or predetermined value of the complex auditory event time segment. The maximum correlation processing length requires that data compression or expansion processing be within the start and end boundaries of the event. Failure to do so causes "smearing" or "blurring" of the event boundary.

도 25는 채널들중 4번째 복합 오디토리 이벤트 시간 세그먼트를 프로세싱되어야 하는 세그먼트로서 사용하여 도 24의 4-채널 데이터 압축 프로세싱 예의 상세 한 사항을 나타낸다. 본 예에서, 채널 1은 복합 이벤트 2에 단일 과도현상을 포함한다. 이러한 예에 대해, 접합점 위치는 오디오 채널 1의 샘플 650에서의 과도현상에 이어서 가장 큰 복합 오디토리 이벤트에 위치하는 샘플 1757이도록 선택된다. 이러한 접합점 위치는, 크로스페이딩중에 이벤트 경계를 손상시키는 것을 회피하도록 초기 복합 이벤트 경계이후에 위치를 5msec(44.1kHz에서, 221 샘플 또는 크로스페이드의 2분의 1길이)로 위치시키는 것에 기초하여 선택되었다. 이러한 세그먼트에 접합점 위치를 두는것은 복합 이벤트 2에 의해 제공되는 포스트마스킹의 이점을 취한다. 도 25에 나타난 예에서, 최대 프로세싱 길이는, 프로세싱 및 크로스페이딩중에 회피되어야하는 샘플 2560에서 복합 다중채널 오디토리 이벤트 경계의 위치를 고려한다. 단계 710의 부분으로서, 최대 프로세싱 길이는 582 샘플로 설정된다. 이러한 값은 다음과 같은 5msec를 2분의 1 크로스페이드 길이(44.1kHz에서 221샘플)로 추정하여 계산된다:
FIG. 25 illustrates details of the four-channel data compression processing example of FIG. 24 using the fourth composite auditory event time segment of the channels as the segment to be processed. In this example, channel 1 includes a single transient in composite event 2. For this example, the junction location is selected to be sample 1757 located at the largest compound auditory event following the transient at sample 650 of audio channel 1. This junction location was chosen based on positioning the location 5 msec (at 44.1 kHz, 221 samples or half the length of the crossfade) after the initial compound event boundary to avoid damaging the event boundary during crossfading. . Placing junction points in these segments takes advantage of the postmasking provided by Compound Event 2. In the example shown in FIG. 25, the maximum processing length takes into account the location of the composite multichannel auditory event boundary in sample 2560 which should be avoided during processing and crossfading. As part of step 710, the maximum processing length is set to 582 samples. This value is calculated by estimating the following 5msec length of 1/2 crossfade (221 samples at 44.1kHz):

최대 프로세싱 길이 = Max Processing Length =

이벤트 경계 - 크로스페이드 길이 - 프로세싱 접합점 위치Event boundary-crossfade length-processing junction location

582 = 2560 - 221 - 1757
582 = 2560-221-1757

단계 710의 출력은 각 복합 오디토리 이벤트의 경계, 각 복합 오디토리 이벤트에 대한 채널에 걸친 병행 데이터 블럭들의 공통 접합점, 복합 오디토리 이벤트, 크로스페이드 파라미터 정보의 심리음향 품질 등급 및 각 복합 오디토리 이벤트에 대한 채널들에 걸친 최대 프로세싱 길이이다.The output of step 710 is the boundary of each compound auditory event, the common junction of parallel data blocks across the channel for each compound auditory event, the compound auditory event, the psychoacoustic quality class of crossfade parameter information, and each compound auditory event. The maximum processing length over channels for.

상기 설명된 것처럼, 낮은 심리음향 품질 등급을 갖는 복합 오디토리 이벤트는 어떠한 데이터 압축 또는 팽창이 오디오 채널들에 걸친 세그먼트에서 발생하지 않음을 지시한다. 예를 들면, 단일 채널만을 고려하는 도 26에 나타난 것처럼, 각각 512 샘플 길이인 이벤트 3과 4의 오디오는 주로 저주파수 콘텐트를 포함하며, 이는 데이터 압축 또는 팽창 프로세싱에 적절하지 않다(유용한 현저한 주파수의 충분한 주기성은 아니다). 그러한 이벤트들은 낮은 심리음향 품질 등급으로 할당되며 스킵된다.As described above, a composite auditory event with a low psychoacoustic quality rating indicates that no data compression or expansion occurs in the segment across the audio channels. For example, as shown in FIG. 26, which only considers a single channel, the audio of events 3 and 4, each 512 samples long, contains mainly low frequency content, which is not suitable for data compression or expansion processing (a sufficient amount of useful significant frequency). Not periodicity). Such events are assigned a low psychoacoustic quality class and are skipped.

복잡성에 기초하여 스킵 712(도 17)Skip 712 based on complexity (FIG. 17)

그러므로, 단계 712("복잡성에 기초하여 스킵?")는, 심리음향 품질 등급이 낮을 때(높은 복잡성을 지시) 스킵 플래그로 설정한다. 하기에 기술된 단계 714의 상호관계 프로세싱 이후보다는 이전에 이러한 복잡성 결정을 함으로써, 불필요한 상호관계 프로세싱을 실행시키는 것을 회피한다. 하기에 기술된 단계 718은, 다양한 채널에 걸친 오디오가 특정 복합 오디토리 이벤트 세그먼트중에 프로세싱되어야 하는지에 관해 더 결정을 한다. 단계 718은 현재 프로세싱 길이 요건에 관해 복합 오디토리 이벤트에서 타겟 세그먼트의 길이를 고려한다. 타겟 세그먼트의 길이는, 공통 종단점이 하기에 설명된 상호관계 단계 714에서 결정될 때 까지 공지되지 않는다.Therefore, step 712 ("Skip based on complexity?") Sets the skip flag when the psychoacoustic quality grade is low (indicating high complexity). By making this complexity determination before rather than after the correlation processing of step 714 described below, it avoids executing unnecessary correlation processing. Step 718, described below, further determines whether audio over various channels should be processed during a particular composite auditory event segment. Step 718 considers the length of the target segment in the composite auditory event with respect to the current processing length requirement. The length of the target segment is unknown until the common endpoint is determined in correlation step 714 described below.

상호관계 프로세싱Correlation processing

각 공통 접합점에 대해, 적절한 공통 종단점은 타겟 세그먼트를 결정하기 위 해서 필요하다. 현재 복합 오디토리 이벤트 세그먼트에 대한 입력 데이터가 프로세싱되어야 한다고 결정되면(단계 712), 그후 도 17에 나타난 것처럼, 2가지 유형의 상호관계 프로세싱(단계 714)이 발생하는데, 이는 시간 도메인 데이터의 상호관계 프로세싱(단계 714-1 및 714-2)과 입력 신호의 위상 정보의 상호관계 프로세싱(단계 714-3 및 714-4)으로 이루어진다. 입력 데이터의 복합 위상 및 시간 도메인 정보를 사용하는 것은, 시간-도메인 정보만을 사용하는 것에 비하여 음성 내지 복잡한 음악에 이르는 범위의 신호에 고품질 타임 스케일링 결과를 제공하는 것으로 여겨진다. 서브단계 714-1, 2, 3 및 4를 포함하는 프로세싱 단계(714)와 다중 상호관계 단계(716)의 상세한 사항은, 단계 714와 716에서 프로세싱이 심리음향적으로 식별된 영역보다는 복합 오디토리 이벤트 세그먼트하는 것을 제외하고는 단계 214(와 그 서브단계 214-1, 2, 3 및 4)와 216과 관련하여 상기된 것과 본질적으로 동일하다.For each common junction, an appropriate common endpoint is needed to determine the target segment. If it is determined that input data for the current composite auditory event segment should be processed (step 712), then two types of correlation processing (step 714) occur, as shown in FIG. 17, which indicates the correlation of time domain data. Processing (steps 714-1 and 714-2) and correlation processing (steps 714-3 and 714-4) of phase information of the input signal. The use of complex phase and time domain information of input data is believed to provide high quality time scaling results for signals ranging from voice to complex music as compared to using only time-domain information. The details of the processing step 714 and the multi-correlation step 716, including substeps 714-1, 2, 3, and 4, refer to the composite auditory rather than the area where processing is psychoacoustically identified in steps 714 and 716. It is essentially the same as described above in connection with step 214 (and its substeps 214-1, 2, 3 and 4) and 216 except for event segmentation.

대체 접합점 및 종단점 선택 프로세스Alternate junction and endpoint selection process

상기 언급된 것처럼, 발명의 양태들은 접합점 위치와 동반 종단점 위치를 선택하기 위한 대체 방법을 고려한다. 상기된 프로세스들은 다소 임의적으로 접합점을 선택하고 그후 평균 주기에 기초하여 종단점을 선택한다(본질적으로, 1개의 자유도). 기술되는 대체 방법은 대신에 최상의 가능한 크로스페이드에 접합점을 통한 최소의 가청 가공물을 제공하는 목적에 기초하여 접합점/종단점 쌍을 선택한다(2개의 자유도).As mentioned above, aspects of the invention contemplate an alternative method for selecting a junction point location and an accompanying endpoint location. The processes described above select the junction somewhat randomly and then select the endpoint based on the average period (essentially one degree of freedom). The alternative method described instead selects junction / endpoint pairs (two degrees of freedom) based on the purpose of providing the smallest audible workpiece through the junction to the best possible crossfade.

도 27은 발명의 대안에 따라 단일 채널의 오디오, 접합점 및 종단점 위치에 대해 선택하는 제 1 단계를 나타낸다. 도 27에서, 신호는 3개의 오디토리 이벤트로 이루어진다. 상기 이벤트들의 심리음향 분석은 이벤트 2가 일시적 마스킹, 두드러지게는 포스트-마스킹을 제공하는 과도현상을 포함함을 나타내며, 이는 이벤트 3으로 연장한다. 이벤트 3은 또한 대형 이벤트로서, 가장 긴 프로세싱 영역을 제공한다. 최적의 접합점 위치를 결정하기 위해서, 데이터(Tc)(크로스페이드의 시간) 샘플 길이의 영역은 프로세싱 영역의 데이터에 상호관계된다. 주요 접합점은 Tc 접합점 영역의 중간에 위치된다.Figure 27 illustrates a first step of selecting for audio, junction and endpoint locations of a single channel in accordance with an alternative to the invention. In FIG. 27, the signal consists of three auditory events. Psychoacoustic analysis of the events indicates that event 2 includes transients that provide temporary masking, predominantly post-masking, which extends to event 3. Event 3 is also a large event, providing the longest processing area. In order to determine the optimal junction location, the region of data Tc (time of crossfade) sample length is correlated to the data of the processing region. The main junction is located in the middle of the Tc junction region.

접합점 영역과 프로세싱 영역의 교차 상호관계는 최상의 종단점을 결정하는데(제 1 대체 방법에 유사한 방식으로) 사용되는 상호관계 측정으로 귀결되며, 특정 접합점에 대한 최상의 종단점은 계산된 상호관계 함수내의 최대 상호관계 값을 발견함으로써 결정된다. 제 2 대체 방법에 따르면, 최적의 접합점/종단점 쌍은 견본(trial) 접합점들에 인접한 상호관계 프로세싱 영역에 일련의 견본 접합점들을 상호관계 맺음으로써 결정된다.The cross-correlation of the junction area and the processing area results in a correlation measure used to determine the best endpoint (in a similar way to the first alternative), where the best endpoint for a particular junction is the maximum correlation in the calculated correlation function. Is determined by finding the value. According to a second alternative method, the optimal junction / endpoint pair is determined by correlating a series of sample junctions in the correlation processing region adjacent to the trial junctions.

도 30A-C에 나타난 것처럼, 이러한 최상의 종단점은 바람직하게 최소 종단점 이후에 있다. 상기 최소 종단점은 최소 개수의 샘플들이 항상 프로세싱(부가 또는 제거)되도록 설정된다. 최상의 종단점은 바람직하게 최대 종단점 또는 이전에 있다. 도 28에 나타난 것처럼, 최대 종단점은 프로세싱되는 이벤트 세그먼트의 종단으로부터 2분의 1 크로스페이드 길이보다도 더 근접하지 않다. 상기 언급된 것처럼, 기술된 실제 구현에서는, 어떠한 오디토리 이벤트도 입력 블럭의 종단을 초과하지 않는다. 이는 도 28의 이벤트 3에 대한 경우로, 이는 4096 샘플 입력 블럭의 종단으로 제한된다.As shown in Figures 30A-C, this best endpoint is preferably after the minimum endpoint. The minimum endpoint is set such that the minimum number of samples are always processed (added or removed). The best endpoint is preferably at or before the maximum endpoint. As shown in FIG. 28, the maximum endpoint is no closer than a half crossfade length from the end of the event segment being processed. As mentioned above, in the actual implementation described, no auditory event exceeds the end of the input block. This is the case for event 3 in FIG. 28, which is limited to the end of the 4096 sample input block.

최소 및 최대 종단점들 간에 그 최대에서 상호관계 함수의 값은 접합점이 특정 접합점에 대한 최적 종단점에 얼마나 유사한지를 결정한다. 접합점/종단점 쌍을 최적화시키기 위해서(특정 접합점에 대해 종단점을 단순히 최적화시키기 보다는), 일련의 상호관계들은 이전 영역의 우측에 각 위치된 N 샘플들에서 다른 Tc 샘플 접합점 영역을 선택함으로써 그리고 도 28에 나타난 상호관계 함수를 재계산함으로써 계산된다.The value of the correlation function at its maximum between the minimum and maximum endpoints determines how similar the junction is to the optimal endpoint for a particular junction. In order to optimize the junction / endpoint pair (rather than simply optimizing the endpoint for a particular junction), a series of correlations can be obtained by selecting a different Tc sample junction region from N samples each located to the right of the previous region and in FIG. 28. Calculated by recalculating the correlation function shown.

N 샘플의 최소 개수는 1개 샘플이다. 그러나, N을 1개 샘플로 선택하는 것은 계산하는데 필요한 상호관계의 개수를 상당히 증가시키며, 이는 실시간 구현을 상당히 방해한다. 간략화가 이루어질 수 있어서 N은 다수의 샘플들, 이를 테면 Tc 샘플들, 크로스페이드의 길이로 설정된다. 이는 양호한 결과를 제공하며 요구되는 프로세싱을 감소시킨다. 도 29는, 접합점 영역이 Tc 샘플에 의해 연속적으로 진전될 때 요구되는 다수의 상호관계 계산의 예를 개념적으로 나타낸다. 3가지 프로세싱 단계는 각 오디오 데이터 블럭 데이터 도면에 겹쳐진다. 도 29에 나타난 프로세싱은 도 30A-C에 나타난 것처럼 최대값을 각각 지니는 3가지 상호관계 함수로 귀결된다.The minimum number of N samples is one sample. However, selecting N as one sample significantly increases the number of correlations needed to calculate, which significantly hinders real-time implementation. Simplification can be made such that N is set to the length of the multiple samples, such as Tc samples, crossfade. This gives good results and reduces the processing required. 29 conceptually shows an example of the number of correlation calculations required when the junction area is continuously advanced by the Tc sample. Three processing steps are superimposed on each audio data block data diagram. The processing shown in FIG. 29 results in three correlation functions each having a maximum value as shown in FIGS. 30A-C.

도 30B에 나타난 것처럼, 최대 상호관계값은 제 2 접합점 반복으로부터 산출된다. 이는 제 2 접합점과 상호관계에 의해 결정된 관련 최대값이 접합점에서 종단까지의 거리로서 사용되어야 함을 의미한다.As shown in FIG. 30B, the maximum correlation value is calculated from the second junction repetition. This means that the relevant maximum determined by the correlation with the second junction should be used as the distance from the junction to the termination.

상호관계를 실행시, 개념적으로, Tc 샘플들은 우측으로 지수단위로 슬라이드 되며, Tc에서의 그리고 프로세싱 영역에서의 해당 샘플이 함께 곱하여진다. Tc 샘플들은 윈도우(window)되며, 본 예에서는 견본 접합점 주위에서 직사각형 윈도우이다. 견본 접합점에 더 많은 중점을 그리고 견본 접합점으로부터 이격된 영역에 더 적은 중점을 두는 윈도우 형상이 더 양호한 결과를 제공한다. 처음에(슬라이드가 없다면, 중첩도 없다), 상호관계 함수는 정의에 의해 제로이다. 상기 함수은, 슬라이딩이 다시 중첩하지 않게 멀리 갔을 때, 최종적으로 다시 제로로 떨어질 때 까지 등락한다. 실제 구현예에서, FFT는 상호관계를 계산하는데 사용된다. 도 30A-C에 나타난 상호관계 함수들은 ±1로 제한된다. 이러한 값들은 임의의 표준화 함수가 아니다. 상호관계의 표준화는 최상의 접합점과 종단점을 선택하는데 사용된 상호관계들간에 상대적인 가중을 폐기해 버린다. 최상의 접합점을 결정할 때, 최소 및 최대 프로세싱 점 위치들간에 비-표준 최대 상호관계 값들을 비교한다. 가장 큰 값을 지닌 최대 상호관계 값은 최상의 접합점 및 종단점 조합을 가리킨다.In executing the correlation, conceptually, the Tc samples slide exponentially to the right and the corresponding samples in the Tc and in the processing area are multiplied together. Tc samples are windowed, in this example a rectangular window around the sample junction. Window shapes that place more emphasis on the sample junction and less on the area away from the sample junction provide better results. Initially (no slides, no nesting), the correlation function is zero by definition. The function rises and falls until it finally falls back to zero when the sliding goes away so that it does not overlap again. In a practical implementation, the FFT is used to calculate the correlation. The correlation functions shown in FIGS. 30A-C are limited to ± 1. These values are not arbitrary normalization functions. Standardization of correlations discards the relative weights between the correlations used to select the best junctions and endpoints. When determining the best junction, compare the non-standard maximum correlation values between the minimum and maximum processing point locations. The largest correlation value with the largest value indicates the best junction and endpoint combination.

이러한 대체 접합점 및 종단점 위치 방법은 종단점이 접합점 이후에 있는 데이터 압축의 경우에 대해 기술되었다. 그러나, 데이터 팽창의 경우에도 균등하게 적용가능하다. 데이터 팽창을 위해, 2가지 대안이 있다. 제 1 대안에 따르면, 최적화된 접합점/종단점 쌍은 상기 설명된 것처럼 결정된다. 그후, 접합점과 종단점의 식별은 접합점이 종단점이 되고 종단점이 접합점이 되도록 바뀌게 된다. 제 2 대안에 따르면, 견본 접합점들 주위의 영역은, 종단점이 접합점보다 "좀더 이른(earlier)" 최적의 종단점/접합점 상을 결정하기 위해서 "포워드"보다는 "백워드" 상호관련된다. This alternative junction and endpoint location method has been described for the case of data compression where the endpoint is after the junction. However, it is equally applicable to the case of data expansion. For data expansion, there are two alternatives. According to a first alternative, the optimized junction / endpoint pair is determined as described above. The identification of the junction and the endpoint then changes so that the junction becomes an endpoint and the endpoint becomes a junction. According to a second alternative, the area around the sample junctions is "backward" rather than "forward" to determine the best endpoint / junction phase that the endpoint is "earlier" than the junction.

다중채널 프로세싱은 상기된 프로세싱에 유사한 방식으로 실행된다. 오디토리 이벤트 영역들이 결합된 이후, 각 채널로부터의 상호관계들은 각 접합점 평가 단계를 위해 결합되며 결합된 상호관계들은 최대값과 최상의 접합점 및 종단점 쌍을 결정하는데 사용된다.Multichannel processing is performed in a similar manner to the processing described above. After the auditory event regions are combined, the correlations from each channel are combined for each junction evaluation step and the combined correlations are used to determine the maximum value and the best junction and endpoint pair.

프로세싱에서의 부가적인 감소는 시간 도메인 데이터를 M의 요소로 데시메이트(decimate)함으로써 제공된다. 이는 계산 강도를 10의 요소로 감소시키지만 단지 하등의 종단점(M 샘플내)을 제공한다. 미세한-튜닝은, 하등의 데시메이트된 프로세싱이후 1개 샘플의 분해능으로 최상의 종단점을 찾도록 언데시메이트된 오디오 모두를 사용하여 다른 상호관계를 실행시킴으로써 달성된다.Additional reduction in processing is provided by decimating the time domain data into the elements of M. This reduces the computational strength to a factor of 10 but provides only a lower endpoint (in M sample). Fine-tuning is achieved by performing different correlations using all of the undecimated audio to find the best endpoint with the resolution of one sample after some decimated processing.

다른 대안은 더 큰 언-위도우된 상호관계 영역에 관하여 대신에 견본 종단점 위치들 주위의 윈도우된 영역에 관하여 견본 접합점 위치들 주위의 윈도우된 영역을 서로 관련시키는 것이다. 윈도우된 견본 접합점 영역과 언-윈도우된 상호관계 영역간의 교차 상호관계를 실행시키는 것이 계산적으로 복잡하지 않을지라도(그러한 상호관계는 나머지 상호관계 계산을 위한 주파수 도메인으로의 컨버전이전에 타임 도메인에서 실행된다), 타임 도메인에서 2개의 윈도우된 영역들을 교차 관련시키는 것이 계산적으로 요구된다.Another alternative is to correlate the windowed area around the sample junction locations with respect to the windowed area around the sample endpoint locations instead of with respect to the larger un-wided correlation area. Although it is not computationally complex to implement cross-correlation between a windowed sample junction area and an un-windowed correlation area (such correlations are performed in the time domain before conversion to the frequency domain for the remaining correlation calculations). , Cross-correlating two windowed regions in the time domain is computationally required.

이러한 대체 접합점/종단점 선택 프로세스는, 오디오 신호들이 오디토리 이벤트들로 분할되는 실시예의 문맥에서 설명되었지만, 이 대체 프로세스의 원리들은 도 5의 프로세스를 포함하여 다른 환경에 균등하게 적용가능하다. 도 5 환경에서, 접합점과 종단점은 오디토리 이벤트 또는 복합 오디토리 이벤트내 보다는 심리음향 적으로 식벽된 영역 또는 식별된 영역의 중첩내에 있다.Although this alternative junction / endpoint selection process has been described in the context of an embodiment where audio signals are divided into auditory events, the principles of this alternative process are equally applicable to other environments, including the process of FIG. In the FIG. 5 environment, junctions and endpoints are within the overlap of psychoacoustically articulated or identified regions, rather than in auditory events or compound auditory events.

이벤트 프로세싱 결정Event processing decision

도 17의 설명으로 돌아가서, 프로세싱의 다음 단계는 이벤트 블럭 프로세싱 결정 단계 718("복합 이벤트를 프로세스?")이다. 타임 스케일링 프로세스는 시간 도메인 또는 시간 도메인과 위상 정보의 주기성을 이용하며 이러한 정보를 이용하여 오디오 신호 데이터를 프로세스하며, 출력 타임 스케일링 요소는 시간에 따라 선형이 아니며 요청받은 입력 타임 스케일링 요소 주위에서 약간 변동한다. 다른 함수들중에, 이벤트 프로세싱 결정은 얼마나 많은 선행 데이터가 타임 스케일링되었는지를 타임 스케일링의 요청받은 양과 비교한다. 만일 복합 오디토리 이벤트 세그먼트의 시간까지 프로세싱하는 것이 바람직한 양의 타임 스케일링을 초과한다면, 그후 이러한 복합 오디토리 이벤트 세그먼트는 스킵된다(즉, 프로세싱되지 않는다). 그러나, 만일 이러한 시간까지 실행된 타임 스케일링의 양이 바람직한 양 아래라면, 그후 복합 오디토리 이벤트 세그먼트가 프로세싱된다.Returning to the description of FIG. 17, the next step of processing is an event block processing decision step 718 ("process complex event?"). The time scaling process uses the periodicity of the time domain or time domain and phase information and uses this information to process the audio signal data, and the output time scaling factor is not linear over time and varies slightly around the requested input time scaling factor. do. Among other functions, the event processing decision compares how much preceding data has been time scaled with the requested amount of time scaling. If processing up to the time of the composite auditory event segment exceeds the desired amount of time scaling, then this composite auditory event segment is skipped (ie not processed). However, if the amount of time scaling performed up to this time is below the desired amount, then the composite auditory event segment is processed.

복합 오디토리 이벤트 세그먼트가 프로세싱되어야 하는 경우에 대해(단계 712에 따라), 이벤트 프로세싱 결정 단계는 요청받은 타임 스케일링 요소와 현재의 복합 오디토리 이벤트 세그먼트를 프로세싱함으로써 달성되는 출력 타임 스케일링 요소를 비교한다. 그후 결정 단계는 입력 데이터 블럭의 현재 복합 오디토리 이벤트 세그먼트를 프로세싱하여야 하는지를 결정한다. 실제 프로세싱은, 복합 오디토리 이벤트 세그먼트내에 포함되는 타겟 세그먼트임에 유의한다. 어떻게 이것이 입력 블럭에 대한 이벤트 레벨에 작용하는지의 예가 도 31에 나타나있다. For cases where a composite auditory event segment should be processed (according to step 712), the event processing decision step compares the requested time scaling element with the output time scaling element achieved by processing the current composite auditory event segment. The decision step then determines whether the current compound auditory event segment of the input data block should be processed. Note that the actual processing is the target segment included in the composite auditory event segment. An example of how this works for the event level for the input block is shown in FIG. 31.

도 31은 전반적인 입력 블럭 길이가 4096 샘플인 예를 나타낸다. 이러한 블럭의 오디오는 3개의 오디토리 이벤트를 포함하며(또는 복합 오디토리 이벤트, 이경우 도면은 다중 채널들중 하나를 나타낸다), 이는 각각 길이가 1536, 1024 및 1536 샘플이다. 도 17에 지시된 것처럼, 각 오디토리 이벤트 또는 복합 오디토리 이벤트가 개별적으로 프로세싱되므로, 블럭의 초기에서의 1536 샘플 오디토리 이벤트가 우선적으로 프로세싱된다. 상기 예에서, 접합점 및 상호관계 분석은, 접합점 샘플 500에서 시작할 때, 프로세스가 최소 가청 가공물을 지니는 363 샘플의 오디오(타겟 세그먼트)를 제거 또는 반복할 수 있음을 발견한다. 이는 다음의 타임 스케일링 요소를31 shows an example in which the overall input block length is 4096 samples. The audio of this block contains three auditory events (or compound auditory events, in which case the figure represents one of multiple channels), which are 1536, 1024 and 1536 samples in length, respectively. As indicated in FIG. 17, since each auditory event or compound auditory event is processed separately, the 1536 sample auditory event at the beginning of the block is preferentially processed. In the example above, the junction and correlation analysis finds that when starting at junction sample 500, the process can remove or repeat the audio (target segment) of 363 samples with minimal audible artifacts. This means that the following time scaling factor

363 샘플/4096 샘플 = 8.86%363 samples / 4096 samples = 8.86%

현재 4096 샘플 입력 블럭에 제공한다. 만일 후속 오디토리 이벤트 또는 복합 오디토리 이벤트 세그먼트로부터 제공된 프로세싱에 따라 이용가능한 프로세싱의 363 샘플들의 조합이 바람직한 양의 타임 스케일링 프로세싱보다 크거나 이에 같다면, 그후 첫번째 오디토리 이벤트 또는 복합 오디토리 이벤트 세그먼트를 프로세싱하는 것은 충분해야하며 블럭의 나머지 오디토리 이벤트 또는 복합 오디토리 이벤트 세그먼트들은 스킵된다. 그러나, 만일 첫번째 오디토리 이벤트에서 프로세싱된 363 샘플들이 바람직한 타임 스케일링 양을 충족시키기에 충분하지 않다면, 그후 제 2 및 제 3 이벤트들이 프로세싱을 위해 또한 고려될 것이다. It currently provides 4096 sample input blocks. If the combination of 363 samples of processing available according to the processing provided from the subsequent auditing event or composite auditory event segment is greater than or equal to the desired amount of time scaling processing, then the first auditory event or complex auditory event segment is deleted. Processing should be sufficient and the remaining auditory event or composite auditory event segments of the block are skipped. However, if the 363 samples processed in the first auditory event are not sufficient to meet the desired amount of time scaling, then the second and third events will also be considered for processing.

접합 및 크로스페이드 프로세싱 710(도 5)Splicing and Crossfade Processing 710 (FIG. 5)

접합점 및 종단점의 결정에 이어서, 단계 712 또는 단계 718에 의해 버려지 지 않았던 각 복합 오디토리 이벤트는 "접합 및 크로스페이드" 단계 710(도 17)에 의해 프로세싱된다. 이 단계는 각 이벤트 또는 복합 이벤트 데이터 세그먼트, 접합점 위치, 프로세싱 종단점 및 프로세페이드 파라미터를 수신한다. 단계 720은, 오디토리 이벤트들 또는 복합 오디토리 이벤트들에 작용하며 프로스페이드의 길이가 더 길다는 것을 제외하고는, 상기된 도 5의 프로세스의 단계 218의 방식으로 일반적으로 작동한다.Following determination of the junction and endpoint, each compound auditory event that was not discarded by step 712 or 718 is processed by “junction and crossfade” step 710 (FIG. 17). This step receives each event or composite event data segment, junction location, processing endpoint and process parameters. Step 720 generally operates in the manner of step 218 of the process of FIG. 5 described above, except that it acts on auditory events or compound auditory events and the length of the pro spade is longer.

크로스페이드 파라미터 정보는, 더 짧은 크로스페이드가 사용되게 허용하는 과도현상 이벤트의 존재에 의해서 영향받을 뿐만 아니라, 공통 접합점 위치가 배치된 복합 오디토리 이벤트의 전반적인 길이에 의해 영향을 영향을 받는다. 실제 구현에서, 크로스페이드 길이는, 데이터 압축 또느 팽창 프로세싱이 발생되어야 하는 오디토리 이벤트 또는 복합 오디토리 이벤트 세그먼트의 사이즈에 비례하여 스케일링된다. 상기 설명된 것처럼, 실제 실시예에서, 허용되는 가장 작은 오디토리 이벤트는 512 포인트로, 이벤트의 사이즈는 512 샘플 증분에 의해 4096 샘플의 입력 블럭 사이즈의 최대 사이즈로 증가한다. 크로스페이드 길이는 가장 작은(512 포인트) 오디토리 이벤트에 대해 10msec로 설정된다. 크로스페이드 길이는 오디토리 이벤트의 사이즈에 비례하여 최대 또는 30-35msec로 증가한다. 그러한 스케일링은 유용한데, 왜냐하면 이미 논의된 것처럼 더 긴 크로스페이드가 가공물들을 마스크하는 경향이 있지만 또한 오디오가 급격히 변동할 때 문제점을 야기시킨다. 오디토리 이벤트들은 오디오를 포함하는 엘리먼트를 제한하므로, 크로스페이딩은 오디오가 오디토리 이벤트내에서 현저하게 정체적이라는 사실을 이용하며 더 긴 크로스페이드들 은 가청 가공물을 도입시키지 않고 사용될 수 없다. 상기 언급된 블럭 사이즈와 크로스페이드 시간은 유용한 결과를 제공하는 것으로 밝혀졌더라도, 그것들은 발명에 중요하지 않다.Crossfade parameter information is affected not only by the presence of transient events that allow shorter crossfades to be used, but also by the overall length of the complex auditory event in which the common junction location is located. In a practical implementation, the crossfade length is scaled in proportion to the size of the auditory event or composite auditory event segment where data compression or expansion processing should occur. As described above, in a practical embodiment, the smallest auditory event allowed is 512 points, and the size of the event is increased to a maximum size of an input block size of 4096 samples by 512 sample increments. The crossfade length is set to 10 msec for the smallest (512 point) auditory event. The crossfade length increases to a maximum or 30-35 msec in proportion to the size of the auditory event. Such scaling is useful because, as already discussed, longer crossfades tend to mask artifacts, but also cause problems when the audio fluctuates rapidly. Since auditory events limit the element that contains the audio, crossfading takes advantage of the fact that the audio is significantly static within the auditory event and longer crossfades cannot be used without introducing an audible artifact. Although the block sizes and crossfade times mentioned above have been found to provide useful results, they are not critical to the invention.

피치 스케일링 프로세싱 722(도 5)Pitch scaling processing 722 (FIG. 5)

복합 오디토리 이벤트들의 접합/프로스페이드 프로세싱에 이어서, 결정 단계 722("피치 스케일?")는 피치 시프팅이 실행되어야 하는지를 결정하도록 검사된다. 이미 논의된 것처럼, 타임 스케일링은 블럭 언더플루어 또는 오버플루어로 인하여 실시간으로 수행될 수 없다. 피치 스케일링은 실시간으로 실행될 수 있는데, 왜냐하며 리샘플링 단계 724("모든 데이터 블럭들을 리샘플") 때문이다. 리샘플 단계는, 동일한 시간 전개를 입력 신호로서 갖지만 변경된 스펙트럼 정보를 지니는 피치 스케일링된 신호로 귀결되는 타임 스케일링된 입력 신호를 리샘플링한다. 실시간 구현을 위해, 리샘플링은 전용 하드웨어 샘플-레이트 컨버터로 실행되어 연산 요건을 감소시킨다.Following the splicing / prospade processing of the complex auditory events, decision step 722 (“Pitch Scale?”) Is checked to determine if pitch shifting should be performed. As already discussed, time scaling cannot be performed in real time due to block underfloor or overfloor. Pitch scaling can be performed in real time because of resampling step 724 (“Resample all data blocks”). The resample step resamples a time scaled input signal that has the same time evolution as the input signal but results in a pitch scaled signal with modified spectral information. For real-time implementation, resampling is performed with a dedicated hardware sample-rate converter to reduce computational requirements.

피치 스케일링 결정 및 가능한 리샘플링에 이어서, 모든 프로세싱된 입력 데이터 블럭들은, 비실시간 작동을 위해 파일로, 또는 실시간 작동을 위해 출력 데이터 버퍼로 출력된다("프로세싱된 데이터 블럭들을 출력")(단계 726). 프로세스 흐름은 그후 부가적인 입력 데이터("데이터 입력?")에 대해 검사하고 프로세싱을 지속한다.Following the pitch scaling decision and possible resampling, all processed input data blocks are output to a file for non-real time operation or to an output data buffer for real time operation (“output processed data blocks”) (step 726). . The process flow then checks for additional input data ("data entry?") And continues processing.

발명의 다른 변형 및 수정의 구현 및 그 다양한 양태들은 당 기술분야의 당업자에게 자명하며, 발명은 설명된 특정 실시예로 제한되지 않음이 이해될 것이다. 따라서, 본문에 개시되고 청구된 기본적인 주요 원리의 사상 및 범위내에 있는 본 발명 및 모든 수정, 변형, 또는 등가물에 의해 포함하는 것으로 고려된다.Implementations of other variations and modifications of the invention and their various aspects will be apparent to those skilled in the art, and it will be understood that the invention is not limited to the specific embodiments described. Accordingly, it is contemplated to include this invention and any modifications, variations, or equivalents that fall within the spirit and scope of the underlying principal principles disclosed and claimed herein.

본 발명 및 그 다양한 양태들은 디지털 신호 프로세서, 프로그래밍된 범용 디지털 컴퓨터, 및/또는 특수 목적용 디지털 컴퓨터에서 실행되는 소프트웨어 기능으로서 구현될 수 있다. 아날로그와 디지털 신호 스트림간의 인터페이스는 적절한 하드웨어에서 실행되며 기능들은 소프트웨어 및/또는 펌웨어에서 실행된다.

The present invention and its various aspects may be implemented as software functions executed on a digital signal processor, a programmed general purpose digital computer, and / or a special purpose digital computer. The interface between the analog and digital signal streams is implemented in the appropriate hardware and the functions are implemented in software and / or firmware.

Claims

오디오 신호를 타임 스케일링(time scaling) 또는 피치 시프팅(pitch shifting)하는 방법에 있어서,In a method of time scaling or pitch shifting an audio signal,

상기 오디오 신호를 오디토리 이벤트들(auditory events)로 분할하는 단계; 및Dividing the audio signal into auditory events; And

오디토리 이벤트 내에서 타임 스케일링 또는 피치 시프팅 프로세싱하는 단계;를 포함하고,Including time scaling or pitch shifting processing within the auditory event;

상기 오디오 신호를 오디토리 이벤트들로 분할하는 단계는,Dividing the audio signal into auditory events,

상기 오디오 신호의 연속적인 타임 세그먼트들의 스펙트럼 콘텐트(spectral content)를 계산하는 단계;Calculating spectral content of successive time segments of the audio signal;

상기 오디오 신호의 연속적인 타임 세그먼트들 간의 스펙트럼 콘텐트에 있어서의 차이를 계산하는 단계; 및Calculating a difference in spectral content between successive time segments of the audio signal; And

연속적인 타임 세그먼트들 간의 스펙트럼 콘텐트에 있어서의 차이가 임계치를 초과할 경우 연속적인 타임 세그먼트들 간의 경계에서 오디토리 이벤트 경계를 설정하는 단계;Setting an auditory event boundary at the boundary between successive time segments if the difference in spectral content between successive time segments exceeds a threshold;

를 포함하는 오디오 신호를 타임 스케일링 또는 피치 시프팅하는 방법.Time scaling or pitch shifting the audio signal comprising a.

제1항에 있어서, 상기 타임 스케일링 또는 피치 시프팅 프로세싱 단계는,The method of claim 1, wherein the time scaling or pitch shifting processing comprises:

상기 오디토리 이벤트 내에서 접합점(splice point) 및 종단점(end point)을 선택하는 단계;Selecting a splice point and an end point in the auditory event;

접합점에서 시작하는 오디오 신호의 부분을 삭제하거나, 접합점에서 종료하는 오디오 신호의 부분을 반복하는 단계; 및Deleting the portion of the audio signal starting at the junction or repeating the portion of the audio signal ending at the junction; And

결과적인 오디오 신호를 소정의 타임 스케일링 또는 피치 시프팅을 산출하는 속도(rate)로 판독하는 단계;Reading the resulting audio signal at a rate that yields a predetermined time scaling or pitch shifting;

상기 오디토리 이벤트 내에서 접합점을 선택하여, 접합점을 이끄는 오디오 신호의 선두 세그먼트(leading segment)를 규정하는 단계;Selecting a junction within the auditory event to define a leading segment of the audio signal leading the junction;

상기 오디토리 이벤트 내에서 상기 접합점으로부터 이격된(spaced) 종단점을 선택하여, 종단점을 추적하는 오디오 신호의 후미 세그먼트(trailing segment) 및 접합점과 종단점 간의 오디오 신호의 타겟 세그먼트를 규정하는 단계;Selecting an endpoint spaced from the junction within the auditory event to define a trailing segment of the audio signal tracking the endpoint and a target segment of the audio signal between the junction and the endpoint;

상기 접합점에서 상기 선두 및 상기 후미 세그먼트들을 연결하여, 종단점이 상기 접합점보다 높은 샘플 번호를 가질 경우 타겟 세그먼트를 생략함으로써 오디오 신호 샘플들의 개수를 감소시키거나, 종단점이 상기 접합점보다 낮은 샘플 번호를 가질 경우 타겟 세그먼트를 반복함으로써 샘플의 개수를 증가시키는 단계; 및Connecting the leading and trailing segments at the junction to reduce the number of audio signal samples by omitting a target segment if the endpoint has a sample number higher than the junction, or if the endpoint has a sample number lower than the junction Increasing the number of samples by repeating the target segment; And

연결된 선두 및 후미 세그먼트들을 소정의 타임 스케일링 또는 피치 시프팅을 산출하는 속도로 판독하는 단계;Reading the connected leading and trailing segments at a rate that yields a predetermined time scaling or pitch shifting;

제3항에 있어서, 상기 연결된 선두 및 후미 세그먼트들은,The method of claim 3, wherein the connected leading and trailing segments,

오리지널 시간 기간과 동일한 시간 기간에 오디오 신호를 피치 시프팅하거나,Pitch-shift the audio signal in the same time period as the original time period,

타겟 세그먼트를 생략하는 경우, 샘플 개수의 감소에 있어서의 상대적 변화와 동일한 비율로 감소된 시간 기간에 오디오 신호를 시간 압축하거나,When omitting the target segment, time compress the audio signal in a reduced time period at the same rate as the relative change in the decrease in the number of samples,

타겟 세그먼트를 반복하는 경우, 샘플 개수의 증가에 있어서의 상대적 변화와 동일한 비율로 증가된 시간 기간에 오디오 신호를 시간 팽창하거나,When repeating the target segment, time-expand the audio signal in an increased time period at the same rate as the relative change in the increase in the number of samples,

샘플 개수의 감소에 있어서의 상대적 변화와 다른 비율로 감소된 시간 기간에 오디오 신호를 시간 압축 및 피치 시프팅하거나, 또는Time compression and pitch shifting the audio signal in a reduced time period at a rate that is different from the relative change in the decrease in the number of samples, or

샘플 개수의 증가에 있어서의 상대적 변화와 다른 비율로 증가된 시간 기간에 오디오 신호를 시간 팽창 및 피치 시프팅하는 속도로 판독되는 것을 특징으로 하는 오디오 신호를 타임 스케일링 또는 피치 시프팅하는 방법.A method of time scaling or pitch shifting an audio signal, characterized in that it is read at a rate of time expansion and pitch shifting of the audio signal in an increased time period at a rate different from the relative change in the number of samples.

제3항에 있어서, 접합점에서 상기 선두 및 후미 세그먼트들을 연결하는 단계는 선두 및 후미 세그먼트들을 크로스페이드(crossfade)하는 단계를 포함하는 오디오 신호를 타임 스케일링 또는 피치 시프팅하는 방법.4. The method of claim 3, wherein connecting the leading and trailing segments at a junction includes crossfading the leading and trailing segments.

제3항에 있어서, 타겟 세그먼트를 생략함으로써 오디오 신호 샘플들의 개수를 감소시키는 경우, 상기 종단점은 접합점을 추적하는 오디오의 세그먼트를 자동상관시킴으로써(autocorrelating) 선택되는 것을 특징으로 하는 오디오 신호를 타임 스케일링 또는 피치 시프팅하는 방법.4. The method of claim 3, wherein when reducing the number of audio signal samples by omitting a target segment, the endpoint is selected by autocorrelating a segment of audio that tracks the junction. How to pitch shift.

제3항에 있어서, 타겟 세그먼트를 반복함으로써 오디오 신호 샘플들의 개수를 증가시키는 경우, 상기 종단점은 접합점을 이끄는 오디오의 세그먼트와 접합점을 추적하는 오디오의 세그먼트를 교차상관시킴으로써(cross correlating) 선택되는 것을 특징으로 하는 오디오 신호를 타임 스케일링 또는 피치 시프팅하는 방법.4. The method of claim 3, wherein when increasing the number of audio signal samples by repeating a target segment, the endpoint is selected by cross correlating the segment of audio leading the junction and the segment of audio tracking the junction. Time scaling or pitch shifting an audio signal.

제3항에 있어서, 타겟 세그먼트를 생략함으로써 오디오 신호 샘플들의 개수를 감소시키거나 타겟 세그먼트를 반복함으로써 오디오 신호 샘플들의 개수를 증가시키는 경우, 접합점 위치와 종단점 위치는,4. The junction point and endpoint position of claim 3, wherein if the number of audio signal samples is increased by omitting the target segment or increasing the number of audio signal samples by repeating the target segment,

각각 일련의 견본(trial) 접합점 위치들에 인접한 오디오 샘플들의 영역에 대해 일련의 견본 접합점 위치들 주위의 오디오 샘플들의 윈도우를 상관시키고, 그리고Correlate the windows of audio samples around the series of sample junction locations with respect to the area of audio samples adjacent to each of the series of junction points, and

최강의 상관을 가져오는 견본 접합점 위치를 결정하고, 그 견본 접합점 위치를 접합점 위치로서 지정하고, 종단점 위치를 최강의 상관의 위치에 설정함으로써 선택되는 것을 특징으로 하는 오디오 신호를 타임 스케일링 또는 피치 시프팅하는 방법.Time scaling or pitch shifting an audio signal, characterized in that it is selected by determining a sample junction location that results in the strongest correlation, designating the sample junction location as the junction location, and setting the endpoint location to the location of the strongest correlation. How to.

제8항에 있어서, 상기 윈도우는 직사각형 윈도우인 것을 특징으로 하는 오디오 신호를 타임 스케일링 또는 피치 시프팅하는 방법.9. The method of claim 8, wherein the window is a rectangular window.

제9항에 있어서, 상기 윈도우는 크로스페이드의 폭을 갖는 것을 특징으로 하는 오디오 신호를 타임 스케일링 또는 피치 시프팅하는 방법. 10. The method of claim 9, wherein the window has a width of crossfade.

제8항에 있어서, 상기 일련의 견본 접합점 위치들은 1 이상의 오디오 샘플만큼 이격된 것을 특징으로 하는 오디오 신호를 타임 스케일링 또는 피치 시프팅하는 방법.9. The method of claim 8, wherein the series of sample junction points are spaced apart by at least one audio sample.

제11항에 있어서, 상기 일련의 견본 접합점 위치들은 상기 윈도우의 폭만큼 이격된 것을 특징으로 하는 오디오 신호를 타임 스케일링 또는 피치 시프팅하는 방법.12. The method of claim 11, wherein the series of sample junction locations are spaced apart by the width of the window.

제12항에 있어서, 상기 윈도우는 크로스페이드의 폭을 갖는 것을 특징으로 하는 오디오 신호를 타임 스케일링 또는 피치 시프팅하는 방법.13. The method of claim 12, wherein the window has a width of crossfade.

제8항에 있어서, 타겟 세그먼트를 생략함으로써 오디오 샘플들의 개수를 감소시키는 경우, 각각의 일련의 견본 접합점 위치들에 인접한 오디오 샘플들의 영역은 각각의 견본 접합점 위치들을 뒤따르고, 이에 의해 접합점이 종단점보다 선행하는 것을 특징으로 하는 오디오 신호를 타임 스케일링 또는 피치 시프팅하는 방법.9. The method of claim 8, wherein when reducing the number of audio samples by omitting a target segment, the region of audio samples adjacent to each series of sample junction locations follows each sample junction location, whereby the junction is less than the endpoint. A method of time scaling or pitch shifting an audio signal, characterized by the preceding.

제8항에 있어서, 타겟 세그먼트를 반복함으로써 오디오 샘플들의 개수를 증가시키는 경우, 각각의 일련의 견본 접합점 위치들에 인접한 오디오 샘플들의 영역은 각각의 견본 접합점 위치들을 뒤따르고 접합점과 종단점의 ID(identity)가 바뀌며, 이에 의해 종단점이 접합점보다 선행하는 것을 특징으로 하는 오디오 신호를 타임 스케일링 또는 피치 시프팅하는 방법.9. The method of claim 8, wherein when increasing the number of audio samples by repeating a target segment, the region of audio samples adjacent to each series of sample junction locations follows each sample junction location and identifies the identity of the junction and endpoint. ), Whereby the endpoint precedes the junction.

제8항에 있어서, 타겟 세그먼트를 반복함으로써 오디오 샘플들의 개수를 증가시키는 경우, 각각의 일련의 견본 접합점 위치들에 인접한 오디오 샘플들의 영역은 각각의 견본 접합점 위치들보다 선행하고, 이에 의해 종단점이 접합점보다 선행하는 것을 특징으로 하는 오디오 신호를 타임 스케일링 또는 피치 시프팅하는 방법.9. The method of claim 8, wherein when increasing the number of audio samples by repeating a target segment, the region of audio samples adjacent to each series of sample junction locations precedes each sample junction location, whereby the endpoint is a junction point. A method of time scaling or pitch shifting an audio signal, characterized by preceding.

제3항에 있어서, 타겟 세그먼트를 생략함으로써 오디오 신호 샘플들의 개수를 감소시키거나 또는 타겟 세그먼트를 반복함으로써 오디오 신호 샘플들의 개수를 증가시키는 경우, 접합점 위치 및 종단점 위치는,The junction position and endpoint position of claim 3, wherein when the number of audio signal samples is increased by omitting the target segment or by increasing the number of audio signal samples by repeating the target segment,

각각의 일련의 견본 접합점 위치들에 인접하고 M의 팩터에 의해 데시메이트되는(decimated) 오디오 샘플들의 영역에 대해 일련의 견본 접합점 위치들 주위의 오디오 샘플들의 윈도우를 상관시키고,Correlating a window of audio samples around a series of sample junction locations relative to an area of audio samples adjacent to each series of sample junction locations and decimated by a factor of M,

최강의 상관을 가져오는 견본 접합점 위치를 결정하고, 그 견본 접합점 위치를 데시메이트된 접합점으로서 지정하고,Determine the sample joint location that gives the strongest correlation, specify the sample joint location as the decimated joint,

각각의 제2 일련의 견본 접합점 위치들에 인접한 언데시메이트된(undecimated) 오디오 샘플들의 영역에 대해 상기 데시메이트된 접합점의 M 샘플 내에서 제2 일련의 견본 접합점 위치들 주위의 언데시메이트된 오디오 샘플들의 윈도우를 상관시키고, 그리고Undecimated audio around a second series of sample junction locations within an M sample of the decimated junction for an area of undecimated audio samples adjacent to each second series of sample junction locations. Correlate the windows of the samples, and

최강의 상관을 가져오는 상기 제2 일련의 견본 접합점 위치에서 견본 접합점 위치를 결정하고, 그 견본 접합점 위치를 접합점으로서 지정하고, 최강의 상관의 위치에서 종단점 위치를 설정함으로써 선택되는 것을 특징으로 하는 오디오 신호를 타임 스케일링 또는 피치 시프팅하는 방법.Audio, characterized in that it is selected by determining a sample junction location at the second series of sample junction locations that yield the strongest correlation, designating the sample junction location as the junction, and setting the endpoint location at the location of the strongest correlation. A method of time scaling or pitch shifting a signal.

복수의 오디오 신호 채널을 타임 스케일링 또는 피치 시프팅하는 방법에 있어서,In the method of time scaling or pitch shifting a plurality of audio signal channels,

각 채널의 오디오 신호를 오디토리 이벤트들로 분할하는 단계; 및Dividing an audio signal of each channel into auditory events; And

오디토리 이벤트 경계가 오디오 신호 채널에서 발생할 경우 경계를 각각 갖는, 조합된 오디토리 이벤트들을 결정하는 단계; 및Determining combined auditory events, each having a boundary when the auditory event boundary occurs in the audio signal channel; And

조합된 오디토리 이벤트 내에서 상기 오디오 신호 채널들 모두를 타임 스케일링 또는 피치 시프팅 프로세싱하여, 프로세싱이 각 채널의 오디토리 이벤트 또는 오디토리 이벤트의 부분 내에 있는 단계;를 포함하고,Time scaling or pitch shifting all of the audio signal channels within a combined auditory event, such that the processing is within the auditory event or portion of the auditory event of each channel;

상기 각 채널의 오디오 신호를 오디토리 이벤트들로 분할하는 단계는,Dividing the audio signal of each channel into auditory events,

각 채널의 오디오 신호의 연속적인 타임 세그먼트들의 스펙트럼 콘텐트를 계산하는 단계;Calculating spectral content of successive time segments of the audio signal of each channel;

각 채널의 오디오 신호의 연속적인 타임 세그먼트들 간의 스펙트럼 콘텐트에서의 차이를 계산하는 단계; 및Calculating a difference in spectral content between successive time segments of the audio signal of each channel; And

연속적인 타임 세그먼트들 간의 스펙트럼 컨텐트에서의 차이가 임계치를 초과할 경우 연속적인 타임 세그먼트들 간의 경계에서 각 채널의 오디오 신호에 오디토리 이벤트 경계를 설정하는 단계;Setting an auditory event boundary to an audio signal of each channel at the boundary between successive time segments if the difference in spectral content between successive time segments exceeds a threshold;

를 포함하는 복수의 오디오 신호 채널을 타임 스케일링 또는 피치 시프팅하는 방법.Time scaling or pitch shifting a plurality of audio signal channels comprising a.

제18항에 있어서, 상기 타임 스케일링 또는 피치 시프팅 프로세싱은,19. The method of claim 18, wherein the time scaling or pitch shifting processing is:

오디오 신호들의 채널들 간에 조합된 오디토리 이벤트 내에서 공통 접합점을 선택하여, 오디오 신호들의 다중 채널들 각각에 있는 하나 이상의 공통 접합점으로부터 유래하는 접합점들이 서로 정렬되는 단계;Selecting a common junction within an auditory event combined between channels of audio signals, such that junctions from one or more common junctions in each of the multiple channels of audio signals are aligned with each other;

상기 공통 접합점에서 시작하는 오디오 신호들의 각 채널의 부분을 삭제하거나, 상기 공통 접합점에서 종료하는 오디오 신호들의 각 채널의 부분을 반복하는 단계; 및Deleting a portion of each channel of audio signals starting at the common junction or repeating a portion of each channel of audio signals ending at the common junction; And

오디오 신호들의 결과적인 채널들을 오디오의 다중 채널들에 대해 소정의 타임 스케일링 또는 피치 시프팅 프로세싱을 산출하는 속도로 판독하는 단계;Reading the resulting channels of audio signals at a rate that yields some time scaling or pitch shifting processing for the multiple channels of audio;

오디오 신호들의 채널들 간의 조합된 오디토리 이벤트 내에서 공통 접합점을 선택하여, 오디오 신호들의 다중 채널들의 상기 공통 접합점으로부터 유래하는 접합점들이 서로 정렬되고, 각 접합점이 접합점을 이끄는 오디오 신호의 선두 세그먼트를 규정하는 단계;By selecting a common junction within a combined auditory event between channels of audio signals, junctions originating from the common junction of multiple channels of audio signals are aligned with each other, defining a leading segment of the audio signal where each junction leads the junction. Doing;

상기 오디토리 이벤트 내에서 상기 공통 접합점으로부터 이격된 공통 종단점을 선택하여, 오디오 신호들의 다중 채널들 각각에 있는 상기 공통 종단점으로부터 유래하는 종단점들이 서로 정렬되고, 종단점을 추적하는 오디오 신호의 후미 세그먼트 및 접합점과 종단점들 간의 오디오 신호의 타겟 세그먼트를 규정하는 단계;Selecting a common endpoint spaced from the common junction within the auditory event so that the endpoints from the common endpoint in each of the multiple channels of audio signals are aligned with each other and the trailing segment and junction of the audio signal tracking the endpoint Defining a target segment of the audio signal between the endpoints and the endpoints;

오디오 신호들의 채널들 각각에 있는 상기 종단점에서 상기 선두 및 후미 세그먼트들을 연결하여, 종단점이 상기 접합점보다 높은 샘플 번호를 가질 경우 타겟 세그먼트를 생략함으로써 오디오 신호 샘플들의 개수를 감소시키거나, 종단점이 상기 접합점보다 낮은 샘플 번호를 가질 경우 타겟 세그먼트를 반복함으로써 오디오 신호 샘플들의 개수를 증가시키는 단계; 및Connect the leading and trailing segments at the endpoints in each of the channels of audio signals to reduce the number of audio signal samples by omitting a target segment if the endpoint has a sample number higher than the junction, or the endpoint is at the junction Increasing the number of audio signal samples by repeating the target segment if it has a lower sample number; And

오디오 신호들의 채널들 각각에서 연결된 선두 및 후미 세그먼트들을 오디오의 다중 채널들에 대해 소정의 타임 스케일링 또는 피치 시프팅을 산출하는 속도로 판독하는 단계;Reading the leading and trailing segments connected in each of the channels of audio signals at a rate that yields predetermined time scaling or pitch shifting for the multiple channels of audio;

제20항에 있어서, 상기 연결된 선두 및 후미 세그먼트들은,The method of claim 20, wherein the connected leading and trailing segments are:

타겟 세그먼트를 생략하는 경우, 샘플 개수의 감소에 있어서의 상대적인 변화와 동일한 비율로 감소된 시간 기간에 오디오 신호를 시간 압축하거나,When omitting the target segment, time compress the audio signal in a reduced time period at the same rate as the relative change in the decrease in the number of samples,

타겟 세그먼트를 반복하는 경우, 샘플 개수의 증가에 있어서의 상대적인 변화와 동일한 비율로 증가된 시간 기간에 오디오 신호를 시간 팽창하거나,When repeating the target segment, time-expand the audio signal in an increased time period at the same rate as the relative change in the increase in the number of samples,

샘플 개수의 감소에 있어서의 상대적인 변화와 다른 비율로 감소된 시간 기간에 오디오 신호를 시간 압축 및 피치 시프팅하거나, 또는Time compressing and pitch shifting the audio signal in a reduced time period at a rate different from the relative change in the number of samples reduced, or

샘플 개수의 증가에 있어서의 상대적인 변화와 다른 비율로 증가된 시간 기간에 오디오 신호를 시간 팽창 및 피치 시프팅하는 속도로 판독되는 것을 특징으로 하는 복수의 오디오 신호 채널을 타임 스케일링 또는 피치 시프팅하는 방법.A method of time scaling or pitch shifting a plurality of audio signal channels, characterized in that the audio signal is read at a rate of time expansion and pitch shifting in an increased time period at a rate different from the relative change in the number of samples. .

오디오 신호를 타임 스케일링 또는 피치 시프팅하는 방법에 있어서,In the method of time scaling or pitch shifting an audio signal,

상기 오디오 신호를 오디토리 이벤트들로 분할하는 단계;Dividing the audio signal into auditory events;

오디오 신호의 타임 스케일링 또는 피치 시프팅 프로세싱이 가청되지 않거나 최소한으로 가청되는 오디토리 이벤트들을 식별하기 위해 심리음향 기준을 사용하여 상기 오디토리 이벤트들을 분석하는 단계; 및Analyzing the auditory events using psychoacoustic criteria to identify auditory events for which time scaling or pitch shifting processing of an audio signal is not audible or minimally audible; And

오디오 신호의 타임 스케일링 또는 피치 시프팅 프로세싱이 가청되지 않거나 최소한으로 가청되는 것으로서 식별된 오디토리 이벤트 내에서 타임 스케일링 또는 피치 시프팅 프로세싱하는 단계;를 포함하고,Time scaling or pitch shifting processing within an auditory event identified as time scaling or pitch shifting processing of the audio signal as not audible or minimally audible;

상기 오디오 신호의 연속적인 타임 세그먼트들의 스펙트럼 콘텐트를 계산하는 단계;Calculating spectral content of successive time segments of the audio signal;

상기 오디오 신호의 연속적인 타임 세그먼트들 간의 스펙트럼 콘텐트에서의 차이를 계산하는 단계; 및Calculating a difference in spectral content between successive time segments of the audio signal; And

연속적인 타임 세그먼트들 간의 스펙트럼 컨텐트에서의 차이가 임계치를 초과할 경우 연속적인 타임 세그먼트들 간의 경계에서 오디토리 이벤트 경계를 설정하는 단계;Setting an auditory event boundary at the boundary between successive time segments if the difference in spectral content between successive time segments exceeds a threshold;

제22항에 있어서, 상기 심리음향 기준은 일군의 심리음향 기준들 중 하나 이상인 것을 특징으로 하는 오디오 신호를 타임 스케일링 또는 피치 시프팅하는 방법.23. The method of claim 22, wherein the psychoacoustic criteria is one or more of a group of psychoacoustic criteria.

제23항에 있어서, 상기 심리음향 기준은,The method of claim 23, wherein the psychoacoustic criteria,

상기 오디오 신호의 식별된 영역이 과도현상의 결과로서 프리마스킹되거나 포스트마스킹됨;The identified region of the audio signal is premasked or postmasked as a result of transient;

상기 오디오 신호의 식별된 영역이 가청되지 않음;The identified region of the audio signal is not audible;

상기 오디오 신호의 식별된 영역이 낮은 주파수 콘텐트보다 높은 주파수 콘텐트를 가짐; 및The identified region of the audio signal has higher frequency content than lower frequency content; And

상기 오디오 신호의 식별된 영역이 상기 식별된 영역을 선행하거나 뒤따르는 세그먼트의 부분보다 조용한 오디오 신호의 세그먼트 부분임;The identified region of the audio signal is a segment portion of the audio signal that is quieter than the portion of the segment that precedes or follows the identified region;

중 하나 이상을 포함하는 것을 특징으로 하는 오디오 신호를 타임 스케일링 또는 피치 시프팅하는 방법.A method of time scaling or pitch shifting an audio signal comprising one or more of the following.

제22항에 있어서, 상기 타임 스케일링 또는 피치 시프팅 프로세싱은,The method of claim 22, wherein the time scaling or pitch shifting processing is performed by:

상기 오디토리 이벤트에서 접합점을 선택하는 단계;Selecting a junction in the auditory event;

결과적인 오디오 신호를 소정의 타임 스케일링 또는 피치 시프팅을 산출하는 속도로 판독하는 단계;Reading the resulting audio signal at a rate that yields a predetermined time scaling or pitch shifting;

상기 오디토리 이벤트에서 접합점을 선택하여, 접합점을 이끄는 오디오 신호의 선두 세그먼트를 규정하는 단계;Selecting a junction in the auditory event to define a leading segment of the audio signal leading the junction;

상기 오디토리 이벤트에서 상기 접합점으로부터 이격된 종단점을 선택하여, 종단점을 추적하는 오디오 신호의 후미 세그먼트 및 접합점과 종단점 간의 오디오 신호의 타겟 세그먼트를 규정하는 단계;Selecting an endpoint spaced apart from the junction in the auditory event to define a trailing segment of the audio signal tracking the endpoint and a target segment of the audio signal between the junction and the endpoint;

상기 접합점에서 상기 선두 및 후미 세그먼트들을 연결하여, 종단점이 상기 접합점보다 높은 샘플 번호를 가질 경우 타겟 세그먼트를 생략함으로써 오디오 신호 샘플들의 개수를 감소시키거나, 종단점이 상기 접합점보다 낮은 샘플 번호를 가질 경우 타겟 세그먼트를 반복함으로써 오디오 신호 샘플들의 개수를 증가시키는 단계; 및Connect the leading and trailing segments at the junction to reduce the number of audio signal samples by omitting a target segment if the endpoint has a higher sample number than the junction, or target if the endpoint has a lower sample number than the junction Increasing the number of audio signal samples by repeating the segment; And

제26항에 있어서, 상기 연결된 선두 및 후미 세그먼트들은,27. The system of claim 26, wherein the connected leading and trailing segments are:

샘플 개수의 증가에 있어서의 상대적인 변화와 다른 비율로 증가된 시간 기간에 오디오 신호를 시간 팽창 및 피치 시프팅하는 속도로 판독되는 것을 특징으로 하는 오디오 신호를 타임 스케일링 또는 피치 시프팅하는 방법.A method of time scaling or pitch shifting an audio signal, characterized in that it is read at a rate of time expansion and pitch shifting of the audio signal in an increased time period at a rate different from the relative change in the number of samples.

제26항에 있어서, 접합점에서 상기 선두 및 후미 세그먼트들을 연결하는 단계는 선두 및 후미 세그먼트들을 크로스페이팅하는 단계를 포함하는 오디오 신호를 타임 스케일링 또는 피치 시프팅하는 방법.27. The method of claim 26, wherein joining the leading and trailing segments at a junction comprises crossfacing the leading and trailing segments.

제26항에 있어서, 타겟 세그먼트를 생략함으로써 오디오 신호 샘플들의 개수를 감소시키는 경우, 상기 종단점은 접합점을 추적하는 오디오의 세그먼트를 자동상관시킴으로써 선택되는 것을 특징으로 하는 오디오 신호를 타임 스케일링 또는 피치 시프팅하는 방법.27. The method of claim 26, wherein when reducing the number of audio signal samples by omitting a target segment, the endpoint is selected by autocorrelating a segment of audio that tracks the junction. How to.

제26항에 있어서, 타겟 세그먼트를 반복함으로써 오디오 신호 샘플들의 개수를 증가시키는 경우, 상기 종단점은 접합점을 이끄는 오디오의 세그먼트와 접합점을 추적하는 오디오의 세그먼트를 교차 상관시킴으로써 선택되는 것을 특징으로 하는 오디오 신호를 타임 스케일링 또는 피치 시프팅하는 방법.27. The audio signal of claim 26, wherein when increasing the number of audio signal samples by repeating a target segment, the endpoint is selected by cross correlating the segment of audio leading the junction and the segment of audio tracking the junction. Time scaling or pitch shifting.

제26항에 있어서, 타겟 세그먼트를 생략함으로써 오디오 신호 샘플들의 개수를 감소시키거나 타겟 세그먼트를 반복함으로써 오디오 신호 샘플들의 개수를 증가시키는 경우, 접합점 위치와 종단점 위치는,27. The junction point and endpoint positions of claim 26, wherein if the number of audio signal samples is increased by omitting the target segment or increasing the number of audio signal samples by repeating the target segment,

각각의 일련의 견본 접합점 위치들에 인접한 오디오 샘플들의 영역에 대해 일련의 견본 접합점 위치들 주위의 오디오 샘플들의 윈도우를 상관시키고, 그리고Correlate a window of audio samples around the series of sample junction locations for an area of audio samples adjacent to each series of sample junction locations, and

제31항에 있어서, 상기 윈도우는 직사각형 윈도우인 것을 특징으로 하는 오디오 신호를 타임 스케일링 또는 피치 시프팅하는 방법.32. The method of claim 31 wherein the window is a rectangular window.

제32항에 있어서, 상기 윈도우는 크로스페이드의 폭을 갖는 것을 특징으로 하는 오디오 신호를 타임 스케일링 또는 피치 시프팅하는 방법.33. The method of claim 32, wherein the window has a width of crossfade.

제31항에 있어서, 상기 일련의 견본 접합점 위치들은 1 이상의 오디오 샘플만큼 이격된 것을 특징으로 하는 오디오 신호를 타임 스케일링 또는 피치 시프팅하는 방법.32. The method of claim 31, wherein the series of sample junction points are spaced apart by one or more audio samples.

제34항에 있어서, 상기 일련의 견본 접합점 위치들은 상기 윈도우의 폭만큼 이격된 것을 특징으로 하는 오디오 신호를 타임 스케일링 또는 피치 시프팅하는 방법.35. The method of claim 34, wherein the series of sample junction points are spaced apart by the width of the window.

제35항에 있어서, 상기 윈도우는 크로스페이드의 폭을 갖는 것을 특징으로 하는 오디오 신호를 타임 스케일링 또는 피치 시프팅하는 방법.36. The method of claim 35, wherein the window has a width of crossfade.

제31항에 있어서, 타겟 세그먼트를 생략함으로써 오디오 신호 샘플들의 개수를 감소시키는 경우, 각각의 일련의 견본 접합점 위치들에 인접한 오디오 샘플들의 영역은 각각의 견본 접합점 위치들을 뒤따르고, 이에 의해 접합점이 종단점보다 선행하는 것을 특징으로 하는 오디오 신호를 타임 스케일링 또는 피치 시프팅하는 방법.32. The method of claim 31, wherein when reducing the number of audio signal samples by omitting a target segment, the region of audio samples adjacent to each series of sample junction positions follows each sample junction position, whereby the junction point is an endpoint. A method of time scaling or pitch shifting an audio signal, characterized by preceding.

제31항에 있어서, 타겟 세그먼트를 반복함으로써 오디오 샘플들의 개수를 증가시키는 경우, 각각의 일련의 견본 접합점 위치들에 인접한 오디오 샘플들의 영역은 각각의 견본 접합점 위치들을 뒤따르고 접합점 및 종단점의 ID가 바뀌며, 이에 의해 종단점이 접합점보다 선행하는 것을 특징으로 하는 오디오 신호를 타임 스케일링 또는 피치 시프팅하는 방법.32. The method of claim 31, wherein when increasing the number of audio samples by repeating a target segment, the region of audio samples adjacent to each series of sample junction locations follows each sample junction location and the IDs of the junction and endpoint are changed. And thereby the endpoint precedes the junction.

제31항에 있어서, 타겟 세그먼트를 반복함으로써 오디오 샘플들의 개수를 증가시키는 경우, 각각의 일련의 견본 접합점 위치들에 인접한 오디오 샘플들의 영역은 각각의 견본 접합점 위치들을 선행하고, 이에 의해 종단점이 접합점보다 선행하는 것을 특징으로 하는 오디오 신호를 타임 스케일링 또는 피치 시프팅하는 방법.32. The method of claim 31, wherein when increasing the number of audio samples by repeating a target segment, the region of audio samples adjacent to each series of sample junction locations precedes each sample junction location, whereby the endpoint is greater than the junction. A method of time scaling or pitch shifting an audio signal, characterized by the preceding.

제26항에 있어서, 타겟 세그먼트를 생략함으로써 오디오 신호 샘플들의 개수를 감소시키거나 타겟 세그먼트를 반복함으로써 오디오 신호 샘플들의 개수를 증가시키는 경우, 접합점 위치 및 종단점 위치는,27. The junction point and endpoint positions of claim 26, wherein if the number of audio signal samples is increased by omitting the target segment or increasing the number of audio signal samples by repeating the target segment,

각각의 일련의 견본 접합점 위치들에 인접하고 M의 팩터에 의해 데시메이트되는 오디오 샘플들의 영역에 대해 일련의 견본 접합점 위치들 주위의 오디오 샘플들의 윈도우를 상관시키고,Correlating a window of audio samples around the series of sample junction locations with respect to the region of audio samples adjacent to each series of sample junction locations and decimated by the factor of M,

각각의 제2 일련의 견본 접합점 위치들에 인접한 언데시메이트된 오디오 샘플들의 영역에 대해 상기 데시메이트된 접합점의 M 샘플 내에서 제2 일련의 견본 접합점 위치들 주위의 언데시메이트된 오디오 샘플들의 윈도우를 상관시키고, 그리고A window of undecimated audio samples around a second series of sample junction locations within an M sample of the decimated junction for an area of undecimated audio samples adjacent to each second series of sample junction locations. To correlate, and

오디오 신호들의 다중 채널들을 타임 스케일링 또는 피치 시프팅하는 방법에 있어서,A method of time scaling or pitch shifting multiple channels of audio signals, the method comprising:

각 채널의 오디오 신호를 오디토리 이벤트들로 분할하는 단계;Dividing an audio signal of each channel into auditory events;

오디오 신호의 타임 스케일링 또는 피치 시프팅 프로세싱이 가청되지 않거나 최소한으로 가청되는 오디토리 이벤트들을 식별하기 위해 심리음향 기준을 사용하여 상기 오디토리 이벤트들을 분석하는 단계;Analyzing the auditory events using psychoacoustic criteria to identify auditory events for which time scaling or pitch shifting processing of an audio signal is not audible or minimally audible;

오디토리 이벤트 경계가 채널의 오디오 신호에서 발생할 경우 경계를 각각 갖는, 조합된 오디토리 이벤트들을 결정하는 단계; 및Determining combined auditory events, each having a boundary when an auditory event boundary occurs in the audio signal of the channel; And

오디오 신호들의 다중 채널들에서 타임 스케일링 또는 피치 시프팅 프로세싱이 가청되지 않거나 최소한으로 가청되는 것으로서 식별된 오디토리 이벤트 내에서 타임 스케일링 또는 피치 시프팅 프로세싱하는 단계;를 포함하고,Time scaling or pitch shifting processing within an auditory event identified as time audible or minimally audible in multiple channels of audio signals.

각 채널의 오디오 신호의 연속적인 타임 세그먼트들 간의 스펙트럼 컨텐트에서의 차이를 계산하는 단계; 및Calculating a difference in spectral content between successive time segments of the audio signal of each channel; And

를 포함하는 오디오 신호들의 다중 채널들을 타임 스케일링 또는 피치 시프팅하는 방법.Time scaling or pitch shifting multiple channels of audio signals comprising a.

제41항에 있어서, 조합된 오디토리 이벤트는, 오디오의 다중 채널들의 타임 스케일링 또는 피치 시프팅 프로세싱이 조합된 오디토리 이벤트 시간 세그먼트 중에 다중 채널들 각각에서 오디오의 심리음향 특성에 기초하여 가청되지 않거나 최소로 가청되는 것으로서 식별되는 것을 특징으로 하는 오디오 신호들의 다중 채널들을 타임 스케일링 또는 피치 시프팅하는 방법.42. The method of claim 41, wherein the combined auditory event is not audible or based on the psychoacoustic characteristics of the audio in each of the multiple channels during the auditory event time segment in which time scaling or pitch shifting processing of the multiple channels of audio is combined. A method of time scaling or pitch shifting multiple channels of audio signals, characterized as being identified as least audible.

제42항에 있어서, 조합된 오디토리 이벤트의 심리음향 품질 등급은 조합된 오디토리 이벤트 중에 채널들 각각에 있는 오디오에 심리음향 기준의 계층을 적용함으로써 결정되는 것을 특징으로 하는 오디오 신호들의 다중 채널들을 타임 스케일링 또는 피치 시프팅하는 방법.43. The method of claim 42, wherein the psychoacoustic quality grade of the combined auditory event is determined by applying a layer of psychoacoustic criteria to the audio in each of the channels during the combined auditory event. How to time scale or pitch shift.

제41항에 있어서, 상기 타임 스케일링 또는 피치 시프팅 프로세싱은,42. The method of claim 41 wherein the time scaling or pitch shifting processing is:

오디오 신호들의 채널들 간의 식별되고 결합된 오디토리 이벤트 내에서 공통 접합점을 선택하여, 오디오 신호들의 다중 채널들 각각에 있는 상기 공통 접합점으로부터 유래하는 접합점들이 서로 정렬되는 단계;Selecting a common junction within an identified and combined auditory event between channels of audio signals, such that the junctions from the common junction in each of the multiple channels of audio signals are aligned with each other;

오디오 신호들의 결과적인 채널들을 오디오의 다중 채널들에 대해 소정의 타임 스케일링 또는 피치 시프팅을 산출하는 속도로 판독하는 단계;Reading the resulting channels of audio signals at a rate that yields a predetermined time scaling or pitch shift for the multiple channels of audio;

오디오 신호들의 채널들 간의 식별되고 조합된 오디토리 이벤트 내에서 공통 접합점을 선택하여, 오디오 신호들의 다중 채널들 각각에 있는 상기 공통 접합점으로부터 유래하는 접합점들이 서로 정렬되고, 각 접합점은 접합점을 이끄는 오디오 신호의 선두 세그먼트를 규정하는 단계;By selecting a common junction within an identified and combined auditory event between channels of audio signals, the junctions from each of the common junctions in each of the multiple channels of audio signals are aligned with each other, with each junction leading an audio signal. Defining a leading segment of;

상기 조합된 오디토리 이벤트 내에서 상기 공통 접합점으로부터 이격된 공통 종단점을 선택하여, 오디오 신호들의 다중 채널들 각각에 있는 상기 공통 종단점으로부터 유래하는 종단점들이 서로 정렬되고, 종단점을 추적하는 오디오 신호의 후미 세그먼트 및 접합점과 종단점 간에 오디오 신호의 타겟 세그먼트를 규정하는 단계;A trailing segment of the audio signal that tracks endpoints such that endpoints derived from the common endpoint in each of the multiple channels of audio signals are aligned with each other by selecting a common endpoint spaced from the common junction within the combined auditory event Defining a target segment of the audio signal between the junction and the endpoint;

오디오 신호들의 채널들 각각에 있는 상기 접합점에서 상기 선두 및 후미 세그먼트들을 연결하여, 종단점이 상기 접합점보다도 높은 샘플 번호를 가질 경우 타겟 세그먼트를 생략함으로써 오디오 신호 샘플들의 개수를 감소시키거나, 종단점이 상기 접합점보다도 낮은 샘플 번호를 가질 경우 타겟 세그먼트를 반복함으로써 샘플들의 개수를 증가시키는 단계; 및Connect the leading and trailing segments at each junction in each of the channels of audio signals to reduce the number of audio signal samples by omitting a target segment if the endpoint has a sample number higher than the junction, or the endpoint points to the junction Increasing the number of samples by repeating the target segment if it has a lower sample number; And

오디오 신호들의 채널들 각각에 있는 연결된 선두 및 후미 세그먼트들을 오디오의 다중 채널들에 대해 소정의 타임 스케일링 또는 피치 시프팅을 산출하는 속도로 판독하는 단계;Reading the connected leading and trailing segments in each of the channels of audio signals at a rate that yields predetermined time scaling or pitch shifting for the multiple channels of audio;

상기 오디오 신호의 식별된 영역이 상기 식별된 영역을 선행하는 세그먼트의 부분 및 상기 식별된 영역을 뒤따르는 세그먼트의 부분보다 조용한 오디오 신호의 세그먼트 부분임;The identified region of the audio signal is a segment portion of the audio signal that is quieter than the portion of the segment preceding the identified region and the portion of the segment following the identified region;

삭제delete