KR20140128567A

KR20140128567A - Audio signal processing method

Info

Publication number: KR20140128567A
Application number: KR1020130047062A
Authority: KR
Inventors: 송명석; 오현오
Original assignee: 인텔렉추얼디스커버리 주식회사
Priority date: 2013-04-27
Filing date: 2013-04-27
Publication date: 2014-11-06
Also published as: KR102148217B1

Abstract

According to an aspect of the present invention, an audio signal processing method comprises the steps of: receiving user position information and a first position information of a screen; receiving a sound source signal and sound source position information; receiving a second position information of the screen, wherein the second position is changed from the first position; obtaining the corrected value of the sound source position information by using the sound source position information and the second position information; generating a filter coefficient using the corrected value of the sound source position information; and generating a rendered channel signal using the generated filter coefficient and the sound source signal.

Description

위치기반 오디오 신호처리 방법 {Audio signal processing method}[0001] The present invention relates to a position-based audio signal processing method,

본 발명은 객체 오디오 신호 처리 방법 및 장치에 관한 것으로, 보다 상세하게는 객체 오디오 신호의 부호화 및 복호화하거나 3차원 공간에 렌더링하기 위한 방법 및 장치에 관한 것이다.
The present invention relates to a method and apparatus for processing an object audio signal, and more particularly, to a method and apparatus for encoding and decoding an object audio signal or rendering the object audio signal in a three-dimensional space.

3D 오디오란 기존의 서라운드 오디오에서 제공하는 수평면 상의 사운드 장면(2D)에 높이 방향으로 또 다른 축(dimension)을 제공함으로써, 말그대로 3차원 공간에서의 임장감있는 사운드를 제공하기 위한 일련의 신호처리, 전송, 부호화, 재생 기술 등을 통칭한다. 특히, 3D 오디오를 제공하기 위해서는 종래보다 많은 수의 스피커를 사용하거나 혹은 적은 수의 스피커를 사용하더라도 스피커가 존재하지 않는 가상의 위치에서 음상이 맺히도록 하는 렌더링 기술이 널리 요구된다. 3D audio is a series of signal processing to provide a lively sound in a three-dimensional space, by providing another dimension in the height direction on a horizontal sound scene (2D) provided by existing surround audio, Transmission, encoding, reproduction technology, and the like. Particularly, in order to provide 3D audio, a rendering technique is widely required in which an image is formed at a virtual position where a speaker is not used even if a larger number of speakers are used or a smaller number of speakers are used.

3D 오디오는 향후 출시될 초고해상도 TV (UHDTV)에 대응되는 오디오 솔루션이 될 것으로 예상되며, 고품질 인포테인먼트 공간으로 진화하고 있는 차량에서의 사운드를 비롯하여 그밖에 극장 사운드, 개인용 3DTV, 테블릿, 스마트폰, 클라우드 게임 등 다양하게 응용될 것으로 예상된다.
3D audio is expected to become an audio solution for future high-definition TVs (UHDTVs), and it is expected that 3D audio will be able to be used in a variety of applications such as sound in vehicles evolving into high-quality infotainment space as well as theater sound, personal 3DTV, tablet, Games and so on.

MPEG-H는 UHDTV의 대형 디스플레이 환경에 대응하는 멀티미디어 코덱 표준이라고 할 수 있다. 대형 디스플레이에 맞는 현장감과 임장감을 갖는 사운드로써 22채널이 고려되지만, 이와 함께 기존의 HDTV환경에서 시청 각도가 30도 내외였던 것에 비해, 동일 시청 거리에서 UHDTV가 설치된 경우 시청 각도는 약 100도에 이르는 상황이며, 이는 AV 부정합(audio-visual misalignment)이라는 새로운 문제를 야기한다. MPEG-H is a multimedia codec standard that supports UHDTV's large display environment. Though 22 channels are considered as a sound with a sense of presence and feel suitable for a large display, compared with the viewing angle of about 30 degrees in the conventional HDTV environment, when the UHDTV is installed in the same viewing distance, the viewing angle reaches about 100 degrees Situation, which creates a new problem of audio-visual misalignment.

AV 부정합은 시청 각도가 넓어짐에 따라 전면 스피커의 간격이 넓어진 상황이 되면서, 그리고, 시청자의 위치가 변화되고 스크린 상의 객체의 움직임 폭이 넓어지면서, 스크린 상의 객체와 이에 대응되는 오디오 객체의 공간적 정합(alignment)을 구현하는데 어려움이 따르는 상황을 일컫는다. 이때, 공간적으로 lip-synchronization을 맞춰야하는 새로운 문제를 가지고 있다.AV mismatch increases the spacing of the front speakers as the viewing angle becomes wider, and as the position of the viewer changes and the movement width of the object on the screen becomes wider, the spatial mismatch between the object on the screen and the corresponding audio object alignment is difficult to implement. At this time, we have a new problem that we have to adjust lip-synchronization spatially.

NHK에서 최초 제안한 배치 구조는 스피커와 TV스크린의 위치를 강력하게 규정하고 있으며, 이를 기준으로 객체 음원들의 위치가 계산된다. 이러한 TV스크린, 스피커, 그리고 객체 음원간의 위치를 결정하는 방법을 screen-centric 좌표 규정법이라고 한다. 도 20은 NHK의 규정사항에 맞춰 사용자의 귀 높이가 중간 레이어에 정확히 일치하였을 경우의 사용 예를 나타낸다. 도 20에서는 스크린 오른쪽의 실제 사용자(2050), 스크린 상의 객체 영상(2040), 스크린 왼쪽의 정위(localization)해야 할 3D 음원(2060)를 나타낸다. 사용자는 스피커 배치를 구성하는 상위 레이어(210), 중간 레이어(220), 하위 레이어(230) 중 중간 레이어(220)에 위치한다. 이 경우, 3D 오디오 음장 재현에 의한 객체의 위치는 컨텐츠 제공자가 의도한 객체의 위치와 일치하게 된다. NHK 's first proposed layout strongly defines the positions of speakers and TV screens, and the position of object sound sources is calculated based on this. This method of determining the location between TV screens, speakers, and object sound sources is called the screen-centric coordinate rule. 20 shows an example in which the user's ear height is exactly matched to the middle layer in accordance with the NHK regulations. 20 shows an actual user 2050 on the right side of the screen, an object image 2040 on the screen, and a 3D sound source 2060 to be localized on the left side of the screen. The user is located in the middle layer 220 of the upper layer 210, the middle layer 220, and the lower layer 230 constituting the speaker layout. In this case, the position of the object by the reproduction of the 3D audio sound field coincides with the position of the object intended by the content provider.

그러나, 실제 사용환경에서 일반적인 사용자의 귀(혹은 눈) 높이가 22.2채널의 미드 레이어의 높이(220)와 일치하지 않을 가능성이 높으며, 이 경우 음상이 목적한 위치에 정위되지 않는 문제점이 있다. 도 21는 screen-centric 배치의 문제점을 나타낸다. 사용자 귀의 높이(2110)가 미드 레이어의 높이보다 낮은 위치에 놓여있는 오프 스윗 스폿(off-sweet spot) 상황을 가정하였을 때, 3D 오디오 음장 재현에 의한 객체의 위치(2120)는 컨텐츠 제공자가 의도한 객체의 위치(160)보다 높게 위치하게 되어 영상과의 동기화에 실패하게 된다. However, there is a high possibility that the height of the ear (or eye) of a general user does not coincide with the height (220) of the mid layer of the 22.2 channel in an actual use environment, and in this case, the sound image is not positioned at a desired position. Figure 21 illustrates the problem of screen-centric layout. Assuming an off-sweet spot situation in which the user's ear height 2110 is located at a position lower than the height of the mid-layer, the position 2120 of the object by 3D audio sound field reproduction is determined by the content provider It is positioned higher than the position 160 of the object, thereby failing to synchronize with the image.

뿐만 아니라 일반적으로 거실의 구조, 가구 배치에 따라 TV스크린과 스피커의 위치가 매우 비정형적인 것은 주지의 사실인데, 이렇게 컨텐츠가 디자인될 당시의 디바이스 배치 구조와 사용 환경의 구조가 차이가 클수록 비정형화된 변수를 규정하기 복잡해지고, 결과적으로 도 22에 나타내진 바와 같이 영상과 3D 오디오의 동기화된 구현은 불가능에 가깝게 된다. 즉, 영상과 동기화된 3D 오디오를 재생하기 위해선 다양한 변수(즉, 음원, 스크린, 스피커, 사용자의 위치)를 고려한 렌더링 기술이 필요하다.In addition, it is well known that the positions of the TV screen and the speaker are generally irregular according to the structure of the living room and the arrangement of the furniture. The larger the difference in the structure of the device arrangement and the usage environment at the time of designing the contents, Variables become complicated, and as a result, the synchronized implementation of the image and the 3D audio becomes close to impossible, as shown in Fig. That is, in order to reproduce 3D audio synchronized with the image, a rendering technique considering various variables (that is, a sound source, a screen, a speaker, and a user's position) is required.

비정형적 위치에 디바이스가 위치하더라도, 컨텐츠 제작자가 의도한 사운드 장면을 제공할 수 있도록 하기 위해서 사용자마다 제각각인 재생 환경에서의 스피커 환경을 알아야 하는 것과 함께, 규격에 따른 위치 대비 차이를 보정하기 위한 렌더링 기술이 필요하게 된다. 이러한 3D 오디오의 품질을 최상으로 끌어올리기 위한 기술을 유연한 렌더링이라고 한다. In order to provide a sound scene intended by a content producer even if the device is located at an irregular position, it is necessary to know a speaker environment in a playback environment which is different for each user, Technology is required. The technology to maximize the quality of these 3D audio is called flexible rendering.

도 23은 비정형적인 TV스크린 위치에 대해 일반적인 유연한 렌더링 기술이 범할 수 있는 문제점과 본 발명에 의한 보정 방법의 일례를 표시한다. 도 23에 의하면 스크린의 높이에 변화량 h(2340)가 생겼을 때, 이를 보정하기 위해 일반적인 렌더링은 객체 음원의 높이를 스크린 높이 변화량 h(2340)만큼 위로 올려 재생하게 된다. 그러나 이는 정확한 보정이 아니며, 실제로는 사용자가 보는 화면과의 일체감을 위해선 h’(2350)의 높이에 객체 음원이 렌더링되어야 한다. 뿐만 아니라 객체 음원의 위치 정보는 일반적으로 구면좌표계(spherical coordinate)에 의해 정의되므로 단순히 h를 더하는 것만으로는 부족하다. 본 발명의 내용은 이와 같이 스크린의 높이 변화에 국한되지 않고, AV 정합에 필요한 모든 디바이스의 상대적 위치(전후좌우상하) 변화를 포함한다.FIG. 23 shows an example of a correction method according to the present invention and a problem that can be caused by a general flexible rendering technique for an irregular TV screen position. Referring to FIG. 23, when a change amount h (2340) occurs in the height of the screen, a general rendering is performed by raising the height of the object sound source by the screen height change amount h (2340) in order to correct it. However, this is not an accurate correction, and in order to feel unity with the screen that the user sees, the object sound source must be rendered at the height of h '(2350). In addition, since the position information of the object sound source is generally defined by a spherical coordinate system, simply adding h is not enough. The content of the present invention is not limited to the height variation of the screen, but includes the relative positions (front-back, left-right, top-bottom, and bottom-right) of all devices required for AV matching.

사운드 scene과 재생 환경의 부정합 문제에 관하여 스피커와 스크린의 위치를 고정하여 규정한 NHK의 좌표 방법인 scene-centric 방법에 반해, auro3D는 angle centric 방법을 제안하였다. Auro3D의 angle centric은 사운드 장면(sound scene)을 기준으로 좌표를 설정하고 스피커의 위치는 임의일 수 있도록 한다. 따라서, 예를 들어 극장마다 스피커의 위치 등이 다른 경우의 환경에 특히 유용하다. 이렇듯, 다양한 좌표 규정 방법에 따라 AV 정합 보정 방법이 달라질 수 밖에 없으며, 3D 오디오 표준에서는 생성단계에서 어떠한 좌표를 기준으로 컨텐츠를 생성했는지 명시해야할 필요가 있다. 또한 3D 오디오 표준은 서로 다른 스피커 배치를 갖는 환경을 고려한 대응 툴의 제공이 필요할 수 있다.
In contrast to the scene-centric method, NHK's coordinate method, which is defined by fixing the positions of the speakers and the screen with respect to the mismatch between the sound scene and the playback environment, the auro3D proposed an angle centric method. The angle centric of Auro3D sets the coordinates based on the sound scene and allows the position of the speaker to be arbitrary. Therefore, the present invention is particularly useful in an environment where the positions of speakers, for example, are different for each theater. As described above, the AV matching correction method must be changed according to various coordinate specification methods. In the 3D audio standard, it is necessary to specify the coordinates at which the content is generated in the generation step. In addition, the 3D audio standard may need to provide a corresponding tool that takes into account environments with different speaker arrangements.

본 발명의 일 양상에 따르면, 오디오 신호 처리 방법으로써, 사용자 위치 정보 및 스크린의 제 1 위치 정보를 수신하는 단계; 객체 음원 신호와 객체 음원 위치 정보를 수신하는 단계; 상기 스크린의 제 2 위치 정보를 수신하는 단계, 상기 제 2 위치 정보는 상기 제 1 위치에서 변화된 위치의 정보임; 상기 객체 음원 위치정보와 상기 제 2 위치정보를 이용하여 상기 객체 음원 위치정보의 수정된 값을 획득하는 단계; 상기 객체 음원 위치정보의 수정된 값을 이용하여 필터 계수를 생성하는 단계; 상기 생성된 필터 계수와 상기 객체 음원 신호를 이용하여, 렌더링된 채널 신호를 생성하는 단계를 포함하는 오디오 신호 처리 방법이 제공될 수 있다.According to an aspect of the present invention, there is provided a method of processing an audio signal, the method comprising: receiving user position information and first position information of a screen; Receiving object sound source signals and object sound source position information; Receiving second position information of the screen, wherein the second position information is information of a position changed at the first position; Obtaining a modified value of the object sound source location information using the object sound source location information and the second location information; Generating a filter coefficient using the modified value of the object sound source position information; And generating a rendered channel signal using the generated filter coefficient and the object sound source signal.

본 발명은 기존 3D 오디오 기술이 다양한 변수(즉, 음원, 스크린, 스피커, 사용자의 위치)의 변동에 의해 성능이 현저히 떨어지는 한계를 극복할 수 있는 방법이다. 본발명에 의하면, 결과적으로 보정과정을 통해 사용자에게 고품질의 오디오 신호를 재생할 수 있다.
The present invention is a method by which the existing 3D audio technology can overcome the limit of performance by the variation of various variables (that is, sound source, screen, speaker, user's position). As a result, according to the present invention, it is possible to reproduce a high-quality audio signal to the user through the correction process.

도 1은 동일한 시청 거리에서 영상 크기에 따른 시청 각도를 설명하기 위한 도면
도 2는 멀티 채널의 일 예로서 22.2ch의 스피커 배치 구성도
도 3은 청자가 3D 오디오를 청취하는 청취 공간상에서의 각 사운드 객체들의 위치를 나타내는 개념도
도 4는 도 3에 도시된 객체들에 대해 본 발명에 따른 그룹핑 방법을 이용하여 객체신호그룹을 형성한 예시적 구성도
도 5는 본 발명에 따른 객체 오디오 신호의 부호화기의 일 실시예에 대한 구성도
도 6는 본 발명의 일 실시예에 따른 복호화장치의 예시적인 구성도
도 7은 본 발명에 따른 부호화 방법에 의해 부호화하여 생성한 비트열의 일 실시예
도 8은 본 발명에 따른 객체 및 채널 신호 복호화 시스템을 블록도로 나타낸 일 실시예
도 9는 본 발명에 따른 또 다른 형태의 객체 및 채널 신호 복호화 시스템의 블록도
도 10은 본 발명에 따른 복호화 시스템의 일 실시예
도 11은 본 발명에 따른 복수 객체 신호에 대한 마스킹 임계치를 설명하기 위한 도면
도 12는 본 발명에 따른 복수 객체 신호에 대한 마스킹 임계치를 산출하는 부호화기의 일 실시예
도 13은 5.1채널 셋업에 대해 ITU-R 권고안에 따른 배치와 임의 위치에 배치된 경우를 설명하기 위한 도면
도 14는 본 발명에 따른 객체 비트열에 대한 복호화기와 이를 이용한 플렉서블 렌더링 시스템이 연결된 일 실시예의 구조
도 15는 본 발명에 따른 객체 비트열에 대한 복호화와 렌더링을 구현한 또 다른 실시예의 구조
도 16은 복호화기와 렌더러 사이의 전송계획을 결정하여 전송하는 구조를 나타내는 도면
도 17은 22.2 채널 시스템에서 전면 배치 스피커 가운데 디스플레이에 의해 부재한 스피커들을 그 주변 채널들을 이용하여 재생하는 개념을 설명하기 위한 개념도
도 18은 본 발명에 따른 부재 스피커 위치에의 음원 배치를 위한 처리 방법의 일 실시예
도 19는 각 밴드에서 생성된 신호를 TV 주변에 배치된 스피커와 매핑시키는 일 실시예
도 20은 규정사항에 맞춰 사용되었을 경우의 사용 예
도 21는 사용자 위치 변화에 따른 문제의 예시
도 22은 스피커 위치 변화에 따른 문제의 예시
도 23는 비정형적인 TV스크린 위치에 대해 유연한 렌더링 기술이 범할 수 있는 문제점과 본 발명에 의한 보정 방법의 일례
도 24는 스크린의 높이 변화에 따른 객체 음원 위치의 변화를 나타내는 그림
도 25은 스크린의 좌우 위치 변화에 따른 객체 음원 위치의 변화를 나타내는 그림
도 26은 본 발명에서 제안하는 보정 방법의 개념도
도 27은 본 발명의 흐름도
도 28은 본 발명의 스크린 전후 위치 변화에 따른 객체 음원의 위치를 변화시키는 방법
도 29은 본 발명의 일 실시예에 따른 오디오 신호 처리 장치가 구현된 제품들의 관계를 보여주는 도면1 is a view for explaining viewing angles according to image sizes at the same viewing distance
2 is a diagram showing a configuration of a speaker arrangement of 22.2 channels
FIG. 3 is a conceptual diagram showing the position of each sound object on the listening space in which the listener listens to 3D audio.
FIG. 4 is an exemplary configuration diagram of an object signal group formed using the grouping method according to the present invention with respect to the objects shown in FIG.
5 is a block diagram of an embodiment of an object audio signal encoder according to the present invention.
6 is an exemplary configuration diagram of a decoding apparatus according to an embodiment of the present invention.
FIG. 7 is a block diagram of an embodiment of a bit string generated by coding by the encoding method according to the present invention
8 is a block diagram of an object and channel signal decoding system according to an embodiment of the present invention.
9 is a block diagram of another object and channel signal decoding system according to the present invention
10 is a block diagram of an embodiment of a decoding system according to the present invention
11 is a view for explaining masking thresholds for a plurality of object signals according to the present invention;
12 is a block diagram of an embodiment of an encoder for calculating a masking threshold for a plurality of object signals according to the present invention
13 is a diagram for explaining the arrangement according to the ITU-R recommendation for the 5.1 channel setup and the case where it is arranged at an arbitrary position;
FIG. 14 is a block diagram illustrating a structure of an embodiment in which a decoder for an object bit stream and a flexible rendering system using the decoder are connected to each other.
15 is a block diagram illustrating a structure of another embodiment that implements decoding and rendering of an object bit stream according to the present invention.
16 is a diagram showing a structure for determining and transmitting a transmission plan between a decoder and a renderer
17 is a conceptual diagram for explaining a concept of reproducing speakers absent by the display among the front-mounted speakers in the 22.2 channel system using the surrounding channels
18 is a flowchart illustrating a method of processing a sound source according to an embodiment of the present invention;
19 is a diagram illustrating an example of mapping a signal generated in each band to a speaker disposed in the vicinity of the TV
Fig. 20 shows an example of usage when used in accordance with the regulations
21 shows an example of a problem according to a change in user position
22 shows an example of a problem caused by a change in speaker position
23 shows a problem that a flexible rendering technique can be applied to an irregular TV screen position and an example of a correction method according to the present invention
24 is a graph showing the change in the position of the object sound source according to the change in the height of the screen
25 is a diagram showing a change in the position of the object sound source according to the change in the position of the left and right sides of the screen
26 is a conceptual diagram of a correction method proposed by the present invention
27 is a flow chart
28 is a view illustrating a method of changing the position of the object sound source according to the change of the position of the screen before and after the present invention
29 is a diagram showing a relationship between products in which an audio signal processing apparatus according to an embodiment of the present invention is implemented

본 명세서에 기재된 실시예는 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에게 본 발명의 사상을 명확히 설명하기 위한 것이므로, 본 발명이 본 명세서에 기재된 실시예에 의해 한정되는 것은 아니며, 본 발명의 범위는 본 발명의 사상을 벗어나지 아니하는 수정예 또는 변형예를 포함하는 것으로 해석되어야 한다.It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are intended to be illustrative of the present invention and not to limit the scope of the invention. Should be interpreted to include modifications or variations that do not depart from the spirit of the invention.

본 명세서에서 사용되는 용어와 첨부된 도면은 본 발명을 용이하게 설명하기 위한 것이고, 도면에 도시된 형상은 필요에 따라 본 발명의 이해를 돕기 위하여 과장되어 표시된 것이므로, 본 발명이 본 명세서에서 사용되는 용어와 첨부된 도면에 의해 한정되는 것은 아니다.The terms and accompanying drawings used herein are for the purpose of facilitating the present invention and the shapes shown in the drawings are exaggerated for clarity of the present invention as necessary so that the present invention is not limited thereto And are not intended to be limited by the terms and drawings.

본 명세서에서 본 발명에 관련된 공지의 구성 또는 기능에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우에 이에 관한 자세한 설명은 필요에 따라 생략한다. In the following description, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear.

본 발명에서 다음 용어는 다음과 같은 기준으로 해석될 수 있고, 기재되지 않은 용어라도 하기 취지에 따라 해석될 수 있다. 코딩은 경우에 따라 인코딩 또는 디코딩으로 해석될 수 있고, 정보(information)는 값(values), 파라미터(parameter), 계수(coefficients), 성분(elements) 등을 모두 아우르는 용어로서, 경우에 따라 의미는 달리 해석될 수 있는 바, 그러나 본 발명은 이에 한정되지 아니한다.In the present invention, the following terms can be interpreted according to the following criteria, and terms not described may be construed in accordance with the following. Coding can be interpreted as encoding or decoding as occasion demands, and information is a term that includes all of values, parameters, coefficients, elements, and the like, But the present invention is not limited thereto.

본 발명의 일 양상에 따르면, 오디오 신호 처리 방법으로써, 사용자 위치 정보 및 스크린의 제 1 위치 정보를 수신하는 단계; 객체 음원 신호와 객체 음원 위치 정보를 수신하는 단계; 상기 스크린의 제 2 위치 정보를 수신하는 단계, 상기 제 2 위치 정보는 상기 제 1 위치에서 변화된 위치의 정보임; 상기 객체 음원 위치정보와 상기 제 2 위치정보를 이용하여 상기 객체 음원 위치정보의 수정된 값을 획득하는 단계; 상기 객체 음원 위치정보의 수정된 값을 이용하여 필터 계수를 생성하는 단계; 상기 생성된 필터 계수와 상기 객체 음원 신호를 이용하여, 렌더링된 채널 신호를 생성하는 단계를 포함하는 오디오 신호 처리 방법이 제공될 수 있다.
According to an aspect of the present invention, there is provided a method of processing an audio signal, the method comprising: receiving user position information and first position information of a screen; Receiving object sound source signals and object sound source position information; Receiving second position information of the screen, wherein the second position information is information of a position changed at the first position; Obtaining a modified value of the object sound source location information using the object sound source location information and the second location information; Generating a filter coefficient using the modified value of the object sound source position information; And generating a rendered channel signal using the generated filter coefficient and the object sound source signal.

또한, 상기 스크린의 제 2 위치 정보를 수신하는 단계는, 상기 스크린의 설치 환경의 변화 또는 사용자의 입력에 기초하여 수행되는 것을 특징으로 하는 오디오 신호처리 방법을 포함할 수 있다.The receiving of the second positional information of the screen may be performed based on a change of an installation environment of the screen or an input of a user.

또한, 상기 스크린의 제 2 위치 정보를 수신하는 단계는, 기기간 통신을 이용하여 상기 스크린의 위치 정보를 추적하고, 상기 추적된 위치 정보에 기초하여 수신하는 것을 특징으로 하는 오디오 신호처리 방법을 포함할 수 있다. The step of receiving the second position information of the screen may include the step of tracking the position information of the screen using the inter-device communication and receiving the information based on the tracked position information .

또한, 제 1 항에 있어서, 상기 수정 객체 음원 위치정보를 출력하는 단계는, 상기 스크린 위치정보, 상기 사용자 위치정보를 이용하여 객체의 깊이 정보를 생성하고 사용자를 기준으로 수정한 수정 객체 음원 위치 정보를 생성하는 것을 특징으로하는 오디오 신호처리 방법을 포함할 수 있다. The method of claim 1, wherein the outputting of the modified object sound source position information comprises: generating depth information of the object using the screen position information and the user position information; The audio signal processing method may further comprise:

또한, 상기 필터 계수를 생성하는 단계는 상기 객체 음원 위치 정보의 수정된 값을 이용하여 HRTF(Head Related Transfer Function) 정보를 생성하는 것을 특징으로하는 오디오 신호처리 방법을 포함할 수 있다.The generating of the filter coefficient may include generating an HRTF (Head Related Transfer Function) information using the modified value of the object sound source location information.

이하에서는 본 발명의 실시예에 따른 객체 오디오 신호의 처리 방법 및 장치에 관하여 설명한다.
Hereinafter, a method and apparatus for processing an object audio signal according to an embodiment of the present invention will be described.

도 1은 동일한 시청 거리상에서 영상 크기(예: UHDTV 및 HDTV)에 따른 시청 각도를 설명하기 위한 도면이다. 디스플레이의 제작 기술이 발전되고, 소비자의 요구에 따라서 영상크기가 대형화 되어가는 추세이다. 도 1에 나타난 바와 같이 HDTV(1920*1080픽셀 영상, 120)인 경우보다 UHDTV(7680*4320픽셀 영상,110)는 약 16배가 커진 영상이다. HDTV가 거실 벽면에 설치되고 시청자가 일정 시청거리를 두고 거실 쇼파에 앉은 경우 약 시청 각도가 30도일 수 있다. 그런데 동일 시청 거리에서 UHDTV가 설치된 경우 시청 각도는 약 100도에 이르게 된다. 이와 같이 고화질 고해상도의 대형 스크린이 설치된 경우, 이 대형 컨텐츠에 걸맞게 높은 현장감과 임장감을 갖는 사운드가 제공되는 것이 바람직할 수 있다. 시청자가 마치 현장에 있는 것과 거의 동일한 환경을 제공하기 위해서는, 1-2개의 서라운드 채널 스피커가 존재하는 것만으로는 부족할 수 있다. 따라서, 보다 많은 스피커 및 채널 수를 갖는 멀티채널 오디오 환경이 요구될 수 있다.1 is a view for explaining viewing angles according to image sizes (e.g., UHDTV and HDTV) on the same viewing distance. Display technology has been developed and the size of the image has been increasing in accordance with the demand of the consumer. As shown in FIG. 1, UHDTV (7680 * 4320 pixel image, 110) is about 16 times larger than that of HDTV (1920 * 1080 pixel image, 120). If the HDTV is installed on the living room wall and the viewer is sitting on the living room sofa at a certain viewing distance, the viewing angle may be about 30 degrees. However, when UHDTV is installed in the same viewing distance, the viewing angle reaches about 100 degrees. When a large screen of high resolution and high resolution is installed as described above, it may be desirable to provide a sound having high sense of presence and impact suitable for the large content. In order to provide a viewer with almost the same environment as in the scene, the presence of one or two surround channel speakers may not be sufficient. Thus, a multi-channel audio environment having a larger number of speakers and channels may be required.

위에 설명한 바와 같이 홈 시어터 환경 이외에도 개인 3D TV(personal 3D TV), 스마트폰 TV, 22.2채널 오디오 프로그램, 자동차, 3D video, 원격 현장감 룸(telepresence room), 클라우드 기반 게임(cloud-based gaming) 등이 있을 수 있다.
In addition to the home theater environment, there are also personal 3D TVs, smartphone TVs, 22.2-channel audio programs, cars, 3D videos, telepresence rooms, cloud-based gaming, Can be.

도 2는 멀티 채널의 일 예로서 22.2ch의 스피커 배치를 나타낸 도면이다. 22.2ch는 음장감을 높이기 위한 멀티 채널 환경의 일 예일 수 있으며, 본 발명은 특정 채널 수 또는 특정 스피커 배치에 한정되지 아니한다. 도 2를 참조하면, 가장 높은 레이어(top layer, 210)에 총 9개 채널이 제공될 수 있다. 전면에 3개, 중간 위치에 3개, 서라운드 위치에 3개 총 9개의 스피커가 배치되어 있음을 알 수 있다. 중간 레이어(middle layer, 220)에는 전면에 5개, 중간 위치에 2개, 서라운드 위치에 총 3개의 스피커가 배치될 수 있다. 전면의 5개 스피커 중에 중앙 위치의 3개는 TV 스크린의 내에 포함될 수 있다. 바닥(bottom layer, 230)에는 전면에 총 3개의 채널 및 2개의 LFE 채널(240)이 설치될 수 있다. 2 is a diagram showing a speaker arrangement of 22.2 channels as an example of a multi-channel. 22.2ch may be an example of a multi-channel environment for enhancing the sound field, and the present invention is not limited to a specific number of channels or a specific speaker arrangement. Referring to FIG. 2, a total of nine channels may be provided in the top layer 210. It can be seen that a total of nine speakers are arranged at the front, three at the middle position, and three at the surround position. In the middle layer 220, five speakers may be arranged on the front side, two speakers may be arranged on the middle position, and three speakers may be disposed on the surround position. Of the five speakers on the front, three of the center positions can be contained within the TV screen. A total of three channels and two LFE channels 240 may be installed on the bottom layer 230 at the front side.

이와 같이 최대 수십 개 채널에 이르는 멀티 채널 신호를 전송하고 재생하는 데 있어서, 높은 연산량이 필요할 수 있다. 또한 통신 환경 등을 고려할 때 높은 압축률이 요구될 수 있다. 뿐만 아니라, 일반 가정에서는 멀티채널(예: 22.2ch) 스피커 환경을 구비하는 경우는 많지 않고 2ch 또는 5.1ch 셋업을 갖는 청취자가 많기 때문에, 모든 유저에게 공통적으로 전송하는 신호가 멀티채널을 각각 인코딩해서 보내는 경우에는, 그 멀티채널을 2ch 및 5.1ch로 다시 변환하여 재생해야하는 경우 통신적인 비효율이 발생할 뿐만 아니라 22.2ch의 PCM 신호를 저장해야 하므로, 메모리 관리에 있어서의 비효율이 발생할 수 있다.
In this manner, a high computation amount may be required for transmitting and reproducing multi-channel signals up to several tens of channels. Also, a high compression ratio may be required in consideration of a communication environment and the like. In addition, many households do not have a multi-channel (eg, 22.2-ch) speaker environment and many listeners have a 2-channel or 5.1-channel setup. When the multi-channel is converted and re-converted into 2-channel and 5.1-channel, it is necessary to store 22.2-channel PCM signals as well as communication inefficiency, resulting in inefficiency in memory management.

도 3은 청자(110)가 3D 오디오를 청취하는 청취 공간상(130)에서 3차원의 사운드 장면을 구성하는 각 사운드 객체(120)들의 위치를 나타내는 개념도이다. 도 3을 참조하면, 도식화의 편의상 각 객체(120)들이 점소스(point source)인 것으로 나타내었으나, 점소스 이외에도 평면파(plain wave) 형태의 음원이나, 엠비언트(ambient) 음원 (사운드 장면의 공간을 인식할 수 있는 전 방위에 걸쳐 퍼져있는 여음) 등도 있을 수 있다. 3 is a conceptual diagram showing positions of sound objects 120 constituting a three-dimensional sound scene in a listening space 130 where the listener 110 listens 3D audio. 3, each object 120 is shown as a point source for convenience of illustration. However, in addition to a point source, a sound source in the form of a plain wave or an ambient sound source A sound that is spread over a perceptible whole bearing).

도 4는 도 3의 도식화된 객체들에 대해 본 발명에 따른 그룹핑 방법을 이용하여 객체신호그룹(410, 420)을 형성한 것을 표시한다. 본 발명에 따르면, 객체신호에 대한 부호화 혹은 처리를 함에 있어, 객체신호그룹을 형성하여 그룹핑된 객체들을 단위로 부호화하거나 처리하는 것이 특징이다. 이때 부호화의 경우 객체를 개별 신호로써 독립 부호화(discrete coding)하는 경우나 객체 신호에 대한 파라메트릭 부호화를 하는 경우를 포함한다. 특히 본 발명에 따르면, 객체신호에 대한 파라메터 부호화를 위한 다운믹스 신호의 생성과 다운믹스에 대응한 객체들의 파라메터 정보를 생성함에 있어서 그룹핑된 객체들을 단위로 생성하는 것이 특징이다. 즉, 종래의 예를 들어 SAOC 부호화 기술의 경우, 사운드 장면을 구성하는 모든 객체를 하나의 다운믹스 신호 (이때 다운믹스 신호는 모노(1채널), 혹은 스테레오(2채널) 일 수 있으나, 편의상 하나의 다운믹스 신호로 표현한다)와 그에 대응하는 객체 파라메터 정보로 표현하였으나, 이와 같은 방법을 본 발명에서 고려하는 시나리오에서처럼 20개 객체 이상, 많게는 200개, 500개를 하나의 다운믹스와 그에 대응한 파라메터로 표현할 경우 원하는 수준의 음질을 제공하는 업믹스 및 렌더링이 사실상 불가능하다. 이에 따라 본 발명에서는 부호화 대상이 되는 객체들을 그룹화하여 그룹단위로 다운믹스를 생성하는 방법을 이용한다. 그룹단위로 다운믹스되는 과정에서 각 객체가 다운믹스될 때 다운믹스 게인이 적용될 수 있으며, 적용된 객체별 다운믹스 게인은 부가정보로써 각 그룹에 대한 비트열에 포함된다. 한편, 부호화의 효율성 혹은 전체 게인에 대한 효과적인 제어를 위해 각 그룹에 공통으로 적용되는 글로벌 게인과 각 그룹별 객체들에 한정하여 적용되는 객체그룹게인이 사용될 수 있으며, 이들은 부호화되어 비트열에 포함되어 수신단에 전송된다. FIG. 4 shows that the object signal groups 410 and 420 are formed using the grouping method according to the present invention with respect to the illustrated objects of FIG. According to the present invention, in encoding or processing an object signal, an object signal group is formed and encoded or processed in units of grouped objects. In this case, the coding includes discrete coding of an object as an individual signal or parametric coding of an object signal. Particularly, according to the present invention, grouped objects are generated in units in generating a downmix signal for parameter encoding of an object signal and parameter information of objects corresponding to a downmix. That is, in the case of the conventional SAOC encoding technique, all the objects constituting the sound scene may be one downmix signal (the downmix signal may be mono (1 channel) or stereo (2 channels) And the object parameter information corresponding thereto. However, as in the scenario considered in the present invention, it is possible to use more than 20 objects, more than 200 objects, and 500 objects as one downmix and corresponding object parameter information When expressed as parameters, upmixing and rendering that provide the desired level of sound quality is virtually impossible. Accordingly, in the present invention, a method of grouping objects to be encoded and generating a downmix on a group basis is used. The downmix gain may be applied when each object is downmixed in the process of being downmixed by a group, and the applied downmix gain for each object is included in the bit string for each group as additional information. Meanwhile, in order to effectively control encoding efficiency or overall gain, a global gain applied to each group and an object group gain applied to only the objects of each group can be used, and they are encoded and included in a bit string, Lt; / RTI >

그룹을 형성하는 첫번째 방법은 사운드 장면상에서 각 객체의 위치를 고려하여 가까운 객체들끼리 그룹을 형성하는 방법이다. 도 4의 객체그룹(410, 420)은 이와 같은 방법으로 형성한 한 예이다. 이는 파라메터 부호화의 불완전성으로 각 객체들간에 발생하는 크로스토크 왜곡이나, 객체들을 제3의 위치로 이동하거나 크기를 변경하는 렌더링을 수행할 때 발생하는 왜곡들이 청자(110)에게 가급적 들리지 않도록 하기위한 방법이다. 같은 위치에 있는 객체들에 발생한 왜곡은 상대적으로 마스킹에 의해 청자에게 들리지 않을 가능성이 높다. 같은 이유로 개별 부호화를 하는 경우도 공간적으로 유사 위치에 있는 객체들간의 그룹핑을 통해 부가정보를 공유하는 등의 효과를 기대할 수 있다. The first method of forming a group is to form a group of nearby objects considering the position of each object on a sound scene. The object groups 410 and 420 of FIG. 4 are examples formed by this method. This is because the crosstalk distortion occurring between the objects due to incompleteness of the parameter coding, the distortion caused when the objects are moved to the third position or the size is changed are not audible to the listener 110 as much as possible Method. Distortion in objects at the same location is likely to be inaudible to the listener due to relatively masking. For the same reason, even when individual encoding is performed, the effect of sharing additional information through grouping among objects located at similar positions can be expected.

도 5는 본 발명에 따른 객체 그룹핑(550) 및 다운믹스(520, 540) 방법을 포함하는 객체 오디오 신호의 부호화기의 일 실시예에 대한 블록도다. 각 그룹별로 다운믹스를 수행하며 이 과정에서 다운믹스된 객체들을 복원하는데 필요한 파라메터를 생성한다(520,540). 각 그룹별로 생성된 다운믹스 신호들은 AAC, MP3와 같은 채널별 웨이브폼(waveform)을 부호화하는 웨이브폼 부호화기(560)를 통해 추가적으로 부호화된다. 이를 흔히 코어코덱(Core codec)이라고 부른다. 또한 각 다운믹스 신호간의 커플링 등을 통한 부호화가 이뤄질 수 있다. 각 부호화기를 통해 생성된 신호는 먹스(570)를 통해 하나의 비트열로 형성되어 전송된다. 따라서, 다운믹스&파라메터 부호화기들(520,540)과 웨이브폼 부호화기(560)을 통해 생성된 비트열은 모두 하나의 사운드 장면을 이루는 구성객체들을 부호화하는 경우로 볼 수 있다. 또한, 생성된 비트열내 서로 다른 객체 그룹에 속한 객체 신호는 동일한 시간 프레임을 가지고 부호화되며, 따라서, 같은 시간대에 재생되는 특징을 갖기도 한다. 객체그룹핑부에서 생성한 그룹핑정보는 부호화되어 수신단에 전달되는 것이 가능하다. 5 is a block diagram of an embodiment of an object audio signal encoder including an object grouping 550 and a downmix 520, 540 method according to the present invention. Downmixing is performed for each group, and parameters necessary for restoring the downmixed objects are generated (520, 540). The downmix signals generated for each group are further encoded through a waveform encoder 560 for encoding channel-specific waveforms such as AAC and MP3. This is commonly referred to as a core codec. In addition, encoding may be performed through coupling between downmix signals. The signals generated through the respective encoders are formed into a single bit stream through the multiplexer 570 and transmitted. Accordingly, it can be seen that the bitstreams generated through the downmix & parameter encoders 520 and 540 and the waveform encoder 560 all encode constituent objects constituting one sound scene. Also, object signals belonging to different object groups within the generated bit stream are encoded with the same time frame, and thus, they may be reproduced at the same time. The grouping information generated by the object grouping unit can be encoded and transmitted to the receiving end.

도 6은 이와같이 부호화되어 전송된 신호에 대한 복호화를 수행하는 일 실시예를 나타내는 블록도이다. 복호화 과정은 부호화의 역과정으로써 웨이브폼 복호화(620)된 복수의 다운믹스 신호들은 각각 대응되는 파라메터와 함께 업믹서& 파라메터 복호화기에 입력된다. 복수의 다운믹스가 존재하므로 복수의 파라메터 복호화가 필요하다. FIG. 6 is a block diagram illustrating an embodiment of performing decoding on a coded and transmitted signal. In the decoding process, a plurality of downmix signals decoded by the waveform decoding 620 are input to the upmixer & parameter decoder together with the corresponding parameters. Since there are a plurality of downmixes, it is necessary to decode a plurality of parameters.

전송된 비트열에 글로벌 게인 및 객체그룹 게인이 포함되어 있는 경우, 이들을 적용하여 정상적인 객체 신호의 크기를 복원할 수 있다. 한편, 렌더링 혹은 트랜스 코딩 과정에서 이 게인값들은 제어가 가능하며, 글로벌 게인 조절을 통해 전체 신호의 크기를, 객체그룹 게인을 통해 그룹별 게인을 조절할 수 있다. 이를테면, 재생 스피커 단위로 객체 그룹핑이 이루어진 경우, 후술할 유연한 렌더링을 구현하기 위해 게인을 조절할 때, 객체그룹 게인을 조절을 통해 쉽게 구현할 수 있을 것이다.If the transmitted bitstream includes a global gain and an object group gain, they can be applied to restore the size of a normal object signal. On the other hand, these gain values can be controlled in the rendering or transcoding process, and the gain of the entire signal can be controlled through the global gain control, and the group gain can be controlled through the object group gain. For example, when object grouping is performed on the basis of the playback speaker, the object group gain can be easily adjusted by adjusting the gain in order to implement flexible rendering, which will be described later.

이때, 복수의 파라메터 부호화기 혹은 복호화기는 설명의 편의상 병렬로 처리되는 것처럼 도시되었으나, 하나의 시스템을 통해 순차적으로 복수 객체 그룹에 대한 부호화 혹은 복호화를 수행하는 것도 가능하다. At this time, although a plurality of parameter encoders or decoders are illustrated as being processed in parallel for convenience of explanation, it is also possible to sequentially perform encoding or decoding of a plurality of object groups through one system.

객체 그룹을 형성하는 또 다른 방법으로 서로 상관도가 낮은 객체끼리 하나의 그룹으로 그룹핑하는 방법이다. 이는 파라메터 부호화의 특징으로 상관도가 높은 객체들은 다운믹스로부터 각각을 분리하기 어려운 특징을 고려한 것이다. 이때, 다운믹스 시 다운믹스 게인 등의 파라메터를 조절하여, 그룹된 각 객체들이 보다 상관성이 멀어지도록 하는 부호화 방법도 가능하다. 이때 사용된 파라메터는 복호화 시 신호 복원에 사용될 수 있도록 전송되는 것이 바람직하다. Another method of forming an object group is to group objects having low correlation into one group. This is a characteristic of the parameter encoding, and it takes into consideration that the objects having high correlation are difficult to separate each from the downmix. At this time, it is also possible to control the parameters such as the downmix gain and the like at the time of downmixing so that the grouped objects are further distanced from each other. At this time, the parameters used are preferably transmitted so that they can be used for signal restoration upon decoding.

객체 그룹을 형성하는 또 다른 방법으로 서로 상관도가 높은 객체들을 하나의 그룹으로 그룹핑하는 방법이다. 이는 상관도가 높은 객체들의 경우 파라메터를 이용한 분리에 어려움이 있지만, 그런 활용도가 높지 않은 응용에서 압축 효율을 높이기 위한 방법이다. 코어코덱의 경우 다양한 스펙트럼을 가진 복잡한 신호일 경우 그만큼 비트가 많이 필요하므로 상관도가 높은 객체를 묶어 하나의 코어코덱을 활용하면 부호화 효율이 높다. Another method of forming an object group is to group objects having a high degree of correlation into one group. This is a method for increasing the compression efficiency in applications where the degree of utilization is not high, although it is difficult to separate the objects with high correlation using the parameters. In the case of a core codec, complex signals with various spectra require a lot of bits. Therefore, when a single core codec is used to group highly correlated objects, encoding efficiency is high.

객체 그룹을 형성하는 또 다른 방법으로 객체간 마스킹 여부를 판단하여 부호화하는 것이다. 예를 들어 객체 A가 객체 B를 마스킹하는 관계에 있는 경우 두 신호를 하나의 다운믹스에 포함하여 코어코덱으로 부호화할 경우, 객체 B는 부호화 과정에서 생략될 수 있다. 이 경우 복호화단에서 파라메터를 이용하여 객체 B를 얻을 경우 왜곡이 크다. 따라서, 이와 같은 관계를 가지는 객체 A와 객체 B는 별도의 다운믹스에 포함하는 것이 바람직하다. 반면, 객체 A와 객체 B가 마스킹 관계에 있지만, 두 객체를 분리하여 렌더링할 필요가 없는 응용이나, 적어도 마스킹된 객체에 대한 별도 처리의 필요가 없는 경우는 반대로 객체 A와 B를 하나의 다운믹스에 포함시키는 것이 바람직하다. 따라서 응용에 따라 선택 방법이 다를 수 있다. 이를테면 부호화 과정에서 바람직한 사운드 장면상에서 특정 객체가 마스킹되어 없어지거나 최소한 미약한 경우라면, 이를 객체 리스트에서 제외하고 마스커가 되는 객체에 포함시키거나 두 객체를 합쳐 하나의 객체로 표현하는 식으로 구현할 수 있다. Another method of forming an object group is to judge whether the object is masked or not and then encode it. For example, when the object A is in the relationship of masking the object B, the object B may be omitted in the encoding process when the two signals are included in one downmix and encoded into the core codec. In this case, when the object B is obtained by using the parameter at the decoding end, the distortion is large. Therefore, it is preferable that object A and object B having such a relationship are included in separate downmixes. On the other hand, when an object A and an object B are in a masking relationship but do not need to separately render two objects or at least do not need to separately process the masked object, conversely, . Therefore, the selection method may be different depending on the application. For example, if a particular object is masked or at least weak in a desirable sound scene in the encoding process, it may be implemented in a form that it is excluded from the object list and included in the masked object, or the two objects are combined and represented as one object .

객체 그룹을 형성하는 또 다른 방법으로 평면파 소스 객체나 엠비언트 소스 객체 등 점 소스 객체가 아닌 것들을 분리하여 별도로 그룹화하는 것이다. 이와 같은 소스들은 점 소스와 다른 특성으로 인해, 다른 형태의 압축 부호화 방법이나 파라메터가 필요하며, 따라서, 별도로 분리하여 처리하는 것이 바람직하다. Another way to form an object group is to separate things that are not point source objects, such as plane wave source objects or ambient source objects, and group them separately. Such sources need different compression encoding methods and parameters because of their different characteristics from point sources, and therefore, it is preferable to separate them separately.

그룹별로 복호화된 객체 정보들은 전송된 그룹화 정보를 참조하여 객체디그룹핑을 통해 원래의 객체들로 환원된다.The object information decoded for each group is reduced to the original objects through object grouping by referring to the transferred grouping information.

도 7은 본 발명에 따른 부호화 방법에 의해 부호화하여 생성한 비트열의 일 실시예이다. 도 7을를 참조하면, 부호화된 채널 혹은 객체 데이터가 전송되는 주비트열(700)이 채널 그룹(720,730,740) 혹은 객체 그룹(750,760,770) 순으로 정렬되어 있는 것을 알 수 있다. 또한 헤더에 각 그룹의 비트열내에서의 위치정보인 채널 그룹 포지션 정보 CHG_POS_INFO (711), 객체 그룹 포지션 정보 OBJ_POS_INFO (712)를 포함하고 있으므로, 이를 참조하면 비트열을 순차적으로 복호화하지 않고도 원하는 그룹의 데이터만을 우선 복호화할 수 있다. 따라서 복호화기는 일반적으로 그룹단위로 먼저 도착한 데이터부터 복호화를 수행하나, 다른 정책이나 이유에 의해 복호화하는 순서를 임의로 변경할 수 있다. 또한 도7은 주비트열(700) 외에 별도로 주요 복호화 관련 정보와 함께 각 채널 혹은 객체 대한 메타데이터(703,704)를 담고 있는 부비트열(701)을 예시한다. 부비트열은 주비트열이 전송되는 중간에 간헐적으로 전송되거나, 별도 전송채널을 통해 전송될 수 있다.
FIG. 7 shows an example of a bit string generated by encoding by the encoding method according to the present invention. Referring to FIG. 7, it can be seen that the main bitstream 700 in which the encoded channel or object data is transmitted is arranged in the order of the channel groups 720, 730, 740 or the object groups 750, 760, 770. In addition, since the header includes channel group position information CHG_POS_INFO 711 and object group position information OBJ_POS_INFO 712, which are position information in each group bit string, referring to this, it is possible to obtain desired group data Can be decoded first. Therefore, the decoder generally decodes data arriving first in group units, but it can arbitrarily change the order of decoding according to another policy or reason. In addition, FIG. 7 illustrates a sub-bit stream 701 containing metadata 703 and 704 for each channel or object as well as main decoding related information in addition to the main bit stream 700. The sub-bit sequence may be transmitted intermittently during the transmission of the main bit stream, or may be transmitted over a separate transmission channel.

(객체 그룹별로 비트할당하는 방법)(A method of allocating bits for each object group)

복수 그룹별로 다운믹스를 생성하고, 각 그룹별로 독립된 파라메트릭 객체 부호화를 수행하는데 있어서, 각 그룹에서 사용되는 비트수는 서로 다를 수 있다. 그룹별 비트를 할당하는 기준은 그룹내 포함된 객체의 수, 그룹내 객체간의 마스킹 효과를 고려한 유효 객체수, 사람의 공간 해상도를 고려한 위치에 따른 가중치, 객체들의 음압 그기, 객체간 상관도, 사운드 장면상의 객체의 중요도 등을 고려할 수 있다. 예를 들면 A,B,C 세개의 공간적 객체 그룹을 갖는 경우, 각각 그룹의 object신호가 3,2,1개씩 포함되어 있다면, 할당된 비트는 3a1(n-x),2 2a2(n-y), a3n으로 할당될 수 있다. 여기서 x,y는 각 그룹 내에서 객체간 그리고 객체내에서 마스킹효과에 의해서 비트를 덜 할당해도 되는 정도를 말하며, a1,a2 a3는 그룹별로 상기 언급한 다양한 요소들에 의해 결정될 수 있다.
In generating a downmix for a plurality of groups and performing independent parametric object coding for each group, the number of bits used in each group may be different from each other. The criterion for allocating bits per group is the number of objects included in the group, the number of effective objects considering the masking effect among the objects in the group, the weight according to positions considering human spatial resolution, the sound pressure level of objects, The importance of the object on the scene, and the like can be considered. For example, if there are three spatial object groups A, B, and C, if the object signals of group 3, 2, and 1 are included, the allocated bits are 3a1 (nx), 2a2 (ny), a3n Can be assigned. Here, x and y represent the degree to which less bits can be allocated by the masking effect between objects in each group and within the object, and a1 and a2 a3 can be determined by various factors mentioned above for each group.

(객체 그룹내에서 주객체,부객체 위치정보 부호화)(Main object and sub object position information encoding in object group)

한편, 객체 정보의 경우 프로듀서가 생성한 의도에 따라 권고하거나 다른 사용자가 제안하는 믹스 정보 등을 객체의 위치 및 크기 정보로써 메타데이터를 통해 전달하는 수단을 갖는 것이 바람직하다. 본 발명에서는 이를 편의상 프리셋 정보라 부른다. 프리셋을 통한 위치 정보의 경우, 특히 객체가 시간에 따라 위치가 가변하는 다이내믹 객체의 경우, 전송되야할 정보량이 적지 않다. 예를들어 1000개의 객체에 대해 매 프레임 가변하는 위치 정보를 전송한다면 매우 큰 데이터량이 된다. 따라서, 객체의 위치 정보 역시 효과적으로 전송하는 것이 바람직하다. 이에 본 발명에서는 주 객체와 부 객체라는 정의를 이용하여 위치 정보의 효과적인 부호화 방법을 사용한다. On the other hand, in the case of object information, it is preferable to have a means for recommending according to an intention created by a producer, or transmitting mix information proposed by another user through metadata as position and size information of an object. In the present invention, this is referred to as preset information for convenience. In the case of position information through a preset, especially when the object is a dynamic object whose position changes with time, the amount of information to be transmitted is not small. For example, if you transmit location information that varies every frame for 1000 objects, it is a very large amount of data. Therefore, it is desirable to effectively transmit the position information of the object. Accordingly, the present invention uses an effective encoding method of position information using the main object and the sub-object definition.

주 객체는 객체의 위치정보를 3차원 공간상의 절대적인 좌표값으로 표현하는 객체를 의미한다. 부 객체는 3차원 공간상의 위치를 주 객체에 대한 상대적인 값으로 표현하여 위치정보를 갖는 객체를 의미한다. 따라서 부 객체는 대응되는 주 객체가 무엇인지 알아야 하는데, 그룹핑을 수행하는 경우, 특히 공간상의 위치를 기준으로 그룹핑을 하는 경우, 동일 그룹내에 하나의 주 객체와 나머지를 부 객체로 두고 위치 정보를 표현하는 방법으로 구현 가능하다. 부호화를 위한 그룹핑이 없거나 이를 이용하는 것이 부 객체 위치정보 부호화에 유리하지 않은 경우, 위치 정보 부호화를 위한 별도의 집합을 형성할 수 있다. 부 객체 위치 정보를 상대적으로 표현하는 것이 절대값으로 표현하는 것보다 유리하기 위해서는 그룹 혹은 집합내에 속하는 객체들은 공간상에서 일정 범위내에 위치하는 것이 바람직하다. The main object is an object that expresses the position information of the object as an absolute coordinate value in the three-dimensional space. The subobject refers to an object having positional information by expressing the position in the three-dimensional space with respect to the principal object. Therefore, the subordinate object needs to know what the corresponding main object is. In the case of grouping, in particular, when grouping is performed based on the position in the space, one main object and the rest are subordinate objects in the same group, . If there is no grouping for encoding or if it is not advantageous to encode sub-object location information, a separate set for location information encoding may be formed. It is preferable that the objects belonging to the group or the set are located within a certain range in space so that it is more advantageous to express the sub object position information relative to the absolute value.

본 발명에 따른 또다른 위치정보 부호화 방법은 주 객체에 대한 상대적인 표현 대신, 고정된 스피커 위치에 대한 상대 정보로써 표현하는 것이다. 이를테면, 22채널 스피커의 지정된 위치값을 기준으로 객체의 상대적 위치 정보를 표현한다. 이때 기준으로 사용할 스피커 개수와 위치 값 등은 현재 컨텐츠에서 설정한 값을 기준으로 이뤄질 수 있다. Another positional information encoding method according to the present invention is to express relative information of a fixed speaker position instead of a relative expression to a main object. For example, the relative position information of the object is expressed based on the designated position value of the 22-channel speaker. At this time, the number of speakers to be used as a reference and the position value can be set based on the value set in the current contents.

본 발명에 따른 또다른 실시 예에서, 위치정보를 절대값 혹은 상대값으로 표현한 뒤 양자화를 수행해야는데, 양자화 스텝은 절대위치를 기준으로 가변적인 것을 특징으로 한다. 예를들어, 청자의 정면 부근은 측면 혹은 후면에 비해 위치에 대한 구별 능력이 월등히 높은 것으로 알려져 있으므로, 정면에 대한 해상도는 측면에 대한 해상도보다 높도록 양자화 스텝을 설정하는 것이 바람직하다. 마찬가지로 사람은 방위에 대한 해상도가 높낮이에 대한 해상도보다 높으므로 방위각에 대한 양자화를 보다 높게 하는 것이 바람직하다. In another embodiment of the present invention, quantization is performed after the position information is expressed as an absolute value or a relative value, and the quantization step is variable based on the absolute position. For example, since the frontal area of a celadon is known to have a significantly higher discriminating ability with respect to a position than a side or a rear side, it is desirable to set the quantization step so that the frontal resolution is higher than the lateral resolution. Likewise, since the resolution for the azimuth is higher than the resolution for the azimuth, it is preferable to increase the quantization for the azimuth angle.

본 발명에 따른 또다른 실시 예에서는, 위치가 시변하는 다이내믹 객체의 경우, 주 객체 혹은 다른 기준점에 대한 상대적인 위치값을 표현하는 대신, 해당 객체의 이전 위치값에 대한 상대적인 값으로 표현하는 것이 가능하다. 따라서 다이내믹 객체에 대한 위치 정보는 시간적으로 이전, 공간적으로 이웃 기준점 중 어디를 기준으로 했는지를 구별하기 위한 플래그 정보를 함께 전송하는 것이 바람직하다.
According to another embodiment of the present invention, in the case of a dynamic object whose position is time-varying, it is possible to express the relative position with respect to the previous position value of the object instead of expressing the relative position value with respect to the main object or another reference point . Therefore, it is preferable to transmit together the flag information for distinguishing the location information of the dynamic object temporally before and spatially based on the neighboring reference points.

(복호화기 전체 아키텍처)(Decoder overall architecture)

도 8은 본 발명에 따른 객체 및 채널 신호 복호화 시스템을 블록도로 나타낸 일 실시예이다. 시스템은 객체 신호(801) 혹은 채널 신호(802) 혹은 객체 신호와 채널 신호의 조합을 받을 수 있고, 또한 객체 신호 혹은 채널 신호는 각각 웨이브폼 부호화(801, 802) 되거나 파라메트릭 부호화(803, 804) 되어 있을 수 있다. 복호화 시스템은 크게 3DA 복호화부(860)와 3DA 렌더링부(870)로 구분될 수 있으며, 3DA 렌더링부(870)는 임의의 외부 시스템 혹은 솔루션이 사용될 수도 있다. 따라서, 3DA 복호화부(860)와 3DA 렌더링부(870)는 외부와 쉽게 호환되는 표준화된 인터페이스를 제공하는 것이 바람직하다. 8 is a block diagram of an object and channel signal decoding system according to the present invention. The system may receive either the object signal 801 or the channel signal 802 or a combination of the object signal and the channel signal and the object signal or the channel signal may be subjected to waveform coding 801 and 802 respectively or parametric coding 803 and 804 ). The decoding system can be roughly divided into a 3DA decoding unit 860 and a 3DA rendering unit 870, and the 3DA rendering unit 870 may use any external system or solution. Accordingly, it is desirable that the 3DA decoding unit 860 and the 3DA rendering unit 870 provide a standardized interface that is easily compatible with the outside.

도 9는 본 발명에 따른 또 다른 형태의 객체 및 채널 신호 복호화 시스템의 블록도이다. 마찬가지로 본 시스템은 객체 신호(901) 혹은 채널 신호(902) 혹은 객체 신호와 채널 신호의 조합을 받을 수 있고, 또한 객체 신호 혹은 채널 신호는 각각 웨이브폼 부호화(901,902) 되거나 파라메트릭 부호화(903,904) 되어 있을 수 있다. 도 8의 시스템과 비교할 때 차이점은 각각 분리되어 있던 개별 객체 복호화기(810)와 개별 채널 복호화기(820), 그리고 파라메트릭 채널 복호화기(840)와 파라메트릭 객체 복호화기(830)가 각각 하나의 개별 복호화기(910)와 파라메트릭 복호화기(920)로 통합되었다는 점과, 3DA 렌더링부(940)와 편리하고 표준화된 인터페이스를 위한 렌더러 인터페이부(930)가 추가되었다는 점이다. 렌더러 인터페이스부(930)는 내부 혹은 외부에 존재하는 3DA 렌더러(940)로부터 사용자 환경정보, 렌더러 버전 등을 입력받아 이에 호환되는 형태의 채널 혹은 객체 신호와 함께 이를 재생하고 관련 정보를 표시하는데 필요한 메타데이터를 전달할 수 있도록 되어 있다. 3DA 렌더러 인터페이스(930)는 후술할 순서 제어부(1630)를 포함할 수 있다.9 is a block diagram of another object and channel signal decoding system according to the present invention. Similarly, the present system can receive the object signal 901 or the channel signal 902 or a combination of the object signal and the channel signal, and the object signal or channel signal can be waveform-encoded 901 902 or parametric 903 904 Can be. 8, the difference is that the individual object decoder 810, the separate channel decoder 820, and the parametric channel decoder 840 and the parametric object decoder 830, which are separated from each other, And the 3DA rendering unit 940 and the renderer interface unit 930 for the convenient and standardized interface are added to the 3D rendering unit 940 and the parametric decoder 920, respectively. The renderer interface unit 930 receives the user environment information, the renderer version, and the like from the 3DA renderer 940 existing in the inside or the outside and reproduces them together with a channel or object signal of a compatible format, So that data can be transmitted. The 3DA renderer interface 930 may include an order controller 1630 to be described later.

파라메트릭 복호화기(920)는 객체 혹은 채널 신호를 생성하기 위해 다운믹스 신호가 필요한데, 필요한 다운믹스 신호는 개별 복호화기(910)를 통해 복호화되어 입력된다. 객체 및 채널 신호 복호화 시스템에 대응되는 부호화기는 여러가지 타입이 될 수 있으며, 도 8 및 도 9에 표현된 형태의 비트열(801,802,803,804,901,902,903,904) 중 적어도 하나를 생성할 수 있으면 호환되는 부호화기로 볼 수 있다. 또한 본 발명에 따르면, 도 8 및 도 9에 제시된 복호화 시스템은 과거 시스템 혹은 비트열과의 호환성을 보장하도록 디자인되었다. 예를들어 AAC로 부호화된 개별 채널 비트열이 입력된 경우 개별 (채널) 복호화기를 통해 복호화하여 3DA 렌더러로 송부할 수 있다. MPS (MPEG Surround) 비트열의 경우 다운믹스 신호와 함께 송부되는데, 다운믹스된 후 AAC로 부호화된 신호는 개별 (채널) 복호화기를 통해 복호화하여 파라메트릭 채널 복호화기에 전달되고, 파라메트릭 채널 복호화기는 마치 MPEG Surround 복호화기처럼 동작한다. SAOC (Spatial Audio Object Coding) 으로 부호화된 비트열의 경우도 마찬가지로 동작한다. SAOC의 경우 도 8의 시스템에서는 종래와 같이 SAOC는 트랜스코더로 동작한 후 MPEG Surround를 통해 채널로 렌더링이 되는 구조를 갖는다. 이를 위해서는 SAOC 트랜스코더는 재생 채널 환경정보를 받아서, 이에 맞도록 최적화된 채널 신호를 생성해서 전송하는 것이 바람직하다. 따라서, 종래 SAOC 비트열을 받아서 복호화 하되, 사용자 혹은 재생 환경에 특화된 렌더링을 수행할 수 있다. 도 9의 시스템에서는 SAOC 비트열이 입력될 경우 MPS 비트열로 변환하는 트랜스코딩 동작대신 바로 채널 혹은 렌더링에 적합한 개별 객체 형태로 변환하는 방법으로 구현된다. 따라서, 트랜스코딩하는 구조에 비해 연산량이 낮으며, 음질 면에서도 유리하다. 도 9에서 객체 복호화기의 출력을 channel 로만 표시하였으나, 개별 객체 신호로써 렌더러 인터페이스에 전달될 수도 있다. 또한 도 9에서만 표기되었으나, 도 8의 경우를 포함하여 파라메트릭 비트열상에 레지듀얼 신호가 포함된 경우 이에 대한 복호화는 개별 복호화기를 통해 복호화되는 것이 특징이다.
The parametric decoder 920 requires a downmix signal to generate an object or channel signal. The required downmix signal is decoded and input through a separate decoder 910. The encoder corresponding to the object and channel signal decoding system may be of various types and can be regarded as a compatible encoder if it can generate at least one of the bit strings 801, 802, 803, 804, 901, 902, 903 and 904 of the form shown in FIG. 8 and FIG. Also, according to the present invention, the decoding system shown in Figs. 8 and 9 is designed to ensure compatibility with past system or bit string. For example, when an individual channel bit stream encoded with AAC is input, it can be decoded through an individual (channel) decoder and transmitted to the 3DA renderer. In the case of an MPS (MPEG Surround) bit stream, a downmix signal is transmitted together with a downmixed AAC encoded signal through a separate (channel) decoder, and the resulting signal is transmitted to a parametric channel decoder. It works like a surround decoder. A bit string encoded by SAOC (Spatial Audio Object Coding) operates similarly. SAOC In the system of FIG. 8, SAOC has a structure of being rendered as a channel through MPEG Surround after being operated as a transcoder as in the prior art. For this purpose, it is desirable that the SAOC transcoder receives the reproduction channel environment information, and generates and transmits an optimized channel signal to the received reproduction channel environment information. Accordingly, conventional SAOC bitstreams can be received and decoded, but rendering specific to a user or playback environment can be performed. In the system of FIG. 9, when the SAOC bit string is inputted, the transcoding operation is converted into an individual object type suitable for channel or rendering instead of the transcoding operation for converting to the MPS bit string. Therefore, the computational complexity is lower than that of the transcoding structure, which is advantageous in terms of sound quality. In FIG. 9, the output of the object decoder is shown as channel only, but it may be transmitted to the renderer interface as an individual object signal. Also, although it is shown only in FIG. 9, when a residual signal is included in the parametric bit stream including the case of FIG. 8, the decoding thereof is decoded through a separate decoder.

(채널에 대한 개별, 파라미터 조합, 레지듀얼)(Individual for channel, parameter combination, residual)

도 10은 본 발명의 다른 실시예에 따른 인코더 및 디코더의 구성을 보여주는 도면이다. 10 is a diagram illustrating a configuration of an encoder and a decoder according to another embodiment of the present invention.

도 10은 디코더의 스피커 셋업이 각기 다를 경우에 스케일러블한 코딩을 위한 구조를 나타낸다. FIG. 10 shows a structure for scalable coding when the speaker setup of the decoder is different.

인코더는 다운믹싱부(210)를 포함하고, 디코더는 디멀티플렉싱부(220)를 포함하고, 제1 디코딩부(230) 내지 제3 디코딩부(250) 중 하나 이상을 포함한다.The encoder includes the downmixing unit 210 and the decoder includes the demultiplexing unit 220 and includes at least one of the first decoding unit 230 to the third decoding unit 250.

다운믹싱부(210)는 멀티채널에 해당하는 입력신호(CH_N)을 다운믹싱함으로써, 다운믹스 신호(DMX)를 생성한다. 이 과정에서 업믹스 파라미터(UP) 및 업믹스 레지듀얼(UR) 중 하나 이상을 생성한다. 그런 다음 다운믹스 신호(DMX), 업믹스 파라미터(UP) (및 업믹스 레지듀얼(UR))를 멀티플렉싱함으로써, 하나 이상의 비트스트림을 생성하여 디코더에 전송한다.The downmixing unit 210 generates a downmix signal DMX by downmixing an input signal CH_N corresponding to a multi-channel. In this process, one or more of the upmix parameter (UP) and the upmix residual (UR) is generated. Then, by multiplexing the downmix signal DMX, upmix parameter UP (and upmix residual UR), one or more bitstreams are generated and transmitted to the decoder.

여기서 업믹스 파라메터(UP)는 하나 이상의 채널을 둘 이상을 채널로 업믹싱하기 위해 필요한 파라미터로서, 공간 파라메터 및 채널간 위상 차이(IPD) 등이 포함될 수 있다.Here, the upmix parameter UP is a parameter required to upmix two or more channels to one or more channels, and may include a spatial parameter and an inter-channel phase difference (IPD).

그리고 업믹스 레지듀얼(UR)은 원본 신호인 입력 신호(CH_N)과 복원된 신호와의 차이인 레지듀얼 신호에 해당하는데, 여기서 복원된 신호는 다운믹스(DMX)에 업믹스 파라미터(UP)를 적용하여 업믹싱된 신호일 수도 있고, 다운믹싱부(210)에 의해 다운믹싱되지 않은 채널이 discrete한 방식으로 인코딩된 신호일 수 있다.The upmix residual signal UR corresponds to a residual signal which is a difference between the original signal CH_N and the recovered signal. The recovered signal corresponds to an upmix parameter UP to the downmix DMX And the downmixed signal by the downmixing unit 210 may be a signal encoded in a discrete manner.

디코더의 디멀티플렉싱부(220)는 하나 이상의 비트스트림으로부터 다운믹스 신호(DMX) 및 업믹스 파라미터(UP)를 추출하고 업믹스 레지듀얼(UR)를 더 추출할 수 있다. 여기서 레지듀얼 신호는 다운믹스 신호에 대한 개별 부호화 유사한 방법으로 부호화될 수 있다. 따라서, 레지듀얼 신호의 복호화는 도 8 혹은 도 9에 제시된 시스템에서는 개별 (채널) 복호화기를 통해 이뤄지는 것이 특징이다.The demultiplexer 220 of the decoder may extract the downmix signal DMX and the upmix parameter UP from one or more bitstreams and further extract the upmix residual (UR). Where the residual signal can be encoded in a similar manner to a separate encoding for the downmix signal. Therefore, the decoding of the residual signal is performed through the individual (channel) decoder in the system shown in FIG. 8 or FIG.

디코더의 스피커 셋업 환경에 따라서, 제1 디코딩부(230) 내지 제3 디코딩부(250) 중 하나(또는 하나 이상)를 선택적으로 포함할 수 있다. 디바이스의 종류(스마트폰, 스테레오 TV, 5.1ch 홈시어터, 22.2ch 홈시어터 등)에 따라서 라우드 스피커의 셋업 환경이 다양할 수 있다. 이와 같이 다양한 환경에도 불구하고, 22.2ch 등의 멀티채널 신호를 생성하기 위한 비트스트림 및 디코더가 선택적이지 않다면, 22.2ch의 신호를 모두 복원한 후에, 스피커 재생환경에 따라서, 다시 다운믹스 해야 한다. 이러한 경우, 복원 및 다운믹스에 소요되는 연산량이 매우 높을 뿐만 아니라, 지연이 발생할 수도 있다. (Or one or more) of the first decoding unit 230 to the third decoding unit 250 depending on the speaker setup environment of the decoder. Depending on the type of device (smart phone, stereo TV, 5.1ch home theater, 22.2ch home theater, etc.), the loudspeaker setup environment may vary. If the bit stream and the decoder for generating the multichannel signal of 22.2 channels or the like are not optional, it is necessary to restore all 22.2 channels and then down mix according to the speaker reproduction environment. In this case, not only is the amount of computation required for restoration and downmixing very high, but also delays may occur.

그러나 본 발명의 다른 실시예에 따르면, 각 디바이스의 셋업 환경에 따라서 제1 디코더 내지 제3 디코더 중 하나(또는 하나 이상)을 선택적으로 구비함으로써, 상기와 같은 불리함으로 해소할 수 있다.However, according to another embodiment of the present invention, the disadvantage can be solved by selectively providing one (or more) of the first decoder to the third decoder according to the setup environment of each device.

제1 디코더(230)는 다운믹스 신호(DMX)만을 디코딩하는 구성으로써, 채널 수의 증가를 동반하지 않는다. 다운믹스 신호가 모노인 경우, 모노 채널 신호를 출력하고, 만약 스테레오인 경우, 스테레오 신호를 출력하는 것이다. 스피커 채널 수가 하나나 또는 두 개인 헤드폰 구비된 장치, 스마트폰, TV 등에 적합할 수 있다.The first decoder 230 is configured to decode only the downmix signal DMX and does not accompany an increase in the number of channels. If the downmix signal is mono, it outputs a mono channel signal, and if it is stereo, it outputs a stereo signal. A device having a headphone with one or two speaker channels, a smart phone, a TV, and the like.

한편, 제2 디코더(240)는 다운믹스 신호(DMX) 및 업믹스 파라미터(UP)를 수신하고, 이를 근거로 파라메트릭 M채널(PM)을 생성한다. 제1 디코더에 비해서 채널 수가 증가하지만, 업믹스 파라미터(UP)가 총 M채널까지의 업믹스에 해당하는 파라미터만 존재하는 경우, 원본 채널 수(N)에 못미치는 M채널 수의 신호를 재생할 수 있다. 예를 들어 인코더의 입력신호인 원본 신호가 22.2ch 신호이고, M채널은 5.1ch, 7.1ch 채널 등일 수 있다.Meanwhile, the second decoder 240 receives the downmix signal DMX and the upmix parameter UP, and generates a parametric M channel PM based on the downmix signal DMX and the upmix parameter UP. When the number of channels is increased as compared with the first decoder and only the parameters corresponding to the upmix up to the total M channels are present in the upmix parameter UP, have. For example, the original signal which is the input signal of the encoder is 22.2ch signal, and the M channel can be 5.1ch, 7.1ch channel and the like.

제3 디코더(250)는 다운믹스 신호(DMX) 및 업믹스 파라미터(UP) 뿐만 아니라, 업믹스 레지듀얼(UR)까지 수신한다. 제2 디코더는 M채널의 파라메트릭 채널을 생성하는 데 비해, 제3 디코더는 이에 업믹스 레지듀얼 신호(UR)까지 추가적으로 적용함으로써, N개 채널의 복원된 신호를 출력할 수 있다.The third decoder 250 receives not only the downmix signal DMX and the upmix parameter UP but also the upmix residual UR. The second decoder generates the M-channel parametric channel, while the third decoder additionally applies the upmix residual signal UR to the N-channel reconstructed signal.

각 디바이스는 제1 디코더 및 제3 디코더 중 하나 이상을 선택적으로 구비하고, 비트스트림 중에서 업믹스 파라미터(UP) 및 업믹스 레지듀얼(UR)을 선택적으로 파싱함으로써, 각 스피커 셋업 환경에 맞는 신호를 바로 생성함으로써, 복잡도 및 연산량을 줄일 수 있다.
Each device selectively includes one or more of a first decoder and a third decoder and selectively parses the upmix parameter UP and the upmix residual UR in the bit stream to generate a signal suitable for each speaker setup environment The complexity and the amount of computation can be reduced by creating it directly.

(마스킹 고려한 객체 웨이브폼 부호화)(Object Waveform Coding Considering Masking)

본 발명에 따른 객체의 웨이브폼 부호화기(이하 웨이브폼(waveform) 부호화기는 채널 혹은 객체 오디오 신호를 각 채널 혹은 객체별로 독립적으로 복호화가 가능하도록 부호화하는 경우를 말하며, 파라메트릭 부호화/복호화에 상대되는 개념으로 또한 개별(discrete) 부호화/복호화라고 부르기도 한다)는 객체의 사운드 장면상의 위치를 고려하여 비트할당한다. 이는 심리음향의 BMLD (Binaural Masking Level Difference) 현상과 객체 신호 부호화의 특징을 이용한 것이다. The waveform coder of an object according to the present invention (hereinafter referred to as a waveform coder) refers to a case where a channel or an object audio signal is encoded so that it can be independently decoded for each channel or object, and a concept relative to parametric encoding / (Also referred to as discrete encoding / decoding) is bit-allocated in consideration of the position on the sound scene of the object. This is based on the binaural masking level difference (BMLD) phenomenon of psychoacoustics and the features of object signal coding.

BMLD 현상을 설명하기 위해 기존 오디오 부호화 방법에서 사용하던 MS (Mid-Side) 스테레오 부호화의 예를 가지고 설명하면 다음과 같다. 즉, 심리음향에서의 마스킹 현상은 마스킹을 발생시키는 마스커(Masker)와 마스킹이 되는 마스키(Maskee)가 공간적으로 동일한 방향에 있을 때 가능하다는 것이 BMLD이다. 스테레오 오디오 신호의 두 채널 오디오 신호간의 상관성이 매우 높고, 그 크기가 같은 경우 그 소리에 대한 상(음상)이 두 스피커 사이 중앙에 맺히게 되며, 상관성이 없는 경우 각 스피커에서 독립된 소리가 나와 그 상이 각각 스피커에 맺히게된다. 만일 상관성이 최대인 입력 신호에 대해 각 채널을 독립적으로 부호화(dual mono)할 경우 이 때 발생하는 각 채널에서의 양자화 잡음은 서로 상관성이 없으므로, 오디오 신호는 중앙에, 양자화 잡음은 그 상이 각 스피커에 따로 맺히게 될 것이다. 따라서, 마스키가 되야하는 양자화 잡음이 공간적 불일치로 인해 마스킹되지 않아, 결국 사람에게 왜곡으로 들리는 문제가 발생한다. 합차부호화는 이와 같은 문제를 해결하고자, 두 채널 신호를 더한 신호 (Mid 신호)와 뺀 신호 (Difference)를 생성한 후 이를 이용하여 심리음향 모델을 수행하고, 이를 이용하여 양자화하여, 발생한 양자화 잡음이 음상과 같은 위치에 있도록 한다. In order to explain the BMLD phenomenon, an example of MS (Mid-Side) stereo coding used in the conventional audio coding method will be described as follows. That is, the masking phenomenon in psychoacoustic is possible when the masker generating the masking and the masking masking are spatially in the same direction. The correlation between the two-channel audio signals of the stereo audio signal is very high. If the sizes are the same, the phase (sound image) of the sound is centered between the two speakers. If there is no correlation, The speaker is concealed. Since the quantization noise of each channel is not correlated with each other when the channel is dual-mono independently of the input signal having the highest correlation, the audio signal is centered, . Therefore, the quantization noise to be masked is not masked due to the spatial inconsistency, resulting in a problem that the noise is distorted to the person. In order to solve such a problem, the sum-difference coding is performed by generating a psychoacoustic model using a signal obtained by subtracting a signal (Mid signal) plus two channel signals (Difference), quantizing the result using the quantized noise, Make sure it is in the same position as the sound image.

종래의 채널 부호화의 경우 각 채널은 재생되는 스피커에 매핑되며, 해당 스피커의 위치는 고정되고 서로 떨어져 있기 때문에, 채널간의 마스킹은 고려될 수 없었다. 그러나, 각 객체를 독립적으로 부호화 하는 경우는 해당 객체들의 사운드 장면상의 위치에 따라 마스킹 여부되는지 여부가 달라질 수 있다. 따라서 타 객체에 의해 현재 부호화되는 객체의 마스킹 여부를 판단하여 그에 따라 비트를 할당하여 부호화하는 것이 바람직하다. In the conventional channel coding, each channel is mapped to a speaker to be reproduced, and since the positions of the speakers are fixed and separated from each other, masking between channels can not be considered. However, when each object is encoded independently, whether the object is masked or not may be changed according to the location on the sound scene of the object. Accordingly, it is desirable to determine whether or not the object currently encoded by another object is masked, allocate and allocate bits according to the masking.

도 11은 객체 1(1110)과 객체 2(1120)에 대한 각각의 신호와 이 신호들로부터 취득될 수 있는 마스킹 임계치와 객체 1과 객체 2를 합친 신호에 대한 마스킹 임계치(1130)를 도시한다. 객체 1과 객체 2가 적어도 청자의 위치를 기준으로 동일한 위치 혹은 BMLD의 문제가 발생하지 않을 만큼의 범위내에 위치하는 것으로 간주한다면, 청자에게 해당 신호에 의해 마스킹되는 영역은 1130과 같이 될 것이므로, 객체 1에 포함된 S2신호는 완전히 마스킹되어 들리지 않는 신호가 될 것이다. 그러므로, 객체 1을 부호화하는 과정에 있어서 객체 2에 대한 마스킹 임계치를 고려하여 부호화하는 것이 바람직하다. 마스킹 임계치는 서로 가산적으로 합쳐지는 성질이 있으므로, 결국 객체 1과 객체 2에 대한 각각의 마스킹 임계치를 더하는 방법으로 구할 수 있다. 혹은 마스킹 임계치를 계산하는 과정 자체도 연산량이 매우 높으므로 객체 1과 객체 2를 미리 합하여 생성한 신호를 이용하여 하나의 마스킹 임계치를 계산하여 객체 1과 객체 2를 각각 부호화 하는 것도 바람직하다. 도 12는 본 발명에 따른 복수 객체 신호에 대한 마스킹 임계치를 산출하는 부호화기의 일 실시예이다. FIG. 11 shows a masking threshold 1130 for each signal for object 1 1110 and object 2 1120, a masking threshold that can be taken from these signals, and a signal that combines object 1 and object 2. Assuming that object 1 and object 2 are located at least at the same position with respect to the position of the listener or within a range in which the problem of BMLD does not occur, the area masked by the signal to the listener will be 1130, The S2 signal included in 1 will be a signal that is completely masked and inaudible. Therefore, in the process of encoding the object 1, it is preferable to encode the object 2 in consideration of the masking threshold value for the object 2. Since the masking threshold values are added together, they can be obtained by adding the respective masking threshold values to the object 1 and the object 2. Or the process of calculating the masking threshold value is very high, it is also preferable to calculate the masking threshold value by using the signal generated by summing the object 1 and the object 2 in advance and then encode the object 1 and the object 2, respectively. FIG. 12 is an embodiment of an encoder for calculating a masking threshold for a plurality of object signals according to the present invention.

본 발명에 따른 또다른 마스킹 임계치 산출 방법은 두 개의 객체 신호의 위치가 청음각 기준으로 완전히 일치 하지 않는 경우 두 객체에 대한 마스킹 임계치를 더하는 것 대신 두 객체가 공간상에 떨어진 정도를 고려하여 마스킹 레벨을 감쇄하여 반영하는 것도 가능하다. 즉 객체 1에 대한 마스킹 임계치를 M1(f), 객체 2에 대한 마스킹 임계치를 M2(f)라고 할 때, 각 객체를 부호화화는데 사용할 최종 조인트 마스킹 임계치 M1’(f), M2’(f)는 다음과 같은 관계를 갖도록 생성된다.
Another masking threshold calculation method according to the present invention is a method of calculating masking thresholds by considering the degree of separation of two objects in space, instead of adding masking thresholds for two objects when the positions of two object signals do not completely coincide with the audible angle reference It is also possible to reflect it by attenuating it. The final joint masking thresholds M1 '(f) and M2' (f) to be used to encode each object when the masking threshold for object 1 is M1 (f) and the masking threshold for object 2 is M2 (f) Is generated so as to have the following relationship.

이때, A(f)는 두 객체간 공간상의 위치와 거리 및 두 객체의 속성 등을 통해 생성되는 감쇄팩터로써 0.0=<A(f)=<1.0 의 범위를 갖는다. In this case, A (f) is an attenuation factor generated by the position and distance of space between two objects and the properties of two objects, and has a range of 0.0 = <A (f) = <1.0.

사람의 방향에 대한 해상도는 정면을 기준으로 좌우로 갈 수록 나빠지고 뒤쪽으로 갈 때 더욱 나빠지는 특성을 갖는데, 따라서, 객체의 절대적 위치는 A(f)를 결정하는 또다른 요소로 작용할 수 있다.
The resolution of a person's direction is deteriorated as it goes from side to side with respect to the front and becomes worse when going backward. Therefore, the absolute position of an object can serve as another factor for determining A (f).

본 발명에 따른 또 다른 실시예에서는, 두 객체 가운데 하나의 객체에 대해서는 자신의 마스킹 임계치만을 이용하고, 또 다른 객체에 대해서만 상대 객체에 대한 마스킹 임계치를 가져오는 방법으로 구현할 수 있다. 이를 각각 독립객체 의존객체라고 한다. 자기 자신의 마스킹 임계치만을 이용하는 객체는 상대 객체와 무관하게 고음질 부호화 되므로, 해당 객체로부터 공간적으로 분리되는 렌더링이 수행되더라도 음질이 보존되는 장점을 가질 수 있다. 객체 1을 독립객체, 객체 2를 의존객체라고 하면, 다음과 같은 식으로 마스킹 임계치가 표현될 수 있다.
According to another embodiment of the present invention, a masking threshold value of only one of two objects is used, and a masking threshold value of a relative object is acquired only for another object. This is called an independent object dependent object. Since the object using only its own masking threshold is high-quality encoded regardless of the relative object, the sound quality can be preserved even if the object is spatially separated from the object. Assuming that object 1 is an independent object and object 2 is a dependent object, a masking threshold value can be expressed by the following equation.

독립객체와 의존객체 여부는 각 객체에 대한 부가정보로써 복호화 및 렌더러에 전달하는 것이 바람직하다. Whether an independent object or a dependent object is the additional information for each object is preferably decoded and transmitted to the renderer.

본 발명에 따른 또 다른 실시예에서는, 두 객체가 공간상에서 일정정도 유사한 경우, 마스킹 임계치만을 합쳐서 생성하는 것이 아니라, 신호 자체를 하나의 객체로 합쳐서 처리하는 것도 가능하다.In another embodiment according to the present invention, when two objects are similar to each other in space, it is also possible to combine the signals themselves into one object, instead of merely generating the masking thresholds.

본 발명에 따른 또 다른 실시예에서는, 특히 파라미터 부호화를 수행하는 경우, 두 신호의 상관도와 두 신호의 공간 상의 위치를 고려하여, 하나의 객체로 합쳐서 처리하는 것이 바람직하다.
In another embodiment according to the present invention, in particular, when parameter coding is performed, it is preferable to combine the two signals into one object in consideration of the correlation between the two signals and the spatial position of the two signals.

(트랜스코딩 특징)(Transcoding feature)

본 발명에 따른 또 다른 실시예에서는, 커플링된 객체를 포함한 비트열을 트랜스코딩 함에 있어서, 특히 더 낮은 비트율로 트랜스 코딩 함에 있어서, 데이터 크기를 줄이기 위해 객체의 숫자를 줄여야 할 경우, 즉, 복수 객체를 하나로 다운믹스 하여 하나의 객체로 표현할 경우, 커플링된 객체에 대해 하나의 객체로 표현하는 것이 바람직하다.In another embodiment according to the present invention, in transcoding a bit string including a coupled object, particularly when transcoding at a lower bit rate, when the number of objects is reduced in order to reduce the data size, When an object is downmixed into a single object, it is preferable to represent the coupled object as a single object.

이상의 객체간 커플링을 통한 부호화를 설명함에 있어서, 설명의 편의를 위해 2개의 객체만을 커플링하는 경우만을 예로 들었으나, 2개 이상 다수의 객체에 대한 커플링도 유사한 방법으로 구현 가능하다.
In the description of the encoding through the inter-object coupling, only the case of coupling only two objects is described as an example for convenience of explanation, but coupling to two or more objects can also be implemented in a similar manner.

(유연한 렌더링 필요) (Flexible rendering required)

3D 오디오를 위해 필요한 기술 가운데 유연한 렌더링은 3D 오디오의 품질을 최상으로 끌어올리기 위해 해결해야할 중요한 과제 가운데 하나이다. 거실의 구조, 가구 배치에 따라 5.1 채널 스피커의 위치가 매우 비정형적인 것은 주지의 사실이다. 이와 같은 비정형적 위치에 스피커가 존재하더라도, 컨텐츠 제작자가 의도한 사운드 장면을 제공할 수 있도록 해야하는데, 이를 위해서는 사용자마다 제각각인 재생 환경에서의 스피커 환경을 알아야 하는 것과 함께, 규격에 따른 위치 대비 차이를 보정하기 위한 렌더링 기술이 필요하다. 즉, 전송된 비트열을 디코딩 방법에 따라 디코딩하는 것으로 코덱의 역할이 끝나는 것이 아니라, 이를 사용자의 재생 환경에 맞게 최적화 변형하는 과정에 대한 일련의 기술이 요구된다. Among the technologies required for 3D audio, flexible rendering is one of the key challenges to be solved to maximize the quality of 3D audio. It is well known that the position of the 5.1 channel speaker is very irregular depending on the structure of the living room and the arrangement of the furniture. Even if there is a speaker at such an irregular position, a content producer should be able to provide a sound scene intended by the user. In order to do this, it is necessary to know the speaker environment in the reproduction environment which is different for each user, A rendering technique is needed to compensate for this. That is, a series of techniques are required to decode the transmitted bit stream according to the decoding method, and not to end the codec role, but to optimize and transform it according to the user's reproduction environment.

도 13은 5.1채널 셋업에 대해 ITU-R 권고안에 따른 배치(회색, 1310)와 임의 위치에 배치된 경우(붉은색, 1320)를 나타낸다. 실제 거실 환경에서는 이처럼 ITU-R 권고안 대비 방향각과 거리 모두 달라지는 문제가 발생할 수 있다. (그림에 나타내지 않았지만 스피커의 높이에도 차이가 있을 수 있다.) 이와 같이 달라진 스피커 위치에서 원래의 채널 신호를 그대로 재생할 경우 이상적인 3D 사운드 장면을 제공하기 힘들다.
FIG. 13 shows a layout (gray, 1310) according to the ITU-R recommendation for a 5.1 channel setup and a case (red, 1320) arranged at an arbitrary position. In the actual living room environment, there may be a problem that the direction angle and the distance are different from the ITU-R recommendation. (Although it is not shown, there may be differences in the speaker's height.) It is difficult to provide the ideal 3D sound scene when the original channel signal is reproduced from such a different speaker position.

(플렉서블 렌더링)(Flexible rendering)

신호의 크기를 기준으로 두 스피커 사이의 음원의 방향 정보를 결정하는 Amplitude Panning이나 3차원 공간상에서 3개의 스피커를 이용하여 음원의 방향을 결정하는데 널리 사용되는 VBAP (Vector-Based Amplitude Panning)을 이용하면 객체별로 전송된 객체 신호에 대해서는 상대적으로 편리하게 플렉서블 렌더링을 구현할 수 있는 것을 알 수 있다. 채널 대신 객체 신호를 전송하는 것의 장점 중 하나이다.
Amplitude Panning, which determines the direction information of a sound source between two speakers based on the signal size, or Vector-Based Amplitude Panning (VBAP), which is widely used to determine the direction of a sound source using three speakers in a three-dimensional space It can be seen that flexible rendering can be implemented relatively conveniently for object signals transmitted on an object-by-object basis. It is one of the advantages of transmitting object signals instead of channels.

(객체 복호화와 렌더링 구조)(Object Decoding and Rendering Structure)

도 14는 본 발명에 따른 객체 비트열에 대한 복호화기와 이를 이용한 플렉서블 렌더링 시스템이 연결된 두가지 실시예의 구조(1400, 1401)를 나타낸다. 전술한 바와 같이 객체의 경우 원하는 사운드 장면에 맞춰 객체를 음원으로 위치시키기 용이한 장점이 있으며, 여기서는 믹스(Mix, 1420)부에서 믹싱행렬로 표현된 위치정보를 입력받아서 우선 채널 신호로 변경한다. 즉, 사운드 장면에 대한 위치정보를 출력 채널에 대응되는 스피커로부터의 상대적인 정보로써 표현되는 것이다. 이때, 실제 스피커의 개수와 위치가 정해진 위치에 존재하지 않는 경우 해당 위치 정보(Speaker Config)를 이용하여 다시 렌더링 하는 과정이 필요하다. 아래 기술하는 것처럼 채널 신호를 다시 다른 형태의 채널 신호로 렌더링하는 것은 객체를 최종 채널에 직접 렌더링하는 경우보다 구현하기 어렵다. FIG. 14 shows structures (1400 and 1401) of two embodiments in which a decoder for an object bit stream and a flexible rendering system using the decoder are connected to each other according to the present invention. As described above, in the case of an object, there is an advantage in that an object can be positioned as a sound source in accordance with a desired sound scene. Here, the mixer 1420 receives position information represented by a mixing matrix and converts it into a priority channel signal. That is, the positional information of the sound scene is expressed as relative information from the speaker corresponding to the output channel. At this time, if the actual number and position of the speakers do not exist at the predetermined positions, it is necessary to re-render them using the corresponding speaker information. Rendering a channel signal back to another type of channel signal, as described below, is more difficult to implement than when the object is directly rendered on the final channel.

도 15는 본 발명에 따른 객체 비트열에 대한 복호화와 렌더링을 구현한 또 다른 실시예의 구조를 나타낸다. 도 14의 경우와 비교하면, 비트열로부터 복호화와 함께 최종 스피커 환경에 맞는 플렉서블 렌더링(1510)을 직접 구현하는 것이다. 즉, 믹싱 행렬에 바탕하여 정형의 채널로 수행하는 믹싱과 이렇게 생성된 정형 채널로부터 플렉서블 스피커로 렌더링하는 과정의 두 단계를 거치는 대신 믹싱행렬과 스피커 위치정보(1520)를 이용하여 하나의 렌더링 행렬 혹은 렌더링 파라미터를 생성하여, 이를 이용하여 객체 신호를 대상 스피커로 바로 렌더링하는 것이다.
FIG. 15 shows a structure of still another embodiment implementing decryption and rendering of an object bit stream according to the present invention. Compared with the case of FIG. 14, it is possible to directly implement the flexible rendering 1510 according to the final speaker environment together with the decoding from the bit stream. That is, instead of going through two steps of mixing performed on a regular channel based on a mixing matrix and rendering the generated regular channel to a flexible speaker, a rendering matrix or a rendering matrix using a mixing matrix and speaker position information 1520 The rendering parameters are generated, and the object signal is directly rendered to the target speaker using the generated rendering parameters.

(채널로 붙여서 플렉서블 렌더링)(Flexible rendering with channel attached)

한편, 채널 신호가 입력으로 전송된 경우, 해당 채널에 대응되는 스피커의 위치가 임의 위치로 변경된 경우는 객체 경우의 같은 패닝 기법을 이용하여 구현되기 어렵고 별도의 채널 매핑 프로세스가 필요하다. 더 문제는 이처럼 객체 신호와 채널 신호에 대해 렌더링을 위해 필요한 과정과 해결 방법이 다르기 때문에 객체 신호와 채널 신호가 동시에 전송되어 두 신호를 믹스한 형태의 사운드 장면을 연출하고자 하는 경우는 공간의 부정합에 의한 왜곡이 발생하기 쉽다. 이와 같은 문제를 해결하기 위해 본 발명에 따른 또다른 실시예에서는 객체에 대한 플렉서블 렌더링을 별도로 수행하지 않고 채널 신호에 믹스를 먼저 수행한 후 채널 신호에 대한 플렉서블 렌더링을 수행하도록 한다. HRTF를 이용한 렌더링 등도 마찬가지 방법으로 구현되는 것이 바람직하다.
On the other hand, when the channel signal is transmitted as an input, if the position of the speaker corresponding to the channel is changed to an arbitrary position, it is difficult to implement using the same panning method in the case of an object, and a separate channel mapping process is required. The problem is that the process and solution required for rendering the object signal and the channel signal are different from each other. Therefore, when the object signal and the channel signal are simultaneously transmitted and a sound scene in which the two signals are mixed is desired, Distortion is likely to occur. In order to solve such a problem, according to another embodiment of the present invention, flexible rendering of a channel signal is performed after a mix is first performed on a channel signal without separately performing flexible rendering on an object. Rendering using HRTF is preferably implemented in the same manner.

(복호화단 다운믹스: 파라미터 전송 혹은 자동생성)(Decoded downmix: parameter transmission or automatic generation)

다운믹스 렌더링의 경우, 멀티채널 컨텐츠를 그보다 적은 수의 출력 채널을 통해 재생하는 경우 지금까지는 M-N 다운믹스 매트릭스 (M은 입력채널 수, N은 출력 채널 수)로 구현하는 것이 일반적이었다. 즉, 5.1 채널 컨텐츠를 스테레오로 재생할 때, 주어진 수식에 의해 다운믹스를 수행하는 식으로 구현된다. 그런데, 이와 같은 다운믹스 구현 방법은 우선 사용자의 재생 스피커 환경이 5.1채널 뿐임에도 불구하고, 전송된 22.2채널에 해당하는 모든 비트열을 복호화해야하는 연산량의 문제가 발생한다. 휴대기기에서의 재생을 위한 스테레오 신호 생성을 위해서도 22.2채널 신호를 모두 복호화 해야한다면, 그 연산량 부담이 매우 높을 뿐 아니라 엄청난 양의 메모리 낭비(22.2채널 복호화된 오디오 신호의 저장)가 발생한다.
In the case of downmix rendering, in the case of reproducing multi-channel contents through a smaller number of output channels, up to now, it has been common to implement an MN downmix matrix (M is the number of input channels and N is the number of output channels). That is, when 5.1 channel contents are reproduced in stereo, downmix is performed by a given expression. However, such a downmix implementation method has a problem of a calculation amount of decoding all bit strings corresponding to 22.2 transmitted channels, although the user's playback speaker environment is only 5.1 channels. In order to generate a stereo signal for reproduction on a portable device, if all the 22.2 channel signals need to be decoded, the computational burden is very high, and a huge amount of memory waste (storage of 22.2 channel decoded audio signals) occurs.

(다운믹스 대안으로의 트랜스코딩)(Transcoding as an alternative to downmix)

이에 대한 대안으로 거대한 22.2채널 원본 비트열로부터 효과적인 트랜스코딩을 통해 목표 기기 혹은 목표 재생 공간에 적합한 수의 비트열로 전환하는 방법을 생각할 수 있다. 예를 들어 클라우드 서버에 저장된 22.2채널 컨텐츠라면, 클라이언트 단말로부터 재생 환경 정보를 수신하고 이에 맞게 변환하여 전송하는 시나리오가 구현가능하다.
As an alternative to this, a method of switching from a huge 22.2 channel original bit stream to a bit stream suitable for the target device or target playback space through effective transcoding can be considered. For example, if the content is 22.2 channel content stored in the cloud server, it is possible to implement a scenario in which the reproduction environment information is received from the client terminal, and converted and transmitted.

(복호화 순서 혹은 다운믹스 순서; 순서제어부)(Decryption sequence or downmix sequence; sequence control unit)

한편, 복호화기와 렌더링이 분리되어 있는 시나리오의 경우, 예를 들어 22.2채널의 오디오 신호와 함께 50개의 객체신호를 복호화하여 이를 렌더러에 전달해야하는 경우가 발생할 수 있는데, 전송되는 오디오 신호는 복호화가 완료된 높은 데이터율의 신호이므로, 복호화기와 렌더러 사이에 매우 큰 대역폭을 요구하는 문제가 있다. 따라서, 한번에 이와 같이 많은 데이터를 동시에 전송하는 것은 바람직하지 않으며, 효과적인 전송계획을 세우는 것이 바람직하다. 그리고, 이에 맞게 복호화기가 복호화 순서를 결정하여 전송하는 것이 바람직하다. 도 16은 이와 같이 복호화기와 렌더러 사이의 전송계획을 결정하여 전송하는 구조를 나타내는 블록도이다. On the other hand, in the scenario in which the decoder and the rendering are separated, for example, there may be a case where it is necessary to decode 50 object signals together with an audio signal of 22.2 channels and transmit the decoded object signals to the renderer. Since it is a data rate signal, there is a problem that a very large bandwidth is required between the decoder and the renderer. Therefore, it is not desirable to simultaneously transmit such a large amount of data at once, and it is desirable to establish an effective transmission plan. Then, it is preferable that the decoder decides the decoding order and transmits it. FIG. 16 is a block diagram showing a structure for determining and transmitting a transmission plan between the decoder and the renderer.

순서제어부(1630)는 비트열에 대한 복호화를 통해 취득한 부가정보 및 메타데이터와 렌더러(1620)로부터 재생 환경, 렌더링 정보 등을 수신하여 복호화 순서와 복호화된 신호를 렌더러(1620)에 전송하는 전송 순서 및 단위 등을 결정하여 결정된 통제 정보를 복호화기(1610)와 렌더러(1620)에 다시 전달하는 역할을 담당한다. 예를 들어 렌더러(1620)에서 특정 객체를 완전히 제거하도록 명령한 경우, 이 객체는 렌더러(1620)로의 전송이 불필요할 뿐 아니라, 복호화도 할 필요가 없다. 혹은 다른 예로 특정 객체들을 특정 채널로만 렌더링하는 상황인 경우, 해당 객체를 별도로 전송하는 대신 전송되는 해당 채널에 미리 다운믹스하여 전송하면 전송 대역이 줄어들 것이다. 또 다른 실시 예로, 사운드 장면을 공간적으로 그룹핑하여, 각 그룹별로 렌더링에 필요한 신호들을 같이 전송하면, 렌더러 내부 버퍼에서 불필요하게 대기해야하는 신호의 양을 최소화할 수 있다. 한편 렌더러(1620)에 따라 한번에 수용 가능한 데이터 크기가 다를 수 있는데 이와 같은 정보도 순서제어부(1630)에 통지하여 이에 맞게 복호화기(1610)가 복호화 타이밍 및 전송량을 결정할 수 있다. The sequence control unit 1630 receives the supplementary information and metadata obtained through decoding of the bit stream and the transmission sequence for receiving the decoding sequence and the decoded signal from the renderer 1620 and transmitting the decoded sequence and the decoded signal to the renderer 1620, And transmits the determined control information to the decryptor 1610 and the renderer 1620 again. For example, if the renderer 1620 commands to completely remove a particular object, then it does not need to be sent to the renderer 1620, nor does it need to be decoded. Alternatively, in a situation where specific objects are rendered only on a specific channel, instead of separately transmitting the object, the transmission bandwidth may be reduced by downmixing the object in advance to the corresponding channel to be transmitted. In another embodiment, when sound scenes are spatially grouped and signals necessary for rendering are grouped together, it is possible to minimize the amount of signals that need to be unnecessarily queued in the renderer internal buffer. On the other hand, according to the renderer 1620, the size of data that can be accommodated at a time may be different. Such information may also be notified to the order controller 1630, and the decoder 1610 may determine the decoding timing and the amount of transmission.

한편, 순서제어부(1630)에 의한 복호화 통제는 나아가서 부호화단에 전달되어, 부호화 과정까지 통제할 수 있다. 즉, 불필요한 신호를 부호화 시 제외하거나, 객체, 채널에 대한 그룹핑을 결정하는 등이 가능하다.
On the other hand, the decoding control by the sequence control unit 1630 is further transmitted to the encoding stage, and the encoding process can be controlled. That is, it is possible to exclude unnecessary signals in encoding, determine grouping of objects and channels, and the like.

(음성 고속도로)(Voice highway)

한편, 비트열 가운데 양방향 통신에 해당하는 음성에 해당하는 객체가 포함될 수 있다. 양방향 통신은 다른 컨텐츠와 다르게 시간 지연에 매우 민감하므로, 이에 해당하는 객체 혹은 채널 신호가 수신된 경우, 이를 우선하여 렌더러에 전송해야한다. 이에 해당하는 객체 혹은 채널신호는 별도의 플래그 등으로 표시할 수 있다. 우선 전송 객체는 타 객체/채널과 다르게 같은 프레임에 들어있는 다른 객체/채널 신호와 재생 시간(presentation time)에 있어서 독립적인 특성을 갖는다.
On the other hand, an object corresponding to a voice corresponding to bidirectional communication may be included in the bit stream. Since bidirectional communication is very sensitive to time delay unlike other contents, when the corresponding object or channel signal is received, it should be transmitted to the renderer first. The corresponding object or channel signal can be indicated by a separate flag or the like. First of all, the transport object is independent of the other object / channel signals contained in the same frame and the presentation time, unlike the other object / channel.

(AV 정합 및 Phantom Center)(AV Matching and Phantom Center)

UHDTV 즉 초고해상도 TV를 고려할 때, 발생하는 새로운 문제 가운데 하나로, 흔히 Near Field라고 부르는 상황이다. 즉, 일반적인 사용자 환경(거실)의 시청 거리를 고려할 때, 재생되는 스피커로부터의 청자까지의 거리가 각 스피커 사이의 거리보다 짧아짐으로 인해, 각 스피커가 점 음원으로 동작하게 된다는 점과 넓고 큰 스크린에 의해 중앙부에 스피커가 부재하게 된 상황에서 비디오에 동기화된 소리 객체의 공간 해상도가 매우 높아야만 고품질 3D 오디오 서비스가 가능하다는 점이다. UHDTV is one of the new problems when considering ultra-high-definition TV, which is often called a near field. That is, considering the viewing distance of a general user environment (living room), the distance from the reproduced speaker to the listener is shorter than the distance between the speakers, so that each speaker operates as a point source, In the absence of a speaker at the center, the spatial resolution of the sound object synchronized to the video must be very high to enable high quality 3D audio service.

종래의 30도 정도의 시청각에서는 좌우에 배치된 스테레오 스피커가 Near Field 상황에 놓이지 않으며, 화면상의 객체의 이동 (예를 들어 왼쪽에서 오른쪽으로 이동하는 자동차)에 맞는 사운드 장면을 제공하기에 충분하다. 그러나, 시청각이 100도에 이르는 UHDTV 환경에서는 좌우 해상도뿐 아니라 화면의 상하를 구성하는 추가의 해상도가 필요하다. 예를 들어, 화면상의 2명의 등장 인물이 있을 경우, 현재의 HDTV에서는 두 명의 소리가 모두 가운데서 발화되는 것으로 들려도 현실감에 있어 큰 문제로 느껴지지 않았지만, UHDTV 크기에서는 화면과 그에 대응하는 소리의 불일치가 새로운 형태의 왜곡으로 인식될 것이다.In a conventional audio angle of about 30 degrees, stereo speakers arranged on the left and right are not in a near field situation, and are sufficient to provide a sound scene suitable for movement of an object on the screen (for example, a vehicle moving from left to right). However, in the UHDTV environment where the audiovisual angle is 100 degrees, not only the left and right resolutions but also additional resolutions constituting the top and bottom of the screen are required. For example, if there are two characters on the screen, it would not be a big problem in realistic sense even if it is said that two sounds are ignited in current HDTV. However, in UHDTV size, Will be perceived as a new form of distortion.

이에 대한 해결방안 중 하나로 22.2 채널 스피커 configuration의 형태를 들 수 있다. 도 2는 22.2채널 배치의 한 예이다. 도 2에 따르면, 전면부에 총 11개의 스피커를 배치하여 전면의 좌우 및 상하 공간 해상도를 크게 높이고 있다. 종전 3개의 스피커가 담당하던 중간층에 5개의 스피커를 배치한다. 그리고, 상위 개층 3개, 하위 계층에 3개를 추가하여 소리의 높낮이도 충분히 대응할 수 있도록 하였다. 이와 같은 배치를 이용하면 종전에 비해 전면의 공간 해상도가 높아지므로, 그만큼 비디오 신호와의 정합에 유리해질 것이다. 그런데, LCD, OLED 등의 디스플레이 소자를 이용하는 현재의 TV들에 있어, 스피커가 존재해야할 위치를 디스플레이가 차지한다는 문제가 있다. 즉, 디스플레이 자체가 소리를 제공하거나 혹은 소리를 관통하는 소자성격을 갖지 않는한 디스플레이 영역 밖에 존재하는 스피커들을 이용하여, 화면내의 각 오브젝트 위치에 정합된 소리를 제공해야하는 문제가 존재한다. 도 2에서 최소 FLc, FC, FRc에 해당하는 스피커는 디스플레이와 중복된 위치에 배치된다. One of the solutions to this problem is the configuration of 22.2 channel speaker configuration. Figure 2 is an example of a 22.2 channel arrangement. 2, a total of eleven loudspeakers are arranged on the front side to increase the left and right spatial resolution and the vertical spatial resolution of the front side. Five speakers are placed in the middle layer of the former three speakers. In addition, three upper layers and three lower layers were added so that the sound level could be sufficiently accommodated. By using such an arrangement, the spatial resolution of the front surface is higher than that of the past, which will be advantageous for matching with the video signal. However, there is a problem that, in current TVs using display devices such as LCDs and OLEDs, the display occupies a position where a speaker should be present. That is, there is a problem in that the speakers must be provided outside the display area to provide a matched sound to each object position in the screen, unless the display itself provides sound or does not have a device character that penetrates the sound. In FIG. 2, speakers corresponding to the minimums FLc, FC, and FRc are disposed at positions overlapping with the display.

도 17은 22.2 채널 시스템에서 전면 배치 스피커 가운데 디스플레이에 의해 부재한 스피커들을 그 주변 채널들을 이용하여 재생하는 개념을 설명하기 위한 개념도이다. FLc, FC, FRc 부재를 대응하기 위해 점선으로 표시한 원과 같이 추가 스피커를 디스플레이의 상하 주변부에 배치하는 경우도 고려할 수 있다. 도 17에 따르면 FLc를 생성하는데 사용할 수 있는 주변 채널은 7개가 있을 수 있다. 이 7개의 스피커를 이용하여 가상 소스를 생성하는 원리로 부재 스피커 위치에 해당하는 소리를 재생할 수 있다. 17 is a conceptual diagram for explaining a concept of reproducing speakers absent from the display among the front-mounted speakers in the 22.2 channel system using the peripheral channels. In order to accommodate the components FLc, FC and FRc, it is also conceivable to arrange additional speakers in the upper and lower peripheral portions of the display, such as circles indicated by dotted lines. According to FIG. 17, there may be seven peripheral channels that can be used to generate the FLc. By using these seven speakers, it is possible to reproduce a sound corresponding to the position of a member speaker by the principle of generating a virtual source.

주변 스피커를 이용하여 가상 소스를 생성하는 방법으로 VBAP이나 HAAS Effect (선행 효과)와 같은 기술 및 성질을 이용할 수 있다. 혹은 주파수 대역에 따라 서로 다른 패닝 기법을 적용할 수 있다. 나아가서는 HRTF를 이용한 방위각 변경 및 높이 조절 등을 고려할 수 있다. 예를들어 BtFC를 이용하여 FC를 대체할 경우, 상승 성질을 갖는 HRTF를 적용하여 FC 채널 신호를 BtFC에 더하는 방법으로 구현할 수 있다. HRTF 관찰을 통해 파악할 수 있는 성질은 소리의 높이를 조절하기 위해서는 고주파수 대역의 특정 Null의 위치(이는 사람에 따라 다름)를 제어해야한다는 것이다. 그런데, 사람에 따라 다른 Null을 일반화하여 구현하기 위해서는 고주파수 대역을 넓게 키우거나 줄이는 방법으로 높이 조절을 구현할 수 있다. 이와 같은 방법을 사용하면 대신 필터의 영향으로 신호에 왜곡이 발생하는 단점이 있다. Techniques and properties such as VBAP or HAAS Effect (pre-effect) can be used as a method of creating virtual sources using peripheral speakers. Alternatively, different panning techniques may be applied depending on the frequency band. Further, it is possible to consider changing the azimuth using the HRTF and adjusting the height. For example, when replacing FC with BtFC, it can be implemented by adding FC channel signal to BtFC by applying HRTF with ascending property. The HRTF observation is that you have to control the position of a specific Null in the high frequency band (which varies from person to person) in order to control the height of the sound. However, in order to generalize a different Null according to a person, height adjustment can be implemented by increasing or decreasing the high frequency band widely. If such a method is used, there is a disadvantage that the signal is distorted due to the influence of the filter.

본 발명에 따른 부재 스피커 위치에의 음원 배치를 위한 처리 방법은 도 18에 제시된 것과 같다. 도 18에 따르면 팬텀 스피커 위치에 대응하는 채널 신호가 입력신호로 사용되며, 입력신호는 3개의 밴드로 분할하는 서브밴드 필터부(1810)를 거친다. 스피커 어레이가 없는 방법으로 구현될 수도 있는데, 이 경우 3개 밴드 대신 2개 밴드로 구분하거나 3개 밴드로 분할한 대신 상위 2개 밴드에 대해 각기 다른 처리를 거치는 방법으로 구현될 수도 있다. 첫번째 밴드는 저주파 대역으로 상대적으로 위치에 둔감한 대신 크기가 큰 스피커를 통해 재생하는 것이 바람직하므로, 우퍼 혹은 서브우퍼를 통해 재생될 수 있는 신호이다. 이때, 선행 효과를 이용하기 위해 첫번째 밴드 신호는 시간 지연(1820)을 추가한다. 이때 시간 지연은 다른 밴드에서의 처리 과정에서 발생하는 필터의 시간 지연을 보상하기 위한 것이 아니라, 다른 밴드 신호 대비 더 늦게 재생되도록 하기 위해 즉, 선행 효과를 제공하기 위한 추가적인 시간 지연을 제공한다. The processing method for arranging the sound sources at the position of the member speaker according to the present invention is the same as that shown in Fig. Referring to FIG. 18, a channel signal corresponding to the phantom speaker position is used as an input signal, and the input signal is passed through a subband filter unit 1810 for dividing into three bands. In this case, instead of dividing into two bands or dividing into three bands instead of three bands, a method of performing different processing on the upper two bands may be implemented. The first band is relatively low in the low frequency band, but is a signal that can be reproduced via a woofer or subwoofer, since it is desirable to reproduce through a large speaker instead of being positionally insensitive. At this time, the first band signal adds a time delay 1820 to utilize the pre-effect. In this case, the time delay is not intended to compensate for the time delay of the filter occurring in the processing of the other bands, but provides an additional time delay to reproduce later than other band signals, that is, to provide a preceding effect.

두번째 밴드는 팬텀 스피커 주변의 (TV 디스플레이의 배젤 및 그 주변에 배치되는 스피커) 스피커를 통해 재생되는데 사용될 신호로써, 적어도 2개의 스피커로 분할되어 재생되며, VBAP 등의 패닝 알고리즘(1830)을 적용하기 위한 계수가 생성되어 적용된다. 따라서, 두번째 밴드 출력이 재생되는 스피커의 개수와 위치(팬텀 스피커에 대해 상대적인)를 정확히 제공해야 이를 통한 패닝효과가 향상될 수 있다. 이때 VBAP 패닝 이외에도 HRTF를 고려한 필터의 적용이나, 시간 패닝 효과를 제공하기 위해 서로 다른 위상 필터 혹은 시간 지연 필터를 적용하는 것도 가능하다. 이와 같이 밴드를 나누어 HRTF를 적용할 때 얻을 수 있는 또다른 장점은 HRTF에 의해 발생하는 신호 왜곡의 범위를 처리하는 대역내로 제한할 수 있다는 점이다. The second band is a signal to be used to be reproduced through a speaker (a speaker disposed in the bubble of the TV display and the surroundings thereof) around the phantom speaker, is divided into at least two speakers and reproduced, and a panning algorithm 1830 such as VBAP is applied Are generated and applied. Therefore, it is necessary to accurately provide the number and position (relative to the phantom speaker) of the speakers on which the second band output is reproduced, so that the panning effect can be improved. In this case, it is possible to apply a filter considering HRTF in addition to VBAP panning, or to apply different phase filters or time delay filters to provide a time panning effect. Another advantage of dividing the bands into HRTFs is that they can limit the range of signal distortions caused by HRTF to within the processing band.

세번째 밴드는 스피커 어레이가 존재하는 경우 이를 이용하여 재생되는 신호를 생성하기 위함이며, 적어도 3개의 스피커를 통한 음원 가상화를 위한 어레이 신호처리 기술(1840)을 적용할 수 있다. 혹은 WFS (Wave Field Synthesis)를 통해 생성되는 계수를 적용할 수 있다. 이때, 세번째 밴드와 두번째 밴드는 실제로 같은 밴드일 수도 있다.
The third band may be used to generate a signal to be reproduced using a speaker array if present, and an array signal processing technique 1840 for sound source virtualization through at least three speakers may be applied. Or coefficients generated by WFS (Wave Field Synthesis) can be applied. In this case, the third band and the second band may actually be the same band.

도 19는 각 밴드에서 생성된 신호를 TV 주변에 배치된 스피커와 매핑시키는 일 실시예를 나타낸다. 도 19에 따르면, 두번째 및 세번째 밴드에 대응되는 스피커의 개수 및 위치 정보는 상대적으로 정확히 정의된 위치에 있어야 하며, 그 위치 정보는 도 18의 처리 시스템에 제공되는 것이 바람직하다.
FIG. 19 shows an embodiment in which a signal generated in each band is mapped to a speaker disposed in the vicinity of the TV. According to FIG. 19, the number and position information of the speakers corresponding to the second and third bands must be located at relatively accurately defined positions, and the position information is preferably provided to the processing system of FIG.

본 발명은 스피커와 스크린이 비정형적으로 배치된 상황에서도 영상 화면과 일치성을 유지하는 3D AV 정합(audio-visual alignment) 기술에 관한 것이다. 즉, 본 발명의 일 양상에 따르면, Spatial Lip-synchronization의 문제를 다룬다. 음원의 위치를 규정함에 있어, 사용자의 위치를 기준점으로 지정하여 스피커, 스크린, 객체 음원들의 상대적인 위치 정보를 결정하고자 한다. 본 발명은 컨텐츠 제작을 위해 목적한 객체 음원들의 공간 정보를 결정하는 인코딩 기술과 비정형적 스피커/스크린 배치를 보정하기 위한 유연한 렌더링 기술을 포함한다. 사용자, 스피커, 스크린의 상대적인 위치를 고려하여 객체 음원의 정위를 보정하여 렌더링한다. The present invention relates to a 3D-audio-visual alignment technique that maintains consistency with an image screen even when a speaker and a screen are irregularly arranged. That is, according to one aspect of the present invention, the problem of spatial lips-synchronization is addressed. In defining the position of the sound source, the user's position is designated as a reference point to determine the relative position information of the speaker, the screen, and the object sound sources. The present invention includes an encoding technique for determining spatial information of object sound sources intended for content production and a flexible rendering technique for correcting atypical speaker / screen layout. And corrects the orientation of the object sound source in consideration of the relative positions of the user, speaker, and screen.

사용 환경에서 발생하는 다양한 부정합을 교정하는데 있어, 제안된 발명은 객체 음원을 각각의 스피커에 출력을 할당하는 역할을 하는 렌더링 기술에 관한 것이다. 즉, 객체의 시/공간적 위치 정보를 포함하는 객체 헤더 정보, 스크린과 스피커의 부정합을 표현하는 위치 정보, 그리고 사용자 머리의 위치/회전 정보를 받아들여 객체 음원의 정위를 보정하는 게인과 딜레이 값을 결정한다. In calibrating various mismatches occurring in a usage environment, the proposed invention relates to a rendering technique that is responsible for assigning output to an individual speaker of an object sound source. That is, the object header information including the temporal / spatial positional information of the object, the positional information representing the mismatch between the screen and the speaker, and the position / rotation information of the user's head, .

도 24는 스크린의 높이 변화에 따른 객체 음원의 위치변화를 나타내는 그림이다. 여기서 O(2410)는 사용자의 위치, p(2420)는 변화되기 전의 스크린의 위치(혹은 변화되기 전 위치의 스크린에 표시된 객체 영상의 위치), s(2430)는 변화되기 전의 객체 음원의 위치를 나타낸다. 이때, 객체 음원 s의 위치는 구면좌표계에 의해 높이 각도 θ(2440)와 거리 r(2450)에 의해 정의된다. 또한, 사용자로부터 스크린까지의 거리g(2460)와 스크린으로부터 객체 음원까지의 깊이d(2470)는 시스템 입력정보인 사용자의 환경정보와 객체 음원 위치 정보로부터 수학식3을 통해 구해진다.
24 is a diagram showing the positional change of the object sound source according to the change in the height of the screen. Here, O (2410) is the position of the user, p (2420) is the position of the screen before the change (or the position of the object image displayed on the screen before the change), s (2430) . At this time, the position of the object sound source s is defined by the height angle? (2440) and the distance r (2450) by the spherical coordinate system. The distance g (2460) from the user to the screen and the depth d (2470) from the screen to the object sound source are obtained from Equation (3) from the user's environment information and object sound source location information, which are system input information.

만약, 스크린의 위치가 기존p(2420)에서 p’(2421)로 변화하는 경우, 사용자의 몰입도 유지를 위해 객체 음원의 위치는 s(2430)에서 s’(2431)로 변화되어야 한다. 이때, 변화된 음원의 위치s’(2431)은 변화된 높이 각도 θ'(2441)와 변화된 거리 r’(2451)에 의해 결정된다. 이때 변화된 높이 각도는 수학식4에 의해 결정된다.
If the position of the screen changes from p (2420) to p '(2421), the position of the object sound source should be changed from s (2430) to s' (2431) in order to maintain the user's immersion. At this time, the position s' (2431) of the changed sound source is determined by the changed height angle? '(2441) and the changed distance r' (2451). At this time, the changed height angle is determined by Equation (4).

또한, 변화된 거리는 수학식 5에 의해 결정된다.
Further, the changed distance is determined by equation (5).

여기서 p’, g, 그리고 d는 기존에 주어지거나 혹은 주어진 정보로부터 구해낼수 있는 파라메터들이므로 변화된 음원의 위치 s’는 스크린의 높이 변화에 따라 즉각적으로 업데이트된다. Here, since p ', g, and d are parameters that can be given or obtained from information given previously, the position s' of the changed sound source is updated instantly as the screen height changes.

도 25은 스크린의 좌우 위치 변화에 따른 객체 음원의 위치변화를 나타내는 그림이다. 도 24에서와 마찬가지로, O(2510)는 사용자의 위치, p(2520)는 변화되기 전의 스크린의 위치(혹은 변화되기 전 위치의 스크린에 표시된 객체 영상의 위치), s(2530)는 변화되기 전의 객체 음원의 위치를 나타낸다. 이때, 객체 음원 s(2530)의 위치는 구면좌표계에 의해 좌우 각도

(2540)와 거리 r(2550)에 의해 정의된다. 또한, 사용자로부터 스크린까지의 거리g(2560)와 스크린으로부터 객체 음원까지의 깊이d(2570)는 시스템 입력정보인 사용자의 환경정보와 객체 음원 위치 정보로부터 수학식6에 의해 결정거나 혹은 수학식3을 통해 구해질 수 있다.
25 is a diagram showing a change in the position of the object sound source according to a change in the left-right position of the screen. As in Fig. 24, O (2510) is the position of the user, p (2520) is the position of the screen before the change (or the position of the object image displayed on the screen before the change), s Indicates the position of the object sound source. At this time, the position of the object sound source s (2530)

(2540) and the distance r (2550). Further, the distance g (2560) from the user to the screen and the depth d (2570) from the screen to the object sound source are determined from the environment information of the user and the object sound source position information, which are system input information, . &Lt; / RTI >

스크린의 위치가 기존p(2520)에서 p’(2521)로 변화하는 경우, 사용자의 몰입도 유지를 위해 객체 음원의 위치는 s(2530)에서 s’(2531)로 변화되어야 한다. 이때, 변화된 음원의 위치s’(2531)은 변화된 각도

'(2541)와 변화된 거리r'(2551)에 의해 결정된다.
If the position of the screen changes from p (2520) to p '(2521), the position of the object sound source should be changed from s (2530) to s' (2531) to maintain the user's immersion. At this time, the changed position s' (2531)

'2541 and the changed distance r' (2551).

즉, 본 발명은 기존 주어진 공간정보 파라메터들과 스크린의 위치 정보를 이용하여 변화된 음원의 위치 s’(2531)를 즉각적으로 업데이트한다.
That is, the present invention instantly updates the changed location s' (2531) of the sound source using the existing spatial information parameters and the location information of the screen.

도 24 표시된 스크린의 높이 변화와 도 25에 나타난 스크린의 좌우 위치 변화 이외의 스크린의 전후 위치 변화가 또다른 사용예가 될 수 있다. 다시말하면, 스크린이 사용자 방향으로 가까워지거나 혹은 사용자로부터 멀어지는 상황을 가정할 수 있다. 도 28은 본 발명의 스크린 전후 위치 변화에 따른 객체 음원의 위치를 변화시키는 방법을 나타내는 그림이다. 여기서 O(2810)는 사용자의 위치, g(2820)는 변화되기 전의 스크린(2821)과 사용자 사이의 거리(혹은 변화되기 전의 스크린에 표시된 객체 영상과 사용자 사이의 거리), d(2830)은 스크린이 사용자 방향으로 가까워진 거리, r(2840)은 스크린의 위치가 변화된 후 객체 음원(2850)이 정위되어야 할 위치와 사용자 사이의 거리를 나타낸다. Another example of use is the change in the height of the screen shown in Fig. 24 and the change in the position of the screen in front of and behind the screen other than the change in the position of the screen shown in Fig. In other words, it can be assumed that the screen approaches or moves away from the user. FIG. 28 is a diagram illustrating a method for changing the position of an object sound source according to a change in the position of the screen before and after a screen according to the present invention. FIG. Here, O (2810) is the position of the user, g (2820) is the distance between the screen 2821 and the user before the change (or the distance between the object image and the user displayed on the screen before the change) The distance r (2840) approaching this user direction indicates the distance between the location where the object sound source 2850 should be positioned and the user after the position of the screen is changed.

스크린(2821)이 d(2830)만큼 가까워진 경우, 본래 스크린에 표시되도록 제공된 객체 영상과 스피커를 통해 재생된 객체 음원의 거리를 일치시키기 위해 사용자 전방 r(2840)의 거리에 객체 음원을 정위시켜야 한다. 즉, 스크린 위치의 변화정도 d(2830)에 따라 객체 음원의 정위 위치 r(2840)이 결정되게 된다. 본 발명에서의 스피커 변화 범위 d(2830)은 양수에 국한되지 않으며 사용자로부터 멀어지거나 (d는 양수) 가까워질(d는 음수) 수 있고, d에 따라 양수의 r 혹은 음수의 r이 계산될 수 있다. When the screen 2821 approaches d (2830), the object sound source should be positioned at the distance of the user front r 2840 to match the distance of the object sound source reproduced through the speaker with the object image originally displayed on the screen . That is, the position r (2840) of the object sound source is determined according to the degree of change d (2830) of the screen position. The speaker variation range d (2830) in the present invention is not limited to positive numbers but may be away from the user (d is a positive number) or close (d is negative number) and a positive r or a negative r may be calculated according to d have.

이때, 본 발명은 두개의 가상 스피커(2861과 2862)를 가정하고 이를 이용하여 사용자로부터 r(2840)만큼의 거리를 떨어진 위치에 객체 음원(2850)을 정위시킨다. 본래 객체 음원을 원하는 위치에 정위시키기위해서는 해당 위치로부터 사용자의 양쪽귀에 전달되는 HRTF를 생성해주는 방법을 사용한다. 그러나 일반적으로 HRTF는 음원의 좌우의 각도변화를 만드는데에는 효과적이지만, 음원의 거리 혹은 깊이(depth)를 조절하는 데에는 부족한 것으로 알려져 있다. At this time, the present invention assumes two virtual speakers 2861 and 2862, and uses this to locate the object sound source 2850 at a distance of r (2840) from the user. In order to position the original object sound source at a desired position, a method of generating an HRTF that is transmitted from the corresponding position to both ears of the user is used. However, in general, HRTF is effective in making left and right angle changes of a sound source, but it is said to be insufficient to control the distance or depth of a sound source.

본 발명은 음원의 거리 조정을 위해 두개의 가상 스피커를 가정하고 이들로부터 사용자의 양쪽귀에 전될되는 HRTF의 일부를 선택적으로 사용한다. 즉, r(2840)의 거리에 객체 음원(2850)을 생성하는데에 있어, 객체 음원(2850)으로부터 사용자의 오른쪽 귀에 전달되는 HRTF는 왼쪽 가상 스피커(2861)로부터 사용자의 오른쪽 귀로 전달되는 HRTF(2871)를 이용하여 계산된다. 마찬가지로 객체 음원(2850)으로부터 사용자의 왼쪽 귀에 전달되는 HRTF는 오른쪽 가상 스피커(2862)로부터 사용자의 왼쪽 귀로 전달되는 HRTF(2872)를 이용하여 계산된다. 이후 객체 음원과 사용자의 양쪽귀 사이의 거리(2881과 2882)와 가상 스피커와 사용자의 양쪽귀 사이의 거리(2891과 2892)에 기반하여 객체 음원(2850)으로부터 사용자의 양쪽 귀에 전달되는 HRTF의 딜레이를 보상한다. 마지막으로, 앞서 계산된 딜레이가 보상된 HRTF를 객체 음원 신호와 컨볼루션하여 객체 영상과 일치된 위치에 객체 음원을 생성한다. 또한, 가상의 스피커를 생성하여 음원의 위치를 정위시키는 본 발명에 있어 실제 음원이 출력되는 스피커들과 사용자의 양쪽귀 사이의 음향 채널(크로스톡 cross-talk)의 악영향을 상쇄시키기 위해 크로스톡 캔슬링(cross-talk canceling) 기술이 추가로 사용될 수 있다.
The present invention assumes two virtual loudspeakers for distance adjustment of a sound source and selectively uses a part of HRTF transmitted to both ears of the user from them. That is, in generating the object sound source 2850 at a distance of r 2840, the HRTF transmitted from the object sound source 2850 to the right ear of the user is transmitted from the left virtual speaker 2861 to the right ear of the user via the HRTF 2871 ). Likewise, the HRTF delivered from the object sound source 2850 to the left ear of the user is calculated using the HRTF 2872 passed from the right virtual speaker 2862 to the left ear of the user. The delay of the HRTF transmitted from the object sound source 2850 to the user's ears based on the distance between the object sound source and the user's ears (2881 and 2882) and the distance between the virtual speaker and the user's ears (2891 and 2892) Lt; / RTI > Finally, the HRTF compensated with the delay calculated above is convolved with the object sound source signal, and an object sound source is generated at a position coincident with the object image. In order to offset the adverse effects of the acoustic channel (cross-talk) between the speakers on which the actual sound source is output and the user's ears in the present invention, in which virtual speakers are generated to position the sound sources, (cross-talk canceling) technique may be further used.

본 발명은 AV 부정합을 조절하는데 있어, 초기 교정부(matrix initialization), 사용자 위치 교정부(matrix update), 객체 위치 교정부(object position information correction)의 순차적 실행 혹은 조합으로 구현된다. 도 26은 각 수행 블록간의 관계를 나타낸 전체 구조도이다. The present invention is implemented by sequential execution or combination of matrix initialization, user update matrix, and object position information correction in controlling AV mismatch. 26 is an overall structure diagram showing the relationship among respective execution blocks.

사용 환경에서의 다양한 부정합 요인 중 스크린과 스피커의 배치는 일단 설치된 이후에는 크게 변동하는 경우가 없지만, 객체 음원과 사용자의 위치는 시간에 따른 변화가 크다고 할 수 있다. 따라서, 변화의 정도에 따라 업데이트 레이트가 다른 모듈로 구성함으로써 효율적인 시스템을 구현한다.Among various inconsistency factors in the use environment, the arrangement of the screen and the speaker does not largely fluctuate after the installation, but the variation of the object sound source and the user position with time may be large. Therefore, an efficient system can be realized by configuring different update rate modules according to the degree of change.

(초기 교정부(matrix initialization))(Matrix initialization)

초기 교정부(2610)은 설치 환경의 스크린과 스피커의 배치에 맞추어 임의의 (혹은 사용자가 지정한) 스윗 스폿을 기준으로 유연한 렌더링(flexible rendering) 매트릭스를 초기화한다. 일단 초기화된 랜더링 매트릭스는 이후 설치 환경의 변화가 있지 않는 이상 수정없이 계속 사용가능하므로, 초기 교정 과정은 사용자의 특별한 trigger가 없는 한 발생하지 않는다. 사용자가 임의로 초기 교정 과정을 실행하고자 하는 경우, 설치된 기기들의 위치는 사용자가 직접 위치 정보를 (UI장비를 이용해) 입력하거나, 다양한 방법(예를 들어, 기기간 통신 기술을 이용한 자동 위치 추적)에 의해 측정될 수 있을 것이다.
The initial calibration unit 2610 initializes a flexible rendering matrix based on arbitrary (or user-specified) sweet spots in accordance with the layout of the screen and speakers of the installation environment. Once initialized, the rendering matrix can continue to be used without modification since there is no change in the installation environment, so the initial calibration process does not occur unless the user has a special trigger. When the user intends to perform the initial calibration process arbitrarily, the location of the installed devices may be determined by the user directly by inputting the location information (using the UI device) or by various methods (for example, automatic location tracking using the inter- It can be measured.

(사용자 위치 교정부(matrix update))(User location matrix update)

사용자 위치 교정부(2620)는 때에 따라 변화가 있을 수 있는 사용자의 위치에 따라 유연한 렌더링 매트릭스를 보정하는 기능을 수행한다. 즉, 사용자(정확히는 사용자의 귀의 위치와 방향)의 위치 변동 정도에 따라 애초에 초기 위치 교정부(2610)에 의해 스윗 스폿을 기준으로 초기화 되었던 랜더링 매트릭스를 업데이트한다. 이 경우, 사용자의 위치는 다양한 기술(예를 들어, 머리 추적 기술, 혹은 리모콘을 이용한 위치 정보 전달)에 의해 측정될 수 있으며, 혹은 사용자가 직접 정보를 입력하거나 렌더러가 지정한 몇 개의 보기위치 중 하나를 선택할 수도 있다.
The user position calibration unit 2620 performs the function of correcting the flexible rendering matrix according to the position of the user, which may change occasionally. That is, the initialization module 2610 updates the rendering matrix that was originally initialized with respect to the sweet spot, based on the degree of positional fluctuation of the user (precisely, the ear position and direction of the user). In this case, the position of the user can be measured by various techniques (for example, head tracking technology or transfer of position information using a remote controller), or the user can input information directly or use one of several view positions designated by the renderer .

(객체 위치 교정부(object position information correction))(Object position information correction)

객체 위치 교정부(2630)는 이전에 측정된 스크린과 사용자의 위치를 이용해 사용자가 느끼기에 화면과 음상이 일치하도록(lip-synchronization) 객체 음원의 위치 정보를 업데이트한다. 앞서 설명한 초기 교정부와 사용자 위치 교정부가 직접 유연한 렌더링 매트릭스의 상수값을 결정하는 역할을 하는데에 반해, 객체 위치 교정부는 객체 음원 신호와 더불어 기존 유연한 렌더링 매트릭스의 입력으로 사용되는 객체 음원 위치 정보를 보정하는 기능을 수행한다. The object position calibrator 2630 updates the position information of the object sound source so that the screen and the sound image are lip-synchronized by the user using the previously measured screen and the position of the user. In contrast to the above-described initial correction unit and user position correction unit, which determine the constant value of the flexible flexible rendering matrix, the object position correction unit corrects the object sound source position information used as the input of the existing flexible rendering matrix, .

제안된 방법에 의해 객체 음원 위치 정보를 수정하기 위해서는, 화면으로부터 객체가 떨어진(혹은 멀거나 가까워진) 깊이 정보가 컨텐츠 생성시에 결정되어 객체 위치 정보에 포함되어 있어야한다. 혹은 객체의 깊이 정보는 기존 객체 음원 위치 정보와 스크린 위치 정보로부터 얻어질 수 있다. 객체 위치 교정부는 이러한 복호화된 객체의 깊이 정보를 스크린과 사용자의 거리와 함께 고려하여 사용자를 기준으로 한 객체의 위치 각도를 계산하여 객체 음원 위치 정보를 수정한다. 수정된 객체 위치 정보는 앞서 (초기 교정부와 사용자 위치 교정부에 의해) 계산된 랜더링 매트릭스 업데이트 정보와 더불어 유연한 랜더링단에 전달되어 최종 스피커 채널 신호를 만드는데 이용된다.
In order to modify the object sound source location information according to the proposed method, the depth information (or far or near) of the object from the screen should be determined at the time of content creation and be included in the object location information. Alternatively, the depth information of the object can be obtained from the existing object sound source position information and the screen position information. The object position correction unit modifies the object sound source position information by calculating the position angle of the object based on the user considering the depth information of the decoded object together with the distance between the screen and the user. The modified object position information is passed to the flexible rendering stage in addition to the rendering matrix update information calculated earlier (by the initial interrogator and the user position corrector) and used to generate the final speaker channel signal.

결과적으로 제안된 발명은 객체 음원을 각각의 스피커에 출력을 할당하는 역할을 하는 렌더링 기술에 관한 것이다. 즉, 객체의 시/공간적 위치 정보를 포함하는 객체 헤더(위치) 정보, 스크린과 스피커의 부정합을 표현하는 위치 정보, 그리고 사용자 머리의 위치/회전 정보를 받아들여 객체 음원의 정위을 보정하는 게인과 딜레이 값을 결정한다.
As a result, the proposed invention relates to a rendering technique that is responsible for assigning output to the respective speakers of an object sound source. That is, there are provided an object header (position) information including temporal / spatial positional information of an object, positional information representing a mismatch between a screen and a speaker, and gain / delay for correcting the position of the object sound source by receiving position / &Lt; / RTI >

도 27은 본 발명의 흐름도이다. 시스템의 설치와 동시에 최초 사용 환경에 대해 초기 교정부(2610)가 작동하여 설치 환경의 스크린과 스피커의 배치에 맞추어 임의의 (혹은 사용자가 지정한) 스윗 스폿을 기준으로 유연한 렌더링(flexible rendering) 매트릭스를 초기화한다. 한번 실행된 초기 교정 과정은 사용자의 특별한 trigger가 없는 한 발생하지 않지만, 사용자가 임의로 초기 교정 과정을 실행하고자 하는 경우 혹은 스피커나 스크린의 위치가 변화된 경우에 한하여 동작할 수 있다. 27 is a flowchart of the present invention. Simultaneously with the installation of the system, the initial calibration unit 2610 operates for the initial use environment to provide a flexible rendering matrix on the basis of arbitrary (or user-specified) Initialize. The initial calibration process executed once does not occur unless there is a special trigger of the user, but it can be operated only when the user desires to perform the initial calibration process arbitrarily or when the position of the speaker or the screen is changed.

사용자 위치 교정부(2620)는 때에 따라 변화가 있을 수 있는 사용자의 위치에 따라 유연한 렌더링 매트릭스를 보정하는 기능을 수행한다. 이후 객체 위치 교정부(2630)는 이전에 측정된 스크린과 사용자의 위치를 이용해 사용자가 느끼기에 화면과 음상이 일치하도록(lip-synchronization) 객체 음원의 위치 정보를 업데이트한다. The user position calibration unit 2620 performs the function of correcting the flexible rendering matrix according to the position of the user, which may change occasionally. Then, the object position calibration unit 2630 updates the position information of the object sound source by lip-synchronization so that the user perceives the screen and the position of the user using the previously measured screen and the position of the user.

이와 같이 초기 교정부와 사용자 위치 교정부에 수정된 렌더링 매트릭스를 이용하여 객체 위치 교정부에 의해 수정된 객체 위치정보에 대해 유연한 렌더링(2640) 과정을 통해 객체 음원의 재생이 이루어진다. 위에서 설명한 프로세스는 신호가 끝날때까지 반복 실행된다.
In this way, the object sound source is reproduced through a flexible rendering process (2640) on the object position information modified by the object position correction unit using the modified rendering matrix for the initial correction unit and the user position correction unit. The process described above is repeated until the end of the signal.

도 29은 본 발명의 일 실시예에 따른 오디오 신호 처리 장치가 구현된 제품들의 관계를 보여주는 도면이다. 우선 도 29을 참조하면, 유무선 통신부(310)는 유무선 통신 방식을 통해서 비트스트림을 수신한다. 구체적으로 유무선 통신부(310)는 유선통신부(310A), 적외선통신부(310B), 블루투스부(310C), 무선랜통신부(310D) 중 하나 이상을 포함할 수 있다.FIG. 29 is a diagram illustrating a relationship among products in which an audio signal processing apparatus according to an embodiment of the present invention is implemented. FIG. 29, the wired / wireless communication unit 310 receives a bitstream through a wired / wireless communication scheme. More specifically, the wired / wireless communication unit 310 may include at least one of a wired communication unit 310A, an infrared communication unit 310B, a Bluetooth unit 310C, and a wireless LAN communication unit 310D.

사용자 인증부는(320)는 사용자 정보를 입력 받아서 사용자 인증을 수행하는 것으로서 지문인식부(320A), 홍채인식부(320B), 얼굴인식부(320C), 및 음성인식부(320D) 중 하나 이상을 포함할 수 있는데, 각각 지문, 홍채정보, 얼굴 윤곽 정보, 음성 정보를 입력받아서, 사용자 정보로 변환하고, 사용자 정보 및 기존 등록되어 있는 사용자 데이터와의 일치여부를 판단하여 사용자 인증을 수행할 수 있다. The user authentication unit 320 performs user authentication by receiving the user information and performs at least one of the fingerprint recognition unit 320A, the iris recognition unit 320B, the face recognition unit 320C, and the voice recognition unit 320D The fingerprint, iris information, face contour information, and voice information may be input and converted into user information, and user authentication may be performed by determining whether the user information and the previously registered user data match each other .

입력부(330)는 사용자가 여러 종류의 명령을 입력하기 위한 입력장치로서, 키패드부(330A), 터치패드부(330B), 리모컨부(330C) 중 하나 이상을 포함할 수 있지만, 본 발명은 이에 한정되지 아니한다. The input unit 330 may include at least one of a keypad unit 330A, a touch pad unit 330B, and a remote control unit 330C as an input device for a user to input various kinds of commands. It is not limited.

신호 코딩 유닛(340)는 유무선 통신부(310)를 통해 수신된 오디오 신호 및/또는 비디오 신호에 대해서 인코딩 또는 디코딩을 수행하고, 시간 도메인의 오디오 신호를 출력한다. 오디오 신호 처리 장치(345)를 포함하는데, 이는 앞서 설명한 본 발명의 실시예(즉, 일 실시예에 따른 디코더(600) 및 다른 실시예에 따른 인코더 및 디코더(1400))에 해당하는 것으로서, 이와 같이 오디오 처리 장치(345) 및 이를 포함한 신호 코딩 유닛은 하나 이상의 프로세서에 의해 구현될 수 있다.The signal coding unit 340 performs encoding or decoding on the audio signal and / or the video signal received through the wired / wireless communication unit 310, and outputs the audio signal in the time domain. And an audio signal processing device 345. This corresponds to the above-described embodiment of the present invention (i.e., the decoder 600 according to one embodiment and the encoder and decoder 1400 according to another embodiment) Likewise, the audio processing unit 345 and the signal coding unit including it may be implemented by one or more processors.

제어부(350)는 입력장치들로부터 입력 신호를 수신하고, 신호 디코딩부(340)와 출력부(360)의 모든 프로세스를 제어한다. 출력부(360)는 신호 디코딩부(340)에 의해 생성된 출력 신호 등이 출력되는 구성요소로서, 스피커부(360A) 및 디스플레이부(360B)를 포함할 수 있다. 출력 신호가 오디오 신호일 때 출력 신호는 스피커로 출력되고, 비디오 신호일 때 출력 신호는 디스플레이를 통해 출력된다.The control unit 350 receives an input signal from the input devices and controls all the processes of the signal decoding unit 340 and the output unit 360. The output unit 360 is a component for outputting the output signal and the like generated by the signal decoding unit 340 and may include a speaker unit 360A and a display unit 360B. When the output signal is an audio signal, the output signal is output to the speaker, and when it is a video signal, the output signal is output through the display.

본 발명에 따른 오디오 신호 처리 방법은 컴퓨터에서 실행되기 위한 프로그램으로 제작되어 컴퓨터가 읽을 수 있는 기록 매체에 저장될 수 있으며, 본 발명에 따른 데이터 구조를 가지는 멀티미디어 데이터도 컴퓨터가 읽을 수 있는 기록 매체에 저장될 수 있다. 상기 컴퓨터가 읽을 수 있는 기록 매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 저장 장치를 포함한다. 컴퓨터가 읽을 수 있는 기록 매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광 데이터 저장장치 등이 있으며, 또한 캐리어 웨이브(예를 들어 인터넷을 통한 전송)의 형태로 구현되는 것도 포함한다. 또한, 상기 인코딩 방법에 의해 생성된 비트스트림은 컴퓨터가 읽을 수 있는 기록 매체에 저장되거나, 유/무선 통신망을 이용해 전송될 수 있다.The audio signal processing method according to the present invention may be implemented as a program to be executed by a computer and stored in a computer-readable recording medium. The multimedia data having the data structure according to the present invention may also be recorded on a computer- Lt; / RTI > The computer-readable recording medium includes all kinds of storage devices in which data that can be read by a computer system is stored. Examples of the computer-readable recording medium include a ROM, a RAM, a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like, and may be implemented in the form of a carrier wave (for example, transmission via the Internet) . In addition, the bit stream generated by the encoding method may be stored in a computer-readable recording medium or transmitted using a wired / wireless communication network.

이상과 같이, 본 발명은 비록 한정된 실시예와 도면에 의해 설명되었으나, 본 발명은 이것에 의해 한정되지 않으며 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 본 발명의 기술사상과 아래에 기재될 특허청구범위의 균등범위 내에서 다양한 수정 및 변형이 가능함은 물론이다.
While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. It will be understood that various modifications and changes may be made without departing from the scope of the appended claims.

210 : 상위 레이어
220 : 중간 레이어
230 : 바닥 레이어
240 : LFE 채널
310 : 유무선 통신 장치
310A : 유선 통신 장치
310B : 적외선 장치
310C : 블루투스 장치
310D : 무선 LAN 장치
320 : 사용자 인증 장치
320A : 지문 인식
320B : 홍채 인식 장치
320C : 안면 인식 장치
320D : 음성 인식 장치
330 : 입력 장치
330A : 키패드
330B : 터치패드
330C : 원격 조정 장치
340 : 신호 부호화 장치
345 : 오디오 신호처리 장치
350 : 제어 장치
360 : 출력 장치
360A : 스피커
360B : 디스플레이
410 : 제 1 객체 신호 그룹
420 : 제 2 객체 신호 그룹
430 : 청자
520 : 다운 믹서 & 파라미터 부호화기1
550 : 객체그룹핑부
560 : 웨이브폼 부호화기
620 : 웨이브폼 복호화기
630 : 업믹서 & 파라미터 복호화기1
670 : 객체디그룹핑부
860 : 3DA 복호화부
870 : 3DA 렌더링부(870)
1110 : 객체1의 마스킹 임계 곡선
1120 : 객체2의 마스킹 임계 곡선
1130 : 객체3의 마스킹 임계 곡선
1230 : 심리 음향 모델
1310 : 5.1채널 라우드 스피커가 ITU-R 권고안에 따른 배치
1320 : 5.1채널 라우드 스피커가 임의 위치에 배치된 경우
1610 : 3차원 오디오 복호화기
1620 : 3차원 오디오 렌더러
1630 : 순서제어부
1810 : 서브밴드 필터부
1820 : 시간 지연 필터부
1830 : 패닝 알고리즘
1840 : 스피커 어레이 제어부
2040 : 스크린 상의 객체 영상
2050 : 스윗 스팟(sweet spot)에 위치한 사용자
2060 : 컨텐츠 제공자가 의도한 객체 음원의 위치
2110 : 스윗 스팟에서 벗어난 (off-sweet spot) 위치의 사용자
2120 : 기존 렌더링 기술에 의해 잘못 정위된 객체 음원의 위치
2210 : 스피커 위치 변화에 의해 변화된 중간 레이어 (middle layer)
2220 : 변화된 중간 레이어에서의 스윗 스팟(sweet spot)에 위치한 사용자
2230 : 기존 렌더링 기술에 의해 잘못 정위된 객체 음원의 위치
2310 : 사용자의 위치
2320 : 변화전 스크린의 위치
2321 : 변화전 스크린 상의 객체 영상
2330 : 변화된 스크린의 위치
2331 : 변화된 스크린 상의 객체 영상
2340 : 스크린 높이의 변화량
2350 : 변화된 객체 영상의 위치에 일치된 객체 음원 위치의 변화량
2360 : 스크린 위치 변화전의 객체 음원의 위치
2370 : 기존 렌더링 기술에 의해 보정된 객체 음원의 위치
2380 : 변화된 객체 영상의 위치에 일치된 객체 음원 위치
2410 : 사용자의 위치
2420 : 변경전 스크린의 위치
2430 : 변경전 객체 음원의 위치
2431 : 변경후 객체 음원의 위치
2440 : 변경전 객체 음원의 높이 각도
2441 : 변경후 객체 음원의 높이 각도
2450 : 변경전 객체 음원의 거리
2451 : 변경후 객체 음원의 거리
2460 : 사용자와 스크린 사이의 거리
2470 : 스크린으로부터 객체 음원까지의 깊이
2510 : 사용자의 위치
2520 : 변경전 스크린의 위치
2530 : 변경전 객체 음원의 위치
2531 : 변경후 객체 음원의 위치
2540 : 변경전 객체 음원의 각도
2541 : 변경후 객체 음원의 각도
2550 : 변경전 객체 음원의 거리
2551 : 변경후 객체 음원의 거리
2560 : 사용자와 스크린 사이의 거리
2570 : 스크린으로부터 객체 음원까지의 깊이
2610 : 초기교정부
2620 : 사용자 위치 교정부
2630 : 객체 위치 교정부
2640 : 유연한 렌더러
2810 : 사용자의 위치
2820 : 변화되기 전의 스크린과 사용자 사이의 거리
2821 : 스크린
2830 : 스크린이 사용자 방향으로 가까워진 거리
2840 : 스크린의 위치가 변화된 후 객체 음원이 정위되어야 할 위치와 사용자 사이의 거리
2850 : 객체 음원
2861 : 왼쪽 가상 스피커
2862 : 오른쪽 가상 스피커
2871 : 왼쪽 가상 스피커로와 사용자의 오른쪽 귀 사이의 HRTF
2872 : 오른쪽 가상 스피커로와 사용자의 왼쪽 귀 사이의 HRTF
2881 : 객체 음원과 사용자의 왼쪽 귀 사이의 거리
2882 : 객체 음원과 사용자의 오른쪽 귀 사이의 거리
210: Upper layer
220: middle layer
230: bottom layer
240: LFE channel
310: wired / wireless communication device
310A: Wired communication device
310B: Infrared device
310C: Bluetooth device
310D: Wireless LAN device
320: User authentication device
320A: Fingerprint Recognition
320B: iris recognition device
320C: Facial recognition system
320D: Speech Recognition Device
330: input device
330A: Keypad
330B: Touchpad
330C: Remote control unit
340: Signal encoding device
345: audio signal processing device
350: Control device
360: Output device
360A: Speaker
360B: Display
410: first object signal group
420: second object signal group
430: Celadon
520: Downmixer & parameter encoder 1
550: Object grouping unit
560: Waveform coder
620: Waveform Decoder
630: Upmixer & Parameter Decoder 1
670: Object D grouping unit
860: 3DA decoding unit
870: 3DA rendering unit 870
1110: Masking threshold curve of object 1
1120: masking threshold curve of object 2
1130: masking threshold curve of object 3
1230: Psychoacoustic model
1310: 5.1 channel loudspeaker placed according to ITU-R Recommendation
1320: 5.1 channel loudspeaker is in any position
1610: 3D audio decoder
1620: 3D Audio Renderer
1630:
1810: Subband filter section
1820: Time delay filter unit
1830: Panning Algorithm
1840: speaker array control unit
2040: Object image on screen
2050: User located in sweet spot
2060: Location of object sound source intended by content provider
2110: Users in an off-sweet spot
2120: Position of object sound source incorrectly pointed by existing rendering technology
2210: middle layer changed by speaker position change,
2220: User located in the sweet spot on the changed middle layer
2230: Location of object source mis-aligned by existing rendering techniques
2310: Your location
2320: Position of screen before change
2321: Object image on screen before change
2330: Location of the changed screen
2331: Object image on the changed screen
2340: Variation of screen height
2350: Variation of position of object sound source matched to position of changed object image
2360: Position of object sound source before screen position change
2370: Position of object sound source corrected by existing rendering technology
2380: Position of the object sound source matched to the position of the changed object image
2410: Your location
2420: Position of screen before change
2430: Location of object source before change
2431: Location of object source after change
2440: Height angle of object source before change
2441: Height angle of object source after change
2450: Distance of object source before change
2451: distance of object source after change
2460: Distance between user and screen
2470: Depth from screen to object sound source
2510: Your location
2520: Position of screen before change
2530: Location of object source before change
2531: Location of object source after change
2540: Angle of object source before change
2541: Angle of object source after change
2550: distance of object source before change
2551: distance of object source after change
2560: Distance between user and screen
2570: Depth from screen to object sound source
2610:
2620: User position calibration
2630: object position calibration
2640: Flexible renderer
2810: Your location
2820: Distance between screen and user before change
2821: Screen
2830: distance the screen approaches the user's direction
2840: Distance between the location where the object sound source should be positioned and the user after the position of the screen is changed
2850: Object source
2861: Left virtual speaker
2862: Right virtual speaker
2871: HRTF between the left virtual speaker and the user's right ear
2872: HRTF between the right virtual speaker and the user's left ear
2881: Distance between object sound source and user's left ear
2882: Distance between the object sound source and the user's right ear

Claims

오디오 신호 처리 방법으로써,
사용자 위치 정보 및 스크린의 제 1 위치 정보를 수신하는 단계;
객체 음원 신호와 객체 음원 위치 정보를 수신하는 단계;
상기 스크린의 제 2 위치 정보를 수신하는 단계, 상기 제 2 위치 정보는 상기 제 1 위치에서 변화된 위치의 정보임;
상기 객체 음원 위치정보와 상기 제 2 위치정보를 이용하여 상기 객체 음원 위치정보의 수정된 값을 획득하는 단계;
상기 객체 음원 위치정보의 수정된 값을 이용하여 필터 계수를 생성하는 단계;
상기 생성된 필터 계수와 상기 객체 음원 신호를 이용하여, 렌더링된 채널 신호를 생성하는 단계를 포함하는 오디오 신호 처리 방법.
As an audio signal processing method,
Receiving user location information and first location information of the screen;
Receiving object sound source signals and object sound source position information;
Receiving second position information of the screen, wherein the second position information is information of a position changed at the first position;
Obtaining a modified value of the object sound source location information using the object sound source location information and the second location information;
Generating a filter coefficient using the modified value of the object sound source position information;
And generating a rendered channel signal using the generated filter coefficient and the object sound source signal.

제 1 항에 있어서,
상기 스크린의 제 2 위치 정보를 수신하는 단계는, 상기 스크린의 설치 환경의 변화 또는 사용자의 입력에 기초하여 수행되는 것을 특징으로 하는 오디오 신호처리 방법.
The method according to claim 1,
Wherein the step of receiving the second position information of the screen is performed based on a change of an installation environment of the screen or an input of a user.

제 1 항에 있어서,
상기 스크린의 제 2 위치 정보를 수신하는 단계는,
기기간 통신을 이용하여 상기 스크린의 위치 정보를 추적하고, 상기 추적된 위치 정보에 기초하여 수신하는 것을 특징으로 하는 오디오 신호처리 방법.
The method according to claim 1,
The step of receiving the second location information of the screen further comprises:
Wherein the location information of the screen is tracked using communication between devices, and the audio signal is received based on the tracked location information.

제 1 항에 있어서, (무슨 뜻인지 모르겠어요, 스크린의 제 1 위치 정보를 이용하여 깊이 정보를 생성하고 이를 수정하는지, 아니면 제 2 위치 정보를 이용하여 바로 깊이 정보를 생성하는지)상기 수정 객체 음원 위치정보를 출력하는 단계는, 상기 스크린 위치정보, 상기 사용자 위치정보를 이용하여 객체의 깊이 정보를 생성하고 사용자를 기준으로 수정한 수정 객체 음원 위치 정보를 생성하는 것을 특징으로하는 오디오 신호처리 방법.
2. The method of claim 1, wherein the depth information is generated using the first position information of the screen and the depth information is generated using the second position information, Wherein the step of outputting information comprises generating depth information of an object using the screen position information and the user position information and generating modified object sound source position information modified based on a user.

제 1 항에 있어서,
상기 필터 계수를 생성하는 단계는 상기 객체 음원 위치 정보의 수정된 값을 이용하여 HRTF(Head Related Transfer Function) 정보를 생성하는 것을 특징으로하는 오디오 신호처리 방법.

The method according to claim 1,
Wherein the step of generating the filter coefficient generates head related transfer function (HRTF) information using the modified value of the object sound source location information.