KR20080031408A

KR20080031408A - Processing method and device with video temporal up-conversion

Info

Publication number: KR20080031408A
Application number: KR1020087003479A
Authority: KR
Inventors: 함 벨트
Original assignee: 코닌클리케 필립스 일렉트로닉스 엔.브이.
Priority date: 2005-07-13
Filing date: 2006-07-07
Publication date: 2008-04-08
Also published as: JP2009501476A; US20100060783A1; RU2008105303A; WO2007007257A1; EP1905243A1; CN101223786A

Abstract

The present invention provides an improved method and device for visual enhancement of a digital image in video applications. In particular, the invention is concerned with a multi-modal scene analysis for face or people finding followed by the visual emphasis of one or more participants on the visual screen, or the visual emphasis of the person speaking among a group of participants to achieve an improved perceived quality and situational awareness during a video conference call. Said analysis is performed by means of a segmenting module (22) allowing to define at least a region of interest (ROI) and a region of no interest (RONI).

Description

비디오 시간 업－컨버전을 갖는 프로세싱 방법 및 디바이스{Processing method and device with video temporal up-conversion}Processing method and device with video temporal up-conversion

본 발명은 시각적 통신 시스템들에 관한 것이고, 더 구체적으로, 본 발명은 향상된 시각적 화상들의 품질을 위한 비디오 전화 시스템들에서 시간 업-컨버전(temporal up-conversion)을 제공하는 방법 및 디바이스에 관한 것이다. The present invention relates to visual communication systems, and more particularly, to a method and device for providing temporal up-conversion in video telephony systems for improved quality of visual pictures.

일반적으로, 비디오 품질은 비디오 전화 어플리케이션들의 일반적 수용을 위한 핵심 특성이다. 비디오 전화 시스템들이 사용자의 상황 인지도를 향상시켜서 비디오 콜의 인식 품질을 향상시키기 위해 종단 사용자들에게 될 수 있는 한 정확하게 다른 측의 상황을 전달하는 것은 매우 중요하다.In general, video quality is a key feature for the general acceptance of video telephony applications. It is very important that video telephony systems communicate the situation on the other side as precisely as possible to end users in order to improve the user's situational awareness to improve the recognition quality of the video call.

비디오 컨퍼런스 시스템들이 수년 전에 처음 도입된 이후 큰 관심을 받고 있지만, 이들이 크게 널리 사용되지는 않고 있고 이들 시스템들의 넓은 약진은 아직 일어나지 않았다. 이것은 일반적으로, 저 해상도, 블록킹되는 화상들, 및 긴 지연시간과 같은 수용이 불가능할 정도의 저질의 비디오 및 오디오 전송으로 유도하는 불충분한 통신 대역폭의 가용성으로 인한 것이었다.Although video conferencing systems have received a great deal of interest since their introduction several years ago, they are not widely used and wide advances have not yet taken place. This was generally due to the availability of insufficient communication bandwidth leading to unacceptably low quality video and audio transmissions such as low resolution, blocked pictures, and long delays.

그러나, 충분한 통신 대역폭을 제공할 수 있는 최신 기술 개혁들은 증가하는 수의 종단 사용자들에게 더욱 널리 이용가능해지고 있다. 또한, 통합된 디스플레 이, 카메라, 마이크로폰, 스피커를 갖춘 PC, 모바일 디바이스 등과 같은 강력한 컴퓨팅 시스템들의 가용성이 빠르게 증가하고 있다. 이들 전술된 이유들로, 비디오 컨퍼런스 해결책들의 오디오 및 비디오 품질이 이 수요가 있는 시장에서 가장 중요한 구별 요인들 중 하나가 되고 있으므로, 고객 비디오 컨퍼런스 시스템들의 사용 및 어플리케이션에서 약진 및 더 높은 품질이 기대될 것이다.However, the latest technological reforms that can provide sufficient communication bandwidth are becoming more widely available to an increasing number of end users. In addition, the availability of powerful computing systems such as integrated displays, cameras, microphones, PCs with speakers, and mobile devices is growing rapidly. For these reasons mentioned above, the audio and video quality of video conferencing solutions has become one of the most important differentiators in this demanding market, so that further and higher quality will be expected in the use and applications of customer video conferencing systems. will be.

일반적으로 말하면, 비디오 컨퍼런스 화상들을 향상시키기 위해 다수의 종래 알고리즘들과 기술들이 제안되어 구현되어왔다. 예를 들어, 다양하고 효율적인 비디오 인코딩 기술들은 비디오 인코딩 효율을 향상시키기 위해 적용되어왔다. 특히, 그런 제안들(예를 들어, S. Daly 등의 "얼굴-기반 시각-최적화된 화상 시퀀스 코딩(Face-Based Visually-Optimized Image Sequence Coding)", 0-8186-8821-1/98, 443-447 페이지, IEEE)은 관심 지역(region of interest;ROI)과 무관심 지역(region of no interest;RONI)의 선택에 기초하여 비디오 인코딩 효율을 향상시키는 것을 목적으로 한다. 더 구체적으로, 제안된 인코딩은, 대부분의 비트들이 ROI에 할당되고 소수의 비트들이 RONI에 할당되는 방식으로 수행된다. 결과적으로, 전체 비트-레이트(bit-rate)는 일정한 채로 있지만, 디코딩된 후, ROI 화상의 품질은 RONI의 화상 품질보다 더 높다. Bober 등의 US 2004/0070666 A1와 같은 다른 제안들은 기본적으로, 비디오 인코딩 전에 스마트 줌잉(smart zooming) 기술들이 적용되어 카메라 시계(field of view) 내의 인물이 디지털 수단으로 줌잉되어 비관련 배경 화상 부분들이 전송되지 않도록 하는 것을 제안한다. 환언하면, 이 방법은 단지 각 캡쳐된(captured) 화상의 선택된 관심 영역들만을 코딩하여 화상을 전송한다.Generally speaking, a number of conventional algorithms and techniques have been proposed and implemented to enhance video conference pictures. For example, various and efficient video encoding techniques have been applied to improve video encoding efficiency. In particular, such proposals (eg, "Face-Based Visually-Optimized Image Sequence Coding" by S. Daly et al., 0-8186-8821-1 / 98, 443). Page 447, IEEE) aims at improving video encoding efficiency based on selection of a region of interest (ROI) and a region of no interest (RONI). More specifically, the proposed encoding is performed in such a way that most bits are assigned to the ROI and a few bits are assigned to the RONI. As a result, the overall bit-rate remains constant, but after being decoded, the quality of the ROI picture is higher than the picture quality of the RONI. Other proposals, such as US 2004/0070666 A1 by Bober et al., Basically apply smart zooming techniques prior to video encoding such that a person in the camera's field of view is zoomed by digital means to produce unrelated background image portions. It is suggested not to transmit. In other words, this method only codes selected regions of interest of each captured image to transmit the image.

그러나, 상술된 종래 기술들은 다수의 요인들로 인해 종종 만족스럽지 못한다. 비디오 통신 시스템들의 전송에서 화상 품질의 악 효과들에 반하기 위해 캡쳐된 화상들에 아무런 프로세싱 혹은 분석이 수행되지 않는다. 또한, 향상된 코딩 스킴들이, 수용가능한 결과들을 제공할지라도, 모든 코딩 스킴들에 대해 보드에서 독립적으로 적용될 수는 없고, 그런 기술들은 특정 비디오 인코딩과 디코딩 기술들이 먼저 구현되어야 하는 것을 요구한다. 또한, 이들 기술들 중 어느 것도 비디오 텔레컨퍼런싱 콜의 낮은 상황 인지도 및 인식된 저 품질의 문제들을 적절히 해결하지 않았다. However, the prior art described above is often unsatisfactory due to a number of factors. No processing or analysis is performed on the captured pictures to counteract the adverse effects of picture quality in the transmission of video communication systems. In addition, although advanced coding schemes provide acceptable results, they cannot be applied independently on board for all coding schemes, and such techniques require that certain video encoding and decoding techniques must be implemented first. In addition, none of these techniques adequately solved the low situational awareness and perceived low quality problems of video teleconferencing calls.

따라서, 본 발명의 목적은, 위에 언급된 문제점들을 해결하고, 비용 절감되고 구현이 단순할 수 있는 화상 품질 개선을 효율적으로 다루는 새롭고 향상된 방법 및 디바이스를 제공하는 것이다. Accordingly, it is an object of the present invention to provide a new and improved method and device that solves the above mentioned problems and efficiently handles image quality improvements that can be cost-saving and simple to implement.

이를 위해, 본 발명은, 비디오 어플리케이션의 화상에서 적어도 한 인물을 검출하는 단계, 화상에서 검출된 인물과 연관된 모션(motion)을 추정하는 단계, 화상을 적어도 하나의 관심 영역과 적어도 하나의 비관심 영역으로 분할하는 단계로서, 관심 영역은 화상에서 검출된 인물을 포함하는 상기 분할 단계, 그리고 비관심 영역에 적용되는 것보다 더 높은 프레임 레이트를 사용하여 관심 영역에서 화상을 포함하는 비디오 신호에 시간 프레임 프로세싱을 적용하는 단계를 포함하는 비디오 화상 프로세싱 방법에 관한 것이다.To this end, the present invention provides a method for detecting at least one person in an image of a video application, estimating motion associated with the person detected in the image, and at least one region of interest and at least one uninterested region of the image. Wherein the region of interest comprises the segmentation step comprising a person detected in the image, and time frame processing the video signal comprising the image in the region of interest using a higher frame rate than that applied to the uninterested region. The method relates to a video image processing method comprising the step of applying.

다음 특징들 중 하나 이상이 또한 포함될 수 있다.One or more of the following features may also be included.

본 발명의 일 양태에 있어서, 시간 프레임 프로세싱은 관심 영역에 적용되는 시간 프레임 업-컨버젼 프로세싱을 포함한다. 다른 양태에서, 시간 프레임 프로세싱은 비관심 영역에 적용되는 시간 프레임 다운-컨버젼(down-conversion) 프로세싱을 포함한다. In one aspect of the invention, the time frame processing includes time frame up-conversion processing applied to the region of interest. In another aspect, time frame processing includes time frame down-conversion processing applied to an uninterested region.

다른 양태에서, 본 방법은 또한, 시간 프레임 업-컨버젼 프로세싱 단계로부터의 출력 정보와, 시간 프레임 다운-컨버젼 프로세싱 단계로부터의 출력 정보를 결합하여 향상된 출력 화상을 생성하는 단계를 포함한다. 또한, 시각적 화상 품질 개선 단계들은 이 화상과 연관된 비디오 신호의 송신 끝단 혹은 수신 끝단에서 수행될 수 있다.In another aspect, the method also includes combining the output information from the time frame up-conversion processing step with the output information from the time frame down-conversion processing step to generate an enhanced output picture. Also, visual picture quality improvement steps may be performed at the transmitting end or the receiving end of the video signal associated with this picture.

또한, 비디오 어플리케이션의 화상에서 식별된 인물을 검출하는 단계는, 화상에서 입술 활동(lip activity)을 검출하는 단계 그리고 화상에서 오디오 스피치 활동을 검출하는 단계를 포함할 수 있다. 또한, 관심 영역에 시간 프레임 업-컨버젼 프로세싱을 적용하는 단계는, 단지 입술 활동 및/또는 오디오 스피치 활동이 검출된 때만 수행될 수 있다. In addition, detecting the identified person in the picture of the video application may include detecting lip activity in the picture and detecting audio speech activity in the picture. Also, applying time frame up-conversion processing to the region of interest may be performed only when lip activity and / or audio speech activity is detected.

다른 양태들에서, 상기 방법은 또한, 화상을 적어도 제 1 관심 영역과 제 2 관심 영역으로 분할하는 단계, 프레임 레이트를 증가시켜 시간 프레임 업-컨버젼 프로세싱을 적용하기 위해 제 1 관심 영역을 선택하는 단계, 그리고 제 2 관심 영역의 프레임 레이트는 그대로 두는 단계를 포함한다.In other aspects, the method also includes dividing the image into at least a first region of interest and a second region of interest, and selecting a first region of interest to increase the frame rate to apply time frame up-conversion processing. And leaving the frame rate of the second region of interest intact.

본 발명은 또한, 비디오 화상들을 프로세싱하도록 구성된 디바이스에 관한 것이고, 상기 디바이스는, 비디오 어플리케이션의 화상에서 적어도 한 인물을 탐지하도록 구성되는 검출 모듈, 화상에서 검출된 인물과 연관된 모션을 추정하도록 구성되는 모션 추정 모듈, 화상을 적어도 하나의 관심 영역과 적어도 하나의 비관심 영역으로 분할하도록 구성되는 분할 모듈로서, 관심 영역은 화상에서 검출된 인물을 포함하는 상기 분할 모듈, 그리고 비관심 영역에서 적용되는 것보다 관심 영역에서 더 높은 프레임 레이트를 사용하여 화상을 포함하는 비디오 신호에 시간 프레임 프로세싱을 적용하도록 구성되는 적어도 하나의 프로세싱 모듈을 포함한다.The invention also relates to a device configured to process video pictures, the device comprising: a detection module configured to detect at least one person in a picture of a video application, a motion configured to estimate a motion associated with the person detected in the picture An estimation module, a segmentation module configured to segment an image into at least one region of interest and at least one region of interest, wherein the region of interest is less than that applied in the segmentation module comprising a person detected in the image, and the region of uninterest At least one processing module configured to apply time frame processing to a video signal comprising a picture using a higher frame rate in the region of interest.

방법 및 디바이스의 다른 특징들은 종속 청구항들에서 더 기재된다.Other features of the method and device are further described in the dependent claims.

실시예들은 하나 이상의 다음 이점들을 가질 것이다.Embodiments will have one or more of the following advantages.

본 발명은 유익하게도, 관련 화상 부분들에 대한 비디오 컨퍼런스 시스템들의 시각적 인식을 향상시키고, 화상의 나머지 부분에 상대적으로, 발화하고 있는 참석자들 혹은 인물들과 연관된 시각적 화상들을 명백하게 하여, 상황 인지도의 레벨을 증가시킨다. The present invention advantageously improves the visual perception of video conference systems for relevant picture portions and clarifies the visual pictures associated with attendees or persons who are speaking, relative to the rest of the picture, thus providing a level of situational awareness. To increase.

또한 본 발명은, 비교적 더 많은 비트들이 개선된 관심 영역(ROI)에 할당되고 비교적 더 적은 수의 비트들이 비관심 영역(RONI)에 할당되어, 동일한 비트-레이트에 대해, 얼굴 표현들 등과 같은 중요하고 관련된 비디오 데이터의 향상된 전송 프로세스의 결과를 가져오므로, 더 높은 비디오 압축 효율의 결과를 내는 전송 측에 적용될 수 있다.In addition, the present invention allows relatively more bits to be allocated to the improved region of interest (ROI) and relatively fewer bits to the region of uninterested region (RONI), so that for the same bit-rate, such important features as face expressions, etc. And as a result of an improved transmission process of related video data, it can be applied to the transmission side resulting in higher video compression efficiency.

또한, 발명의 방법 및 디바이스는 비디오 전화 구현들에서 사용될 수 있는 임의 코딩 스킴과는 독립적 어플리케이션을 허용한다. 본 발명은 비디오 인코딩 혹은 디코딩을 요구하지 않는다. 또한, 상기 방법은 향상된 카메라 신호를 위한 비디오 전화의 카메라 측에 적용될 수 있거나, 또는 향상된 디스플레이 신호에 대해 디스플레이 측에 적용될 수 있다. 그러므로, 본 발명은 전송 및 수신 측들 모두에 적용될 수 있다.In addition, the method and device of the invention allow an application independent of any coding scheme that can be used in video telephony implementations. The present invention does not require video encoding or decoding. The method may also be applied to the camera side of the video phone for enhanced camera signals or to the display side for enhanced display signals. Therefore, the present invention can be applied to both transmitting and receiving sides.

다른 이점으로서, 얼굴 검출을 위한 신원확인(identification) 프로세스는 입술 활동 검출기 및/또는 오디오 위치확인 알고리즘들과 같은 다양한 얼굴 검출 기술들 혹은 방식들을 조합하여 더욱 견고하고 실패가 방지되게 만들어질 수 있다. 또한, 다른 이점으로서, 모션 보상된 보간이 단지 ROI에만 적용되므로, 계산들이 보호되고 절약될 수 있다.As another advantage, the identification process for face detection can be made more robust and failure resistant by combining various face detection techniques or methods such as lip activity detectors and / or audio positioning algorithms. Also as another advantage, since motion compensated interpolation only applies to the ROI, calculations can be protected and saved.

그러므로, 본 발명의 구현으로, 비디오 품질은 크게 개선되어, 인물들의 상황 인지도, 그러므로 비디오 콜의 인식 품질을 증가시킴으로써, 비디오-전화 어플리케이션들의 더 나은 수신을 하도록 만든다. 더 구체적으로, 본 발명은 또한 화상들의 개선된 요해(intelligibility)를 위해 그리고 상이한 유형들의 얼굴 감정들 및 표현들을 전달하기 위해 더 높은 품질의 얼굴 표현들을 전송할 수 있다. 오늘날 그룹 비디오 컨퍼런스 어플리케이션들에서 이런 유형의 상황 인지도의 증가는, 특히, 예를 들어, 컨퍼런스 콜에서 참석자들 혹은 인물들이 다른 참석자들을 잘 알지 못할 때, 증가된 사용과 신뢰성과 동일하다.Therefore, with the implementation of the present invention, the video quality is greatly improved, resulting in better reception of video-telephone applications by increasing the situational awareness of the persons, and therefore the recognition quality of the video call. More specifically, the present invention may also transmit higher quality facial expressions for improved intelligibility of pictures and to convey different types of facial emotions and expressions. In today's group video conference applications, this type of situational awareness is equal to increased usage and reliability, especially when attendees or people are unaware of other attendees in a conference call, for example.

본 발명의 이들 및 다른 양태들은, 이하 설명에서의 실시예들, 도면들, 및 청구범위를 참조하여 명백하고 명료해질 것이다. These and other aspects of the invention will be apparent from and elucidated with reference to the embodiments, figures, and claims in the following description.

도 1은 본 발명에 따른 화상 품질 개선을 위한 향상된 방법의 일 실시예의 개략적 기능 블록도를 도시한다.1 shows a schematic functional block diagram of one embodiment of an improved method for improving image quality according to the present invention.

도 2는 도 1에 따른 화상 품질 개선을 위한 향상된 방법의 일 실시예의 흐름도를 도시한다.FIG. 2 shows a flowchart of one embodiment of an improved method for improving image quality according to FIG. 1.

도 3은 본 발명에 따른 화상 품질 개선을 위한 향상된 방법의 다른 실시예의 흐름도이다.3 is a flowchart of another embodiment of an improved method for improving image quality according to the present invention.

도 4는 본 발명에 따른 화상 품질 개선을 위한 향상된 방법의 다른 실시예의 흐름도이다.4 is a flowchart of another embodiment of an improved method for improving image quality according to the present invention.

도 5는 본 발명에 따른 화상 품질 개선을 위한 향상된 방법의 다른 실시예의 흐름도이다.5 is a flowchart of another embodiment of an improved method for improving image quality according to the present invention.

도 6은 본 발명에 따른 화상 품질 개선을 위한 향상된 방법의 다른 실시예의 개략적 기능 블록도이다. 6 is a schematic functional block diagram of another embodiment of an improved method for improving image quality according to the present invention.

도 7은, 본 발명에 따라, 복수 인물 비디오 컨퍼런싱 세션을 위해 보여진 화상 품질 개선을 위한 개략적 기능 블록도이다.7 is a schematic functional block diagram for improving picture quality seen for a multiple person video conferencing session, in accordance with the present invention.

도 8은, 본 발명에 따라, 복수 인물 비디오 컨퍼런싱 세션을 위해 보여진 화상 품질 개선을 위한 다른 개략적 기능 블록도이다.8 is another schematic functional block diagram for improving picture quality seen for a multiple person video conferencing session, in accordance with the present invention.

도 9는, 도 8에 따라, 화상 품질 개선을 위한 향상된 방법의 일 실시예에서 사용되는 방법 단계들을 나타내는 흐름도이다.FIG. 9 is a flow diagram illustrating method steps used in one embodiment of an improved method for improving image quality, in accordance with FIG. 8.

도 10은, 예시적 경우로서, 비디오 어플리케이션으로부터 취해진 전형적 화상을 도시한다.10 illustrates a typical picture taken from a video application, as an example case.

도 11은, 본 발명에 따른, 얼굴 추적 메커니즘의 구현을 도시한다.11 illustrates an implementation of a face tracking mechanism, in accordance with the present invention.

도 12는 ROI/RONI 분할 프로세스의 어플리케이션을 도시한다.12 shows an application of a ROI / RONI segmentation process.

도 13은 머리와 어깨 모델에 기초한 ROI/RONI 분할을 도시한다. 13 illustrates ROI / RONI segmentation based on head and shoulder models.

도 14는, 본 발명의 일 실시예에 따른, 프레임 레이트 컨버젼을 도시한다.14 illustrates frame rate conversion, in accordance with an embodiment of the present invention.

도 15는 ROI와 RONI 영역 사이의 경계 지역들에 구현되는 최적화 기술을 도시한다. 15 illustrates an optimization technique implemented in boundary regions between ROI and RONI regions.

본 발명은, 예를 들어, 비디오 전화 시스템의 화상에서 인물들의 인식 개선, 그리고, 비디오 텔레컨퍼런싱 세션의 상황 인지도의 개선을 다룬다.The present invention deals with, for example, improving the perception of persons in the picture of a video telephony system, and improving situational awareness of a video teleconferencing session.

도 1을 참조하면, 본 발명의 기본 특징들은, 예를 들어, 일인(one person) 비디오 컨퍼런싱 세션에 화상 품질 개선을 적용하는 것에 관하여 설명된다. 송신 측에서, "비디오 인(video in)"(10) 신호(V_in)는 카메라에 입력되고, 레코딩된 카메라 신호가 된다. "비디오 아웃(video out)"(12) 신호는, 한편, 코딩되어 전송될 신호 V_out이다. 환언하면, 수신 측에서, 신호(10)는 수신되어 디코딩되는 신호이고, 신호(12)는 종단 사용자들을 위해 디스플레이로 전송된다.Referring to FIG. 1, the basic features of the present invention are described, for example, in applying picture quality improvement to a one person video conferencing session. On the transmitting side, the " video in " 10 signal V _in is input to the camera and becomes a recorded camera signal. The "video out" 12 signal, on the other hand, is the signal V _out to be coded and transmitted. In other words, at the receiving side, signal 10 is a signal that is received and decoded, and signal 12 is sent to the display for end users.

본 발명을 구현하기 위해, 화상 분할 기술은 컨퍼런스 콜의 참석자를 포함하는 ROI의 선택을 위해 적용될 필요가 있다. 그러므로, 얼굴 추적 모듈(14)은 화상에서 얼굴 위치 및 크기에 대한 정보(20)를 발견하기 위해 사용될 수 있다. 다양한 얼굴 검출 알고리즘들이 이 분야에서 잘 알려졌다. 예를 들어, 화상에서 한 인 물의 얼굴을 발견하기 위해, 피부 색 검출 알고리즘 혹은 피부 색 검출과 타원형 객체 경계 탐색의 결합이 사용될 수 있다. 대안적으로, 화상에서 중요한 특징들에 대한 얼굴을 탐색하여 식별하기 위한 추가 방법들이 사용될 수 있다. 그러므로, 본 발명에서, 효율적 객체 분류기들을 발견하여 적용하기 위한 다수의 이용가능하고 견고한 방법들이 통합될 수 있다.In order to implement the present invention, picture segmentation techniques need to be applied for the selection of an ROI that includes attendees of a conference call. Therefore, face tracking module 14 may be used to find information 20 about face position and size in an image. Various face detection algorithms are well known in the art. For example, to find the face of a person in an image, a skin color detection algorithm or a combination of skin color detection and elliptical object boundary search may be used. Alternatively, additional methods may be used to search for and identify faces for important features in the picture. Therefore, in the present invention, a number of available and robust methods for finding and applying efficient object classifiers can be integrated.

화상에서 참석자의 얼굴을 식별하는 것에 후속하여, 모션 추정 모듈(16)은 모션 벡터 필드들(18)을 계산하도록 사용된다. 그 후, 얼굴 위치와 크기에 대한 정보(20)를 사용하여, ROI/RONI 분할 모듈(22)은, 예를 들어, 단순한 머리와 어깨 모델을 사용하여, 참석자 주위에서 수행된다. 대안적으로, ROI는 블럭마다 모션 검출(모션 추정이 아님)을 사용하여 추적될 수 있다. 환언하면, 모션이 가장 움직이는 블럭들을 갖는 객체인 ROI로 검출된 블록들을 그룹핑하여 객체가 형성된다. 또한, 모션 탐지를 사용하는 방법들은 화상 프로세싱 기술들에 대한 계산 복잡성을 절약한다. Subsequent to identifying the participant's face in the picture, motion estimation module 16 is used to calculate motion vector fields 18. Then, using information 20 about face position and size, ROI / RONI segmentation module 22 is performed around the participant, for example using a simple head and shoulder model. Alternatively, the ROI may be tracked using motion detection (not motion estimation) per block. In other words, the object is formed by grouping the detected blocks with the ROI, which is the object with the most moving blocks. Also, methods using motion detection save computational complexity for image processing techniques.

다음, ROI/RONI 프로세싱이 이루어진다. ROI 분할부(segment;24)에 대해, 픽셀들은, 시각적 개선을 위해, 시간 프레임 레이트 업-컨버젼 모듈(26)에 의해 ROI 분할부(24) 내에 시각적으로 강조된다. 이것은, RONI 분할부(28)에 대해, 완화될(de-emphasized) 나머지 화상 부분들의 시간 프레임 다운-컨버젼 모듈(30)과 결합된다. 그 다음, ROI와 RONI 프로세싱된 출력들은 재결합 모듈(32)에서 결합되어 "출력" 신호(12)(V_out)를 형성한다. ROI/RONI 프로세싱을 사용하여, ROI 분할 부(24)는 시각적으로 향상되고, 덜 관련된 RONI 분할부(28)에 대해 더욱 중요한 전경으로서 가져온다. Next, ROI / RONI processing is performed. For ROI segment 24, the pixels are visually highlighted in ROI segment 24 by time frame rate up-conversion module 26 for visual improvement. This is combined with the time frame down-conversion module 30 of the remaining picture portions to be de-emphasized, for the RONI partition 28. The ROI and RONI processed outputs are then combined at recombination module 32 to form an "output" signal 12 (V _out ). Using ROI / RONI processing, ROI segmentation 24 is visually enhanced, bringing it as a more important foreground for less relevant RONI segmentation 28.

이하 도 2를 참조하여, 흐름도(40)는 도 1에 설명된 본 발명의 기본 단계들을 나타낸다. 제 1 "입력" 단계(42)에서, 예를 들어, 비디오 신호는 카메라에 입력되어, 레코딩된 카메라 신호가 된다. 다음, 얼굴 검출 단계(44)는, 다수의 기존 알고리즘들을 사용하여, 얼굴 추적 모듈(14)(도 1에 도시됨)에서 수행된다. 더욱이, 모션 추정 단계(46)는, ROI 혹은 RONI 각각을 업-컨버트하거나 혹은 다운-컨버트하기 위해 나중에 필요되는 모션 벡터들을 생성하기 위해(48) 수행된다.Referring now to FIG. 2, a flow chart 40 illustrates the basic steps of the invention described in FIG. In the first " input " step 42, for example, a video signal is input to the camera, resulting in a recorded camera signal. Face detection step 44 is then performed in face tracking module 14 (shown in FIG. 1), using a number of existing algorithms. Moreover, motion estimation step 46 is performed to generate 48 motion vectors that are needed later to up-convert or down-convert each of the ROI or RONI.

단계(44)에서 얼굴이 탐지되었다면, ROI/RONI 분할 단계(50)가 수행되어, ROI 분할부에 대한 생성 단계(52)와 RONI에 대한 생성 단계(54)의 결과를 가져온다. 그 후, ROI 분할부는 단계(48)에 의해 생성된 모션 벡터들을 사용하여 모션-보상된 프레임 업-컨버트 단계(56)를 수행한다. 유사하게, RONI 분할부는 프레임 다운-컨버트 단계(58)를 수행한다. 후속적으로, 프로세싱된 ROI와 RONI 분할부들은 결합 단계(60)에서 결합되어 단계(62)에서 출력 신호를 발생시킨다. 또한, 얼굴 검출 단계(44)에서, 얼굴이 검출되지 않았다면, 단계(64)에서(테스트"컨버젼-다운?"), 화상이 다운-컨버젼 프로세싱을 할 것이면, 다운-컨버젼 단계(66)가 수행된다. 한편, 화상이 그대로 남아 있을 것이면, 단계(62)(직접 접속)로 단계(66)가 없이 단순히 진행하고, 프로세싱되지 않은 출력 신호를 발생시킨다.If a face is detected in step 44, ROI / RONI segmentation step 50 is performed, resulting in a generation step 52 for the ROI partition and a generation step 54 for the RONI. The ROI divider then performs a motion-compensated frame up-convert step 56 using the motion vectors generated by step 48. Similarly, the RONI partition performs frame down-convert step 58. Subsequently, the processed ROI and RONI partitions are combined in combining step 60 to generate an output signal in step 62. Further, in face detection step 44, if no face is detected, in step 64 (test " convert-down? &Quot;), if the image will be down-conversion processing, down-conversion step 66 is performed. do. On the other hand, if the image will remain intact, it simply proceeds to step 62 (direct connection) without step 66 and generates an unprocessed output signal.

이하 도 3 내지 도 5를 참조하여, 도 2의 방법 단계들에의 추가 최적화들이 제공된다. 비디오 텔레 컨퍼런스의 참석자가 발화하고 있는지의 여부에 따라, ROI 업-컨버젼 프로세스는 수정되어 최적화된다. 도 3에서, 흐름도(70)는, 얼굴 검출 단계(44)에 후속하는 추가 입술 검출 단계(71)를 갖는, 도 2에 설명된 흐름도(40)와 동일한 단계들을 나타낸다. 환언하면, 누가 발화하는 중인지 식별하기 위해, 비디오 화상의 입술 활동 검출을 적용할 수 있고, 스피치 활동 검출이 화상 시퀀스에 입술 활동 검출을 사용하여 측정될 수 있다. 예를 들어, 입술 활동은 자동 입술 판독에 대한 종래 기술 혹은 다양한 비디오 입술 활동 검출 알고리즘들을 사용하여 측정될 수 있다. 그러므로, 입술 활동 검출 메커니즘들에 대한 단계(71)의 추가는, 송신과 수신 끝단들 모두에서 사용될 수 있는, 다른 방식들과 결합될 때 얼굴 추적 혹은 검출 단계(44)를 더욱 견고히 만든다. 이런 방식으로, 본 목적은, 단지 인물 혹은 참석자가 발화하는 경우에만 ROI 분할부에 증가된 프레임 레이트를 제공하여 스피치 활동의 발생을 시각적으로 지원하는 것이다. 3-5, further optimizations to the method steps of FIG. 2 are provided. Depending on whether the attendees of the video teleconference are speaking, the ROI up-conversion process is modified and optimized. In FIG. 3, the flowchart 70 shows the same steps as the flowchart 40 described in FIG. 2, with an additional lip detection step 71 following the face detection step 44. In other words, to identify who is speaking, lip activity detection of a video picture may be applied, and speech activity detection may be measured using lip activity detection in a picture sequence. For example, lip activity can be measured using the prior art for automatic lip reading or various video lip activity detection algorithms. Therefore, the addition of step 71 to the lip activity detection mechanisms makes the face tracking or detection step 44 more robust when combined with other ways, which can be used at both the transmitting and receiving ends. In this way, the object is to visually support the occurrence of speech activity by providing an increased frame rate in the ROI partition only when the person or participant is speaking.

도 3은 또한, 단지 입술 검출 단계(71)가 긍정(Y)일 때만 ROI 업-컨버젼 단계(56)가 수행됨을 나타낸다. 입술 검출이 없으면, 흐름도(70)는 컨버젼 다운 단계(64)로 진행하여, 궁극적으로 비디오-아웃 신호를 발생시키는 단계(62)로 유도한다.3 also shows that the ROI up-conversion step 56 is performed only when the lip detection step 71 is affirmative (Y). If there is no lip detection, the flow chart 70 proceeds to the conversion down step 64, which ultimately leads to the step 62 of generating a video-out signal.

이하, 도 4를 참조하면, 흐름도(80)에서, 추가 방식들이 구현된다. 얼굴 추적 혹은 검출 단계(44)가 에러가 없는 얼굴 검출을 항상 보장할 수 있는 것은 아니므로, 실제 인물이 발견되지 않는 얼굴을 식별할 수 있다. 그러나, 얼굴 추적 및 검출의 기술들과, 그리고 입술 활동(도 3) 및 오디오 위치확인 알고리즘들과 같은 방식들과 조합하여, 얼굴 추적 단계(44)는 더욱 견고히 만들어질 수 있다. 그러므 로, 도 4는, 비디오-인 단계(42)와 얼굴 검출 단계(44)와 동시에 병렬로 동작하는, 오디오-인 단계(81)와 그에 후속하는 오디오 검출 단계(82)를 사용하는 최적화를 추가한다. Referring now to FIG. 4, in a flowchart 80, additional schemes are implemented. Since the face tracking or detection step 44 may not always guarantee error-free face detection, it is possible to identify faces for which no real person is found. However, in combination with techniques of face tracking and detection, and methods such as lip activity (FIG. 3) and audio positioning algorithms, face tracking step 44 can be made more robust. Therefore, FIG. 4 provides optimization using the audio-in step 81 and the subsequent audio detection step 82, which operate in parallel with the video-in step 42 and the face detection step 44 simultaneously. Add.

환언하면, 인물이 발화하고 있으므로 오디오를 이용할 수 있을 때, 스피치 활동 검출기가 사용될 수 있다. 예를 들어, 피치(pitch) 검출기와 결합된 오디오 신호에서 동적 이벤트들의 검출에 기초한 스피치 활동 검출기가 사용될 수 있다. 송신 끝단에서, 즉, 오디오-인 단계(81)에서, "오디오-인" 신호는 마이크로폰 입력이다. 수신 끝단에서, "오디오-인" 신호는 수신되고 디코딩되는 오디오이다. 그러므로, 오디오 활동 검출의 증가된 확실성에 대해, 조합된 오디오/비디오 스피치 활동 탐지가 개별 검출기 출력들 상의 논리 AND에 의해 수행된다.In other words, a speech activity detector can be used when audio is available because the person is speaking. For example, a speech activity detector based on the detection of dynamic events in an audio signal coupled with a pitch detector may be used. At the end of the transmission, i.e. in audio-in step 81, the "audio-in" signal is a microphone input. At the receiving end, the "audio-in" signal is the audio that is received and decoded. Therefore, for increased certainty of audio activity detection, combined audio / video speech activity detection is performed by a logical AND on the individual detector outputs.

유사하게, 도 4는, 단지 오디오 검출 단계(82)가 오디오 신호를 긍정적으로 검출했을 때만, 흐름도(80)의 ROI 업-컨버젼 단계(56)가 수행됨을 도시한다. 오디오 신호가 검출되면, 얼굴의 긍정적 검출로 진행하여, ROI/RONI 분할 단계(50)가 수행되고나서, ROI 업-컨버젼 단계(56)가 후속된다. 그러나, 오디오 스피치가 검출되지 않았으면, 흐름도(80)는 컨버젼 다운 단계(64)로 진행하여, 궁극적으로 비디오-아웃 신호를 발생시키는 단계(62)로 유도된다.Similarly, FIG. 4 shows that ROI up-conversion step 56 of flowchart 80 is performed only when audio detection step 82 positively detects an audio signal. If an audio signal is detected, proceed to positive detection of the face, ROI / RONI segmentation step 50 is performed, followed by ROI up-conversion step 56. However, if no audio speech was detected, flow chart 80 proceeds to conversion down step 64, which ultimately leads to step 62 generating a video-out signal.

도 5를 참조하면, 흐름도(90)는 오디오 스피치 활동과 비디오 입술 활동 검출 프로세스들의 구현의 조합을 나타낸다. 그러므로, 도 3과 도 4의 조합은 흐름도(90)의 결과를 가져오고, 관심 인물 혹은 참석자를 식별하거나 혹은 검출하는 매우 견고한 수단을 제공하고 ROI를 정확히 분석한다.Referring to FIG. 5, a flow chart 90 illustrates a combination of an implementation of audio speech activity and video lip activity detection processes. Therefore, the combination of Figures 3 and 4 results in a flow chart 90, providing a very robust means of identifying or detecting interested persons or participants and accurately analyzing the ROI.

또한, 도 6은, 오디오 스피치 검출과 비디오 입술 활동 검출 단계들 모두를 구현하는 일인 비디오 컨퍼런스 세션에 적용되는 화상 품질 개선에 대한 흐름도의 개략적 기능 블록도를 나타낸다. 도 1에 도시된 기능적 특징들과 유사하게, 송신 끝단에서, 입력 신호(10)(V_in)는 카메라/입력 장치에 입력되어, 레코딩된 카메라 신호가 된다. 동일한 선들을 따라서, "오디오-인" 입력 신호(A_in)(11)는 입력되고, 오디오 알고리즘 모듈(13)은, 임의 스피치 신호가 검출될 수 있는지의 여부를 검출하기 위해 적용된다. 동시에, 입술 활동 검출 모듈(15)은 비디오-인 신호를 분석하여, 수신 신호에 임의 입술 활동이 존재하는지의 여부를 판정한다. 결과적으로, 오디오 알고리즘 모듈(13)이, 참으로 판명된 참 혹은 거짓 스피치 활동 플래그(17)를 생성하면, ROI 업-컨버트 모듈(26)은, ROI 분할부(24) 수신시, ROI 분할부(24)에 대해 프레임 레이트 업-컨버젼을 수행한다. 유사하게, 입술 활동 탐지 모듈(15)이 참 혹은 거짓 입술 활동 플래그(19)가 참이 되도록 검출하면, ROI 분할부(24)의 수신시, 모듈(26)은 ROI 분할부(24)에 대한 프레임 레이트 업-컨버젼을 수행한다.6 also shows a schematic functional block diagram of a flowchart for picture quality improvement applied to a video conference session, which is one of implementing both audio speech detection and video lip activity detection steps. Similar to the functional features shown in FIG. 1, at the transmitting end, the input signal 10 (V _in ) is input to a camera / input device, resulting in a recorded camera signal. Along the same lines, the "audio-in" input signal A _in 11 is input, and the audio algorithm module 13 is applied to detect whether any speech signal can be detected. At the same time, lip activity detection module 15 analyzes the video-in signal to determine whether any lip activity is present in the received signal. As a result, when the audio algorithm module 13 generates a true or false speech activity flag 17 that is found to be true, the ROI up-converting module 26, upon receiving the ROI splitter 24, receives the ROI splitter. Perform frame rate up-conversion for 24. Similarly, if the lip activity detection module 15 detects that the true or false lip activity flag 19 is true, upon receipt of the ROI divider 24, the module 26 will be directed to the ROI divider 24. Perform frame rate up-conversion.

이하 도 7를 참조하면, 송신 끝단에서, 복수의 마이크로폰들이 이용가능하면, 화자의 위치를 발견하기 위한 매우 견고하고 효율적인 방법이 구현될 수 있다. 즉, 인물들의 검출과 신원확인을 개선하기 위해, 특히, 발화 중인 복수의 인물들 혹은 참석자들을 식별하기 위해, 오디오와 비디오 알고리즘들의 조합은 매우 강력하다. 이것은, 복수-센서 오디오 데이터(모노(mono) 오디오 보다는)가 이용가능할 때, 특히 송신 끝단에서 적용될 수 있다. 대안적으로, 시스템을 여전히 더 견고하게 만들고 발화 중인 인물들을 정확히 식별하기 위해, 송신 및 수신 끝단들 모두에서 적용될 수 있는, 비디오에서 입술 활동 검출이 적용될 수 있다.Referring now to Figure 7, at the end of the transmission, if a plurality of microphones are available, a very robust and efficient method for finding the location of the speaker can be implemented. That is, the combination of audio and video algorithms is very powerful in order to improve the detection and identification of people, especially to identify a plurality of people or attendees who are speaking. This may be especially true at the transmitting end when multi-sensor audio data (rather than mono audio) is available. Alternatively, lip activity detection can be applied in the video, which can be applied at both the transmitting and receiving ends, to make the system still more robust and to correctly identify the persons speaking.

도 7에서, 화상 품질 개선을 위한 개략적 기능 블록도가 복수 인물 비디오 전화 컨퍼런스 세션을 위해 도시되었다. 송신 끝단에 있을 때, 복수의 인물들 혹은 참석자들이 존재하고, 얼굴 추적 모듈(14)은 하나 이상의 얼굴, 즉, 총 N명의 얼굴들(x N)을 찾을 수 있다. 얼굴 추적 모듈(14)에 의해 검출된 N 얼굴들 각각에 대해, 즉, N 얼굴 위치들과 크기들 각각에 대해, 복수 인물 ROI/RONI 분할 모듈(22N)(22-1, 22-2,...,22-N)은, 다시 말하면, 예를 들어, 머리와 어깨 모델에 기초하여, N 얼굴들에 대해 생성된 ROI와 RONI 분할부들 각각에 대해 생성된다.In FIG. 7, a schematic functional block diagram for picture quality improvement is shown for a multiperson video telephony conference session. When at the end of the transmission, there are a plurality of persons or attendees, and the face tracking module 14 may find one or more faces, i.e., a total of N faces (x N). For each of the N faces detected by face tracking module 14, ie for each of N face positions and sizes, multiple person ROI / RONI segmentation module 22N 22-1, 22-2. ..., 22 -N), in other words, is generated for each of the ROI and RONI partitions generated for the N faces, for example based on the head and shoulder model.

2개의 ROI들이 검출되는 경우에서, ROI 선택 모듈(23)은 스피치 활동 플래그(17)를 포함하고, 입술 활동 검출 모듈(15), 즉, 입술 활동 플래그(19)의 결과들을 포함하는, 사운드 소스 혹은 사운드 소스들(접속(21)은 사운드 소스들의 (x, y) 위치들을 제공함)의 위치들(x, y 좌표들)을 출력하는 오디오 알고리즘 모듈(13)의 결과들에 기초하여 화상 품질 개선을 위해 프로세싱되어야 하는 ROI들의 선택을 수행한다. 환언하면, 복수의 마이크로폰 컨퍼런싱 시스템들에서, 복수의 오디오 입력들은 수신 측에서 이용가능하다. 그 후, 오디오 알고리즘들과 연관된 입술 활동 알고리즘들을 적용하여, 스피치 혹은 오디오가 나오는 방향 및 위치(x, y 좌표)가 또한 결정된다. 이 정보는, 화상에서 현재 발화하는 중인 참석자인, 의도된 ROI를 타겟팅하기 위해 관련될 수 있다. In case two ROIs are detected, the ROI selection module 23 includes a speech activity flag 17, and includes the results of the lip activity detection module 15, ie, the lip activity flag 19. Or image quality improvement based on the results of the audio algorithm module 13 outputting the positions (x, y coordinates) of the sound sources (connection 21 provides (x, y) positions of the sound sources). Performs selection of ROIs to be processed for. In other words, in a plurality of microphone conferencing systems, a plurality of audio inputs are available at the receiving side. Then, applying the lip activity algorithms associated with the audio algorithms, the direction and position (x, y coordinates) where the speech or audio comes from is also determined. This information may be relevant for targeting the intended ROI, which is the participant currently speaking in the picture.

이런 방식으로, 2개 이상의 ROI들이 얼굴 추적 모듈(14)에 의해 검출될 때, ROI 선택 모듈(23)은 발화하는 중인 인물과 연관된 ROI를 선택하여, 발화하는 중인 이 인물에 대부분 시각적으로 강조(visual emphasis)되어 제공하고, 텔레컨퍼런싱 세션의 나머지 인물들 혹은 참석자들에 RONI 배경에 대해 적게 강조되어 수신하도록 할 수 있다. In this way, when two or more ROIs are detected by the face tracking module 14, the ROI selection module 23 selects the ROI associated with the person speaking, thereby visually emphasizing most of the person speaking. visual emphasis) to provide the remaining people or attendees of the teleconference session with less emphasis on the RONI background.

그 후, 분리된 ROI와 RONI 분할부들은, 모션 추정 모듈(16)에 의해 출력된 정보를 사용하여, ROI에 대한 프레임 레이트 업-컨버젼에서 ROI 업-컨버트 모듈(26)에 의해, 그리고 RONI에 대해 프레임 레이트 다운-컨버젼에서 RONI 다운-컨버트 모듈(30)에 의해 화상 프로세싱 단계들을 수행한다. 더욱이, ROI 분할부는 얼굴 추적 모듈(14)에 의해 검출되는 인물들의 총 수를 포함할 수 있다. 화자로부터 더 멀리 있는 인물들이 비디오 텔레컨퍼런싱 콜에 참석하지 않고 있다고 가정하면, ROI는 검출된 얼굴 크기의 검사에 의해 단지 충분히 근접하고 그 얼굴 크기가 화상 크기의 특정 퍼센티지보다 더 크게 검출된 얼굴들 혹은 인물들만을 포함할 수 있다. 대안적으로, ROI 분할부는 단지 발화하는 중인 인물만을 또는 그 이후 화자가 없었을 때는 최종 화자이었던 인물만을 포함할 수 있다.The separated ROI and RONI partitions are then used by the ROI up-convert module 26 and at the RONI at the frame rate up-conversion for the ROI, using the information output by the motion estimation module 16. Perform image processing steps by RONI down-convert module 30 at frame rate down-conversion. Moreover, the ROI segmentation may include the total number of persons detected by face tracking module 14. Assuming that persons farther away from the speaker are not attending the video teleconferencing call, the ROI is only close enough by the detection of the detected face size and the faces detected whose face size is larger than a certain percentage of the picture size or It can only include people. Alternatively, the ROI divider may include only the person who is speaking or who was the last speaker when there was no speaker thereafter.

이하, 도 8을 참조하면, 복수 인물 비디오 컨퍼런싱 세션을 위해 보여진 화상 품질 개선에 대한 다른 개략 기능 블록도가 도시된다. ROI 선택 모듈(23)은 2개의 ROI들을 선택한다. 이것은, 제 1 ROI 분할부(24-1)가 발화 중인 참석자 혹은 인물과 연관되고, 제 2 ROI 분할부(24-2)가 검출된 나머지 참석자들과 연관되므로, 2개의 ROI들이 구별되는 사실에 의해 유발될 수 있다. 도시된 것처럼, 제 1 ROI 분할부(24-1)는 ROI_1 업-컨버트 모듈(26-1)에 의해 시간적으로 업-컨버트되고, 한편 제 2 ROI 분할부(24-2)는 그대로 두어진다. 이전 도 5 및 도 6의 경우와 같이, RONI 분할부(28)는 또한 RONI 다운-컨버트 모듈(30)에 의해 시간적으로 다운-컨버트될 수 있다.Referring now to FIG. 8, another schematic functional block diagram for picture quality improvement shown for a multiple person video conferencing session is shown. The ROI selection module 23 selects two ROIs. This is because the first ROI divider 24-1 is associated with the participant or person who is speaking, and the second ROI divider 24-2 is associated with the remaining participants detected, so that the two ROIs are distinguished. Can be caused by. As shown, the first ROI divider 24-1 is up-converted in time by the ROI_1 up-convert module 26-1, while the second ROI divider 24-2 is left intact. As in the previous case of FIGS. 5 and 6, RONI partition 28 may also be down-converted in time by RONI down-convert module 30.

도 9를 참조하면, 흐름도(100)는, 도 8을 참조하여 상술된 것처럼, 화상 품질 개선을 위한 방법의 실시예들 중 하나에서 사용되는 단계들을 나타낸다. 사실상, 흐름도(100)는, 도 8에 도시되고, 또한 도 2 내지 도 5를 참조하여 설명된 다양한 모듈들에 의해 후속되는 기본 단계들을 나타낸다. 이들 단계들에서, 제 1 "비디오 인" 단계(42)에서, 즉, 비디오 신호는 카메라로 입력되어, 레코딩된 카메라 신호가 된다. 그 후, 얼굴 검출 단계(44)와 ROI/RONI 분할 단계(50)가 후속하여, ROI 분할부들에 대한 N개의 생성 단계들(52) 그리고 RONI 분할부에 대한 생성 단계(54)의 결과를 가져온다. ROI 분할부들에 대한 생성 단계들(52)은 ROI_1 분할부에 대한 단계(52a), ROI_2 분할부에 대한 단계(52a) 등 그리고 ROI_N 분할부에 대한 단계(52N)를 포함한다.Referring to FIG. 9, a flowchart 100 shows the steps used in one of the embodiments of a method for image quality improvement, as described above with reference to FIG. 8. In fact, the flowchart 100 shows the basic steps followed by the various modules shown in FIG. 8 and described with reference to FIGS. 2 to 5. In these steps, in the first " video in " step 42, i.e., the video signal is input to the camera and becomes a recorded camera signal. The face detection step 44 and the ROI / RONI segmentation step 50 then follow, resulting in N generation steps 52 for the ROI partitions and a generation step 54 for the RONI partitions. . Generation steps 52 for the ROI partitions include step 52a for the ROI_1 partition, step 52a for the ROI_2 partition, and so on, and 52N for the ROI_N partition.

다음, 입술 검출 단계(71)는 얼굴 검출 단계(44)와 ROI/RONI 분할 단계(50)에 후속하여 수행된다. 또한 도 8에 도시된 것처럼, 입술 검출 단계(71)가 긍정적(Y)이면, ROI/RONI 선택 단계(102)가 수행된다. 유사 방식으로, "오디오 인" 단계(81) 다음에, 비디오-인 단계(42)와 얼굴 검출 단계(44), 그리고 입술 검출 단계(71)와 동시에 동작하는 오디오 검출 단계(82)가 후속하여, 관심의 ROI 영역들을 정확히 검출하기 위해 더욱 견고한 메커니즘과 프로세스를 제공한다. 결과적 정보 는 ROI/RONI 선택 단계(102)에서 사용된다. Next, the lip detection step 71 is performed subsequent to the face detection step 44 and the ROI / RONI segmentation step 50. Also, as shown in FIG. 8, if the lip detection step 71 is positive (Y), the ROI / RONI selection step 102 is performed. In a similar manner, the "audio in" step 81 is followed by an audio detection step 82 which operates simultaneously with the video-in step 42 and the face detection step 44 and the lip detection step 71. This provides a more robust mechanism and process for accurately detecting ROI areas of interest. The resulting information is used in the ROI / RONI selection step 102.

후속적으로, ROI/RONI 선택 단계(102)는 프레임 업-컨버트 단계(56)를 수행하는 선택된 ROI 분할부(104)를 생성한다. ROI/RONI 선택(102)은 또한 다른 ROI 분할부들(106)을 생성하고, 이것은 단계(64)에서, 화상이 다운-컨버젼 분석되는 결정이 긍정적이면, 다운-컨버젼 단계(66)가 수행된다. 한편, 화상이 그대로 두어지면, 단순히 단계(60)로 진행하여, 단계(56)에 의해 생성되는 시간적으로 업-컨버트된 ROI 화상과, 단계들(54 및 66)에 의해 생성되는 RONI 화상과 결합하여, 결국 단계(66)에서 프로세싱되지 않은 "비디오-아웃" 신호에 도달한다.Subsequently, ROI / RONI selection step 102 creates a selected ROI splitter 104 that performs frame up-convert step 56. ROI / RONI selection 102 also generates other ROI partitions 106, which, in step 64, if the determination that the image is down-converted is positive, then down-conversion step 66 is performed. On the other hand, if the image is left intact, simply proceed to step 60 to combine the temporally up-converted ROI image generated by step 56 with the RONI image generated by steps 54 and 66. This results in an unprocessed "video-out" signal in step 66.

이하 도 10 내지 도 15를 참조하면, 화상 품질 개선을 성취하기 위해 사용되는 기술들 및 방법들이 설명된다. 예를 들어, 모션 추정, 얼굴 추적 및 검출, ROI/RONI 분할, 및 ROI/RONI 시간 컨버젼 프로세싱의 프로세스들이 더 상세히 설명될 것이다.10-15, techniques and methods used to achieve image quality improvement are described. For example, the processes of motion estimation, face tracking and detection, ROI / RONI segmentation, and ROI / RONI time conversion processing will be described in more detail.

도 10 내지 도 12를 참조하면, 예를 들어, 웹 카메라의 시퀀스 샷(sequence shot)으로부터 취해진 화상(110)이 도시된다. 예를 들어, 화상(110)은, 통상 오늘날 모바일 어플리케이션들의 경우인, 176 x 144 혹은 320 x 240 화소들의 해상도와 7.5 Hz와 15 Hz 사이의 프레임 레이트를 가질 수 있다.10-12, for example, an image 110 taken from a sequence shot of a web camera is shown. For example, image 110 may have a resolution of 176 x 144 or 320 x 240 pixels and a frame rate between 7.5 Hz and 15 Hz, which is typically the case in today's mobile applications.

모션motion 추정 calculation

화상(110)은 8 x 8 휘도 값들의 블록들로 세분될 수 있다. 모션 추정에 대해, 예를 들어, 3D 재귀적 탐색 방법이 사용될 수 있다. 그 결과는 8 x 8 블록들 각각에 대해 2-차원 모션 벡터이다. 이 모션 벡터는

에 의해 표현되고, 2-차원 벡터

는 8 x 8 블록의 공간적 x-좌표와 y-좌표를 포함하고, n은 시간 인덱스이다. 모션 벡터 필드는 2개의 원래 입력 프레임들 사이에 특정 시간에 값이 매겨진다. 2개의 원래 입력 프레임들 사이의 다른 시간에 모션 벡처 필드를 유효하게 만들기 위해, 모션 벡터 리타이밍(retiming)을 수행할 수 있다.Image 110 may be subdivided into blocks of 8 x 8 luminance values. For motion estimation, for example, a 3D recursive search method can be used. The result is a two-dimensional motion vector for each of the 8 x 8 blocks. This motion vector is

Represented by a two-dimensional vector

Contains the spatial x- and y-coordinates of an 8 x 8 block, where n is the temporal index. The motion vector field is priced at a specific time between two original input frames. To make the motion vector field valid at different times between two original input frames, motion vector retiming may be performed.

얼굴 검출Face detection

이하, 도 11을 참조하면, 얼굴 추적 메커니즘이 인물들(112 및 114)의 얼굴들을 추적하기 위해 사용된다. 얼굴 추적 메커니즘은 인물들(112 및 114)의 피부 색들을 발견하여 얼굴들을 찾는다(어둡게 나타난 얼굴들). 그러므로, 피부 검출기 기술이 사용될 수 있다. 타원(120 및 122)은 발견되어 식별된 인물들(112 및 114)의 얼굴들을 나타낸다. 대안적으로, 얼굴 검출은, P. Viola와 M. Jones의, "견고한 실시간 객체 검출(Robust Real-Time Object Detection)", Proceedings of the Second International Workshop on Statistical and Computational Theories of Vision - Modeling, Learning, Computing, and Sampling, Vancouver, Canada, July 13, 2001에 개시된 것과 같은, 훈련된 분류기들에 기초하여 수행된다. 분류기 기반 방법들은 변화하는 광 조건들에 대해 더욱 견고하다는 이점을 갖는다. 또한, 단지 발견된 얼굴들 근처에 있는 얼굴들만이 검출될 수 있다. 머리 크기 때문에 인물(118)의 얼굴이 발견되지 않은 것은 화상(110)의 크기에 비해 너무 작은 것이다. 그러므로, 인물(118)은 임의 비디오 컨퍼런스 콜에 참석하지 않는 것으로서, 정확히 가정된다(이 경우에).Referring now to FIG. 11, a face tracking mechanism is used to track the faces of people 112 and 114. The face tracking mechanism finds the faces by finding the skin colors of the characters 112 and 114 (faces that appear dark). Therefore, skin detector technology can be used. Ellipses 120 and 122 represent the faces of the people 112 and 114 found and identified. Alternatively, face detection is described by P. Viola and M. Jones, "Robust Real-Time Object Detection," Proceedings of the Second International Workshop on Statistical and Computational Theories of Vision-Modeling, Learning, It is performed based on trained classifiers, such as those disclosed in Computing, and Sampling, Vancouver, Canada, July 13, 2001. Classifier-based methods have the advantage of being more robust to changing light conditions. Also, only faces that are near the found faces can be detected. It is too small for the size of the image 110 that the face of the person 118 is not found because of the size of the head. Therefore, person 118 is correctly assumed (in this case) as not attending any video conference call.

이전에 언급된 것과 같이, 얼굴 추적 메커니즘의 견고함은, 얼굴 추적 메커니즘이, 송신과 수신 끝단들 모두에서 사용가능한 비디오 입술 활동 검출기로부터의 정보와 결합되고 그리고/또는 복수의 마이크로폰 채널들을 요구하고 송신 끝단에 구현되는 오디오 소스 추적기와 결합될 때, 향상될 수 있다. 이들 기술들의 조합을 사용하여, 얼굴 추적 메커니즘에 의해 실수로 발견되는 비얼굴들(non-faces)은 적절히 거부될 수 있다.As mentioned previously, the robustness of the face tracking mechanism is such that the face tracking mechanism is combined with information from the video lip activity detector available at both the transmitting and receiving ends and / or requires and transmits a plurality of microphone channels. When combined with an audio source tracker implemented at the end, it can be improved. Using a combination of these techniques, non-faces that are accidentally found by the face tracking mechanism can be rejected as appropriate.

ROIROI 및 And RONIRONI 분할 Division

도 12를 참조하면, ROI/RONI 분할 프로세스는 화상(110)에 적용된다. 얼굴 검출 프로세스에 후속하여, 화상(110)의 각 검출된 얼굴로, ROI/RONI 분할 프로세스는 머리와 어깨 모델에 기초하여 사용된다. 인물(124)의 머리와 몸체를 포함하는 인물(112)의 머리와 어깨 곡선(124)은 식별되어 분리된다. 이 조악한 머리와 어깨 곡선(124)의 크기는 중요하지 않지만, 인물(112)의 몸체가 완전히 곡선(124) 내에 포함됨을 보장할 만큼 충분히 커야 한다. 그 후, 또한 머리와 어깨 곡선(124) 내의 지역인, 단지 이 ROI의 화소들에만 시간 업-컨버젼이 적용된다.Referring to FIG. 12, a ROI / RONI segmentation process is applied to image 110. Subsequent to the face detection process, with each detected face of the image 110, a ROI / RONI segmentation process is used based on the head and shoulder model. The head and shoulder curves 124 of the person 112, including the person's head and body, are identified and separated. The size of this coarse head and shoulder curve 124 is not critical, but should be large enough to ensure that the body of the person 112 is completely contained within the curve 124. Then, time up-conversion is applied only to the pixels of this ROI, which are also areas within the head and shoulder curve 124.

ROIROI 및 And RONIRONI 프레임 frame 레이트Rate 컨버젼Conversion

ROI/RONI 프레임 레이트 컨버젼은 원래 화상의 모션 벡터들에 기초하여 모션 추정 프로세스를 사용한다.ROI / RONI frame rate conversion uses a motion estimation process based on the motion vectors of the original picture.

이하, 도 13을 참조하면, 예를 들어, 원래 입력 화상들 혹은 픽쳐(132A)(t = (n-1)T에서)와 픽쳐(132B)(t = nT에서)에 대해 3개의 다이어그램들(130A 내지 130C)에서, 도 12를 참조하여 설명된 것처럼 머리와 어깨 모델에 기초한 ROI/RONI 분할이 도시된다. 보간된 픽쳐(134)(t = (n-α)T; 다이어그램(130B))에 대해, 특정 위치의 화소는, 동일한 위치에서, 선행하는 원래 입력 픽쳐(132A)의 화소가 이 픽쳐의 ROI에 속하거나, 또는 동일한 위치에서, 후속하는 원래 입력 픽쳐(132B)의 화소는 이 픽쳐의 ROI에 속하거나, 또는 둘 모두일 때, ROI에 속한다. 환언하면, 보간된 픽쳐(134)의 ROI 영역(138B)은 이전과 이후의 원래 입력 픽쳐들(132A 및 132B) 각각의 ROI 영역(138A)과 ROI 영역(138C) 모두를 포함한다.Referring now to FIG. 13, for example, three diagrams (for the original input pictures or picture 132A (at t = (n−1) T)) and the picture 132B (at t = nT) 130A-130C, ROI / RONI segmentation based on head and shoulder model is shown as described with reference to FIG. 12. For an interpolated picture 134 (t = (n−α) T; diagram 130B), the pixel at a particular position is at the same position, so that the pixel of the preceding original input picture 132A is in the ROI of this picture. At the same location, or at the same location, the pixels of the subsequent original input picture 132B belong to the ROI of this picture, or both, when they belong. In other words, ROI region 138B of interpolated picture 134 includes both ROI region 138A and ROI region 138C of each of the previous and subsequent original input pictures 132A and 132B.

RONI 영역(140)에 대해, 보간된 픽쳐(134)에 대해, RONI 영역(140)에 속하는 화소들은 단순히 이전 원래 입력 픽쳐(132A)로부터 복사되고, ROI의 화소들은 모션 보상으로 보간된다.For RONI region 140, for interpolated picture 134, the pixels belonging to RONI region 140 are simply copied from the previous original input picture 132A, and the pixels of the ROI are interpolated with motion compensation.

이것은 또한 도 14를 참조하여 나타내지고, 여기서 T는 시퀀스의 프레임 기간을 나타내고, n은 정수 프레임 인덱스를 나타낸다. 예를 들어, 파라미터 α(0 < α< 1)는 2개의 원래 입력 화상들(132A 및 132B) 사이에서 보간된 화상(134A)의 상대적 타이밍을 제공한다(이 경우, α=1/2가 사용될 수 있슴).This is also indicated with reference to FIG. 14, where T represents the frame period of the sequence and n represents the integer frame index. For example, parameter α (0 <α <1) provides the relative timing of the interpolated picture 134A between two original input pictures 132A and 132B (in this case α = 1/2 is used). Can be).

도 14에서, 보간된 픽쳐(134A)에 대해(그리고 유사하게, 보간된 픽쳐(134B)에 대해), 예를 들어, "p"와 q"로 레이블링된 화소 블록들은 RONI 영역(140)에 놓이고, 이들 블록들의 화소들은 이전 원래 픽쳐의 동일한 위치로부터 복사된다. 보간된 픽쳐(134A)에 대해, ROI 영역(138)의 화소 값들은 하나 이상의 후속하고 선행하는 입력 원래 픽쳐들(132A와 132B)의 모션 보상된 평균으로서 계산된다. 도 14에서, 2-프레임 보간이 도시된다. f(a, b, α)는 모션 보상된 보간 결과를 닮는 다. 모션 보상된 보간 기술들에 대한 상이한 방법들이 사용될 수 있다. 그러므로, 도 14는, ROI 영역(138)의 화소들이 모션 보상된 보간에 의해 얻어지고, RONI 영역(140)의 화소들이 프레임 반복에 의해 얻어지는 프레임 레이트 컨버젼 기술을 나타낸다.In FIG. 14, pixel blocks labeled “p” and q, for interpolated picture 134A (and similarly, for interpolated picture 134B), are placed in RONI region 140. The pixels of these blocks are copied from the same location of the previous original picture For the interpolated picture 134A, the pixel values of the ROI region 138 are one or more subsequent and preceding input original pictures 132A and 132B. Is calculated as the motion-compensated average of the two-frame interpolation is shown in Fig. 14. f (a, b, α) resembles the motion compensated interpolation result. Therefore, Figure 14 shows a frame rate conversion technique in which pixels in ROI region 138 are obtained by motion compensated interpolation, and pixels in RONI region 140 are obtained by frame repetition.

또한, 화상 또는 픽쳐의 배경이 정적일 때, ROI와 RONI 영역들 사이의 천이 경계들은, ROI 영역 내의 배경 화소들이 0 모션 벡터들과 보간되므로, 결과적인 출력 화상에서 가시적이지 않다. 그러나, 종종 디지털 카메라들의 경우인 것처럼 배경이 움직일 때(즉, 불안정한 손 움직임들), ROI와 RONI 영역들 사이의 경계들은, 배경 화소들이 ROI 영역 내에서 모션 보상으로 계산되고, 한편 배경 화소들이 RONI 영역에서 이전 입력 프레임으로부터 복사되므로, 가시적이 된다.Also, when the background of the picture or picture is static, the transition boundaries between the ROI and RONI areas are not visible in the resulting output picture since the background pixels in the ROI area are interpolated with zero motion vectors. However, when the background moves (i.e., unstable hand movements) as is often the case with digital cameras, the boundaries between ROI and RONI areas are calculated with the background pixels being motion compensated within the ROI area, while the background pixels are RONI. As it is copied from the previous input frame in the region, it becomes visible.

이하, 도 15를 참조하면, 배경이 정적이 아닐 때, 다이어그램들(150A 및 150B)에서 도시된 것과 같이, 최적화 기술은 ROI와 RONI 영역들 사이의 경계 지역들에서 화상 품질의 개선에 대해 구현될 수 있다.Referring now to FIG. 15, when the background is not static, as shown in diagrams 150A and 150B, an optimization technique may be implemented for improvement of image quality in the border regions between ROI and RONI regions. Can be.

특히, 도 15는, ROI/RONI 분할을 갖는 t = (n - α)T에서 추정되는 모션 벡터 필드의 구현을 나타낸다. 다이어그램(150A)은 RONI 영역(140)의 배경에서 움직임이 있는 원래 상황을 나타낸다. RONI 영역(140)의 2-차원 모션 벡터들은 소문자 알파벳 부호들(a, b, c, d, e, f, g, h, k, l)에 의해 표시되고, ROI 영역(138)의 모션 벡터들은 대문자 알파벳 부호들(A, B, C, D, E, F, G, H)에 의해 표현된다. 다이어그램(150B)은, 일단 배경이 움직이기 시작하면, ROI/RONI 경계(152B)의 가시성을 완화시키기 위해 ROI(138)가 선형적으로 보간된 모션 벡터들로 확장되는 최적 화된 상황을 나타낸다.In particular, FIG. 15 shows an implementation of a motion vector field estimated at t = (n−α) T with ROI / RONI splitting. Diagram 150A represents the original situation where there is motion in the background of RONI region 140. Two-dimensional motion vectors of RONI region 140 are represented by lowercase alphabetic symbols a, b, c, d, e, f, g, h, k, l, and motion vectors of ROI region 138. Are represented by uppercase alphabetic symbols A, B, C, D, E, F, G, H. Diagram 150B illustrates an optimized situation in which ROI 138 is expanded with linearly interpolated motion vectors to mitigate the visibility of ROI / RONI boundary 152B once the background begins to move.

도 15에 도시된 것처럼, 경계 영역(152B)의 인식할 수 있는 가시성은, 블록 그리드(다이어그램(150B)) 상에 ROI 영역(138)을 확장하고, 점차적 모션 벡터 천이를 만들어서, 확장 지역의 화소들에 대한 모션-보상된 보간 분석을 적용하여 완화될 수 있다. 배경에 모션이 있을 때 천이를 더욱 완화시키기 위해, ROI 확장 지역(154)의 화소들에 대해 수평으로 및 수직으로 모두 블러링(blurring) 필터(예를 들어, [1 2 1]/4 )를 적용할 수 있다.As shown in FIG. 15, the recognizable visibility of the boundary region 152B extends the ROI region 138 on the block grid (diagram 150B) and creates a gradual motion vector transition, thereby creating a pixel of the expanded region. Can be mitigated by applying motion-compensated interpolation analysis. To further mitigate the transition when there is motion in the background, blurring filters (e.g. [1 2 1] / 4) both horizontally and vertically with respect to the pixels of the ROI extension area 154. Applicable

본 발명의 바람직한 실시예들이라고 현재 간주되는 것들이 도시되고 설명되었지만, 당업자들이라면, 본 발명의 실제 범위를 벗어나지 않고, 다양한 다른 수정본들이 만들어질 수 있고, 동등물들이 대체될 수 있슴을 이해할 것이다.While those presently considered to be preferred embodiments of the invention have been shown and described, those skilled in the art will understand that various other modifications may be made and equivalents may be substituted without departing from the true scope of the invention.

특히, 전술된 설명이 대부분 비디오 컨퍼런싱에 관련된 것이지만, 설명된 화상 품질 개선 방법은, 모바일 전화 디바이스들과 플랫폼들, PC와 같은 홈 오피스 플랫폼들 등에 구현된 것들과 같은, 임의 유형의 비디오 어플리케이션에 적용될 수 있다. In particular, although the foregoing description is mostly related to video conferencing, the described image quality improvement method is applicable to any type of video application, such as those implemented in mobile telephone devices and platforms, home office platforms such as PCs, and the like. Can be.

또한, 본 명세서에 설명된 중심 발명 개념으로부터 벗어나지 않고 본 발명의 개시에 특정 상황을 적응하도록 다수의 진보된 비디오 프로세싱 수정본들이 만들어질 수 있다. 더욱이, 본 발명의 일 실시예는 상술된 모든 특징들을 포함하지 않을 수 있다. 그러므로, 본 발명이 개시된 특정 실시예들에만 제한되지 않고, 첨부된 청구범위와 이들의 동등물들의 범위 내에 속하는 모든 실시예들을 포함하려고 의도된다. In addition, many advanced video processing modifications may be made to adapt a particular situation to the disclosure of the present invention without departing from the central inventive concept described herein. Moreover, one embodiment of the invention may not include all of the features described above. Therefore, it is intended that the present invention not be limited to the particular embodiments disclosed, but to include all embodiments falling within the scope of the appended claims and their equivalents.

Claims

비디오 화상들을 프로세싱하는 방법에 있어서,A method of processing video pictures, the method comprising:

비디오 어플리케이션의 한 화상에서 적어도 한 명의 인물을 검출하는 단계(44);Detecting (44) at least one person in a picture of the video application;

상기 화상에서 상기 적어도 한 명의 검출된 인물과 연관되는 모션(motion)을 추정하는 단계(46);Estimating (46) motion associated with the at least one detected person in the image;

상기 화상을 적어도 하나의 관심 영역(region of interest)과 적어도 하나의 비관심 영역(region of no interest)으로 분할하는 단계(50)로서, 상기 적어도 하나의 관심 영역은 상기 화상에서 상기 적어도 하나의 검출된 인물을 포함하는, 상기 분할 단계; 및Dividing the image into at least one region of interest and at least one region of no interest, wherein the at least one region of interest is the at least one detection in the image. The partitioning step of including a person; And

상기 적어도 하나의 비관심 영역에 적용되는 것보다 상기 적어도 하나의 관심 영역에 더 높은 프레임 레이트를 사용하여 상기 화상을 포함하는 비디오 신호에 시간 프레임 프로세싱을 적용하는 단계를 포함하는, 비디오 화상 프로세싱 방법. Applying time frame processing to the video signal comprising the picture using a higher frame rate in the at least one region of interest than is applied to the at least one uninterested region.

제 1 항에 있어서, 상기 시간 프레임 프로세싱은 상기 적어도 하나의 관심 영역에 적용되는 시간 프레임 업-컨버젼(up-conversion) 프로세싱(56)을 포함하는, 비디오 화상 프로세싱 방법.2. The method of claim 1, wherein said time frame processing comprises time frame up-conversion processing (56) applied to said at least one region of interest.

제 1 항 또는 제 2 항에 있어서, 상기 시간 프레임 프로세싱은 상기 적어도 하나의 비관심 영역에 적용되는 시간 프레임 다운-컨버젼(down-conversion) 프로세싱(58)을 포함하는, 비디오 화상 프로세싱 방법.The method of claim 1 or 2, wherein the time frame processing comprises time frame down-conversion processing (58) applied to the at least one uninterested region.

제 3 항에 있어서, 개선된 출력 화상을 생성(62)하기 위해, 상기 시간 프레임 업-컨버젼 프로세싱 단계로부터 출력 정보와 상기 시간 프레임 다운-컨버젼 프로세싱 단계로부터 출력 정보를 결합하는 단계(60)를 더 포함하는, 비디오 화상 프로세싱 방법.4. The method of claim 3, further comprising combining (60) output information from the time frame up-conversion processing step with output information from the time frame down-conversion processing step to produce 62 an improved output picture. And a video picture processing method.

제 1 항 내지 제 4 항 중 어느 한 항에 있어서, 상기 시각적 화상 품질 개선 단계들은 상기 화상과 연관된 상기 비디오 신호의 송신 끝단 또는 수신 끝단에서 수행되는, 비디오 화상 프로세싱 방법. 5. The method according to any one of claims 1 to 4, wherein the visual picture quality improvement steps are performed at the transmitting end or the receiving end of the video signal associated with the picture.

제 1 항 내지 제 5 항 중 어느 한 항에 있어서, 상기 비디오 어플리케이션의 화상에서 식별되는 상기 적어도 하나의 인물을 검출하는 단계는 상기 화상에서 입술 활동(lip activity)을 검출하는 단계(71)를 포함하는, 비디오 화상 프로세싱 방법.6. The method of claim 1, wherein detecting the at least one person identified in a picture of the video application comprises detecting 71 a lip activity in the picture. 7. Video image processing method.

제 1 항 내지 제 6 항 중 어느 한 항에 있어서, 상기 비디오 어플리케이션의 화상에서 식별되는 상기 적어도 한 명의 인물을 검출하는 단계는, 상기 화상에서 오디오 스피치 활동(audio speech activity)을 검출하는 단계(82)를 포함하는, 비 디오 화상 프로세싱 방법.The method of claim 1, wherein detecting the at least one person identified in a picture of the video application comprises: detecting audio speech activity in the picture (82). Video image processing method.

제 6 항 또는 제 7 항에 있어서, 상기 관심 영역에 시간 프레임 업-컨버젼 프로세싱을 적용하는 단계는 단지 입술 활동 및/또는 오디오 스피치 활동이 검출된 때만 수행되는, 비디오 화상 프로세싱 방법.8. The method of claim 6 or 7, wherein applying time frame up-conversion processing to the region of interest is performed only when lip activity and / or audio speech activity is detected.

제 1 항 내지 제 8 항 중 어느 한 항에 있어서, 상기 방법은,The method of claim 1, wherein the method comprises:

상기 화상을 적어도 제 1 관심 영역과 제 2 관심 영역으로 분할하는 단계(50);Dividing (50) the image into at least a first region of interest and a second region of interest;

상기 프레임 레이트를 증가시켜서 상기 시간 프레임 업-컨버젼 프로세싱을 적용하기 위한 상기 제 1 관심 영역을 선택하는 단계(102); 및Selecting (102) the first region of interest for applying the time frame up-conversion processing by increasing the frame rate; And

상기 제 2 관심 영역의 프레임 레이트를 그대로 두는 단계를 더 포함하는, 비디오 화상 프로세싱 방법.Leaving the frame rate of the second region of interest as is.

제 1 항 내지 제 9 항 중 어느 한 항에 있어서, 상기 관심 영역에 시간 프레임 업-컨버젼 프로세싱을 적용하는 단계는, 상기 관심 영역과 연관되는 화소들의 프레임 레이트를 증가시키는 단계를 포함하는, 비디오 화상 프로세싱 방법.10. The video image of claim 1, wherein applying time frame up-conversion processing to the region of interest comprises increasing a frame rate of pixels associated with the region of interest. Processing method.

제 1 항 내지 제 10 항 중 어느 한 항에 있어서, 상기 화상의 블록 그리드(150B) 상에 상기 관심 영역을 확장하는 단계와, 상기 확장된 관심 영역(154)의 화소들에 대해 모션 보상된 보간(motion compensated interpolation)을 적용하여 점차적 모션 벡터 천이(motion vector transition)를 수행하는 단계를 더 포함하는, 비디오 화상 프로세싱 방법.11. The method of any one of claims 1 to 10, further comprising: expanding the region of interest on the block grid 150B of the image, and motion compensated interpolation for the pixels of the expanded region of interest 154. performing motion gradual motion vector transition by applying motion compensated interpolation.

제 11 항에 있어서, 상기 확장된 관심 영역(154)의 화소들에 대해 수직으로 및 수평으로 블러링(blurring) 필터를 적용하여 경계 지역(152)을 완화시키는 단계를 더 포함하는, 비디오 화상 프로세싱 방법.12. The video image processing of claim 11, further comprising applying a blurring filter vertically and horizontally to the pixels of the extended region of interest 154 to mitigate the border region 152. Way.

비디오 화상들을 프로세싱하도록 구성되는 디바이스에 있어서,A device configured to process video pictures, the device comprising:

비디오 어플리케이션의 한 화상에서 적어도 한 인물을 검출하도록 구성되는 검출 모듈(14);A detection module 14 configured to detect at least one person in a picture of the video application;

상기 화상에서 상기 적어도 한 명의 검출된 인물과 연관된 모션을 추정하도록 구성되는 모션 추정 모듈(16);A motion estimation module (16) configured to estimate a motion associated with the at least one detected person in the image;

상기 화상을, 적어도 하나의 관심 영역과 적어도 하나의 비관심 영역으로 분할하도록 구성되는 분할 모듈(22)로서, 상기 적어도 하나의 관심 영역은 상기 화상에서 상기 적어도 한 명의 검출된 인물을 포함하는, 상기 분할 모듈; 및A segmentation module 22, configured to segment the image into at least one region of interest and at least one uninterested region, wherein the at least one region of interest comprises the at least one detected person in the image; Splitting module; And

상기 적어도 하나의 비관심 영역에 적용되는 것보다, 상기 적어도 하나의 관심 영역에서 더 높은 프레임 레이트를 사용하여 상기 화상을 포함하는 비디오 신호에 시간 프레임 프로세싱을 적용하도록 구성되는 적어도 하나의 프로세싱 모듈을 포함하는, 비디오 화상 프로세싱 디바이스. At least one processing module configured to apply time frame processing to a video signal comprising the picture using a higher frame rate in the at least one region of interest than that applied to the at least one uninterested region A video image processing device.

제 13 항에 있어서, 상기 프로세싱 모듈은, 상기 적어도 하나의 관심 영역에 시간 프레임 업-컨버젼 프로세싱을 적용하도록 구성되는 관심 영역 업-컨버트 모듈(26)을 포함하는, 비디오 화상 프로세싱 디바이스. 14. The video image processing device of claim 13, wherein the processing module comprises a region of interest up-convert module (26) configured to apply time frame up-conversion processing to the at least one region of interest.

제 13 항 또는 제 14 항에 있어서, 상기 프로세싱 모듈은, 상기 적어도 하나의 비관심 영역에 시간 프레임 다운-컨버젼 프로세싱을 적용하도록 구성되는 비관심 영역 다운-컨버트 모듈(30)을 포함하는, 비디오 화상 프로세싱 디바이스. 15. The video image of claim 13 or 14, wherein the processing module comprises an uninterested region down-convert module (30) configured to apply time frame down-conversion processing to the at least one uninterested region. Processing device.

제 15 항에 있어서, 상기 관심 영역 업-컨버트 모듈로부터 유도된 출력 정보와, 상기 비관심 영역 다운-컨버트 모듈로부터 유도되는 출력 정보를 결합하도록 구성되는 결합 모듈(32)을 더 포함하는, 비디오 화상 프로세싱 디바이스. 16. The video image according to claim 15, further comprising a combining module (32) configured to combine output information derived from the region of interest up-convert module and output information derived from the uninterested region down-convert module. Processing device.

제 1 항 내지 제 16 항 중 어느 한 항에 있어서, 입술 활동 검출 모듈(15)을 더 포함하는, 비디오 화상 프로세싱 디바이스. 17. A video image processing device according to any of the preceding claims, further comprising a lip activity detection module (15).

제 1 항 내지 제 17 항 중 어느 한 항에 있어서, 오디오 스피치 활동 모듈(13)을 더 포함하는, 비디오 화상 프로세싱 디바이스. 18. A video picture processing device according to any one of the preceding claims, further comprising an audio speech activity module (13).

제 1 항 내지 제 18 항 중 어느 한 항에 있어서, 시간 프레임 업-컨버젼을 위한 제 1 관심 영역을 선택하도록 구성된 관심 영역 선택 모듈(23)을 더 포함하는, 비디오 화상 프로세싱 디바이스. 19. The video image processing device according to any one of the preceding claims, further comprising a region of interest selection module (23) configured to select a first region of interest for time frame up-conversion.

저장된 명령어들의 시퀀스를 갖는 제 13 항 내지 제 19 항 중 어느 한 항의 디바이스와 연관된 컴퓨터-판독가능한 매체에 있어서, 상기 디바이스의 마이크로프로세서에 의해 실행될 때, 상기 프로세서로 하여금, 20. A computer-readable medium associated with a device of any one of claims 13-19 having a sequence of stored instructions, the computer-readable medium of which, when executed by a microprocessor of the device, causes the processor to:

비디오 어플리케이션의 한 화상에서 적어도 한 명의 인물을 검출(44)하고,Detect (44) at least one person in an image of a video application,

상기 화상에서 상기 적어도 한 명의 검출된 인물과 연관된 모션을 추정(46)하고,Estimate 46 motion associated with the at least one detected person in the image,

상기 화상을, 적어도 하나의 관심 영역과 적어도 하나의 비관심 영역으로 분할(50)하고, 상기 적어도 하나의 관심 영역은 상기 화상에서 상기 적어도 한 명의 검출된 인물을 포함하고,Splitting the image into at least one region of interest and at least one uninterested region, wherein the at least one region of interest includes the at least one detected person in the image,

상기 적어도 하나의 비관심 영역에 적용되는 것보다, 상기 적어도 하나의 관심 영역에 더 높은 프레임 레이트를 사용하여 상기 화상을 포함하는 비디오 신호에 시간 프레임 프로세싱을 적용하도록 하는, 컴퓨터-판독가능한 매체.And apply time frame processing to the video signal comprising the picture using a higher frame rate to the at least one region of interest than to the at least one uninterested region.