KR20240087477A

KR20240087477A - Electronic device for adjusting volume for each narrator, operating method thereof, and storage medium

Info

Publication number: KR20240087477A
Application number: KR1020220180935A
Authority: KR
Inventors: 신재영
Original assignee: 삼성전자주식회사
Priority date: 2022-12-12
Filing date: 2022-12-21
Publication date: 2024-06-19

Abstract

일 실시 예에 따르면, 전자 장치(101)는, 디스플레이(360) 및 적어도 하나의 프로세서(320)를 포함할 수 있다. 일 실시 예에 따르면, 상기 적어도 하나의 프로세서는, 컨텐트에 포함된 복수의 음성 신호들을 식별하도록 설정될 수 있다. 일 실시 예에 따르면, 상기 적어도 하나의 프로세서는, 상기 컨텐트에 포함된 영상 내의 복수의 화자들을 식별하도록 설정될 수 있다. 일 실시 예에 따르면, 상기 적어도 하나의 프로세서는, 상기 음성 신호들과 상기 화자들을 각각 매칭하도록 설정될 수 있다. 일 실시 예에 따르면, 상기 적어도 하나의 프로세서는, 상기 디스플레이 상에 표시된 상기 영상 내의 화자들 중 제1 화자와 관련된 음량 제어 객체에 대한 제1 사용자 입력을 수신하도록 설정될 수 있다. 일 실시 예에 따르면, 상기 적어도 하나의 프로세서는, 상기 제1 사용자 입력에 대응하여, 상기 제1 화자에 대한 음량을 조절하고, 상기 복수의 화자들 중 제2 화자에 대한 음량을 유지하도록 설정하도록 설정될 수 있다.According to one embodiment, the electronic device 101 may include a display 360 and at least one processor 320. According to one embodiment, the at least one processor may be configured to identify a plurality of voice signals included in content. According to one embodiment, the at least one processor may be set to identify a plurality of speakers in an image included in the content. According to one embodiment, the at least one processor may be set to match the voice signals and the speakers, respectively. According to one embodiment, the at least one processor may be set to receive a first user input for a volume control object related to a first speaker among speakers in the image displayed on the display. According to one embodiment, the at least one processor is configured to adjust the volume for the first speaker in response to the first user input and maintain the volume for the second speaker among the plurality of speakers. can be set.

Description

화자별 음량 조절을 위한 전자 장치, 그 동작 방법 및 저장 매체{ELECTRONIC DEVICE FOR ADJUSTING VOLUME FOR EACH NARRATOR, OPERATING METHOD THEREOF, AND STORAGE MEDIUM}Electronic device for controlling volume for each speaker, operation method, and storage medium {ELECTRONIC DEVICE FOR ADJUSTING VOLUME FOR EACH NARRATOR, OPERATING METHOD THEREOF, AND STORAGE MEDIUM}

본 문서에 개시된 일 실시 예는 화자별 음량 조절을 위한 전자 장치, 그 동작 방법 및 저장 매체에 관한 것이다.An embodiment disclosed in this document relates to an electronic device for controlling volume for each speaker, an operation method thereof, and a storage medium.

스마트 폰과 같은 전자 장치를 통해 제공되는 다양한 서비스 및 부가 기능들이 점차 증가함에 따라, 전자 장치에서 실행 가능한 다양한 어플리케이션들이 개발되고 있다. 또한, 전자 장치의 하드웨어적인 부분 및/또는 소프트웨어적인 부분도 지속적으로 발전되고 있다. As the variety of services and additional functions provided through electronic devices such as smart phones gradually increases, various applications that can run on electronic devices are being developed. Additionally, the hardware and/or software aspects of electronic devices are continuously being developed.

일 예로, 전자 장치에 장착된 카메라를 이용하여 영상(예: 동영상)을 촬영하는 카메라 어플리케이션이 있으며, 카메라 어플리케이션을 이용하여 영상을 촬영하고 감상하는 사용자가 증가하고 있다. 사용자는 다양한 환경에서 전자 장치의 카메라를 이용한 동영상 촬영(예: 녹화 또는 녹음) 시에는 영상뿐만 아니라 소리(또는 음향)도 함께 녹음되는데, 촬영된 동영상을 확인하기 전에는 녹음 상태를 확인할 수 없다. 따라서 영상의 음질을 개선하면서도 음량 레벨을 제어할 수 있는 사용자 인터페이스를 제공할 수 있다면, 사용자의 편의성이 증대될 수 있을 것이다.For example, there is a camera application that captures images (e.g., videos) using a camera mounted on an electronic device, and the number of users who capture and view images using camera applications is increasing. When users take video (e.g., record or record) using the camera of an electronic device in various environments, not only video but also sound (or sound) is recorded, but the recording status cannot be confirmed until the captured video is checked. Therefore, if it is possible to provide a user interface that can control the volume level while improving the sound quality of the video, user convenience can be increased.

일 실시 예에 따르면, 전자 장치(101)는, 디스플레이(360) 및 적어도 하나의 프로세서(320)를 포함할 수 있다. 일 실시 예에 따르면, 상기 적어도 하나의 프로세서는, 컨텐트에 포함된 복수의 음성 신호들을 식별하도록 설정될 수 있다. 일 실시 예에 따르면, 상기 적어도 하나의 프로세서는, 상기 컨텐트에 포함된 영상 내의 복수의 화자들을 식별하도록 설정될 수 있다. 일 실시 예에 따르면, 상기 적어도 하나의 프로세서는, 상기 음성 신호들과 상기 화자들을 각각 매칭하도록 설정될 수 있다. 일 실시 예에 따르면, 상기 적어도 하나의 프로세서는, 상기 디스플레이 상에 표시된 상기 영상 내의 화자들 중 제1 화자와 관련된 음량 제어 객체에 대한 제1 사용자 입력을 수신하도록 설정될 수 있다. 일 실시 예에 따르면, 상기 적어도 하나의 프로세서는, 상기 제1 사용자 입력에 대응하여, 상기 제1 화자에 대한 음량을 조절하고, 상기 복수의 화자들 중 제2 화자에 대한 음량을 유지하도록 설정하도록 설정될 수 있다. According to one embodiment, the electronic device 101 may include a display 360 and at least one processor 320. According to one embodiment, the at least one processor may be configured to identify a plurality of voice signals included in content. According to one embodiment, the at least one processor may be set to identify a plurality of speakers in an image included in the content. According to one embodiment, the at least one processor may be set to match the voice signals and the speakers, respectively. According to one embodiment, the at least one processor may be set to receive a first user input for a volume control object related to a first speaker among speakers in the image displayed on the display. According to one embodiment, the at least one processor is configured to adjust the volume for the first speaker in response to the first user input and maintain the volume for the second speaker among the plurality of speakers. can be set.

일 실시 예에 따르면, 전자 장치에서 화자별 음량 조절을 위한 방법은, 컨텐트에 포함된 복수의 음성 신호들을 식별하는 동작을 포함할 수 있다. 일 실시 예에 따르면, 상기 방법은, 상기 컨텐트에 포함된 영상 내의 복수의 화자들을 식별하는 동작을 포함할 수 있다. 일 실시 예에 따르면, 상기 방법은, 상기 음성 신호들과 상기 화자들을 각각 매칭하는 동작을 포함할 수 있다. 일 실시 예에 따르면, 상기 방법은, 상기 전자 장치의 디스플레이 상에 표시된 상기 영상 내의 화자들 중 제1 화자와 관련된 음량 제어 객체에 대한 제1 사용자 입력을 수신하는 동작을 포함할 수 있다. 일 실시 예에 따르면, 상기 방법은, 상기 제1 사용자 입력에 대응하여, 상기 제1 화자에 대한 음량을 조절하고, 상기 복수의 화자들 중 제2 화자에 대한 음량을 유지하는 동작을 포함할 수 있다. According to one embodiment, a method for adjusting the volume for each speaker in an electronic device may include identifying a plurality of voice signals included in content. According to one embodiment, the method may include an operation of identifying a plurality of speakers in an image included in the content. According to one embodiment, the method may include an operation of matching the voice signals and the speakers, respectively. According to one embodiment, the method may include receiving a first user input for a volume control object related to a first speaker among speakers in the image displayed on the display of the electronic device. According to one embodiment, the method may include adjusting the volume for the first speaker in response to the first user input and maintaining the volume for the second speaker among the plurality of speakers. there is.

일 실시 예에 따르면, 명령들을 저장하고 있는 비휘발성 저장 매체에 있어서, 상기 명령들은 전자 장치(101)의 적어도 하나의 프로세서(320)에 의하여 실행될 때에 상기 전자 장치로 하여금 적어도 하나의 동작을 수행하도록 설정된 것으로서, 상기 적어도 하나의 동작은, 컨텐트에 포함된 복수의 음성 신호들을 식별하는 동작, 상기 컨텐트에 포함된 영상 내의 복수의 화자들을 식별하는 동작, 상기 음성 신호들과 상기 화자들을 각각 매칭하는 동작, 상기 전자 장치의 디스플레이 상에 표시된 상기 영상 내의 화자들 중 제1 화자와 관련된 음량 제어 객체에 대한 제1 사용자 입력을 수신하는 동작 및 상기 제1 사용자 입력에 대응하여, 상기 제1 화자에 대한 음량을 조절하고, 상기 복수의 화자들 중 제2 화자에 대한 음량을 유지하는 동작을 포함할 수 있다.According to one embodiment, in a non-volatile storage medium storing instructions, the instructions, when executed by at least one processor 320 of the electronic device 101, cause the electronic device to perform at least one operation. As set, the at least one operation includes identifying a plurality of voice signals included in the content, identifying a plurality of speakers in an image included in the content, and matching the voice signals with the speakers, respectively. , an operation of receiving a first user input for a volume control object related to a first speaker among the speakers in the image displayed on the display of the electronic device, and in response to the first user input, the volume level for the first speaker It may include an operation of adjusting and maintaining the volume of the second speaker among the plurality of speakers.

도 1은 일 실시 예에 따른 네트워크 환경 내의 전자 장치의 블럭도이다.
도 2는 일 실시 예에 따른 컨텐트 내 음성 신호를 분리하는 방법을 설명하기 위한 도면이다.
도 3은 일 실시 예에 따른 전자 장치의 내부 블럭 구성도이다.
도 4는 일 실시 예에 따른 화자별 음량 조절을 위한 전자 장치의 동작 흐름도이다.
도 5는 일 실시 예에 따른 전자 장치에서 화자별 음량 편집 방법을 설명하기 위한 흐름도이다.
도 6은 일 실시 예에 따른 화자 선택에 따른 음량 편집 방법의 예시도이다.
도 7은 일 실시 예에 따른 선택된 화자와 선택되지 않은 화자 각각에 대한 음성 정보 표시의 예를 도시한 도면이다.
도 8a는 일 실시 예에 따른 선택된 화자에 대한 음량 조절 전의 예시도이다.
도 8b는 일 실시 예에 따른 선택된 화자에 대한 음소거 시의 예시도이다.
도 8c는 일 실시 예에 따른 선택된 다른 화자에 대한 음량 조절 전의 예시도이다.
도 9는 일 실시 예에 따른 선택된 화자의 음성 구간 내에서 조절된 음량의 지속 시간을 설정하는 방법을 설명하기 위한 예시도이다.
도 10은 일 실시 예에 따른 음량의 지속 시간을 설정하는 방법을 설명하기 위한 도면이다.
도 11은 일 실시 예에 따른 동일 화자의 음성 구간을 분리하여 음량을 조절하는 방법을 설명하기 위한 예시도이다.
도 12는 일 실시 예에 따른 동일 화자의 음성 구간의 분리에 따른 음량 설정 방법을 나타낸 도면이다.
도 13은 일 실시 예에 따른 임계값 이상의 음량의 자동 조절 방법을 설명하기 위한 예시도이다.
도 14는 일 실시 예에 따른 굉음에 해당하는 음량의 자동 조절 방법을 설명하기 위한 예시도이다.
도 15는 일 실시 예에 따른 화자 인식에 따른 화자 트래킹 및 음성 정보 표시의 예를 도시한 도면이다.
도 16은 일 실시 예에 따른 편집 대상 화자 선택에 따른 음성 정보 표시의 예를 도시한 도면이다.
도 17은 일 실시 예에 따른 선택된 화자에 대한 음량 조절 방법을 나타낸 예시도이다.
도 18은 일 실시 예에 따른 선택된 화자에 대한 음성 구간 인디케이터를 이용한 음량 조절을 위한 동작 흐름도이다.
도 19는 일 실시 예에 따른 선택된 화자에 대한 음성 구간 인디케이터를 이용한 음량 조절 방법을 나타낸 예시도이다.
도 20은 일 실시 예에 따른 영상 프레임 인디케이터를 이용한 음량 조절 방법을 나타낸 예시도이다.
도 21은 일 실시 예에 따른 화자의 순차적 선택에 따른 음성 정보 표시를 나타낸 예시도이다.
도면의 설명과 관련하여, 동일 또는 유사한 구성 요소에 대해서는 동일 또는 유사한 참조 부호가 사용될 수 있다.1 is a block diagram of an electronic device in a network environment according to an embodiment.
FIG. 2 is a diagram illustrating a method of separating voice signals within content according to an embodiment.
Figure 3 is an internal block diagram of an electronic device according to an embodiment.
Figure 4 is a flowchart of the operation of an electronic device for adjusting volume for each speaker according to an embodiment.
Figure 5 is a flowchart illustrating a method of editing volume for each speaker in an electronic device according to an embodiment.
Figure 6 is an example diagram of a method for editing volume according to speaker selection according to an embodiment.
FIG. 7 is a diagram illustrating an example of displaying voice information for each of a selected speaker and an unselected speaker according to an embodiment.
FIG. 8A is an exemplary diagram before volume adjustment for a selected speaker according to an embodiment.
Figure 8b is an example diagram of muting a selected speaker according to an embodiment.
Figure 8c is an example diagram before adjusting the volume for another selected speaker according to an embodiment.
Figure 9 is an example diagram illustrating a method of setting the duration of the adjusted volume within the voice section of a selected speaker according to an embodiment.
Figure 10 is a diagram for explaining a method of setting the duration of sound volume according to an embodiment.
Figure 11 is an example diagram to explain a method of adjusting the volume by separating voice sections of the same speaker according to an embodiment.
Figure 12 is a diagram showing a method of setting the volume according to separation of voice sections of the same speaker according to an embodiment.
Figure 13 is an example diagram to explain a method of automatically adjusting the volume above the threshold according to an embodiment.
Figure 14 is an example diagram for explaining a method of automatically adjusting the volume corresponding to a roar according to an embodiment.
FIG. 15 is a diagram illustrating an example of speaker tracking and voice information display according to speaker recognition according to an embodiment.
FIG. 16 is a diagram illustrating an example of displaying voice information according to selection of a speaker to be edited according to an embodiment.
Figure 17 is an exemplary diagram showing a method of adjusting the volume for a selected speaker according to an embodiment.
Figure 18 is an operation flowchart for volume control using a voice section indicator for a selected speaker according to an embodiment.
Figure 19 is an example diagram showing a method of adjusting the volume using a voice section indicator for a selected speaker according to an embodiment.
Figure 20 is an example diagram showing a method of adjusting the volume using an image frame indicator according to an embodiment.
Figure 21 is an example diagram showing the display of voice information according to sequential selection of a speaker according to an embodiment.
In relation to the description of the drawings, identical or similar reference numerals may be used for identical or similar components.

도 1은, 일 실시예에 따른, 네트워크 환경(100) 내의 전자 장치(101)의 블록도이다. 도 1을 참조하면, 네트워크 환경(100)에서 전자 장치(101)는 제 1 네트워크(198)(예: 근거리 무선 통신 네트워크)를 통하여 전자 장치(102)와 통신하거나, 또는 제 2 네트워크(199)(예: 원거리 무선 통신 네트워크)를 통하여 전자 장치(104) 또는 서버(108) 중 적어도 하나와 통신할 수 있다. 일실시예에 따르면, 전자 장치(101)는 서버(108)를 통하여 전자 장치(104)와 통신할 수 있다. 일실시예에 따르면, 전자 장치(101)는 프로세서(120), 메모리(130), 입력 모듈(150), 음향 출력 모듈(155), 디스플레이 모듈(160), 오디오 모듈(170), 센서 모듈(176), 인터페이스(177), 연결 단자(178), 햅틱 모듈(179), 카메라 모듈(180), 전력 관리 모듈(188), 배터리(189), 통신 모듈(190), 가입자 식별 모듈(196), 또는 안테나 모듈(197)을 포함할 수 있다. 어떤 실시예에서는, 전자 장치(101)에는, 이 구성요소들 중 적어도 하나(예: 연결 단자(178))가 생략되거나, 하나 이상의 다른 구성요소가 추가될 수 있다. 어떤 실시예에서는, 이 구성요소들 중 일부들(예: 센서 모듈(176), 카메라 모듈(180), 또는 안테나 모듈(197))은 하나의 구성요소(예: 디스플레이 모듈(160))로 통합될 수 있다.1 is a block diagram of an electronic device 101 in a network environment 100, according to one embodiment. Referring to FIG. 1, in the network environment 100, the electronic device 101 communicates with the electronic device 102 through a first network 198 (e.g., a short-range wireless communication network) or a second network 199. It is possible to communicate with at least one of the electronic device 104 or the server 108 through (e.g., a long-distance wireless communication network). According to one embodiment, the electronic device 101 may communicate with the electronic device 104 through the server 108. According to one embodiment, the electronic device 101 includes a processor 120, a memory 130, an input module 150, an audio output module 155, a display module 160, an audio module 170, and a sensor module ( 176), interface 177, connection terminal 178, haptic module 179, camera module 180, power management module 188, battery 189, communication module 190, subscriber identification module 196 , or may include an antenna module 197. In some embodiments, at least one of these components (eg, the connection terminal 178) may be omitted or one or more other components may be added to the electronic device 101. In some embodiments, some of these components (e.g., sensor module 176, camera module 180, or antenna module 197) are integrated into one component (e.g., display module 160). It can be.

프로세서(120)는, 예를 들면, 소프트웨어(예: 프로그램(140))를 실행하여 프로세서(120)에 연결된 전자 장치(101)의 적어도 하나의 다른 구성요소(예: 하드웨어 또는 소프트웨어 구성요소)를 제어할 수 있고, 다양한 데이터 처리 또는 연산을 수행할 수 있다. 일실시예에 따르면, 데이터 처리 또는 연산의 적어도 일부로서, 프로세서(120)는 다른 구성요소(예: 센서 모듈(176) 또는 통신 모듈(190))로부터 수신된 명령 또는 데이터를 휘발성 메모리(132)에 저장하고, 휘발성 메모리(132)에 저장된 명령 또는 데이터를 처리하고, 결과 데이터를 비휘발성 메모리(134)에 저장할 수 있다. 일실시예에 따르면, 프로세서(120)는 메인 프로세서(121)(예: 중앙 처리 장치 또는 어플리케이션 프로세서) 또는 이와는 독립적으로 또는 함께 운영 가능한 보조 프로세서(123)(예: 그래픽 처리 장치, 신경망 처리 장치(NPU: neural processing unit), 이미지 시그널 프로세서, 센서 허브 프로세서, 또는 커뮤니케이션 프로세서)를 포함할 수 있다. 예를 들어, 전자 장치(101)가 메인 프로세서(121) 및 보조 프로세서(123)를 포함하는 경우, 보조 프로세서(123)는 메인 프로세서(121)보다 저전력을 사용하거나, 지정된 기능에 특화되도록 설정될 수 있다. 보조 프로세서(123)는 메인 프로세서(121)와 별개로, 또는 그 일부로서 구현될 수 있다.The processor 120, for example, executes software (e.g., program 140) to operate at least one other component (e.g., hardware or software component) of the electronic device 101 connected to the processor 120. It can be controlled and various data processing or calculations can be performed. According to one embodiment, as at least part of data processing or computation, the processor 120 stores commands or data received from another component (e.g., sensor module 176 or communication module 190) in volatile memory 132. The commands or data stored in the volatile memory 132 can be processed, and the resulting data can be stored in the non-volatile memory 134. According to one embodiment, the processor 120 includes the main processor 121 (e.g., a central processing unit or an application processor) or an auxiliary processor 123 that can operate independently or together (e.g., a graphics processing unit, a neural network processing unit ( It may include a neural processing unit (NPU), an image signal processor, a sensor hub processor, or a communication processor). For example, if the electronic device 101 includes a main processor 121 and a secondary processor 123, the secondary processor 123 may be set to use lower power than the main processor 121 or be specialized for a designated function. You can. The auxiliary processor 123 may be implemented separately from the main processor 121 or as part of it.

보조 프로세서(123)는, 예를 들면, 메인 프로세서(121)가 인액티브(예: 슬립) 상태에 있는 동안 메인 프로세서(121)를 대신하여, 또는 메인 프로세서(121)가 액티브(예: 어플리케이션 실행) 상태에 있는 동안 메인 프로세서(121)와 함께, 전자 장치(101)의 구성요소들 중 적어도 하나의 구성요소(예: 디스플레이 모듈(160), 센서 모듈(176), 또는 통신 모듈(190))와 관련된 기능 또는 상태들의 적어도 일부를 제어할 수 있다. 일실시예에 따르면, 보조 프로세서(123)(예: 이미지 시그널 프로세서 또는 커뮤니케이션 프로세서)는 기능적으로 관련 있는 다른 구성요소(예: 카메라 모듈(180) 또는 통신 모듈(190))의 일부로서 구현될 수 있다. 일실시예에 따르면, 보조 프로세서(123)(예: 신경망 처리 장치)는 인공지능 모델의 처리에 특화된 하드웨어 구조를 포함할 수 있다. 인공지능 모델은 기계 학습을 통해 생성될 수 있다. 이러한 학습은, 예를 들어, 인공지능 모델이 수행되는 전자 장치(101) 자체에서 수행될 수 있고, 별도의 서버(예: 서버(108))를 통해 수행될 수도 있다. 학습 알고리즘은, 예를 들어, 지도형 학습(supervised learning), 비지도형 학습(unsupervised learning), 준지도형 학습(semi-supervised learning) 또는 강화 학습(reinforcement learning)을 포함할 수 있으나, 전술한 예에 한정되지 않는다. 인공지능 모델은, 복수의 인공 신경망 레이어들을 포함할 수 있다. 인공 신경망은 심층 신경망(DNN: deep neural network), CNN(convolutional neural network), RNN(recurrent neural network), RBM(restricted boltzmann machine), DBN(deep belief network), BRDNN(bidirectional recurrent deep neural network), 심층 Q-네트워크(deep Q-networks) 또는 상기 중 둘 이상의 조합 중 하나일 수 있으나, 전술한 예에 한정되지 않는다. 인공지능 모델은 하드웨어 구조 이외에, 추가적으로 또는 대체적으로, 소프트웨어 구조를 포함할 수 있다.The auxiliary processor 123 may, for example, act on behalf of the main processor 121 while the main processor 121 is in an inactive (e.g., sleep) state, or while the main processor 121 is in an active (e.g., application execution) state. ), together with the main processor 121, at least one of the components of the electronic device 101 (e.g., the display module 160, the sensor module 176, or the communication module 190) At least some of the functions or states related to can be controlled. According to one embodiment, co-processor 123 (e.g., image signal processor or communication processor) may be implemented as part of another functionally related component (e.g., camera module 180 or communication module 190). there is. According to one embodiment, the auxiliary processor 123 (eg, neural network processing unit) may include a hardware structure specialized for processing artificial intelligence models. Artificial intelligence models can be created through machine learning. For example, such learning may be performed in the electronic device 101 itself on which the artificial intelligence model is performed, or may be performed through a separate server (e.g., server 108). Learning algorithms may include, for example, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning, but It is not limited. An artificial intelligence model may include multiple artificial neural network layers. Artificial neural networks include deep neural network (DNN), convolutional neural network (CNN), recurrent neural network (RNN), restricted boltzmann machine (RBM), belief deep network (DBN), bidirectional recurrent deep neural network (BRDNN), It may be one of deep Q-networks or a combination of two or more of the above, but is not limited to the examples described above. In addition to hardware structures, artificial intelligence models may additionally or alternatively include software structures.

메모리(130)는, 전자 장치(101)의 적어도 하나의 구성요소(예: 프로세서(120) 또는 센서 모듈(176))에 의해 사용되는 다양한 데이터를 저장할 수 있다. 데이터는, 예를 들어, 소프트웨어(예: 프로그램(140)) 및, 이와 관련된 명령에 대한 입력 데이터 또는 출력 데이터를 포함할 수 있다. 메모리(130)는, 휘발성 메모리(132) 또는 비휘발성 메모리(134)를 포함할 수 있다. The memory 130 may store various data used by at least one component (eg, the processor 120 or the sensor module 176) of the electronic device 101. Data may include, for example, input data or output data for software (e.g., program 140) and instructions related thereto. Memory 130 may include volatile memory 132 or non-volatile memory 134.

프로그램(140)은 메모리(130)에 소프트웨어로서 저장될 수 있으며, 예를 들면, 운영 체제(142), 미들 웨어(144) 또는 어플리케이션(146)을 포함할 수 있다. The program 140 may be stored as software in the memory 130 and may include, for example, an operating system 142, middleware 144, or application 146.

입력 모듈(150)은, 전자 장치(101)의 구성요소(예: 프로세서(120))에 사용될 명령 또는 데이터를 전자 장치(101)의 외부(예: 사용자)로부터 수신할 수 있다. 입력 모듈(150)은, 예를 들면, 마이크, 마우스, 키보드, 키(예: 버튼), 또는 디지털 펜(예: 스타일러스 펜)을 포함할 수 있다. The input module 150 may receive commands or data to be used in a component of the electronic device 101 (e.g., the processor 120) from outside the electronic device 101 (e.g., a user). The input module 150 may include, for example, a microphone, mouse, keyboard, keys (eg, buttons), or digital pen (eg, stylus pen).

음향 출력 모듈(155)은 음향 신호를 전자 장치(101)의 외부로 출력할 수 있다. 음향 출력 모듈(155)은, 예를 들면, 스피커 또는 리시버를 포함할 수 있다. 스피커는 멀티미디어 재생 또는 녹음 재생과 같이 일반적인 용도로 사용될 수 있다. 리시버는 착신 전화를 수신하기 위해 사용될 수 있다. 일실시예에 따르면, 리시버는 스피커와 별개로, 또는 그 일부로서 구현될 수 있다.The sound output module 155 may output sound signals to the outside of the electronic device 101. The sound output module 155 may include, for example, a speaker or a receiver. Speakers can be used for general purposes such as multimedia playback or recording playback. The receiver can be used to receive incoming calls. According to one embodiment, the receiver may be implemented separately from the speaker or as part of it.

디스플레이 모듈(160)은 전자 장치(101)의 외부(예: 사용자)로 정보를 시각적으로 제공할 수 있다. 디스플레이 모듈(160)은, 예를 들면, 디스플레이, 홀로그램 장치, 또는 프로젝터 및 해당 장치를 제어하기 위한 제어 회로를 포함할 수 있다. 일실시예에 따르면, 디스플레이 모듈(160)은 터치를 감지하도록 설정된 터치 센서, 또는 상기 터치에 의해 발생되는 힘의 세기를 측정하도록 설정된 압력 센서를 포함할 수 있다. The display module 160 can visually provide information to the outside of the electronic device 101 (eg, a user). The display module 160 may include, for example, a display, a hologram device, or a projector, and a control circuit for controlling the device. According to one embodiment, the display module 160 may include a touch sensor configured to detect a touch, or a pressure sensor configured to measure the intensity of force generated by the touch.

오디오 모듈(170)은 소리를 전기 신호로 변환시키거나, 반대로 전기 신호를 소리로 변환시킬 수 있다. 일실시예에 따르면, 오디오 모듈(170)은, 입력 모듈(150)을 통해 소리를 획득하거나, 음향 출력 모듈(155), 또는 전자 장치(101)와 직접 또는 무선으로 연결된 외부 전자 장치(예: 전자 장치(102))(예: 스피커 또는 헤드폰)를 통해 소리를 출력할 수 있다.The audio module 170 can convert sound into an electrical signal or, conversely, convert an electrical signal into sound. According to one embodiment, the audio module 170 acquires sound through the input module 150, the sound output module 155, or an external electronic device (e.g., directly or wirelessly connected to the electronic device 101). Sound may be output through the electronic device 102 (e.g., speaker or headphone).

센서 모듈(176)은 전자 장치(101)의 작동 상태(예: 전력 또는 온도), 또는 외부의 환경 상태(예: 사용자 상태)를 감지하고, 감지된 상태에 대응하는 전기 신호 또는 데이터 값을 생성할 수 있다. 일실시예에 따르면, 센서 모듈(176)은, 예를 들면, 제스처 센서, 자이로 센서, 기압 센서, 마그네틱 센서, 가속도 센서, 그립 센서, 근접 센서, 컬러 센서, IR(infrared) 센서, 생체 센서, 온도 센서, 습도 센서, 또는 조도 센서를 포함할 수 있다. The sensor module 176 detects the operating state (e.g., power or temperature) of the electronic device 101 or the external environmental state (e.g., user state) and generates an electrical signal or data value corresponding to the detected state. can do. According to one embodiment, the sensor module 176 includes, for example, a gesture sensor, a gyro sensor, an air pressure sensor, a magnetic sensor, an acceleration sensor, a grip sensor, a proximity sensor, a color sensor, an IR (infrared) sensor, a biometric sensor, It may include a temperature sensor, humidity sensor, or light sensor.

인터페이스(177)는 전자 장치(101)가 외부 전자 장치(예: 전자 장치(102))와 직접 또는 무선으로 연결되기 위해 사용될 수 있는 하나 이상의 지정된 프로토콜들을 지원할 수 있다. 일실시예에 따르면, 인터페이스(177)는, 예를 들면, HDMI(high definition multimedia interface), USB(universal serial bus) 인터페이스, SD카드 인터페이스, 또는 오디오 인터페이스를 포함할 수 있다.The interface 177 may support one or more designated protocols that can be used to connect the electronic device 101 directly or wirelessly with an external electronic device (eg, the electronic device 102). According to one embodiment, the interface 177 may include, for example, a high definition multimedia interface (HDMI), a universal serial bus (USB) interface, an SD card interface, or an audio interface.

연결 단자(178)는, 그를 통해서 전자 장치(101)가 외부 전자 장치(예: 전자 장치(102))와 물리적으로 연결될 수 있는 커넥터를 포함할 수 있다. 일실시예에 따르면, 연결 단자(178)는, 예를 들면, HDMI 커넥터, USB 커넥터, SD 카드 커넥터, 또는 오디오 커넥터(예: 헤드폰 커넥터)를 포함할 수 있다.The connection terminal 178 may include a connector through which the electronic device 101 can be physically connected to an external electronic device (eg, the electronic device 102). According to one embodiment, the connection terminal 178 may include, for example, an HDMI connector, a USB connector, an SD card connector, or an audio connector (eg, a headphone connector).

햅틱 모듈(179)은 전기적 신호를 사용자가 촉각 또는 운동 감각을 통해서 인지할 수 있는 기계적인 자극(예: 진동 또는 움직임) 또는 전기적인 자극으로 변환할 수 있다. 일실시예에 따르면, 햅틱 모듈(179)은, 예를 들면, 모터, 압전 소자, 또는 전기 자극 장치를 포함할 수 있다.The haptic module 179 can convert electrical signals into mechanical stimulation (e.g., vibration or movement) or electrical stimulation that the user can perceive through tactile or kinesthetic senses. According to one embodiment, the haptic module 179 may include, for example, a motor, a piezoelectric element, or an electrical stimulation device.

카메라 모듈(180)은 정지 영상 및 동영상을 촬영할 수 있다. 일실시예에 따르면, 카메라 모듈(180)은 하나 이상의 렌즈들, 이미지 센서들, 이미지 시그널 프로세서들, 또는 플래시들을 포함할 수 있다.The camera module 180 can capture still images and moving images. According to one embodiment, the camera module 180 may include one or more lenses, image sensors, image signal processors, or flashes.

전력 관리 모듈(188)은 전자 장치(101)에 공급되는 전력을 관리할 수 있다. 일실시예에 따르면, 전력 관리 모듈(188)은, 예를 들면, PMIC(power management integrated circuit)의 적어도 일부로서 구현될 수 있다.The power management module 188 can manage power supplied to the electronic device 101. According to one embodiment, the power management module 188 may be implemented as at least a part of, for example, a power management integrated circuit (PMIC).

배터리(189)는 전자 장치(101)의 적어도 하나의 구성요소에 전력을 공급할 수 있다. 일실시예에 따르면, 배터리(189)는, 예를 들면, 재충전 불가능한 1차 전지, 재충전 가능한 2차 전지 또는 연료 전지를 포함할 수 있다.Battery 189 may supply power to at least one component of electronic device 101. According to one embodiment, the battery 189 may include, for example, a non-rechargeable primary battery, a rechargeable secondary battery, or a fuel cell.

통신 모듈(190)은 전자 장치(101)와 외부 전자 장치(예: 전자 장치(102), 전자 장치(104), 또는 서버(108)) 간의 직접(예: 유선) 통신 채널 또는 무선 통신 채널의 수립, 및 수립된 통신 채널을 통한 통신 수행을 지원할 수 있다. 통신 모듈(190)은 프로세서(120)(예: 어플리케이션 프로세서)와 독립적으로 운영되고, 직접(예: 유선) 통신 또는 무선 통신을 지원하는 하나 이상의 커뮤니케이션 프로세서를 포함할 수 있다. 일실시예에 따르면, 통신 모듈(190)은 무선 통신 모듈(192)(예: 셀룰러 통신 모듈, 근거리 무선 통신 모듈, 또는 GNSS(global navigation satellite system) 통신 모듈) 또는 유선 통신 모듈(194)(예: LAN(local area network) 통신 모듈, 또는 전력선 통신 모듈)을 포함할 수 있다. 이들 통신 모듈 중 해당하는 통신 모듈은 제 1 네트워크(198)(예: 블루투스, WiFi(wireless fidelity) direct 또는 IrDA(infrared data association)와 같은 근거리 통신 네트워크) 또는 제 2 네트워크(199)(예: 레거시 셀룰러 네트워크, 5G 네트워크, 차세대 통신 네트워크, 인터넷, 또는 컴퓨터 네트워크(예: LAN 또는 WAN)와 같은 원거리 통신 네트워크)를 통하여 외부의 전자 장치(104)와 통신할 수 있다. 이런 여러 종류의 통신 모듈들은 하나의 구성요소(예: 단일 칩)로 통합되거나, 또는 서로 별도의 복수의 구성요소들(예: 복수 칩들)로 구현될 수 있다. 무선 통신 모듈(192)은 가입자 식별 모듈(196)에 저장된 가입자 정보(예: 국제 모바일 가입자 식별자(IMSI))를 이용하여 제 1 네트워크(198) 또는 제 2 네트워크(199)와 같은 통신 네트워크 내에서 전자 장치(101)를 확인 또는 인증할 수 있다. Communication module 190 is configured to provide a direct (e.g., wired) communication channel or wireless communication channel between electronic device 101 and an external electronic device (e.g., electronic device 102, electronic device 104, or server 108). It can support establishment and communication through established communication channels. Communication module 190 operates independently of processor 120 (e.g., an application processor) and may include one or more communication processors that support direct (e.g., wired) communication or wireless communication. According to one embodiment, the communication module 190 is a wireless communication module 192 (e.g., a cellular communication module, a short-range wireless communication module, or a global navigation satellite system (GNSS) communication module) or a wired communication module 194 (e.g., : LAN (local area network) communication module, or power line communication module) may be included. Among these communication modules, the corresponding communication module is a first network 198 (e.g., a short-range communication network such as Bluetooth, wireless fidelity (WiFi) direct, or infrared data association (IrDA)) or a second network 199 (e.g., legacy It may communicate with an external electronic device 104 through a telecommunication network such as a cellular network, a 5G network, a next-generation communication network, the Internet, or a computer network (e.g., LAN or WAN). These various types of communication modules may be integrated into one component (e.g., a single chip) or may be implemented as a plurality of separate components (e.g., multiple chips). The wireless communication module 192 uses subscriber information (e.g., International Mobile Subscriber Identifier (IMSI)) stored in the subscriber identification module 196 within a communication network such as the first network 198 or the second network 199. The electronic device 101 can be confirmed or authenticated.

무선 통신 모듈(192)은 4G 네트워크 이후의 5G 네트워크 및 차세대 통신 기술, 예를 들어, NR 접속 기술(new radio access technology)을 지원할 수 있다. NR 접속 기술은 고용량 데이터의 고속 전송(eMBB(enhanced mobile broadband)), 단말 전력 최소화와 다수 단말의 접속(mMTC(massive machine type communications)), 또는 고신뢰도와 저지연(URLLC(ultra-reliable and low-latency communications))을 지원할 수 있다. 무선 통신 모듈(192)은, 예를 들어, 높은 데이터 전송률 달성을 위해, 고주파 대역(예: mmWave 대역)을 지원할 수 있다. 무선 통신 모듈(192)은 고주파 대역에서의 성능 확보를 위한 다양한 기술들, 예를 들어, 빔포밍(beamforming), 거대 배열 다중 입출력(massive MIMO(multiple-input and multiple-output)), 전차원 다중입출력(FD-MIMO: full dimensional MIMO), 어레이 안테나(array antenna), 아날로그 빔형성(analog beam-forming), 또는 대규모 안테나(large scale antenna)와 같은 기술들을 지원할 수 있다. 무선 통신 모듈(192)은 전자 장치(101), 외부 전자 장치(예: 전자 장치(104)) 또는 네트워크 시스템(예: 제 2 네트워크(199))에 규정되는 다양한 요구사항을 지원할 수 있다. 일실시예에 따르면, 무선 통신 모듈(192)은 eMBB 실현을 위한 Peak data rate(예: 20Gbps 이상), mMTC 실현을 위한 손실 Coverage(예: 164dB 이하), 또는 URLLC 실현을 위한 U-plane latency(예: 다운링크(DL) 및 업링크(UL) 각각 0.5ms 이하, 또는 라운드 트립 1ms 이하)를 지원할 수 있다.The wireless communication module 192 may support 5G networks after 4G networks and next-generation communication technologies, for example, NR access technology (new radio access technology). NR access technology provides high-speed transmission of high-capacity data (eMBB (enhanced mobile broadband)), minimization of terminal power and access to multiple terminals (mMTC (massive machine type communications)), or high reliability and low latency (URLLC (ultra-reliable and low latency). -latency communications)) can be supported. The wireless communication module 192 may support a high frequency band (eg, mmWave band), for example, to achieve a high data rate. The wireless communication module 192 uses various technologies to secure performance in high frequency bands, for example, beamforming, massive array multiple-input and multiple-output (MIMO), and full-dimensional multiplexing. It can support technologies such as input/output (FD-MIMO: full dimensional MIMO), array antenna, analog beam-forming, or large scale antenna. The wireless communication module 192 may support various requirements specified in the electronic device 101, an external electronic device (e.g., electronic device 104), or a network system (e.g., second network 199). According to one embodiment, the wireless communication module 192 supports Peak data rate (e.g., 20 Gbps or more) for realizing eMBB, loss coverage (e.g., 164 dB or less) for realizing mmTC, or U-plane latency (e.g., 164 dB or less) for realizing URLLC. Example: Downlink (DL) and uplink (UL) each of 0.5 ms or less, or round trip 1 ms or less) can be supported.

안테나 모듈(197)은 신호 또는 전력을 외부(예: 외부의 전자 장치)로 송신하거나 외부로부터 수신할 수 있다. 일실시예에 따르면, 안테나 모듈(197)은 서브스트레이트(예: PCB) 위에 형성된 도전체 또는 도전성 패턴으로 이루어진 방사체를 포함하는 안테나를 포함할 수 있다. 일실시예에 따르면, 안테나 모듈(197)은 복수의 안테나들(예: 어레이 안테나)을 포함할 수 있다. 이런 경우, 제 1 네트워크(198) 또는 제 2 네트워크(199)와 같은 통신 네트워크에서 사용되는 통신 방식에 적합한 적어도 하나의 안테나가, 예를 들면, 통신 모듈(190)에 의하여 상기 복수의 안테나들로부터 선택될 수 있다. 신호 또는 전력은 상기 선택된 적어도 하나의 안테나를 통하여 통신 모듈(190)과 외부의 전자 장치 간에 송신되거나 수신될 수 있다. 어떤 실시예에 따르면, 방사체 이외에 다른 부품(예: RFIC(radio frequency integrated circuit))이 추가로 안테나 모듈(197)의 일부로 형성될 수 있다. The antenna module 197 may transmit or receive signals or power to or from the outside (eg, an external electronic device). According to one embodiment, the antenna module 197 may include an antenna including a radiator made of a conductor or a conductive pattern formed on a substrate (eg, PCB). According to one embodiment, the antenna module 197 may include a plurality of antennas (eg, an array antenna). In this case, at least one antenna suitable for a communication method used in a communication network such as the first network 198 or the second network 199 is connected to the plurality of antennas by, for example, the communication module 190. can be selected. Signals or power may be transmitted or received between the communication module 190 and an external electronic device through the selected at least one antenna. According to some embodiments, in addition to the radiator, other components (eg, radio frequency integrated circuit (RFIC)) may be additionally formed as part of the antenna module 197.

다양한 실시예에 따르면, 안테나 모듈(197)은 mmWave 안테나 모듈을 형성할 수 있다. 일실시예에 따르면, mmWave 안테나 모듈은 인쇄 회로 기판, 상기 인쇄 회로 기판의 제 1 면(예: 아래 면)에 또는 그에 인접하여 배치되고 지정된 고주파 대역(예: mmWave 대역)을 지원할 수 있는 RFIC, 및 상기 인쇄 회로 기판의 제 2 면(예: 윗 면 또는 측 면)에 또는 그에 인접하여 배치되고 상기 지정된 고주파 대역의 신호를 송신 또는 수신할 수 있는 복수의 안테나들(예: 어레이 안테나)을 포함할 수 있다.According to various embodiments, the antenna module 197 may form a mmWave antenna module. According to one embodiment, a mmWave antenna module includes a printed circuit board, an RFIC disposed on or adjacent to a first side (e.g., bottom side) of the printed circuit board and capable of supporting a designated high frequency band (e.g., mmWave band), And a plurality of antennas (e.g., array antennas) disposed on or adjacent to the second side (e.g., top or side) of the printed circuit board and capable of transmitting or receiving signals in the designated high frequency band. can do.

상기 구성요소들 중 적어도 일부는 주변 기기들간 통신 방식(예: 버스, GPIO(general purpose input and output), SPI(serial peripheral interface), 또는 MIPI(mobile industry processor interface))을 통해 서로 연결되고 신호(예: 명령 또는 데이터)를 상호간에 교환할 수 있다.At least some of the components are connected to each other through a communication method between peripheral devices (e.g., bus, general purpose input and output (GPIO), serial peripheral interface (SPI), or mobile industry processor interface (MIPI)) and signal ( (e.g. commands or data) can be exchanged with each other.

일실시예에 따르면, 명령 또는 데이터는 제 2 네트워크(199)에 연결된 서버(108)를 통해서 전자 장치(101)와 외부의 전자 장치(104)간에 송신 또는 수신될 수 있다. 외부의 전자 장치(102, 또는 104) 각각은 전자 장치(101)와 동일한 또는 다른 종류의 장치일 수 있다. 일실시예에 따르면, 전자 장치(101)에서 실행되는 동작들의 전부 또는 일부는 외부의 전자 장치들(102, 104, 또는 108) 중 하나 이상의 외부의 전자 장치들에서 실행될 수 있다. 예를 들면, 전자 장치(101)가 어떤 기능이나 서비스를 자동으로, 또는 사용자 또는 다른 장치로부터의 요청에 반응하여 수행해야 할 경우에, 전자 장치(101)는 기능 또는 서비스를 자체적으로 실행시키는 대신에 또는 추가적으로, 하나 이상의 외부의 전자 장치들에게 그 기능 또는 그 서비스의 적어도 일부를 수행하라고 요청할 수 있다. 상기 요청을 수신한 하나 이상의 외부의 전자 장치들은 요청된 기능 또는 서비스의 적어도 일부, 또는 상기 요청과 관련된 추가 기능 또는 서비스를 실행하고, 그 실행의 결과를 전자 장치(101)로 전달할 수 있다. 전자 장치(101)는 상기 결과를, 그대로 또는 추가적으로 처리하여, 상기 요청에 대한 응답의 적어도 일부로서 제공할 수 있다. 이를 위하여, 예를 들면, 클라우드 컴퓨팅, 분산 컴퓨팅, 모바일 에지 컴퓨팅(MEC: mobile edge computing), 또는 클라이언트-서버 컴퓨팅 기술이 이용될 수 있다. 전자 장치(101)는, 예를 들어, 분산 컴퓨팅 또는 모바일 에지 컴퓨팅을 이용하여 초저지연 서비스를 제공할 수 있다. 다른 실시예에 있어서, 외부의 전자 장치(104)는 IoT(internet of things) 기기를 포함할 수 있다. 서버(108)는 기계 학습 및/또는 신경망을 이용한 지능형 서버일 수 있다. 일실시예에 따르면, 외부의 전자 장치(104) 또는 서버(108)는 제 2 네트워크(199) 내에 포함될 수 있다. 전자 장치(101)는 5G 통신 기술 및 IoT 관련 기술을 기반으로 지능형 서비스(예: 스마트 홈, 스마트 시티, 스마트 카, 또는 헬스 케어)에 적용될 수 있다. According to one embodiment, commands or data may be transmitted or received between the electronic device 101 and the external electronic device 104 through the server 108 connected to the second network 199. Each of the external electronic devices 102 or 104 may be of the same or different type as the electronic device 101. According to one embodiment, all or part of the operations performed in the electronic device 101 may be executed in one or more of the external electronic devices 102, 104, or 108. For example, when the electronic device 101 needs to perform a certain function or service automatically or in response to a request from a user or another device, the electronic device 101 may perform the function or service instead of executing the function or service on its own. Alternatively, or additionally, one or more external electronic devices may be requested to perform at least part of the function or service. One or more external electronic devices that have received the request may execute at least part of the requested function or service, or an additional function or service related to the request, and transmit the result of the execution to the electronic device 101. The electronic device 101 may process the result as is or additionally and provide it as at least part of a response to the request. For this purpose, for example, cloud computing, distributed computing, mobile edge computing (MEC), or client-server computing technology can be used. The electronic device 101 may provide an ultra-low latency service using, for example, distributed computing or mobile edge computing. In another embodiment, the external electronic device 104 may include an Internet of Things (IoT) device. Server 108 may be an intelligent server using machine learning and/or neural networks. According to one embodiment, the external electronic device 104 or server 108 may be included in the second network 199. The electronic device 101 may be applied to intelligent services (e.g., smart home, smart city, smart car, or healthcare) based on 5G communication technology and IoT-related technology.

이하의 상세한 설명에서는, 선행 실시예를 통해 용이하게 이해할 수 있는 구성에 관해 도면의 참조번호를 동일하게 부여하거나 생략하고, 그 상세한 설명 또한 생략될 수 있다. 본 문서에 개시된 일 실시예에 따른 전자 장치(101)는 서로 다른 실시예의 구성이 선택적으로 조합되어 구현될 수 있으며, 한 실시예의 구성이 다른 실시예의 구성에 의해 대체될 수 있다. 예컨대, 본 발명이 특정한 도면이나 실시예에 한정되지 않음에 유의한다.In the following detailed description, the same reference numbers in the drawings may be assigned or omitted for configurations that can be easily understood through prior embodiments, and the detailed description may also be omitted. The electronic device 101 according to an embodiment disclosed in this document may be implemented by selectively combining the components of different embodiments, and the components of one embodiment may be replaced by the components of another embodiment. For example, note that the present invention is not limited to specific drawings or embodiments.

본 발명을 설명하기에 앞서, 본 발명에서 이용되는 컨텐트 내 음성 신호를 분리하는 방법을 도 2를 참조하여 설명하기로 한다. 도 2는 일 실시 예에 따른 컨텐트 내 음성 신호를 분리하는 방법을 설명하기 위한 도면이다. Before explaining the present invention, a method for separating audio signals in content used in the present invention will be described with reference to FIG. 2. FIG. 2 is a diagram illustrating a method of separating voice signals within content according to an embodiment.

전자 장치(101)에서 제공되는 컨텐트의 종류가 다양해지고 있는데, 컨텐트는 일 예로, 사용자가 촬영한 동영상 컨텐트, 적어도 하나의 서버(예: 컨텐트 프로바이더)로부터 다운로드 또는 스트리밍(streaming) 받은 드라마나 영화와 같은 영상 컨텐트일 수 있다. 이하의 설명에서 컨텐트는 시각적 영상 및 청각적 소리를 포함하는 것으로, 예를 들어, 인물이 포함된 영상 및 음성을 포함하는 컨텐트일 수 있다. 여기서, 영상은 움직이는 동영상을 의미할 수 있으며, 음성(voice)은 오디오(audio) 또는 음향(sound)으로 기술될 수도 있다. The types of content provided by the electronic device 101 are becoming more diverse. Examples of content include video content captured by a user, dramas or movies downloaded or streamed from at least one server (e.g., content provider). It may be video content such as . In the following description, content includes visual images and auditory sounds, and may be, for example, content including images and voices containing people. Here, video may refer to a moving video, and voice may be described as audio or sound.

도 2를 참조하면, 전자 장치(101)에서 컨텐트 재생 시 컨텐트에 포함된 복수의 화자들의 음성이 포함된 음성 신호들(205)이 스피커를 통해 출력될 수 있다. 상기 음성 신호들(205)은 음성 분리 기술(sound separation)(210)에 기반하여 각각으로 분리될 수 있다. 전자 장치(101)는 분리된 음성 신호들 각각에 대한 음성 특징 추출을 통해 화자를 식별하여 식별된 화자에 매칭(220)할 수 있다. Referring to FIG. 2, when playing content on the electronic device 101, voice signals 205 containing the voices of a plurality of speakers included in the content may be output through a speaker. The voice signals 205 may be separated into each other based on sound separation technology (sound separation) 210. The electronic device 101 may identify the speaker by extracting voice features for each of the separated voice signals and match the speaker to the identified speaker (220).

구체적으로, 전자 장치(101)는 컨텐트 내의 복수의 음성 신호들 각각을 식별(또는 분리)하고, 상기 컨텐트의 영상 내 얼굴 인식을 통해 화자를 식별하고, 상기 식별된 음성 신호들과 얼굴 인식을 통해 식별된 화자들을 각각 매칭할 수 있다. 이와 같이 전자 장치(101)는 컨텐트 내의 음성 신호를 분리할 수 있으며, 예컨대, 딥 러닝과 같은 학습을 통해 분리된 음성 신호를 영상 내 화자에 매칭시킬 수 있다. 일 실시 예에 따르면, 전자 장치(101)에서는 화자 각각에 대한 음성 신호를 분리함으로써, 영상 내 화자별 음량을 조절(230)하는 동작을 수행할 수 있다. Specifically, the electronic device 101 identifies (or separates) each of a plurality of voice signals in the content, identifies the speaker through face recognition in the image of the content, and uses the identified voice signals and face recognition. Each identified speaker can be matched. In this way, the electronic device 101 can separate the voice signal in the content and, for example, match the separated voice signal to the speaker in the video through learning such as deep learning. According to one embodiment, the electronic device 101 can perform an operation to adjust the volume for each speaker in the video (230) by separating the voice signal for each speaker.

일 실시 예에 따른 전자 장치는, 영상 컨텐트 내의 복수의 화자들에 대한 음량을 제어하기 위한 것으로, 더욱 상세하게는 영상 내 화자 각각에 대한 음량 레벨을 개별적으로 제어할 수 있는 사용자 인터페이스를 제공할 수 있다. 전자 장치는 화자별 음량을 제어할 수 있는 사용자 인터페이스를 제공함으로써, 사용자는 영상 내 원하는 화자에 대한 음량을 원하는 구간에 대해 원하는 음량으로 용이하게 조절할 수 있어, 사용자의 편의성이 증대될 수 있다. An electronic device according to an embodiment is for controlling the volume of a plurality of speakers in video content, and more specifically, can provide a user interface that can individually control the volume level of each speaker in the video. there is. Electronic devices provide a user interface that can control the volume of each speaker, so that the user can easily adjust the volume of the desired speaker in the video to the desired volume for the desired section, thereby increasing user convenience.

도 3은 일 실시 예에 따른 전자 장치의 내부 블럭 구성도이다.Figure 3 is an internal block diagram of an electronic device according to an embodiment.

도 3을 참조하면, 전자 장치(101)는 프로세서(320), 메모리(330), 디스플레이(360) 및/또는 스피커(355)를 포함할 수 있다. 도 3의 전자 장치(101)는 도 1의 전자 장치(101)일 수 있다. 여기서, 도 3에 도시된 모든 구성 요소가 전자 장치(101)의 필수 구성 요소인 것은 아니며, 도 3에 도시된 구성 요소보다 많거나 적은 구성 요소에 의해 전자 장치(101)가 구현될 수도 있다. Referring to FIG. 3 , the electronic device 101 may include a processor 320, memory 330, display 360, and/or speaker 355. The electronic device 101 of FIG. 3 may be the electronic device 101 of FIG. 1 . Here, not all components shown in FIG. 3 are essential components of the electronic device 101, and the electronic device 101 may be implemented with more or fewer components than those shown in FIG. 3 .

도 3을 참조하면, 메모리(330)는 전자 장치(101)의 제어를 위한 제어 프로그램, 제조사에서 제공되거나 외부로부터 다운로드 받은 어플리케이션과 관련된 UI 및 UI를 제공하기 위한 이미지들, 사용자 정보, 문서, 데이터베이스들 또는 관련 데이터들을 저장할 수 있다. 예를 들어, 메모리(330)는 영상 및 음성을 포함하는 컨텐트를 저장할 수 있다. 상기 컨텐트는, 카메라 어플리케이션(예: 프로그램 또는 기능)의 실행을 통해 촬영된 컨텐트 또는 적어도 하나의 서버(예: 컨텐트 프로바이더)로부터 다운로드 또는 스트리밍(streaming) 받은 드라마나 영화와 같은 컨텐트일 수 있다. 또한, 메모리(330)는 카메라 어플리케이션을 통하여 실시간으로 획득되는 촬영 영상 내 화자의 음량을 편집할 수 있도록 획득되는 영상의 적어도 일부는 다음 영상 처리 작업을 위하여 적어도 일시 저장할 수 있다. 예를 들어, 임시로 저장되는 적어도 일부의 영상은, 화상 회의와 같이 복수의 화자들 각각에 대한 음량을 제어하는데 이용될 수 있다. 따라서, 화상 회의 영상을 표시하는 동안에 영상 내 화자를 식별하고, 각 화자에 대한 음성 신호를 화자에 매칭시켜 시각화 과정을 통해 표시될 수 있다. 따라서 전자 장치는 저장된 컨텐트 뿐만 아니라 실시간으로 촬영되는 영상 내의 화자별로 음량 제어를 수행할 수 있다.Referring to FIG. 3, the memory 330 includes a control program for controlling the electronic device 101, a UI related to an application provided by the manufacturer or downloaded from an external source, images for providing the UI, user information, documents, and a database. or related data can be stored. For example, the memory 330 may store content including video and audio. The content may be content captured through the execution of a camera application (eg, a program or function), or content such as a drama or movie downloaded or streamed from at least one server (eg, a content provider). Additionally, the memory 330 can at least temporarily store at least a portion of the acquired image for the next image processing task so that the volume of the speaker in the captured image acquired in real time through the camera application can be edited. For example, at least some of the temporarily stored images may be used to control the volume for each of a plurality of speakers, such as in a video conference. Therefore, while displaying a video conference video, the speakers in the video can be identified, and the voice signal for each speaker can be matched to the speaker and displayed through a visualization process. Therefore, the electronic device can control the volume for each speaker in the video captured in real time as well as the stored content.

디스플레이(360)는 프로세서(320)의 제어 하에 영상 내 화자별 음량 편집을 위한 사용자 요청이 수신되면, 음량 편집과 관련된 사용자 인터페이스를 표시할 수 있다. 예를 들어, 프로세서(320)는 컨텐트 재생 화면 상에 표시되는 컨텐트 편집(예: 화자별 음량 편집)을 위한 항목(또는 메뉴)에 대한 사용자 선택 시 또는 컨텐트 편집을 위한 별도의 앱의 실행 시, 디스플레이(360)는 상기 컨텐트 편집과 관련된 사용자 인터페이스를 표시할 수 있다.When a user request for volume editing for each speaker in the video is received under the control of the processor 320, the display 360 may display a user interface related to volume editing. For example, the processor 320 operates when the user selects an item (or menu) for content editing (e.g., volume editing for each speaker) displayed on the content playback screen or when a separate app for content editing is executed, The display 360 may display a user interface related to editing the content.

일 실시 예에 따르면, 디스플레이(360)는 컨텐트 편집 시 영상 내 화자들 각각에 대한 음량 제어를 나타내는 객체들을 포함하는 사용자 인터페이스를 표시할 수 있다. 객체란, 음량 제어와 관련된 그래픽 요소로, 예컨대, 인디케이터를 포함할 수 있다. According to one embodiment, the display 360 may display a user interface including objects indicating volume control for each speaker in the video when editing content. An object is a graphic element related to volume control and may include, for example, an indicator.

스피커(355)는 프로세서(320)로부터 전기 신호를 인가받아 음향을 발생시켜 외부로 출력할 수 있다. 스피커(355)는 컨텐트 재생에 따른 음향 신호를 전자 장치(101)의 외부로 출력할 수 있다. The speaker 355 can receive an electrical signal from the processor 320, generate sound, and output it to the outside. The speaker 355 may output sound signals according to content playback to the outside of the electronic device 101.

일 실시 예에 따른, 프로세서(320)는 피사체(예: 화자)를 포함하는 컨텐트를 재생할 수 있으며, 컨텐트의 화자별 음량 편집하는 것을 지원할 수 있다. 컨텐트 편집은 웹 방식 또는 앱 방식으로 사용자에게 제공될 수 있다. 웹 방식은 사용자가 전자 장치(101)를 이용하여 컨텐트 편집을 지원하는 서버에 의해 제공되는 웹 페이지를 방문하여 편집 기능을 제공받는 방식일 수 있다. 앱 방식은 전자 장치(101)에 설치 및 구동되는 어플리케이션을 통해 컨텐트 편집을 위한 기능을 제공받는 방식일 수 있다. According to one embodiment, the processor 320 can play content including a subject (eg, a speaker) and support editing the volume of the content for each speaker. Content editing can be provided to users through the web or app. The web method may be a method in which a user uses the electronic device 101 to visit a web page provided by a server that supports content editing and is provided with an editing function. The app method may be a method in which functions for content editing are provided through an application installed and running on the electronic device 101.

일 실시 예에 따르면, 프로세서(320)는 컨텐트에서 영상 프레임들을 추출하고, 추출된 영상 프레임들을 타임 라인에 따라 프리뷰 형태로 표시할 수 있다. 예를 들어, 컨텐트 편집 시 프로세서(320)는 디스플레이(360)의 일부 상에 재생되는 영상을 표시하며, 상기 재생되는 영상이 표시되는 부분을 제외한 나머지 부분 상에 상기 재생되는 영상에 대한 영상 프레임들을 프리뷰 형태로 표시할 수 있다. 일 실시 예에 따르면, 프로세서(320)는 사용자가 타임 라인에 따라 표시되는 영상 프레임들 중 원하는 시점이나 구간을 선택할 수 있는 기능을 제공할 수 있다. 예를 들어, 영상 프레임들이 타임 라인에 따라 표시되는 상태에서, 사용자는 타임 라인에 대해 원하는 시점을 선택할 수 있으며, 이에 대응하여 프로세서(320)는 선택된 시점부터 원하는 화자에 대한 음량을 조절할 수 있다. According to one embodiment, the processor 320 may extract video frames from content and display the extracted video frames in a preview form along a timeline. For example, when editing content, the processor 320 displays the image being played on a portion of the display 360 and displays video frames for the image being played on the remaining portion excluding the portion where the image being played is displayed. It can be displayed in preview form. According to one embodiment, the processor 320 may provide a function that allows the user to select a desired viewpoint or section among image frames displayed according to a timeline. For example, when video frames are displayed along a timeline, the user can select a desired point in time on the timeline, and in response, the processor 320 can adjust the volume for the desired speaker from the selected point in time.

또한, 프로세서(320)는 선택된 시점이나 구간에 대해 사용자가 원하는 화자와 관련하여 음량을 조절할 수 있는 사용자 인터페이스를 제공할 수 있다. 예를 들어, 사용자가 디스플레이(360) 상에 표시된 영상 내 복수의 화자들 중 원하는 화자를 선택할 수 있도록, 프로세서(320)는 영상 내 화자들 각각을 나타내는 영역을 표시할 수 있다. 만일 상기 영역들 중 어느 하나의 영역에 대한 선택 시, 프로세서(320)는 선택된 화자에 대한 음량 조절(또는 제어)을 위한 객체를 디스플레이(360)의 적어도 일부 상에 표시할 수 있다. 만일 사용자가 상기 객체를 선택하게 되면, 프로세서(320)는 조절된 음량을 선택된 시점이나 구간에 상기 원하는 화자에 대한 음량으로 적용할 수 있다. 또한, 프로세서(320)는 상기 음량 조절을 위한 객체에 대한 선택에 따라 영상 내 화자들 중 상기 선택된 화자에 대해 음량은 조절하고, 다른 나머지 화자에 대한 음량은 유지하도록 설정할 수 있다. 이와 마찬가지로, 프로세서(320)는 영상 내 화자들 중 다른 화자에 대한 선택 시 선택된 다른 화자에 대한 음량을 조절하고, 선택되지 않은 화자에 대한 음량을 유지하도록 설정할 수 있다. 반면, 상기 영역들 중 선택된 영역을 해제하는 입력에 대응하여, 프로세서(320)는 선택된 화자에 대해 조절된 음량을 원래 음량 예컨대, 디폴트 음량으로 초기화시킬 수 있다. 상기한 바와 같이 일 실시 예에 따르면, 사용자는 쉽게 간편하게 영상 내 원하는 화자에 대해 음량을 조절할 수 있다. 이와 같이 영상 내 화자별 음량 조절을 위한 다양한 형태의 사용자 인터페이스의 예에 대해서는 후술하기로 한다. Additionally, the processor 320 may provide a user interface that allows the user to adjust the volume in relation to the speaker desired for the selected time point or section. For example, the processor 320 may display an area representing each of the speakers in the image so that the user can select a desired speaker among a plurality of speakers in the image displayed on the display 360. If one of the areas is selected, the processor 320 may display an object for adjusting (or controlling) the volume for the selected speaker on at least a portion of the display 360. If the user selects the object, the processor 320 can apply the adjusted volume as the volume for the desired speaker at the selected time point or section. Additionally, the processor 320 may be set to adjust the volume for the selected speaker among the speakers in the image and maintain the volume for the remaining speakers according to the selection of the object for volume control. Similarly, when selecting another speaker among speakers in the video, the processor 320 can be set to adjust the volume for the other selected speaker and maintain the volume for the unselected speaker. On the other hand, in response to an input that releases the selected region among the regions, the processor 320 may initialize the volume adjusted for the selected speaker to the original volume, for example, the default volume. As described above, according to one embodiment, a user can easily and conveniently adjust the volume for a desired speaker in an image. Examples of various types of user interfaces for adjusting the volume of each speaker in the video will be described later.

일 실시 예에 따르면, 상기 적어도 하나의 프로세서는, 상기 제2 화자와 관련된 음량 제어 객체에 대한 제2 사용자 입력에 대응하여, 상기 제1 화자에 대한 음량을 유지하고, 상기 복수의 화자들 중 제2 화자에 대한 음량을 조절하도록 설정될 수 있다.According to one embodiment, the at least one processor maintains the volume for the first speaker in response to a second user input for the volume control object related to the second speaker, and maintains the volume for the first speaker among the plurality of speakers. 2 Can be set to adjust the volume for the speaker.

일 실시 예에 따르면, 상기 적어도 하나의 프로세서는, 상기 영상 내의 화자들 각각을 나타내는 영역을 표시하고, 상기 영역들 중 어느 하나의 영역을 선택하는 입력에 대응하여, 상기 영역에 대응하는 화자에 대한 음량을 조절하기 위한 음량 제어 객체를 상기 디스플레이의 적어도 일부 상에 표시하도록 설정될 수 있다.According to one embodiment, the at least one processor displays regions representing each of the speakers in the image, and in response to an input for selecting one of the regions, provides information on the speaker corresponding to the region. A volume control object for adjusting the volume may be set to be displayed on at least part of the display.

일 실시 예에 따르면, 상기 적어도 하나의 프로세서는, 상기 영상 내의 화자들 각각을 나타내는 영역들 중 어느 하나의 영역에 대한 선택을 해제하는 입력에 대응하여, 상기 제1 화자에 대한 조절된 음량을 디폴트 음량으로 초기화하도록 설정될 수 있다.According to one embodiment, the at least one processor sets the adjusted volume for the first speaker to default in response to an input deselecting one of the regions representing each of the speakers in the image. It can be set to reset to volume.

일 실시 예에 따르면, 상기 적어도 하나의 프로세서는, 상기 영상의 프레임들을 타임 라인에 따라 상기 디스플레이의 적어도 일부 상에 표시하고, 상기 타임 라인에 대해 선택된 시점부터 상기 제1 화자에 대한 음량을 조절하도록 설정될 수 있다.According to one embodiment, the at least one processor is configured to display frames of the image on at least a portion of the display according to a timeline and adjust the volume for the first speaker from a point selected for the timeline. can be set.

일 실시 예에 따르면, 상기 적어도 하나의 프로세서는, 상기 타임 라인 상에서 상기 제1 화자의 음성 구간의 적어도 일부를 나타내는 음성 구간 인디케이터를 표시하고, 상기 음성 구간 인디케이터를 드래그하는 입력에 기반하여, 상기 제1 화자의 음량 조절 구간을 설정하도록 설정될 수 있다.According to one embodiment, the at least one processor displays a voice section indicator indicating at least a portion of the voice section of the first speaker on the timeline, and based on an input of dragging the voice section indicator, the first speaker 1 Can be set to set the speaker's volume control section.

일 실시 예에 따르면, 상기 적어도 하나의 프로세서는, 상기 타임 라인에 따라 표시되는 상기 영상의 프레임들 중 상기 제1 화자가 포함된 영상 프레임 구간을 식별하고, 상기 제1 화자가 포함된 영상 프레임 구간에 대해 상기 제1 화자에 대한 음량을 조절하도록 설정될 수 있다.According to one embodiment, the at least one processor identifies a video frame section that includes the first speaker among the frames of the video displayed according to the timeline, and identifies a video frame section that includes the first speaker. It can be set to adjust the volume for the first speaker.

일 실시 예에 따르면, 상기 적어도 하나의 프로세서는, 상기 타임 라인 상에서 상기 음성 구간 인디케이터를 드래그하는 입력의 해제 시, 상기 음성 구간 인디케이터에 의해 지시되는 시점이 상기 제1 화자가 포함된 영상 프레임 구간을 벗어나는지를 식별하고, 상기 음성 구간 인디케이터에 의해 지시되는 시점이 상기 제1 화자가 포함된 영상 프레임 구간을 벗어나는 경우, 상기 제1 화자가 포함된 영상 프레임의 마지막 프레임까지를 상기 제1 화자의 음량 조절 구간으로 설정하도록 설정될 수 있다.According to one embodiment, the at least one processor is configured to, when an input for dragging the voice section indicator on the timeline is released, the time point indicated by the voice section indicator is the video frame section including the first speaker. If the time point indicated by the voice section indicator is outside the video frame section including the first speaker, adjust the volume of the first speaker up to the last frame of the video frame including the first speaker. It can be set to be set to a section.

일 실시 예에 따르면, 상기 적어도 하나의 프로세서는, 상기 제1 화자가 포함된 영상 프레임 구간을 분리하는 입력에 대응하여, 상기 제1 화자가 포함된 영상 프레임 구간의 적어도 일부 구간에 대해 상기 제1 화자에 대한 음량을 조절하고, 상기 제1 화자가 포함된 영상 프레임 구간의 나머지 구간에 대해 상기 제1 화자에 대한 음량을 유지하도록 설정될 수 있다.According to one embodiment, the at least one processor, in response to an input that separates an image frame section including the first speaker, is configured to configure the first speaker for at least a portion of the video frame section including the first speaker. It may be set to adjust the volume for the speaker and maintain the volume for the first speaker for the remaining section of the video frame section including the first speaker.

일 실시 예에 따르면, 상기 적어도 하나의 프로세서는, 음량 자동 조절 기능에 대한 선택에 대응하여, 상기 화자들 각각에 대한 음량을 임계값과 비교하고, 상기 임계값을 초과하는 음량을 지정된 음량으로 조절하도록 설정될 수 있다.According to one embodiment, the at least one processor, in response to selection of the automatic volume adjustment function, compares the volume for each of the speakers with a threshold and adjusts the volume exceeding the threshold to a specified volume. It can be set to do so.

도 4는 일 실시 예에 따른 화자별 음량 조절을 위한 전자 장치의 동작 흐름도이다. 도 4를 참조하면, 동작 방법은 405 동작 내지 425 동작을 포함할 수 있다. 도 4의 동작 방법의 각 동작은, 전자 장치(예: 도 1 및 도 3의 전자 장치(101)), 전자 장치의 적어도 하나의 프로세서(예: 도 1의 프로세서(120) 및 도 3의 프로세서(320))에 의해 수행될 수 있다. 일 실시 예에서, 405 동작 내지 425 동작 중 적어도 하나가 생략되거나, 일부 동작들의 순서가 바뀌거나, 다른 동작이 추가될 수 있다. Figure 4 is a flowchart of the operation of an electronic device for adjusting volume for each speaker according to an embodiment. Referring to FIG. 4, the operation method may include operations 405 to 425. Each operation of the operation method of FIG. 4 involves an electronic device (e.g., the electronic device 101 of FIGS. 1 and 3) and at least one processor of the electronic device (e.g., the processor 120 of FIG. 1 and the processor of FIG. 3). (320)). In one embodiment, at least one of operations 405 to 425 may be omitted, the order of some operations may be changed, or another operation may be added.

일 실시 예에 따르면, 사용자에 의해 선택되는 컨텐트는 전자 장치(101) 내에 저장된 컨텐트이거나 전자 장치(101)가 아닌 외부 서버에 위치하는 컨텐트일 수 있다. 일 실시 예에 따르면, 전자 장치(101)는 컨텐트 편집 요청에 대응하여, 컨텐트 편집 모드로 진입할 수 있다. 예를 들어, 컨텐트 편집을 위한 어플리케이션을 나타내는 실행 아이콘(예: 객체, 그래픽 요소, 메뉴, 버튼 또는 단축 이미지)(도시되지 않음)에 의한 입력 또는 컨텐트 재생 화면 상에서의 편집 메뉴에 대한 선택 입력이 수신될 때, 전자 장치(101)는 편집 요청이 있는 것으로 식별하고, 컨텐트 편집 어플리케이션을 실행하거나 컨텐트 편집 모드로 진입할 수 있다. 컨텐트 편집 어플리케이션의 실행에 따라 또는 컨텐트 편집 모드에서, 전자 장치(101)는 영상 재생 화면(또는 프리뷰 화면)에서 화자 음량 설정을 변경할 수 있도록 하는 기능을 제공할 수 있다. According to one embodiment, the content selected by the user may be content stored within the electronic device 101 or content located on an external server rather than the electronic device 101. According to one embodiment, the electronic device 101 may enter a content editing mode in response to a content editing request. For example, input by an executable icon (e.g., object, graphic element, menu, button, or shortcut image) (not shown) representing an application for content editing or selection input for an edit menu on the content playback screen is received. When this happens, the electronic device 101 identifies that there is an editing request, and executes a content editing application or enters a content editing mode. Depending on the execution of the content editing application or in the content editing mode, the electronic device 101 may provide a function that allows the speaker volume setting to be changed on the video playback screen (or preview screen).

405 동작에서, 전자 장치(101)는, 컨텐트에 포함된 복수의 음성 신호들을 식별할 수 있다. 예를 들면, 전자 장치(101)는 도 2에서 설명한 바와 같이 음성 분리 기술을 통해 음성 특징을 추출하고, 이에 따라 컨텐트로부터 복수의 음성 신호들을 분리할 수 있다. In operation 405, the electronic device 101 may identify a plurality of voice signals included in the content. For example, the electronic device 101 may extract voice features through voice separation technology as described in FIG. 2 and thereby separate a plurality of voice signals from content.

410 동작에서, 전자 장치(101)는, 상기 컨텐트에 포함된 영상 내의 복수의 화자들을 식별할 수 있다. 예를 들어, 전자 장치(101)는 딥러닝을 이용하여 화자 인식을 수행할 수 있다. In operation 410, the electronic device 101 may identify a plurality of speakers in an image included in the content. For example, the electronic device 101 may perform speaker recognition using deep learning.

415 동작에서, 전자 장치(101)는, 상기 음성 신호들과 상기 화자들을 각각 매칭할 수 있다. 예를 들어, 전자 장치(101)는 영상 내 인식된 얼굴과 매칭되는 음성을 시각적으로 분리하고, 각 화자에 대한 음량 제어와 관련한 항목을 시각적으로 표시할 수 있다. In operation 415, the electronic device 101 may match the voice signals and the speakers, respectively. For example, the electronic device 101 may visually separate voices matching recognized faces in an image and visually display items related to volume control for each speaker.

420 동작에서, 전자 장치(101)는, 상기 전자 장치의 디스플레이 상에 표시된 상기 영상 내의 화자들 중 제1 화자와 관련된 음량 제어 객체에 대한 제1 사용자 입력을 수신할 수 있다. 425 동작에서, 전자 장치(101)는, 상기 제1 사용자 입력에 대응하여, 상기 제1 화자에 대한 음량을 조절하고, 상기 복수의 화자들 중 제2 화자에 대한 음량을 유지할 수 있다. In operation 420, the electronic device 101 may receive a first user input for a volume control object related to a first speaker among speakers in the image displayed on the display of the electronic device. In operation 425, the electronic device 101 may adjust the volume for the first speaker and maintain the volume for the second speaker among the plurality of speakers in response to the first user input.

일 실시 예에 따르면, 전자 장치(101)는, 상기 제2 화자와 관련된 음량 제어 객체에 대한 제2 사용자 입력에 대응하여, 상기 제1 화자에 대한 음량을 유지하고, 상기 복수의 화자들 중 제2 화자에 대한 음량을 조절할 수 있다. According to one embodiment, the electronic device 101 maintains the volume for the first speaker in response to a second user input for a volume control object related to the second speaker, and maintains the volume for the first speaker among the plurality of speakers. 2 You can adjust the volume for the speaker.

일 실시 예에 따르면, 전자 장치(101)는, 상기 영상 내의 화자들 각각을 나타내는 영역을 표시하고, 상기 영역들 중 어느 하나의 영역을 선택하는 입력에 대응하여, 상기 영역에 대응하는 화자에 대한 음량을 조절하기 위한 음량 제어 객체를 상기 디스플레이의 적어도 일부 상에 표시할 수 있다. According to one embodiment, the electronic device 101 displays regions representing each of the speakers in the image, and in response to an input for selecting one of the regions, provides information about the speaker corresponding to the region. A volume control object for adjusting the volume may be displayed on at least part of the display.

일 실시 예에 따르면, 전자 장치(101)는, 상기 영상 내의 화자들 각각을 나타내는 영역들 중 어느 하나의 영역에 대한 선택을 해제하는 입력에 대응하여, 상기 제1 화자에 대한 조절된 음량을 디폴트 음량으로 초기화할 수 있다. According to one embodiment, the electronic device 101 sets the adjusted volume for the first speaker to the default in response to an input deselecting one of the regions representing each of the speakers in the image. You can reset it to volume.

일 실시 예에 따르면, 전자 장치(101)는, 상기 영상의 프레임들을 타임 라인에 따라 상기 디스플레이의 적어도 일부 상에 표시하고, 상기 타임 라인에 대해 선택된 시점부터 상기 제1 화자에 대한 음량을 조절할 수 있다. According to one embodiment, the electronic device 101 may display frames of the image on at least a portion of the display according to a timeline and adjust the volume for the first speaker from a point selected for the timeline. there is.

일 실시 예에 따르면, 전자 장치(101)는, 상기 타임 라인 상에서 상기 제1 화자의 음성 구간의 적어도 일부를 나타내는 음성 구간 인디케이터를 표시하고, 상기 음성 구간 인디케이터를 드래그하는 입력에 기반하여, 상기 제1 화자의 음량 조절 구간을 설정할 수 있다. According to one embodiment, the electronic device 101 displays a voice section indicator indicating at least a portion of the voice section of the first speaker on the timeline and, based on an input of dragging the voice section indicator, 1 You can set the speaker’s volume control section.

일 실시 예에 따르면, 전자 장치(101)는, 상기 타임 라인에 따라 표시되는 상기 영상의 프레임들 중 상기 제1 화자가 포함된 영상 프레임 구간을 식별하고, 상기 제1 화자가 포함된 영상 프레임 구간에 대해 상기 제1 화자에 대한 음량을 조절할 수 있다. According to one embodiment, the electronic device 101 identifies a video frame section that includes the first speaker among the frames of the video displayed according to the timeline, and identifies a video frame section that includes the first speaker. The volume for the first speaker can be adjusted.

일 실시 예에 따르면, 전자 장치(101)는, 상기 제1 화자가 포함된 영상 프레임 구간을 분리하는 입력에 대응하여, 상기 제1 화자가 포함된 영상 프레임 구간의 적어도 일부 구간에 대해 상기 제1 화자에 대한 음량을 조절하고, 상기 제1 화자가 포함된 영상 프레임 구간의 나머지 구간에 대해 상기 제1 화자에 대한 음량을 유지할 수 있다. According to one embodiment, in response to an input that separates an image frame section including the first speaker, the electronic device 101 selects the first speaker for at least a portion of the video frame section including the first speaker. The volume for the speaker can be adjusted, and the volume for the first speaker can be maintained for the remaining section of the video frame section including the first speaker.

일 실시 예에 따르면, 전자 장치(101)는, 음량 자동 조절 기능에 대한 선택에 대응하여, 상기 화자들 각각에 대한 음량을 임계값과 비교하고, 상기 임계값을 초과하는 음량을 지정된 음량으로 조절할 수 있다. According to one embodiment, in response to selection of the automatic volume adjustment function, the electronic device 101 compares the volume for each of the speakers with a threshold value and adjusts the volume exceeding the threshold to a specified volume. You can.

도 5는 일 실시 예에 따른 전자 장치에서 화자별 음량 편집 방법을 설명하기 위한 흐름도이다. Figure 5 is a flowchart illustrating a method of editing volume for each speaker in an electronic device according to an embodiment.

도 5를 참조하면, 505 동작에서, 전자 장치(101)는 영상 내 화자별 음성 분리를 수행할 수 있다. 510 동작에서, 전자 장치(101)는 영상 내 인물(또는 얼굴) 인식을 수행할 수 있다. 515 동작에서, 전자 장치(101)는 화자별 분리된 음성과 영상 내 인물을 매칭시킬 수 있다. 520 동작에서, 전자 장치(101)는 화자별 음량을 조절하는 동작을 수행할 수 있다. 일 실시 예에 따르면, 화자별 음량을 조절하는 방법은, 화자별 음성 정보를 표시하는 방법(520A), 화자별 음량 정보 및 지속 시간을 표시하는 방법(520B), 화자별/구간별 조절된 음량 편집 정보의 지속 시간을 조절하는 방법(520C), 동일 화자의 음성 정보를 구간별로 다른 음량으로 조절하는 방법(520D) 중 적어도 하나를 포함할 수 있다. 525 동작에서, 전자 장치(101)는 편집이 종료되는지를 식별하고, 편집이 종료되지 않는 한 520 동작으로 되돌아가 전술한 편집 동작을 수행할 수 있다. Referring to FIG. 5, in operation 505, the electronic device 101 may perform voice separation for each speaker within the video. In operation 510, the electronic device 101 may perform person (or face) recognition in the image. In operation 515, the electronic device 101 may match the separated voice for each speaker with the person in the video. In operation 520, the electronic device 101 may perform an operation to adjust the volume for each speaker. According to one embodiment, the method of adjusting the volume for each speaker includes a method for displaying voice information for each speaker (520A), a method for displaying volume information and duration for each speaker (520B), and a method for displaying volume information and duration for each speaker (520B), and a method for adjusting the volume for each speaker/section. It may include at least one of a method of adjusting the duration of editing information (520C) and a method of adjusting the voice information of the same speaker to a different volume for each section (520D). In operation 525, the electronic device 101 identifies whether editing is completed and, unless editing is completed, returns to operation 520 and performs the above-described editing operation.

일 실시 예에서, 화자별 음성 정보를 표시하는 방법(520A)은, 영상 프리뷰 화면에 인식된 화자 각각에 대한 음성 정보를 시각적으로 각각 구분하여 표시하는 방법일 수 있다. 예를 들어, 화자마다 발화 속도나 음량이 다르기 때문에, 어떠한 화자가 어떠한 음량으로 발화하고 있는지를 시각적으로 나타내기 위해, 화자 각각에 대한 현재 음량을 나타내는 객체(또는 그래픽 요소, 기호)를 서로 다른 색 또는 신호 파형 크기를 이용하여 나타낼 수 있다. 화자별 음성 정보를 표시하는 방법(520A)에 대한 사용자 인터페이스의 예는 도 6 및 도 7을 참조하여 설명하기로 한다. In one embodiment, the method 520A of displaying voice information for each speaker may be a method of visually distinguishing and displaying voice information for each speaker recognized on the video preview screen. For example, since each speaker has a different speech rate or volume, to visually indicate which speaker is speaking at what volume, an object (or graphic element, symbol) representing the current volume for each speaker is displayed in different colors. Alternatively, it can be expressed using the signal waveform size. An example of a user interface for the method 520A of displaying voice information for each speaker will be described with reference to FIGS. 6 and 7.

도 6은 일 실시 예에 따른 화자 선택에 따른 음량 편집 방법의 예시도이다. Figure 6 is an example diagram of a method for editing volume according to speaker selection according to an embodiment.

영상 내 피사체가 복수일 경우, 본 문서에 개시되는 피사체는 사람인 경우를 전제로 하기 때문에, 여러 피사체 인물들 중 누구의 음량을 조절할 것인지를 결정할 필요가 있다. 일 실시 예에서, 피사체는 사용자에 의해 지정되도록 설정될 수 있다.When there are multiple subjects in a video, it is assumed that the subjects disclosed in this document are people, so it is necessary to determine whose volume among the subjects is to be adjusted. In one embodiment, the subject may be set to be designated by the user.

전자 장치(101)는 영상에서 얼굴 영역을 검출할 수 있으며, 검출된 얼굴 영역들을 피사체 후보군으로 제시할 수 있다. 도 6(a) 내지 도 6(c)에 도시된 바와 같이 피사체 후보군들을 나타내기 위하여 가이드 박스들이 표시될 수 있으며, 가이드 박스들은 피사체 인물의 얼굴 각각을 포함하는 영역일 수 있다. 사용자는 가이드 박스를 선택함으로써 음량 제어 대상이 되는 피사체(예: 원하는 화자)를 지정할 수 있다. The electronic device 101 can detect a face area in an image and present the detected face areas as subject candidates. As shown in FIGS. 6(a) to 6(c), guide boxes may be displayed to indicate subject candidates, and the guide boxes may be areas containing each face of the subject. The user can specify the subject (e.g., desired speaker) subject to volume control by selecting the guide box.

한편, 전술한 바에서는 영상 전체에 대해 얼굴 영역 검출을 통해 피사체 인물의 얼굴을 포함하는 영역들을 결정하는 경우를 예로 들어 설명하였으나, 영상의 일부 즉, 사용자가 영상의 특정 영역을 피사체 영역으로 지정한 후, 지정된 피사체 영역에서 얼굴 영역의 검출을 수행함으로써 검출된 얼굴 영역을 피사체 후보군으로 제시할 수 있다. 또한 영상 전체에 대해 얼굴 검출을 수행한 후, 검출된 얼굴 영역들을 피사체 후보군으로 지정한 후, 상기 피사체 후보군에 해당하는 얼굴 영역들을 트랙킹(tracking)할 수 있다.Meanwhile, in the above description, the case of determining the areas containing the subject's face through face area detection for the entire image was used as an example, but after the user designated a part of the image, that is, a specific area of the image as the subject area, , By performing detection of the face area in the designated subject area, the detected face area can be presented as a subject candidate group. Additionally, after performing face detection on the entire image, the detected face areas can be designated as subject candidates, and then the face areas corresponding to the subject candidates can be tracked.

도 6(a)를 참조하면, 전자 장치(101)는 영상의 프리뷰 화면 내 화자의 음성과 매칭되는 화자(A, B)를 표시할 수 있다. 예를 들어, 얼굴 인식과 음성 분리를 통해 화자 A, B가 식별된 경우, 화자 A, B 각각에 대한 음성 정보를 600a에서와 같은 객체(또는 그래픽 요소)를 이용하여 시각적으로 표시할 수 있다. 만일 도 6(b)에서와 같이 영상 프리뷰 화면 내에서 화자 A가 선택된 경우, 600b에서와 같은 객체(또는 그래픽 요소)를 이용하여 화자 A에 대한 음성 정보를 시각적으로 표시할 수 있다. 또한, 도 6(c)에서와 같이 영상 프리뷰 화면 내에서 화자 B가 선택된 경우, 600c에서와 같은 객체(또는 그래픽 요소)를 이용하여 화자 B에 대한 음성 정보를 시각적으로 표시할 수 있다. Referring to FIG. 6(a), the electronic device 101 may display speakers (A, B) that match the speaker's voice in the video preview screen. For example, when speakers A and B are identified through face recognition and voice separation, voice information for each speaker A and B can be visually displayed using an object (or graphic element) as in 600a. If speaker A is selected in the video preview screen as shown in FIG. 6(b), audio information about speaker A can be visually displayed using the same object (or graphic element) as in 600b. Additionally, when speaker B is selected in the video preview screen as shown in FIG. 6(c), audio information about speaker B can be visually displayed using the same object (or graphic element) as in 600c.

도 7은 일 실시 예에 따른 선택된 화자와 선택되지 않은 화자 각각에 대한 음성 정보 표시의 예를 도시한 도면으로, 600b에서와 같은 객체 및 600c에서와 같은 객체를 확대 표시할 경우 각각 도 7(a) 및 도 7(b)에서와 같이 나타낼 수 있다. FIG. 7 is a diagram illustrating an example of voice information display for each of a selected speaker and an unselected speaker according to an embodiment. When the object as in 600b and the object as in 600c are enlarged and displayed, respectively, FIG. 7(a) ) and can be shown as in Figure 7(b).

도 7(a)에서 선택된 화자 A의 음성 정보는 바 형태의 객체(또는 그래픽 요소)(705)를 타임 라인에 따라 표시되는 영상 프레임들(700)에 인접하게 표시될 수 있다. 상기 바 형태의 객체(705)는 선택된 화자 A의 음성 정보 예컨대, 발화 속도나 음량을 신호 파형 형태로 나타낸 그래픽 요소(710)를 포함할 수 있다. 일 실시 예에서, 상기 바 형태의 객체(705)는 타임 라인 상에서 선택된 화자 A의 음성 구간의 적어도 일부를 나타내는 음성 구간 인디케이터라고 칭할 수 있다. 이때, 선택되지 않은 화자 B의 음성 정보도 영상 프레임들(700) 내에 선택되지 않은 화자 B가 등장하는 구간에서 표시될 수 있다. 만일 영상 프레임들(700) 내에 선택된 화자 A와 선택되지 않은 화자 B가 동시에 등장하여 발화하는 경우에는, 선택된 화자 A와 선택되지 않은 화자 B 각각의 음성 정보가 함께 표시될 수 있다. 이때, 선택된 화자 A의 음성 정보는 선택되지 않은 화자 B의 음성 정보에 비해 시각적으로 두드러지게 바 형태의 객체(705)를 이용하여 표시될 수 있다. 도 7(a)에서의 바 형태의 객체(705)의 크기(또는 길이)는 영상 프레임들 내 선택된 화자 A가 등장하는 구간의 길이에 대응할 수 있다. 여기서, 선택된 화자 A가 등장하는 구간이란 영상 프레임들 내에서 얼굴 인식된 화자 A가 존재하면서 상기 화자 A가 발화하는 구간을 의미할 수 있다. The voice information of speaker A selected in FIG. 7(a) may be displayed as a bar-shaped object (or graphic element) 705 adjacent to the image frames 700 displayed along the timeline. The bar-shaped object 705 may include a graphic element 710 representing the selected speaker A's voice information, such as speech rate or volume, in the form of a signal waveform. In one embodiment, the bar-shaped object 705 may be referred to as a voice section indicator indicating at least a portion of the voice section of speaker A selected on the timeline. At this time, voice information of the unselected speaker B may also be displayed in the section where the unselected speaker B appears within the video frames 700. If the selected speaker A and the unselected speaker B appear and speak at the same time in the video frames 700, the voice information of the selected speaker A and the unselected speaker B may be displayed together. At this time, the voice information of the selected speaker A may be displayed using a bar-shaped object 705 to visually stand out compared to the voice information of the unselected speaker B. The size (or length) of the bar-shaped object 705 in FIG. 7(a) may correspond to the length of the section in which the selected speaker A appears within the video frames. Here, the section in which the selected speaker A appears may mean a section in which the speaker A whose face is recognized exists within the video frames and the speaker A speaks.

상기한 바와 마찬가지로 도 7(b)에서 선택된 화자 B의 음성 정보는 바 형태의 객체(또는 그래픽 요소)(725)를 타임 라인에 따라 표시되는 영상 프레임들(720)에 인접하게 표시될 수 있다. 영상 프레임들(720) 내에 선택된 화자 B가 등장하는 구간에 대응되도록 바 형태의 객체(725)를 이용하여 선택된 화자 B의 음성 구간을 표시될 수 있다. 상기 바 형태의 객체(725)는 선택된 화자 B의 음성 정보 예컨대, 발화 속도나 음량을 신호 파형 형태로 나타낸 그래픽 요소(730)를 포함할 수 있다. As described above, the voice information of speaker B selected in FIG. 7(b) may be displayed as a bar-shaped object (or graphic element) 725 adjacent to the image frames 720 displayed along the timeline. The voice section of the selected speaker B may be displayed using a bar-shaped object 725 to correspond to the section in which the selected speaker B appears within the video frames 720. The bar-shaped object 725 may include a graphic element 730 that represents the selected speaker B's voice information, such as speech rate or volume, in the form of a signal waveform.

일 실시 예에서는, 선택된 화자에 대한 음성 정보를 바 형태의 객체와 신호 파형을 이용하여 나타내는 경우를 예로 들어 설명하였으나, 선택된 화자에 대한 음성 정보를 나타내기 위한 객체(그래픽 요소 또는 기호)의 형태는 이에 한정되지 않을 수 있다. In one embodiment, the case where voice information about the selected speaker is expressed using a bar-shaped object and a signal waveform is explained as an example, but the form of the object (graphic element or symbol) for representing voice information about the selected speaker is It may not be limited to this.

일 실시 예에서, 화자별 음량 정보 및 지속 시간을 표시하는 방법(520B)은 영상 프리뷰 화면에 복수의 화자들이 등장하는 경우, 선택된 화자에 대해 조절된 음량 정보와 함께 상기 선택된 화자의 음성이 존재하는 구간을 시각적으로 표시하는 방법일 수 있다. 화자별 음량 정보 및 지속 시간을 표시하는 방법(520B)에 대한 사용자 인터페이스의 예는 도 8a 내지 도 8c를 참조하여 설명하기로 한다. 도 8a는 일 실시 예에 따른 선택된 화자에 대한 음량 조절 전의 예시도이고, 도 8b는 일 실시 예에 따른 선택된 화자에 대한 음소거 시의 예시도이고, 도 8c는 일 실시 예에 따른 선택된 다른 화자에 대한 음량 조절 전의 예시도이다.In one embodiment, the method for displaying volume information and duration for each speaker (520B) is that when a plurality of speakers appear on the video preview screen, the voice of the selected speaker is present along with the volume information adjusted for the selected speaker. This may be a way to visually display a section. An example of a user interface for the method 520B of displaying volume information and duration for each speaker will be described with reference to FIGS. 8A to 8C. FIG. 8A is an exemplary diagram before adjusting the volume for a selected speaker according to an embodiment, FIG. 8B is an exemplary diagram when muting the selected speaker according to an embodiment, and FIG. 8C is an exemplary diagram when the volume is adjusted for another selected speaker according to an embodiment. This is an example before adjusting the volume.

도 8a를 참조하면, 전자 장치(101)는 얼굴 인식을 통해 검출된 얼굴 즉, 화자에 해당하는 피사체 후보군들을 나타내기 위한 가이드 박스들(또는 영역들)을 영상 내 각 얼굴에 대응하는 위치에 각각 표시할 수 있다. 화자 B에 대한 가이드 박스에 대한 사용자 선택은, 선택된 화자 B의 음량 조절 바를 호출하는 역할을 할 수 있다. 예를 들어, 사용자가 화자 B에 대한 가이드 박스를 선택할 경우, 음량을 조절하기 위한 바 형태의 객체(810)가 표시될 수 있다. 또한, 가이드 박스에 대한 선택 후에는 화자 B에 대한 선택을 해제하기 위한 객체(805)가 음량을 조절하기 위한 바 형태의 객체(810)와 함께 표시될 수 있다. 또한, 화자 B의 가이드 박스 선택에 대응하여, 화자 B의 음성 정보가 존재하는 구간(또는 시간)이 시각적으로 표시될 수 있다. Referring to FIG. 8A, the electronic device 101 places guide boxes (or areas) to indicate subject candidates corresponding to faces detected through face recognition, that is, speakers, at positions corresponding to each face in the image. It can be displayed. User selection of the guide box for speaker B may serve to call up the volume control bar for the selected speaker B. For example, when the user selects the guide box for speaker B, a bar-shaped object 810 for adjusting the volume may be displayed. Additionally, after selection of the guide box, an object 805 for deselecting speaker B may be displayed along with a bar-shaped object 810 for adjusting the volume. Additionally, in response to speaker B's selection of the guide box, the section (or time) in which speaker B's voice information exists may be visually displayed.

예를 들어, 사용자는 음량을 조절하기 위한 바 형태의 객체(810)를 터치한 채로 좌우로 드래그함으로써 음량의 높낮이를 조절할 수 있으며, 선택 해제를 위한 객체(805)를 선택할 경우, 화자 B에 대한 음량 조절 동작을 종료할 수 있다. 만일 화자 B에 대한 선택을 해제하는 입력이 수신되면, 전자 장치(101)는 화자 B에 대해 조절된 음량을 원래의 음량 예컨대, 디폴트 음량으로 초기화시킬 수 있다. For example, the user can adjust the height of the volume by dragging it left and right while touching the bar-shaped object 810 for adjusting the volume, and when selecting the object 805 for deselection, the user can adjust the volume for speaker B. You can end the volume control operation. If an input deselecting speaker B is received, the electronic device 101 may initialize the volume adjusted for speaker B to the original volume, such as the default volume.

일 실시 예에 따르면, 음량을 조절하기 위한 객체(805)와 관련하여 음량 레벨을 나타내는 객체(815)가 표시될 수 있다. 음량을 조절하기 위한 객체(805)를 이용하여 음량의 높낮이를 조절함에 따라, 사용자가 직관적으로 음량 조절을 인지할 수 있도록 객체(815)에 포함된 원본의 음량 레벨을 기준으로 편집(또는 조절)된 음량 레벨의 높이가 변할 수 있다. According to one embodiment, an object 815 indicating a volume level may be displayed in relation to an object 805 for adjusting the volume. As the height of the volume is adjusted using the object 805 for adjusting the volume, editing (or adjustment) is made based on the original volume level included in the object 815 so that the user can intuitively recognize the volume adjustment. The height of the selected volume level can be changed.

예를 들어, 도 8b에 도시된 바와 같이, 음량을 조절하기 위한 객체(805)를 이용하여 화자 B에 대한 음성을 뮤트시키게 되면, 원본의 음량 레벨을 기준으로 편집(또는 조절)된 음량 레벨의 높이가 낮아진 객체(815)가 표시될 수 있다. 또한, 영상 프레임들 내에 선택된 화자 B가 등장하는 구간에 대응되도록 바 형태의 객체(820)를 이용하여 선택된 화자 B의 음성 구간을 표시될 수 있다. 이때, 화자 B에 대한 음성이 뮤트됨에 따라 음성 구간을 나타내는 객체(820)에는 크기 변화가 없는 신호 파형(830)이 표시될 수 있다. 예를 들어, 뮤트 이전에는 화자의 음성에 따라 크기가 변하는 신호 파형이 표시되었다면, 뮤트 이후에는 크기 변화가 없는 파형 예컨대, 점선 형태가 표시될 수 있다. For example, as shown in FIG. 8B, when the voice for speaker B is muted using the object 805 for adjusting the volume, the edited (or adjusted) volume level is based on the original volume level. An object 815 whose height has been reduced may be displayed. Additionally, the voice section of the selected speaker B may be displayed using a bar-shaped object 820 to correspond to the section in which the selected speaker B appears within the video frames. At this time, as the voice for speaker B is muted, a signal waveform 830 with no change in size may be displayed on the object 820 representing the voice section. For example, if a signal waveform whose size changes depending on the speaker's voice is displayed before muting, a waveform with no change in size, such as a dotted line, may be displayed after muting.

일 실시 예에 따르면, 선택된 화자 B에 대한 음량 조절이 얼마 동안 지속되도록 설정할 것인지를 나타내는 지속 시간을 설정하는 기능이 제공될 수 있다. 예를 들어, 도 8a에 도시된 바와 같이, 선택된 화자 B에 대한 음성 정보를 화자 B가 등장하는 구간에 대응되도록 바 형태의 객체(820)와 신호 파형(825)을 이용하여 나타낼 수 있다. 만일 사용자가 화자 A에 대한 가이드 박스를 선택할 경우에는 도 8c에 도시된 바와 같이 화자 A의 음성 정보가 존재하는 구간(또는 시간)을 영상 프레임들 내 화자 A가 등장하는 구간에 대응되도록 바 형태의 객체(840)가 시각적으로 표시될 수 있다. 일 실시 예에 따르면, 도 8a에서의 화자 B가 선택되었을 경우의 객체(820)와, 도 8c에서의 화자 A가 선택되었을 경우의 객체(840)는 선택된 화자 각각과의 연관 관계를 나타내도록 시각적으로 표시될 수 있다. 예를 들어, 화자 A와 화자 B 각각에 대한 객체(또는 그래픽 요소, 기호)를 서로 다른 색을 이용하여 시각적으로 구분되도록 나타낼 수 있다.According to one embodiment, a function to set a duration indicating how long the volume control for the selected speaker B is set to last may be provided. For example, as shown in FIG. 8A, voice information about the selected speaker B can be displayed using a bar-shaped object 820 and a signal waveform 825 to correspond to the section in which speaker B appears. If the user selects the guide box for speaker A, as shown in Figure 8c, the section (or time) in which speaker A's voice information exists is displayed in the form of a bar to correspond to the section in which speaker A appears in the video frames. Object 840 may be displayed visually. According to one embodiment, the object 820 when speaker B in FIG. 8A is selected and the object 840 when speaker A in FIG. 8C is selected are visually displayed to indicate an association with each of the selected speakers. It can be displayed as . For example, objects (or graphic elements, symbols) for each speaker A and speaker B can be displayed using different colors to visually distinguish them.

한편, 일 실시 예에서는, 도 8a에서의 선택된 화자 B에 대한 음량을 변경하기 전과 도 8b에서의 음량 변경 이후에 각각의 신호 파형(825, 830)의 크기가 변경되는 경우를 예로 들어 설명하였으나, 음성 구간을 나타내는 객체(820)의 색, 투명도(opacity)와 같이 다양한 그래픽 요소의 값을 변경할 수 있으며, 시각적 표시 방식은 이에 한정되지 않을 수 있다. 예를 들어, 시각적으로 사용자가 음량 조절 상황을 직관적으로 인지할 수 있는 객체(그래픽 요소 또는 기호)의 형태는 이에 한정되지 않을 수 있다. Meanwhile, in one embodiment, the case where the size of each signal waveform 825 and 830 is changed before changing the volume for the selected speaker B in FIG. 8A and after changing the volume in FIG. 8B has been described as an example. The values of various graphic elements, such as the color and opacity of the object 820 representing the voice section, can be changed, and the visual display method may not be limited to this. For example, the form of an object (graphic element or symbol) that visually allows the user to intuitively recognize the volume control situation may not be limited to this.

일 실시 예에서, 화자별/구간별 조절된 음량 편집 정보의 지속 시간을 조절하는 방법(520C)은, 선택된 화자에 대한 음량 조절 구간(예: 지속 시간)을 영상 프레임들이 표시되는 타임 라인을 이용하여 표시하는 방법일 수 있다. 예를 들어, 사용자는 선택된 화자에 대해 조절한 음량을 원하는 구간만큼 유지하도록 지속 시간을 설정할 수 있다. 화자별/구간별 조절된 음량 편집 정보의 지속 시간을 조절하는 방법(520C)에 대한 사용자 인터페이스의 예는 도 9 및 도 10을 참조하여 설명하기로 한다. 도 9는 일 실시 예에 따른 선택된 화자의 음성 구간 내에서 조절된 음량의 지속 시간을 설정하는 방법을 설명하기 위한 예시도이고, 도 10은 일 실시 예에 따른 음량의 지속 시간을 설정하는 방법을 설명하기 위한 도면이다. In one embodiment, the method (520C) of adjusting the duration of the volume editing information adjusted for each speaker/section uses a timeline in which video frames are displayed to adjust the volume adjustment section (e.g., duration) for the selected speaker. This may be a way to display it. For example, the user can set the duration to maintain the volume adjusted for the selected speaker for a desired period. An example of a user interface for a method 520C of adjusting the duration of volume editing information adjusted for each speaker/section will be described with reference to FIGS. 9 and 10. FIG. 9 is an example diagram illustrating a method for setting the duration of the adjusted volume within the voice section of a selected speaker according to an embodiment, and FIG. 10 is a method for setting the duration of the volume according to an embodiment. This is a drawing for explanation.

도 9(a)에서와 같이 화자 B가 선택되면, 전자 장치(101)는 타임 라인에 따라 표시되는 영상 프레임들 내에서 화자 B가 포함된 영상 프레임 구간을 식별할 수 있다. 전자 장치(101)는 화자 B가 포함된(또는 등장하는) 영상 프레임 구간에 대해 화자 B에 대한 음량을 조절할 수 있는 기능을 제공할 수 있다. 이를 위해 전자 장치(101)는 영상 프레임들 내에서 화자 B가 등장하는 구간의 시작 시점을 영상 프레임 인디케이터(900)를 이용하여 나타낼 수 있다. 예를 들어, 영상 프레임들 내에서 화자 B가 등장하는 구간이 제1 시점(예: 0:30초)부터 제2 시점(예: 1:02초)라고 가정했을 경우, 음량 조절 바를 이용하여 화자 B의 음성을 뮤트시키게 되면, 제1 시점(예: 0:30초)부터 제2 시점(예: 1:02초)까지의 지속 시간이 뮤트 처리될 수 있다. 예를 들어, 도 10의 1010에서는 선택된 화자 B의 원래 음성 정보와 뮤트시킨 음성 정보를 비교한 경우를 예시하고 있다. 제1 시점(예: 0:30초)부터 제2 시점(예: 1:02초)까지의 음량 조절 구간(1015)을 살펴보면, 도 10의 1010에 도시된 바와 같이, 원래 음성 정보와 비교했을 때 음량 조절 구간(1015) 내에서 선택된 화자 B의 음성을 뮤트시키게 되면, 신호 파형이 출력되지 않는 음성 정보의 형태가 표시될 수 있다. When speaker B is selected as shown in FIG. 9(a), the electronic device 101 can identify the video frame section that includes speaker B within the video frames displayed according to the timeline. The electronic device 101 may provide a function to adjust the volume for speaker B for the video frame section in which speaker B is included (or appears). To this end, the electronic device 101 may use the video frame indicator 900 to indicate the start point of the section in which speaker B appears within the video frames. For example, if it is assumed that the section in which speaker B appears within the video frames is from the first time point (e.g., 0:30 seconds) to the second time point (e.g., 1:02 seconds), use the volume control bar to control the speaker. When B's voice is muted, the duration from the first time point (eg, 0:30 seconds) to the second time point (eg, 1:02 seconds) may be muted. For example, 1010 in FIG. 10 illustrates a case in which the original voice information of the selected speaker B and the muted voice information are compared. Looking at the volume control section 1015 from the first time point (e.g., 0:30 seconds) to the second time point (e.g., 1:02 seconds), as shown at 1010 in FIG. 10, compared to the original voice information, When the voice of speaker B selected within the volume control section 1015 is muted, a form of voice information in which no signal waveform is output may be displayed.

일 실시 예에 따르면, 화자 B가 등장하는 구간 내에서 음량이 조절되는 구간은 사용자 설정에 따라 변경될 수 있다. 예를 들어, 도 9(b)에 도시된 바와 같이, 음성 구간을 나타내는 바 형태의 객체의 길이를 조절하는 사용자 입력에 따라, 도 9(c)에 도시된 바와 같이, 바 형태의 객체의 길이가 줄어들면서 화자 B의 음량 조절 구간도 줄어들 수 있다. 예를 들어, 음성 구간을 나타내는 바 형태의 객체의 길이를 조절하게 되면, 도 9(c)에 도시된 바와 같이, 시작 시점이 제1 시점(예: 0:30초)에서 제3 시점(예: 0:46초)로 변경되어, 제3 시점(예: 0:46초)부터 제2 시점(예: 1:02초)까지의 지속 시간이 뮤트 처리될 수 있다. According to one embodiment, the section in which the volume is adjusted within the section in which speaker B appears may be changed according to user settings. For example, as shown in FIG. 9(b), according to a user input that adjusts the length of a bar-shaped object representing a voice section, as shown in FIG. 9(c), the length of the bar-shaped object As decreases, speaker B's volume control section may also decrease. For example, when the length of a bar-shaped object representing a voice section is adjusted, as shown in FIG. 9(c), the start time changes from the first time point (e.g., 0:30 seconds) to the third time point (e.g. : 0:46 seconds), so that the duration from the third viewpoint (e.g., 0:46 seconds) to the second viewpoint (e.g., 1:02 seconds) can be muted.

예를 들어, 도 10의 1020에 도시된 바와 같이, 시작 시점이 제1 시점(예: 0:30초)에서 제3 시점(예: 0:46초)로 변경되어, 제3 시점(예: 0:46초)부터 제2 시점(예: 1:02초)까지 선택된 화자 B의 음성을 뮤트시키게 되면, 제1 시점(예: 0:30초)부터 제3 시점(예: 0:46초) 까지의 구간(1025)에서의 음성은 원래 음성 정보와 비교했을 때 동일한 신호 파형이 표시되듯이 원래 음량으로 원복될 수 있다. 반면, 제3 시점(예: 0:46초)부터 제2 시점(예: 1:02초)까지의 구간(1030)은 원래 음성 정보와 비교했을 때 신호 파형이 출력되지 않는 음성 정보의 형태가 표시될 수 있다. For example, as shown at 1020 in FIG. 10, the start time is changed from the first time point (e.g., 0:30 seconds) to the third time point (e.g., 0:46 seconds), and the start time point is changed to the third time point (e.g., 0:46 seconds). If the selected speaker B's voice is muted from 0:46 seconds) to the second time point (e.g. 1:02 seconds), the voice is muted from the first time point (e.g. 0:30 seconds) to the third time point (e.g. 0:46 seconds). ) The voice in the section 1025 up to can be restored to its original volume as the same signal waveform is displayed when compared to the original voice information. On the other hand, the section 1030 from the third time point (e.g., 0:46 seconds) to the second time point (e.g., 1:02 seconds) is a form of voice information in which no signal waveform is output compared to the original voice information. can be displayed.

일 실시 예에서, 동일 화자의 음성 정보를 구간별로 다른 음량으로 조절하는 방법(520D)은, 상기 화자에 대한 음량 조절 구간(예: 지속 시간)을 영상 프레임들이 표시되는 타임 라인을 이용하여 표시하는 방법일 수 있다. 예를 들어, 사용자는 동일 화자의 음성 구간을 분리하여 구간별로 다른 음량으로 설정할 수 있다. 동일 화자의 음성 정보를 구간별로 다른 음량으로 조절하는 방법(520D)에 대한 사용자 인터페이스의 예는 도 11 및 도 12를 참조하여 설명하기로 한다. 도 11은 일 실시 예에 따른 동일 화자의 음성 구간을 분리하여 음량을 조절하는 방법을 설명하기 위한 예시도이고, 도 12는 일 실시 예에 따른 동일 화자의 음성 구간의 분리에 따른 음량 설정 방법을 나타낸 도면이다.In one embodiment, the method 520D of adjusting the same speaker's voice information to a different volume for each section includes displaying the volume adjustment section (e.g., duration) for the speaker using a timeline where video frames are displayed. It could be a way. For example, a user can separate voice sections of the same speaker and set different volume levels for each section. An example of a user interface for a method 520D of adjusting the same speaker's voice information to different volumes for each section will be described with reference to FIGS. 11 and 12. FIG. 11 is an example diagram illustrating a method of adjusting the volume by separating voice sections of the same speaker according to an embodiment, and FIG. 12 is a diagram illustrating a method of setting the volume according to separation of voice sections of the same speaker according to an embodiment. This is the drawing shown.

도 11(a)에서는 화자 A가 선택된 상태를 예시하고 있다. 영상 프레임들 내에서 화자 A가 등장하는 구간을 제1 시점(예: 0: 15초)부터 제2 시점(예: 1:32초)까지의 구간이라고 가정했을 때, 화자 A에 대한 음성 정보를 구간 별로 분리할 수 있다. 예를 들어, 도 11(a)에 도시된 바와 같이 영상 프레임 인디케이터(1100)를 터치한 상태로 드래그할 수 있으며, 도 11(b)에 도시된 바와 같이 구간 분리를 위한 객체(1120)가 표시될 수 있다. 상기 구간 분리를 위한 객체(1120)에 대한 선택에 대응하여, 전자 장치(101)는 제3 시점(예: 1:02초)을 가리키는 영상 프레임 인디케이터(1100)를 기준으로 상기 화자 A가 등장하는 구간(예: 제1 시점(예: 0: 15초)부터 제2 시점(예: 1:32초)까지의 구간)을 분리할 수 있다. Figure 11(a) illustrates a state in which speaker A is selected. Assuming that the section in which speaker A appears within the video frames is the section from the first time point (e.g., 0:15 seconds) to the second time point (e.g., 1:32 seconds), the audio information about speaker A is It can be separated by section. For example, as shown in FIG. 11(a), the video frame indicator 1100 can be touched and dragged, and as shown in FIG. 11(b), an object 1120 for section separation is displayed. It can be. In response to the selection of the object 1120 for section separation, the electronic device 101 selects the image frame indicator 1100 indicating the third time point (e.g., 1:02 seconds) in which speaker A appears. A section (e.g., a section from the first time point (e.g., 0:15 seconds) to the second time point (e.g., 1:32 seconds)) can be separated.

만일 구간이 분리된 상태에서 음량을 조절하기 위한 바 형태의 객체를 터치한 채로 음량의 높낮이를 조절할 수 있는데, 음량을 조절하기 위한 바 형태의 객체를 이용하여 음량을 100에서 0으로 줄이게 되면, 도 11(c)에서 음량 레벨이 100으로 설정된 음성 구간을 나타내는 객체(1130) 내의 파형은 도 11(d)에 도시된 객체(1145)에서와 같이 신호 파형이 없는 상태로 표시될 수 있으며, 음량 레벨을 나타내는 객체(1140)에 포함된 원본의 음량 레벨을 기준으로 편집(또는 조절)된 음량 레벨의 높이가 변할 수 있다. 또한, 도 11(e)에 도시된 바와 같이 영상 프레임 인디케이터(1100)를 표시함으로써 각 화자의 음성 정보 구간(1150)에서 제3 시점(예: 1:02초)을 기준으로 구간이 분리되어 있음을 나타낼 수 있다. If the section is separated and you can adjust the volume by touching the bar-shaped object to control the volume, if you reduce the volume from 100 to 0 using the bar-shaped object to control the volume, the The waveform in the object 1130 representing the voice section with the volume level set to 100 in 11(c) may be displayed without a signal waveform as in the object 1145 shown in FIG. 11(d), and the volume level The height of the edited (or adjusted) volume level may change based on the original volume level included in the object 1140 representing . In addition, as shown in FIG. 11(e), by displaying the video frame indicator 1100, the sections are separated based on the third time point (e.g., 1:02 seconds) in each speaker's voice information section 1150. can indicate.

도 12를 참조하면, 동일 화자에 대한 음성 정보 구간을 분리하기 전의 구간(예: 0:15초부터 1:32초까지)을 1200이라고 할 경우, 구간 분리 후에는 제1 구간(예: 0:15초부터 1:02초까지)과 제2 구간(예: 1:02초부터 1:32초까지)으로 분리될 수 있다. 예를 들어, 전자 장치(101)는 상기 화자가 포함된 영상 프레음 구간의 적어도 일부 구간(예: 1 구간(예: 0:15초부터 1:02초까지))에 대해 상기 화자에 대한 음량을 조절할 수 있으며, 상기 화자가 포함된 영상 프레임 구간의 나머지 구간(예: 제2 구간(예: 1:02초부터 1:32초까지))에 대해 상기 화자에 대한 음량을 유지하도록 설정할 수 있다. Referring to FIG. 12, if the section before separating the voice information sections for the same speaker (e.g., from 0:15 seconds to 1:32 seconds) is 1200, after section separation, the first section (e.g., 0: It can be divided into a second section (e.g., from 15 seconds to 1:02 seconds) and a second section (e.g., from 1:02 seconds to 1:32 seconds). For example, the electronic device 101 may set the volume for the speaker for at least a portion (e.g., 1 section (e.g., from 0:15 seconds to 1:02 seconds)) of the video sound section including the speaker. can be adjusted, and can be set to maintain the volume for the speaker for the remaining section of the video frame section including the speaker (e.g., the second section (e.g., from 1:02 seconds to 1:32 seconds)). .

일 실시 예에 따르면, 제1 구간(예: 0:15초부터 1:02초까지)(1205)과 제2 구간(예: 1:02초부터 1:32초까지)(1210)은 동일 화자에 대한 음성 정보 구간이지만, 서로 다른 음량 레벨 및 지속 시간을 가지도록 설정된 것일 수 있다. According to one embodiment, the first section (e.g., from 0:15 seconds to 1:02 seconds) (1205) and the second section (e.g., from 1:02 seconds to 1:32 seconds) (1210) are the same speaker. This is a voice information section for , but may be set to have different volume levels and durations.

도 13은 일 실시 예에 따른 임계값 이상의 음량의 자동 조절 방법을 설명하기 위한 예시도이다. Figure 13 is an example diagram to explain a method of automatically adjusting the volume above the threshold according to an embodiment.

일 실시 예에 따르면, 전자 장치(101)는 컨텐트로부터 분리된 음성 신호를 분석할 수 있으며, 음성 신호의 크기를 즉, 음량을 임계값과 비교할 수 있다. 만일 임계값 보다 큰 음량에 해당하는 음성이 검출되는 경우에는, 전자 장치(101)는 임계값 이상의 음량을 지정된 범위 내에 속하도록 음량을 낮추는 동작을 수행할 수 있다. According to one embodiment, the electronic device 101 may analyze a voice signal separated from the content and compare the size of the voice signal, that is, the volume, with a threshold. If a voice corresponding to a volume higher than the threshold is detected, the electronic device 101 may perform an operation to lower the volume so that the volume above the threshold falls within a specified range.

일 실시 예에서는 음량 자동 조절 가능을 제공할 수 있으며, 사용자 설정에 따라 음량 자동 조절 기능의 활성화 여부가 결정될 수 있다. 예를 들어, 도 13(a)에 도시된 바와 같이 음량 자동 조절 기능을 나타내는 객체(1300)에 대한 사용자 선택에 대응하여, 음량 자동 조절 기능이 활성화될 수 있다. 음량 자동 조절 기능에 대한 선택에 대응하여, 전자 장치(101)는 화자들 각각에 대한 음량을 임계값과 비교하여, 임계값을 초과하는 음량을 지정된 음량으로 낮출 수 있다. 또한, 도 13(a)에 도시된 바와 같이 복수의 화자들 중 어느 하나의 화자 A가 선택된 경우에는, 선택된 화자 A에 대한 음량을 임계값과 비교하여, 음성 정보 구간 내의 신호들(1315)에서 선택된 화자 A의 음량이 임계값을 초과하는 경우 음성 정보 구간을 나타내는 객체(1310)의 그래픽 요소를 변경함으로써 음량 초과 레벨 상황을 경고할 수 있다. 만일 음량 자동 조절 기능을 나타내는 객체(1300)가 선택된 경우에는, 전자 장치(101)는 임계값을 초과하는 선택된 화자 A의 음량을 지정된 음량으로 낮출 수 있다. 음량 자동 조절 기능이 활성화됨에 따라, 도 13(b)에 도시된 바와 같이 원본의 음량 레벨을 기준으로 편집(또는 조절)된 음량 레벨의 높이가 낮아진 객체(1330)가 표시되면서, 도 13(a)에서의 음량 초과 레벨이 지정된 음량 레볼로 낮출 수 있다. 예를 들어, 전자 장치(101)는 특정 구간의 음량이 임계값(예: 90dB) 이상으로 높은 경우, 구간 평균치(예: 76dB)로 음량 레벨을 낮출 수 있다. In one embodiment, automatic volume adjustment may be provided, and whether or not to activate the automatic volume adjustment function may be determined according to user settings. For example, as shown in FIG. 13(a), the automatic volume adjustment function may be activated in response to the user's selection of the object 1300 representing the automatic volume adjustment function. In response to the selection of the automatic volume control function, the electronic device 101 may compare the volume for each speaker with a threshold and lower the volume exceeding the threshold to the specified volume. In addition, as shown in FIG. 13(a), when any one speaker A is selected among the plurality of speakers, the volume for the selected speaker A is compared with the threshold value, and in the signals 1315 within the voice information section, If the volume of the selected speaker A exceeds the threshold, the volume exceeding level situation can be warned by changing the graphic element of the object 1310 representing the voice information section. If the object 1300 representing the automatic volume adjustment function is selected, the electronic device 101 may lower the volume of the selected speaker A that exceeds the threshold to the specified volume. As the automatic volume adjustment function is activated, an object 1330 with a lowered height of the edited (or adjusted) volume level based on the original volume level is displayed, as shown in FIG. 13(b), and the ) can be lowered to the specified volume level. For example, if the volume of a specific section is higher than a threshold (e.g., 90 dB), the electronic device 101 may lower the volume level to the average value of the section (e.g., 76 dB).

도 14는 일 실시 예에 따른 굉음에 해당하는 음량의 자동 조절 방법을 설명하기 위한 예시도이다. Figure 14 is an example diagram for explaining a method of automatically adjusting the volume corresponding to a roar according to an embodiment.

도 14(a)를 참조하면, 음성 정보 구간 내의 신호들(1415)에서 화자의 음량이 임계값을 초과하는 경우 이외에 주변 오디오 신호로 인해 스파이크(또는 굉음)가 발생하는 구간이 있을 수 있다. 이때, 음성 정보 구간을 나타내는 객체(1410)의 그래픽 요소를 변경함으로써 스파이크(또는 굉음) 발생 상황을 경고할 수 있다. 예를 들어, 전자 장치(101)는 스파이크(또는 굉음)가 발생하는 구간을 도 14(b)에 도시된 바와 같이 영상 프레임 인디케이터를 이용하여 나타낼 수 있으며, 자동 조절된 레벨은 객체(1420)를 이용하여 나타낼 수 있다. Referring to FIG. 14(a), in the signals 1415 within the voice information section, there may be a section in which spikes (or roars) occur due to surrounding audio signals other than when the speaker's volume exceeds the threshold. At this time, a spike (or roar) occurrence situation can be warned by changing the graphic element of the object 1410 representing the voice information section. For example, the electronic device 101 may indicate a section in which a spike (or roar) occurs using an image frame indicator as shown in FIG. 14(b), and the automatically adjusted level indicates the object 1420. It can be expressed using

한편, 전술한 바에서는 선택된 화자에 대해 임계값을 초과하는 음량을 낮추는 경우를 예로 들어 설명하였지만, 특정 임계값 미만으로 선택된 화자의 음량이 낮은 경우에는 평균 음량 레벨로 음량을 조절하는 경우에도 본 발명이 적용될 수 있다. Meanwhile, in the above description, the case where the volume exceeding the threshold for the selected speaker is lowered is given as an example, but when the volume of the selected speaker is lower than the specific threshold, the present invention also applies to adjusting the volume to the average volume level. This can be applied.

도 15는 일 실시 예에 따른 화자 인식에 따른 화자 트래킹 및 음성 정보 표시의 예를 도시한 도면이다.FIG. 15 is a diagram illustrating an example of speaker tracking and voice information display according to speaker recognition according to an embodiment.

도 15(a)에서는 음량 편집 모드로의 진입을 위한 객체(1500)에 대한 선택에 대응하여, 전자 장치(101)는 도 15(b)에서와 같이 영상 내 얼굴 또는 인물 인식을 통해 인식된 화자를 지시하는 영역 및 인식된 화자를 트래킹하여 표시할 수 있다. 여기서, 전자 장치(101)는 컨텐트에 포함된 영상 내에서 얼굴 검출, 정렬, 얼굴 인식 및/또는 얼굴 트래킹을 통해, 영상 내 피사체 후보군(예: 화자)에 해당하는 영역 및 영상 프레임들 내 피사체 후보군이 존재하는 구간을 식별할 수 있다. 또한, 전자 장치(101)는 컨텐트에 포함된 복수의 음성 신호들을 식별하고, 피사체 후보군이 존재하는 구간에 대응하는 음성 분리를 수행하여 영상 내 화자 각각에 대응하는 음성을 매칭할 수 있다. In Figure 15(a), in response to the selection of the object 1500 to enter the volume editing mode, the electronic device 101 selects the speaker recognized through face or person recognition in the image as shown in Figure 15(b). The area indicating and the recognized speaker can be tracked and displayed. Here, the electronic device 101 detects, aligns, recognizes and/or tracks faces within an image included in the content, and detects an area corresponding to a candidate subject (e.g., a speaker) within the image and a group of subject candidates within the image frames. The section where this exists can be identified. Additionally, the electronic device 101 may identify a plurality of audio signals included in the content, perform audio separation corresponding to a section in which candidate subjects exist, and match the audio corresponding to each speaker in the video.

도 15(b)를 참조하면, 전자 장치(101)는 인식된 얼굴 각각을 나타내는 영역들(또는 가이드 박스)을 이용하여 화자 인식 결과를 표시하며, 화자별로 음성이 출력되고 있음을 음성 정보 구간(1510) 내의 서로 다른 색의 그래픽 요소를 이용하여 표시할 수 있다. 예를 들어, 도 15(b)의 화자 각각을 지시하는 영역(또는 가이드 박스)의 색은 각 화자의 음성 정보 구간(1510) 내의 색에 매칭될 수 있다. 도 15(c)에 도시된 바와 같이, 화자 B가 선택된 경우, 화자 B를 나타내는 가이드 박스의 색은, 음성 정보 구간을 나타내는 바 형태의 객체(1520)의 색과 동일할 수 있다. 마찬가지로, 도 15(d)에 도시된 바와 같이, 화자 A가 선택된 경우, 화자 A를 나타내는 가이드 박스의 색은, 음성 정보 구간을 나타내는 바 형태의 객체(1530)의 색과 동일할 수 있다. 도 15(b) 내지 도 15(c)에 도시된 바와 같이 등장하는 화자가 달라지거나 화자에 대한 화각이 변경되거나 구도가 변경 시에도, 전자 장치(101)는 얼굴 트래킹을 통해 화자의 얼굴 위에 화자를 지시하는 영역(또는 가이드 박스)을 표시할 수 있다. Referring to FIG. 15(b), the electronic device 101 displays the speaker recognition result using areas (or guide boxes) representing each recognized face, and displays a voice information section (speech information section) indicating that a voice is being output for each speaker. 1510) can be displayed using graphic elements of different colors. For example, the color of the area (or guide box) indicating each speaker in FIG. 15(b) may match the color within the voice information section 1510 of each speaker. As shown in FIG. 15(c), when speaker B is selected, the color of the guide box representing speaker B may be the same as the color of the bar-shaped object 1520 representing the voice information section. Likewise, as shown in FIG. 15(d), when speaker A is selected, the color of the guide box representing speaker A may be the same as the color of the bar-shaped object 1530 representing the voice information section. As shown in FIGS. 15(b) to 15(c), even when the speaker that appears changes, the angle of view for the speaker changes, or the composition changes, the electronic device 101 keeps the speaker on the speaker's face through face tracking. You can display an area (or guide box) indicating .

도 16은 일 실시 예에 따른 편집 대상 화자 선택에 따른 음성 정보 표시의 예를 도시한 도면이다. 도 16(a)에서는 각각의 화자를 지시하는 영역(또는 가이드 박스)이 표시되는 상태에서, 사용자가 화자 A의 영역을 선택할 경우, 도 16(c)에 도시된 바와 같이, 화자 A의 선택을 해제하기 위한 객체 및 음량을 조절하기 위한 객체가 표시될 수 있다. 반면, 도 16(b)에서 화자 B의 선택을 해제하는 입력에 대응하여, 전자 장치(101)는 도 16(a)에 도시된 바와 같이 선택을 해제하기 위한 객체 및 음량을 조절하기 위한 객체가 표시되지 않도록 제거할 수 있다. 이때, 화자 선택에 대한 해제 시에는 화자의 음량을 디폴트 음량으로 초기화시킬 수 있다. FIG. 16 is a diagram illustrating an example of displaying voice information according to selection of a speaker to be edited according to an embodiment. In Figure 16(a), when the area (or guide box) indicating each speaker is displayed and the user selects the area of speaker A, as shown in Figure 16(c), speaker A is selected. An object for disarming and an object for adjusting the volume may be displayed. On the other hand, in response to the input for deselecting speaker B in FIG. 16(b), the electronic device 101 includes an object for deselecting and an object for adjusting the volume as shown in FIG. 16(a). You can remove it so it is not displayed. At this time, when canceling speaker selection, the speaker's volume can be initialized to the default volume.

도 17은 일 실시 예에 따른 선택된 화자에 대한 음량 조절 방법을 나타낸 예시도이다. Figure 17 is an exemplary diagram showing a method of adjusting the volume for a selected speaker according to an embodiment.

도 17(a)를 참조하면, 화자 선택 시 음량을 조절하기 위한 객체가 표시되는데, 상기 객체를 터치 후 드래그함으로써 선택된 화자의 음량 레벨이 조절될 수 있다. 음량 레벨의 조절에 따라 사용자가 조절 상황을 시각적으로 인지할 수 있도록 도 17(b)에 도시된 바와 같이 음량 레벨을 나타내는 객체에 포함된 원본의 음량 레벨을 기준으로 높이가 변하는 편집(또는 조절)된 음량 레벨을 나타내는 객체가 표시될 수 있다. 또한, 도 17(c)에서 선택된 화자에 대한 음량을 조절하기 위한 객체가 선택되어 음량 레벨이 조절되는 경우에는 도 17(d)에 도시된 바와 같이 음량 레벨이 높아진 객체가 표시될 수 있다. 이때, 도 17(a) 내지 도 17(d)에서는 원래 음량 레벨로 복귀하도록 제어하기 위한 리셋용 객체가 표시될 수 있다. Referring to Figure 17(a), when a speaker is selected, an object for adjusting the volume is displayed. By touching and dragging the object, the volume level of the selected speaker can be adjusted. Editing (or adjustment) in which the height changes based on the original volume level included in the object representing the volume level, as shown in FIG. 17(b), so that the user can visually recognize the adjustment situation according to the adjustment of the volume level. An object representing the selected volume level may be displayed. Additionally, when an object for adjusting the volume for the speaker selected in FIG. 17(c) is selected and the volume level is adjusted, an object with an increased volume level may be displayed as shown in FIG. 17(d). At this time, in FIGS. 17(a) to 17(d), a reset object may be displayed to control the volume to return to the original volume level.

도 18은 일 실시 예에 따른 선택된 화자에 대한 음성 구간 인디케이터를 이용한 음량 조절을 위한 동작 흐름도이다. 도 18에서의 설명의 이해를 돕기 위해 도 19를 참조하여 설명하기로 한다. 도 19는 일 실시 예에 따른 선택된 화자에 대한 음성 구간 인디케이터를 이용한 음량 조절 방법을 나타낸 예시도이다. Figure 18 is an operation flowchart for volume control using a voice section indicator for a selected speaker according to an embodiment. To help understand the explanation in FIG. 18, the description will be made with reference to FIG. 19. Figure 19 is an example diagram showing a method of adjusting the volume using a voice section indicator for a selected speaker according to an embodiment.

도 18을 참조하면, 음량 편집 모드에서, 1805 동작에서, 전자 장치(101)는 선택된 화자의 음성 구간 인디케이터에 대한 드래그 입력이 있는지를 식별할 수 있다. Referring to FIG. 18, in the volume editing mode, in operation 1805, the electronic device 101 can identify whether there is a drag input on the voice section indicator of the selected speaker.

선택된 화자의 음성 구간 인디케이터에 대한 드래그 입력의 식별에 대응하여, 1810 동작에서 전자 장치(101)는 타임 라인을 따라 표시되는 영상 프레임 및 상기 드래그 입력에 기반한 선택된 화자에 대한 음량 조절 구간을 표시할 수 있다. In response to the identification of the drag input for the voice section indicator of the selected speaker, in operation 1810, the electronic device 101 may display video frames displayed along the timeline and a volume control section for the selected speaker based on the drag input. there is.

도 19(a)에 도시된 바와 같이 복수의 화자들 중 어느 하나의 화자가 선택된 상태에서, 사용자는 선택된 화자와 관련된 음성 구간 인디케이터를 이용하여 선택된 화자에 대한 음량 조절 구간을 설정할 수 있다. 예를 들어, 영상 프레임들 내에서 선택된 화자가 등장하는 음성 정보 구간이 제1 시점(예: 0:30초)부터 제2 시점(예: 1:02초)까지의 구간이라고 가정했을 경우, 음성 구간 인디케이터의 길이는 제1 시점(예: 0:30초)부터 제2 시점(예: 1:02초)까지의 구간의 길이에 대응할 수 있다. As shown in FIG. 19(a), when one of the plurality of speakers is selected, the user can set the volume control section for the selected speaker using the voice section indicator related to the selected speaker. For example, if it is assumed that the audio information section in which the selected speaker appears within the video frames is the section from the first time point (e.g., 0:30 seconds) to the second time point (e.g., 1:02 seconds), the voice The length of the section indicator may correspond to the length of the section from the first time point (eg, 0:30 seconds) to the second time point (eg, 1:02 seconds).

만일 도 19(a)에서 음성 구간 인디케이터의 좌측을 터치한 상태로 오른쪽으로 드래그하면, 음성 정보 구간의 시작 시점은 제1 시점(예: 0:30초)에서 제3 시점(예: 0: 46초)으로 변경될 수 있다. 선택된 화자의 음성을 뮤트시키는 음량 조절의 경우에는, 도 19(b)에서와 같이 전자 장치(101)는 제3 시점(예: 0: 46초)부터 제2 시점(예: 1:02초)까지의 구간에서 선택된 화자의 음성을 뮤트시킬 수 있다. If you touch the left side of the voice section indicator in Figure 19(a) and drag it to the right, the start time of the voice information section changes from the first time point (e.g., 0:30 seconds) to the third time point (e.g., 0:46). seconds) can be changed. In the case of volume control to mute the voice of the selected speaker, the electronic device 101 controls from the third time point (e.g., 0:46 seconds) to the second time point (e.g., 1:02 seconds), as shown in FIG. 19(b). You can mute the voice of the selected speaker in the section up to.

1815 동작에서, 전자 장치(101)는 상기 음성 구간 인디케이터를 드래그하는 입력이 해제되는지를 식별할 수 있다. 상기 음성 구간 인디케이터를 드래그하는 입력의 해제에 대응하여, 1820 동작에서 전자 장치(101)는 음성 구간 인디케이터에 의해 지시되는 시점이 선택된 화자가 포함된 영상 프레임 구간을 벗어나는지를 식별할 수 있다. 만일 음성 구간 인디케이터에 의해 지시되는 시점이 선택된 화자가 포함된 영상 프레임 구간을 벗어나지 않는다면, 1825 동작에서 전자 장치(101)는 드래그 입력의 해제 시의 음성 구간 인디케이터에 의해 지시되는 시점의 영상 프레임까지를 선택된 화자의 음량 조절 구간으로 설정할 수 있다. In operation 1815, the electronic device 101 can identify whether the input for dragging the voice section indicator is released. In response to the release of the input for dragging the voice section indicator, in operation 1820, the electronic device 101 may identify whether the time point indicated by the voice section indicator deviates from the video frame section including the selected speaker. If the time point indicated by the voice section indicator does not deviate from the video frame section including the selected speaker, in operation 1825, the electronic device 101 moves the video frame up to the time point indicated by the voice section indicator when the drag input is released. It can be set to the volume control section of the selected speaker.

반면, 음성 구간 인디케이터에 의해 지시되는 시점이 선택된 화자가 포함된 영상 프레임 구간을 벗어나는 경우. 1830 동작에서 전자 장치(101)는 음량 조절이 불가한 구간임을 경고할 수 있다. 1835 동작에서 전자 장치(101)는 드래그 입력의 해제 시의 음성 구간 인디케이터에 의해 지시되는 시점과 가장 가까운 영상 프레임까지를 선택된 화자의 음량 조절 구간으로 설정할 수 있다. 일 실시 예에 따르면, 음량 조절이 불가한 구간임을 경고함과 동시에 1835 동작을 수행할 수 있다. On the other hand, when the time point indicated by the voice section indicator is outside the video frame section containing the selected speaker. In operation 1830, the electronic device 101 may warn that it is a section where volume control is not possible. In operation 1835, the electronic device 101 may set the video frame closest to the time point indicated by the voice section indicator when the drag input is released as the volume control section of the selected speaker. According to one embodiment, operation 1835 can be performed at the same time as warning that it is a section where volume control is not possible.

도 19(b)에서 음성 구간 인디케이터의 우측을 터치한 상태로 오른쪽으로 드래그하면, 음성 구간 인디케이터가 가리키는 음성 정보 구간의 종료 시점은 제2 시점(예: 1:02초)을 벗어날 수 있다. 도 19(c)에서 음성 구간 인디케이터(1900)를 드래그하는 입력이 제4 시점(예: 2:52초)까지 이동하게 되면, 전자 장치(101)는 영상 프레임들 내에 상기 선택된 화자가 등장하는 구간을 벗어난 것인지를 식별할 수 있다. 만일 선택된 화자가 등장하는 구간을 벗어난 경우에는 음성 구간 인디케이터가 더 이상 드래그되지 않을 수 있다. 이때, 음성 구간 인디케이터의 이동 시 영상 프레임 인디케이터도 함께 이동될 수 있다. In Figure 19(b), if the right side of the voice section indicator is touched and dragged to the right, the end time of the voice information section indicated by the voice section indicator may deviate from the second time point (e.g., 1:02 seconds). In FIG. 19(c), when the input for dragging the voice section indicator 1900 moves to the fourth time point (e.g., 2:52 seconds), the electronic device 101 selects the section in which the selected speaker appears within the video frames. You can identify whether it is out of bounds or not. If the selected speaker is outside the section in which he or she appears, the voice section indicator may no longer be dragged. At this time, when the audio section indicator is moved, the video frame indicator may also be moved.

도 19(d)에 도시된 바와 같이, 음성 구간 인디케이터(1900)를 드래그하는 입력이 가리키는 시점이 선택된 화자가 등장하는 구간을 벗어난 경우에는, 전자 장치(101)는 선택된 화자가 등장하는 영상 프레임의 마지막 시점(예: 1:02초)을 가리키도록 음성 구간 인디케이터(1910)를 표시할 수 있다. 이와 같이 전자 장치(101)는 선택된 화자가 포함된 영상 프레임의 마지막 프레임까지를 상기 선택된 화자의 음량 조절 구간으로 설정할 수 있다. As shown in FIG. 19(d), when the point in time indicated by the input for dragging the voice section indicator 1900 is outside the section in which the selected speaker appears, the electronic device 101 selects the video frame in which the selected speaker appears. The voice section indicator 1910 can be displayed to indicate the last time point (e.g., 1:02 seconds). In this way, the electronic device 101 can set up to the last frame of the video frame including the selected speaker as the volume control section for the selected speaker.

도 20은 일 실시 예에 따른 영상 프레임 인디케이터를 이용한 음량 조절 방법을 나타낸 예시도이다. Figure 20 is an example diagram showing a method of adjusting the volume using an image frame indicator according to an embodiment.

도 20(a)를 참조하면, 전자 장치(101)는 영상 프레임들 내에서 화자 B가 등장하는 구간의 시작 시점을 영상 프레임 인디케이터(2010)를 이용하여 나타낼 수 있다. 도 20(a)에서는 영상 프레임들 내에서 화자 B가 등장하는 구간이 제1 시점(예: 0:46초)부터 제2 시점(예: 1:02초)라고 가정하기로 한다. 만일 사용자가 음량 조절 바를 이용하여 화자 B의 음성을 뮤트시킨 후, 도 20(b)에서와 같이 영상 프레임 인디케이터(2010)를 제3 시점(예: 1:45초)까지 드래그할 경우, 영상 프레임 인디케이터(2010)는 화자 B가 등장하는 구간을 벗어나 이동 가능할 수 있다. 하지만, 영상 프레임 인디케이터(2010)는 화자 B가 등장하는 구간을 벗어나 이동하는 경우에는 화자 B에 대한 음량 조절 구간은 비활성화 상태로 표시될 수 있다. 또한 전자 장치(101)는 도 20(c)에 도시된 바와 같이 영상 프레임 인디케이터(2010)가 가리키는 영상 프레임에 해당하는 프리뷰 영상(2020)을 표시할 수 있다. 예를 들어, 제2 시점(예: 1:02초)에 해당하는 프리뷰 영상(2020)이 표시될 수 있다. 만일 프리뷰 영상(2020) 내에서 사용자에 의해 화자 B가 선택되는 경우에는 도 20(d)에 도시된 바와 같이 화자 B에 대한 음량 조절을 위한 객체, 음량 조절에 따른 현재의 음량 레벨을 나타내는 객체, 또는 및 화자 B의 음량 조절 구간 중 적어도 하나가 표시될 수 있다. Referring to FIG. 20(a), the electronic device 101 can indicate the start point of the section in which speaker B appears within the video frames using the video frame indicator 2010. In Figure 20(a), it is assumed that the section in which speaker B appears within the video frames is from the first viewpoint (eg, 0:46 seconds) to the second viewpoint (eg, 1:02 seconds). If the user mutes speaker B's voice using the volume control bar and then drags the video frame indicator 2010 to the third time point (e.g., 1:45 seconds) as shown in FIG. 20(b), the video frame The indicator (2010) may be able to move beyond the section where speaker B appears. However, when the video frame indicator 2010 moves beyond the section where speaker B appears, the volume control section for speaker B may be displayed as inactive. Additionally, the electronic device 101 may display a preview image 2020 corresponding to the image frame indicated by the image frame indicator 2010, as shown in FIG. 20(c). For example, a preview image 2020 corresponding to a second viewpoint (eg, 1:02 seconds) may be displayed. If speaker B is selected by the user in the preview image 2020, as shown in FIG. 20(d), an object for volume control for speaker B, an object indicating the current volume level according to volume control, Alternatively, at least one of speaker B's volume control sections may be displayed.

도 21은 일 실시 예에 따른 화자의 순차적 선택에 따른 음성 정보 표시를 나타낸 예시도이다. Figure 21 is an example diagram showing the display of voice information according to sequential selection of a speaker according to an embodiment.

도 21(a)에서는 화자 A와 화자 B가 인식된 상태에서, 음량 조절 대상으로 화자 B가 선택된 경우를 예시하고 있다. 만일 화자 B의 음량이 0으로 조절되게 되면, 음량 조절 구간을 나타내는 객체(2105)가 크기 변화가 없는 파형 예컨대, 점선 형태로 표시될 수 있다. 또한 화자 A가 선택되면 도 21(b)에서와 같이 화자 A의 음량 조절 구간을 나타내는 객체(2110)가 표시될 수 있다. 만일 화자 A의 음량 레벨을 음량 조절 객체를 이용하여 100에서 50으로 낮출 경우, 도 21(c)에서와 같이 화자 A의 음량 조절 구간을 나타내는 객체(2115) 내의 파형의 크기도 줄어드는 형태로 표시될 수 있다. Figure 21(a) illustrates a case where speaker A and speaker B are recognized, and speaker B is selected as the volume control target. If speaker B's volume is adjusted to 0, the object 2105 representing the volume adjustment section may be displayed in a waveform without change in size, for example, in the form of a dotted line. Additionally, when speaker A is selected, an object 2110 representing speaker A's volume control section may be displayed, as shown in FIG. 21(b). If speaker A's volume level is lowered from 100 to 50 using a volume control object, the size of the waveform within the object 2115 representing speaker A's volume control section will also be displayed in a reduced form, as shown in Figure 21(c). You can.

한편, 도 21(c)에서와 같이 화자들을 나타내는 영역들(또는 가이드 박스들) 이외의 영역을 선택할 경우에는 도 21(d)에 도시된 바와 같이 전자 장치(101)는 화자 A, B 각각에 대한 선택을 비활성화시킬 수 있다. 일 실시 예에 따르면, 화자 A 또는 화자 B 중 적어도 하나에 대한 음량 조절이 완료된 경우에는, 화자 각각에 해당하는 음량 조절 구간 내의 적어도 일부 구간의 그래픽 요소(예: 색, 굵기)를 다르게 표시함으로써, 음량 조절된 구간임을 시각적으로 표시할 수 있다. On the other hand, when selecting an area other than the areas (or guide boxes) representing the speakers as shown in FIG. 21(c), the electronic device 101 selects each of speakers A and B as shown in FIG. 21(d). You can disable the selection. According to one embodiment, when the volume adjustment for at least one of Speaker A or Speaker B is completed, graphic elements (e.g., color, thickness) of at least some sections within the volume adjustment section corresponding to each speaker are displayed differently, You can visually indicate that the volume has been adjusted.

본 문서에 개시된 다양한 실시예들에 따른 전자 장치는 다양한 형태의 장치가 될 수 있다. 전자 장치는, 예를 들면, 휴대용 통신 장치(예: 스마트폰), 컴퓨터 장치, 휴대용 멀티미디어 장치, 휴대용 의료 기기, 카메라, 웨어러블 장치, 또는 가전 장치를 포함할 수 있다. 본 문서의 실시예에 따른 전자 장치는 전술한 기기들에 한정되지 않는다.Electronic devices according to various embodiments disclosed in this document may be of various types. Electronic devices may include, for example, portable communication devices (e.g., smartphones), computer devices, portable multimedia devices, portable medical devices, cameras, wearable devices, or home appliances. Electronic devices according to embodiments of this document are not limited to the above-described devices.

본 문서의 다양한 실시예들 및 이에 사용된 용어들은 본 문서에 기재된 기술적 특징들을 특정한 실시예들로 한정하려는 것이 아니며, 해당 실시예의 다양한 변경, 균등물, 또는 대체물을 포함하는 것으로 이해되어야 한다. 도면의 설명과 관련하여, 유사한 또는 관련된 구성요소에 대해서는 유사한 참조 부호가 사용될 수 있다. 아이템에 대응하는 명사의 단수 형은 관련된 문맥상 명백하게 다르게 지시하지 않는 한, 상기 아이템 한 개 또는 복수 개를 포함할 수 있다. 본 문서에서, "A 또는 B", "A 및 B 중 적어도 하나", "A 또는 B 중 적어도 하나", "A, B 또는 C", "A, B 및 C 중 적어도 하나", 및 "A, B, 또는 C 중 적어도 하나"와 같은 문구들 각각은 그 문구들 중 해당하는 문구에 함께 나열된 항목들 중 어느 하나, 또는 그들의 모든 가능한 조합을 포함할 수 있다. "제 1", "제 2", 또는 "첫째" 또는 "둘째"와 같은 용어들은 단순히 해당 구성요소를 다른 해당 구성요소와 구분하기 위해 사용될 수 있으며, 해당 구성요소들을 다른 측면(예: 중요성 또는 순서)에서 한정하지 않는다. 어떤(예: 제 1) 구성요소가 다른(예: 제 2) 구성요소에, "기능적으로" 또는 "통신적으로"라는 용어와 함께 또는 이런 용어 없이, "커플드" 또는 "커넥티드"라고 언급된 경우, 그것은 상기 어떤 구성요소가 상기 다른 구성요소에 직접적으로(예: 유선으로), 무선으로, 또는 제 3 구성요소를 통하여 연결될 수 있다는 것을 의미한다.The various embodiments of this document and the terms used herein are not intended to limit the technical features described in this document to specific embodiments, but should be understood to include various changes, equivalents, or replacements of the embodiments. In connection with the description of the drawings, similar reference numbers may be used for similar or related components. The singular form of a noun corresponding to an item may include one or more of the above items, unless the relevant context clearly indicates otherwise. As used herein, “A or B”, “at least one of A and B”, “at least one of A or B”, “A, B or C”, “at least one of A, B and C”, and “A Each of phrases such as “at least one of , B, or C” may include any one of the items listed together in the corresponding phrase, or any possible combination thereof. Terms such as "first", "second", or "first" or "second" may be used simply to distinguish one component from another, and to refer to those components in other respects (e.g., importance or order) is not limited. One (e.g., first) component is said to be “coupled” or “connected” to another (e.g., second) component, with or without the terms “functionally” or “communicatively.” Where mentioned, it means that any of the components can be connected to the other components directly (e.g. wired), wirelessly, or through a third component.

본 문서의 다양한 실시예들에서 사용된 용어 "모듈"은 하드웨어, 소프트웨어 또는 펌웨어로 구현된 유닛을 포함할 수 있으며, 예를 들면, 로직, 논리 블록, 부품, 또는 회로와 같은 용어와 상호 호환적으로 사용될 수 있다. 모듈은, 일체로 구성된 부품 또는 하나 또는 그 이상의 기능을 수행하는, 상기 부품의 최소 단위 또는 그 일부가 될 수 있다. 예를 들면, 일실시예에 따르면, 모듈은 ASIC(application-specific integrated circuit)의 형태로 구현될 수 있다. The term “module” used in various embodiments of this document may include a unit implemented in hardware, software, or firmware, and is interchangeable with terms such as logic, logic block, component, or circuit, for example. It can be used as A module may be an integrated part or a minimum unit of the parts or a part thereof that performs one or more functions. For example, according to one embodiment, the module may be implemented in the form of an application-specific integrated circuit (ASIC).

본 문서의 다양한 실시예들은 기기(machine)(예: 전자 장치(101)) 의해 읽을 수 있는 저장 매체(storage medium)(예: 내장 메모리(136) 또는 외장 메모리(138))에 저장된 하나 이상의 명령어들을 포함하는 소프트웨어(예: 프로그램(140))로서 구현될 수 있다. 예를 들면, 기기(예: 전자 장치(101))의 프로세서(예: 프로세서(120))는, 저장 매체로부터 저장된 하나 이상의 명령어들 중 적어도 하나의 명령을 호출하고, 그것을 실행할 수 있다. 이것은 기기가 상기 호출된 적어도 하나의 명령어에 따라 적어도 하나의 기능을 수행하도록 운영되는 것을 가능하게 한다. 상기 하나 이상의 명령어들은 컴파일러에 의해 생성된 코드 또는 인터프리터에 의해 실행될 수 있는 코드를 포함할 수 있다. 기기로 읽을 수 있는 저장 매체는, 비일시적(non-transitory) 저장 매체의 형태로 제공될 수 있다. 여기서, ‘비일시적’은 저장 매체가 실재(tangible)하는 장치이고, 신호(signal)(예: 전자기파)를 포함하지 않는다는 것을 의미할 뿐이며, 이 용어는 데이터가 저장 매체에 반영구적으로 저장되는 경우와 임시적으로 저장되는 경우를 구분하지 않는다.Various embodiments of the present document are one or more instructions stored in a storage medium (e.g., built-in memory 136 or external memory 138) that can be read by a machine (e.g., electronic device 101). It may be implemented as software (e.g., program 140) including these. For example, a processor (e.g., processor 120) of a device (e.g., electronic device 101) may call at least one command among one or more commands stored from a storage medium and execute it. This allows the device to be operated to perform at least one function according to the at least one instruction called. The one or more instructions may include code generated by a compiler or code that can be executed by an interpreter. A storage medium that can be read by a device may be provided in the form of a non-transitory storage medium. Here, 'non-transitory' only means that the storage medium is a tangible device and does not contain signals (e.g. electromagnetic waves). This term refers to cases where data is stored semi-permanently in the storage medium. There is no distinction between temporary storage cases.

일실시예에 따르면, 본 문서에 개시된 다양한 실시예들에 따른 방법은 컴퓨터 프로그램 제품(computer program product)에 포함되어 제공될 수 있다. 컴퓨터 프로그램 제품은 상품으로서 판매자 및 구매자 간에 거래될 수 있다. 컴퓨터 프로그램 제품은 기기로 읽을 수 있는 저장 매체(예: compact disc read only memory(CD-ROM))의 형태로 배포되거나, 또는 어플리케이션 스토어(예: 플레이 스토어^TM)를 통해 또는 두 개의 사용자 장치들(예: 스마트 폰들) 간에 직접, 온라인으로 배포(예: 다운로드 또는 업로드)될 수 있다. 온라인 배포의 경우에, 컴퓨터 프로그램 제품의 적어도 일부는 제조사의 서버, 어플리케이션 스토어의 서버, 또는 중계 서버의 메모리와 같은 기기로 읽을 수 있는 저장 매체에 적어도 일시 저장되거나, 임시적으로 생성될 수 있다.According to one embodiment, methods according to various embodiments disclosed in this document may be included and provided in a computer program product. Computer program products are commodities and can be traded between sellers and buyers. The computer program product may be distributed in the form of a machine-readable storage medium (e.g. compact disc read only memory (CD-ROM)) or via an application store (e.g. Play Store ^TM ) or on two user devices (e.g. It can be distributed (e.g. downloaded or uploaded) directly between smart phones) or online. In the case of online distribution, at least a portion of the computer program product may be at least temporarily stored or temporarily created in a machine-readable storage medium, such as the memory of a manufacturer's server, an application store's server, or a relay server.

다양한 실시예들에 따르면, 상기 기술한 구성요소들의 각각의 구성요소(예: 모듈 또는 프로그램)는 단수 또는 복수의 개체를 포함할 수 있으며, 복수의 개체 중 일부는 다른 구성요소에 분리 배치될 수도 있다. 다양한 실시예들에 따르면, 전술한 해당 구성요소들 중 하나 이상의 구성요소들 또는 동작들이 생략되거나, 또는 하나 이상의 다른 구성요소들 또는 동작들이 추가될 수 있다. 대체적으로 또는 추가적으로, 복수의 구성요소들(예: 모듈 또는 프로그램)은 하나의 구성요소로 통합될 수 있다. 이런 경우, 통합된 구성요소는 상기 복수의 구성요소들 각각의 구성요소의 하나 이상의 기능들을 상기 통합 이전에 상기 복수의 구성요소들 중 해당 구성요소에 의해 수행되는 것과 동일 또는 유사하게 수행할 수 있다. 다양한 실시예들에 따르면, 모듈, 프로그램 또는 다른 구성요소에 의해 수행되는 동작들은 순차적으로, 병렬적으로, 반복적으로, 또는 휴리스틱하게 실행되거나, 상기 동작들 중 하나 이상이 다른 순서로 실행되거나, 생략되거나, 또는 하나 이상의 다른 동작들이 추가될 수 있다.According to various embodiments, each component (e.g., module or program) of the above-described components may include a single or plural entity, and some of the plurality of entities may be separately placed in other components. there is. According to various embodiments, one or more of the components or operations described above may be omitted, or one or more other components or operations may be added. Alternatively or additionally, multiple components (eg, modules or programs) may be integrated into a single component. In this case, the integrated component may perform one or more functions of each of the plurality of components identically or similarly to those performed by the corresponding component of the plurality of components prior to the integration. . According to various embodiments, operations performed by a module, program, or other component may be executed sequentially, in parallel, iteratively, or heuristically, or one or more of the operations may be executed in a different order, or omitted. Alternatively, one or more other operations may be added.

Claims

전자 장치(101)에 있어서,
디스플레이(360); 및
적어도 하나의 프로세서(320)를 포함하고,
상기 적어도 하나의 프로세서는,
컨텐트에 포함된 복수의 음성 신호들을 식별하고,
상기 컨텐트에 포함된 영상 내의 복수의 화자들을 식별하고,
상기 음성 신호들과 상기 화자들을 각각 매칭하고,
상기 디스플레이 상에 표시된 상기 영상 내의 화자들 중 제1 화자와 관련된 음량 제어 객체에 대한 제1 사용자 입력을 수신하고,
상기 제1 사용자 입력에 대응하여, 상기 제1 화자에 대한 음량을 조절하고, 상기 복수의 화자들 중 제2 화자에 대한 음량을 유지하도록 설정된, 전자 장치.
In the electronic device 101,
display(360); and
Comprising at least one processor 320,
The at least one processor,
Identifying a plurality of voice signals included in the content,
Identifying a plurality of speakers in an image included in the content,
Matching the voice signals with the speakers, respectively,
Receiving a first user input for a volume control object related to a first speaker among speakers in the image displayed on the display,
An electronic device configured to adjust the volume for the first speaker and maintain the volume for the second speaker among the plurality of speakers in response to the first user input.

제1항에 있어서, 상기 적어도 하나의 프로세서는,
상기 제2 화자와 관련된 음량 제어 객체에 대한 제2 사용자 입력에 대응하여, 상기 제1 화자에 대한 음량을 유지하고, 상기 복수의 화자들 중 제2 화자에 대한 음량을 조절하도록 설정된, 전자 장치.
The method of claim 1, wherein the at least one processor:
An electronic device configured to maintain volume for the first speaker and adjust volume for a second speaker among the plurality of speakers in response to a second user input to a volume control object related to the second speaker.

제1항 또는 제2항에 있어서, 상기 적어도 하나의 프로세서는,
상기 영상 내의 화자들 각각을 나타내는 영역을 표시하고,
상기 영역들 중 어느 하나의 영역을 선택하는 입력에 대응하여, 상기 영역에 대응하는 화자에 대한 음량을 조절하기 위한 음량 제어 객체를 상기 디스플레이의 적어도 일부 상에 표시하도록 설정된, 전자 장치.
The method of claim 1 or 2, wherein the at least one processor:
Displaying an area representing each speaker in the video,
An electronic device configured to display a volume control object for adjusting the volume for a speaker corresponding to the region on at least a portion of the display in response to an input for selecting one of the regions.

제1항 내지 제3항 중 어느 한 항에 있어서, 상기 적어도 하나의 프로세서는,
상기 영상 내의 화자들 각각을 나타내는 영역들 중 어느 하나의 영역에 대한 선택을 해제하는 입력에 대응하여, 상기 제1 화자에 대한 조절된 음량을 디폴트 음량으로 초기화하도록 설정된, 전자 장치.
The method of any one of claims 1 to 3, wherein the at least one processor:
An electronic device configured to initialize the adjusted volume for the first speaker to a default volume in response to an input deselecting one of the regions representing each of the speakers in the image.

제1항 내지 제4항 중 어느 한 항에 있어서, 상기 적어도 하나의 프로세서는,
상기 영상의 프레임들을 타임 라인에 따라 상기 디스플레이의 적어도 일부 상에 표시하고,
상기 타임 라인에 대해 선택된 시점부터 상기 제1 화자에 대한 음량을 조절하도록 설정된, 전자 장치.
The method of any one of claims 1 to 4, wherein the at least one processor:
Displaying frames of the video on at least a portion of the display according to a timeline,
An electronic device configured to adjust the volume for the first speaker from a selected point in time on the timeline.

제1항 내지 제5항 중 어느 한 항에 있어서, 상기 적어도 하나의 프로세서는,
상기 타임 라인 상에서 상기 제1 화자의 음성 구간의 적어도 일부를 나타내는 음성 구간 인디케이터를 표시하고,
상기 음성 구간 인디케이터를 드래그하는 입력에 기반하여, 상기 제1 화자의 음량 조절 구간을 설정하도록 설정된, 전자 장치.
The method of any one of claims 1 to 5, wherein the at least one processor:
Displaying a voice section indicator indicating at least a portion of the first speaker's voice section on the timeline,
An electronic device configured to set a volume control section of the first speaker based on an input of dragging the voice section indicator.

제1항 내지 제6항 중 어느 한 항에 있어서, 상기 적어도 하나의 프로세서는,
상기 타임 라인에 따라 표시되는 상기 영상의 프레임들 중 상기 제1 화자가 포함된 영상 프레임 구간을 식별하고,
상기 제1 화자가 포함된 영상 프레임 구간에 대해 상기 제1 화자에 대한 음량을 조절하도록 설정된, 전자 장치.
The method of any one of claims 1 to 6, wherein the at least one processor:
Identifying a video frame section containing the first speaker among the frames of the video displayed according to the timeline,
An electronic device configured to adjust the volume for the first speaker for a video frame section including the first speaker.

제1항 내지 제7항 중 어느 한 항에 있어서, 상기 적어도 하나의 프로세서는,
상기 타임 라인 상에서 상기 음성 구간 인디케이터를 드래그하는 입력의 해제 시, 상기 음성 구간 인디케이터에 의해 지시되는 시점이 상기 제1 화자가 포함된 영상 프레임 구간을 벗어나는지를 식별하고,
상기 음성 구간 인디케이터에 의해 지시되는 시점이 상기 제1 화자가 포함된 영상 프레임 구간을 벗어나는 경우, 상기 제1 화자가 포함된 영상 프레임의 마지막 프레임까지를 상기 제1 화자의 음량 조절 구간으로 설정하도록 설정된, 전자 장치.
The method of any one of claims 1 to 7, wherein the at least one processor:
When the input of dragging the voice section indicator on the timeline is released, identify whether the time point indicated by the voice section indicator is outside the video frame section including the first speaker,
When the time point indicated by the voice section indicator is outside the video frame section including the first speaker, the last frame of the video frame including the first speaker is set to be set as the volume control section of the first speaker. , electronic devices.

제1항 내지 제8항 중 어느 한 항에 있어서, 상기 적어도 하나의 프로세서는,
상기 제1 화자가 포함된 영상 프레임 구간을 분리하는 입력에 대응하여, 상기 제1 화자가 포함된 영상 프레임 구간의 적어도 일부 구간에 대해 상기 제1 화자에 대한 음량을 조절하고,
상기 제1 화자가 포함된 영상 프레임 구간의 나머지 구간에 대해 상기 제1 화자에 대한 음량을 유지하도록 설정된, 전자 장치.
The method of any one of claims 1 to 8, wherein the at least one processor:
In response to an input that separates a video frame section including the first speaker, adjusting the volume for the first speaker for at least a portion of the video frame section including the first speaker,
An electronic device configured to maintain the volume for the first speaker for the remaining section of the video frame section including the first speaker.

제1항 내지 제9항 중 어느 한 항에 있어서, 상기 적어도 하나의 프로세서는,
음량 자동 조절 기능에 대한 선택에 대응하여, 상기 화자들 각각에 대한 음량을 임계값과 비교하고,
상기 임계값을 초과하는 음량을 지정된 음량으로 조절하도록 설정된, 전자 장치.
The method of any one of claims 1 to 9, wherein the at least one processor:
In response to selection of an automatic volume adjustment feature, compare the volume for each of the speakers to a threshold,
An electronic device configured to adjust a volume exceeding the threshold to a specified volume.

전자 장치에서 화자별 음량 조절을 위한 방법에 있어서,
컨텐트에 포함된 복수의 음성 신호들을 식별하는 동작;
상기 컨텐트에 포함된 영상 내의 복수의 화자들을 식별하는 동작;
상기 음성 신호들과 상기 화자들을 각각 매칭하는 동작;
상기 전자 장치의 디스플레이 상에 표시된 상기 영상 내의 화자들 중 제1 화자와 관련된 음량 제어 객체에 대한 제1 사용자 입력을 수신하는 동작; 및
상기 제1 사용자 입력에 대응하여, 상기 제1 화자에 대한 음량을 조절하고, 상기 복수의 화자들 중 제2 화자에 대한 음량을 유지하는 동작을 포함하는, 화자별 음량 조절을 위한 방법.
In a method for controlling the volume for each speaker in an electronic device,
An operation of identifying a plurality of voice signals included in content;
Identifying a plurality of speakers in an image included in the content;
An operation of matching the voice signals and the speakers, respectively;
Receiving a first user input for a volume control object related to a first speaker among speakers in the image displayed on the display of the electronic device; and
In response to the first user input, a method for adjusting the volume for each speaker, comprising the operation of adjusting the volume for the first speaker and maintaining the volume for the second speaker among the plurality of speakers.

제11항에 있어서,
상기 제2 화자와 관련된 음량 제어 객체에 대한 제2 사용자 입력에 대응하여, 상기 제1 화자에 대한 음량을 유지하고, 상기 복수의 화자들 중 제2 화자에 대한 음량을 조절하는 동작을 더 포함하는, 화자별 음량 조절을 위한 방법.
According to clause 11,
In response to a second user input to a volume control object related to the second speaker, maintaining the volume for the first speaker and adjusting the volume for a second speaker among the plurality of speakers further comprising: , Method for controlling volume for each speaker.

제11항 또는 제12항에 있어서,
상기 영상 내의 화자들 각각을 나타내는 영역을 표시하는 동작; 및
상기 영역들 중 어느 하나의 영역을 선택하는 입력에 대응하여, 상기 영역에 대응하는 화자에 대한 음량을 조절하기 위한 음량 제어 객체를 상기 디스플레이의 적어도 일부 상에 표시하는 동작을 더 포함하는, 화자별 음량 조절을 위한 방법.
According to claim 11 or 12,
An operation of displaying an area representing each speaker in the video; and
In response to an input for selecting one of the regions, displaying a volume control object for adjusting the volume for a speaker corresponding to the region on at least a portion of the display, for each speaker. How to adjust the volume.

제11항 내지 제13항 중 어느 한 항에 있어서,
상기 영상 내의 화자들 각각을 나타내는 영역들 중 어느 하나의 영역에 대한 선택을 해제하는 입력에 대응하여, 상기 제1 화자에 대한 조절된 음량을 디폴트 음량으로 초기화하는 동작을 더 포함하는, 화자별 음량 조절을 위한 방법.
According to any one of claims 11 to 13,
Volume for each speaker further comprising initializing the adjusted volume for the first speaker to a default volume in response to an input deselecting one of the regions representing each of the speakers in the image. Method for adjustment.

제11항 내지 제14항 중 어느 한 항에 있어서,
상기 영상의 프레임들을 타임 라인에 따라 상기 디스플레이의 적어도 일부 상에 표시하는 동작; 및
상기 타임 라인에 대해 선택된 시점부터 상기 제1 화자에 대한 음량을 조절하는 동작을 더 포함하는, 화자별 음량 조절을 위한 방법.
According to any one of claims 11 to 14,
displaying frames of the image on at least a portion of the display according to a timeline; and
A method for adjusting the volume for each speaker, further comprising adjusting the volume for the first speaker from a point selected for the timeline.

제11항 내지 제15항 중 어느 한 항에 있어서,
상기 타임 라인 상에서 상기 제1 화자의 음성 구간의 적어도 일부를 나타내는 음성 구간 인디케이터를 표시하는 동작;
상기 음성 구간 인디케이터를 드래그하는 입력에 기반하여, 상기 제1 화자의 음량 조절 구간을 설정하는 동작을 더 포함하는, 화자별 음량 조절을 위한 방법.
According to any one of claims 11 to 15,
displaying a voice section indicator indicating at least a portion of the first speaker's voice section on the timeline;
A method for adjusting the volume for each speaker, further comprising setting a volume control section for the first speaker based on an input of dragging the voice section indicator.

제11항 내지 제16항 중 어느 한 항에 있어서,
상기 타임 라인에 따라 표시되는 상기 영상의 프레임들 중 상기 제1 화자가 포함된 영상 프레임 구간을 식별하는 동작; 및
상기 제1 화자가 포함된 영상 프레임 구간에 대해 상기 제1 화자에 대한 음량을 조절하는 동작을 더 포함하는, 화자별 음량 조절을 위한 방법.
According to any one of claims 11 to 16,
identifying a video frame section including the first speaker among the frames of the video displayed according to the timeline; and
A method for adjusting the volume for each speaker, further comprising adjusting the volume for the first speaker for a video frame section including the first speaker.

제11항 내지 제17항 중 어느 한 항에 있어서,
상기 제1 화자가 포함된 영상 프레임 구간을 분리하는 입력에 대응하여, 상기 제1 화자가 포함된 영상 프레임 구간의 적어도 일부 구간에 대해 상기 제1 화자에 대한 음량을 조절하는 동작; 및
상기 제1 화자가 포함된 영상 프레임 구간의 나머지 구간에 대해 상기 제1 화자에 대한 음량을 유지하는 동작을 더 포함하는, 화자별 음량 조절을 위한 방법.
According to any one of claims 11 to 17,
adjusting the volume for the first speaker for at least a portion of the video frame section including the first speaker, in response to an input that separates the video frame section including the first speaker; and
A method for adjusting the volume for each speaker, further comprising maintaining the volume for the first speaker for the remaining section of the video frame section including the first speaker.

제11항 내지 제18항 중 어느 한 항에 있어서,
음량 자동 조절 기능에 대한 선택에 대응하여, 상기 화자들 각각에 대한 음량을 임계값과 비교하는 동작; 및
상기 임계값을 초과하는 음량을 지정된 음량으로 조절하는 동작을 더 포함하는, 화자별 음량 조절을 위한 방법.
According to any one of claims 11 to 18,
In response to selection of an automatic volume adjustment function, comparing the volume for each of the speakers with a threshold value; and
A method for adjusting the volume for each speaker, further comprising adjusting the volume exceeding the threshold to a specified volume.

명령들을 저장하고 있는 비휘발성 저장 매체에 있어서, 상기 명령들은 전자 장치(101)의 적어도 하나의 프로세서(320)에 의하여 실행될 때에 상기 전자 장치로 하여금 적어도 하나의 동작을 수행하도록 설정된 것으로서, 상기 적어도 하나의 동작은,
컨텐트에 포함된 복수의 음성 신호들을 식별하는 동작;
상기 컨텐트에 포함된 영상 내의 복수의 화자들을 식별하는 동작;
상기 음성 신호들과 상기 화자들을 각각 매칭하는 동작;
상기 전자 장치의 디스플레이 상에 표시된 상기 영상 내의 화자들 중 제1 화자와 관련된 음량 제어 객체에 대한 제1 사용자 입력을 수신하는 동작; 및
상기 제1 사용자 입력에 대응하여, 상기 제1 화자에 대한 음량을 조절하고, 상기 복수의 화자들 중 제2 화자에 대한 음량을 유지하는 동작을 포함하는, 저장 매체.A non-volatile storage medium storing instructions, wherein the instructions are set to cause the electronic device to perform at least one operation when executed by at least one processor 320 of the electronic device 101, wherein the at least one The operation is,
An operation of identifying a plurality of voice signals included in content;
Identifying a plurality of speakers in an image included in the content;
An operation of matching the voice signals and the speakers, respectively;
Receiving a first user input for a volume control object related to a first speaker among speakers in the image displayed on the display of the electronic device; and
In response to the first user input, a storage medium comprising adjusting the volume for the first speaker and maintaining the volume for the second speaker among the plurality of speakers.