JP2016018082A

JP2016018082A - Voice processing device and method, as well as imaging device

Info

Publication number: JP2016018082A
Application number: JP2014140862A
Authority: JP
Inventors: 文裕梶村; Fumihiro Kajimura
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2014-07-08
Filing date: 2014-07-08
Publication date: 2016-02-01

Abstract

PROBLEM TO BE SOLVED: To attain effective noise elimination as balancing both of a computation load and sound quality.SOLUTION: A voice processing device is configured to: acquire a voice signal of a plurality of channels including a first channel and a second channel; detect a phase difference between the channels; generate a first prediction voice signal for replacing the voice signal in a noise immixture segment, using the voice signal of the first channel in at least one of a segment in front of a noise immixture segment in the acquired voice signal and a segment at the back thereof, and replace the voice signal of the first channel in the noise immixture segment with the first prediction voice signal; and next, generate a second prediction voice signal by correcting the first prediction voice signal on the basis of the phase difference, and replace the voice signal of the second channel in the noise immixture segment with the second prediction voice signal.SELECTED DRAWING: Figure 8

Description

本発明は、音声処理技術に関する。 The present invention relates to a voice processing technique.

近年のデジタルカメラには、静止画撮影の機能のみならず、音声信号記録を伴う動画撮影を行う機能を有するものがある。動画撮影時には、撮影状態の変更に応じたフォーカスレンズの駆動、被写体の輝度変化に応じた絞り機構の駆動など、撮像装置の駆動部が動作する。駆動部の動作音は、記録される音声に雑音として混入してしまうという問題がある。駆動雑音低減処理については従来、さまざまな手法が開示されている。 Some recent digital cameras have not only a still image shooting function but also a function of shooting a moving image with audio signal recording. At the time of moving image shooting, the drive unit of the imaging apparatus operates such as driving a focus lens according to a change in shooting state and driving a diaphragm mechanism according to a change in luminance of a subject. There is a problem that the operation sound of the drive unit is mixed as noise in the recorded sound. Conventionally, various methods have been disclosed for driving noise reduction processing.

例えば、特許文献１では、駆動雑音発生区間の前後の音声信号から雑音の含まれない音声を予測し、その予測した音声のデータで駆動雑音発生区間のデータを置き換える技術が提案されている。 For example, Patent Document 1 proposes a technique for predicting a voice that does not include noise from voice signals before and after a driving noise generation section and replacing the data of the driving noise generation section with the predicted voice data.

特開２００８−０５３８０２号公報JP 2008-0538802 A

しかしながら、音声信号を取得するマイクロホンを複数備え、マルチチャネルの音声記録機能を有する場合は、以下のような問題がある。例えばＬｃｈ及びＲｃｈを有するステレオ音声に対して、予測処理によって雑音除去をする場合、ＬｃｈとＲｃｈの音声信号の両方について予測処理を行わなければならない。しかし、２ｃｈ以上の音声信号についてそれぞれ予測処理を行うのは演算負荷が大きい。マイクロホンの数が増えるとその演算負荷はマイクロホン数に比例して増加していく。 However, when a plurality of microphones that acquire audio signals are provided and a multi-channel audio recording function is provided, there are the following problems. For example, when noise is removed by prediction processing for stereo audio having Lch and Rch, prediction processing must be performed for both Lch and Rch audio signals. However, it is computationally intensive to perform prediction processing for audio signals of 2ch or more. As the number of microphones increases, the calculation load increases in proportion to the number of microphones.

本発明は、演算負荷と音質とを両立させつつ効果的な雑音除去を実現する。 The present invention realizes effective noise removal while achieving both computational load and sound quality.

本発明の一側面によれば、第１チャネルと第２チャネルを含む複数チャネルの音声信号を取得する取得手段と、前記第１チャネルの音声信号と前記第２チャネルの音声信号との間の位相差を検出する位相差検出手段と、前記音声信号における雑音混入区間の前後の少なくとも一方の区間における前記第１チャネルの音声信号を用いて前記雑音混入区間の音声信号を置き換えるための第１予測音声信号を生成し、前記第１予測音声信号で前記雑音混入区間における前記第１チャネルの音声信号を置換する第１置換手段と、前記第１予測音声信号を前記位相差により補正することで第２予測音声信号を生成し、前記第２予測音声信号で前記雑音混入区間における前記第２チャネルの音声信号を置換する第２置換手段とを有することを特徴とする音声処理装置が提供される。 According to one aspect of the present invention, an acquisition means for acquiring audio signals of a plurality of channels including the first channel and the second channel, and a position between the audio signal of the first channel and the audio signal of the second channel. Phase difference detection means for detecting a phase difference, and first predicted speech for replacing the speech signal in the noise-mixed section using the speech signal of the first channel in at least one section before and after the noise-mixed section in the speech signal A first replacement means for generating a signal and replacing the first channel speech signal in the noise-mixed section with the first predicted speech signal; and correcting the first predicted speech signal with the phase difference to generate a second signal. And a second replacement means for generating a predicted speech signal and replacing the second channel speech signal in the noisy section with the second predicted speech signal. Processing apparatus is provided.

本発明によれば、演算負荷と音質とを両立させつつ効果的な雑音除去を行うことができる。 ADVANTAGE OF THE INVENTION According to this invention, effective noise removal can be performed, making calculation load and sound quality compatible.

第１実施形態における撮像装置の全体図。1 is an overall view of an imaging apparatus according to a first embodiment. 第１実施形態における撮像装置のブロック図。1 is a block diagram of an imaging apparatus according to a first embodiment. 第１実施形態におけるマイクロホンの詳細図。FIG. 3 is a detailed view of the microphone according to the first embodiment. 音声予測処理手法の説明図。Explanatory drawing of the audio | voice prediction processing method. 被写体音が正面から到来するときの音声信号の模式図。The schematic diagram of an audio | voice signal when object sound arrives from the front. 被写体音が正面から到来するときの音声信号の模式図。The schematic diagram of an audio | voice signal when object sound arrives from the front. 第１実施形態における音声記録動作のフローチャート。The flowchart of the audio | voice recording operation | movement in 1st Embodiment. 第１実施形態における予測処理動作を示した音声信号及び相関値の模式図。The schematic diagram of the audio | voice signal and correlation value which showed the prediction process operation | movement in 1st Embodiment. 第２実施形態における音声記録動作のフローチャート。The flowchart of the audio | voice recording operation | movement in 2nd Embodiment.

以下、図面を参照して本発明の好適な実施形態について詳細に説明する。なお、本発明は以下の実施形態に限定されるものではなく、本発明の実施に有利な具体例を示すにすぎない。また、以下の実施形態の中で説明されている特徴の組み合わせの全てが本発明の課題解決のために必須のものであるとは限らない。 DESCRIPTION OF EMBODIMENTS Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the drawings. In addition, this invention is not limited to the following embodiment, It shows only the specific example advantageous for implementation of this invention. Moreover, not all combinations of features described in the following embodiments are indispensable for solving the problems of the present invention.

＜第１実施形態＞
図１は、本発明の音声処理装置が適用される撮像装置１００の構成を示すブロック図である。撮像装置１００は例えばデジタル一眼レフカメラであり、カメラボディ１０１と、カメラボディ１０１に対して着脱可能な撮影レンズ１０２を有する。撮影レンズ１０２は、レンズ鏡筒１０３内に光軸１０５を有する撮像光学系１０４を有する。撮像光学系１０４は、フォーカスレンズ群、手ブレ補正レンズユニット、絞り機構を駆動させるレンズ駆動部１０６、レンズ駆動部１０６を制御するレンズ制御部１０７を含む。撮影レンズ１０２は、レンズマウント接点１０８を介してカメラボディ１０１と電気的に接続される。撮影レンズ１０２前方から入射する被写体光学像は、光軸１０５を通ってカメラボディ１０１に入光し、一部をハーフミラーで構成された主ミラー１１０で反射され、フォーカシングスクリーン１１７上に結像する。ユーザは、フォーカシングスクリーン１１７上に結象した光学像をペンタプリズム１１１を通して接眼窓１１２から視認することができる。 <First Embodiment>
FIG. 1 is a block diagram illustrating a configuration of an imaging apparatus 100 to which a sound processing apparatus of the present invention is applied. The imaging apparatus 100 is, for example, a digital single-lens reflex camera, and includes a camera body 101 and a photographic lens 102 that can be attached to and detached from the camera body 101. The taking lens 102 has an imaging optical system 104 having an optical axis 105 in a lens barrel 103. The imaging optical system 104 includes a focus lens group, a camera shake correction lens unit, a lens driving unit 106 that drives a diaphragm mechanism, and a lens control unit 107 that controls the lens driving unit 106. The taking lens 102 is electrically connected to the camera body 101 via a lens mount contact 108. A subject optical image incident from the front of the photographing lens 102 enters the camera body 101 through the optical axis 105, is partially reflected by the main mirror 110 formed of a half mirror, and forms an image on the focusing screen 117. . The user can visually recognize the optical image formed on the focusing screen 117 from the eyepiece window 112 through the pentaprism 111.

一方、測光センサ１１６はフォーカシングスクリーン１１７上に結像した光学象の明るさを検出する。また、主ミラー１１０を透過した被写体光学像はサブミラー１１３で反射され焦点検出部１１４に入射し、被写体像の焦点検出演算に用いられる。カメラボディ１０１内にある不図示のレリーズボタンが操作され、撮影開始命令が発せられると、主ミラー１１０及びサブミラー１１３は、被写体光学像が撮像素子１１８に入射するように撮影光路から退避する。焦点検出部１１４、測光センサ１１６、撮像素子１１８に入射した光線はそれぞれ電気信号に変換され、カメラ制御部１１９に送られカメラシステムの制御が行われる。 On the other hand, the photometric sensor 116 detects the brightness of the optical image formed on the focusing screen 117. The subject optical image transmitted through the main mirror 110 is reflected by the sub-mirror 113 and enters the focus detection unit 114, and is used for the focus detection calculation of the subject image. When a release button (not shown) in the camera body 101 is operated and an imaging start command is issued, the main mirror 110 and the sub mirror 113 are retracted from the imaging optical path so that the subject optical image is incident on the image sensor 118. Light rays incident on the focus detection unit 114, the photometric sensor 116, and the image sensor 118 are converted into electrical signals, which are sent to the camera control unit 119 to control the camera system.

また、動画撮影時は、マイクロホン１１５に被写体の音声が入力されカメラ制御部１１９に送られ、撮像素子１１８に入射した被写体光学像信号と同期して記録処理される。マイクロホン１１５は、複数チャネルの音声信号を取得する構成を含む。例えば、マイクロホン１１５は、図３に示すように、第１チャネルであるＬｃｈ１１５ａと第２チャネルであるＲｃｈ１１５ｂの２チャネルのマイクロホンアレーを構成する。マイクロホン１１５近傍のカメラボディ１０１の内側面には、加速度計１２０が設置されている。加速度計１２０は、レンズ駆動部１０６がフォーカスレンズ群、手ブレ補正レンズユニット、絞り機構などの駆動機構を駆動した時、撮影レンズ１０２、カメラボディ１０１を介して伝播してくる振動を検出する。カメラ制御部１１９は、その振動検出結果に対し解析を行い、雑音混入区間を算出する。 During moving image shooting, the sound of the subject is input to the microphone 115 and sent to the camera control unit 119, and recording processing is performed in synchronization with the subject optical image signal incident on the image sensor 118. The microphone 115 includes a configuration for acquiring audio signals of a plurality of channels. For example, as shown in FIG. 3, the microphone 115 configures a two-channel microphone array of an Lch 115a that is the first channel and an Rch 115b that is the second channel. An accelerometer 120 is installed on the inner surface of the camera body 101 near the microphone 115. The accelerometer 120 detects vibration that propagates through the photographing lens 102 and the camera body 101 when the lens driving unit 106 drives a driving mechanism such as a focus lens group, a camera shake correction lens unit, and a diaphragm mechanism. The camera control unit 119 analyzes the vibration detection result and calculates a noise mixed section.

図２は、撮像装置１００の電気的制御を説明するブロック図である。撮像装置１００は、撮像系、画像処理系、音声処理系、記録再生系、制御系を有する。撮像系は、撮影レンズ１０２及び撮像素子１１８を含む。画像処理系は、Ａ／Ｄ変換器１３１、画像処理回路１３２を含む。音声処理系は、マイクロホン１１５と音声信号処理回路１３７とを含む。記録再生系は、記録処理回路１３３、メモリ１３４を含む。制御系は、カメラ制御部１１９、焦点検出部１１４、測光センサ１１６、操作検出部１３５、レンズ制御部１０７、レンズ駆動部１０６を含む。レンズ駆動部１０６は、焦点レンズ駆動部１０６ａ、ブレ補正駆動部１０６ｂ、絞り駆動部１０６ｃを含む。 FIG. 2 is a block diagram for explaining electrical control of the imaging apparatus 100. The imaging apparatus 100 includes an imaging system, an image processing system, an audio processing system, a recording / reproducing system, and a control system. The imaging system includes a photographic lens 102 and an imaging element 118. The image processing system includes an A / D converter 131 and an image processing circuit 132. The audio processing system includes a microphone 115 and an audio signal processing circuit 137. The recording / reproducing system includes a recording processing circuit 133 and a memory 134. The control system includes a camera control unit 119, a focus detection unit 114, a photometric sensor 116, an operation detection unit 135, a lens control unit 107, and a lens driving unit 106. The lens driving unit 106 includes a focus lens driving unit 106a, a blur correction driving unit 106b, and an aperture driving unit 106c.

撮像系は、物体からの光を、撮像光学系１０４を介して撮像素子１１８の撮像面に結像する光学処理系である。エイミングなどの撮影予備動作中は、主ミラー１１０に設けられたミラーを介して、焦点検出部１１４にも光束の一部が導かれる。また後述するように、制御系によって適切に撮像光学系が調整されることで、適切な光量の物体光を撮像素子１１８に露光するとともに、撮像素子１１８近傍で被写体像が結像する。 The imaging system is an optical processing system that forms an image of light from an object on the imaging surface of the imaging element 118 via the imaging optical system 104. During a preliminary shooting operation such as aiming, a part of the light beam is also guided to the focus detection unit 114 via a mirror provided on the main mirror 110. Further, as will be described later, by appropriately adjusting the imaging optical system by the control system, an object light of an appropriate amount of light is exposed to the imaging element 118 and a subject image is formed in the vicinity of the imaging element 118.

画像処理回路１３２は、Ａ／Ｄ変換器１３１を介して撮像素子１１８から受けた撮像素子の画素数の画像信号を処理する信号処理回路である。具体的には画像処理回路１３２は、ホワイトバランス回路、ガンマ補正回路、補間演算による高解像度化を行う補間演算回路等を有する。 The image processing circuit 132 is a signal processing circuit that processes an image signal of the number of pixels of the image sensor received from the image sensor 118 via the A / D converter 131. Specifically, the image processing circuit 132 includes a white balance circuit, a gamma correction circuit, an interpolation calculation circuit that performs high resolution by interpolation calculation, and the like.

音声処理系は、マイクロホン１１５からの信号に対して音声信号処理回路１３７によって適切な処理を施して記録用音声信号を生成する。生成された記録用音声信号は後述する記録処理部により画像とリンクして記録処理される。 The audio processing system performs appropriate processing on the signal from the microphone 115 by the audio signal processing circuit 137 to generate a recording audio signal. The generated recording audio signal is recorded by being linked to an image by a recording processing unit described later.

また、加速度計１２０は、加速度計処理回路１３８を介して、カメラ制御部１１９に接続されている。加速度計１２０で検出されたカメラボディ１０１の振動の加速度信号は、加速度計処理回路１３８において増幅及びハイパスフィルタ処理、ローパスフィルタ処理が行われ、目的の周波数が抽出されるように処理される。 The accelerometer 120 is connected to the camera control unit 119 via the accelerometer processing circuit 138. The acceleration signal of the vibration of the camera body 101 detected by the accelerometer 120 is subjected to amplification, high-pass filter processing, and low-pass filter processing in the accelerometer processing circuit 138 so as to extract a target frequency.

記録処理回路１３３は、メモリ１３４への画像信号の出力を行うとともに、表示部１３６に出力する像を生成、保存する。また、記録処理回路１３３は、予め定められた方法を用いて画像、動画、音声などの圧縮、記録処理を行う。 The recording processing circuit 133 outputs an image signal to the memory 134 and generates and stores an image to be output to the display unit 136. In addition, the recording processing circuit 133 performs compression and recording processing of images, moving images, sounds, and the like using a predetermined method.

カメラ制御部１１９は、撮像の際のタイミング信号などを生成して出力する。焦点検出部１１４及び測光センサ１１６はそれぞれ、撮像装置のピント状態、及び、被写体の輝度を検出する。レンズ制御部１０７は、カメラ制御部１１９の信号に応じて適切にレンズを駆動させて光学系の調整を行う。 The camera control unit 119 generates and outputs a timing signal at the time of imaging. The focus detection unit 114 and the photometric sensor 116 detect the focus state of the imaging device and the luminance of the subject, respectively. The lens control unit 107 adjusts the optical system by appropriately driving the lens according to the signal from the camera control unit 119.

制御系は更に、外部操作に連動して撮像系、画像処理系、記録再生系をそれぞれ制御する。例えば、不図示のシャッターレリーズ釦の押下を操作検出部１３５が検出して、撮像素子１１８の駆動、画像処理回路１３２の動作、記録処理回路１３３の圧縮処理などを制御する。更に、光学ファインダー、液晶モニタ等に情報表示を行う表示部１３６の各セグメントの状態を制御する。 The control system further controls the imaging system, the image processing system, and the recording / reproducing system in conjunction with external operations. For example, the operation detection unit 135 detects that a shutter release button (not shown) is pressed, and controls driving of the image sensor 118, operation of the image processing circuit 132, compression processing of the recording processing circuit 133, and the like. Further, the state of each segment of the display unit 136 that displays information on an optical finder, a liquid crystal monitor, or the like is controlled.

光学系の調整動作について説明する。カメラ制御部１１９には焦点検出部１１４及び測光センサ１１６が接続されており、これらの信号を元に適切な焦点位置、絞り位置を求める。カメラ制御部１１９はこれらをレンズマウント接点１０８を介してレンズ制御部１０７に指令を出し、レンズ制御部１０７は焦点レンズ駆動部１０６ａおよび絞り駆動部１０６ｃを適切に制御する。さらにレンズ制御部１０７には不図示の手ぶれ検出センサが接続されており、手ぶれ補正を行うモードにおいては、手ぶれ検出センサの信号を元にブレ補正駆動部１０６ｂを適切に制御する。また、動画撮影時においては、主ミラー１１０及びサブミラー１１３が光軸１０５から撮像素子１１８に入光する光路から退避するため、焦点検出部１１４及び測光センサ１１６に被写体光学像は入射しない。そこで、カメラ制御部１１９は、焦点レンズ駆動部１０６ａの駆動量と撮像素子１１８で露光した連続的な画像情報を用いて、いわゆる山登り方式と呼ばれるコントラスト式焦点検出方法で撮像光学系のピント状態を調節する。また、撮像素子１１８で露光した画像情報を用いて被写体像の輝度を算出し絞り状態を調節する。 The adjustment operation of the optical system will be described. A focus detection unit 114 and a photometric sensor 116 are connected to the camera control unit 119, and an appropriate focus position and aperture position are obtained based on these signals. The camera control unit 119 issues a command to the lens control unit 107 via the lens mount contact 108, and the lens control unit 107 appropriately controls the focus lens driving unit 106a and the aperture driving unit 106c. In addition, a camera shake detection sensor (not shown) is connected to the lens control unit 107, and in the camera shake correction mode, the shake correction driving unit 106b is appropriately controlled based on a signal from the camera shake detection sensor. At the time of moving image shooting, since the main mirror 110 and the sub mirror 113 are retracted from the optical path that enters the image sensor 118 from the optical axis 105, the subject optical image does not enter the focus detection unit 114 and the photometric sensor 116. Therefore, the camera control unit 119 uses the driving amount of the focus lens driving unit 106a and the continuous image information exposed by the image sensor 118 to change the focus state of the imaging optical system by a contrast type focus detection method called a so-called hill-climbing method. Adjust. Further, the brightness of the subject image is calculated using the image information exposed by the image sensor 118, and the aperture state is adjusted.

次に、図４を用いて音声予測による雑音低減処理手法について説明する。述べる雑音低減処理では駆動雑音混入期間の前後の音声信号を用いて、駆動雑音混入期間の音声信号を予測する予測処理を行う。 Next, a noise reduction processing method based on speech prediction will be described with reference to FIG. In the noise reduction process to be described, a prediction process for predicting an audio signal in a driving noise mixing period is performed using audio signals before and after the driving noise mixing period.

図４は、一つのマイクロホンから取得される１チャネルの音声信号に対して行う予測処理を説明する図である。図４（ａ）（ｂ）（ｄ）（ｅ）（ｆ）（ｇ）において、横軸は時間、縦軸は信号レベルを表す。図４（ｃ）において、縦軸は相関値、横軸は図４（ａ）（ｂ）の音声信号に同期した相関値算出位置を表す。図４（ａ）は、被写体音信号に絞りの駆動で生じた雑音が混入している信号を示す。図４（ｂ）は、ピッチ検出を行うための相関値参照区間の音声信号を示す。図４（ｃ）は、相関値参照区間と相関値算出区間とから求められた相関値を示し、図４（ａ）の相関値算出区間と同期している。図４（ｄ）は、検出されたピッチを用いて、雑音混入区間における予測音声信号を示す。図４（ｅ）は、図４（ｄ）の予測音声信号に三角形状の窓関数をかけた信号を示す。以下、雑音混入区間に対して、時間的に前を「前方」、時間的に後ろを「後方」と呼ぶことにする。 FIG. 4 is a diagram for explaining a prediction process performed on an audio signal of one channel acquired from one microphone. 4A, 4B, 4D, 4E, and 4G, the horizontal axis represents time, and the vertical axis represents the signal level. In FIG. 4C, the vertical axis represents the correlation value, and the horizontal axis represents the correlation value calculation position synchronized with the audio signal of FIGS. FIG. 4A shows a signal in which noise generated by driving the diaphragm is mixed in the subject sound signal. FIG. 4B shows an audio signal in a correlation value reference section for performing pitch detection. FIG. 4C shows a correlation value obtained from the correlation value reference section and the correlation value calculation section, and is synchronized with the correlation value calculation section of FIG. FIG. 4D shows the predicted speech signal in the noise-mixed section using the detected pitch. FIG. 4 (e) shows a signal obtained by multiplying the predicted speech signal of FIG. 4 (d) by a triangular window function. Hereinafter, the front in terms of time and the rear in terms of time will be referred to as “rear” with respect to the noise-containing section.

予測処理においては、音声信号処理回路１３７は、図４（ａ）で示される雑音混入区間を加速度計１２０などを用いて検出し、音声信号処理回路１３７により雑音混入区間の信号を破棄する。雑音混入区間の検出は、駆動雑音の特徴周波数成分を用いて入力音声の周波数を解析することで行ってもよい。あるいは、レンズ駆動部への駆動命令のタイミングを得ることで雑音混入区間を検出することが可能である。すなわち、レンズ駆動部の駆動期間を雑音混入区間として検出することが可能である。次に、音声信号処理回路１３７は、雑音混入区間の前方区間の信号の相関値からピッチ検出を行う。図４（ａ）に示すように、音声信号は、短時間の領域に着目すると比較的繰返し性が高い性質があることを利用して、雑音混入区間の前方区間の音声信号を繰り返して再現することで音声信号の予測処理を行う。図４（ａ）の相関値参照区間の信号と相関値算出区間の信号とで相関値を算出すると、音声信号において雑音混入区間直前から相関値が最大になった位置（時間長）が音声のピッチ周期となる。ただし、相関値算出区間に対し相関値参照区間が時間的に同期している位置で相関値が最大になることは自明であるので、相関値の最大値は雑音除去区間からピッチ閾値間隔の長さ離れた図４（ｂ）に示す相関最大値探索区間から探す。ピッチ閾値間隔は目標とする音声信号に含まれる基本周波数の最大値の逆数をとれば、求めたい音声の繰返しピッチよりも短いピッチを誤って検出することがなくなる。例えば、日本人の基本周波数は約４００Ｈｚまでなので、ピッチ閾値間隔は２.５ｍｓｅｃに設定すればよい。 In the prediction process, the audio signal processing circuit 137 detects the noise-containing section shown in FIG. 4A using the accelerometer 120 or the like, and the audio signal processing circuit 137 discards the signal in the noise-containing section. The detection of the noise mixed section may be performed by analyzing the frequency of the input voice using the characteristic frequency component of the driving noise. Alternatively, it is possible to detect a noise mixed section by obtaining the timing of a drive command to the lens driving unit. That is, it is possible to detect the driving period of the lens driving unit as a noise mixing section. Next, the audio signal processing circuit 137 performs pitch detection from the correlation value of the signal in the front section of the noise mixed section. As shown in FIG. 4 (a), the sound signal has a relatively high repeatability when focusing on a short-time region, so that the sound signal in the front section of the noise-mixed section is repeatedly reproduced. Thus, the speech signal prediction process is performed. When the correlation value is calculated from the signal in the correlation value reference section and the signal in the correlation value calculation section in FIG. 4A, the position (time length) at which the correlation value is maximized immediately before the noise mixing section in the voice signal is the voice. The pitch period. However, since it is obvious that the correlation value becomes maximum at a position where the correlation value reference interval is temporally synchronized with the correlation value calculation interval, the maximum correlation value is the length of the pitch threshold interval from the noise elimination interval. The search is performed from the correlation maximum value search section shown in FIG. If the pitch threshold interval is the reciprocal of the maximum value of the fundamental frequency included in the target audio signal, a pitch shorter than the repetitive pitch of the audio to be obtained is not erroneously detected. For example, since the basic frequency of Japanese is up to about 400 Hz, the pitch threshold interval may be set to 2.5 msec.

音声信号処理回路１３７は、雑音混入区間の前方区間における第１チャネルの音声信号を複製することで第１の信号を生成する。同様に、雑音混入区間の後方区間における第１チャネルの音声信号を複製することで第２の信号を生成する。そして、第１及び第２の音声信号を合成することで第１予測音声信号を生成する。以下、具体例を示す。 The audio signal processing circuit 137 generates the first signal by duplicating the audio signal of the first channel in the front section of the noise mixing section. Similarly, the second signal is generated by duplicating the audio signal of the first channel in the rear section of the noise-mixing section. And a 1st prediction audio | voice signal is produced | generated by synthesize | combining a 1st and 2nd audio | voice signal. Specific examples are shown below.

音声信号処理回路１３７は、図４（ｄ）に示すように、検出されたピッチ区間の音声信号を予測区間（雑音混入区間）の終端まで繰り返し複製して第１の信号を生成する。以下、この第１の信号を「前方からの窓掛け前予測信号」と呼ぶ。次に、図４（ｅ）に示すように、作成した前方からの窓掛け前予測信号に三角形状の窓関数をかけて前方予測信号を完成させる。以下、この段階での予測信号を「前方からの窓掛け後予測信号」と呼ぶ。このとき窓関数ｗｆ（ｔ）は予測区間のデータ数がＮ＋１点である場合、予測開始直後のデータをｎ＝０とすると、ｗｆ（ｎ）＝（Ｎ‐ｎ）/Ｎで表される関数である。 As shown in FIG. 4 (d), the audio signal processing circuit 137 repeatedly duplicates the detected audio signal in the pitch interval until the end of the prediction interval (noise mixing interval) to generate the first signal. Hereinafter, this first signal is referred to as a “pre-window prediction signal from the front”. Next, as shown in FIG. 4 (e), a forward prediction signal is completed by applying a triangular window function to the created pre-window prediction signal from the front. Hereinafter, the prediction signal at this stage is referred to as a “prediction signal after windowing from the front”. At this time, the window function wf (t) is a function represented by wf (n) = (N−n) / N when the data immediately after the start of prediction is n = 0 when the number of data in the prediction section is N + 1. It is.

音声信号処理回路１３７は、上記と同様に、雑音混入区間の後方区間においてピッチ検出を行い、検出されたピッチ区間の音声信号を予測区間の始端まで繰り返し複製して第２の信号を生成する。この第２の信号を、「後方からの窓掛け前予測信号」と呼ぶ。次に、図４（ｆ）に示すように、後方からの窓掛け前予測信号に図４（ｅ）とは逆向きの三角形状の窓関数を掛けて後方予測信号（後方からの窓掛け後予測信号）を作成する。後方からの窓掛け前予測信号に掛けられる三角形状の窓関数ｗｒ（ｎ）は、前方からの予測のときと対称となり、ｗｒ（ｎ）＝ｎ/Ｎで表される。 Similarly to the above, the audio signal processing circuit 137 performs pitch detection in the rear section of the noise-mixed section, and repeatedly generates the second signal by replicating the audio signal in the detected pitch section to the start end of the prediction section. This second signal is referred to as a “pre-window prediction signal from behind”. Next, as shown in FIG. 4 (f), the prediction signal before windowing from the rear is multiplied by a triangular window function opposite to that shown in FIG. Prediction signal). The triangular window function wr (n) multiplied by the prediction signal before windowing from the rear is symmetric with the prediction from the front and is expressed by wr (n) = n / N.

音声信号処理回路１３７は、前方からの窓掛け後予測信号と後方からの窓掛け後予測信号とを加算して第１予測音声信号を生成する。そして、音声信号処理回路１３７は、雑音混入区間の音声信号を第１予測音声信号で置換する。これにより雑音混入区間における雑音が低減される。図４（ｇ）はその結果の信号波形の例を示している。このように、第１の信号と第２の信号とのクロスフェードによる合成によって、第１予測音声信号が生成される。このようなクロスフェードによれば、前方からの予測信号と雑音混入区間直後との接続部、並びに、後方からの予測信号と雑音混入区間直前との接続部において、音声を滑らかにつなげることができる。 The audio signal processing circuit 137 generates a first predicted audio signal by adding the windowed predicted signal from the front and the windowed predicted signal from the rear. Then, the audio signal processing circuit 137 replaces the audio signal in the noise-mixed section with the first predicted audio signal. As a result, the noise in the noise mixing section is reduced. FIG. 4G shows an example of the resulting signal waveform. As described above, the first predicted speech signal is generated by synthesizing the first signal and the second signal by cross-fading. According to such a crossfade, it is possible to smoothly connect voices at the connection portion between the prediction signal from the front and immediately after the noise mixing interval, and at the connection portion between the prediction signal from the rear and immediately before the noise mixing interval. .

図１の撮像装置のように２つのマイクロホンを備える場合、各マイクロホンからの２チャネルの音声信号に対して前述の予測処理による雑音低減を行う。この場合、予測処理の演算負荷を減らすために、一方のチャネルの音声信号から得られた第１予測音声信号を他方のチャネルでの予測音声信号としてそのまま適用することが考えられる。 When two microphones are provided as in the image pickup apparatus of FIG. 1, noise reduction is performed by the above-described prediction processing on the two-channel audio signals from each microphone. In this case, in order to reduce the calculation load of the prediction process, it is conceivable to apply the first predicted speech signal obtained from the speech signal of one channel as it is as the predicted speech signal of the other channel.

図５は、Ｌｃｈの音声信号に対して得られた第１予測音声信号をＲｃｈの雑音区間の予測音声信号としてそのまま用いた例を示している。図５（ａ）はＬｃｈの音声入力信号、図５（ｂ）はＲｃｈの音声入力信号であり、いずれも雑音混入区間の信号は省略している。図５（ｃ）はＬｃｈの第１予測音声信号、図５（ｄ）は図５（ｃ）の第１予測音声信号をＲｃｈの雑音混入区間の予測音声信号としてそのままあてはめた音声信号である。 FIG. 5 shows an example in which the first predicted speech signal obtained for the Lch speech signal is used as it is as the predicted speech signal in the Rch noise section. FIG. 5A shows an Lch audio input signal, and FIG. 5B shows an Rch audio input signal. In both cases, signals in the noise-mixing section are omitted. FIG. 5C shows the Lch first predicted speech signal, and FIG. 5D shows the speech signal in which the first predicted speech signal of FIG. 5C is applied as it is as the predicted speech signal in the Rch noise-mixed section.

図５のように、被写体からの音声がマイクロホンアレーの正中面方向から到来する場合は、ＬｃｈとＲｃｈの音声信号はほとんど同様のため問題はない。しかし、被写体からの音声がマイクロホンアレーの正中面方向以外の方向から到来する場合は、ＬｃｈとＲｃｈで位相差が生じる。そのため、Ｌｃｈの第１予測音声信号をＲｃｈの雑音混入区間にそのまま用いると、Ｒｃｈの雑音混入区間の前後で音声信号が不連続なものとなりこれを聴音すると違和感を生じうる。図６は、被写体からの音声がマイクロホンアレーの正中面方向以外の方向から到来する場合の音声信号を表したものである。図６（ａ）はＬｃｈの音声入力信号、図６（ｂ）はＲｃｈの音声入力信号であり、いずれも雑音混入区間の信号は省略している。図６（ｃ）はＬｃｈの第１予測音声信号、図６（ｄ）は図６（ｃ）の第１予測音声信号をＲｃｈの雑音混入区間の予測音声信号としてそのままあてはめたものである。ＬｃｈとＲｃｈの音声入力信号には位相差が生じているので、Ｌｃｈの第１予測音声信号をそのままＲｃｈに用いると音声信号が不連続になっているのがわかる。これを実際に聴音すると違和感を生じうる。 As shown in FIG. 5, when the sound from the subject comes from the median plane direction of the microphone array, the Lch and Rch sound signals are almost the same, so there is no problem. However, when the sound from the subject comes from a direction other than the median plane direction of the microphone array, a phase difference occurs between Lch and Rch. For this reason, if the Lch first predicted speech signal is used as it is in the Rch noise-mixed section, the speech signal becomes discontinuous before and after the Rch noise-mixed section, and it may be uncomfortable to hear it. FIG. 6 shows an audio signal when audio from the subject comes from a direction other than the median plane direction of the microphone array. FIG. 6A shows an Lch audio input signal, and FIG. 6B shows an Rch audio input signal. In both cases, signals in a noise-mixed section are omitted. FIG. 6C shows the first predicted speech signal of Lch, and FIG. 6D shows the first predicted speech signal of FIG. 6C applied as it is as the predicted speech signal of the Rch noisy section. Since there is a phase difference between the Lch and Rch audio input signals, it can be seen that the audio signal is discontinuous when the Lch first predicted audio signal is used as it is for the Rch. When this is actually heard, a sense of incongruity can occur.

図５，６を用いて前述したように、被写体からの音声がマイクロホンアレーの正面（すなわち正中面方向）から到来する場合はこの方法は有効である。しかし、被写体からの音声がマイクロホンアレーの正中面方向以外の方向から到来する場合は、２つの音声信号間に位相差が生じる。そのため、他方の補間信号をそのまま適応すると音声信号が不連続になり、実際に聴音すると違和感を生じる。 As described above with reference to FIGS. 5 and 6, this method is effective when the sound from the subject comes from the front of the microphone array (that is, the midplane direction). However, when the sound from the subject comes from a direction other than the median plane direction of the microphone array, a phase difference occurs between the two sound signals. Therefore, when the other interpolation signal is applied as it is, the audio signal becomes discontinuous, and when it is actually heard, a sense of incongruity occurs.

以下、図７、８を用いて本実施形態の雑音低減処理について説明する。図７は本実施形態の雑音低減処理のフローチャートであり、図８はＬｃｈ及びＲｃｈの音声入力信号の予測処理を表した音声信号及び相関値を表したものである。 Hereinafter, the noise reduction processing of this embodiment will be described with reference to FIGS. FIG. 7 is a flowchart of noise reduction processing according to the present embodiment, and FIG. 8 shows an audio signal and a correlation value representing prediction processing of Lch and Rch audio input signals.

記録開始が指示され、音声の記録が開始されると、音声信号処理回路１３７は、駆動雑音が検出された否かを判定する（Ｓ１００１）。この判定は、カメラ制御部１１９からレンズ駆動命令が発行されたことを検出することで判定してもよいし、加速度計１２０を用いて実際に駆動雑音を検出することで判定してもよい。 When recording start is instructed and audio recording is started, the audio signal processing circuit 137 determines whether or not driving noise is detected (S1001). This determination may be made by detecting that a lens drive command is issued from the camera control unit 119, or may be made by actually detecting drive noise using the accelerometer 120.

Ｓ１００１で駆動雑音が検出されない場合は、Ｓ１００７に進み、記録スイッチによる記録停止の指示が検出されるまで再びＳ１００１に戻り、フローが続けられる。Ｓ１００１で駆動雑音が検出されると、音声信号処理回路１３７は、加速度計１２０の出力信号を解析して雑音混入区間を算出する（Ｓ１００２）。雑音混入区間の音声信号は破棄される。 If drive noise is not detected in S1001, the process proceeds to S1007 and returns to S1001 again until a recording stop instruction is detected by the recording switch, and the flow continues. When driving noise is detected in S1001, the audio signal processing circuit 137 analyzes the output signal of the accelerometer 120 and calculates a noise mixing section (S1002). The audio signal in the noise mixed section is discarded.

図８（ａ）はＬｃｈの音声入力信号を表し、図８（ｂ）はＲｃｈの音声入力信号を表している。ともに雑音混入区間の信号は破棄した後の音声信号である。なお、横軸は時間、縦軸は音声信号の音圧レベルを表している。 FIG. 8A shows an Lch audio input signal, and FIG. 8B shows an Rch audio input signal. In both cases, the signal in the noise-mixed section is an audio signal after being discarded. The horizontal axis represents time, and the vertical axis represents the sound pressure level of the audio signal.

Ｓ１００２において雑音混入区間が算出されると、音声信号処理回路１３７は、Ｌｃｈの音声入力信号について図５で説明したように、雑音混入区間の第１予測音声信号を算出する（Ｓ１００３）。このとき、図８（ｃ）に示すように、第１予測音声信号は雑音混入区間よりも余剰予測区間分だけ長めに算出する。これは、後述するＲｃｈの位相補正のために、雑音混入区間の前後に所定時間長の区間（余剰予測区間）を付加したものである。余剰予測区間は算出された繰返しピッチを余剰予測区間の長さ分だけ通常予測区間の前後につなげていってもよいし、実際のＬｃｈの音声信号の余剰予測区間と同期した位置の信号をそのまま用いてもよい。 When the noise-mixed section is calculated in S1002, the speech signal processing circuit 137 calculates the first predicted speech signal in the noise-mixed section as described in FIG. 5 for the Lch speech input signal (S1003). At this time, as shown in FIG. 8C, the first predicted speech signal is calculated longer than the noise-mixed section by the surplus prediction section. This is obtained by adding a section (extra prediction section) having a predetermined time length before and after the noise-mixing section for Rch phase correction to be described later. In the surplus prediction section, the calculated repetition pitch may be connected before and after the normal prediction section by the length of the surplus prediction section, or the signal at the position synchronized with the surplus prediction section of the actual Lch speech signal is left as it is. It may be used.

次にＳ１００４で、Ｓ１００３で算出した第１予測音声信号をＬｃｈの雑音混入区間の信号として書き込む。こうして、雑音混入区間の前後の区間における音声の予測信号で、雑音混入区間における信号が置換される（第１置換処理）。なおこのとき、余剰予測区間の信号は省いて通常予測区間の音声信号のみを書き込む。こうして、Ｌｃｈの音声信号の雑音低減処理が行われる。 Next, in S1004, the first predicted speech signal calculated in S1003 is written as a signal of the Lch noise mixed section. In this way, the signal in the noise-containing section is replaced with the predicted speech signal in the section before and after the noise-containing section (first replacement process). At this time, the signal in the extra prediction section is omitted, and only the audio signal in the normal prediction section is written. In this way, noise reduction processing of the Lch audio signal is performed.

次にＳ１００５では、Ｌｃｈの音声信号とＲｃｈの音声信号との間の位相差を検出する。図８を用いてＬｃｈとＲｃｈの音声信号間の位相差検出方法について述べる。図８（ｄ）は、図８（ａ）に示すＬｃｈ相関参照区間と図８（ｂ）に示すＲｃｈ相関参照区間とで相関値をとった結果であり、図８（ａ）（ｂ）の音声信号と同期して表示しているので、算出相関中央位置が相関参照区間の右端にあたる。なお、相関参照区間は雑音混入区間の前方又は後方の所定区間に設定される。ＬｃｈとＲｃｈの音声信号に位相差がない場合は算出相関中央位置で相関値が最大となる。位相差がある場合は、算出相関中央位置から相関値最大位置までの長さがＬｃｈとＲｃｈとして位相差として検出される。 In step S1005, a phase difference between the Lch audio signal and the Rch audio signal is detected. A method of detecting the phase difference between the Lch and Rch audio signals will be described with reference to FIG. FIG. 8D shows a result of taking a correlation value between the Lch correlation reference section shown in FIG. 8A and the Rch correlation reference section shown in FIG. 8B. Since it is displayed in synchronization with the audio signal, the calculated correlation center position corresponds to the right end of the correlation reference section. The correlation reference section is set to a predetermined section in front of or behind the noise mixture section. When there is no phase difference between the Lch and Rch audio signals, the correlation value becomes maximum at the calculated correlation center position. When there is a phase difference, the length from the calculated correlation center position to the maximum correlation value position is detected as a phase difference as Lch and Rch.

次にＳ１００６で、Ｓ１００３で算出した第１予測音声信号をＳ１００５で検出された位相差に相当するサンプル数シフトして得た位相補正予測信号（第２予測音声信号）を生成し、Ｒｃｈの雑音混入区間の信号として書き込む。こうして、Ｌｃｈの第１予測音声信号を位相補正して得た第２予測音声信号で、雑音混入区間におけるＲｃｈの信号が置換される（第２置換処理）。このとき、予測信号のうち雑音混入区間に当てはまらない余分な予測信号は破棄して書き込まれる。図８（ｅ）（ｆ）は、Ｒｃｈの音声信号の雑音混入区間に第２予測音声信号をあてはめる動作を示したものである。図８（ｅ）の第２予測音声信号は、図８（ｃ）の第１予測音声信号をＳ１００５で検出された位相差に相当するサンプル数シフトしたものである。第２予測音声信号の両端の余分音声信号が破棄されてＲｃｈの雑音混入区間の音声信号として書き込まれる。その結果、図８（ｆ）に示すようなＲｃｈの音声信号が得られる。 Next, in S1006, a phase-corrected predicted signal (second predicted speech signal) obtained by shifting the first predicted speech signal calculated in S1003 by the number of samples corresponding to the phase difference detected in S1005 is generated, and Rch noise is generated. Write as mixed section signal. In this way, the Rch signal in the noise-mixed section is replaced with the second predicted speech signal obtained by correcting the phase of the Lch first predicted speech signal (second replacement process). At this time, an extra prediction signal that does not apply to the noise-containing section of the prediction signal is discarded and written. FIGS. 8E and 8F show the operation of applying the second predicted audio signal to the noise mixed section of the Rch audio signal. The second predicted speech signal in FIG. 8E is obtained by shifting the first predicted speech signal in FIG. 8C by the number of samples corresponding to the phase difference detected in S1005. The extra audio signals at both ends of the second predicted audio signal are discarded and written as audio signals in the Rch noise-mixed section. As a result, an Rch audio signal as shown in FIG. 8F is obtained.

処理は、Ｓ１００７で操作検出部１３５により記録スイッチがＯＦＦにされたことが検知されるまでＳ１００１に戻って繰り返される。記録スイッチＯＦＦが検出されたら記録動作を終了する。 The process returns to S1001 and is repeated until the operation detection unit 135 detects that the recording switch is turned OFF in S1007. When the recording switch OFF is detected, the recording operation is terminated.

以上説明したように本実施形態では、雑音混入区間に対してＬｃｈのみについて第１予測音声信号の算出を行う。その後、ＬｃｈとＲｃｈとの間の音声信号の位相差を検出する。そして、その位相差により第１予測音声信号を補正して第２予測音声信号を生成し、これをＲｃｈの雑音混入区間の予測音声信号とする。これにより、演算負荷を低減しながらもステレオ感を損なわない高品質な雑音低減処理が実現される。 As described above, in the present embodiment, the first predicted speech signal is calculated for only the Lch with respect to the noise mixed section. Thereafter, the phase difference of the audio signal between Lch and Rch is detected. Then, the first predicted speech signal is corrected based on the phase difference to generate a second predicted speech signal, which is used as the predicted speech signal in the Rch noise-mixed section. This realizes a high-quality noise reduction process that reduces the computational load while maintaining the stereo feeling.

図８（ｃ）の余剰予測区間及び、図８（ａ），（ｂ）のＬｃｈ及びＲｃｈ相関参照区間については、想定される最大位相差に対応する区間長に設定すればよい。つまり、２つのマイクロホンを結んだ延長線方向から被写体音が到来するとき、ＬｃｈとＲｃｈの音声信号間に最大の位相差が生じる。よって、２つのマイクロホン間の距離をＬｍ（ｍｍ）とし、音声の空気伝播による伝達速度を３４００００（ｍｍ／ｓｅｃ）とすると、余剰予測区間及び相関参照区間はＬｍ/３４００００（ｓｅｃ）分の長さだけ設ければよい。 The surplus prediction interval in FIG. 8C and the Lch and Rch correlation reference intervals in FIGS. 8A and 8B may be set to the interval length corresponding to the assumed maximum phase difference. That is, when the subject sound comes from the direction of the extension line connecting the two microphones, the maximum phase difference is generated between the Lch and Rch audio signals. Therefore, if the distance between the two microphones is Lm (mm) and the transmission speed of sound by air propagation is 340000 (mm / sec), the surplus prediction section and the correlation reference section are Lm / 340000 (sec) long. Only need to be provided.

上述の実施形態では、Ｌｃｈの音声信号から通常予測区間と余剰予測区間の音声信号を求めた。しかし、先にＬｃｈとＲｃｈの位相差を検出しておけば、必要な位相の補正量がわかるので余剰予測区間の信号は予測信号の補正に必要な分だけを求めればよい。音声信号処理回路のシーケンスによってはこちらの手順の方が計算負荷を低減させることができる。 In the above-described embodiment, the speech signals of the normal prediction section and the surplus prediction section are obtained from the Lch speech signal. However, if the phase difference between Lch and Rch is detected first, the necessary phase correction amount can be known, so that only the signal necessary for correcting the prediction signal needs to be obtained for the signal in the extra prediction section. Depending on the sequence of the audio signal processing circuit, this procedure can reduce the calculation load.

また、上述の実施形態では２チャネルのマイクロホンアレーを構成したが、３チャネル以上のマイクロホンアレーを構成してもよい。この場合は、第１チャネルの音声信号についてのみＳ１００３の予測信号算出を行い、その他のチャネルについては第１チャネルとその他のチャネルそれぞれの音声信号について位相差を検出する。そして、第１チャネルの予測信号をそれぞれの位相差で補正して用いればよい。チャネル数が増えるほど、チャネル毎に予測信号の算出を行う場合よりも演算負荷が低減する。 In the above-described embodiment, a two-channel microphone array is configured, but a three-channel or more microphone array may be configured. In this case, the prediction signal calculation of S1003 is performed only for the audio signal of the first channel, and the phase difference is detected for the audio signals of the first channel and the other channels for the other channels. Then, the prediction signal of the first channel may be corrected with each phase difference and used. As the number of channels increases, the calculation load is reduced as compared with the case where the prediction signal is calculated for each channel.

また、上述の実施形態では、予測信号算出処理において、雑音混入区間の前後で相関をとりピッチを検出して雑音混入区間の予測信号を算出したが、これに限られない。例えば、雑音混入区間の直前及び直後の一定区間の音声信号を線形予測で予測し線形予測係数を求める。求めた線形予測係数に雑音混入区間前後の音声信号に乗じることで雑音混入区間の音声信号を次々と予測してことができる。線形予測手法は線形予測係数の次数によって予測処理の演算負荷が非常に高くなる。そのため、上述したように、第１チャネルのみについて予測処理をし、その他のチャネルの雑音混入区間は予測信号結果を補正して用いることで演算量を低減させることが有効である。 Further, in the above-described embodiment, in the prediction signal calculation process, the correlation is detected before and after the noise-containing section and the pitch is detected to calculate the prediction signal of the noise-containing section. However, the present invention is not limited to this. For example, a linear prediction coefficient is obtained by predicting a speech signal in a certain section immediately before and immediately after a noise-mixing section by linear prediction. By multiplying the obtained linear prediction coefficient by the speech signal before and after the noise-mixed section, the speech signal in the noise-mixed section can be predicted one after another. In the linear prediction method, the calculation load of the prediction process becomes very high depending on the order of the linear prediction coefficient. Therefore, as described above, it is effective to reduce the amount of calculation by performing the prediction process only for the first channel and correcting and using the prediction signal result in the noise mixed section of the other channels.

また、本実施形態では、雑音混入区間の前後の音声信号を用いて予測音声信号を生成したが、雑音混入区間の前後の少なくとも一方から予測信号を生成する構成でもよい。 In the present embodiment, the predicted speech signal is generated using the speech signals before and after the noise-mixed section. However, the predicted signal may be generated from at least one before and after the noise-mixed section.

また、上述の実施形態ではレンズの駆動により生じた雑音の低減について述べた。しかし、操作部材を操作する際に発生する操作雑音、カメラボディを操作者が触ることにより発生するタッチノイズなどを、加速度計１２０で検知し雑音混入区間が検出できる雑音であれば、そのような雑音の低減も可能である。 In the above-described embodiment, reduction of noise generated by driving the lens has been described. However, if the noise that can be detected by the accelerometer 120 to detect the operation noise generated when operating the operation member, the touch noise generated when the operator touches the camera body, etc., can be detected. Noise can also be reduced.

また、上述の実施形態では撮像装置における雑音低減処理について述べたが、取得された音声信号中の雑音を低減させる他の音声処理装置においても適用可能である。 Moreover, although the noise reduction process in the imaging device has been described in the above-described embodiment, the present invention can also be applied to other audio processing devices that reduce noise in the acquired audio signal.

＜第２実施形態＞
第１実施形態では、Ｌｃｈのみについて雑音混入区間の音声の予測信号を算出し、ＬｃｈとＲｃｈの位相差を検出し、Ｒｃｈには、Ｌｃｎの予測信号を位相補正して用いた。しかし、ＬｃｈとＲｃｈとの位相差が十分に小さい場合には、聴感的な影響はない。そこで第２実施形態では、ＬｃｈとＲｃｈとの位相差が閾値よりも小さい場合は、位相差を補正せずにそのままＬｃｈからの予測信号をＲｃｈの雑音混入区間の予測信号として用いる。一律に全て位相差補正を行う場合に対して演算負荷の点で有利である。 Second Embodiment
In the first embodiment, a speech prediction signal in a noise-mixed section is calculated for only Lch, the phase difference between Lch and Rch is detected, and the Lcn prediction signal is phase-corrected and used for Rch. However, when the phase difference between Lch and Rch is sufficiently small, there is no audible effect. Therefore, in the second embodiment, when the phase difference between the Lch and the Rch is smaller than the threshold value, the prediction signal from the Lch is used as it is as the prediction signal of the Rch noise mixing section without correcting the phase difference. This is advantageous in terms of calculation load compared to the case where all phase difference correction is performed uniformly.

装置構成は第１実施形態と同様なのでその説明は省略する。図９は第２実施形態のフローチャートである。まず動画撮影動作スイッチがＯＮにされると、記録動作が開始されるフローチャートがスタートする。 Since the apparatus configuration is the same as that of the first embodiment, the description thereof is omitted. FIG. 9 is a flowchart of the second embodiment. First, when the moving image shooting operation switch is turned on, a flowchart for starting the recording operation starts.

Ｓ２００１，Ｓ２００２の動作はそれぞれ第１実施形態のＳ１００１，Ｓ１００２の動作と同様のため説明を省略する。
Ｓ２００３においては、Ｓ１００５と同様にＬｃｈとＲｃｈの音声信号の位相差を検出する。次にＳ２００４で、Ｓ２００３で検出した位相差が閾値Ｔｆｓ以上であるかを判断する。検出した位相差が閾値Ｔｆｓ以上である場合、Ｓ２００５に進む。Ｓ２００５〜Ｓ２００７ではＳ１００３〜Ｓ１００４及びＳ１００６と同様に、Ｌｃｈの音声信号から余剰予測区間を含む予測音声信号を算出し、それを検出位相差分だけ補正してＲｃｈの雑音混入区間の予測音声信号として用いる。 Since the operations of S2001 and S2002 are the same as the operations of S1001 and S1002 of the first embodiment, description thereof will be omitted.
In S2003, the phase difference between the Lch and Rch audio signals is detected as in S1005. In step S2004, it is determined whether the phase difference detected in step S2003 is equal to or greater than the threshold value Tfs. If the detected phase difference is equal to or greater than the threshold value Tfs, the process proceeds to S2005. In S2005 to S2007, similarly to S1003 to S1004 and S1006, a predicted speech signal including a surplus prediction section is calculated from the Lch speech signal, is corrected by the detected phase difference, and is used as a predicted speech signal of the Rch noise-mixed section. .

一方、検出した位相差が閾値未満の場合は、Ｓ２００９に進み、Ｌｃｈの予測音声信号を算出する。このとき、余剰予測区間を省いた通常予測区間のみの予測音声信号を算出する。次に、Ｓ２０１０で、Ｌｃｈ及びＲｃｈの雑音混入区間の予測音声信号としてＳ２００９で算出した予測音声信号の位相補正をせずに書き込む。 On the other hand, if the detected phase difference is less than the threshold, the process proceeds to S2009, and an Lch predicted speech signal is calculated. At this time, the predicted speech signal of only the normal prediction section excluding the excess prediction section is calculated. Next, in S2010, the predicted speech signal calculated in S2009 is written without performing phase correction as the predicted speech signal of the Lch and Rch noise mixed sections.

その後、処理は、Ｓ２００８で操作検出部１３５により記録スイッチがＯＦＦにされたことが検出されるまでＳ２００１に戻って繰り返される。記録スイッチＯＦＦが検出されたら記録動作を終了する。 Thereafter, the process returns to S2001 and is repeated until the operation detection unit 135 detects that the recording switch is turned off in S2008. When the recording switch OFF is detected, the recording operation is terminated.

上述した第２実施形態によれば、ＬｃｈとＲｃｈの音声信号の位相差が閾値未満である場合には、Ｌｃｈの予測音声信号を位相補正せずにそのままＲｃｈの雑音混入区間の予測音声信号として用いる。２つの音声信号の位相差が十分小さいときは、位相補正をせずにそのままＬｃｈからの予測音声信号を用いても聴感上の影響は少ないので、位相補正にかかる演算の負荷をなくすことができる。 According to the second embodiment described above, when the phase difference between the Lch and Rch audio signals is less than the threshold value, the Lch predicted audio signal is directly used as the predicted audio signal of the Rch noisy section without phase correction. Use. When the phase difference between the two audio signals is sufficiently small, there is little influence on the audibility even if the predicted audio signal from the Lch is used as it is without performing phase correction, so that it is possible to eliminate the calculation load related to phase correction. .

（他の実施形態）
本発明は、上述の実施形態の１以上の機能を実現するプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける１つ以上のプロセッサーがプログラムを読出し実行する処理でも実現可能である。また、１以上の機能を実現する回路（例えば、ＡＳＩＣ）によっても実現可能である。 (Other embodiments)
The present invention supplies a program that realizes one or more functions of the above-described embodiments to a system or apparatus via a network or a storage medium, and one or more processors in a computer of the system or apparatus read and execute the program This process can be realized. It can also be realized by a circuit (for example, ASIC) that realizes one or more functions.

１０１：撮像装置、１０２：カメラボディ、１０３：撮影レンズ、１０４：撮像光学系、１０６：レンズ駆動部、１０７：レンズ制御部、１１５：マイクロホン、１１９：カメラ制御部 DESCRIPTION OF SYMBOLS 101: Imaging device, 102: Camera body, 103: Shooting lens, 104: Imaging optical system, 106: Lens drive part, 107: Lens control part, 115: Microphone, 119: Camera control part

Claims

第１チャネルと第２チャネルを含む複数チャネルの音声信号を取得する取得手段と、
前記第１チャネルの音声信号と前記第２チャネルの音声信号との間の位相差を検出する位相差検出手段と、
前記音声信号における雑音混入区間の前後の少なくとも一方の区間における前記第１チャネルの音声信号を用いて前記雑音混入区間の音声信号を置き換えるための第１予測音声信号を生成し、前記第１予測音声信号で前記雑音混入区間における前記第１チャネルの音声信号を置換する第１置換手段と、
前記第１予測音声信号を前記位相差により補正することで第２予測音声信号を生成し、前記第２予測音声信号で前記雑音混入区間における前記第２チャネルの音声信号を置換する第２置換手段と、
を有することを特徴とする音声処理装置。 Obtaining means for obtaining audio signals of a plurality of channels including the first channel and the second channel;
Phase difference detection means for detecting a phase difference between the audio signal of the first channel and the audio signal of the second channel;
A first predicted speech signal for replacing the speech signal in the noise-mixed section is generated using the voice signal of the first channel in at least one section before and after the noise-mixed section in the speech signal, and the first predicted speech First replacement means for replacing the audio signal of the first channel in the noise-mixed section with a signal;
Second replacement means for generating a second predicted speech signal by correcting the first predicted speech signal by the phase difference and replacing the speech signal of the second channel in the noise-mixed section with the second predicted speech signal. When,
A speech processing apparatus comprising:

前記第２置換手段は、前記位相差が閾値未満である場合は、前記第１予測音声信号を補正せずに前記第２予測音声信号を生成することを特徴とする請求項１に記載の音声処理装置。 2. The speech according to claim 1, wherein when the phase difference is less than a threshold, the second replacement unit generates the second predicted speech signal without correcting the first predicted speech signal. Processing equipment.

前記第１置換手段は、前記雑音混入区間の前後に、前記第１チャネルの音声信号と前記第２チャネルの音声信号との間の最大の位相差に対応する長さの区間を付加した区間分の前記第１の予測信号を生成することを特徴とする請求項１又は２に記載の音声処理装置。 The first replacement means includes a section in which a section having a length corresponding to the maximum phase difference between the sound signal of the first channel and the sound signal of the second channel is added before and after the noise-mixing section. The speech processing apparatus according to claim 1, wherein the first prediction signal is generated.

前記第１置換手段は、
前記雑音混入区間の前方区間における前記第１チャネルの音声信号を複製することで第１の信号を生成する手段と、
前記雑音混入区間の後方区間における前記第１チャネルの音声信号を複製することで第２の信号を生成する手段と、
前記第１の信号と前記第２の信号とを合成することにより前記第１予測音声信号を生成する手段と、
を含むことを特徴とする請求項１乃至３のいずれか１項に記載の音声処理装置。 The first replacement means includes
Means for generating a first signal by replicating the audio signal of the first channel in a front section of the noise-mixing section;
Means for generating a second signal by replicating the audio signal of the first channel in a rear section of the noise-mixing section;
Means for generating the first predicted speech signal by combining the first signal and the second signal;
The speech processing apparatus according to claim 1, comprising:

前記第１の信号を生成する手段は、前記前方区間における前記第１チャネルの音声信号のピッチ検出を行い、検出されたピッチ区間を繰り返し複製することで前記第１の信号を生成し、
前記第２の信号を生成する手段は、前記後方区間における前記第１チャネルの音声信号のピッチ検出を行い、検出されたピッチ区間を繰り返し複製することで前記第２の信号を生成する
ことを特徴とする請求項４に記載の音声処理装置。 The means for generating the first signal performs pitch detection of the audio signal of the first channel in the front section, and generates the first signal by repeatedly duplicating the detected pitch section,
The means for generating the second signal detects the pitch of the audio signal of the first channel in the rear section, and generates the second signal by repeatedly duplicating the detected pitch section. The voice processing apparatus according to claim 4.

前記位相差検出手段は、前記雑音混入区間の前方又は後方に設定される所定区間における、前記第１チャネルの音声信号と前記第２チャネルの音声信号との間の相関値に基づいて前記位相差を検出することを特徴とする請求項１乃至５のいずれか１項に記載の音声処理装置。 The phase difference detection means is configured to detect the phase difference based on a correlation value between the first channel audio signal and the second channel audio signal in a predetermined interval set in front of or behind the noise mixture interval. The speech processing apparatus according to claim 1, wherein:

請求項１乃至６のいずれか１項に記載の音声処理装置を有することを特徴とする撮像装置。 An image pickup apparatus comprising the audio processing apparatus according to claim 1.

前記雑音混入区間は、前記撮像装置が有するレンズ駆動部の駆動期間であることを特徴とする請求項７に記載の撮像装置。 The imaging apparatus according to claim 7, wherein the noise mixing section is a driving period of a lens driving unit included in the imaging apparatus.

第１チャネルと第２チャネルを含む複数チャネルの音声信号を取得する取得ステップと、
前記第１チャネルの音声信号と前記第２チャネルの音声信号との間の位相差を検出する位相差検出ステップと、
前記音声信号における雑音混入区間の前後の少なくとも一方の区間における前記第１チャネルの音声信号を用いて前記雑音混入区間の音声信号を置き換えるための第１予測音声信号を生成し、前記第１予測音声信号で前記雑音混入区間における前記第１チャネルの音声信号を置換する第１置換ステップと、
前記第１予測音声信号を前記位相差により補正することで第２予測音声信号を生成し、前記第２予測音声信号で前記雑音混入区間における前記第２チャネルの音声信号を置換する第２置換ステップと、
を有することを特徴とする音声処理方法。 An acquisition step of acquiring audio signals of a plurality of channels including the first channel and the second channel;
A phase difference detection step of detecting a phase difference between the audio signal of the first channel and the audio signal of the second channel;
A first predicted speech signal for replacing the speech signal in the noise-mixed section is generated using the voice signal of the first channel in at least one section before and after the noise-mixed section in the speech signal, and the first predicted speech A first substituting step of substituting the audio signal of the first channel in the noisy section with a signal;
A second replacement step of generating a second predicted speech signal by correcting the first predicted speech signal by the phase difference, and replacing the speech signal of the second channel in the noise-mixed section with the second predicted speech signal. When,
A voice processing method characterized by comprising:

前記第２置換ステップは、前記位相差が閾値未満である場合は、前記第１予測音声信号を補正せずに前記第２予測音声信号を生成することを特徴とする請求項９に記載の音声処理方法。 10. The speech according to claim 9, wherein the second replacement step generates the second predicted speech signal without correcting the first predicted speech signal when the phase difference is less than a threshold value. Processing method.

前記第１置換ステップは、前記雑音混入区間の前後に、前記第１チャネルの音声信号と前記第２チャネルの音声信号との間の最大の位相差に対応する長さの区間を付加した区間分の前記第１の予測信号を生成することを特徴とする請求項９又は１０に記載の音声処理方法。 In the first replacement step, a section having a length corresponding to the maximum phase difference between the audio signal of the first channel and the audio signal of the second channel is added before and after the noise mixing section. The speech processing method according to claim 9 or 10, wherein the first prediction signal is generated.

前記第１置換ステップは、
前記雑音混入区間の前方区間における前記第１チャネルの音声信号を複製することで第１の信号を生成するステップと、
前記雑音混入区間の後方区間における前記第１チャネルの音声信号を複製することで第２の信号を生成するステップと、
前記第１の信号と前記第２の信号とを合成することにより前記第１予測音声信号を生成するステップと、
を含むことを特徴とする請求項９乃至１１のいずれか１項に記載の音声処理方法。 The first replacement step includes
Generating a first signal by replicating the audio signal of the first channel in the front section of the noise-mixing section;
Generating a second signal by replicating the audio signal of the first channel in a rear section of the noise-mixing section;
Generating the first predicted speech signal by combining the first signal and the second signal;
The voice processing method according to claim 9, comprising:

前記第１の信号を生成するステップは、前記前方区間における前記第１チャネルの音声信号のピッチ検出を行い、検出されたピッチ区間を繰り返し複製することで前記第１の信号を生成し、
前記第２の信号を生成するステップは、前記後方区間における前記第１チャネルの音声信号のピッチ検出を行い、検出されたピッチ区間を繰り返し複製することで前記第２の信号を生成する
ことを特徴とする請求項１２に記載の音声処理方法。 The step of generating the first signal performs pitch detection of the audio signal of the first channel in the front section, and generates the first signal by repeatedly duplicating the detected pitch section.
The step of generating the second signal includes performing pitch detection of the audio signal of the first channel in the rear section, and generating the second signal by repeatedly duplicating the detected pitch section. The voice processing method according to claim 12.

前記位相差検出ステップは、前記雑音混入区間の前方又は後方に設定される所定区間における、前記第１チャネルの音声信号と前記第２チャネルの音声信号との間の相関値に基づいて前記位相差を検出することを特徴とする請求項９乃至１３のいずれか１項に記載の音声処理方法。 The phase difference detection step includes the phase difference based on a correlation value between the audio signal of the first channel and the audio signal of the second channel in a predetermined interval set in front of or behind the noise mixture interval. The voice processing method according to claim 9, wherein the voice processing method is detected.

コンピュータを請求項１乃至６のいずれか１項に記載の音声処理装置が有する各手段として機能させるためのプログラム。 The program for functioning a computer as each means which the audio | voice processing apparatus of any one of Claims 1 thru | or 6 has.