JP2012165219A

JP2012165219A - Imaging apparatus

Info

Publication number: JP2012165219A
Application number: JP2011024535A
Authority: JP
Inventors: Koichi Washisu; 晃一鷲巣
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2011-02-08
Filing date: 2011-02-08
Publication date: 2012-08-30

Abstract

PROBLEM TO BE SOLVED: To reduce focus sound of an imaging optical system mixed in input sound for recording in still image photographing during moving image photographing.SOLUTION: In the still image photographing during the moving image photographing, a camera CPU (16) drives a focusing lens of an imaging optical system (10) to be focused by a lens drive part (18) according to output of a focus detection part (14). When it is focused, a focus sound generation part (40) drives a speaker (42) and makes it output the focus sound. A microphone (22) fetches surrounding sound and the focus sound outputted by the speaker (42) during the moving image photographing. A sound processing part (28) eliminates sound signals in a focus sound superimposed section from the sound signals from a gain adjustment part (24), predicts and complements the sound signals in the focus sound superimposed section from the sound signals in sections before and after the focus sound superimposed section, and composites pseudo focus sound signals from a pseudo focus sound generation part (30).

Description

本発明は、音声付きの動画と同時に静止画を撮影可能な撮像装置に関する。 The present invention relates to an imaging apparatus capable of capturing a still image simultaneously with a moving image with sound.

カメラで静止画撮影を行う場合、通常、撮影者がレリーズボタンを半押しした段階で焦点調節及び露出調節が行われ、全押しで被写体画像が取り込まれ、記録される。撮影レンズの合焦時に、合焦音を発生するカメラがある。他方、動画撮影では、焦点調節が継続的に行われることと、同時に取り込む周囲音声に対してノイズ音となるので、合焦音を発生させることは無い。 When shooting a still image with a camera, the focus adjustment and exposure adjustment are usually performed when the photographer presses the release button halfway, and the subject image is captured and recorded when fully pressed. Some cameras generate a focusing sound when the taking lens is in focus. On the other hand, in moving image shooting, since the focus adjustment is continuously performed and a noise sound is generated with respect to ambient sounds to be captured at the same time, no in-focus sound is generated.

近年，動画撮影中に静止画撮影を行えるカメラが製品化されている。動画撮影中に静止画撮影のためのレリーズボタンの半押しをすると、静止画撮影のための合焦動作が始動することがある。例えば、静止画撮影の場合、撮影画角内の１又は複数のエリアで合焦判定したり、撮影画角内の人間（の顔）に注目して合焦動作することがあり、この際の合焦制御は、動画撮影時のそれとは異なることがあるからである。 In recent years, cameras that can shoot still images during movie shooting have been commercialized. If the release button for still image shooting is pressed halfway during moving image shooting, a focusing operation for still image shooting may be started. For example, in the case of still image shooting, in-focus determination may be performed in one or a plurality of areas within the shooting angle of view, or focusing may be performed while paying attention to the human (its face) within the shooting angle of view. This is because the focus control may be different from that during moving image shooting.

静止画撮影のための合焦調節を行う場合、動画撮影中であっても、その合焦調節の完了を撮影者に知らせる手段が必要となる。しかし、音で撮影者に知らせると、動画撮影に付随して記録している音声にノイズ音として混入してしまう。また、合焦調節のためのフォーカシングレンズの駆動音又は移動音も、記録中の音声にノイズ音として混入してしまう。 When performing focus adjustment for still image shooting, a means for notifying the photographer of completion of the focus adjustment is required even during moving image shooting. However, if the photographer is informed by sound, it is mixed as noise sound in the sound recorded accompanying the moving image shooting. Also, the driving sound or moving sound of the focusing lens for adjusting the focus is mixed as noise sound in the sound being recorded.

周囲の音声を取り込むためのマイクは、通常、撮像装置の正面の、撮影レンズの近くに配置されている。従って、マイクは、合焦調節に伴う撮影レンズの作動音を取り込み易い。露出調整に伴う絞りの作動音も容易に取り込んでしまう。ノイズ音発生位置に近いことから、相対的に大きな音量で取り込んでしまうだけでなく、合焦駆動音と共に発生する振動や合焦音がカメラ内で起こす残響までも、取り込んでしまう。 A microphone for capturing surrounding sound is usually disposed near the photographing lens in front of the imaging apparatus. Therefore, the microphone can easily capture the operation sound of the photographing lens accompanying the focus adjustment. The diaphragm operating sound accompanying exposure adjustment is also easily captured. Since it is close to the noise sound generation position, it not only captures at a relatively large volume, but also captures vibrations generated along with the focus drive sound and reverberation caused by the focus sound within the camera.

特許文献１には、シャッタ音を動画撮影時には消去することが記載されている。シャッタ音を消去するモードと消去しないモードがあり、ユーザが何れか一方を選択できるようになっている。 Patent Document 1 describes that the shutter sound is erased during moving image shooting. There are a mode for erasing the shutter sound and a mode for not erasing, and the user can select one of them.

特開２００６−３１１４１２号公報JP 2006-311412 A

撮影した動画を再生する際に、どの時点で静止画を撮影したかが分かると、便利である。そのためには、従来技術では、音声に混入する合焦通知音又は合焦動作音を頼るしかないが、そのような音は、本来の音の品質を低下させるノイズ音でもあり、鮮明で聴き取りやすいものとは言えない。 When playing back a captured video, it is convenient to know when the still image was taken. For that purpose, in the prior art, there is no choice but to rely on in-focus notification sound or in-focus operation sound mixed in the sound, but such sound is also a noise sound that degrades the quality of the original sound, and is clear and audible. It's not easy.

本発明は、動画撮影中になされる静止画撮影タイミングを、本来の音声の品質を損なわない明確な音でユーザに知らせることが出来る撮像装置を提示することを目的とする。 An object of the present invention is to provide an imaging apparatus capable of notifying a user of a still image shooting timing performed during moving image shooting with a clear sound that does not impair the quality of the original sound.

本発明に係る撮像装置は、動画撮影中に静止画を撮影する撮像装置であって、撮像光学系と、前記撮像光学系による光学像を画像信号に変換する撮像手段と、前記静止画の撮影の際の前記撮像光学系の合焦に従い、合焦音を発生する合焦音発生手段と、前記合焦音を含む周囲音を取り込む音声入力手段と、疑似合焦音信号を発生する疑似合焦音発生手段と、前記音声入力手段の入力音声信号に混入する前記合焦音を除去し、前記疑似合焦音信号を合成する音声処理手段とを具備することを特徴とする。 An imaging apparatus according to the present invention is an imaging apparatus that captures a still image during moving image capturing, an imaging optical system, an imaging unit that converts an optical image obtained by the imaging optical system into an image signal, and the still image capturing. In accordance with the focus of the imaging optical system at the time of focusing, a focusing sound generating means for generating a focusing sound, a voice input means for capturing ambient sounds including the focusing sound, and a pseudo focusing sound signal for generating a pseudo focusing sound signal It is characterized by comprising a sound generation means and a sound processing means for removing the in-focus sound mixed in the input sound signal of the sound input means and synthesizing the pseudo-focus sound signal.

本発明によれば、被写体音声に混入する合焦音又はレンズ駆動音を削除し、高品質が疑似音を代わりに合成するので、再生時に合焦又はレンズ駆動を高品質な再生音で表現できる。 According to the present invention, since the focus sound or lens drive sound mixed in the subject sound is deleted and the high quality is synthesized instead of the pseudo sound, the focus or lens drive can be expressed with high quality playback sound at the time of playback. .

本発明の一実施例の概略構成ブロック図である。It is a schematic block diagram of one Example of this invention. 合焦音から擬似合焦音を生成し記憶する回路の概略構成ブロック図である。It is a schematic block diagram of a circuit for generating and storing a pseudo in-focus sound from the in-focus sound. 音声処理部による特定区間の音声を削除し補完する処理の説明用タイミングチャートである。It is a timing chart for explanation of processing which deletes and complements the voice of a specific section by a voice processing part. 音声処理部による合焦音低減処理の説明用タイミングチャートである。It is a timing chart for description of the focused sound reduction process by an audio | voice processing part. 入力音声信号に対する利得の変化を説明するタイミングチャートである。It is a timing chart explaining the change of the gain to an input voice signal. 予測音声信号の説明用タイミングチャートである。It is a timing chart for description of a prediction voice signal. 擬似合焦音合成の説明用タイミングチャートである。It is a timing chart for description of pseudo-focusing sound synthesis. 本実施例の被写体音声取り込みのフローチャートである。It is a flowchart of subject audio | voice acquisition of a present Example. 再生画面の画面例である。It is an example of a playback screen. 入力音声に混入した合焦音を削除し，疑似合焦音を合成する処理のフローチャートである。It is a flowchart of the process which deletes the focusing sound mixed in the input sound and synthesizes the pseudo-focusing sound. 入力音声に混入した合焦音を削除し，疑似合焦音を合成する別の処理の説明用タイミングチャートである。It is a timing chart for description of another process which deletes the focusing sound mixed in the input sound and synthesizes the pseudo-focusing sound. レンズ駆動音を低減する場合の説明用タイミングチャートである。It is an explanatory timing chart in the case of reducing lens driving sound. レンズ駆動音を低減する回路の概略構成ブロック図である。It is a schematic block diagram of a circuit for reducing lens driving sound. レンズ駆動音が重畳している被写体音声信号と、疑似レンズ駆動音のスペクトル波形例である。It is an example of the spectrum waveform of the subject audio signal on which the lens driving sound is superimposed and the pseudo lens driving sound.

以下、図面を参照して、本発明の実施例を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図１は、本発明に係る撮像装置の一実施例であるデジタル一眼レフカメラの概略構成ブロック図を示す。 FIG. 1 shows a schematic block diagram of a digital single-lens reflex camera which is an embodiment of an imaging apparatus according to the present invention.

図１に示す実施例の基本的な構成と動作を説明する。撮像光学系１０は、被写体からの光学像を撮像素子１２に入射する。合焦検出部１４は、撮像光学系１０からの被写体光の合焦度を検出し、検出結果をカメラＣＰＵ１６に供給する。カメラＣＰＵ１６は合焦検出部１４の合焦度出力に従い、レンズ駆動部１８により撮像光学系１０のフォーカシングレンズを合焦位置に向け駆動する。この帰還制御により、撮像光学系１０は、被写体に合焦する位置に制御される。なお、レンズ駆動部１８は、カメラＣＰＵ１６からの指示に従い、撮像光学系１０のズームレンズ及び絞りも駆動する。 The basic configuration and operation of the embodiment shown in FIG. 1 will be described. The imaging optical system 10 makes an optical image from a subject incident on the imaging element 12. The focus detection unit 14 detects the degree of focus of the subject light from the imaging optical system 10 and supplies the detection result to the camera CPU 16. The camera CPU 16 drives the focusing lens of the imaging optical system 10 toward the in-focus position by the lens driving unit 18 according to the focus degree output of the focus detection unit 14. By this feedback control, the imaging optical system 10 is controlled to a position where the subject is focused. The lens driving unit 18 also drives the zoom lens and the diaphragm of the imaging optical system 10 in accordance with an instruction from the camera CPU 16.

撮像素子１２は撮像光学系１０による光学像を画像信号に変換し、その画像信号を画像処理部２０に供給する。画像処理部２０は、撮像素子１２の出力画像信号を動画像として処理する動画像処理部２０Ｍと、撮像素子１２の出力画像信号の１画面を静止画として処理する静止画処理部２０Ｓを具備する。例えば、動画像処理部２０Ｍは、撮像素子１２の出力画像信号をＭＰＥＧ方式又はＭｏｔｉｏｎＪＰＥＧ方式で圧縮符号化する。静止画処理部２０Ｓは、撮像素子１２の出力画像信号をＪＰＥＧ方式で圧縮符号化する。 The image sensor 12 converts an optical image obtained by the imaging optical system 10 into an image signal, and supplies the image signal to the image processing unit 20. The image processing unit 20 includes a moving image processing unit 20M that processes the output image signal of the image sensor 12 as a moving image, and a still image processing unit 20S that processes one screen of the output image signal of the image sensor 12 as a still image. . For example, the moving image processing unit 20M compresses and encodes the output image signal of the image sensor 12 using the MPEG method or the Motion JPEG method. The still image processing unit 20S compresses and encodes the output image signal of the image sensor 12 using the JPEG method.

音声入力手段であるマイク２２は、周囲の音声を取り込み、音声信号を利得調整部２４に出力する。マイク２２は、カメラ筐体裏側に不図示のゴムなどで弾性的に取り付けられている。利得調整部２４は、マイク２２からの音声信号の利得を調整する。利得調整制御部２６は、カメラＣＰＵ１６からの指示に従い、利得調整部２４の利得を制御する。疑似合焦音発生部３０は、カメラＣＰＵ１６からの指示に従い、所定の疑似合焦音信号を発生する。音声処理部２８は、利得調整部２４からの音声信号を処理した上で疑似合焦音発生部３０からの疑似合焦音信号を合成し、記録用に符号化する。 The microphone 22 which is an audio input unit takes in surrounding audio and outputs an audio signal to the gain adjusting unit 24. The microphone 22 is elastically attached to the back side of the camera casing with rubber (not shown). The gain adjusting unit 24 adjusts the gain of the audio signal from the microphone 22. The gain adjustment control unit 26 controls the gain of the gain adjustment unit 24 in accordance with an instruction from the camera CPU 16. The pseudo focus sound generation unit 30 generates a predetermined pseudo focus sound signal in accordance with an instruction from the camera CPU 16. The audio processing unit 28 processes the audio signal from the gain adjusting unit 24 and then synthesizes the pseudo focused sound signal from the pseudo focused sound generating unit 30 and encodes it for recording.

記録部３２には、画像処理部２０から符号化画像信号（圧縮動画像信号と圧縮静止画信号）が供給され、音声処理部２８から符号化音声信号が供給される。記録部３２にはまた、カメラＣＰＵ１６から、撮影画像の画素数、シャッタ速度、絞り値及びフレームレート等の撮影条件、並びに、静止画撮影の際の合焦タイミングを示す信号が供給される。記録部３２は、画像処理部２０，音声処理部２８及びカメラＣＰＵ１６からのこれらの情報を図示しない記録媒体に記録する。記録媒体は、例えば、半導体メモリ、磁気ディスク又は光ディスク等からなる。 The recording unit 32 is supplied with encoded image signals (compressed moving image signal and compressed still image signal) from the image processing unit 20, and is supplied with encoded audio signals from the audio processing unit 28. The recording unit 32 is also supplied from the camera CPU 16 with a signal indicating the number of pixels of the captured image, shooting conditions such as the shutter speed, aperture value, and frame rate, and the in-focus timing at the time of still image shooting. The recording unit 32 records these pieces of information from the image processing unit 20, the audio processing unit 28, and the camera CPU 16 on a recording medium (not shown). The recording medium is composed of, for example, a semiconductor memory, a magnetic disk, an optical disk, or the like.

開始／停止スイッチ３４は、動画撮影の開始と停止をカメラＣＰＵ１６に指示するのに使用される。レリーズボタン３６は、静止画撮影で、半押しで合焦と露出の制御を、全押しで撮影の実行をカメラＣＰＵ１６に指示にするのに使用される。ユーザは、動画／静止画モード切替えスイッチ３８を使って、カメラＣＰＵ１６に動画モードと静止画モードの切替えを指示できる。動画モードと静止画モードの詳細は口述する。 The start / stop switch 34 is used to instruct the camera CPU 16 to start and stop moving image shooting. The release button 36 is used for instructing the camera CPU 16 to perform focusing and exposure control when pressed halfway and to execute shooting when pressed fully in still image shooting. The user can instruct the camera CPU 16 to switch between the moving image mode and the still image mode using the moving image / still image mode switch 38. Details of the video mode and still image mode are dictated.

カメラＣＰＵ１６は、静止画撮影の際に撮像光学系１０が被写体に合焦した時に、合焦音発生部４０に合焦音信号を発生させる。合焦音発生部４０は、カメラＣＰＵ１６からの指示に従い合焦音信号を発生し、スピーカ４２に供給する。スピーカ４２は、合焦音発生部４０の発生する合焦音信号を音響出力する。カメラＣＰＵ１６はまた、動画撮影中での静止画撮影に対しても合焦音信号を合焦音発生部４０に発生させる。このときの合焦音は、静止画単体の撮影の際のそれとは異なる。例えば、静止画単体撮影の場合の合焦音を「ピピッ！」とし、動画撮影中での静止画撮影のそれを、より短い音、例えば、「ピッ！」とする。詳細は後述するが、動画撮影中での静止画撮影のための合焦音はマイク２２の出力音声から削除されるが、その削除期間は短い方が好ましいからである。 The camera CPU 16 causes the focus sound generator 40 to generate a focus sound signal when the imaging optical system 10 is focused on a subject during still image shooting. The in-focus sound generator 40 generates an in-focus sound signal in accordance with an instruction from the camera CPU 16 and supplies it to the speaker 42. The speaker 42 acoustically outputs the in-focus sound signal generated by the in-focus sound generator 40. The camera CPU 16 also causes the focus sound generator 40 to generate a focus sound signal for still image shooting during moving image shooting. The focus sound at this time is different from that at the time of photographing a single still image. For example, the in-focus sound in the case of still image single shooting is “pip!”, And that in still image shooting during moving image shooting is a shorter sound, for example “pip!”. Although details will be described later, the focusing sound for still image shooting during moving image shooting is deleted from the output sound of the microphone 22, but it is preferable that the deletion period is shorter.

カメラＣＰＵ１６は音声処理部２８にその動作を制御する制御信号を供給する。例えば、カメラＣＰＵ１６は音声処理部２８に、撮像光学系１０の合焦時を含む短い期間を示す合焦タイミングゲート信号を供給する。この合焦タイミングゲート信号は、例えば、マイク２２（実際には利得調整部２４）の出力音声信号に合焦音が重畳する期間又はこれを包含する期間（以下、合焦音重畳区間という）を示す。音声処理部２８は、このような合焦タイミングゲート信号に依存せずに自律的に合焦音重畳区間を利得調整部２４の出力音声信号から検出しても良い。この場合、カメラＣＰＵ１６からの合焦タイミングゲート信号に類する、合焦タイミングを示す信号は不要になる。 The camera CPU 16 supplies a control signal for controlling the operation to the sound processing unit 28. For example, the camera CPU 16 supplies the audio processing unit 28 with an in-focus timing gate signal indicating a short period including the in-focus time of the imaging optical system 10. This focusing timing gate signal is, for example, a period in which the focusing sound is superimposed on the output audio signal of the microphone 22 (actually the gain adjusting unit 24) or a period including this period (hereinafter referred to as a focusing sound overlapping period). Show. The sound processing unit 28 may autonomously detect the focused sound superimposition section from the output sound signal of the gain adjusting unit 24 without depending on such a focusing timing gate signal. In this case, a signal indicating the in-focus timing similar to the in-focus timing gate signal from the camera CPU 16 is not necessary.

音声処理部２８は、マイク２２による入力音声信号に混入するスピーカ４２からの合焦音を次のように削除し、疑似合焦音を挿入する。すなわち、音声処理部２８は、マイク２２（実際には利得調整部２４）の出力音声信号から合焦音重畳区間の音声信号を削除し、合焦音重畳区間の前後の音声から予測して補完する。前後の音声から予測して補完信号を生成するので、合焦音重畳区間をより自然につなぐことが出来る。 The audio processing unit 28 deletes the in-focus sound from the speaker 42 mixed in the input audio signal from the microphone 22 as follows, and inserts the pseudo-in-focus sound. That is, the audio processing unit 28 deletes the audio signal in the focused sound superimposing section from the output audio signal of the microphone 22 (actually the gain adjusting unit 24), and predicts and complements the sound before and after the focused sound overlapping section. To do. Since the complementary signal is generated by predicting from the previous and next voices, the focused sound superimposing sections can be connected more naturally.

カメラＣＰＵ１６はまた、撮像光学系１０の合焦に同期して、疑似合焦音発生部３０に所定トーンからなる疑似合焦音を発生させる。音声処理部２８は、合焦音重畳区間の音声信号を削除した後の利得調整部２４の出力音声信号に、疑似合焦音発生部３０からの疑似合焦音を合成する。スピーカ４２から出力される合焦音をマイク２２で拾った場合、奇麗な音にはならない。しかし、本実施例のように、一旦、マイク２２で拾った合焦音を削除した後に、電気的に発生する疑似合焦音音を重畳することにより、良質な合焦音を録音できる。 The camera CPU 16 also causes the pseudo focus sound generating unit 30 to generate a pseudo focus sound composed of a predetermined tone in synchronization with the focus of the imaging optical system 10. The audio processing unit 28 synthesizes the pseudo focused sound from the pseudo focused sound generating unit 30 with the output audio signal of the gain adjusting unit 24 after deleting the audio signal in the focused sound superimposing section. When the in-focus sound output from the speaker 42 is picked up by the microphone 22, the sound is not beautiful. However, as in the present embodiment, once the in-focus sound picked up by the microphone 22 is deleted, by superimposing the electrically generated in-focus sound, a high-quality in-focus sound can be recorded.

疑似合焦音発生部３０が発生する疑似合焦音信号は、例えば、図２に示すような構成で、疑似合焦音発生部３０に書き込まれる。スピーカ４２から出力される合焦音をマイク２２又は別のマイクで取り込み、フィルタ４６で余分な周波数部分を除去する。そのフィルタ４６の出力音声信号が、疑似合焦音信号として、疑似合焦音発生部３０に書き込まれる。 The pseudo focus sound signal generated by the pseudo focus sound generation unit 30 is written in the pseudo focus sound generation unit 30 with a configuration as shown in FIG. The in-focus sound output from the speaker 42 is captured by the microphone 22 or another microphone, and an excess frequency portion is removed by the filter 46. The output audio signal of the filter 46 is written into the pseudo-focus sound generator 30 as a pseudo-focus sound signal.

疑似合焦音発生部３０の出力する疑似合焦音信号の代わりに、合焦音発生部４０の発生する合焦音信号、又はこれをフィルタ処理した音信号を適切なタイミングで音声処理部２８に供給しても良い。 Instead of the pseudo-focusing sound signal output from the pseudo-focusing sound generation unit 30, the voice processing unit 28 at an appropriate timing with the focusing sound signal generated by the focusing sound generation unit 40 or a sound signal obtained by filtering the focusing sound signal. May be supplied.

合焦音重畳区間に対する音声処理部２８の処理を詳細に説明する。まず、合焦音重畳区間の音声信号を破棄する。次に、音声処理部２８に含まれる音声予測補完部が、合焦音重畳区間の時間的に前の区間及び後の区間を学習区間として、前後の学習区間から合焦音重畳区間にあるべき音声信号を予測する。そして、音声予測補完部は、予測された音声信号を合焦重畳区間に配置する。 The processing of the voice processing unit 28 for the focused sound superimposing section will be described in detail. First, the audio signal in the focused sound superimposing section is discarded. Next, the speech prediction complementing unit included in the speech processing unit 28 should be in the focused sound superimposition section from the preceding and following learning sections, with the temporally preceding and subsequent sections of the focused sound superimposing section as learning sections. Predict voice signals. Then, the speech prediction complementing unit places the predicted speech signal in the in-focus superimposed section.

線形予測係数の導出（学習動作）と線形予測係数を用いた信号の予測（予測動作）を例に、音声予測補完部の動作を説明する。 The operation of the speech prediction complementing unit will be described taking derivation of the linear prediction coefficient (learning operation) and signal prediction (prediction operation) using the linear prediction coefficient as examples.

線形予測を用いるにあたり、現在の信号とこれに隣接する有限個（ここではｐ個とおく）の標本値との間に、次のような線形１次結合関係を仮定する。すなわち、

但し、式（１）において、ε_ｔは、平均値０、分散σ^２の互いに無相関な確率変数である。 In using linear prediction, the following linear linear coupling relationship is assumed between the current signal and a finite number (p here) of sample values adjacent thereto. That is,

However, in Equation (1), ε _t is a random variable having an average value of 0 and a variance of σ ² that are uncorrelated with each other.

ここでｘ_ｔが過去の値から予測されるように式を変形すると、

となる。 Now x _t deforms the expression as expected from past values,

It becomes.

式（２）によると、ε_ｔが十分に小さければ、近傍ｐ個の線形和によって現在の値が表現される。ｘ_ｔを上記の予測によって求めた後、さらにその近似が十分によければ、ｘ_ｔ＋１も同じく近傍ｐ個の線形和によって求められる。 According to equation (2), if ε _t is sufficiently small, the current value is represented by the linear p sum of neighborhoods. After obtaining _xt by the above prediction, if the approximation is sufficiently good, _{xt + 1} is also obtained by a linear sum of p neighbors.

このように、ε_ｔを十分に小さくすることが出来れば、順次値を予測して信号を求めることが出来る。そこで、ε_ｔを最小にするようなα_ｉを求めることを考える。本実施例では、ε_ｔを最小にするようなα_ｉを求める動作を学習動作と呼ぶ。 Thus, if ε _t can be made sufficiently small, the signal can be obtained by sequentially predicting the value. Accordingly, consider obtaining α _i that minimizes ε _t . In this embodiment, an operation for obtaining α _i that minimizes ε _t is called a learning operation.

前述した学習区間において、Σε_ｔ ^２を最小化すればよい。学習の開始時間をｔ_０、終了時間ｔ_１とすると、

ただし、α_０＝１である。ここで、式（３）を簡単化するために、

とおく。式（３）を最小化するようにα_ｉを決めるためには、式（３）のα_ｊ（ｊ＝１，２，・・・，ｐ）に関する偏微分を０として解けばよい。この結果、

が得られる。式（５）は、ｐ個の線形連立１次方程式を解けば、α_ｉを決定できることを示している。式（５）のｃ_ｉｊは、ｘ_ｔ−ｉ（ｉ＝１，２，・・・，ｐ）から求めることができる。すなわち、式（５）からα_ｉを求めることができる。 In the above-described learning period, it may be minimized Σε _t ^2. If the learning start time is t ₀ and the end time t ₁ ,

However, α ₀ = 1. Here, in order to simplify the equation (3),

far. In order to determine α _i so as to minimize Equation (3), partial differentiation with respect to α _j (j = 1, 2,..., P) in Equation (3) may be solved as zero. As a result,

Is obtained. Equation (5) shows that α _i can be determined by solving p linear simultaneous linear equations. In the equation (5), c _ij can be obtained from x _ti (i = 1, 2,..., P). That is, α _i can be obtained from Equation (5).

式（５）に従ってα_ｉを決定した場合、Σε_ｔ ^２は最小化されている。このとき、式（２）から、ｘ_ｔの値は、

で近似できる。この近似が十分に良いものであれば、ｘ_ｔの代わりに、式（６）の右辺を予測信号として用いることができる。 When α _i is determined according to equation (5), Σε _t ² is minimized. At this time, from equation (2), the value of _{x t} is

Can be approximated by If this approximation is good enough, instead of x _t, it can be used right side of the equation (6) as a prediction signal.

さらに、ｘ_ｔ＋１についても同様に、近傍のｐ−１個と、予測によって求めた信号とから近似値を得ることが出来る。 Further, similarly for _{xt + 1} , an approximate value can be obtained from the p-1 nearby and the signal obtained by prediction.

このような処理を順次、繰り返すことで、予測区間（ここでは、合焦音重畳区間に一致する。）の音声信号を生成出来る。本実施例では、求められたα_ｉから予測区間の近似を求める動作を予測動作と呼ぶ。 By repeating such processing sequentially, an audio signal in the prediction interval (here, coincides with the focused sound superimposition interval) can be generated. In this embodiment, an operation for obtaining an approximation of a prediction interval from the obtained α _i is referred to as a prediction operation.

図３は、被写体（又は周囲）からの音声（以下、「被写体音声」という。）の音圧レベルと、合焦音重畳区間（予測区間）、学習区間との関係を示す模式図である。横軸は、時間を示し、縦軸は被写体音の有無を示す。 FIG. 3 is a schematic diagram showing the relationship between the sound pressure level of the sound from the subject (or the surroundings) (hereinafter referred to as “subject sound”), the focused sound superimposition section (prediction section), and the learning section. The horizontal axis represents time, and the vertical axis represents the presence or absence of subject sound.

５１ａは元の被写体音声であり、合焦音重畳区間である削除区間５２に合焦音（雑音）が重畳している。音声処理部２８は、削除区間５２の被写体音声５１ａを削除する。 51a is the original subject sound, and the focused sound (noise) is superimposed on the deletion section 52 which is the focused sound superimposing section. The sound processing unit 28 deletes the subject sound 51a in the deletion section 52.

５１ｂは、削除区間５２の被写体音声を削除した後の被写体音声を示す。音声処理部２８は、削除区間５２より時間的に前の学習区間５３ａと、時間的に後ろの学習区間５３ｂから削除区間５２に対して予測動作を繰り返し、予測波形を削除区間５２に埋め込む。 51b shows the subject voice after the subject voice in the deletion section 52 is deleted. The speech processing unit 28 repeats the prediction operation for the deletion section 52 from the learning section 53 a temporally preceding the deletion section 52 and the learning section 53 b temporally behind, and embeds the predicted waveform in the deletion section 52.

５１ｃは、削除区間５２に予測動作で得られる予測信号が埋め込まれた被写体音声を示す。削除区間５２が、予測信号を埋め込むべき予測区間５４になる。 51c shows the subject sound in which the prediction signal obtained by the prediction operation is embedded in the deletion section 52. The deletion section 52 becomes the prediction section 54 in which the prediction signal is to be embedded.

このように、学習動作を行うに当たっては、予測区間の前後の信号を用いる。これは、音声信号が、極く短時間の領域に着目すると、比較的繰り返し性が高いという性質を利用している。 Thus, when performing the learning operation, signals before and after the prediction interval are used. This utilizes the property that the audio signal has a relatively high repeatability when focusing on an extremely short region.

学習動作および予測動作では、学習区間５３ａと学習区間５３ｂの信号に対して夫々独立に計算を行う。学習区間５３ａの学習動作に基づき予測区間５４の信号を生成することを、前方からの予測、略して前方予測と呼ぶ。他方、学習区間５３ｂの学習動作に基づき予測区間５４の信号を予測することを、後方からの予測、略して後方予測と呼ぶ。予測区間の信号の計算では、学習区間３３ａに近いほど前方予測による値の重みを大きくし、学習区間３３ｂに近いほど後方予測による値の重みを大きくするように、前方予測と後方予測を重み付けする。 In the learning operation and the prediction operation, calculation is performed independently for the signals in the learning section 53a and the learning section 53b. Generating the signal of the prediction section 54 based on the learning operation of the learning section 53a is called prediction from the front, or forward prediction for short. On the other hand, the prediction of the signal in the prediction section 54 based on the learning operation in the learning section 53b is referred to as prediction from the rear, or backward prediction for short. In the calculation of the signal in the prediction interval, the forward prediction and the backward prediction are weighted so that the weight of the value by the forward prediction is increased as it is closer to the learning interval 33a and the value of the value by the backward prediction is increased closer to the learning interval 33b. .

図４は、本実施例における実際の合焦音と、削除区間及び予測区間、並びに、疑似合焦音との関係を示すタイミングチャートである。横軸は時間を示し、縦軸は、被写体音声の音圧レベルを示す。６１ａ〜６１ｄは、被写体音声の音圧レベルを示す。 FIG. 4 is a timing chart showing the relationship between the actual in-focus sound, the deletion section, the prediction section, and the pseudo-in-focus sound in the present embodiment. The horizontal axis represents time, and the vertical axis represents the sound pressure level of the subject sound. 61a to 61d indicate sound pressure levels of the subject sound.

被写体音声６１ａには実際に発音した実際の合焦音６０が重畳している。被写体音声６１ｂに示すように、合焦音重畳区間を含む削除区間６２の被写体音声を削除する。削除区間６２は、一般に実合焦音６０が重畳している区間より広い。 An actual in-focus sound 60 that is actually pronounced is superimposed on the subject sound 61a. As shown in the subject sound 61b, the subject sound in the deletion section 62 including the focused sound superimposition section is deleted. The deletion section 62 is generally wider than the section where the actual in-focus sound 60 is superimposed.

実合焦音６０の重畳する区間より広い区間の被写体音声を削除する理由を以下に説明する。カメラＣＰＵ１６は、合焦検出部１４からの焦点検出信号に従い、レンズ駆動部１８により撮像光学系１０を合焦点に制御する。カメラＣＰＵ１６は、撮像光学系１０が合焦点に到達するタイミングで合焦音発生部４０に合焦音発生指示信号を供給すると共に、削除区間６２を示す合焦タイミングゲート信号を音声処理部２８に供給する。音声処理部２８は、利得調整部２４からの音声信号のうち、合焦タイミングゲート信号が示す区間の音声信号を削除する。合焦タイミングゲート信号は、合焦音発生部４０の発生する合焦音とその残響音がマイク２２に入力する期間を包含する区間を示すように、合焦音発生指示信号の期間より広く設定される。 The reason why the subject audio in the section wider than the section where the actual in-focus sound 60 is superimposed will be described below. The camera CPU 16 controls the imaging optical system 10 to be in focus by the lens driving unit 18 in accordance with the focus detection signal from the focus detection unit 14. The camera CPU 16 supplies the in-focus sound generation instruction signal to the in-focus sound generation unit 40 at the timing when the imaging optical system 10 reaches the in-focus point, and also supplies the in-focus timing gate signal indicating the deletion section 62 to the sound processing unit 28. Supply. The audio processing unit 28 deletes the audio signal in the section indicated by the in-focus timing gate signal from the audio signal from the gain adjusting unit 24. The in-focus timing gate signal is set wider than the period of the in-focus sound generation instruction signal so as to indicate a section including the in-focus sound generated by the in-focus sound generator 40 and the reverberation sound input to the microphone 22. Is done.

削除区間６２の音声信号を予測する場合に、利得調整部２４の動作を考慮する必要がある。利得調整部２４は、被写体音声が小さいときには増幅利得を大きくして感度を高め、被写体音声が小さいときには増幅利得を小さくして信号の飽和を防いでいる。 When predicting the audio signal in the deletion section 62, it is necessary to consider the operation of the gain adjusting unit 24. The gain adjustment unit 24 increases the amplification gain to increase the sensitivity when the subject sound is low, and reduces the amplification gain when the subject sound is small to prevent signal saturation.

図５は、利得調整部２４の利得の変化例を示す。横軸は時間を示し、縦軸は、マイク２２の出力音声信号の音圧レベルと、利得調整部２４の利得レベルを示す。被写体音声６１ａには実合焦音６０が重畳している。 FIG. 5 shows a change example of the gain of the gain adjusting unit 24. The horizontal axis represents time, and the vertical axis represents the sound pressure level of the output audio signal of the microphone 22 and the gain level of the gain adjusting unit 24. An actual in-focus sound 60 is superimposed on the subject sound 61a.

合焦音６０が被写体音声６１ａに対して大きな音圧レベルの場合、利得調整部２４は、音声信号の飽和を防ぐ為に合焦音６０が存在する区間で利得レベル７１を下げる。一般的には、大きな音が止んだ時点以降、利得調整部２４は、利得レベル７２に示す様に利得を徐々に元に戻す。これは、利得レベルを急激に戻すと、その前後の音声が不連続になり、違和感が生ずるからである。 When the in-focus sound 60 has a large sound pressure level with respect to the subject sound 61a, the gain adjusting unit 24 lowers the gain level 71 in a section where the in-focus sound 60 exists in order to prevent the saturation of the sound signal. Generally, after the loud sound stops, the gain adjusting unit 24 gradually returns the gain as indicated by the gain level 72. This is because when the gain level is suddenly returned, the sound before and after that becomes discontinuous and a sense of incongruity occurs.

他方、利得レベル７２の様に徐々に利得を変化させると、利得が変化する期間７３は、予測音声作成のための学習区間として利用できない。これに対し、カメラＣＰＵ１６は、利得調整制御部２６を介して利得調整部２４の利得を利得レベル７４に示すように、実合焦音６０の終了後、利得レベルを急速に戻す。すなわち、利得調整制御部２６は、カメラＣＰＵ１６からの合焦音の終了タイミングを示す信号に従い、利得調整部２４の利得制御の帰還ループの時定数を一時的に短縮する。このような利得の一時制御により利得レベルが全体として安定し、合焦音の後の期間を学習区間として利用出来る。 On the other hand, when the gain is gradually changed as in the gain level 72, the period 73 during which the gain changes cannot be used as a learning section for creating a predicted speech. On the other hand, the camera CPU 16 rapidly returns the gain level after the actual in-focus sound 60 ends, as indicated by the gain level 74 of the gain adjustment unit 24 via the gain adjustment control unit 26. That is, the gain adjustment control unit 26 temporarily shortens the time constant of the gain control feedback loop of the gain adjustment unit 24 according to the signal indicating the end timing of the focused sound from the camera CPU 16. The gain level is stabilized as a whole by such temporary control of the gain, and the period after the focused sound can be used as the learning section.

音声の予測では、予測信号が時間経過と共に発散してしまう可能性がある。これは、前述した計算により求めた各予測係数の誤差が累積するからであり、その結果として、予測音声が極めて大きくなってしまう。この問題は、各予測係数を調整することで解決できる。例えば、時刻ｔにおける被写体音声を前回求めたレベルより小さくなる様に各係数の倍率を一律に変更する。このような調整により、次に予測される音声信号は前回よりも小さい値になり、最終的には予測音声信号が収束する。予測音声信号の精度は若干低下するが、その後に擬似合焦音が合成されるので、その精度低下は目立たない。 In speech prediction, the prediction signal may diverge over time. This is because the error of each prediction coefficient obtained by the above-described calculation accumulates, and as a result, the predicted speech becomes extremely large. This problem can be solved by adjusting each prediction coefficient. For example, the magnification of each coefficient is uniformly changed so that the subject sound at time t becomes smaller than the previously obtained level. As a result of such adjustment, the next predicted speech signal becomes a smaller value than the previous time, and finally the predicted speech signal converges. The accuracy of the predicted speech signal is slightly reduced, but since the pseudo-focused sound is synthesized thereafter, the accuracy reduction is not noticeable.

図６は、以上の処理を説明する模式図を示す。横軸は時間を示し、縦軸は音圧レベルを示す。合焦音重畳区間６５の音声を削除した被写体音声６１ｃに対し、合焦音重畳区間の前後の音声から学習及び予測し、削除区間に予測音声を埋め込む。このとき、前述した様に各係数を調整し、予測音声７１ａ，７１ｂとして示すように、時間の経過と共にゼロに収束する予測音声とする。 FIG. 6 is a schematic diagram illustrating the above processing. The horizontal axis indicates time, and the vertical axis indicates the sound pressure level. The subject sound 61c from which the sound in the focused sound superimposing section 65 has been deleted is learned and predicted from the sound before and after the focused sound superimposed section, and the predicted sound is embedded in the deleted section. At this time, the coefficients are adjusted as described above to obtain predicted speech that converges to zero with the passage of time, as shown as predicted speech 71a and 71b.

図４に示すように、疑似合焦音発生部３０が、被写体音声６１ｄに対し、予め記憶してある所定音を擬似合焦音６４として予測区間６３内に発生し、音声処理部２８が、擬似合焦音６４を被写体音声６１ｄに合成する。擬似合焦音６４を合成する区間６６は、合焦音重畳区間６５より短い。これは、実合焦音の長さと揃える為である。 As shown in FIG. 4, the pseudo-focusing sound generating unit 30 generates a predetermined sound stored in advance as a pseudo-focusing sound 64 in the prediction section 63 with respect to the subject sound 61d. The pseudo in-focus sound 64 is synthesized with the subject sound 61d. The section 66 for synthesizing the pseudo-focus sound 64 is shorter than the focus sound superimposition section 65. This is to align with the length of the actual in-focus sound.

被写体音声６１ｄの大きさに合わせて、擬似合焦音の音圧レベルを調整する。即ち、被写体音声が大きい時には擬似合焦音も大きくして良く聞こえるようにする。他方、被写体音声が小さい時は、擬似合焦音も小さくして、擬似合焦音ばかりが目立つ事が無い様にする。図７は、被写体音声６１ｄと擬似合焦音の音圧レベルの関係を示す模式図である。横軸は時間を示し、縦軸は音圧レベルを示す。図７に示す例では、擬似合焦音６４ａ，６４ｂの音圧レベルを被写体音声６１ｄ，６１ｅの音圧レベルのほぼ倍としている。 The sound pressure level of the pseudo in-focus sound is adjusted according to the size of the subject sound 61d. That is, when the subject sound is loud, the pseudo-focusing sound is also louded so that it can be heard well. On the other hand, when the subject sound is low, the pseudo focus sound is also reduced so that only the pseudo focus sound does not stand out. FIG. 7 is a schematic diagram showing the relationship between the subject sound 61d and the sound pressure level of the pseudo in-focus sound. The horizontal axis indicates time, and the vertical axis indicates the sound pressure level. In the example shown in FIG. 7, the sound pressure levels of the pseudo in-focus sounds 64a and 64b are approximately double the sound pressure levels of the subject sounds 61d and 61e.

図８は、本実施例における被写体音声の検出と仮記憶の動作を示す。カメラＣＰＵ１６上で動作する制御プログラムが、図８に示すフローチャートを実現するように各部を制御する。図８に示すフローは、動画撮影の開始（又は音声のみの記録の開始）と共にスタートする。 FIG. 8 shows operations of subject sound detection and temporary storage in the present embodiment. A control program operating on the camera CPU 16 controls each unit so as to realize the flowchart shown in FIG. The flow shown in FIG. 8 starts with the start of moving image shooting (or the start of recording of audio only).

ステップＳ８００１では、カメラＣＰＵ１６は、実合焦音の発生が終了しているか否かを判定し、終了している場合はステップＳ８００２に進み、そうで無い場合はステップＳ８００３に進む。前述した様に、実合焦音発生時はその音が大きい事から、利得調整部２４がマイク２２の増幅利得を下げている。その為、実合焦音終了直後は、利得が直ぐに回復せず、後方予測の精度が低くなる。 In step S8001, the camera CPU 16 determines whether or not the actual in-focus sound has been generated. If YES in step S8001, the process proceeds to step S8002. If not, the process proceeds to step S8003. As described above, since the sound is loud when the actual in-focus sound is generated, the gain adjusting unit 24 lowers the amplification gain of the microphone 22. Therefore, immediately after the actual in-focus sound is finished, the gain is not recovered immediately, and the accuracy of backward prediction is lowered.

ステップＳ８００２で、カメラＣＰＵ１６は、利得調整制御部２６に指示して、実合焦音の終了直後に利得調整部２４の利得を瞬時に回復させる。これにより、実合焦より時間的に後の学習区間の被写体音声を早期に安定させることができ、後方予測の精度が向上する。 In step S8002, the camera CPU 16 instructs the gain adjustment control unit 26 to instantaneously recover the gain of the gain adjustment unit 24 immediately after the end of the actual in-focus sound. As a result, it is possible to quickly stabilize the subject voice in the learning section that is temporally after the actual focus, and the accuracy of backward prediction is improved.

実合焦音以外の音の場合には、ステップＳ８００２をスキップするので、利得回復はゆっくりとなり、違和感の無い被写体音声信号になる。 In the case of a sound other than the actual in-focus sound, step S8002 is skipped, so that the gain recovery is slow and the subject sound signal has no sense of incongruity.

ステップＳ８００３で被写体音声の取り込みを行い、ステップＳ８００４で合焦状態の取り込みを行う。合焦状態情報としては、実合焦音の発生と終了のタイミングや、撮像する画像の中での合焦領域情報があげられる。実合焦音の発生終了タイミングは、被写体音声内の実合焦音重畳区間の削除や予測に使用される。合焦領域情報は、図９に示す様に、動画再生時に画像内に合焦領域を表示するために用いられる。図９は、動画再生画面に合焦領域を重畳表示する画面例を示す。動画再生時にレリーズ操作などで急速合焦を行った場合、記録されていた合焦領域がフレーム９１内に合焦エリア９２として重畳表示され、同時に、擬似合焦音が発生される。 In step S8003, the subject voice is captured, and in step S8004, the in-focus state is captured. The in-focus state information includes the timing of the occurrence and end of the actual in-focus sound, and in-focus area information in the image to be captured. The actual focus sound generation end timing is used for deletion or prediction of the actual focus sound superimposed section in the subject sound. As shown in FIG. 9, the focus area information is used to display a focus area in an image when a moving image is reproduced. FIG. 9 shows an example of a screen that superimposes and displays the focus area on the moving image playback screen. When rapid focusing is performed by a release operation or the like during moving image reproduction, the recorded focusing area is superimposed and displayed as a focusing area 92 in the frame 91, and at the same time, a pseudo focusing sound is generated.

ステップＳ８００５では、取り込んだ音声信号および合焦状態情報を同期してバッファなどに一時記憶する。 In step S8005, the captured audio signal and in-focus state information are synchronized and temporarily stored in a buffer or the like.

図１０は、本実施例の被写体音声処理のフローチャートを示す。音声の録音又は動画の撮影開始から所定の時間遅れて、音声処理部２８は、図１０に示すフローをスタートする。この時間遅れは、予測音声作成に必要な時間を見込んだものであり、前述した各予測係数の算出及び後方予測に必要な時間（例えば、後方学習区間３３ｂに要する時間）である。音声処理部２８が、利得調整部２４の出力音声信号をバッファに記憶した上で、後方予測を採用するので、精度の高い予測音が得られる。 FIG. 10 shows a flowchart of subject audio processing of the present embodiment. The audio processing unit 28 starts the flow shown in FIG. 10 with a predetermined time delay from the start of audio recording or video recording. This time delay allows for the time required for creating the predicted speech, and is the time required for the calculation of each prediction coefficient and the backward prediction described above (for example, the time required for the backward learning section 33b). Since the speech processing unit 28 stores the output speech signal of the gain adjusting unit 24 in the buffer and employs backward prediction, a highly accurate predicted sound can be obtained.

ステップＳ１０００１で、音声処理部２８は、バッファに一時記憶された被写体音声を走査し、合焦状態情報などにより実合焦音が重畳したか否かを判定する。実合焦音重畳区間になると、ステップＳ１０００２に進み、そうで無い時はステップＳ１０００１に戻り、循環待機する。 In step S10001, the sound processing unit 28 scans the subject sound temporarily stored in the buffer, and determines whether or not the actual in-focus sound is superimposed based on the in-focus state information. When the actual in-focus sound superimposing section is reached, the process proceeds to step S10002, and if not, the process returns to step S10001 to wait for circulation.

ステップＳ１０００２で、音声処理部２８は、バッファに記憶された被写体音声信号のうち、実合焦音重畳区間の信号を削除する。 In step S10002, the audio processing unit 28 deletes the signal of the actual focused sound superimposing section from the subject audio signal stored in the buffer.

ステップＳ１０００３で、音声処理部２８は、実合焦音重畳区間の前後の被写体音声信号から実合焦音重畳区間の被写体音声を予測して組み込む。 In step S10003, the audio processing unit 28 predicts and incorporates the subject audio in the actual in-focus sound superimposed section from the subject audio signals before and after the actual in-focus sound superimposed section.

ステップＳ１０００４で、音声処理部２８は、実合焦音重畳区間より前の区間における被写体音圧レベルが所定値より大きいか否かを判定する。被写体音圧レベルが所定値より小さい場合はステップＳ１０００５に進み、大きい場合はステップＳ１０００６に進む。 In step S10004, the sound processing unit 28 determines whether or not the subject sound pressure level in the section before the actual focused sound superimposing section is larger than a predetermined value. If the subject sound pressure level is smaller than the predetermined value, the process proceeds to step S10005, and if it is greater, the process proceeds to step S10006.

ステップＳ１０００５及びＳ１０００６では共に、音声処理部２８は、擬似合焦音信号を被写体音声信号に合成する。ただし、ステップＳ１０００６では、音圧レベルの大きな擬似合焦音を合成し、ステップＳ１０００５では、音圧レベルの小さな疑似合焦音を合成する。これにより、被写体音声に擬似合焦音が埋もれてしまうことを防ぐ。 In both steps S10005 and S10006, the sound processing unit 28 synthesizes the pseudo-focus sound signal with the subject sound signal. However, in step S10006, a pseudo focused sound with a large sound pressure level is synthesized, and in step S10005, a pseudo focused sound with a low sound pressure level is synthesized. This prevents the pseudo in-focus sound from being buried in the subject sound.

ステップ＃１０００７で、音声処理部２８は、擬似合焦音が合成された被写体音声信号を記録部３２に出力して、ステップＳ１０００１に戻る。記録部３２は、音声処理部２８からの音声信号を図示しない記録媒体に記録する。 In step # 10007, the audio processing unit 28 outputs the subject audio signal synthesized with the pseudo in-focus sound to the recording unit 32, and returns to step S10001. The recording unit 32 records the audio signal from the audio processing unit 28 on a recording medium (not shown).

実合焦音重畳区間６５の音声信号は削除され、その部分を予測信号で補完するが、削除するこの区間６５（削除区間６２）が短いほど、予測信号の誤差が累積されない。実合焦音重畳期間は、短いほど好ましい。しかし、合焦確認としての実合焦音は、ある程度の長さがあった方が撮影時の操作感がよい。同じ静止画撮影でも、動画を撮影していないときの静止画単独撮影の場合の合焦音を長めとし、動画撮影中の静止画撮影では、より短い合焦音とする。そして、記録すべき音声信号に埋め込む疑似合焦音は、静止画単独撮影の場合の実合焦音と同じ長さとする。これにより、再生時の違和感を解消する。例えば、静止画単独撮影時の実合焦音を「ピピッ！」とする。他方、動画撮影中での静止画撮影の場合の実合焦音を「ピッ！」とし、擬似合焦音を「ピピッ！」とする。 The audio signal in the actual in-focus sound superimposing section 65 is deleted and the portion is supplemented with the prediction signal. However, the shorter the section 65 to be deleted (deletion section 62), the less the error of the prediction signal is accumulated. The shorter the actual in-focus sound superimposition period, the better. However, the actual in-focus sound as the confirmation of in-focus is more comfortable to operate at the time of shooting if it has a certain length. Even in the same still image shooting, the focus sound in the case of still image single shooting when a moving image is not shot is made longer, and in the still image shooting during moving image shooting, a shorter focusing sound is set. The pseudo in-focus sound embedded in the audio signal to be recorded has the same length as the actual in-focus sound in the case of taking a still image alone. Thereby, the uncomfortable feeling at the time of reproduction | regeneration is eliminated. For example, it is assumed that the actual in-focus sound at the time of taking a still image alone is “beep!”. On the other hand, the actual in-focus sound in the case of still image shooting during moving image shooting is “beep!”, And the pseudo-in-focus sound is “beep!”.

動画静止画モード切替えスイッチ３８でユーザが静止画撮影モードを指定している場合、カメラＣＰＵ１６は、合焦音発生部４０に合焦時に「ピピッ！」と２度の連続破裂音を発生させる。他方、動画撮影モードが指定されている場合、カメラＣＰＵ１６は、合焦音発生部４０に合焦時に「ピッ！」と１度の破裂音を発生させる。 When the user designates the still image shooting mode with the moving image still image mode switch 38, the camera CPU 16 causes the in-focus sound generating unit 40 to generate “pip!” Twice in continuous burst sound. On the other hand, when the moving image shooting mode is designated, the camera CPU 16 causes the in-focus sound generating unit 40 to generate a single “peep!” Popping sound when in focus.

図１１は、以上の動作の説明用タイミングチャートである。横軸は時間を示し、縦軸は音圧レベルを示す。１１１ａから１１１ｄはそれぞれ、被写体音声を示す。被写体音声１１１ａには、実際に発音した実合焦音１１０が重畳している。ここでは、動画撮影時の実合焦音であるので、図４に示す実合焦音６０に比べて実合焦音の発生区間が短くなっている。 FIG. 11 is a timing chart for explaining the above operation. The horizontal axis indicates time, and the vertical axis indicates the sound pressure level. Reference numerals 111a to 111d denote subject sounds. An actual focused sound 110 that is actually pronounced is superimposed on the subject sound 111a. Here, since it is the actual in-focus sound at the time of moving image shooting, the actual in-focus sound generation section is shorter than the actual in-focus sound 60 shown in FIG.

音声処理部２８は、合焦音重畳区間の被写体音声を削除するが、ここでは、実合焦音１１０の存在する区間より広い区間１１２の被写体音声を削除する。この削除区間１１２も、図４で示した削除区間６２より短く出来る。 The sound processing unit 28 deletes the subject sound in the focused sound superimposing section. Here, the sound processing unit 28 deletes the subject sound in the section 112 wider than the section where the actual focused sound 110 exists. This deletion section 112 can also be shorter than the deletion section 62 shown in FIG.

被写体音声１１１ｃでは、音声処理部２８は、削除区間１１２の前後の被写体音声信号を学習し、削除区間の被写体音声を予測して、補完する。削除区間１１２が短いので、補完する予測音声に誤差が累積しない。 In the subject sound 111c, the sound processing unit 28 learns subject sound signals before and after the deletion section 112, and predicts and supplements the subject sound in the deletion section. Since the deletion section 112 is short, no error is accumulated in the predicted speech to be complemented.

被写体音声１１１ｄで、音声処理部２８は、擬似合焦音１１４を被写体音声１１１ｄに合成する。擬似合焦音１１４を合成する区間１１６は、静止画時の合焦音と同じ長さになり、削除区間１１２よりも長い。これにより、最終的に記録された被写体音声の再生時には、静止画撮影時と同じ長さの擬似合焦音が知覚される。このように構成にすることで、静止画撮影時の合焦音と同等で違和感の無い動画再生が行われると共に、予測信号の誤差累積を防ぐ事が出来る。 With the subject sound 111d, the sound processing unit 28 synthesizes the pseudo focused sound 114 with the subject sound 111d. The section 116 for synthesizing the pseudo-focusing sound 114 has the same length as the focusing sound at the time of a still image, and is longer than the deletion section 112. Thereby, when reproducing the finally recorded subject sound, a pseudo-focusing sound having the same length as that during still image shooting is perceived. By adopting such a configuration, it is possible to reproduce a moving image that is equivalent to the in-focus sound at the time of still image shooting and has no sense of incongruity, and can prevent accumulation of errors in the prediction signal.

本実施例は、合焦の為のレンズ駆動音を低減することもできる。合焦の為のレンズ駆動音は、モータ及びそのギアの噛み合い音、並びにそれによる鏡筒の振動からなる。カメラ筐体内の共鳴があるので、マイク２２が取り込む音としては、レンズ駆動音自体を外部から聞くときに比べてかなり異なる雑音となる。 This embodiment can also reduce lens driving sound for focusing. The lens driving sound for focusing is composed of the meshing sound of the motor and its gear, and the vibration of the lens barrel caused thereby. Since there is resonance in the camera housing, the sound captured by the microphone 22 is considerably different from that when the lens driving sound itself is heard from the outside.

実施例１と同様に、レンズ駆動音発生区間の被写体音声を削除し、そこに予測音声を埋め、擬似的なレンズ駆動音を合成してもよい。但し、レンズ駆動音は合焦音に比べて発生区間が長くなるので、予測音声の誤差が累積し易くなる。 Similarly to the first embodiment, the subject sound in the lens driving sound generation section may be deleted, the predicted sound may be embedded therein, and the pseudo lens driving sound may be synthesized. However, since the lens drive sound has a longer generation interval than the in-focus sound, the error of the predicted speech is likely to accumulate.

音声予測を用いるのではなく、以下のような方法でレンズ駆動音を低減してもよい。図１２は、そのタイミングチャート例を示す。横軸は時間を示し、縦軸は音圧レベルを示す。１２１ａ〜１２１ｃは各々、被写体音声を示す。 Instead of using speech prediction, the lens driving sound may be reduced by the following method. FIG. 12 shows an example of the timing chart. The horizontal axis indicates time, and the vertical axis indicates the sound pressure level. Reference numerals 121a to 121c denote subject sounds.

被写体音声１２１ａには、合焦の為のレンズ駆動音１２０が重畳している。レンズ駆動音が重畳している区間のみ、音声処理部２８は、レンズ駆動音の低減処理を行い、被写体音声１２１ｂとする。音声処理部２８によるレンズ駆動音低減処理の詳細は、後述する。低減処理を行う区間１２５は、レンズ駆動音１２０の重畳区間とほぼ同じにしている。 A lens driving sound 120 for focusing is superimposed on the subject sound 121a. Only in the section where the lens driving sound is superimposed, the sound processing unit 28 performs the lens driving sound reduction process to obtain the subject sound 121b. Details of the lens driving sound reduction processing by the sound processing unit 28 will be described later. The interval 125 where the reduction process is performed is substantially the same as the overlapping interval of the lens driving sound 120.

音声処理部２８は、擬似レンズ駆動音１２３を被写体音声信号１２１ｃに合成して、被写体音声１２１ｃを生成する。擬似レンズ駆動音は例えば、周囲音の無い環境でレンズ駆動音をマイク２２により取り込み、そのときのマイク２２の出力音声信号から主要周波数成分を取り出し、聞きやすい音として合成したものである。もちろん、聴感を確認しつつ、聴き取りやすい音を人工的に生成すれば良い。 The sound processing unit 28 synthesizes the pseudo lens driving sound 123 with the subject sound signal 121c to generate the subject sound 121c. For example, the pseudo lens driving sound is obtained by capturing the lens driving sound with the microphone 22 in an environment without ambient sound, extracting the main frequency component from the output audio signal of the microphone 22 at that time, and synthesizing it as an easy-to-hear sound. Of course, it is only necessary to artificially generate a sound that is easy to hear while confirming the audibility.

実施例１と同様に、音声処理部２８は、擬似レンズ駆動音１２３を、レンズ駆動音重畳区間１２５より広い区間１２６で被写体音声１２１ｃに合成する。レンズ駆動音の低減処理を行った区間の前後の接続部分に不連続部があったとしても、擬似レンズ駆動音にカバーされてしまうので、違和感のない被写体音声となる。 Similar to the first embodiment, the sound processing unit 28 synthesizes the pseudo lens driving sound 123 with the subject sound 121c in the section 126 wider than the lens driving sound superimposing section 125. Even if there is a discontinuity in the connection part before and after the section in which the lens driving sound reduction processing is performed, it is covered by the pseudo lens driving sound, so that the subject sound has no sense of incongruity.

図１３は、被写体音声に重畳するレンズ駆動音を低減する構成のブロック図を示す。 FIG. 13 shows a block diagram of a configuration for reducing lens driving sound superimposed on subject sound.

周波数変換部１３０は、レンズ駆動音が重畳している被写体音声信号をフーリエ変換により周波数軸上に変換する。これにより、例えば、図１４（ａ）に示すようなスペクトル波形１４１ａの信号が得られる。 The frequency converter 130 converts the subject audio signal on which the lens driving sound is superimposed onto the frequency axis by Fourier transform. Thereby, for example, a signal having a spectrum waveform 141a as shown in FIG.

疑似レンズ駆動音の周波数成分データが記憶装置１３１に格納されている。図１４（ｂ）は、記憶装置１３１に記憶される疑似レンズ駆動音のスペクトル波形１４１ｂを示す。 The frequency component data of the pseudo lens driving sound is stored in the storage device 131. FIG. 14B shows a spectrum waveform 141 b of the pseudo lens driving sound stored in the storage device 131.

差分処理部１３２は、変換部１３０の各周波数成分（図１４（ａ））から、記憶装置１３１からの同じ周波数の周波数成分（図１４（ｂ））を減算する。これにより、実レンズ駆動音が重畳する被写体音声に重畳する実レンズ駆動音から疑似レンズ駆動音を減算し、実レンズ駆動音の音圧を低減できる。 The difference processing unit 132 subtracts the frequency component (FIG. 14B) of the same frequency from the storage device 131 from each frequency component (FIG. 14A) of the conversion unit 130. Thereby, the pseudo lens driving sound is subtracted from the actual lens driving sound superimposed on the subject sound on which the actual lens driving sound is superimposed, and the sound pressure of the actual lens driving sound can be reduced.

時間軸変換部１３３は、差分処理部１３２の出力を逆フーリエ変換して、時間軸の波形に戻す。 The time axis conversion unit 133 performs inverse Fourier transform on the output of the difference processing unit 132 and returns the waveform to the time axis.

このように、一旦、周波数空間に変換してから疑似レンズ駆動音を差し引く事により、各周波数の位相を考えなくて済む。 In this way, the phase of each frequency does not have to be considered by subtracting the pseudo lens driving sound once converted into the frequency space.

本発明の一実施例として、撮像装置が発生する音を消去乃至低減した後に擬似音を重畳する被写体音声録音システムの実施例を説明した。本発明は、動画撮影機能を有するデジタルスチルカメラ、デジタルビデオカメラ、監視カメラ、Ｗｅｂカメラ及び携帯電話などにも広く適用できる。 As one embodiment of the present invention, the embodiment of the subject sound recording system that superimposes the pseudo sound after the sound generated by the imaging apparatus is erased or reduced has been described. The present invention can be widely applied to a digital still camera, a digital video camera, a surveillance camera, a Web camera, a mobile phone, and the like having a moving image shooting function.

Claims

動画撮影中に静止画を撮影する撮像装置であって、
撮像光学系と、
前記撮像光学系による光学像を画像信号に変換する撮像手段と、
前記静止画の撮影の際の前記撮像光学系の合焦に従い、合焦音を発生する合焦音発生手段と、
前記合焦音を含む周囲音を取り込む音声入力手段と、
疑似合焦音信号を発生する疑似合焦音発生手段と、
前記音声入力手段の入力音声信号に混入する前記合焦音を除去し、前記疑似合焦音信号を合成する音声処理手段
とを具備することを特徴とする撮像装置。 An imaging device that captures still images during video recording,
An imaging optical system;
Imaging means for converting an optical image by the imaging optical system into an image signal;
In-focus sound generating means for generating in-focus sound according to the focus of the imaging optical system at the time of shooting the still image;
Voice input means for capturing ambient sounds including the focused sound;
A pseudo-focusing sound generating means for generating a pseudo-focusing sound signal;
An image pickup apparatus comprising: an audio processing unit that removes the in-focus sound mixed in an input audio signal of the audio input unit and synthesizes the pseudo-in-focus signal.

前記音声処理手段は、
前記音声入力手段の入力音声信号から、前記合焦音が重畳する合焦音重畳区間の音声信号を除去する除去手段と、
前記合焦音重畳区間の時間的に前後する区間の音声信号から、合焦音重畳区間の音声信号を予測して前記合焦音重畳区間に補完する補完手段と、
前記補完手段の出力に前記疑似合焦音信号を合成する手段
とを具備することを特徴とする請求項１に記載の撮像装置。 The voice processing means is
Removing means for removing a voice signal in a focused sound superimposing section where the focused sound is superimposed from an input voice signal of the voice input means;
Complementing means for predicting a speech signal in a focused sound superimposing section from a speech signal in a section before and after the focused sound superimposing section and complementing the focused sound superimposed section,
The imaging apparatus according to claim 1, further comprising a unit that synthesizes the pseudo-focus sound signal with an output of the complementing unit.