JP2021173881A

JP2021173881A - Voice processing device and voice processing method

Info

Publication number: JP2021173881A
Application number: JP2020078052A
Authority: JP
Inventors: 信範工藤; Akinori Kudo
Original assignee: Alps Alpine Co Ltd
Current assignee: Alps Alpine Co Ltd
Priority date: 2020-04-27
Filing date: 2020-04-27
Publication date: 2021-11-01
Anticipated expiration: 2040-04-27
Also published as: JP7493875B2

Abstract

To provide a voice processing device and a voice processing method that can have a function of executing beam-forming processing while suppressed from increasing in cost.SOLUTION: A voice recognition system 1 is provided with a beam-forming processing part 22 which performs beam-forming processing targeting a plurality of voice signals, and also configured to: when a user inputs voice, stop voice output functions of a plurality of speakers and then let the plurality of speakers function as microphones so as to input voice signals that the plurality of speakers functioning as the microphone output to the beam-forming processing part 22; and then use a plurality of speakers originally present in the space where a voice processing device 1 is installed to input the plurality of voice signals to the voice processing device 1.SELECTED DRAWING: Figure 3

Description

本発明は、音声処理装置および音声処理方法に関し、特に、マイクにより収音されたユーザの発話音声を処理する音声処理装置および音声処理方法に用いて好適なものである。 The present invention relates to a voice processing device and a voice processing method, and is particularly suitable for use in a voice processing device and a voice processing method for processing a user's spoken voice picked up by a microphone.

従来、マイクにより収音されたユーザの発話音声の音声信号を入力し、入力した音声信号に対してノイズキャンセル処理やエコーキャンセル処理等の処理を施す音声処理装置が存在する。この種の音声処理装置では、複数のマイクから音声信号を入力し、ビームフォーミング処理を施すことによって、出力する音声信号の更なる高品質化を図ったものがある。なお、特許文献１には、マイクが故障したときに、スピーカをマイクの代用として使用し、ハンズフリーフォンシステムの機能を継続する技術が記載されている。 Conventionally, there is a voice processing device that inputs a voice signal of a user's uttered voice picked up by a microphone and performs processing such as noise cancellation processing and echo cancellation processing on the input voice signal. In this type of audio processing device, audio signals are input from a plurality of microphones and beamforming processing is performed to further improve the quality of the output audio signals. In addition, Patent Document 1 describes a technique of using a speaker as a substitute for a microphone and continuing the function of the hands-free phone system when the microphone breaks down.

特開２０１７−２１２４８９号公報Japanese Unexamined Patent Publication No. 2017-212489

ビームフォーミング処理は複数台のマイクが必要となるため、ビームフォーミング処理を実行する機能を新たに音声処理装置に実装する場合、複数台のマイクから音声信号が入力される状態を構築する必要がある。この方法として専用のマイクを増設し、専用のマイクを音声処理装置に接続すること、或いは、専用のマイクを音声処理装置に内蔵することが考えられるが、この場合、専用のマイクを増設する分、コストが増大してしまう。 Since beamforming processing requires multiple microphones, it is necessary to build a state in which audio signals are input from multiple microphones when newly implementing a function to execute beamforming processing in a voice processing device. .. As this method, it is conceivable to add a dedicated microphone and connect the dedicated microphone to the voice processing device, or to incorporate the dedicated microphone in the voice processing device. , The cost will increase.

本発明は、このような問題を解決するために成されたものであり、コストの増大を抑制しつつ、ビームフォーミング処理を実行する機能を音声処理装置に実装できるようにすることを目的としている。 The present invention has been made to solve such a problem, and an object of the present invention is to enable a function for executing beamforming processing to be implemented in a voice processing device while suppressing an increase in cost. ..

上記した課題を解決するために、本発明は、複数のスピーカが配置された所定の空間に設置された音声処理装置について、複数の音声信号を対象としてビームフォーミング処理を施すビームフォーミング処理部を設け、ユーザにより音声入力がされる場合に、複数のスピーカの音声出力機能を停止し、複数のスピーカをマイクとして機能させ、マイクとして機能する複数のスピーカから音声信号がビームフォーミング処理部に入力されるようにしている。 In order to solve the above-mentioned problems, the present invention provides a beam forming processing unit that performs beam forming processing on a plurality of voice signals for a voice processing device installed in a predetermined space in which a plurality of speakers are arranged. , When voice input is performed by the user, the voice output function of a plurality of speakers is stopped, the plurality of speakers function as microphones, and voice signals are input to the beam forming processing unit from the plurality of speakers that function as microphones. I am trying to do it.

上記のように構成した本発明によれば、専用のマイクを増設して、音声処理装置に複数のマイクから音声信号が入力されるようにするのではなく、音声処理装置が設置された空間に元々ある複数のスピーカを利用して、音声処理装置に複数の音声信号が入力されるようにすることができるため、コストの増大を抑制しつつ、ビームフォーミング処理を実行する機能を音声処理装置に実装できる。 According to the present invention configured as described above, instead of adding a dedicated microphone so that voice signals are input from a plurality of microphones to the voice processing device, the space where the voice processing device is installed is used. Since multiple voice signals can be input to the voice processing device by using a plurality of original speakers, the sound processing device has a function of executing beam forming processing while suppressing an increase in cost. Can be implemented.

本発明の一実施形態に係る音声処理装置が車内空間に設けられた様子の一例を示す図である。It is a figure which shows an example of the appearance that the voice processing apparatus which concerns on one Embodiment of this invention is provided in the vehicle interior space. 本発明の一実施形態に係る音声認識システムの構成例を示す図である。It is a figure which shows the structural example of the voice recognition system which concerns on one Embodiment of this invention. 本発明の一実施形態に係る音声処理装置のハードウェア構成の一例を示す図である。It is a figure which shows an example of the hardware composition of the voice processing apparatus which concerns on one Embodiment of this invention. 本発明の一実施形態に係る音声処理装置の制御ユニットの機能の一例を示す機能ブロック図である。It is a functional block diagram which shows an example of the function of the control unit of the voice processing apparatus which concerns on one Embodiment of this invention. 本発明の一実施形態に係る音声処理装置の動作例を示すフローチャートである。It is a flowchart which shows the operation example of the voice processing apparatus which concerns on one Embodiment of this invention.

以下、本発明の一実施形態を図面に基づいて説明する。図１は、本実施形態に係る音声処理装置１が車両の車内に形成された車内空間２（特許請求の範囲の「所定の空間」に相当）に設けられた様子を示す図である。図１では、車内空間２の前部座席（運転席３＋助手席４）およびダッシュボード５の周辺を単純化して模式的に示している。図１で示すように、ダッシュボード５の中央部には音声処理装置１が設けられている。ただし図１で示す音声処理装置１の設置位置は一例であり、音声処理装置１は任意の位置に設置できる。音声処理装置１には、音声を収音する内蔵マイク６（マイクロフォン）が内蔵されている。ただし図１では内蔵マイク６を誇張して描画している。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a diagram showing a state in which the voice processing device 1 according to the present embodiment is provided in an in-vehicle space 2 (corresponding to a “predetermined space” in the claims) formed in the vehicle interior of the vehicle. In FIG. 1, the front seats (driver's seat 3 + passenger seat 4) and the periphery of the dashboard 5 of the vehicle interior space 2 are shown schematically in a simplified manner. As shown in FIG. 1, a voice processing device 1 is provided in the central portion of the dashboard 5. However, the installation position of the voice processing device 1 shown in FIG. 1 is an example, and the voice processing device 1 can be installed at an arbitrary position. The voice processing device 1 has a built-in microphone 6 (microphone) that collects sound. However, in FIG. 1, the built-in microphone 6 is exaggerated and drawn.

図１で示すように、ダッシュボード５の両端部には一対のツイータ７Ｒ、７Ｌ（特許請求の範囲の「複数のスピーカ」「車載スピーカ」に相当）が設けられている。ツイータ７Ｒ、７Ｌは、高音域の音声を音声出力するスピーカであり、音声処理装置１に接続されている。図１における図示は省略したが、車内空間２には、ツイータ７Ｒ、７Ｌ以外に、中音域以下の音声を出力するスピーカ（例えばフルレンジスピーカや、フルレンジスピーカとサブウーファとの組み合わせ等）が設けられており、音声処理装置１と各スピーカとにより車載オーディオシステムが構成されている。なお車載オーディオシステムにおいて、本実施形態のように一対のツイータをダッシュボードの両端部に設けることは、現状、広く行われている。 As shown in FIG. 1, a pair of tweeters 7R and 7L (corresponding to "plurality of speakers" and "vehicle-mounted speakers" in the claims) are provided at both ends of the dashboard 5. The tweeters 7R and 7L are speakers that output high-pitched sound, and are connected to the sound processing device 1. Although not shown in FIG. 1, in addition to the tweeters 7R and 7L, a speaker (for example, a full-range speaker or a combination of a full-range speaker and a subwoofer) that outputs sound in the midrange or lower is provided in the vehicle interior space 2. An in-vehicle audio system is configured by the voice processing device 1 and each speaker. In an in-vehicle audio system, it is currently widely practiced to provide a pair of tweeters at both ends of a dashboard as in the present embodiment.

以下の説明において、音声処理装置１に接続されたスピーカの集まりを「スピーカ群」（ツイータ７Ｒ、７Ｌを含む）という。また、車両の搭乗者を単に「ユーザ」という。 In the following description, a group of speakers connected to the voice processing device 1 is referred to as a "speaker group" (including tweeters 7R and 7L). Also, the passengers of the vehicle are simply referred to as "users".

図２は、本実施形態に係る音声処理装置１を含んで構成される音声認識システム９の構成を示す図である。図２で示すように、音声処理装置１は、インターネットや電話網等の通信網を含んで構成されたネットワークＮにアクセス可能であり、ネットワークＮを介してサービス提供サーバ１０と通信可能である。サービス提供サーバ１０は、クライアント端末で収集された音声の音声認識に関するサービスを提供するクラウドサーバである。以下、サービス提供サーバ１０により提供されるサービスを「音声認識サービス」という。音声認識サービスの１つは、クライアント端末で収集された音声を音声認識して、その音声の内容を理解し、その音声の内容に対応する処理を実行するというものである。一例として、サービス提供サーバ１０は、ユーザがクライアント端末に対して何らかの質問を内容とする音声を発話した場合に、その音声を音声認識し、その音声の内容を理解し、質問に対する回答を生成し、クライアント端末に音声として出力させ、これによりユーザとクライアント端末との間で音声対話を実現する。 FIG. 2 is a diagram showing a configuration of a voice recognition system 9 including the voice processing device 1 according to the present embodiment. As shown in FIG. 2, the voice processing device 1 can access a network N including a communication network such as the Internet and a telephone network, and can communicate with the service providing server 10 via the network N. The service providing server 10 is a cloud server that provides a service related to voice recognition of voice collected by a client terminal. Hereinafter, the service provided by the service providing server 10 is referred to as a "voice recognition service". One of the voice recognition services is to recognize the voice collected by the client terminal, understand the content of the voice, and execute the process corresponding to the content of the voice. As an example, when the user utters a voice containing a question to the client terminal, the service providing server 10 recognizes the voice, understands the content of the voice, and generates an answer to the question. , The client terminal is made to output as voice, thereby realizing a voice dialogue between the user and the client terminal.

本実施形態に係る音声処理装置１は、サービス提供サーバ１０に対するクライアント端末として機能し、ユーザは、音声処理装置１を介して音声認識サービスを利用することができる。ユーザは、音声認識サービスの利用に際し、ウェイクワードと呼ばれる予め定められた特定のワードを発話し、ウェイクワードの発話に続けて、何らかの質問や、要求を行うための文言（以下、「リクエスト」という）を発話する。本実施形態では、説明の便宜のため、リクエストの発話の前に必ずウェイクワードの発話がユーザにより行われるものとする。 The voice processing device 1 according to the present embodiment functions as a client terminal for the service providing server 10, and the user can use the voice recognition service via the voice processing device 1. When using the voice recognition service, the user utters a specific predetermined word called a wake word, and following the utterance of the wake word, a word for asking a question or request (hereinafter referred to as "request"). ) Is spoken. In the present embodiment, for convenience of explanation, it is assumed that the wake word is uttered by the user before the request is uttered.

音声処理装置１は、ユーザによるウェイクワードおよびリクエストの発話に応じて処理要求データを生成し、サービス提供サーバ１０に送信する。処理要求データは、ユーザが発話したウェイクワードに対応する音声データ、および、ユーザが発話したリクエストに対応する音声データを含む音声データ（以下「発話音声データ」という）と、発話音声データに関する必要な参照情報が所定のフォーマット（例えばＪＳＯＮ）に従って記述された制御情報データとを含んでいる。 The voice processing device 1 generates processing request data in response to a wake word and a request utterance by the user, and transmits the processing request data to the service providing server 10. The processing request data is necessary for voice data including voice data corresponding to the wake word spoken by the user, voice data corresponding to the request spoken by the user (hereinafter referred to as “spoken voice data”), and spoken voice data. The reference information includes control information data described according to a predetermined format (for example, JSON).

ここで発話音声データに含まれるリクエストに対応する音声データは、サービス提供サーバ１０における音声認識の対象となるものであり、高品質であることが求められる。これを鑑み、本実施形態に係る音声処理装置１は、入力音声について、エコーキャンセル処理およびノイズキャンセル処理を実行する機能の他、ビームフォーミング処理を実行する機能が実装されている。周知の通り、ある装置においてビームフォーミング処理を実行するためには、その装置に複数台のマイクから音声信号が入力されるようにする必要があるが、本実施形態に係る音声処理装置１は、備え付けのマイクとして、１台の内蔵マイク６のみを備えている。 Here, the voice data corresponding to the request included in the utterance voice data is the target of voice recognition in the service providing server 10, and is required to have high quality. In view of this, the voice processing device 1 according to the present embodiment is equipped with a function of executing an echo canceling process and a noise canceling process as well as a function of executing a beamforming process for the input voice. As is well known, in order to execute beamforming processing in a certain device, it is necessary to input audio signals from a plurality of microphones to the device, but the sound processing device 1 according to the present embodiment is As a built-in microphone, only one built-in microphone 6 is provided.

このような構成の音声処理装置１にビームフォーミング処理を実行する機能を実装するためには、専用のマイクを増設し、その専用のマイクを音声処理装置１に接続すること、或いは、その専用のマイクを音声処理装置１に内蔵することが考えられるが、この場合、専用のマイクを増設する分、コストが増大してしまう。また、専用のマイクを接続するようにした場合、適切な位置に固定的に専用のマイクを取り付ける必要があり、作業の難易度が高く、また、専用のマイクを内蔵するようにした場合、筐体内において専用のマイクを搭載するスペースについての課題や、筐体内の他の電子部品との配置についての課題、デザイン上の課題等の種種の課題を解決する必要がある。以上を踏まえ、本実施形態に係る音声処理装置１は、以下の構成の下、以下の手段でビームフォーミング処理を実行する。以下、音声処理装置１の構成および処理について詳述する。 In order to implement the function of executing the beamforming process on the audio processing device 1 having such a configuration, a dedicated microphone is added and the dedicated microphone is connected to the audio processing device 1, or the dedicated microphone is connected to the dedicated microphone. It is conceivable that the microphone is built in the voice processing device 1, but in this case, the cost increases as a result of adding a dedicated microphone. In addition, when a dedicated microphone is connected, it is necessary to fix the dedicated microphone in an appropriate position, which makes the work difficult, and when the dedicated microphone is built in, the housing It is necessary to solve various problems such as a space for mounting a dedicated microphone in the body, a problem of arrangement with other electronic components in the housing, and a design problem. Based on the above, the voice processing device 1 according to the present embodiment executes the beamforming process by the following means under the following configuration. Hereinafter, the configuration and processing of the voice processing device 1 will be described in detail.

図３は、音声処理装置１の要部のハードウェア構成例を示すブロック図である。ただし、図３では、制御ユニット１２（後述）を示すブロック内に、制御ユニット１２により実現される機能を示す機能ブロックを描画している。図３で示すように、音声処理装置１は、ハードウェア構成として制御ユニット１２と音声処理ユニット１３とを備えている。 FIG. 3 is a block diagram showing a hardware configuration example of a main part of the voice processing device 1. However, in FIG. 3, a functional block indicating a function realized by the control unit 12 is drawn in the block indicating the control unit 12 (described later). As shown in FIG. 3, the voice processing device 1 includes a control unit 12 and a voice processing unit 13 as a hardware configuration.

制御ユニット１２は、ＤＳＰ（Digital Signal Processor）およびＤＳＰに付随する各種回路／電子部品を備え、ＤＳＰの機能により各種処理を実行する。ただし、制御ユニット１２はＤＳＰではなく、例えば、汎用のマイクロプロセッサやマイクロコントローラを含んで構成されていてもよい。音声処理ユニット１３は、音声処理に関する各種回路／電子部品を備えている。なお、図３において制御ユニット１２および音声処理ユニット１３を異なるブロックとしているのは説明の便宜上のことであり、当然、制御ユニット１２の機能を実現する各種回路／電子部品および音声処理ユニット１３を実現する各種回路／電子部品が共通する基板上に設けられていてもよい。制御ユニット１２は、音声処理ユニット１３に対して音声信号を出力して音声を放音させる機能、および、音声処理ユニット１３により収音された音声に基づく音声信号を入力し、対応する処理を実行する機能を備えている。 The control unit 12 includes a DSP (Digital Signal Processor) and various circuits / electronic components associated with the DSP, and executes various processes by the function of the DSP. However, the control unit 12 may be configured to include, for example, a general-purpose microprocessor or a microcontroller instead of the DSP. The voice processing unit 13 includes various circuits / electronic components related to voice processing. It should be noted that the fact that the control unit 12 and the voice processing unit 13 are set as different blocks in FIG. 3 is for convenience of explanation, and naturally, various circuits / electronic components and the voice processing unit 13 that realize the functions of the control unit 12 are realized. Various circuits / electronic components to be used may be provided on a common substrate. The control unit 12 has a function of outputting a voice signal to the voice processing unit 13 to emit a sound, and inputs a voice signal based on the voice picked up by the voice processing unit 13 to execute a corresponding process. It has a function to do.

音声処理装置１は、動作モードとして通常モードとビームフォーミングモードとを有している。以下まず、通常モードのときの音声処理ユニット１３の動作、および、ビームフォーミングモードのときの音声処理ユニット１３の動作について、制御ユニット１２が出力する音声信号に基づく音声を放音する点、および、収音した音声に基づく音声信号を制御ユニット１２に出力する点に着目して説明する。 The voice processing device 1 has a normal mode and a beamforming mode as operation modes. First, regarding the operation of the voice processing unit 13 in the normal mode and the operation of the voice processing unit 13 in the beamforming mode, a point of emitting sound based on the voice signal output by the control unit 12 and a point of emitting sound based on the voice signal output by the control unit 12 and the following. This description will be described focusing on the point that a voice signal based on the picked-up voice is output to the control unit 12.

＜通常モード＞
通常モードにおいて、制御ユニット１２がデジタルな音声信号をＤ／Ａコンバータ１４に出力すると、音声信号はＤ／Ａコンバータ１４によりデジタル／アナログ変換され、ボリューム１５により音量レベルが調整され、スピーカアンプ１６により増幅される。通常モードにおいては、セレクタ１７は、スピーカ機能状態とされる。このスピーカ機能状態では、セレクタ１７のスイッチによりスピーカアンプ１６とスピーカ群（ツイータ７Ｒ、７Ｌを含む）とが導通された状態とされる。従って、スピーカアンプ１６により増幅された音声信号はセレクタ１７を介してツイータ７Ｒ、７Ｌに出力され、ツイータ７Ｒ、７Ｌにおいて音声信号に基づく音声が放音される。なお、スピーカ機能状態では、セレクタ１７のスイッチによりツイータ７Ｒ、７Ｌとマイクアンプ１８Ｒ、１８Ｌとの導通状態は停止される。 <Normal mode>
In the normal mode, when the control unit 12 outputs a digital audio signal to the D / A converter 14, the audio signal is digitally / analog-converted by the D / A converter 14, the volume level is adjusted by the volume 15, and the speaker amplifier 16 adjusts the volume level. Amplified. In the normal mode, the selector 17 is in the speaker function state. In this speaker function state, the speaker amplifier 16 and the speaker group (including tweeters 7R and 7L) are made conductive by the switch of the selector 17. Therefore, the audio signal amplified by the speaker amplifier 16 is output to the tweeters 7R and 7L via the selector 17, and the audio signal based on the audio signal is emitted from the tweeters 7R and 7L. In the speaker function state, the continuity state between the tweeters 7R and 7L and the microphone amplifiers 18R and 18L is stopped by the switch of the selector 17.

通常モードにおいて、内蔵マイク６が音声を収音すると、内蔵マイク６が収音した音声に基づく音声信号は、内蔵マイク６からマイクアンプ１９に出力され、マイクアンプ１９により増幅され、Ａ／Ｄコンバータ２０でアナログ／デジタル変換され、エコーキャンセラ２１によりエコーキャンセル処理が施される。通常モードにおいては、ビームフォーミング処理部２２は、オフ状態とされる。このオフ状態では、ビームフォーミング処理部２２は、前段のエコーキャンセラ２１から入力した音声信号について信号処理を施すことなく、後段のノイズキャンセラ２３に出力する。従って、通常モードにおいて、エコーキャンセラ２１によりエコーキャンセル処理が施された音声信号は、ビームフォーミング処理部２２を介してノイズキャンセラ２３に出力され、ノイズキャンセラ２３においてノイズキャンセル処理が施され、制御ユニット１２に出力される。 In the normal mode, when the built-in microphone 6 picks up the sound, the voice signal based on the sound picked up by the built-in microphone 6 is output from the built-in microphone 6 to the microphone amplifier 19, amplified by the microphone amplifier 19, and is an A / D converter. 20 is used for analog / digital conversion, and the echo canceller 21 performs echo cancellation processing. In the normal mode, the beamforming processing unit 22 is turned off. In this off state, the beamforming processing unit 22 outputs the audio signal input from the echo canceller 21 in the previous stage to the noise canceller 23 in the subsequent stage without performing signal processing. Therefore, in the normal mode, the audio signal subjected to the echo canceling process by the echo canceller 21 is output to the noise canceller 23 via the beamforming processing unit 22, the noise canceling process is performed by the noise canceller 23, and the sound signal is output to the control unit 12. Will be done.

＜ビームフォーミングモード＞
ビームフォーミングモードでは、セレクタ１７は、マイク機能状態とされる。このマイク機能状態では、セレクタ１７のスイッチによりスピーカアンプ１６とスピーカ群（ツイータ７Ｒ、７Ｌを含む）との導通状態が停止され、スピーカアンプ１６からスピーカ群への音声出力が遮断される。つまり、ツイータ７Ｒ、７Ｌ（複数のスピーカ）の音声出力機能が停止される。そしてビームフォーミングモードでは、ツイータ７Ｒ、７Ｌとマイクアンプ１８Ｒ、１８Ｌとが信号線により導通された状態とされる。 <Beamforming mode>
In the beamforming mode, the selector 17 is in the microphone function state. In this microphone function state, the continuity state between the speaker amplifier 16 and the speaker group (including the tweeters 7R and 7L) is stopped by the switch of the selector 17, and the audio output from the speaker amplifier 16 to the speaker group is cut off. That is, the audio output functions of the tweeters 7R and 7L (plurality of speakers) are stopped. Then, in the beamforming mode, the tweeters 7R and 7L and the microphone amplifiers 18R and 18L are connected by a signal line.

ここでツイータ７Ｒ、７Ｌは、スピーカとして機能するとき、スピーカアンプ１６から入力した音声信号を振動板の振動に変換し音声として出力するが、音声を出力していない状態のときには、周囲で発生した音声を振動板で収音し、振動板の振動を音声信号に変換するマイクとして機能させることができる。特に本実施形態に係るツイータ７Ｒ、７Ｌについては、マイクとして有効に機能することが事前に実証されている。そしてビームフォーミングモードにおいては、ツイータ７Ｒ、７Ｌの音声出力機能が停止された状態で、ツイータ７Ｒ、７Ｌとマイクアンプ１８Ｒ、１８Ｌとが導通するため、ツイータ７Ｒ、７Ｌはマイクとして機能し、ツイータ７Ｒ、７Ｌにより収音された音声に基づく音声信号は、セレクタ１７を介してマイクアンプ１８Ｒ、１８Ｌに出力される。マイクアンプ１８Ｒ、１８Ｌが入力した音声信号は、マイクアンプ１８Ｒ、１８Ｌにて増幅され、Ａ／Ｄコンバータ２４Ｒ、２４Ｌでアナログ／デジタル変換され、エコーキャンセラ２５Ｒ、２５Ｌでエコーキャンセル処理が施され、ビームフォーミング処理部２２に入力される。 Here, when the tweeters 7R and 7L function as speakers, the sound signal input from the speaker amplifier 16 is converted into vibration of the diaphragm and output as sound, but when the sound is not output, it occurs in the surroundings. The sound can be collected by the diaphragm and function as a microphone that converts the vibration of the diaphragm into a voice signal. In particular, the tweeters 7R and 7L according to the present embodiment have been demonstrated in advance to function effectively as microphones. In the beamforming mode, the tweeters 7R and 7L function as microphones and the tweeters 7R 7R function as microphones because the tweeters 7R and 7L and the microphone amplifiers 18R and 18L conduct with each other while the audio output function of the tweeters 7R and 7L is stopped. The audio signal based on the sound picked up by the 7L is output to the microphone amplifiers 18R and 18L via the selector 17. The audio signal input by the microphone amplifiers 18R and 18L is amplified by the microphone amplifiers 18R and 18L, analog / digitally converted by the A / D converters 24R and 24L, echo-cancelled by the echo cancellers 25R and 25L, and the beam is applied. It is input to the forming processing unit 22.

一方、ビームフォーミングモードにおいて、内蔵マイク６が収音した音声に基づく音声信号は、マイクアンプ１９による増幅、Ａ／Ｄコンバータ２０によるアナログ／デジタル変換、および、エコーキャンセラ２１によるエコーキャンセル処理を介して、ビームフォーミング処理部２２に入力される。 On the other hand, in the beamforming mode, the audio signal based on the sound picked up by the built-in microphone 6 is amplified by the microphone amplifier 19, analog / digital conversion by the A / D converter 20, and echo cancellation processing by the echo canceller 21. , Is input to the beamforming processing unit 22.

ビームフォーミングモードではビームフォーミング処理部２２はオン状態とされる。オン状態の場合、ビームフォーミング処理部２２は、エコーキャンセラ２１およびエコーキャンセラ２５Ｒ、２５Ｌのそれぞれから入力する音声信号に基づいてビームフォーミング処理を実行する。周知の通り、ビームフォーミング処理は、音声信号が示す音声について、音声の発生源に向かう方向（内蔵マイク６から音の発生源に向かう方向）に対しての感度を確保しつつ、音声の発生源に向かう方向以外の感度を低下させる処理である。ビームフォーミング処理では、各マイクで検出した信号のレベルと位相差に基づいて、音声の発生源に向かう方向を特定する処理が行われるが、ツイータ７Ｒとツイータ７Ｌとは左右方向に離間して配置されており、各ツイータに対する音声の発生源の距離が相違するときに、位相差と信号のレベルの差とが現出しやすく、ビームフォーミング処理部２２への音声信号の供給元として適している。 In the beamforming mode, the beamforming processing unit 22 is turned on. In the on state, the beamforming processing unit 22 executes the beamforming processing based on the audio signals input from the echo canceller 21, the echo cancellers 25R, and 25L, respectively. As is well known, the beamforming process ensures the sensitivity of the sound indicated by the voice signal in the direction toward the sound source (direction from the built-in microphone 6 toward the sound source), and at the same time, the sound source. This is a process that reduces the sensitivity other than the direction toward. In the beamforming process, the direction toward the sound source is specified based on the signal level and phase difference detected by each microphone, but the tweeter 7R and tweeter 7L are arranged apart from each other in the left-right direction. When the distance of the sound source to each tweeter is different, the phase difference and the difference in the signal level are likely to appear, and it is suitable as a source of the sound signal to the beamforming processing unit 22.

なお、ビームフォーミング処理は、ツイータ７Ｒ、７Ｌの配置位置や、内蔵マイク６とツイータ７Ｒ、７Ｌとの位置関係、ツイータ７Ｒ、７Ｌおよび内蔵マイク６の特性等が考慮されて事前に行われたテストやシミュレーションの結果に基づいて設計されたモデルに従って適切に実行される。ビームフォーミング処理部２２によりビームフォーミング処理が施された音声信号は、ノイズキャンセラ２３によりノイズキャンセル処理が施された後、制御ユニット１２に出力される。このように動作モードがビームフォーミングモードのときは、収音された音声についてビームフォーミング処理が施されるため、その点で通常モードのときと比較して制御ユニット１２に出力される音声信号の品質が高い。 The beamforming process is a test performed in advance in consideration of the arrangement position of the tweeters 7R and 7L, the positional relationship between the tweeters 7R and 7L and the characteristics of the tweeters 7R and 7L and the built-in microphone 6. It is properly executed according to the model designed based on the results of the tweeter and the simulation. The audio signal that has undergone beamforming processing by the beamforming processing unit 22 is output to the control unit 12 after being subjected to noise canceling processing by the noise canceller 23. In this way, when the operation mode is the beamforming mode, the beamforming processing is performed on the collected sound, and in that respect, the quality of the sound signal output to the control unit 12 is higher than that in the normal mode. Is high.

図４は、制御ユニット１２の要部の機能を機能ブロックとして表現した機能ブロック図である。図４で示すように、制御ユニット１２は、その機能構成として、音声出力部２６、コンテンツ再生部２７、音声入力部２８、検出部２９、音声認識処理部３０および切替部３１を備えている。上述したように、本実施形態では、各機能ブロック２６〜３１の処理はＤＳＰによって実行されるが、各機能ブロック２６〜３１は、ＤＳＰに限らず、任意のハードウェア或いは任意のハードウェアと任意のソフトウェアとの組み合わせにより実現可能である。例えば、制御ユニット１２がコンピュータのＣＰＵ、ＲＡＭ、ＲＯＭ等を備えて構成され、制御ユニット１２の各機能ブロック２６〜３１は、ＣＰＵがＲＯＭに記憶されたプログラムＲＡＭに読み出して実行することにより各種処理を実行する。以下、制御ユニット１２の各機能ブロック２６〜３１の処理の説明を通して、音声処理装置１の動作について説明する。 FIG. 4 is a functional block diagram showing the functions of the main parts of the control unit 12 as functional blocks. As shown in FIG. 4, the control unit 12 includes a voice output unit 26, a content reproduction unit 27, a voice input unit 28, a detection unit 29, a voice recognition processing unit 30, and a switching unit 31 as its functional configuration. As described above, in the present embodiment, the processing of each functional block 26 to 31 is executed by the DSP, but each functional block 26 to 31 is not limited to the DSP, and is arbitrary hardware or arbitrary hardware. It can be realized by combining with the software of. For example, the control unit 12 is configured to include a CPU, RAM, ROM, etc. of a computer, and each functional block 26 to 31 of the control unit 12 is read into a program RAM stored in the ROM and executed to perform various processes. To execute. Hereinafter, the operation of the voice processing device 1 will be described through the description of the processing of the functional blocks 26 to 31 of the control unit 12.

音声出力部２６は、音声信号をＤ／Ａコンバータ１４に出力し、音声信号に基づく音声を音声処理ユニット１３に放音させる。 The audio output unit 26 outputs an audio signal to the D / A converter 14, and causes the audio processing unit 13 to emit audio based on the audio signal.

コンテンツ再生部２７は、ユーザの指示に応じてコンテンツを再生する。コンテンツは、図示しないコンテンツドライブに挿入されたＣＤやＤＶＤに記録された楽曲や動画（映画などの動画）、記憶領域に記憶されたデータに記録された楽曲や動画、音声処理装置１に接続された外部装置に記憶された楽曲や動画等である。音声出力部２６は、コンテンツ再生部２７により再生されたコンテンツの音声に対応する音声信号を出力する。以下、コンテンツ再生部２７により再生されるコンテンツに対応する音声を特に「コンテンツ音声」という。 The content reproduction unit 27 reproduces the content according to the instruction of the user. The content is connected to a music or video (video such as a movie) recorded on a CD or DVD inserted in a content drive (not shown), a music or video recorded in data stored in a storage area, or an audio processing device 1. Music, moving images, etc. stored in an external device. The audio output unit 26 outputs an audio signal corresponding to the audio of the content reproduced by the content reproduction unit 27. Hereinafter, the audio corresponding to the content reproduced by the content reproduction unit 27 is particularly referred to as “content audio”.

音声入力部２８は、ノイズキャンセラ２３から音声信号を入力し、入力した音声信号を音声データとして音声バッファ（不図示）にバッファリングする。この結果、現時点から遡って所定期間の間に音声処理ユニット１３により収音された音声に基づく音声データが音声バッファに記憶された状態となる。以下、音声バッファに記憶された音声データの集合を「入力音声データ」という。 The audio input unit 28 inputs an audio signal from the noise canceller 23, and buffers the input audio signal as audio data in an audio buffer (not shown). As a result, the voice data based on the voice picked up by the voice processing unit 13 during a predetermined period retroactively from the present time is stored in the voice buffer. Hereinafter, the set of voice data stored in the voice buffer is referred to as "input voice data".

検出部２９は、ユーザによりウェイクワードが発話されたときに、そのことを検出する。詳述すると、検出部２９は、音声入力部２８により音声バッファに累積的に記憶される入力音声データを継続して分析し、入力音声データに記録された音声の音声波形と、あらかじめ登録されたウェイクワードの音声パターンの類似度を継続して算出する。そして、検出部２９は、ウェイクワードの音声パターンと、入力音声データに係る音声波形との類似度が閾値以上となった場合、ユーザがウェイクワードに対応する音声を発話したことを検出する。 The detection unit 29 detects when a wake word is spoken by the user. More specifically, the detection unit 29 continuously analyzes the input voice data cumulatively stored in the voice buffer by the voice input unit 28, and pre-registers the voice waveform of the voice recorded in the input voice data. The similarity of the wake word voice pattern is continuously calculated. Then, the detection unit 29 detects that the user has uttered the voice corresponding to the wake word when the similarity between the voice pattern of the wake word and the voice waveform related to the input voice data becomes equal to or more than the threshold value.

なお、ユーザによりウェイクワードが発話されたということは、基本的にはユーザがこれからリクエスト（音声入力）を行うということである。従って検出部２９がユーザによりウェイクワードが発話されたことを検出する処理は、特許請求の範囲の「ユーザにより音声入力が開始されることを検出する」処理に相当する。検出部２９は、ユーザによりウェイクワードが発話されたことを検出した場合、音声認識処理部３０および切替部３１にその旨、通知する。以下この通知を「開始通知」という。 The fact that the wake word is spoken by the user basically means that the user will make a request (voice input) from now on. Therefore, the process of detecting that the wake word is spoken by the user by the detection unit 29 corresponds to the process of "detecting that the voice input is started by the user" in the claims. When the detection unit 29 detects that the wake word has been spoken by the user, the detection unit 29 notifies the voice recognition processing unit 30 and the switching unit 31 to that effect. Hereinafter, this notification is referred to as a "start notification".

一方、検出部２９は、ユーザによりウェイクワードが発話され、更にリクエストの発話が開始され、その後リクエストの発話が終了したときに、そのことを検出する。詳述すると、検出部２９は、ユーザによるウェイクワードの発話を検出した後、音声入力部２８により音声バッファに累積的に記憶される入力音声データを継続して分析し、音声の音圧レベルが所定値以上の状態となった後、音圧レベルが所定値以下の状態が一定時間以上続いた場合、リクエストの発話が終了したことを検出する。なお、ユーザは、ウェイクワードを発話した後、一定期間内にリクエストの発話を開始し、リクエストの発話が終了すると、発話をしばらくやめると想定されており、音声の音圧レベルが所定値以下の状態が一定時間以上続いた場合、リクエストの発話が終了したとみなすことができる。 On the other hand, the detection unit 29 detects when the wake word is uttered by the user, the utterance of the request is further started, and then the utterance of the request is completed. More specifically, the detection unit 29 continuously analyzes the input voice data cumulatively stored in the voice buffer by the voice input unit 28 after detecting the utterance of the wake word by the user, and the sound pressure level of the voice is determined. When the sound pressure level remains below the predetermined value for a certain period of time or more after the state becomes above the predetermined value, it is detected that the utterance of the request has ended. It is assumed that the user starts uttering the request within a certain period of time after uttering the wake word, and when the utterance of the request ends, the utterance is stopped for a while, and the sound pressure level of the voice is equal to or less than the predetermined value. If the state continues for a certain period of time or longer, it can be considered that the utterance of the request has ended.

なお、検出部２９がユーザによるリクエストの発話の終了を検出する処理は、特許請求の範囲の「ユーザによる音声入力の終了を検出する」処理に相当する。検出部２９は、ユーザによるリクエストの発話の終了を検出した場合、音声認識処理部３０および切替部３１にその旨、通知する。以下この通知を「終了通知」という。 The process of detecting the end of the utterance of the request by the user by the detection unit 29 corresponds to the process of "detecting the end of the voice input by the user" in the claims. When the detection unit 29 detects the end of the utterance of the request by the user, the detection unit 29 notifies the voice recognition processing unit 30 and the switching unit 31 to that effect. Hereinafter, this notification is referred to as "end notification".

音声認識処理部３０は、検出部２９から開始通知を受け、更にリクエスト終了通知を受けると、音声バッファに格納された音声データに基づいて処理要求データを生成する。音声認識処理部３０は、生成した処理要求データを、ネットワークＮを介してサービス提供サーバ１０に送信する。 When the voice recognition processing unit 30 receives the start notification from the detection unit 29 and further receives the request end notification, the voice recognition processing unit 30 generates processing request data based on the voice data stored in the voice buffer. The voice recognition processing unit 30 transmits the generated processing request data to the service providing server 10 via the network N.

サービス提供サーバ１０は、処理要求データを受信し、受信した処理要求データに基づいて、リクエストの内容を認識すると共に、当該内容に対応する処理を実行する。説明の便宜のため、本実施形態では、リクエストの内容は２つのパターンがあるものとする。１つ目は、車内空間２に設けられ、音声処理装置１に接続された機器（例えば空気調和装置）の制御を要求するパターン（以下「機器制御パターン」という）であり、当パターンのリクエストの文言の一例は「エアコンをつけて」というものである。２つ目は、音声対話を要求するパターン（以下「音声対話パターン」という）であり、当パターンのリクエストの文言の一例は「今日の天気は」というものである。 The service providing server 10 receives the processing request data, recognizes the content of the request based on the received processing request data, and executes the processing corresponding to the content. For convenience of explanation, in this embodiment, it is assumed that the content of the request has two patterns. The first is a pattern (hereinafter referred to as "equipment control pattern") that requests control of a device (for example, an air conditioner) provided in the vehicle interior space 2 and connected to the voice processing device 1. An example of the wording is "turn on the air conditioner." The second is a pattern that requires voice dialogue (hereinafter referred to as "voice dialogue pattern"), and an example of the wording of the request of this pattern is "today's weather".

サービス提供サーバ１０は、リクエストの内容が機器制御パターンの場合には、音声処理装置１が機器を制御するための機器制御データを生成し、音声認識処理部３０に応答する。サービス提供サーバ１０は、リクエストの内容が音声対話パターンの場合には、音声処理装置１にリクエストに対応する所定の内容の音声（以下「応答音声」という）を音声出力させるための音声出力制御データを生成し、音声認識処理部３０に応答する。音声出力制御データには、応答音声の音声データが含まれている。 When the content of the request is a device control pattern, the service providing server 10 generates device control data for the voice processing device 1 to control the device, and responds to the voice recognition processing unit 30. When the content of the request is a voice dialogue pattern, the service providing server 10 has voice output control data for causing the voice processing device 1 to output voice having a predetermined content corresponding to the request (hereinafter referred to as “response voice”). Is generated and responds to the voice recognition processing unit 30. The voice output control data includes voice data of the response voice.

音声認識処理部３０は、サービス提供サーバ１０から機器制御データを受信した場合、機器制御データに基づいて機器を制御する。当処理についての詳細な説明は省略する。音声認識処理部３０は、サービス提供サーバ１０から音声出力制御データを受信した場合、音声出力制御データに基づいて音声出力部２６を制御して、スピーカ群から応答音声を出力させる。後に明らかとなる通り、音声認識処理部３０がサービス提供サーバ１０から音声出力制御データを受信したタイミングでは、音声処理装置１の動作モードは通常モードであり、応答音声の出力は問題なくできる。なお、音声出力部２６がコンテンツを再生中の場合には、音声認識処理部３０は、コンテンツ音声に重畳して応答音声を出力させる。ただし、応答音声の出力中は、コンテンツの再生を一時的に中断したり、コンテンツ音声の音量を小さくしたりするようにしてもよい。 When the voice recognition processing unit 30 receives the device control data from the service providing server 10, the voice recognition processing unit 30 controls the device based on the device control data. A detailed description of this process will be omitted. When the voice recognition processing unit 30 receives the voice output control data from the service providing server 10, the voice recognition processing unit 30 controls the voice output unit 26 based on the voice output control data to output the response voice from the speaker group. As will become clear later, at the timing when the voice recognition processing unit 30 receives the voice output control data from the service providing server 10, the operation mode of the voice processing device 1 is the normal mode, and the response voice can be output without any problem. When the voice output unit 26 is playing the content, the voice recognition processing unit 30 superimposes the content voice on the content voice to output the response voice. However, during the output of the response voice, the playback of the content may be temporarily interrupted or the volume of the content voice may be reduced.

切替部３１は、動作モードが通常モードのときに検出部２９から開始通知を受けた場合、動作モードをビームフォーミングモードに切り替える。動作モードのビームフォーミングモードへの切り替えに応じて、切替部３１は、セレクタ１７に制御信号を出力して、セレクタ１７の状態をスピーカ機能状態からマイク機能状態へと切り替える。更に切替部３１は、ビームフォーミング処理部２２に制御信号を出力して、ビームフォーミング処理部２２の状態をオフ状態からオン状態へと切り替える。 When the switching unit 31 receives a start notification from the detection unit 29 when the operation mode is the normal mode, the switching unit 31 switches the operation mode to the beamforming mode. In response to the switching of the operation mode to the beamforming mode, the switching unit 31 outputs a control signal to the selector 17 to switch the state of the selector 17 from the speaker function state to the microphone function state. Further, the switching unit 31 outputs a control signal to the beamforming processing unit 22 to switch the state of the beamforming processing unit 22 from the off state to the on state.

一方、切替部３１は、動作モードがビームフォーミングモードのときに検出部２９から終了通知を受けた場合、動作モードを通常モードに切り替える。動作モードの通常モードへの切り替えに応じて、切替部３１は、セレクタ１７に制御信号を出力して、セレクタ１７の状態をマイク機能状態からスピーカ機能状態へと切り替える。つまり、切替部３１は、ツイータ７Ｒ、７Ｌについてマイクとして機能させることを停止し、音声出力機能の停止を解除する。更に切替部３１は、ビームフォーミング処理部２２に制御信号を出力して、ビームフォーミング処理部２２の状態をオン状態からオフ状態へと切り替える。 On the other hand, when the switching unit 31 receives the end notification from the detection unit 29 when the operation mode is the beamforming mode, the switching unit 31 switches the operation mode to the normal mode. In response to the switching of the operation mode to the normal mode, the switching unit 31 outputs a control signal to the selector 17 to switch the state of the selector 17 from the microphone function state to the speaker function state. That is, the switching unit 31 stops the tweeters 7R and 7L from functioning as microphones, and releases the stop of the audio output function. Further, the switching unit 31 outputs a control signal to the beamforming processing unit 22 to switch the state of the beamforming processing unit 22 from the on state to the off state.

以上の処理が行われることにより、例えば以下の態様で音声認識サービスの提供が行われることになる。すなわち、コンテンツ再生部２７によるコンテンツの再生、および、コンテンツの再生に伴うコンテンツ音声の出力が行われている状況であり、音声処理装置１の動作モードが通常モードであるものとする。この状況において、ユーザが車載機器の制御或いは音声対話の実行を所望し、ウェイクワードを発話したとする。すると、音声処理装置１の機能により、動作モードが通常モードからビームフォーミングモードへと移行し、スピーカ群によるコンテンツ音声の出力が停止されると共に、音声処理ユニット１３において入力音声に対してビームフォーミング処理が施される状態となる。 By performing the above processing, the voice recognition service is provided, for example, in the following manners. That is, it is assumed that the content reproduction unit 27 is reproducing the content and outputting the content audio accompanying the reproduction of the content, and the operation mode of the audio processing device 1 is the normal mode. In this situation, suppose the user wants to control an in-vehicle device or perform a voice dialogue and utters a wake word. Then, the function of the audio processing device 1 shifts the operation mode from the normal mode to the beamforming mode, the output of the content audio by the speaker group is stopped, and the audio processing unit 13 performs beamforming processing on the input audio. Will be applied.

その後ユーザがリクエストを発話すると、リクエストに対応する音声信号はビームフォーミング処理が施された上で制御ユニット１２に出力されることになる。このため、リクエストに対応する音声信号の高品質化、および、これに伴うサービス提供サーバ１０に送信されるリクエストに対応する音声データの高品質化を実現でき、ひいてはサービス提供サーバ１０におけるリクエストについての認識精度の向上を図ることができる。また、リクエストが発話されている間は、スピーカ群によりコンテンツ音声が放音されないため、この点からもリクエストに対応する音声信号の高品質化およびこれに付随する効果を得ることができる。 After that, when the user utters a request, the audio signal corresponding to the request is subjected to beamforming processing and then output to the control unit 12. Therefore, it is possible to improve the quality of the voice signal corresponding to the request and the quality of the voice data corresponding to the request transmitted to the service providing server 10, and eventually the request on the service providing server 10. It is possible to improve the recognition accuracy. Further, since the content sound is not emitted by the speaker group while the request is being uttered, it is possible to obtain high quality of the sound signal corresponding to the request and the effect accompanying the improvement from this point as well.

そして、ユーザによるリクエストの発話が終了すると速やかに動作モードがビームフォーミングモードから通常モードへ移行する。上述の通り、通常モードではスピーカ群（ツイータ７Ｒ、７Ｌを含む）により音声出力が可能な状態となるため、リクエストが音声対話を要求するものである場合、リクエスト対する応答音声を問題なく出力できる。また、スピーカ群によりコンテンツ音声の放音が停止される期間は、ユーザがリクエストを発話した短い時間であり、これによるユーザへの影響は極めて限定的である。 Then, as soon as the user finishes speaking the request, the operation mode shifts from the beamforming mode to the normal mode. As described above, in the normal mode, the speaker group (including the tweeters 7R and 7L) enables voice output. Therefore, when the request requires voice dialogue, the response voice to the request can be output without any problem. Further, the period during which the sound generation of the content sound is stopped by the speaker group is a short time during which the user utters the request, and the influence on the user by this is extremely limited.

以上詳しく説明したように、本実施形態では、ツイータ７Ｒ、７Ｌ（複数のスピーカ）が配置された車内空間２に設置された音声処理装置１について、複数の音声信号を対象としてビームフォーミング処理を施すビームフォーミング処理部２２を設け、ユーザによりリクエスト（音声入力）がされた場合に、ツイータ７Ｒ、７Ｌの音声出力機能を停止し、ツイータ７Ｒ、７Ｌをマイクとして機能させ、マイクとして機能するツイータ７Ｒ、７Ｌが出力する音声信号がビームフォーミング処理部２２に入力されるようにしている。 As described in detail above, in the present embodiment, the voice processing device 1 installed in the vehicle interior space 2 in which the tweeters 7R and 7L (plurality of speakers) are arranged is subjected to beamforming processing for a plurality of voice signals. A beamforming processing unit 22 is provided, and when a request (voice input) is made by the user, the voice output function of the tweeters 7R and 7L is stopped, the tweeters 7R and 7L function as a microphone, and the tweeter 7R functions as a microphone. The audio signal output by the 7L is input to the beamforming processing unit 22.

以上の構成によれば、専用のマイクを増設して、音声処理装置１に複数のマイクから音声信号が入力されるようにするのではなく、音声処理装置１が設置された空間に元々あるツイータ７Ｒ、７Ｌを利用して、音声処理装置１に複数のマイクから音声信号が入力されるようにすることができるため、コストの増大を抑制しつつ、ビームフォーミング処理を実行する機能を音声処理装置１に実装できる。 According to the above configuration, instead of adding a dedicated microphone so that the voice signal is input from a plurality of microphones to the voice processing device 1, the tweeter originally existing in the space where the voice processing device 1 is installed. Since voice signals can be input to the voice processing device 1 from a plurality of microphones by using 7R and 7L, the voice processing device has a function of executing beam forming processing while suppressing an increase in cost. Can be implemented in 1.

次に、音声処理装置１の動作例についてフローチャートを用いて説明する。図５は、音声処理装置１による音声処理方法を示すフローチャートである。図５で示すように、音声処理装置１の検出部２９は、ユーザにより音声入力が開始されることを検出する（ステップＳＡ１）。上述の通り、本実施形態では、検出部２９は、ユーザによりウェイクワードが発話されたことを検出する。次いで、音声処理装置１の切替部３１は、複数のスピーカの音声出力機能を停止し、ツイータ７Ｒ、７Ｌをマイクとして機能させ、マイクとして機能するツイータ７Ｒ、７Ｌからビームフォーミング処理部２２に音声信号を入力させる（ステップＳＡ２）。上述の通り、本実施形態では、検出部２９は、動作モードをビームフォーミングモードへ移行する。 Next, an operation example of the voice processing device 1 will be described with reference to a flowchart. FIG. 5 is a flowchart showing a voice processing method by the voice processing device 1. As shown in FIG. 5, the detection unit 29 of the voice processing device 1 detects that the voice input is started by the user (step SA1). As described above, in the present embodiment, the detection unit 29 detects that the wake word has been spoken by the user. Next, the switching unit 31 of the audio processing device 1 stops the audio output function of the plurality of speakers, causes the tweeters 7R and 7L to function as microphones, and transmits an audio signal from the tweeters 7R and 7L that function as microphones to the beamforming processing unit 22. Is input (step SA2). As described above, in the present embodiment, the detection unit 29 shifts the operation mode to the beamforming mode.

以上、本発明の一実施形態を説明したが、上記実施形態は、本発明を実施するにあたっての具体化の一例を示したものに過ぎず、これによって本発明の技術的範囲が限定的に解釈されてはならないものである。すなわち、本発明はその要旨、またはその主要な特徴から逸脱することなく、様々な形で実施することができる。 Although one embodiment of the present invention has been described above, the above-described embodiment is merely an example of embodiment of the present invention, thereby limiting the technical scope of the present invention. It should not be done. That is, the present invention can be implemented in various forms without departing from its gist or its main features.

例えば上記実施形態では、本発明が音声認識システム９の音声処理装置１に適用される例を説明したが、本発明が適用される音声処理装置は、本実施形態のようにサーバと協働で音声認識に関するサービスを提供する装置に限られない。すなわち本発明は、収音した音声に基づく音声信号についてビームフォーミング処理を行って品質を向上することが求められる音声処理装置に広く適用可能である。一例として、本発明をハンズフリー通話システムを構成する音声処理装置に適用することができる。 For example, in the above embodiment, the example in which the present invention is applied to the voice processing device 1 of the voice recognition system 9 has been described, but the voice processing device to which the present invention is applied cooperates with the server as in the present embodiment. It is not limited to a device that provides services related to voice recognition. That is, the present invention can be widely applied to a voice processing apparatus that is required to perform beamforming processing on a voice signal based on picked-up voice to improve the quality. As an example, the present invention can be applied to a voice processing device constituting a hands-free communication system.

また上記実施形態では、音声処理装置１は車内空間２に設けられていたが、音声処理装置１が設けられる空間は車内空間２に限られない。すなわち、音声処理装置１が設けられる空間は、マイクとして機能させることが可能なスピーカが元々存在する空間であればよい。一例として音声処理装置１は、オフィスや住宅の一室に設けられていてもよい。 Further, in the above embodiment, the voice processing device 1 is provided in the vehicle interior space 2, but the space in which the voice processing device 1 is provided is not limited to the vehicle interior space 2. That is, the space in which the voice processing device 1 is provided may be a space in which a speaker capable of functioning as a microphone originally exists. As an example, the voice processing device 1 may be provided in a room of an office or a house.

また上記実施形態では、検出部２９は、ウェイクワードが発話されたことをもって音声入力（リクエスト）が開始されたことを検出した。また検出部２９は、ウェイクワードの発話を検出した後、音声の音圧レベルが所定値以下の状態が一定時間以上続いた場合、音声入力（リクエスト）が終了したことを検出した。しかしながら、音声入力の開始/終了を検出部２９が検出する方法は実施形態で例示した方法に限られず、ユーザの音声入力が行われる方法に応じた適切な方法が採用される。例えば、ユーザが音声入力の開始時および終了時に所定のスイッチを操作するシステム（音声入力の間、所定のスイッチを押し続けるというシステムでもよい）の場合、所定のスイッチに対する操作に基づいて音声入力の開始／終了を検出部２９が検出してもよい。 Further, in the above embodiment, the detection unit 29 detects that the voice input (request) is started when the wake word is uttered. Further, the detection unit 29 detects that the voice input (request) is completed when the sound pressure level of the voice continues to be equal to or less than a predetermined value for a certain period of time or more after detecting the utterance of the wake word. However, the method in which the detection unit 29 detects the start / end of the voice input is not limited to the method exemplified in the embodiment, and an appropriate method according to the method in which the user's voice input is performed is adopted. For example, in the case of a system in which a user operates a predetermined switch at the start and end of voice input (a system in which a predetermined switch is held down during voice input), voice input is performed based on an operation on the predetermined switch. The detection unit 29 may detect the start / end.

また例えば、ユーザと音声処理装置１との間での音声対話において２回目以降のユーザの発話にウェイクワードが含まれない場合に、検出部２９が以下の処理を実行してもよい。すなわち、検出部２９は、音声処理装置１により応答音声が出力された後、ユーザによる発話があるものとして、ユーザの音声入力の開始を検出し、その後、音声の音圧レベルが所定値以下の状態が一定時間以上続いた場合に、音声入力が終了したことを検出する構成でもよい。 Further, for example, when the wake word is not included in the second and subsequent utterances of the user in the voice dialogue between the user and the voice processing device 1, the detection unit 29 may execute the following processing. That is, the detection unit 29 detects the start of the user's voice input assuming that the user has uttered after the response voice is output by the voice processing device 1, and then the sound pressure level of the voice is equal to or less than a predetermined value. It may be configured to detect that the voice input is completed when the state continues for a certain period of time or more.

また、図３で示す音声処理ユニット１３のハードウェア構成はあくまで一例であり、ハードウェア構成が例示した内容に限られないことは勿論である。例えば、エコーキャンセラ２１がない構成でもよく、エコーキャンセル処理、ノイズキャンセル処理およびビームフォーミング処理が施される順番は例示した順番に限られない。 Further, the hardware configuration of the voice processing unit 13 shown in FIG. 3 is merely an example, and it goes without saying that the hardware configuration is not limited to the illustrated content. For example, the configuration without the echo canceller 21 may be used, and the order in which the echo canceling process, the noise canceling process, and the beamforming process are performed is not limited to the illustrated order.

また、上記実施形態では、音声処理装置１に接続し、マイクとして機能させる複数のスピーカはツイータ７Ｒ、７Ｌであったが、車内空間２に設けられた他のスピーカであってもよい。ただしマイクとして有効に機能するスピーカに限られる。 Further, in the above embodiment, the plurality of speakers connected to the voice processing device 1 and functioning as microphones are tweeters 7R and 7L, but other speakers provided in the vehicle interior space 2 may be used. However, it is limited to speakers that function effectively as microphones.

また、上記実施形態では、切替部３１は、検出部２９から開始通知を受けた後、終了通知を受けるまでの間、セレクタ１７を制御してスピーカ群からの音声出力を停止したが、その際に、コンテンツ再生部２７と連携しコンテンツの再生を一時停止する構成としてもよい。この構成によれば、音声出力が停止している間、コンテンツの再生が進むことを防止できる。 Further, in the above embodiment, the switching unit 31 controls the selector 17 to stop the audio output from the speaker group after receiving the start notification from the detection unit 29 until the end notification is received. In addition, it may be configured to suspend the reproduction of the content in cooperation with the content reproduction unit 27. According to this configuration, it is possible to prevent the content from being played back while the audio output is stopped.

また、上記実施形態で、サービス提供サーバ１０が実行していた処理の一部または全部を音声処理装置１が実行する構成としてもよい。また音声処理装置１が実行していた処理の一部または全部をサービス提供サーバ１０（サービス提供サーバ１０以外の外部装置であってもよい）が実行する構成としてもよい。 Further, in the above embodiment, the voice processing device 1 may execute a part or all of the processing executed by the service providing server 10. Further, the service providing server 10 (which may be an external device other than the service providing server 10) may execute a part or all of the processing executed by the voice processing device 1.

１音声処理装置
２車内空間（所定の空間）
５ダッシュボード
７Ｒ、７Ｌツイータ（複数のスピーカ、車載スピーカ）
２２ビームフォーミング処理部
２９検出部
３１切替部 1 Voice processing device 2 Interior space (predetermined space)
5 Dashboard 7R, 7L tweeter (multiple speakers, in-vehicle speaker)
22 Beamforming processing unit 29 Detection unit 31 Switching unit

Claims

複数のスピーカが配置された所定の空間に設置された音声処理装置であって、
複数の音声信号を対象としてビームフォーミング処理を施すビームフォーミング処理部が設けられ、
ユーザにより音声入力が開始されることを検出する検出部と、
前記検出部により音声入力の開始が検出された場合、前記複数のスピーカの音声出力機能を停止し、前記複数のスピーカをマイクとして機能させ、マイクとして機能する前記複数のスピーカから前記ビームフォーミング処理部に音声信号を入力させる切替部と、
を備えることを特徴とする音声処理装置。 A voice processing device installed in a predetermined space in which a plurality of speakers are arranged.
A beamforming processing unit that performs beamforming processing on a plurality of audio signals is provided.
A detector that detects when the user starts voice input,
When the start of audio input is detected by the detection unit, the audio output function of the plurality of speakers is stopped, the plurality of speakers function as microphones, and the beam forming processing unit from the plurality of speakers functioning as microphones. And a switching unit that lets you input audio signals to
A voice processing device characterized by comprising.

前記検出部は、ユーザによる音声入力の終了を検出し、
前記切替部は、前記複数のスピーカをマイクとして機能させた後、前記検出部により音声入力の終了が検出された場合、前記複数のスピーカについてマイクとして機能させることを停止し、音声出力機能の停止を解除する
ことを特徴とする請求項１に記載の音声処理装置。 The detection unit detects the end of voice input by the user and detects the end of voice input.
After making the plurality of speakers function as microphones, the switching unit stops the functioning of the plurality of speakers as microphones when the detection unit detects the end of voice input, and stops the voice output function. The voice processing apparatus according to claim 1, wherein the voice processing apparatus is released.

前記所定の空間は、車両の車内に形成された車内空間であり、
前記複数のスピーカは、前記車内空間において、左右方向に離間して配置された車載スピーカであることを特徴とする請求項１または２に記載の音声処理装置。 The predetermined space is an in-vehicle space formed in the vehicle interior of the vehicle.
The voice processing device according to claim 1 or 2, wherein the plurality of speakers are in-vehicle speakers arranged apart from each other in the left-right direction in the vehicle interior space.

前記複数のスピーカは、ダッシュボードの両端部に設けられた２台のツイータであることを特徴とする請求項３に記載の音声処理装置。 The voice processing device according to claim 3, wherein the plurality of speakers are two tweeters provided at both ends of the dashboard.

複数のスピーカが配置された所定の空間に設置され、複数の音声信号を対象としてビームフォーミング処理を施すビームフォーミング処理部が設けられた音声処理装置による音声処理方法であって、
前記音声処理装置の検出部が、ユーザにより音声入力が開始されることを検出するステップと、
前記音声処理装置の切替部が、前記検出部により音声入力の開始が検出された場合、前記複数のスピーカの音声出力機能を停止し、前記複数のスピーカをマイクとして機能させ、マイクとして機能する前記複数のスピーカから前記ビームフォーミング処理部に音声信号を入力させるステップと、
を含むことを特徴とする音声処理方法。 A voice processing method using a voice processing device provided with a beamforming processing unit that is installed in a predetermined space in which a plurality of speakers are arranged and performs beamforming processing on a plurality of voice signals.
A step in which the detection unit of the voice processing device detects that the voice input is started by the user, and
When the switching unit of the voice processing device detects the start of voice input by the detection unit, the voice output function of the plurality of speakers is stopped, the plurality of speakers function as microphones, and the plurality of speakers function as microphones. A step of inputting an audio signal from a plurality of speakers to the beam forming processing unit,
A voice processing method comprising.