JP7449070B2

JP7449070B2 - Voice input device, voice input method and its program

Info

Publication number: JP7449070B2
Application number: JP2019197231A
Authority: JP
Inventors: 剛樹西川
Original assignee: Panasonic Intellectual Property Corp of America
Current assignee: Panasonic Intellectual Property Corp of America
Priority date: 2019-03-27
Filing date: 2019-10-30
Publication date: 2024-03-13
Anticipated expiration: 2039-10-30
Also published as: JP2020160430A; CN111754986A

Description

本開示は、音声入力装置、音声入力方法及び音声入力方法を用いたプログラムに関する。 The present disclosure relates to a voice input device, a voice input method, and a program using the voice input method.

例えば特許文献１には、ユーザの操作により、音声入力操作を可能にする音声入力開始操作手段と、ユーザの音声を取得する音声入力手段と、音声入力開始操作手段でユーザが操作を行なった時刻からユーザが実際に発話を開始するまでの時間を学習した発話開始学習時間を保持する発話開始時間学習データ保持手段と、計測時間と発話開始時間学習データ保持手段からの発話開始学習時間とを比較し、時間計測した音声がユーザの入力音声か否かを判定し、ユーザの入力音声である時は音声認識する音声認識手段とを備える音声認識装置が開示されている。 For example, Patent Document 1 describes a voice input start operation means that enables a voice input operation by a user's operation, a voice input means that acquires the user's voice, and a time when the user performs an operation with the voice input start operation means. Compare the utterance start time learning data holding means that holds the utterance start learning time that is the time from when the user actually starts speaking, and the measured time and the utterance start learning time from the utterance start time learning data holding means. However, a speech recognition device is disclosed which includes a speech recognition means for determining whether or not the time-measured speech is the user's input speech and recognizing the speech when the time-measured speech is the user's input speech.

この音声認識装置では、ユーザごとに学習し、学習した発話開始時間を使用することで、音声がユーザか否かを認識することができる。 This voice recognition device learns for each user and uses the learned utterance start time to recognize whether or not the voice is from the user.

特開２００６－３１３２６１号公報Japanese Patent Application Publication No. 2006-313261

しかしながら、特許文献１に開示される技術では、ユーザが音声入力装置の操作を行った時刻からユーザが実際に発話を開始するまでの期間を予め学習する必要がある。このため、従来の音声認識装置では、学習による計算量が増大する恐れがある。 However, with the technique disclosed in Patent Document 1, it is necessary to learn in advance the period from the time when the user operates the voice input device until the user actually starts speaking. For this reason, in conventional speech recognition devices, the amount of calculations required for learning may increase.

そこで、本開示は、簡易な処理で話者を識別することで計算量の増大を抑制することができる音声入力装置、音声入力方法及びプログラムを提供することを目的とする。 Therefore, an object of the present disclosure is to provide a voice input device, a voice input method, and a program that can suppress an increase in the amount of calculation by identifying a speaker through simple processing.

本開示の一態様に係る音声入力装置は、１以上の話者が発話する際のそれぞれの音声を取得する取得部と、前記取得部が取得した前記１以上の話者の発話による前記それぞれの音声を記憶する記憶部と、トリガが入力されるトリガ入力部と、前記トリガ入力部に前記トリガが入力されるごとに、前記記憶部に記憶される前記それぞれの音声から発話を開始した開始位置を検出する発話開始検出部と、少なくとも、前記トリガ入力部に前記トリガが入力される第１時点と、前記発話開始検出部が前記それぞれの音声から検出した発話の開始位置の第２時点とに基づいて、前記１以上の話者のうちからいずれかの話者を識別する話者識別部と、前記第１時点と前記第２時点とのいずれの時点が前の時間であるかを少なくとも登録する発話時機登録部とを備え、前記話者識別部は、前記第１時点と前記第２時点と前記発話時機登録部が前記第１時点に対する前記第２時点の時機を示す複数の登録情報とに基づいて、前記１以上の話者のうちからいずれかの話者を識別する。 A voice input device according to an aspect of the present disclosure includes an acquisition unit that acquires each voice uttered by one or more speakers, and a voice input device that acquires each voice uttered by one or more speakers, and a a storage unit that stores sounds; a trigger input unit into which a trigger is input; and a starting position at which the utterance starts from each of the voices that is stored in the storage unit each time the trigger is input to the trigger input unit. at least a first time point when the trigger is input to the trigger input section, and a second time point of the start position of the speech detected by the speech start detection section from each of the voices. a speaker identification unit that identifies one of the one or more speakers based on the information, and at least registers which of the first time point and the second time point is the previous time. and a utterance time registration unit, the speaker identification unit includes the first time point, the second time point, and a plurality of registered information in which the utterance time registration unit indicates the timing of the second time point with respect to the first time point. one of the one or more speakers is identified based on the one or more speakers .

なお、これらのうちの一部の具体的な態様は、システム、方法、集積回路、コンピュータプログラム又はコンピュータで読み取り可能なＣＤ－ＲＯＭ等の記録媒体を用いて実現されてもよく、システム、方法、集積回路、コンピュータプログラム及び記録媒体の任意な組み合わせを用いて実現されてもよい。 Note that some specific aspects of these may be realized using a system, a method, an integrated circuit, a computer program, or a computer-readable recording medium such as a CD-ROM. It may be implemented using any combination of integrated circuits, computer programs, and storage media.

本開示の音声入力装置等によれば、簡易な処理で話者を識別することで計算量の増大を抑制することができる。 According to the voice input device and the like of the present disclosure, an increase in the amount of calculation can be suppressed by identifying the speaker through simple processing.

図１は、実施の形態における話者認識装置の外観と、話者の発話による話者認識装置の使用場面の一例を示す図である。FIG. 1 is a diagram illustrating an example of an external appearance of a speaker recognition device according to an embodiment and a usage scene of the speaker recognition device according to a speaker's utterance. 図２Ａは、実施の形態における話者認識装置の一例を示すブロック図である。FIG. 2A is a block diagram illustrating an example of a speaker recognition device according to an embodiment. 図２Ｂは、実施の形態における別の話者認識装置の一例を示すブロック図である。FIG. 2B is a block diagram illustrating an example of another speaker recognition device according to the embodiment. 図３は、第１話者が発話した場合の話者認識装置の動作を示すフローチャートである。FIG. 3 is a flowchart showing the operation of the speaker recognition device when the first speaker speaks. 図４は、第１話者が発話する場合と第２話者が発話する場合との、発話による音声ごとの第１時点と第２時点との時系列を例示した図である。FIG. 4 is a diagram illustrating the time series of the first time point and the second time point for each voice of the utterance when the first speaker speaks and when the second speaker speaks. 図５は、第２話者が発話した場合の話者認識装置の動作を示すフローチャートである。FIG. 5 is a flowchart showing the operation of the speaker recognition device when the second speaker speaks. 図６は、実施の形態における話者認識装置の話者識別部における動作を示すフローチャートである。FIG. 6 is a flowchart showing the operation of the speaker identification section of the speaker recognition device in the embodiment.

本開示の一態様に係る音声入力装置は、１以上の話者が発話する際のそれぞれの音声を取得する取得部と、前記取得部が取得した前記１以上の話者の発話による前記それぞれの音声を記憶する記憶部と、トリガが入力されるトリガ入力部と、前記トリガ入力部に前記トリガが入力されるごとに、前記記憶部に記憶される前記それぞれの音声から発話を開始した開始位置を検出する発話開始検出部と、少なくとも、前記トリガ入力部に前記トリガが入力される第１時点と、前記発話開始検出部が前記それぞれの音声から検出した発話の開始位置の第２時点とに基づいて、前記１以上の話者のうちからいずれかの話者を識別する話者識別部とを備える。 A voice input device according to an aspect of the present disclosure includes an acquisition unit that acquires each voice uttered by one or more speakers, and a voice input device that acquires each voice uttered by one or more speakers, and a a storage unit that stores sounds; a trigger input unit into which a trigger is input; and a starting position at which the utterance starts from each of the voices that is stored in the storage unit each time the trigger is input to the trigger input unit. an utterance start detection unit that detects the utterance, and at least a first time point when the trigger is input to the trigger input unit, and a second time point of the utterance start position detected by the utterance start detection unit from the respective voices. and a speaker identification unit that identifies any one of the one or more speakers based on the one or more speakers.

これによれば、例えば、１以上の話者のうちの話者によるトリガを検知した第１時点と、話者が発話した音声の第２時点との時間的な前後関係によって、１以上の話者のうちからいずれかの話者を識別することができる。つまり、第１時点から第２時点までの期間を学習しなくても、取得部が取得した音声の話者が、１以上の話者のうちのいずれの話者であるかを識別することができる。 According to this, for example, depending on the temporal relationship between the first point in time when a trigger from one or more speakers is detected and the second point in time of the voice uttered by the speaker, one or more speech It is possible to identify any one of the speakers. In other words, it is possible to identify which of one or more speakers the speaker of the voice acquired by the acquisition unit is without learning the period from the first time point to the second time point. can.

したがって、この音声入力装置では、簡易な処理で話者を識別することで計算量の増大を抑制することができる。 Therefore, in this voice input device, an increase in the amount of calculation can be suppressed by identifying the speaker through simple processing.

特に、話者は、第１時点に対する発話の時機に基づいて、音声の話者を識別することができる。このため、音声入力装置では、簡単な操作で音声の話者を識別することができる。また、音声入力装置の操作が簡単となるため、音声入力装置に複数のボタンを配置する等の、音声入力装置の複雑化を抑制することができる。このため、この音声入力装置では、例えばトリガ入力部がボタンである場合、１つのボタンでも１以上の話者のうちのいずれの話者であるかを識別することができるため、音声入力装置の構成をより簡易にすることができる。 In particular, the speaker can identify the speaker of the audio based on the timing of the utterance relative to the first point in time. Therefore, with the voice input device, the speaker of the voice can be identified with a simple operation. Furthermore, since the voice input device can be easily operated, it is possible to suppress the complexity of the voice input device, such as arranging a plurality of buttons on the voice input device. For this reason, in this voice input device, if the trigger input section is a button, it is possible to identify which one of one or more speakers is the speaker with just one button. The configuration can be made simpler.

本開示の他の態様に係る音声入力方法は、１以上の話者が発話する際のそれぞれの音声を取得することと、取得した前記１以上の話者の発話による前記それぞれの音声を記憶部に記憶することと、トリガが入力されることと、前記トリガが入力されるごとに、前記記憶部に記憶される前記それぞれの音声から発話を開始した開始位置を検出することと、少なくとも、前記トリガが入力される第１時点と、前記それぞれの音声から検出した発話の開始位置の第２時点とに基づいて、前記１以上の話者のうちからいずれかの話者を識別することとを含む。 A voice input method according to another aspect of the present disclosure includes acquiring each voice uttered by one or more speakers, and storing each of the acquired voices uttered by the one or more speakers in a storage unit. a trigger is input; and each time the trigger is input, detecting a starting position at which the utterance starts from each of the voices stored in the storage unit; identifying one of the one or more speakers based on a first time point at which a trigger is input and a second time point at a start position of an utterance detected from each of the voices; include.

この音声入力方法においても、上述の音声入力装置と同様の作用効果を奏する。 This voice input method also provides the same effects as the voice input device described above.

また、本開示の他の態様に係るプログラムは、音声入力方法をコンピュータに実行させるためのプログラムである。 Further, a program according to another aspect of the present disclosure is a program for causing a computer to execute a voice input method.

このプログラムにおいても、上述の音声入力装置と同様の作用効果を奏する。 This program also provides the same effects as the voice input device described above.

本開示の他の態様に係る音声入力装置は、前記第１時点と前記第２時点とのいずれの時点が前の時間であるかを少なくとも登録する発話時機登録部を備え、前記話者識別部は、前記第１時点と前記第２時点と前記発話時機登録部が前記第１時点に対する前記第２時点の時機を示す複数の登録情報とに基づいて、前記１以上の話者のうちからいずれかの話者を識別する。 The voice input device according to another aspect of the present disclosure includes a speech timing registration unit that registers at least which of the first time point and the second time point is the previous time, and the speaker identification unit is based on the first time point, the second time point, and a plurality of pieces of registered information indicating the timing of the second time point with respect to the first time point, and the utterance time registration unit selects which one of the one or more speakers. identify the speaker.

これによれば、１以上の話者が所望する条件として、第１時点と第２時点との時間的な前後関係を予め登録することができる。このため、話者識別部は、第１時点及び第２時点の時間的な前後関係が、登録情報に示されているかどうかを判定するだけで、１以上の話者のうちからいずれかの話者を識別することができる。その結果、この音声入力装置では、簡易な処理で、より確実に話者を識別することができる。 According to this, the temporal relationship between the first time point and the second time point can be registered in advance as a condition desired by one or more speakers. Therefore, the speaker identification unit only needs to determine whether or not the temporal relationship between the first time point and the second time point is indicated in the registered information. person can be identified. As a result, this voice input device can identify the speaker more reliably with simple processing.

本開示の他の態様に係る音声入力装置において、前記発話時機登録部は、前記１以上の話者のそれぞれの発話の時機を登録する際に、前記トリガ入力部に前記トリガが入力される前記第１時点よりも、発話を開始した開始位置の前記第２時点の方が後の時間であることを示す第１時間情報と、前記１以上の話者のうちからいずれかの話者とを紐付けた登録情報である第１登録情報を登録し、前記トリガ入力部に前記トリガが入力される前記第１時点よりも、発話を開始した開始位置の前記第２時点の方が前の時間であることを示す第２時間情報と、前記１以上の話者のうちから別のいずれかの話者とを紐付けた登録情報である第２登録情報を登録する。 In the voice input device according to another aspect of the present disclosure, the utterance timing registration unit is configured to register the utterance timing registration unit, when registering the utterance timing of each of the one or more speakers, when the trigger is input to the trigger input unit. first time information indicating that the second time point of the starting position of the utterance is later than the first time point; and one of the one or more speakers. First registration information that is linked registration information is registered, and the second time point of the start position at which the utterance is started is a time earlier than the first time point when the trigger is input to the trigger input section. Second registration information that is registration information that associates second time information indicating that the second time information is the same as another speaker from among the one or more speakers is registered.

これによれば、話者は、発話を開始する前に取得部にトリガを入力するという条件を登録したり、発話を開始した後に取得部にトリガを入力するという条件を登録したりすることができる。このように、予め話者が条件を登録しておけば、音声入力装置は、学習することなく、話者を簡易かつ確実に識別することができる。 According to this, the speaker can register a condition to input a trigger to the acquisition unit before starting to speak, or register a condition to input a trigger to the acquisition unit after starting to speak. can. In this way, if the speaker registers the conditions in advance, the voice input device can easily and reliably identify the speaker without learning.

本開示の他の態様に係る音声入力装置において、前記話者識別部は、前記第１時点に対する前記第２時点の時機を算出し、算出した時機を示す結果と前記複数の登録情報とを照らし合わせて、前記第１時点よりも前記第２時点の方が後の時間である場合は、発話した話者が第１話者であると判定し、前記第１時点よりも前記第２時点の方が前の時間である場合は、発話した話者が前記第１話者と別の第２話者であると判定する。 In the voice input device according to another aspect of the present disclosure, the speaker identification unit calculates the timing of the second point in time with respect to the first point in time, and compares a result indicating the calculated timing with the plurality of registered information. In addition, if the second time point is later than the first time point, it is determined that the speaker who spoke is the first speaker, and the second time point is later than the first time point. If the time is earlier, it is determined that the speaker who spoke is a second speaker different from the first speaker.

これによれば、話者識別部は、トリガ入力部に入力された第１時点と、発話開始検出部が検出した第２時点とから、第１時点に対する第２時点の時機を算出することができる。これにより、発話開始検出部は、第１時点が第２時点よりも先の時間であるか、後の時間であるかという、時機を示す結果を算出することができる。この結果、発話開始検出部は、算出した時機を示す結果と複数の登録情報とを比較することで、１以上の話者のうちのいずれの話者であるかを、より確実に識別することができる。 According to this, the speaker identification section can calculate the timing of the second time point relative to the first time point from the first time point input to the trigger input section and the second time point detected by the speech start detection section. can. Thereby, the speech start detection unit can calculate a result indicating the timing, such as whether the first time point is earlier or later than the second time point. As a result, the speech start detection unit can more reliably identify which of the one or more speakers the speaker is by comparing the calculated timing result with the plurality of registered information. Can be done.

また、話者が複数存在する場合、例えば第１時点から第２時点までの期間を登録することで、複数の話者が存在しても、いずれの話者であるかを識別することができる。 In addition, if there are multiple speakers, for example, by registering the period from the first time point to the second time point, it is possible to identify which speaker the speaker is, even if there are multiple speakers. .

本開示の他の態様に係る音声入力装置において、前記トリガ入力部は、予め設定された音声の入力を受付ける音声入力インターフェイスであり、前記トリガ入力部には、予め設定された音声が前記トリガとして入力される。 In the audio input device according to another aspect of the present disclosure, the trigger input section is an audio input interface that accepts input of preset audio, and the trigger input section receives preset audio as the trigger. input.

これによれば、話者がウエイクアップワード等のような予め設定された音声による発話をするだけで、音声入力装置は、マジックワード認識を行い、話者の識別を実行することができる。このため、音声入力装置は、操作性に優れている。 According to this, the voice input device can perform magic word recognition and identify the speaker simply by the speaker speaking a preset voice such as a wake-up word. Therefore, the voice input device has excellent operability.

本開示の他の態様に係る音声入力装置において、前記トリガ入力部は、前記音声入力装置に設けられた操作ボタンであり、前記トリガ入力部には、受付けた操作入力が前記トリガとして入力される。 In the voice input device according to another aspect of the present disclosure, the trigger input section is an operation button provided on the voice input device, and a received operation input is inputted to the trigger input section as the trigger. .

これによれば、話者がトリガ入力部を操作することで、トリガ入力部にトリガを確実に入力することができる。 According to this, the speaker can reliably input a trigger to the trigger input section by operating the trigger input section.

なお、これらのうちの一部の具体的な態様は、システム、方法、集積回路、コンピュータプログラム又はコンピュータで読み取り可能なＣＤ－ＲＯＭ等の記録媒体を用いて実現されてもよく、システム、方法、集積回路、コンピュータプログラム又は記録媒体の任意な組み合わせを用いて実現されてもよい。 Note that some specific aspects of these may be realized using a system, a method, an integrated circuit, a computer program, or a computer-readable recording medium such as a CD-ROM. It may be implemented using any combination of integrated circuits, computer programs, or recording media.

以下で説明する実施の形態は、いずれも本開示の一具体例を示すものである。以下の実施の形態で示される数値、形状、材料、構成要素、構成要素の配置位置などは、一例であり、本開示を限定する主旨ではない。また、以下の実施の形態における構成要素のうち、独立請求項に記載されていない構成要素については、任意の構成要素として説明される。また全ての実施の形態において、各々の内容を組み合わせることもできる。 The embodiments described below are all specific examples of the present disclosure. The numerical values, shapes, materials, components, arrangement positions of the components, etc. shown in the following embodiments are merely examples, and do not limit the present disclosure. Further, among the constituent elements in the following embodiments, constituent elements that are not described in the independent claims will be described as arbitrary constituent elements. Moreover, in all embodiments, the contents of each can be combined.

以下、本開示の一態様に係る音声入力装置、音声入力方法及びそのプログラムについて、図面を参照しながら具体的に説明する。 Hereinafter, a voice input device, a voice input method, and a program thereof according to one aspect of the present disclosure will be specifically described with reference to the drawings.

（実施の形態）
＜構成：話者認識装置１＞
図１は、実施の形態における話者認識装置１の外観と、話者の発話による話者認識装置１の使用場面の一例を示す図である。図１では、複数の話者が話者認識装置１をシェアし、発話する際に話者認識装置１を使用している様子を例示している。 (Embodiment)
<Configuration: Speaker recognition device 1>
FIG. 1 is a diagram showing an example of an external appearance of a speaker recognition device 1 according to an embodiment and a usage scene of the speaker recognition device 1 according to a speaker's utterance. FIG. 1 illustrates an example in which a plurality of speakers share the speaker recognition device 1 and use the speaker recognition device 1 when speaking.

図１に示すように、話者認識装置１は、１以上の話者が発話した音声を取得し、取得した音声に基づいて、１以上の話者のうちのいずれの話者であるかを識別する装置である。つまり、話者認識装置１は、１以上の話者のそれぞれが発話したそれぞれの音声を取得し、取得したそれぞれの音声ごとに話者を識別する。話者認識装置１は、音声入力装置の一例である。 As shown in FIG. 1, a speaker recognition device 1 acquires voices uttered by one or more speakers, and identifies which of the one or more speakers the speaker is based on the acquired voices. It is an identification device. That is, the speaker recognition device 1 acquires each voice uttered by one or more speakers, and identifies the speaker for each acquired voice. The speaker recognition device 1 is an example of a voice input device.

なお、話者認識装置１は、発話者と会話相手との間の会話を取得し、取得した会話に基づいて、発話者と会話相手とのうちのいずれの話者であるかを識別してもよい。 Note that the speaker recognition device 1 acquires a conversation between a speaker and a conversation partner, and identifies the speaker between the speaker and the conversation partner based on the acquired conversation. Good too.

本実施の形態では、話者認識装置１は、１以上の話者のそれぞれが発話したそれぞれの音声を取得し、取得したそれぞれの音声と、入力されたトリガとのそれぞれの時機（タイミング）に基づいて、話者を識別する。 In the present embodiment, the speaker recognition device 1 acquires each voice uttered by one or more speakers, and at each timing of each acquired voice and an input trigger. Identify the speaker based on.

本実施の形態の図１では、複数の話者である第１話者と第２話者が個別に話者認識装置１を使用し、それぞれの話者が発話する様子を例示する。例えば、第１話者の音声認識が終了した後に第２話者が、二点鎖線で示す話者認識装置１を使用してもよい。つまり、話者認識装置１は、それぞれの話者が別々のタイミング及びイベントで使用してもよく、第１話者と第２話者とが会話する際には同時に使用してもよい。第１話者及び第２話者は、話者の一例である。なお、第２話者は、第１話者の会話相手であってもよい。 In FIG. 1 of this embodiment, a first speaker and a second speaker who are a plurality of speakers individually use the speaker recognition device 1, and each speaker speaks. For example, the second speaker may use the speaker recognition device 1 shown by the two-dot chain line after the first speaker's voice recognition is completed. That is, the speaker recognition device 1 may be used by each speaker at different times and events, or may be used simultaneously when the first speaker and the second speaker have a conversation. The first speaker and the second speaker are examples of speakers. Note that the second speaker may be a conversation partner of the first speaker.

ここで、第１話者と第２話者とは、同一の言語で発話してもよいが、異なる２つの言語間で発話してもよい。この場合、話者認識装置１は、第１話者が発話する第１言語と、第２話者が発話する第２言語との同一の又は異なる２つの言語間において、第１話者と第２話者とが発話したそれぞれの音声ごとに、第１話者か第２話者かを識別する。例えば、第１言語及び第２言語は、日本語、英語、フランス語、ドイツ語、中国語等である。 Here, the first speaker and the second speaker may speak in the same language, or may speak in two different languages. In this case, the speaker recognition device 1 is capable of identifying the first language spoken by the first speaker and the second language spoken by the second speaker, which may be the same or different. For each voice uttered by the two speakers, it is determined whether the voice is the first speaker or the second speaker. For example, the first language and the second language are Japanese, English, French, German, Chinese, etc.

本実施の形態では、第１話者を話者認識装置１の所有者とし、話者認識装置１へのトリガとなる入力と、入力されるトリガに対する話者が発話との時機に関する登録は、主に第１話者によって行われる。つまり、第１話者は、話者認識装置１の操作方法を理解している、話者認識装置１の使用者である。 In this embodiment, the first speaker is the owner of the speaker recognition device 1, and the input that becomes a trigger to the speaker recognition device 1 and the timing of the speaker's utterance in response to the input trigger are registered as follows. This is mainly done by the first speaker. In other words, the first speaker is a user of the speaker recognition device 1 who understands how to operate the speaker recognition device 1.

本実施の形態では、話者が話者認識装置１にトリガを入力した後に発話することで、話者認識装置１は、例えば第１話者が発話したことを認識する。また、別の話者が発話した後に、話者認識装置１にトリガが入力されることで、話者認識装置１は、例えば第２話者が発話したことを認識する。 In this embodiment, when the speaker speaks after inputting a trigger to the speaker recognition device 1, the speaker recognition device 1 recognizes that the first speaker has spoken, for example. Further, when a trigger is input to the speaker recognition device 1 after another speaker speaks, the speaker recognition device 1 recognizes that the second speaker has spoken, for example.

話者認識装置１は、スマートホン及びタブレット端末等の、第１話者が携帯可能な携帯端末である。 The speaker recognition device 1 is a mobile terminal, such as a smartphone or a tablet terminal, that can be carried by the first speaker.

図２Ａは、実施の形態における話者認識装置１を示すブロック図である。 FIG. 2A is a block diagram showing the speaker recognition device 1 in the embodiment.

図２Ａに示すように、話者認識装置１は、発話時機登録部２５と、取得部２１と、記憶部２２と、トリガ入力部２３と、発話開始検出部２４と、話者識別部２６と、出力部３１と、電源部３５とを有する。 As shown in FIG. 2A, the speaker recognition device 1 includes a speech timing registration section 25, an acquisition section 21, a storage section 22, a trigger input section 23, a speech start detection section 24, and a speaker identification section 26. , an output section 31, and a power supply section 35.

［発話時機登録部２５］
発話時機登録部２５は、第１時点と第２時点とのいずれの時点が前の時間であるかを少なくとも登録する。具体的には、発話時機登録部２５は、トリガの入力に対する１以上の話者のそれぞれの発話の時機を登録する登録装置である。 [Utterance timing registration unit 25]
The utterance time registration unit 25 registers at least which of the first time point and the second time point is the previous time. Specifically, the utterance timing registration unit 25 is a registration device that registers the timing of each utterance of one or more speakers in response to a trigger input.

発話時機登録部２５は、１以上の話者の操作によって所望の条件を設定し、設定した条件を登録することができる。具体的には、発話時機登録部２５は、１以上の話者のそれぞれの発話の時機を登録する際に、トリガ入力部２３にトリガが入力される第１時点よりも、発話を開始した開始位置の第２時点の方が後の時間であることを示す第１時間情報と、１以上の話者のうちからいずれかの話者とを紐付けた登録情報である第１登録情報を登録する。具体例を示すと、トリガ入力部２３にトリガを入力した後に、第１話者が発話を開始するという条件を設定し、発話時機登録部２５は、設定した条件を示す第１時間情報と、ラベルＡとを紐付けた第１登録情報を登録する。発話時機登録部２５には、メモリが格納され、設定した第１登録情報を記憶する。なお、発話時機登録部２５が設定した第１登録情報は、記憶部２２に記憶されてもよい。 The utterance timing registration unit 25 can set desired conditions through operations of one or more speakers, and can register the set conditions. Specifically, when registering the timing of each utterance of one or more speakers, the utterance timing registration unit 25 registers the utterance timing from the first point in time when a trigger is input to the trigger input unit 23. Register first registration information that is registration information that links first time information indicating that the second time point of the position is a later time and one of the speakers from among the one or more speakers. do. To give a specific example, a condition is set that the first speaker starts speaking after inputting a trigger to the trigger input section 23, and the utterance timing registration section 25 receives first time information indicating the set condition; First registration information that is linked with label A is registered. The speech timing registration unit 25 stores a memory and stores the set first registration information. Note that the first registration information set by the utterance timing registration section 25 may be stored in the storage section 22.

また、発話時機登録部２５は、それぞれの発話の時機を登録する際に、トリガ入力部２３にトリガが入力される第１時点よりも、発話を開始した開始位置の第２時点の方が前の時間であることを示す第２時間情報と、１以上の話者のうちから別のいずれかの話者とを紐付けた登録情報である第２登録情報を登録する。具体例を示すと、トリガ入力部２３にトリガを入力する前に、第２話者が発話を開始するという条件を設定し、発話時機登録部２５は、設定した条件を示す第２時間情報と、ラベルＢとを紐付けた第２登録情報を登録する。発話時機登録部２５には、メモリが格納され、設定した第２登録情報を記憶する。なお、発話時機登録部２５が設定した第２登録情報は、記憶部２２に記憶されてもよい。 Furthermore, when registering the timing of each utterance, the utterance timing registration unit 25 selects a second point in time at which the utterance is started, which is earlier than the first point in time when a trigger is input to the trigger input unit 23. Second registration information is registered that is registration information that associates second time information indicating that it is the time of , and another one of the one or more speakers. To give a specific example, before inputting a trigger to the trigger input section 23, a condition is set that the second speaker starts speaking, and the speech timing registration section 25 sets second time information indicating the set condition. , and label B are registered. The speech timing registration unit 25 stores a memory and stores the set second registration information. Note that the second registration information set by the utterance timing registration section 25 may be stored in the storage section 22.

例えば、第１話者がラベルＡで設定した第１登録情報の条件で発話する場合、第１話者が第２話者に対してラベルＢで設定した第２登録情報の条件で発話するように促せば（第１話者と第２話者との間で使用する条件を決定しておく）、異なる話者が異なる条件で発話することができる。このため、発話時機登録部２５によって発話の条件を個別に登録すれば、話者識別部２６が話者識別を行う判断材料となる。 For example, if the first speaker speaks under the conditions of the first registered information set at label A, the first speaker will speak to the second speaker under the conditions of the second registered information set at label B. By prompting (the conditions to be used between the first speaker and the second speaker are determined in advance), different speakers can speak under different conditions. Therefore, if the utterance conditions are individually registered by the utterance timing registration section 25, the utterance conditions will be used as judgment materials for the speaker identification section 26 to identify the speaker.

発話時機登録部２５は、登録した第１登録情報及び第２登録情報等の複数の登録情報を話者識別部２６に出力する。 The utterance timing registration unit 25 outputs a plurality of registered information such as the registered first registration information and second registration information to the speaker identification unit 26.

なお、発話時機登録部２５は、トリガ入力部２３へトリガを入力する第１時点から話者による発話の第２時点までの期間を設定できる。つまり、発話時機登録部２５は、トリガ入力部２３にトリガを入力した第１時点から○○秒後又は○○秒以降に話者が発話を開始するという条件を登録情報として登録してもよい。また、発話時機登録部２５は、話者が発話を開始してから○○秒後又は○○秒以降にトリガ入力部２３にトリガを入力するという条件を登録情報として登録してもよい。言い換えれば、発話時機登録部２５は、第１時点から○○秒後又は○○秒以降に第２時点を設定、第２時点から○○秒後又は○○秒以降に第１時点を設定し、設定した情報を登録情報として登録してもよい。ここでいう「○○」は、任意の数字であり、必ずしも同一の時間を示しているわけではない。 Note that the utterance timing registration unit 25 can set a period from the first time point at which a trigger is input to the trigger input unit 23 to the second time point at which the speaker makes an utterance. In other words, the utterance timing registration unit 25 may register as registration information a condition that the speaker starts speaking after XXX seconds or after XXX seconds from the first point in time when the trigger is input to the trigger input unit 23. . The utterance timing registration unit 25 may also register, as registration information, a condition that a trigger is input to the trigger input unit 23 after XXX seconds or after XXX seconds after the speaker starts speaking. In other words, the utterance timing registration unit 25 sets the second time point after XXX seconds or after XXX seconds from the first time point, and sets the first time point after XXX seconds or after XXX seconds from the second time point. , the set information may be registered as registered information. "○○" here is an arbitrary number and does not necessarily indicate the same time.

なお、発話時機登録部２５は、トリガ入力部２３へのトリガの連続入力時間の長さを、登録情報として登録してもよい。例えば、トリガ入力部２３が操作ボタンである場合、話者の発話のタイミングに応じて操作ボタンを長押しする（トリガ入力部２３への連続入力する）時間の長さも発話時機登録部２５が登録しておけば、登録した長押しの時間を、話者識別部２６が話者を識別する判断材料として用いることもできる。 Note that the utterance timing registration unit 25 may register the length of continuous input time of the trigger to the trigger input unit 23 as registration information. For example, when the trigger input section 23 is an operation button, the utterance timing registration section 25 also registers the length of time for which the operation button is pressed and held (continuous input to the trigger input section 23) according to the timing of the speaker's utterance. If this is done, the registered long press time can be used by the speaker identification unit 26 as a basis for identifying the speaker.

例えば、発話時機登録部２５は、トリガ入力部２３にトリガを入力した第１時点から○○秒後又は○○秒以降にトリガ入力部２３にトリガを〇〇秒間、トリガを入力し続けるという条件を登録情報として登録してもよい。また、発話時機登録部２５は、話者が発話を開始してから○○秒後又は○○秒以降にトリガ入力部２３にトリガを〇〇秒間、トリガを入力し続けるという条件を登録情報として登録してもよい。 For example, the utterance timing registration unit 25 sets the condition that the trigger continues to be input to the trigger input unit 23 for XX seconds after XX seconds or after XX seconds from the first point in time when the trigger was input to the trigger input unit 23. may be registered as registration information. In addition, the utterance timing registration unit 25 sets as registered information the condition that the trigger continues to be input to the trigger input unit 23 for XX seconds after XX seconds or after XX seconds after the speaker starts speaking. You may register.

［取得部２１］
取得部２１は、１以上の話者が発話する際の音声を取得する。つまり、取得部２１は、１以上の話者のそれぞれが発話した音声を取得し、取得した話者が発話した音声を音声信号に変換し、変換した音声信号を記憶部２２に出力する。 [Acquisition unit 21]
The acquisition unit 21 acquires audio when one or more speakers speak. That is, the acquisition unit 21 acquires voices uttered by each of one or more speakers, converts the acquired voices uttered by the speakers into an audio signal, and outputs the converted audio signals to the storage unit 22.

取得部２１は、音声を音声信号に変換することで、音声信号を取得するマイクロホン部である。なお、取得部２１は、マイクロホンと電気的に接続される入力インターフェイスであってもよい。つまり、取得部２１は、マイクロホンから、音声信号を取得してもよい。また、取得部２１は、複数のマイクロホンから構成されるマイクロホンアレイ部であってもよい。取得部２１は話者認識装置１の周囲に存在する話者の音声を収音することができればよいため、話者認識装置１における取得部２１の配置については特に限定されない。 The acquisition unit 21 is a microphone unit that acquires an audio signal by converting audio into an audio signal. Note that the acquisition unit 21 may be an input interface electrically connected to a microphone. That is, the acquisition unit 21 may acquire the audio signal from the microphone. Further, the acquisition unit 21 may be a microphone array unit including a plurality of microphones. Since the acquisition unit 21 only needs to be able to collect the voices of speakers existing around the speaker recognition device 1, the arrangement of the acquisition unit 21 in the speaker recognition device 1 is not particularly limited.

［記憶部２２］
記憶部２２は、取得部２１が取得した１以上の話者のそれぞれの音声の音声情報を記憶する。具体的には、記憶部２２は、取得部２１から取得した音声信号が示す音声の音声情報を記憶する。つまり、記憶部２２には、１以上の話者のそれぞれが発話した音声の音声情報が、自動的に記憶される。 [Storage unit 22]
The storage unit 22 stores audio information of each voice of one or more speakers acquired by the acquisition unit 21. Specifically, the storage unit 22 stores audio information of the audio indicated by the audio signal acquired from the acquisition unit 21. That is, the storage unit 22 automatically stores audio information of voices uttered by each of one or more speakers.

また、記憶部２２は、話者認識装置１が起動したときに、録音を再開する。また、記憶部２２は、話者認識装置１の起動後に、最初に話者がトリガ入力部２３にトリガを入力した時点から録音を開始してもよい。つまり、最初に話者によるトリガ入力部２３へのトリガの入力によって、記憶部２２は音声の録音を開始してもよい。また、記憶部２２は、トリガ入力部２３へのトリガの入力によって、音声の録音を中止又は停止してもよい。 Furthermore, the storage unit 22 resumes recording when the speaker recognition device 1 is activated. Further, the storage unit 22 may start recording from the time when the speaker first inputs a trigger to the trigger input unit 23 after the speaker recognition device 1 is started. That is, the storage section 22 may start recording the voice when the speaker first inputs a trigger to the trigger input section 23 . Furthermore, the storage unit 22 may stop or stop audio recording by inputting a trigger to the trigger input unit 23.

なお、記憶部２２に記憶される容量には限りがあるため、記憶部２２に記憶された音声情報は、規定容量に達すると、自動的に古い音声データから削除してもよい。つまり、音声情報には、話者の音声と、日時を示す情報（タイムスタンプ）とが付加されていてもよい。記憶部２２は、日時を示す情報に基づいて、古い音声情報を削除する。 Note that since the storage capacity of the storage unit 22 is limited, when the audio information stored in the storage unit 22 reaches a specified capacity, the oldest audio data may be automatically deleted. That is, the voice information may include the speaker's voice and information (time stamp) indicating the date and time. The storage unit 22 deletes old audio information based on information indicating the date and time.

また、記憶部２２は、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）又は半導体メモリ等で構成される。 Furthermore, the storage unit 22 is configured with an HDD (Hard Disk Drive), a semiconductor memory, or the like.

［トリガ入力部２３］
トリガ入力部２３には、話者によって、トリガが入力される。具体例を示すと、トリガ入力部２３は、例えば第１話者が発話する前に予め設定されたトリガの入力を話者から受付ける。また、トリガ入力部２３は、例えば第２話者が発話した後に、予め設定されたトリガの入力を話者から受付ける。つまり、トリガ入力部２３は、第１話者では第１話者が発話する前にトリガの入力を受付け、第２話者では第２話者が発話した後にトリガの入力を受付ける。トリガ入力部２３は、１以上の話者のそれぞれが発話する度に、話者からのトリガの入力を受付ける。 [Trigger input section 23]
A trigger is input to the trigger input section 23 by the speaker. To give a specific example, the trigger input unit 23 receives a preset trigger input from the speaker, for example, before the first speaker speaks. Further, the trigger input unit 23 receives a preset trigger input from the speaker, for example after the second speaker speaks. That is, the trigger input unit 23 receives a trigger input for the first speaker before the first speaker speaks, and receives a trigger input for the second speaker after the second speaker speaks. The trigger input unit 23 receives a trigger input from each speaker each time one or more speakers speak.

なお、トリガ入力部２３は、話者からの操作入力によって、記憶部２２への音声の録音を開始してもよく、記憶部２２への音声の録音を中止又は停止してもよい。 Note that the trigger input unit 23 may start recording the audio into the storage unit 22, or may stop or stop recording the audio into the storage unit 22, in response to an operation input from the speaker.

トリガ入力部２３は、入力されたトリガを検知すると、入力信号を生成し、生成した入力信号を発話開始検出部２４及び話者識別部２６に出力する。入力信号には、第１時点を示す情報（タイムスタンプ）が含まれる。 When the trigger input section 23 detects the input trigger, it generates an input signal and outputs the generated input signal to the speech start detection section 24 and the speaker identification section 26 . The input signal includes information (time stamp) indicating the first time point.

本実施の形態では、トリガ入力部２３は、話者認識装置１に設けられた１つの操作ボタンである。この場合、トリガ入力部２３には、話者による操作ボタンの押下を受付けた操作入力が、トリガとして入力される。つまり、本実施の形態では、トリガは、話者がトリガ入力部２３に操作入力した入力信号である。なお、トリガ入力部２３は、話者認識装置１に２つ以上設けられていてもよい。 In this embodiment, the trigger input section 23 is one operation button provided on the speaker recognition device 1. In this case, an operation input in which the speaker presses an operation button is input to the trigger input unit 23 as a trigger. That is, in the present embodiment, the trigger is an input signal input by the speaker into the trigger input section 23 . Note that the speaker recognition device 1 may be provided with two or more trigger input units 23.

なお、トリガ入力部２３は、話者認識装置１の表示部３３と一体的に設けられるタッチセンサであってもよい。この場合、話者認識装置１の表示部３３には、話者による操作入力を受付ける操作ボタンであるトリガ入力部２３が表示されていてもよい。 Note that the trigger input section 23 may be a touch sensor provided integrally with the display section 33 of the speaker recognition device 1. In this case, the display section 33 of the speaker recognition device 1 may display a trigger input section 23 that is an operation button that accepts an operation input from the speaker.

図２Ｂは、実施の形態における別の話者認識装置１の一例を示すブロック図である。 FIG. 2B is a block diagram showing an example of another speaker recognition device 1 in the embodiment.

図２Ｂに示すように、トリガ入力部２３ａは、予め設定された音声の入力を受付ける音声入力インターフェイスであってもよい。この場合、トリガ入力部２３ａには、取得部２１ａを介して予め設定された音声がトリガとして入力される。つまり、この場合では、トリガは、トリガ入力部２３ａに入力された話者の発話した音声が入力信号となる。ここで、予め設定された音声は、ウエイクアップワード等である。話者認識装置１は、ウエイクアップワードが例えば「ＯＫ！○○、××」であれば第１話者及び「○○、ＯＫ！××」であれば第２話者と設定されていれば、話者が「ＯＫ！○○、××」と発話すると第１話者と識別し、「○○、ＯＫ！××」と発話すると第２話者と識別する。なお、トリガ入力部２３ａが音声入力インターフェイスであれば、音声の内容ごとに話者を設定することで、第１話者と第２話者とからそれぞれの話者を確実に識別することができる。 As shown in FIG. 2B, the trigger input unit 23a may be an audio input interface that accepts preset audio input. In this case, a preset voice is input as a trigger to the trigger input section 23a via the acquisition section 21a. That is, in this case, the input signal of the trigger is the voice uttered by the speaker input to the trigger input section 23a. Here, the preset voice is a wake-up word or the like. The speaker recognition device 1 may set the wake-up word as the first speaker if the wake-up word is "OK!○○, XX" and the second speaker if the wake-up word is "○○, OK! XX". For example, when a speaker utters "OK! XX, XX", the speaker is identified as the first speaker, and when the speaker utters "XX, OK! XX", the speaker is identified as the second speaker. Note that if the trigger input unit 23a is an audio input interface, by setting the speaker for each audio content, each speaker can be reliably identified from the first speaker and the second speaker. .

［発話開始検出部２４］
図１及び図２Ａに示すように、発話開始検出部２４は、トリガ入力部２３にトリガが入力されるごとに、記憶部２２に記憶されるそれぞれの音声から発話を開始した開始位置を検出する検出装置である。 [Speech start detection unit 24]
As shown in FIGS. 1 and 2A, the utterance start detection unit 24 detects the start position of the utterance from each voice stored in the storage unit 22 every time a trigger is input to the trigger input unit 23. It is a detection device.

具体的には、発話開始検出部２４は、記憶部２２に記憶されるそれぞれの音声情報の音声において、話者によるトリガ入力部２３へのトリガの入力をした第１時点から規定期間が経過するまでの間に第１話者が発話した音声であって、第１話者の発話によって記憶された音声情報に示される音声の開始位置を検出する。つまり、発話開始検出部２４は、トリガ入力部２３がトリガの入力を検知した第１時点から規定期間が経過するまでの間に、第１話者が発話した音声の発話開始の第２時点である開始位置を検出する。 Specifically, the speech start detection unit 24 detects when a predetermined period of time has elapsed from the first point in time when the speaker inputs a trigger to the trigger input unit 23 in the audio of each audio information stored in the storage unit 22. The starting position of the voice uttered by the first speaker during the period shown in the voice information stored by the utterance of the first speaker is detected. In other words, the speech start detection unit 24 detects the second time point of the start of the speech of the voice uttered by the first speaker during the period from the first time point when the trigger input unit 23 detects the input of the trigger until the prescribed period has elapsed. Find a certain starting position.

また、発話開始検出部２４は、記憶部２２に記憶されるそれぞれの音声情報の音声において、話者によるトリガ入力部２３へのトリガを入力した第１時点から第１時点よりも規定期間前までの間に第２話者が発話を開始した音声であって、第２話者の発話によって記憶された音声情報に示される音声の開始位置を検出する。つまり、発話開始検出部２４は、第１時点から第１時点よりも規定期間前までの間に、第２話者が発話した音声の発話開始の第２時点である開始位置を検出する。 In addition, the speech start detection unit 24 detects, in the audio of each audio information stored in the storage unit 22, from a first point in time when the speaker inputs a trigger to the trigger input unit 23, to a predetermined period before the first point in time. Detects the start position of the voice that the second speaker started speaking during the period, which is indicated by the voice information stored by the second speaker's utterance. That is, the speech start detection unit 24 detects the start position, which is the second time point of the start of the speech of the voice uttered by the second speaker, from the first time point to a predetermined period before the first time point.

発話開始検出部２４は、それぞれの音声ごとに、音声の開始位置を示す開始位置情報を生成し、生成した開始位置情報を、話者識別部２６に出力する。開始位置情報は、話者が発話した音声の発話開始時点である開始位置を示す情報（タイムスタンプ）である。 The speech start detection unit 24 generates start position information indicating the start position of the voice for each voice, and outputs the generated start position information to the speaker identification unit 26 . The start position information is information (time stamp) indicating the start position of the voice uttered by the speaker.

［話者識別部２６］
話者識別部２６は、トリガ入力部２３にトリガが入力される第１時点と、発話開始検出部２４がそれぞれの音声から検出した発話の開始位置の第２時点と、発話時機登録部２５が第１時点に対する第２時点の時機を示す複数の登録情報とに基づいて、１以上の話者のうちからいずれかの話者を識別する装置である。 [Speaker identification unit 26]
The speaker identification unit 26 identifies a first point in time when a trigger is input to the trigger input unit 23, a second point in time at which the utterance start position is detected from each voice by the utterance start detection unit 24, and a utterance timing registration unit 25. The present invention is an apparatus for identifying one of one or more speakers based on a plurality of pieces of registration information indicating timing of a second point in time with respect to a first point in time.

具体的には、話者識別部２６は、トリガ入力部２３から第１時点が示される入力信号を取得し、発話開始検出部２４から開始位置情報を取得すると、第１時点に対する第２時点の時機を算出する。つまり、話者識別部２６は、入力信号に示される第１時点に対する、開始位置情報に示される第２時点の時間的な前後関係を比較し算出する。話者識別部２６が算出した結果が、第１時点に対する第２時点の時機を示す結果となる。 Specifically, when the speaker identification unit 26 obtains an input signal indicating the first time point from the trigger input unit 23 and obtains start position information from the speech start detection unit 24, the speaker identification unit 26 determines the second time point relative to the first time point. Calculate the timing. That is, the speaker identification unit 26 compares and calculates the temporal relationship between the first time point shown in the input signal and the second time point shown in the start position information. The result calculated by the speaker identification unit 26 is a result indicating the timing of the second time point with respect to the first time point.

また、話者識別部２６は、発話時機登録部２５から登録情報を取得すると、算出した第１時点に対する第２時点の時機を示す結果と複数の登録情報とを照らし合わせて、第１時点よりも第２時点の方が後の時間である場合は、発話した話者が第１話者であると判定し、話者を特定する。また、話者識別部２６は、この時機を示す結果と複数の登録情報とを照らし合わせて、第１時点よりも第２時点の方が前の時間である場合は、発話した話者が第２話者であると判定し、話者を特定する。 Further, when the speaker identification unit 26 acquires the registered information from the utterance timing registration unit 25, the speaker identification unit 26 compares the calculated result indicating the timing of the second time point with respect to the first time point with the plurality of registered information, and If the second point in time is a later time, it is determined that the speaker who spoke is the first speaker, and the speaker is identified. In addition, the speaker identification unit 26 compares the result indicating this timing with the plurality of registered information, and if the second time point is earlier than the first time point, the speaker identifying unit 26 compares the result indicating the timing with the plurality of registered information, and if the second time point is earlier than the first time point, the speaker who spoke is the first time point. It is determined that there are two speakers, and the speaker is specified.

より具体的には、話者識別部２６は、トリガ入力部２３からのトリガの入力を受付けた第１時点から前後の期間である規定期間における１以上の話者が発話したそれぞれの音声から、いずれの話者かを判定する。話者識別部２６は、第１時点を基点として、第１時点から第１時点よりも規定期間前までの間、又は、当該第１時点から規定期間が経過するまでの間において、話者が発話した直近（最新）の音声を、記憶部２２に記憶されているそれぞれの音声の中から選択する。話者識別部２６は、選択した音声によって、いずれかの話者を識別する。 More specifically, the speaker identification unit 26 identifies each voice uttered by one or more speakers during a specified period that is a period before and after the first time point when the trigger input from the trigger input unit 23 is received. Determine which speaker it is. The speaker identification unit 26 identifies the speaker from the first time point to a predetermined period before the first time point, or from the first time point until a predetermined period elapses. The most recent (latest) voice uttered is selected from among the voices stored in the storage unit 22. The speaker identification unit 26 identifies one of the speakers based on the selected voice.

ここで規定期間は、例えば、１秒、２秒等の数秒間であり、例えば１０秒間であってもよい。これにより、話者識別部２６は、１以上の話者のそれぞれが直近に発話したそれぞれの音声ごとの第１時点と第２時点とに基づいて、話者を識別する。これは、古すぎる音声に基づいて、話者識別部２６が話者を識別しても、直近に発話した話者を正確に識別することができなくなる不具合を避けるためである。 Here, the prescribed period is, for example, several seconds, such as 1 second or 2 seconds, and may be, for example, 10 seconds. Thereby, the speaker identification unit 26 identifies the speaker based on the first time point and the second time point of each voice recently uttered by each of the one or more speakers. This is to avoid a problem in which even if the speaker identification unit 26 identifies a speaker based on a voice that is too old, it cannot accurately identify the speaker who spoke most recently.

話者識別部２６は、話者を識別した結果を含む結果情報を、出力部３１に出力する。結果情報は、１以上の話者のうちから、識別されたいずれかの話者を示す情報を含む。例えば、結果情報は、話者の発話によって記憶された音声情報が、識別された第１話者であることを示す情報、又は、話者の発話によって記憶された音声情報が、識別された第２話者であることを示す情報を含む。 The speaker identification unit 26 outputs result information including the result of identifying the speaker to the output unit 31. The result information includes information indicating one of the identified speakers from among the one or more speakers. For example, the result information may be information indicating that the voice information stored by the speaker's utterances is the identified first speaker, or information indicating that the voice information stored by the speaker's utterances is the identified first speaker. Contains information indicating that there are two speakers.

［表示部３３］
表示部３３は、例えば、液晶パネル、又は、有機ＥＬパネル等のモニタである。表示部３３は、話者識別部２６から取得した結果情報に示される話者をテキスト文として表示する。例えば、表示部３３は、話者が発話すると、発話した話者が第１話者であることを示す表示をする。また、表示部３３は、話者が発話すると、発話した話者が第２話者であることを示す表示をする。表示部３３は、出力部３１の一例である。 [Display section 33]
The display unit 33 is, for example, a monitor such as a liquid crystal panel or an organic EL panel. The display unit 33 displays the speaker indicated by the result information obtained from the speaker identification unit 26 as a text sentence. For example, when a speaker speaks, the display unit 33 displays a display indicating that the speaker is the first speaker. Further, when the speaker speaks, the display unit 33 displays a display indicating that the speaker who spoke is the second speaker. The display section 33 is an example of the output section 31.

なお、話者認識装置１は、音声出力部を有していてもよい。この場合、音声出力部は、話者識別部２６から取得した結果情報に示される話者を音声として出力するスピーカであってもよい。つまり、音声出力部は、話者が発話した場合、結果情報に示される話者が第１話者であることを示す音声を出力する。また、音声出力部は、話者が発話した場合、結果情報に示される話者が第２話者であることを示す音声を出力する。音声出力部は、出力部３１の一例となる。 Note that the speaker recognition device 1 may include an audio output section. In this case, the audio output unit may be a speaker that outputs the speaker indicated by the result information obtained from the speaker identification unit 26 as audio. That is, when the speaker speaks, the audio output unit outputs audio indicating that the speaker indicated in the result information is the first speaker. Further, when the speaker speaks, the audio output unit outputs audio indicating that the speaker indicated in the result information is the second speaker. The audio output section is an example of the output section 31.

［電源部３５］
電源部３５は、例えば一次電池又は二次電池等であり、配線を介して発話時機登録部２５、取得部２１、記憶部２２、トリガ入力部２３、発話開始検出部２４、話者識別部２６及び出力部３１等と電気的に接続される。電源部３５は、発話時機登録部２５、取得部２１、記憶部２２、トリガ入力部２３、発話開始検出部２４、話者識別部２６及び出力部３１等に電力を供給する。 [Power supply section 35]
The power supply section 35 is, for example, a primary battery or a secondary battery, and is connected to the utterance timing registration section 25, the acquisition section 21, the storage section 22, the trigger input section 23, the speech start detection section 24, and the speaker identification section 26 via wiring. It is electrically connected to the output section 31 and the like. The power supply section 35 supplies power to the speech timing registration section 25, the acquisition section 21, the storage section 22, the trigger input section 23, the speech start detection section 24, the speaker identification section 26, the output section 31, and the like.

＜動作＞
以上のように構成される話者認識装置１が行う動作について説明する。 <Operation>
The operation performed by the speaker recognition device 1 configured as described above will be explained.

図３は、第１話者が発話した場合の話者認識装置１の動作を示すフローチャートである。図４は、第１話者が発話する場合と第２話者が発話する場合との、発話による音声ごとの第１時点と第２時点との時系列を例示した図である。 FIG. 3 is a flowchart showing the operation of the speaker recognition device 1 when the first speaker speaks. FIG. 4 is a diagram illustrating the time series of the first time point and the second time point for each voice of the utterance when the first speaker speaks and when the second speaker speaks.

図３及び図４では、発話時機登録部２５には、話者がトリガ入力部２３にトリガを入力した後に、第１話者が発話を開始するという条件を示す第１時間情報とラベルＡとを紐付けた第１登録情報が、発話時機登録部２５のメモリに登録されているものとする。また、発話時機登録部２５は、話者がトリガ入力部２３にトリガを入力する前に、第２話者が発話を開始するという条件を示す第２時間情報とラベルＢとを紐付けた第２登録情報が、発話時機登録部２５のメモリに登録されているものとする。 In FIGS. 3 and 4, the utterance timing registration unit 25 includes first time information and a label A indicating a condition that the first speaker starts speaking after the speaker inputs a trigger into the trigger input unit 23. It is assumed that the first registration information in which the utterance timing registration section 25 is associated with the first registration information is registered in the memory of the utterance timing registration section 25. The utterance timing registration unit 25 also stores a second time information that is associated with the label B and second time information indicating a condition that the second speaker starts speaking before the speaker inputs a trigger into the trigger input unit 23. 2 registration information is registered in the memory of the utterance timing registration section 25.

図２Ａ、図３及び図４に示すように、まず、トリガ入力部２３には、取得部２１によるそれぞれの音声の取得を開始するためのトリガが入力される。つまり、トリガ入力部２３は、一方の話者が発話する前に、話者によって予め設定されたトリガの入力を受付ける。これにより、トリガ入力部２３は、話者から入力されたトリガを検知する（Ｓ１１）。トリガ入力部２３は、トリガの入力を検知すると、入力信号を生成し、生成した入力信号を発話開始検出部２４及び話者識別部２６に出力する。 As shown in FIGS. 2A, 3, and 4, first, a trigger for starting acquisition of each audio by the acquisition unit 21 is input to the trigger input unit 23. In other words, the trigger input unit 23 receives a trigger input preset by the speaker before one of the speakers speaks. Thereby, the trigger input unit 23 detects a trigger input from the speaker (S11). When the trigger input section 23 detects a trigger input, it generates an input signal and outputs the generated input signal to the speech start detection section 24 and the speaker identification section 26 .

次に、取得部２１は、一方の話者が発話した音声を取得する（Ｓ１２）。取得部２１は、取得した一方の話者が発話した音声を音声信号に変換し、変換した音声信号を記憶部２２に出力する。 Next, the acquisition unit 21 acquires the voice uttered by one of the speakers (S12). The acquisition unit 21 converts the acquired voice uttered by one of the speakers into an audio signal, and outputs the converted audio signal to the storage unit 22 .

次に、記憶部２２は、取得部２１が取得した音声信号が示す音声の音声情報を記憶する（Ｓ１３）。つまり、記憶部２２には、一方の話者が発話した直近の音声の音声情報が自動的に記憶される。 Next, the storage unit 22 stores audio information of the audio indicated by the audio signal acquired by the acquisition unit 21 (S13). That is, the storage unit 22 automatically stores the audio information of the most recent audio uttered by one of the speakers.

次に、発話開始検出部２４は、トリガ入力部２３から入力信号を取得すると、記憶部２２に記憶された音声情報の音声において、発話を開始した開始位置（第２時点）を検出する（Ｓ１４）。具体的には、発話開始検出部２４は、話者によるトリガ入力部２３にトリガを入力した直後に一方の話者が発話した音声であって、一方の話者の発話によって記憶された音声情報に示される音声の開始位置を検出する。発話開始検出部２４は、音声の開始位置を示す開始位置情報を生成し、生成した開始位置情報を話者識別部２６に出力する。 Next, upon acquiring the input signal from the trigger input unit 23, the speech start detection unit 24 detects the start position (second time point) at which the speech is started in the audio information stored in the storage unit 22 (S14 ). Specifically, the speech start detection unit 24 detects the voice uttered by one of the speakers immediately after the speaker inputs a trigger into the trigger input unit 23, and detects the voice information stored by the utterance of the other speaker. Detect the start position of the audio shown in . The speech start detection section 24 generates start position information indicating the start position of the voice, and outputs the generated start position information to the speaker identification section 26 .

次に、話者識別部２６は、トリガ入力部２３にトリガが入力される第１時点と、発話開始検出部２４がそれぞれの音声から検出した発話の開始位置の第２時点と、発話時機登録部２５が第１時点に対する第２時点の時機を示す複数の登録情報とに基づいて、第１話者及び第２話者のうちからいずれかの話者を識別する（Ｓ１５）。図３では、話者識別部２６は、第１時点の方が第２時点よりも時間的に早い時点であるため、開始位置情報の音声（発話した音声）は第１話者であると識別する。つまり、話者識別部２６は、一方の話者を第１話者であると識別する。 Next, the speaker identification unit 26 registers the first point in time when the trigger is input to the trigger input unit 23, the second point in time of the start position of the utterance detected from each voice by the utterance start detection unit 24, and the utterance timing registration. The unit 25 identifies one of the first and second speakers based on a plurality of pieces of registered information indicating the timing of the second point in time relative to the first point in time (S15). In FIG. 3, since the first time point is earlier than the second time point, the speaker identification unit 26 identifies the voice (uttered voice) of the start position information as being the first speaker. do. That is, the speaker identification unit 26 identifies one speaker as the first speaker.

次に、話者識別部２６は、第１話者を識別した結果を含む結果情報を、出力部３１に出力する（Ｓ１６）。 Next, the speaker identification unit 26 outputs result information including the result of identifying the first speaker to the output unit 31 (S16).

そして、話者認識装置１は、処理を終了する。 Then, the speaker recognition device 1 ends the process.

図５は、第２話者が発話した場合の話者認識装置１の動作を示すフローチャートである。図３と同様の処理については適宜説明を省略する。 FIG. 5 is a flowchart showing the operation of the speaker recognition device 1 when the second speaker speaks. Descriptions of processes similar to those in FIG. 3 will be omitted as appropriate.

図２Ａ、図４及び図５に示すように、まず、取得部２１は、他方の話者が発話した音声を取得する（Ｓ２１）。取得部２１は、取得した他方の話者が発話した音声を音声信号に変換し、変換した音声信号を記憶部２２に出力する。 As shown in FIGS. 2A, 4, and 5, the acquisition unit 21 first acquires the voice uttered by the other speaker (S21). The acquisition unit 21 converts the acquired voice uttered by the other speaker into an audio signal, and outputs the converted audio signal to the storage unit 22 .

次に、トリガ入力部２３には、取得部２１によるそれぞれの音声の取得を開始するためのトリガが入力される。つまり、トリガ入力部２３は、他方の話者が発話した後に、話者によって予め設定されたトリガの入力を受付ける。これにより、トリガ入力部２３は、話者から入力されたトリガを検知する（Ｓ２２）。トリガ入力部２３は、トリガの入力を検知すると、入力信号を生成し、生成した入力信号を発話開始検出部２４及び話者識別部２６に出力する。 Next, a trigger for starting acquisition of each audio by the acquisition unit 21 is input to the trigger input unit 23 . That is, after the other speaker speaks, the trigger input unit 23 receives an input of a trigger set in advance by the speaker. Thereby, the trigger input unit 23 detects a trigger input from the speaker (S22). When the trigger input section 23 detects a trigger input, it generates an input signal and outputs the generated input signal to the speech start detection section 24 and the speaker identification section 26 .

次に、記憶部２２は、取得部２１が取得した音声信号が示す音声の音声情報を記憶する（Ｓ１３）。つまり、記憶部２２には、他方の話者が発話した直近の音声の音声情報が自動的に記憶される。 Next, the storage unit 22 stores audio information of the audio indicated by the audio signal acquired by the acquisition unit 21 (S13). That is, the storage unit 22 automatically stores audio information of the most recent audio uttered by the other speaker.

次に、発話開始検出部２４は、トリガ入力部２３から入力信号を取得すると、記憶部２２に記憶された音声情報の音声において、発話を開始した開始位置（第２時点）を検出する（Ｓ１４）。具体的には、発話開始検出部２４は、話者によるトリガ入力部２３にトリガを入力した直後に他方の話者が発話した音声であって、他方の話者の発話によって記憶された音声情報に示される音声の開始位置を検出する。発話開始検出部２４は、音声の開始位置を示す開始位置情報を生成し、生成した開始位置情報を話者識別部２６に出力する。 Next, upon acquiring the input signal from the trigger input unit 23, the speech start detection unit 24 detects the start position (second time point) at which the speech is started in the audio information stored in the storage unit 22 (S14 ). Specifically, the speech start detection unit 24 detects the voice uttered by the other speaker immediately after the speaker inputs a trigger into the trigger input unit 23, and detects the voice information stored by the utterance of the other speaker. Detect the start position of the audio shown in . The speech start detection section 24 generates start position information indicating the start position of the voice, and outputs the generated start position information to the speaker identification section 26 .

次に、話者識別部２６は、トリガ入力部２３にトリガが入力される第１時点と、発話開始検出部２４がそれぞれの音声から検出した発話の開始位置の第２時点と、発話時機登録部２５が第１時点に対する第２時点の時機を示す複数の登録情報とに基づいて、第１話者及び第２話者のうちからいずれかの話者を識別する（Ｓ１５）。図５では、話者識別部２６は、第２時点の方が第１時点よりも時間的に早い時点であるため、開始位置情報の音声は第２話者であると識別する。つまり、話者識別部２６は、他方の話者を第２話者であると識別する。 Next, the speaker identification unit 26 registers the first point in time when the trigger is input to the trigger input unit 23, the second point in time of the start position of the utterance detected from each voice by the utterance start detection unit 24, and the utterance timing registration. The unit 25 identifies one of the first and second speakers based on a plurality of pieces of registered information indicating the timing of the second point in time relative to the first point in time (S15). In FIG. 5, since the second time point is earlier than the first time point, the speaker identification unit 26 identifies the voice of the start position information as being from the second speaker. That is, the speaker identification unit 26 identifies the other speaker as the second speaker.

次に、話者識別部２６は、第２話者を識別した結果を含む結果情報を、出力部３１に出力する（Ｓ１６）。 Next, the speaker identification unit 26 outputs result information including the result of identifying the second speaker to the output unit 31 (S16).

図６は、実施の形態における話者認識装置１の話者識別部２６における動作を示すフローチャートである。 FIG. 6 is a flowchart showing the operation of the speaker identification section 26 of the speaker recognition device 1 in the embodiment.

図３、図５及び図６に示すように、まず、話者識別部２６は、トリガ入力部２３から第１時点が示される入力信号を取得し、発話開始検出部２４から第２時点が示される開始位置情報を取得すると、第１時点に対する第２時点の時機を算出する（Ｓ３１）。つまり、話者識別部２６は、第１時点に対する第２時点の時間的な前後関係を比較し算出する。 As shown in FIGS. 3, 5, and 6, the speaker identification unit 26 first obtains an input signal indicating a first time point from the trigger input unit 23, and obtains an input signal indicating a second time point from the speech start detection unit 24. When the start position information is acquired, the timing of the second time point relative to the first time point is calculated (S31). That is, the speaker identification unit 26 compares and calculates the temporal relationship between the first time point and the second time point.

話者識別部２６は、算出した第１時点に対する第２時点の時機を示す結果と登録情報とを照らし合わせ、第１時点の方が第２時点よりも前の時間であるかどうかを判定する（Ｓ３２）。 The speaker identification unit 26 compares the registered information with the calculated result indicating the timing of the second time point relative to the first time point, and determines whether the first time point is earlier than the second time point. (S32).

第１時点の方が第２時点よりも前の時間である場合、話者識別部２６は、登録情報における第１登録情報に示される内容と同様であると判定し（Ｓ３２でＹＥＳ）、発話した話者が第１話者であると判定する（Ｓ３３）。 If the first time point is earlier than the second time point, the speaker identification unit 26 determines that the content is the same as the content shown in the first registered information in the registered information (YES in S32), and the speaker It is determined that the speaker who has spoken is the first speaker (S33).

話者識別部２６は、第１話者及び第２話者のうちから第１話者を識別した結果を含む結果情報を、表示部に出力する。そして、話者識別部２６は、処理を終了する。 The speaker identification unit 26 outputs result information including the result of identifying the first speaker from among the first speaker and the second speaker to the display unit. Then, the speaker identification unit 26 ends the process.

第１時点の方が第２時点よりも後の時間である場合、話者識別部２６は、登録情報における第２登録情報に示される内容と同様であると判定し（Ｓ３２でＮＯ）、発話した話者が第２話者であると判定する（Ｓ３４）。 If the first time point is later than the second time point, the speaker identification unit 26 determines that the content is the same as that shown in the second registered information in the registered information (NO in S32), and the speaker It is determined that the speaker who has spoken is the second speaker (S34).

話者識別部２６は、第１話者及び第２話者のうちから第２話者を識別した結果を含む結果情報を、表示部に出力する。そして、話者識別部２６は、処理を終了する。 The speaker identification unit 26 outputs result information including the result of identifying the second speaker from among the first speaker and the second speaker to the display unit. Then, the speaker identification unit 26 ends the process.

＜作用効果＞
次に、本実施の形態における話者認識装置１の作用効果について説明する。 <Effect>
Next, the effects of the speaker recognition device 1 in this embodiment will be explained.

以上のように、本実施の形態における話者認識装置１は、１以上の話者が発話する際のそれぞれの音声を取得する取得部２１と、取得部２１が取得した１以上の話者の発話によるそれぞれの音声を記憶する記憶部２２と、トリガが入力されるトリガ入力部２３と、トリガ入力部２３にトリガが入力されるごとに、記憶部２２に記憶されるそれぞれの音声から発話を開始した開始位置を検出する発話開始検出部２４と、少なくとも、トリガ入力部２３にトリガが入力される第１時点と、発話開始検出部２４がそれぞれの音声から検出した発話の開始位置の第２時点とに基づいて、１以上の話者のうちからいずれかの話者を識別する話者識別部２６とを備える。 As described above, the speaker recognition device 1 according to the present embodiment includes the acquisition unit 21 that acquires each voice when one or more speakers speak, and the acquisition unit 21 that acquires each voice of the one or more speakers who speak. A storage section 22 that stores each voice of the utterance, a trigger input section 23 into which a trigger is input, and a trigger input section 23 that stores utterances from each voice stored in the storage section 22 each time a trigger is input to the trigger input section 23. An utterance start detection unit 24 detects the start position of the utterance, at least a first point in time when a trigger is input to the trigger input unit 23, and a second point in time at which the utterance start position detected by the utterance start detection unit 24 from each voice. and a speaker identification unit 26 that identifies one of the one or more speakers based on the time point.

これによれば、例えば、１以上の話者のうちの話者によるトリガを検知した第１時点と、話者が発話した音声の第２時点との時間的な前後関係によって、１以上の話者のうちからいずれかの話者を識別することができる。つまり、第１時点から第２時点までの期間を学習しなくても、取得部２１が取得した音声の話者が、１以上の話者のうちのいずれの話者であるかを識別することができる。 According to this, for example, depending on the temporal relationship between the first point in time when a trigger from one or more speakers is detected and the second point in time of the voice uttered by the speaker, one or more speech It is possible to identify any one of the speakers. In other words, it is possible to identify which one of one or more speakers is the speaker of the voice acquired by the acquisition unit 21 without learning the period from the first time point to the second time point. Can be done.

したがって、この話者認識装置１では、簡易な処理で話者を識別することで計算量の増大を抑制することができる。 Therefore, in this speaker recognition device 1, an increase in the amount of calculation can be suppressed by identifying the speaker through simple processing.

特に、話者は、第１時点に対する発話の時機に基づいて、音声の話者を識別することができる。このため、話者認識装置１では、簡単な操作で音声の話者を識別することができる。また、話者認識装置１の操作が簡単となるため、話者認識装置１に複数のボタンを配置する等の、話者認識装置１の複雑化を抑制することができる。このため、この音声入力装置１では、例えばトリガ入力部２３がボタンである場合、１つのボタンでも１以上の話者のうちのいずれの話者であるかを識別することができるため、音声入力装置１の構成をより簡易にすることができる。 In particular, the speaker can identify the speaker of the audio based on the timing of the utterance relative to the first point in time. Therefore, the speaker recognition device 1 can identify the speaker of the voice with a simple operation. Furthermore, since the operation of the speaker recognition device 1 is simplified, complication of the speaker recognition device 1, such as arranging a plurality of buttons on the speaker recognition device 1, can be suppressed. For this reason, in this voice input device 1, if the trigger input section 23 is a button, for example, it is possible to identify which one of one or more speakers is the speaker, even with one button. The configuration of the device 1 can be made simpler.

また、本実施の形態における音声入力方法は、１以上の話者が発話する際のそれぞれの音声を取得することと、取得した１以上の話者の発話によるそれぞれの音声を記憶部２２に記憶することと、トリガが入力されることと、トリガが入力されるごとに、記憶部２２に記憶されるそれぞれの音声から発話を開始した開始位置を検出することと、少なくとも、トリガが入力される第１時点と、それぞれの音声から検出した発話の開始位置の第２時点とに基づいて、１以上の話者のうちからいずれかの話者を識別することとを含む。 Furthermore, the voice input method in this embodiment includes acquiring each voice uttered by one or more speakers, and storing each acquired voice uttered by the one or more speakers in the storage unit 22. the trigger is input; each time the trigger is input, the start position of the utterance is detected from each voice stored in the storage unit 22; and at least the trigger is input. The method includes identifying one of the one or more speakers based on the first time point and the second time point of the start position of the utterance detected from each voice.

この音声入力方法においても、上述の話者認識装置１と同様の作用効果を奏する。 This voice input method also provides the same effects as the speaker recognition device 1 described above.

また、本実施の形態におけるプログラムは、音声入力方法をコンピュータに実行させるためのプログラムである。 Further, the program in this embodiment is a program for causing a computer to execute a voice input method.

このプログラムにおいても、上述の話者認識装置１と同様の作用効果を奏する。 This program also provides the same effects as the speaker recognition device 1 described above.

また、本実施の形態における話者認識装置１は、第１時点と第２時点とのいずれの時点が前の時間であるかを少なくとも登録する発話時機登録部２５を備える。そして、話者識別部２６は、第１時点と第２時点と発話時機登録部２５が第１時点に対する第２時点の時機を示す複数の登録情報とに基づいて、１以上の話者のうちからいずれかの話者を識別する。 Furthermore, the speaker recognition device 1 according to the present embodiment includes a speech timing registration unit 25 that registers at least which of the first time point and the second time point is the previous time. Then, the speaker identification section 26 identifies one or more speakers based on the first time point, the second time point, and the plurality of registered information that the utterance time registration section 25 indicates the timing of the second time point with respect to the first time point. Identify any speaker from .

これによれば、１以上の話者が所望する条件として、第１時点と第２時点との時間的な前後関係を予め登録することができる。このため、話者識別部２６は、第１時点及び第２時点の時間的な前後関係が、登録情報に示されているかどうかを判定するだけで、１以上の話者のうちからいずれかの話者を識別することができる。その結果、この話者認識装置１では、簡易な処理で、より確実に話者を識別することができる。 According to this, the temporal relationship between the first time point and the second time point can be registered in advance as a condition desired by one or more speakers. For this reason, the speaker identification unit 26 only needs to determine whether the temporal relationship between the first time point and the second time point is indicated in the registered information. Speakers can be identified. As a result, the speaker recognition device 1 can identify the speaker more reliably with simple processing.

また、本実施の形態における話者認識装置１において、発話時機登録部２５は、１以上の話者のそれぞれの発話の時機を登録する際に、トリガ入力部２３にトリガが入力される第１時点よりも、発話を開始した開始位置の第２時点の方が後の時間であることを示す第１時間情報と、１以上の話者のうちからいずれかの話者とを紐付けた登録情報である第１登録情報を登録する。そして、発話時機登録部２５は、それぞれの発話の時機を登録する際に、トリガ入力部２３にトリガが入力される第１時点よりも、発話を開始した開始位置の第２時点の方が前の時間であることを示す第２時間情報と、１以上の話者のうちから別のいずれかの話者とを紐付けた登録情報である第２登録情報を登録する。 Furthermore, in the speaker recognition device 1 according to the present embodiment, the utterance timing registration unit 25 registers the first utterance timing registering unit 25 in which a trigger is input to the trigger input unit 23 when registering the utterance timing of each of one or more speakers. Registration that associates first time information indicating that the second time point of the start position of the utterance is later than the time point, and one of the one or more speakers. First registration information, which is information, is registered. Then, when registering the timing of each utterance, the utterance timing registration unit 25 selects a second point in time at which the utterance is started, which is earlier than the first point in time when a trigger is input to the trigger input unit 23. Second registration information is registered that is registration information that associates second time information indicating that it is the time of , and another one of the one or more speakers.

これによれば、話者は、発話を開始する前に取得部２１にトリガを入力するという条件を登録したり、発話を開始した後に取得部２１にトリガを入力するという条件を登録したりすることができる。このように、予め話者が条件を登録しておけば、話者認識装置１は、学習することなく、話者を簡易かつ確実に識別することができる。 According to this, the speaker registers a condition to input a trigger to the acquisition unit 21 before starting to speak, or registers a condition to input a trigger to the acquisition unit 21 after starting to speak. be able to. In this way, if the speaker registers the conditions in advance, the speaker recognition device 1 can easily and reliably identify the speaker without learning.

また、本実施の形態における話者認識装置１において、話者識別部２６は、第１時点に対する第２時点の時機を算出し、算出した時機を示す結果と複数の登録情報とを照らし合わせて、第１時点よりも第２時点の方が後の時間である場合は、発話した話者が第１話者であると判定し、第１時点よりも第２時点の方が前の時間である場合は、発話した話者が第１話者と別の第２話者であると判定する。 Furthermore, in the speaker recognition device 1 according to the present embodiment, the speaker identification unit 26 calculates the timing of the second point in time with respect to the first point in time, and compares the result indicating the calculated timing with the plurality of registered information. , if the second time point is later than the first time point, it is determined that the speaker who spoke is the first speaker, and the second time point is earlier than the first time point. If so, it is determined that the speaker who made the utterance is a second speaker different from the first speaker.

これによれば、話者識別部２６は、トリガ入力部２３に入力された第１時点と、発話開始検出部２４が検出した第２時点とから、第１時点に対する第２時点の時機を算出することができる。これにより、発話開始検出部２４は、第１時点が第２時点よりも先の時間であるか、後の時間であるかという、時機を示す結果を算出することができる。この結果、発話開始検出部２４は、算出した時機を示す結果と複数の登録情報とを比較することで、１以上の話者のうちのいずれの話者であるかを、より確実に識別することができる。 According to this, the speaker identification unit 26 calculates the timing of the second time point with respect to the first time point from the first time point input to the trigger input unit 23 and the second time point detected by the speech start detection unit 24. can do. Thereby, the speech start detection unit 24 can calculate a result indicating the timing of whether the first time point is earlier or later than the second time point. As a result, the speech start detection unit 24 more reliably identifies which of the one or more speakers the speaker is by comparing the calculated result indicating the timing with the plurality of registered information. be able to.

また、本実施の形態における話者認識装置１において、トリガ入力部２３は、予め設定された音声の入力を受付ける音声入力インターフェイスである。そして、トリガ入力部２３には、予め設定された音声がトリガとして入力される。 Furthermore, in the speaker recognition device 1 according to the present embodiment, the trigger input section 23 is a voice input interface that accepts input of preset voice. Then, a preset voice is input to the trigger input section 23 as a trigger.

これによれば、話者がウエイクアップワード等のような予め設定された音声による発話をするだけで、話者認識装置１は、マジックワード認識を行い、話者の識別を実行することができる。このため、話者認識装置１は、操作性に優れている。 According to this, the speaker recognition device 1 can perform magic word recognition and identify the speaker simply by the speaker uttering a preset voice such as a wake-up word. . Therefore, the speaker recognition device 1 has excellent operability.

また、本実施の形態における話者認識装置１において、トリガ入力部２３は、話者認識装置１に設けられた操作ボタンである。そして、トリガ入力部２３には、受付けた操作入力がトリガとして入力される。 Furthermore, in the speaker recognition device 1 according to the present embodiment, the trigger input section 23 is an operation button provided on the speaker recognition device 1. Then, the accepted operation input is input to the trigger input section 23 as a trigger.

これによれば、話者がトリガ入力部２３を操作することで、トリガ入力部２３にトリガを確実に入力することができる。 According to this, the speaker can reliably input a trigger to the trigger input section 23 by operating the trigger input section 23.

（その他変形例等）
以上、本開示について、実施の形態に基づいて説明したが、本開示は、これら実施の形態等に限定されるものではない。 (Other variations, etc.)
Although the present disclosure has been described above based on the embodiments, the present disclosure is not limited to these embodiments.

例えば、上記各実施の形態に係る音声入力装置、音声入力方法及びそのプログラムにおいて、取得部が取得した音声に基づいて、音声入力装置に対する話者の方向を推定してもよい。この場合、マイクロホンアレイ部の取得部を用いて、話者のそれぞれが発話した、音声入力装置に対する音源方向を推定してもよい。具体的には、音声入力装置は、取得部におけるそれぞれのマイクロホンに到達した音声の時間差（位相差）を算出し、例えば遅延時間推定法等により音源方向を推定してもよい。 For example, in the voice input device, voice input method, and program thereof according to each of the embodiments described above, the direction of the speaker with respect to the voice input device may be estimated based on the voice acquired by the acquisition unit. In this case, the acquisition unit of the microphone array unit may be used to estimate the direction of the sound source relative to the audio input device in which each speaker speaks. Specifically, the audio input device may calculate the time difference (phase difference) between the sounds reaching each microphone in the acquisition unit, and estimate the sound source direction using, for example, a delay time estimation method.

また、上記各実施の形態に係る音声入力装置、音声入力方法及びそのプログラムにおいて、音声入力装置は、取得部が取得する話者の音声の区間を検出することで、取得部が取得する話者の音声を取得できない期間が所定期間以上検出されれば、自動的に録音を中止又は停止してもよい。 Furthermore, in the voice input device, the voice input method, and the program thereof according to each of the above embodiments, the voice input device detects the section of the speaker's voice acquired by the acquisition unit. If a predetermined period or more is detected during which the audio cannot be acquired, recording may be automatically stopped or stopped.

また、上記各実施の形態に係る音声入力方法は、コンピュータを用いたプログラムによって実現され、このようなプログラムは、記憶装置に記憶されてもよい。 Further, the voice input method according to each of the embodiments described above is realized by a program using a computer, and such a program may be stored in a storage device.

また、上記各実施の形態に係る音声入力装置、音声入力方法及びそのプログラムに含まれる各処理部は、典型的に集積回路であるＬＳＩとして実現される。これらは個別に１チップ化されてもよいし、一部又は全てを含むように１チップ化されてもよい。 Further, the audio input device, audio input method, and each processing unit included in the program according to each of the embodiments described above is typically realized as an LSI, which is an integrated circuit. These may be individually integrated into one chip, or may be integrated into one chip including some or all of them.

また、集積回路化はＬＳＩに限るものではなく、専用回路又は汎用プロセッサで実現してもよい。ＬＳＩ製造後にプログラムすることが可能なＦＰＧＡ（ＦｉｅｌｄＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）、又はＬＳＩ内部の回路セルの接続や設定を再構成可能なリコンフィギュラブル・プロセッサを利用してもよい。 Further, circuit integration is not limited to LSI, and may be realized using a dedicated circuit or a general-purpose processor. An FPGA (Field Programmable Gate Array) that can be programmed after the LSI is manufactured, or a reconfigurable processor that can reconfigure the connections and settings of circuit cells inside the LSI may be used.

なお、上記各実施の形態において、各構成要素は、専用のハードウェアで構成されるか、各構成要素に適したソフトウェアプログラムを実行することによって実現されてもよい。各構成要素は、ＣＰＵ又はプロセッサなどのプログラム実行部が、ハードディスク又は半導体メモリなどの記録媒体に記録されたソフトウェアプログラムを読み出して実行することによって実現されてもよい。 Note that in each of the above embodiments, each component may be configured with dedicated hardware, or may be realized by executing a software program suitable for each component. Each component may be realized by a program execution unit such as a CPU or a processor reading and executing a software program recorded on a recording medium such as a hard disk or a semiconductor memory.

また、上記で用いた数字は、全て本開示を具体的に説明するために例示するものであり、本開示の実施の形態は例示された数字に制限されない。 Moreover, all the numbers used above are exemplified to specifically explain the present disclosure, and the embodiments of the present disclosure are not limited to the illustrated numbers.

また、ブロック図における機能ブロックの分割は一例であり、複数の機能ブロックを一つの機能ブロックとして実現したり、一つの機能ブロックを複数に分割したり、一部の機能を他の機能ブロックに移してもよい。また、類似する機能を有する複数の機能ブロックの機能を単一のハードウェア又はソフトウェアが並列又は時分割に処理してもよい。 Furthermore, the division of functional blocks in the block diagram is just an example; multiple functional blocks can be realized as one functional block, one functional block can be divided into multiple functional blocks, or some functions can be moved to other functional blocks. It's okay. Further, functions of a plurality of functional blocks having similar functions may be processed in parallel or in a time-sharing manner by a single piece of hardware or software.

また、フローチャートにおける各ステップが実行される順序は、本開示を具体的に説明するために例示するためであり、上記以外の順序であってもよい。また、上記ステップの一部が、他のステップと同時（並列）に実行されてもよい。 Further, the order in which the steps in the flowchart are executed is for illustrative purposes to specifically explain the present disclosure, and may be in an order other than the above. Further, some of the above steps may be executed simultaneously (in parallel) with other steps.

その他、実施の形態に対して当業者が思いつく各種変形を施して得られる形態、本開示の趣旨を逸脱しない範囲で実施の形態における構成要素及び機能を任意に組み合わせることで実現される形態も本開示に含まれる。 In addition, this invention also includes forms obtained by making various modifications to the embodiments that those skilled in the art can think of, and forms realized by arbitrarily combining the components and functions of the embodiments without departing from the spirit of the present disclosure. Included in disclosure.

本開示は、複数の話者のそれぞれの発話がどの話者であるかを特定するために用いられる音声入力装置、音声入力方法及びそのプログラムに適用することができる。 The present disclosure can be applied to a voice input device, a voice input method, and a program thereof, which are used to identify which speaker is responsible for each utterance of a plurality of speakers.

１話者認識装置（音声入力装置）
２１取得部
２２記憶部
２３トリガ入力部
２４発話開始検出部
２５発話時機登録部
２６話者識別部 1 Speaker recognition device (voice input device)
21 Acquisition unit 22 Storage unit 23 Trigger input unit 24 Speech start detection unit 25 Speech timing registration unit 26 Speaker identification unit

Claims

１以上の話者が発話する際のそれぞれの音声を取得する取得部と、
前記取得部が取得した前記１以上の話者の発話による前記それぞれの音声を記憶する記憶部と、
トリガが入力されるトリガ入力部と、
前記トリガ入力部に前記トリガが入力されるごとに、前記記憶部に記憶される前記それぞれの音声から発話を開始した開始位置を検出する発話開始検出部と、
少なくとも、前記トリガ入力部に前記トリガが入力される第１時点と、前記発話開始検出部が前記それぞれの音声から検出した発話の開始位置の第２時点とに基づいて、前記１以上の話者のうちからいずれかの話者を識別する話者識別部と、
前記第１時点と前記第２時点とのいずれの時点が前の時間であるかを少なくとも登録する発話時機登録部とを備え、
前記話者識別部は、前記第１時点と前記第２時点と前記発話時機登録部が前記第１時点に対する前記第２時点の時機を示す複数の登録情報とに基づいて、前記１以上の話者のうちからいずれかの話者を識別する
音声入力装置。 an acquisition unit that acquires each voice when one or more speakers speak;
a storage unit that stores each of the voices uttered by the one or more speakers acquired by the acquisition unit;
a trigger input section into which a trigger is input;
an utterance start detection unit that detects a start position at which an utterance is started from each of the voices stored in the storage unit each time the trigger is input to the trigger input unit;
Based on at least a first time point at which the trigger is input to the trigger input section and a second time point at which the utterance start position is detected from the respective voices by the utterance start detection section, the one or more speakers a speaker identification unit that identifies one of the speakers ;
an utterance timing registration unit that registers at least which time point of the first time point or the second time point is the previous time;
The speaker identification unit identifies the one or more utterances based on the first time point, the second time point, and a plurality of pieces of registered information that the utterance time registration unit indicates the timing of the second time point with respect to the first time point. identify one of the speakers
Voice input device.

前記発話時機登録部は、前記１以上の話者のそれぞれの発話の時機を登録する際に、
前記トリガ入力部に前記トリガが入力される前記第１時点よりも、発話を開始した開始位置の前記第２時点の方が後の時間であることを示す第１時間情報と、前記１以上の話者のうちからいずれかの話者とを紐付けた登録情報である第１登録情報を登録し、
前記トリガ入力部に前記トリガが入力される前記第１時点よりも、発話を開始した開始位置の前記第２時点の方が前の時間であることを示す第２時間情報と、前記１以上の話者のうちから別のいずれかの話者とを紐付けた登録情報である第２登録情報を登録する
請求項１に記載の音声入力装置。 The utterance timing registration unit, when registering the utterance timing of each of the one or more speakers,
first time information indicating that the second time point of the start position of the utterance is a later time than the first time point when the trigger is input to the trigger input section; registering first registration information that is registration information linking one of the speakers with one of the speakers;
second time information indicating that the second time point at the start position of the utterance is earlier than the first time point when the trigger is input to the trigger input section; The voice input device according to claim 1, wherein second registration information is registered that is registration information linking one of the speakers with another speaker.

前記話者識別部は、
前記第１時点に対する前記第２時点の時機を算出し、
算出した時機を示す結果と前記複数の登録情報とを照らし合わせて、前記第１時点よりも前記第２時点の方が後の時間である場合は、発話した話者が第１話者であると判定し、前記第１時点よりも前記第２時点の方が前の時間である場合は、発話した話者が前記第１話者と別の第２話者であると判定する
請求項１又は２に記載の音声入力装置。 The speaker identification unit includes:
calculating the timing of the second point in time with respect to the first point in time;
Comparing the calculated result indicating the timing with the plurality of registered information, if the second time point is later than the first time point, the speaker who uttered the utterance is the first speaker. If it is determined that the second time point is earlier than the first time point, it is determined that the speaker who uttered the utterance is a second speaker different from the first speaker . Or the voice input device according to 2 .

前記トリガ入力部は、予め設定された音声の入力を受付ける音声入力インターフェイスであり、
前記トリガ入力部には、予め設定された音声が前記トリガとして入力される
請求項１～３のいずれか１項に記載の音声入力装置。 The trigger input unit is an audio input interface that accepts preset audio input,
The audio input device according to any one of claims 1 to 3 , wherein a preset audio is input as the trigger to the trigger input section.

前記トリガ入力部は、前記音声入力装置に設けられた操作ボタンであり、
前記トリガ入力部には、受付けた操作入力が前記トリガとして入力される
請求項１～３のいずれか１項に記載の音声入力装置。 The trigger input section is an operation button provided on the voice input device,
The audio input device according to any one of claims 1 to 3 , wherein a received operation input is input as the trigger to the trigger input section.

１以上の話者が発話する際のそれぞれの音声を取得することと、
取得した前記１以上の話者の発話による前記それぞれの音声を記憶部に記憶することと、
トリガが入力されることと、
前記トリガが入力されるごとに、前記記憶部に記憶される前記それぞれの音声から発話を開始した開始位置を検出することと、
少なくとも、前記トリガが入力される第１時点と、前記それぞれの音声から検出した発話の開始位置の第２時点とに基づいて、前記１以上の話者のうちからいずれかの話者を識別することと、
前記第１時点と前記第２時点とのいずれの時点が前の時間であるかを少なくとも登録することと、
前記第１時点と前記第２時点と前記第１時点に対する前記第２時点の時機を示す複数の登録情報とに基づいて、前記１以上の話者のうちからいずれかの話者を識別することとを含む
音声入力方法。 Obtaining each voice when one or more speakers speak;
storing each of the acquired voices uttered by the one or more speakers in a storage unit;
that a trigger is input,
detecting a starting position at which an utterance is started from each of the voices stored in the storage unit each time the trigger is input;
Identify one of the one or more speakers based on at least a first point in time when the trigger is input and a second point in time at the start position of the utterance detected from each of the voices. And ,
registering at least which time point between the first time point and the second time point is the previous time;
Identifying one of the one or more speakers based on the first time point, the second time point, and a plurality of registered information indicating the timing of the second time point with respect to the first time point. and voice input methods.

請求項６に記載の音声入力方法をコンピュータに実行させるための
プログラム。 A program for causing a computer to execute the voice input method according to claim 6 .