JP2020190671A

JP2020190671A - Speech-to-text converter, speech-to-text conversion method and speech-to-text conversion program

Info

Publication number: JP2020190671A
Application number: JP2019096723A
Authority: JP
Inventors: 喜美子川嶋; Kimiko Kawashima; 安永　健治; Kenji Yasunaga; 健治安永
Original assignee: Nippon Telegraph and Telephone West Corp
Current assignee: Nippon Telegraph and Telephone West Corp
Priority date: 2019-05-23
Filing date: 2019-05-23
Publication date: 2020-11-26
Anticipated expiration: 2039-05-23
Also published as: JP6735392B1

Abstract

To output a speech recognition result with higher accuracy.SOLUTION: A noise suppression unit 11 suppresses noise of an original speech waveform f1, a speech segment detection unit 12 detects a speech segment tj from a noise suppressed speech waveform f2, a speech waveform cutting unit 13 cuts the original speech waveform f1 and the noise suppressed speech waveform f2 at each speech segment tj to obtain a segmental speech waveforms f1_tj and f2_tj, a speech recognition unit 14 then speech-recognizes each of the segmental speech waveforms f1_tj and f2_tj before and after the noise suppression by each of a plurality of speech recognition engines ei so that the one with the larger number of characters is regarded as a speech recognition result Rij for the speech segment tj by the speech recognition engines ei, and a recognition result correction unit 15 compares the speech recognition results Rij for each speech segment tj to correct the speech recognition results.SELECTED DRAWING: Figure 1

Description

本発明は、音声認識精度を向上する技術に関する。 The present invention relates to a technique for improving speech recognition accuracy.

近年、音声認識技術が広く利用されている。例えば、ネットワークに接続されたスピーカーにマイクを内蔵し、音声認識による操作を可能とするスマートスピーカーが普及している。様々な企業から音声認識エンジンが提供されており、音声をテキスト化することが容易になっている。 In recent years, speech recognition technology has been widely used. For example, smart speakers that have a built-in microphone in a speaker connected to a network and can be operated by voice recognition have become widespread. Speech recognition engines are provided by various companies, making it easy to convert speech into text.

また、音声認識の精度を向上させるための雑音抑圧技術も検討されている（例えば非特許文献１）。 Further, a noise suppression technique for improving the accuracy of speech recognition has also been studied (for example, Non-Patent Document 1).

“雑音環境下での音声認識精度の向上に向けた音声処理技術”、日本電信電話株式会社、［平成３１年４月２２日検索］、インターネット〈ＵＲＬ：http://www.ntt.co.jp/svlab/activity/category_2/product2_29.html〉"Voice processing technology for improving voice recognition accuracy in noisy environments", Nippon Telegraph and Telephone Corporation, [Search on April 22, 2019], Internet <URL: http://www.ntt.co. jp / svlab / activity / category_2 / product2_29.html>

音声認識エンジンによって認識結果の特性が異なり、音声認識エンジンごとに得意不得意がある。音声認識エンジンごとに学習に用いているデータや音声認識アルゴリズムが異なるので、文章のような整った話し方の音声での認識精度が高い音声認識エンジンや、話し言葉のようなくだけた話し方の音声での認識精度が高い音声認識エンジンがある。音声認識エンジンによっては、認識精度が高いと推定される箇所のみを出力するものもあれば、認識できた箇所すべてを出力するものもある。 The characteristics of the recognition result differ depending on the voice recognition engine, and each voice recognition engine has its strengths and weaknesses. Since the data and voice recognition algorithm used for learning are different for each voice recognition engine, a voice recognition engine with high recognition accuracy in a well-organized voice such as a sentence, or a voice with a simple speech like a spoken word There is a voice recognition engine with high recognition accuracy. Some speech recognition engines output only the parts that are estimated to have high recognition accuracy, while others output all the parts that can be recognized.

また、雑音抑圧することで、音声認識精度が向上する箇所とそうでない箇所があり、雑音抑圧すれば認識精度が必ずしも上がるわけではない。例えば、雑音抑圧技術を適用すると、雑音のある個所は雑音が抑圧されるため音声認識精度が向上する。しかし、雑音のない箇所は、雑音抑圧処理が施されることで音質が下がり、音声認識精度が低下してしまうことがある。 In addition, there are places where the voice recognition accuracy is improved by suppressing noise, and there are places where it is not, and suppressing noise does not necessarily improve the recognition accuracy. For example, when the noise suppression technique is applied, the noise is suppressed in the noisy part, so that the voice recognition accuracy is improved. However, in a place where there is no noise, the sound quality may be lowered due to the noise suppression processing, and the voice recognition accuracy may be lowered.

本発明は、上記に鑑みてなされたものであり、より精度が高い音声認識結果を出力することを目的とする。 The present invention has been made in view of the above, and an object of the present invention is to output a voice recognition result with higher accuracy.

本発明に係る音声テキスト化装置は、入力した音声波形の雑音を抑圧する雑音抑圧部と、複数の音声認識エンジンのそれぞれにより、前記音声波形を音声認識した第１の音声認識結果と、雑音を抑圧した雑音抑圧音声波形を音声認識した第２の音声認識結果を得て、前記第１の音声認識結果と前記第２の音声認識結果のうち文字数の多い方を当該音声認識エンジンの音声認識結果として選択する音声認識部と、前記複数の音声認識エンジンの音声認識結果を互いに比較して前記音声認識結果を補正する認識結果補正部と、を有することを特徴とする。 The voice text conversion device according to the present invention uses a noise suppression unit that suppresses the noise of the input voice waveform, a first voice recognition result of voice recognition of the voice waveform by each of the plurality of voice recognition engines, and noise. The second voice recognition result of voice recognition of the suppressed noise suppression voice waveform is obtained, and the voice recognition result of the voice recognition engine is the one having the larger number of characters among the first voice recognition result and the second voice recognition result. It is characterized by having a voice recognition unit selected as, and a recognition result correction unit that compares the voice recognition results of the plurality of voice recognition engines with each other and corrects the voice recognition result.

本発明に係る音声テキスト化方法は、入力した音声波形の雑音を抑圧するステップと、複数の音声認識エンジンのそれぞれにより、前記音声波形を音声認識した第１の音声認識結果と、雑音を抑圧した雑音抑圧音声波形を音声認識した第２の音声認識結果を得るステップと、前記第１の音声認識結果と前記第２の音声認識結果のうち文字数の多い方を当該音声認識エンジンの音声認識結果として選択するステップと、前記複数の音声認識エンジンの音声認識結果を互いに比較して前記音声認識結果を補正するステップと、を有することを特徴とする。 In the voice text conversion method according to the present invention, the first voice recognition result of voice recognition of the voice waveform and the noise are suppressed by the step of suppressing the noise of the input voice waveform and each of the plurality of voice recognition engines. The step of obtaining the second voice recognition result of voice recognition of the noise-suppressed voice waveform, and the voice recognition result of the voice recognition engine is the one having the larger number of characters among the first voice recognition result and the second voice recognition result. It is characterized by having a step of selecting and a step of comparing the voice recognition results of the plurality of voice recognition engines with each other and correcting the voice recognition result.

本発明によれば、より精度が高い音声認識結果を出力することができる。 According to the present invention, it is possible to output a voice recognition result with higher accuracy.

本実施形態の音声テキスト化装置の構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the voice text conversion apparatus of this embodiment. 本実施形態の音声テキスト化装置の処理の流れを示すフローチャートである。It is a flowchart which shows the process flow of the voice text conversion apparatus of this embodiment. 音声波形切断処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the voice waveform cutting process. 音声波形の頭からの無音区間の長さと認識精度との関係を示す図である。It is a figure which shows the relationship between the length of a silent section from the head of a speech waveform, and recognition accuracy. 音声認識処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a voice recognition process. 認識結果補正処理の流れを示すフローチャートである。It is a flowchart which shows the flow of recognition result correction processing. 音声認識結果を形態素に分割し、不一致箇所を抽出した例を示す図である。It is a figure which shows the example which divided the voice recognition result into morphemes and extracted the inconsistent part. 補正状態フラグの一例を示す図である。It is a figure which shows an example of the correction state flag.

以下、本発明の実施の形態について図面を用いて説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

（音声テキスト化装置の構成）
図１は、本実施形態の音声テキスト化装置１の構成を示す機能ブロック図である。音声テキスト化装置１は、音声を入力し、入力した音声を音声認識した認識結果であるテキストを出力する。音声テキスト化装置１は、テキストに加えて、音声認識結果の補正内容を示す補正状態を出力してもよい。 (Configuration of voice text conversion device)
FIG. 1 is a functional block diagram showing the configuration of the voice text conversion device 1 of the present embodiment. The voice text conversion device 1 inputs voice and outputs a text which is a recognition result of voice recognition of the input voice. The voice text conversion device 1 may output a correction state indicating the correction content of the voice recognition result in addition to the text.

図１に示す音声テキスト化装置１は、雑音抑圧部１１、発話区間検出部１２、音声波形切断部１３、音声認識部１４、および認識結果補正部１５を備える。音声テキスト化装置１が備える各部は、演算処理装置、記憶装置等を備えたコンピュータにより構成して、各部の処理がプログラムによって実行されるものとしてもよい。このプログラムは音声テキスト化装置１が備える記憶装置に記憶されており、磁気ディスク、光ディスク、半導体メモリ等の記録媒体に記録することも、ネットワークを通して提供することも可能である。 The voice text conversion device 1 shown in FIG. 1 includes a noise suppression unit 11, an utterance section detection unit 12, a voice waveform cutting unit 13, a voice recognition unit 14, and a recognition result correction unit 15. Each part included in the voice text conversion device 1 may be configured by a computer provided with an arithmetic processing unit, a storage device, and the like, and the processing of each part may be executed by a program. This program is stored in a storage device included in the voice text conversion device 1, and can be recorded on a recording medium such as a magnetic disk, an optical disk, or a semiconductor memory, or can be provided through a network.

雑音抑圧部１１は、音声認識対象となる元音声波形ｆ１を入力し、雑音抑圧処理を実施して、雑音抑圧音声波形ｆ２を出力する。雑音抑圧処理は、例えば、非特許文献１の音声処理技術や、ノイズキャンセリングイヤホン等に実装されている技術を用いることができる。元音声波形ｆ１と雑音抑圧音声波形ｆ２は、音声波形切断部１３に入力される。 The noise suppression unit 11 inputs the original voice waveform f1 to be voice recognition target, performs noise suppression processing, and outputs the noise suppression voice waveform f2. For the noise suppression processing, for example, the voice processing technology of Non-Patent Document 1 or the technology implemented in a noise canceling earphone or the like can be used. The original voice waveform f1 and the noise suppression voice waveform f2 are input to the voice waveform cutting unit 13.

発話区間検出部１２は、雑音抑圧音声波形ｆ２を入力し、音声波形の中で人が発話している発話区間ｔｊ（ｊ＝１，２，・・・，ｍ）を検出する。発話区間の検出には、Ｇｏｏｇｌｅ等が公開しているＶＡＤ（ＶｏｉｃｅＡｃｔｉｖｉｔｙＤｅｔｅｃｔｉｏｎ）ライブラリを利用できる。発話区間検出部１２は、元音声波形ｆ１から発話区間を検出してもよい。 The utterance section detection unit 12 inputs the noise suppression voice waveform f2, and detects the utterance section tj (j = 1, 2, ..., M) spoken by a person in the voice waveform. A VAD (Voice Activity Detection) library published by Google and others can be used to detect the utterance section. The utterance section detection unit 12 may detect the utterance section from the original voice waveform f1.

音声波形切断部１３は、元音声波形ｆ１と雑音抑圧音声波形ｆ２のそれぞれを発話区間ｔｊで音声波形を切り出し、切り出した発話区間ごとの音声波形のそれぞれの先頭に無音波形を付加する。音声波形切断部１３は、元音声波形ｆ１から発話区間ｔｊごとに切り出して無音波形を付加した区間音声波形ｆ１＿ｔｊと、雑音抑圧音声波形ｆ２から発話区間ｔｊごとに切り出して無音波形を付加した区間音声波形ｆ２＿ｔｊを音声認識部１４へ出力する。 The voice waveform cutting unit 13 cuts out a voice waveform of each of the original voice waveform f1 and the noise suppression voice waveform f2 in the utterance section tj, and adds a silence type to the head of each of the cut out voice waveforms for each utterance section. The voice waveform cutting unit 13 cuts out a section voice waveform f1_tj cut out from the original voice waveform f1 for each utterance section tj and adds a silence type, and a section voice cut out from the noise suppression voice waveform f2 for each utterance section tj and adds a silence type. The waveform f2_tj is output to the voice recognition unit 14.

音声認識部１４は、複数の音声認識エンジンｅｉ（ｉ＝１，２，・・・，ｎ）を用いて、発話区間ｔｊごとに、雑音抑圧前後の区間音声波形ｆ１＿ｔｊと区間音声波形ｆ２＿ｔｊを音声認識する。音声認識部１４は、区間音声波形ｆ１＿ｔｊと区間音声波形ｆ２＿ｔｊの認識結果のうち文字数が多い方の認識結果を、音声認識エンジンｅｉによる発話区間ｔｊの音声認識結果Ｒｉｊとする。つまり、音声認識部１４は、発話区間ｔｊごとに、複数の音声認識エンジンｅｉによる音声認識結果Ｒｉｊを出力する。 The voice recognition unit 14 uses a plurality of voice recognition engines ei (i = 1, 2, ..., N) to voice the section voice waveform f1_tj and the section voice waveform f2_tj before and after noise suppression for each utterance section tj. recognize. The voice recognition unit 14 uses the recognition result of the section voice waveform f1_tj and the section voice waveform f2_tj, whichever has the larger number of characters, as the voice recognition result Rij of the utterance section tj by the voice recognition engine ei. That is, the voice recognition unit 14 outputs the voice recognition result Rij by the plurality of voice recognition engines ei for each utterance section tj.

音声認識部１４は、複数の音声認識エンジンｅｉを備えてもよいし、外部の音声認識サービスを用いて音声認識してもよい。異なる複数の音声認識エンジンｅｉを用いるのであれば、その形式は問わない。複数の結果を出力する音声認識エンジンに関しては、信頼度が最大の認識結果を採用する。あるいは、複数の結果のうち信頼度が上位のものから複数個を出力し、後段の認識結果補正部１５で比較してもよい。 The voice recognition unit 14 may include a plurality of voice recognition engines ei, or may use an external voice recognition service for voice recognition. If a plurality of different voice recognition engines ei are used, the format does not matter. For a speech recognition engine that outputs multiple results, the recognition result with the highest reliability is adopted. Alternatively, a plurality of the results having the highest reliability may be output and compared by the recognition result correction unit 15 in the subsequent stage.

認識結果補正部１５は、発話区間ｔｊごとに、音声認識エンジンｅｉごとの音声認識結果Ｒｉｊを比較して不一致箇所を特定し、不一致箇所に関して、より多くの音声認識エンジンｅｉの音声認識結果を採用する。音声テキスト化装置１の入力した音声が映像やスライドに付随するものである場合、認識結果補正部１５は、不一致箇所に関して、音声認識結果Ｒｉｊを映像やスライドの文字認識結果と比較し、最も適した内容に補正する。映像やスライドの文字認識結果は、別の装置が映像等を処理して抽出したものを音声テキスト化装置１が入力してもよいし、音声テキスト化装置１が映像等を入力して抽出してもよい。 The recognition result correction unit 15 compares the voice recognition result Rij of each voice recognition engine ei for each utterance section tj to identify a mismatched part, and adopts more voice recognition results of the voice recognition engine ei for the mismatched part. To do. When the voice input by the voice text conversion device 1 accompanies the video or slide, the recognition result correction unit 15 compares the voice recognition result Rij with the character recognition result of the video or slide with respect to the inconsistent portion, and is most suitable. Correct the contents. The voice text conversion device 1 may input the character recognition result of the video or slide, which is extracted by processing the video or the like by another device, or the voice text conversion device 1 inputs and extracts the video or the like. You may.

認識結果補正部１５は、補正後の音声認識結果であるテキストに加えて、音声認識結果Ｒｉｊの不一致箇所の補正状態を出力する。例えば、認識結果補正部１５は、補正した不一致箇所に対して、音声認識比較での補正または文字認識との比較での補正などの情報を付与する。 The recognition result correction unit 15 outputs the correction state of the non-matching portion of the voice recognition result Rij in addition to the text which is the voice recognition result after the correction. For example, the recognition result correction unit 15 adds information such as correction in voice recognition comparison or correction in comparison with character recognition to the corrected inconsistent portion.

（音声テキスト化装置の動作）
次に、本実施形態の音声テキスト化装置１の動作について説明する。 (Operation of voice text conversion device)
Next, the operation of the voice text conversion device 1 of the present embodiment will be described.

図２は、本実施形態の音声テキスト化装置１の処理の流れを示すフローチャートである。 FIG. 2 is a flowchart showing a processing flow of the voice text conversion device 1 of the present embodiment.

ステップＳ１にて、雑音抑圧部１１は、元音声波形ｆ１に対して雑音抑圧処理を実施し、雑音抑圧音声波形ｆ２を出力する。 In step S1, the noise suppression unit 11 performs noise suppression processing on the original voice waveform f1 and outputs the noise suppression voice waveform f2.

ステップＳ２にて、発話区間検出部１２は、雑音抑圧音声波形ｆ２から発話区間ｔｊを検出する。 In step S2, the utterance section detection unit 12 detects the utterance section tj from the noise suppression voice waveform f2.

ステップＳ３にて、音声波形切断部１３は、元音声波形ｆ１と雑音抑圧音声波形ｆ２のそれぞれから発話区間ｔｊを切り出すとともに、切り出した区間音声波形ｆ１＿ｔｊ，ｆ２＿ｔｊの頭に無音波形を付加する。音声波形切断部１３による音声波形切断処理の詳細は後述する。 In step S3, the voice waveform cutting unit 13 cuts out the utterance section tj from each of the original voice waveform f1 and the noise suppression voice waveform f2, and adds a silence type to the heads of the cut out section voice waveforms f1_tj and f2_tj. The details of the voice waveform cutting process by the voice waveform cutting unit 13 will be described later.

なお、元音声波形ｆ１が短い場合は、ステップＳ２，Ｓ３の処理を行わずに、元音声波形ｆ１と雑音抑圧音声波形ｆ２を音声認識部１４に渡してもよい。 If the original voice waveform f1 is short, the original voice waveform f1 and the noise suppression voice waveform f2 may be passed to the voice recognition unit 14 without performing the processes of steps S2 and S3.

ステップＳ４にて、音声認識部１４は、複数の音声認識エンジンｅｉを用いて、区間音声波形ｆ１＿ｔｊ，ｆ２＿ｔｊのそれぞれを音声認識し、音声認識結果Ｒｉｊを得る。音声認識部１４による音声認識処理の詳細は後述する。 In step S4, the voice recognition unit 14 uses a plurality of voice recognition engines ei to perform voice recognition of each of the section voice waveforms f1_tj and f2_tj, and obtains the voice recognition result Rij. The details of the voice recognition process by the voice recognition unit 14 will be described later.

ステップＳ５にて、認識結果補正部１５は、複数の音声認識エンジンｅｉによる音声認識結果Ｒｉｊを比較し、適切な認識結果を採用してテキストを出力する。認識結果補正部１５は、元音声に関連した文字認識結果を用いて音声認識結果を補正してもよい。認識結果補正部１５による認識結果補正処理の詳細は後述する。 In step S5, the recognition result correction unit 15 compares the voice recognition results Rij by the plurality of voice recognition engines ei, adopts an appropriate recognition result, and outputs a text. The recognition result correction unit 15 may correct the voice recognition result by using the character recognition result related to the original voice. The details of the recognition result correction process by the recognition result correction unit 15 will be described later.

（音声波形切断処理）
図３は、音声波形切断処理の流れを示すフローチャートである。音声波形切断部１３は、元音声波形ｆ１、雑音抑圧音声波形ｆ２、および発話区間ｔｊを入力し、音声波形切断処理を実行する。 (Voice waveform cutting process)
FIG. 3 is a flowchart showing the flow of the voice waveform cutting process. The voice waveform cutting unit 13 inputs the original voice waveform f1, the noise suppression voice waveform f2, and the utterance section tj, and executes the voice waveform cutting process.

ステップＳ３１にて、音声波形切断部１３は、元音声波形ｆ１を発話区間ｔｊで切り出す。 In step S31, the voice waveform cutting unit 13 cuts out the original voice waveform f1 in the utterance section tj.

ステップＳ３２にて、音声波形切断部１３は、雑音抑圧音声波形ｆ２を発話区間ｔｊで切り出す。 In step S32, the voice waveform cutting unit 13 cuts out the noise suppression voice waveform f2 in the utterance section tj.

ステップＳ３３にて、音声波形切断部１３は、元音声波形ｆ１および雑音抑圧音声波形ｆ２を発話区間ｔｊで切り出した音声波形のそれぞれの先頭に無音波形を付加する。音声波形切断部１３は、元音声波形ｆ１を発話区間ｔｊで切り出して無音を付加した区間音声波形ｆ１＿ｔｊと、雑音抑圧音声波形ｆ２を発話区間ｔｊで切り出して無音を付加した区間音声波形ｆ２＿ｔｊを出力する。 In step S33, the voice waveform cutting unit 13 adds an ansonic form to the beginning of each of the voice waveforms obtained by cutting out the original voice waveform f1 and the noise suppression voice waveform f2 in the utterance section tj. The voice waveform cutting unit 13 outputs the section voice waveform f1_tj obtained by cutting out the original voice waveform f1 in the utterance section tj and adding silence, and the section voice waveform f2_tj in which the noise suppression voice waveform f2 is cut out in the utterance section tj and adding silence. To do.

図４に示すように、音声認識の際、発話前の無音区間が所定の長さ以上あれば認識精度が向上する。そのため、音声波形切断部１３は、認識精度が飽和するような無音区間の時間を事前に決定しておき、切り出した区間音声波形ｆ１＿ｔｊ，ｆ２＿ｔｊの頭に無音区間を付加する。 As shown in FIG. 4, in voice recognition, if the silent section before utterance has a predetermined length or more, the recognition accuracy is improved. Therefore, the voice waveform cutting unit 13 determines in advance the time of the silent section in which the recognition accuracy is saturated, and adds the silent section to the head of the cut section voice waveforms f1_tj and f2_tj.

ステップＳ３４にて、音声波形切断部１３は、全ての発話区間について処理したか否かを判定する。処理していない発話区間が存在する場合は、ステップＳ３１に戻り、次の発話区間ｔｊ＋１を処理する。全ての発話区間を切り出した場合は、音声波形切断処理を終了する。 In step S34, the voice waveform cutting unit 13 determines whether or not all the utterance sections have been processed. If there is an utterance section that has not been processed, the process returns to step S31 and the next utterance section tj + 1 is processed. When all the utterance sections have been cut out, the voice waveform cutting process is terminated.

（音声認識処理）
図５は、音声認識処理の流れを示すフローチャートである。音声認識部１４は、雑音抑圧前後の区間音声波形ｆ１＿ｔｊ，ｆ２＿ｔｊを入力し、複数の音声認識エンジンのそれぞれを用いて、発話区間ごとに音声認識結果を求める。 (Voice recognition processing)
FIG. 5 is a flowchart showing the flow of the voice recognition process. The voice recognition unit 14 inputs the section voice waveforms f1_tj and f2_tj before and after noise suppression, and uses each of the plurality of voice recognition engines to obtain the voice recognition result for each utterance section.

ステップＳ４１にて、音声認識部１４は、複数の音声認識エンジンの中から一つの音声認識エンジンｅｉを選択する。 In step S41, the voice recognition unit 14 selects one voice recognition engine ei from the plurality of voice recognition engines.

ステップＳ４２にて、音声認識部１４は、ステップＳ４１で選択した音声認識エンジンｅｉを用いて、元音声波形ｆ１から切り出した区間音声波形ｆ１＿ｔｊを音声認識する。 In step S42, the voice recognition unit 14 uses the voice recognition engine ei selected in step S41 to perform voice recognition of the section voice waveform f1_tj cut out from the original voice waveform f1.

ステップＳ４３にて、音声認識部１４は、ステップＳ４１で選択した音声認識エンジンｅｉを用いて、雑音抑圧音声波形ｆ２から切り出した区間音声波形ｆ２＿ｔｊを音声認識する。 In step S43, the voice recognition unit 14 uses the voice recognition engine ei selected in step S41 to perform voice recognition of the section voice waveform f2_tj cut out from the noise suppression voice waveform f2.

ステップＳ４４にて、音声認識部１４は、ステップＳ４２，Ｓ４３で得られた音声認識結果の文字数を比較し、文字数の多い方の音声認識結果を音声認識エンジンｅｉによる発話区間ｔｊの音声認識結果Ｒｉｊとして採用する。雑音抑圧前後の波形の認識結果を比較することで、雑音抑圧により音声認識精度が向上する箇所とそうでない箇所があることを反映できる。雑音抑圧前後の認識文字数を比較し、文字数が多い認識結果を採用することで、認識漏れを防ぐことができる。 In step S44, the voice recognition unit 14 compares the number of characters of the voice recognition result obtained in steps S42 and S43, and the voice recognition result having the larger number of characters is the voice recognition result Rij of the utterance section tj by the voice recognition engine ei. Adopt as. By comparing the recognition results of the waveforms before and after noise suppression, it is possible to reflect that there are places where the voice recognition accuracy is improved by noise suppression and where it is not. By comparing the number of recognized characters before and after noise suppression and adopting the recognition result with a large number of characters, it is possible to prevent recognition omission.

ステップＳ４５にて、音声認識部１４は、全ての発話区間について処理したか否かを判定する。処理していない発話区間が存在する場合は、ステップＳ４２に戻り、次の発話区間ｔｊ＋１を処理する。 In step S45, the voice recognition unit 14 determines whether or not all the utterance sections have been processed. If there is an unprocessed utterance section, the process returns to step S42 and the next utterance section tj + 1 is processed.

ステップＳ４６にて、音声認識部１４は、全ての音声認識エンジンで処理したか否かを判定する。処理していない音声認識エンジンが存在する場合は、ステップＳ４１に戻り、次の音声認識エンジンｅｉ＋１を選択し、最初の発話区間から順に処理する。なお、ステップＳ４２〜Ｓ４５までの処理を複数の音声認識エンジンで並列に実行してもよい。 In step S46, the voice recognition unit 14 determines whether or not all the voice recognition engines have processed. If there is a voice recognition engine that has not been processed, the process returns to step S41, the next voice recognition engine ei + 1 is selected, and processing is performed in order from the first utterance section. The processes from steps S42 to S45 may be executed in parallel by a plurality of speech recognition engines.

（認識結果補正処理）
図６は、認識結果補正処理の流れを示すフローチャートである。認識結果補正部１５は、発話区間ｔｊごとに各音声認識エンジンｅｉの音声認識結果Ｒｉｊを比較し、比較結果に基づいて音声認識結果を補正する。 (Recognition result correction processing)
FIG. 6 is a flowchart showing the flow of the recognition result correction process. The recognition result correction unit 15 compares the voice recognition result Rij of each voice recognition engine ei for each utterance section tj, and corrects the voice recognition result based on the comparison result.

ステップＳ５１にて、認識結果補正部１５は、発話区間ｔｊについて、音声認識エンジンごとの音声認識結果を比較して不一致箇所を抽出する。具体的には、認識結果補正部１５は、ＭｅＣａｂやＪｕｍａｎ等を用いて音声認識結果Ｒｉｊを形態素に分割し、ｄｉｆｆｌｉｂ等のライブラリを用いて形態素ごとに音声認識エンジン間での認識結果を比較して不一致箇所を抽出する。 In step S51, the recognition result correction unit 15 compares the voice recognition results for each voice recognition engine with respect to the utterance section tj, and extracts a mismatched portion. Specifically, the recognition result correction unit 15 divides the voice recognition result Rij into morphemes using MeCab, Juman, etc., and compares the recognition results between the voice recognition engines for each morpheme using a library such as difflib. And extract the inconsistent part.

図７に、音声認識結果を形態素に分割し、不一致箇所を抽出した例を示す。同図の例では、発話区間ｔｊにおける６つの音声認識エンジンｅ１〜ｅ６の認識結果を形態素に分割して示している。発話区間ｔｊの、音声認識エンジンｅ１−ｅ３による音声認識結果は「私は山に登り」であり、音声認識エンジンｅ４，ｅ５による音声認識結果は「わしは山に乗り」であり、音声認識エンジンｅ６による音声認識結果は「私は山に乗り」である。各音声認識結果を形態素に分割して比較したとき、「私」と「わし」、「登り」と「乗り」が不一致箇所として抽出される。 FIG. 7 shows an example in which the voice recognition result is divided into morphemes and the inconsistent parts are extracted. In the example of the figure, the recognition results of the six speech recognition engines e1 to e6 in the utterance section tj are divided into morphemes and shown. The voice recognition result by the voice recognition engine e1-e3 in the utterance section tj is "I climb the mountain", and the voice recognition result by the voice recognition engines e4 and e5 is "I ride the mountain", and the voice recognition engine The voice recognition result by e6 is "I ride a mountain". When each voice recognition result is divided into morphemes and compared, "I" and "eagle", "climbing" and "riding" are extracted as inconsistent parts.

ステップＳ５２にて、認識結果補正部１５は、不一致箇所について、複数の音声認識エンジンが出力している結果を採用する。例えば、図７の例で、「私」と「わし」で不一致の箇所について、認識結果補正部１５は、「私」と認識した音声認識エンジンの数が「わし」と認識した音声認識エンジンの数よりも多いので、「私」を採用する。また、図７の例で、「登り」と「乗り」で不一致の箇所について、認識結果補正部１５は、音声認識エンジンの数が同数であるので、どちらを採用してもよい。 In step S52, the recognition result correction unit 15 adopts the results output by the plurality of voice recognition engines for the mismatched parts. For example, in the example of FIG. 7, for the part where "I" and "Washi" do not match, the recognition result correction unit 15 recognizes that the number of voice recognition engines recognized as "I" is "Washi". Since there are more than numbers, I will adopt "I". Further, in the example of FIG. 7, the recognition result correction unit 15 may adopt either of the voice recognition engines because the number of voice recognition engines is the same for the parts where the “climbing” and the “riding” do not match.

ステップＳ５３にて、認識結果補正部１５は、不一致箇所について、文字認識結果と不一致箇所の認識結果とを比較し、より適切な候補を採用する。例えば、発話区間ｔｊの前後１０秒を含めた区間から映像やスライドから文字認識結果を取得し、文字認識結果と不一致箇所の各認識結果の意味ベクトルを比較し、文字認識結果と意味が類似している認識結果を採用する。意味ベクトルは、ｗｏｒｄ２ｖｅｃなどのベクトル化手法を用いて導出できる。図７の例で、映像から「山登り」という文字が取得できた場合、「登り」と「乗り」で不一致の箇所について、認識結果補正部１５は「登り」を採用する。 In step S53, the recognition result correction unit 15 compares the character recognition result with the recognition result of the non-matching portion for the non-matching portion, and adopts a more appropriate candidate. For example, the character recognition result is acquired from the video or slide from the section including 10 seconds before and after the utterance section tj, the meaning vector of the character recognition result and each recognition result of the mismatched part is compared, and the character recognition result and the meaning are similar. Adopt the recognition result. The semantic vector can be derived using a vectorization method such as word2vec. In the example of FIG. 7, when the character "mountain climbing" can be obtained from the video, the recognition result correction unit 15 adopts "climbing" for the inconsistent portion between "climbing" and "riding".

ステップＳ５２とステップＳ５３の順序は逆でもよい。ステップＳ５２とステップＳ５３で同じ不一致箇所を補正した場合は、より信頼度の高い方を採用してもよい。 The order of steps S52 and S53 may be reversed. When the same inconsistency is corrected in step S52 and step S53, the one with higher reliability may be adopted.

ステップＳ５４にて、認識結果補正部１５は、ステップＳ５２およびステップＳ５３での補正状況に基づいて、補正状態フラグを設定する。図８に、補正状態フラグの一例を示す。図８の例では、ステップＳ５２およびステップＳ５３で音声認識結果を補正しなかった場合は補正状態フラグを１とし、ステップＳ５２で音声認識結果間での比較に基づいて音声認識結果を補正した場合は補正状態フラグを２とし、ステップＳ５３で文字認識結果との比較に基づいて音声認識結果を補正した場合は補正状態フラグを３としている。フラグは上記に限るものではない。 In step S54, the recognition result correction unit 15 sets the correction state flag based on the correction status in steps S52 and S53. FIG. 8 shows an example of the correction state flag. In the example of FIG. 8, when the voice recognition result is not corrected in steps S52 and S53, the correction state flag is set to 1, and when the voice recognition result is corrected based on the comparison between the voice recognition results in step S52. When the correction state flag is set to 2 and the voice recognition result is corrected based on the comparison with the character recognition result in step S53, the correction state flag is set to 3. The flag is not limited to the above.

ステップＳ５５にて、認識結果補正部１５は、発話区間ｔｊについて、音声認識結果のテキストＴｊとともにステップＳ５４で設定した補正状態フラグｆｊを出力する。 In step S55, the recognition result correction unit 15 outputs the correction state flag fj set in step S54 together with the text Tj of the voice recognition result for the utterance section tj.

ステップＳ５６にて、認識結果補正部１５は、全ての発話区間について処理したか否かを判定する。処理していない発話区間が存在する場合は、ステップＳ５１に戻り、次の発話区間ｔｊ＋１を処理する。 In step S56, the recognition result correction unit 15 determines whether or not all the utterance sections have been processed. If there is an utterance section that has not been processed, the process returns to step S51 and the next utterance section tj + 1 is processed.

以上説明したように、本実施形態によれば、雑音抑圧部１１が元音声波形ｆ１の雑音を抑制し、発話区間検出部１２が雑音抑圧音声波形ｆ２から発話区間ｔｊを検出し、音声波形切断部１３が元音声波形ｆ１と雑音抑圧音声波形ｆ２を発話区間ｔｊごとに切断して区間音声波形ｆ１＿ｔｊ，ｆ２＿ｔｊを得て、音声認識部１４が、複数の音声認識エンジンｅｉのそれぞれにより、雑音抑圧前後の区間音声波形ｆ１＿ｔｊ，ｆ２＿ｔｊのそれぞれを音声認識し、文字数の多い方を音声認識エンジンｅｉによる発話区間ｔｊの音声認識結果Ｒｉｊとし、認識結果補正部１５が発話区間ｔｊごとに音声認識結果Ｒｉｊを比較して音声認識結果を補正することにより、雑音抑圧効果の有無および音声認識エンジンの得意不得意に応じて音声認識の精度を向上できる。 As described above, according to the present embodiment, the noise suppression unit 11 suppresses the noise of the original voice waveform f1, the speech section detection unit 12 detects the speech section tj from the noise suppression voice waveform f2, and the speech waveform is cut off. The unit 13 cuts the original voice waveform f1 and the noise suppression voice waveform f2 for each speech section tj to obtain the section voice waveforms f1_tj and f2_tj, and the voice recognition unit 14 suppresses the noise by each of the plurality of voice recognition engines ei. Each of the preceding and following section voice waveforms f1_tj and f2_tj is voice-recognized, and the one with the larger number of characters is set as the voice recognition result Rij of the speech section tj by the voice recognition engine ei, and the recognition result correction unit 15 sets the voice recognition result Rij for each speech section tj. By comparing and correcting the voice recognition result, the accuracy of voice recognition can be improved according to the presence or absence of the noise suppression effect and the strengths and weaknesses of the voice recognition engine.

本実施形態によれば、音声波形切断部１３が区間音声波形ｆ１＿ｔｊ，ｆ２＿ｔｊの頭に無音波形を付加することにより、区間音声波形ｆ１＿ｔｊ，ｆ２＿ｔｊの音声認識の精度を向上できる。 According to the present embodiment, the voice waveform cutting unit 13 can improve the accuracy of voice recognition of the section voice waveforms f1_tj and f2_tj by adding the asonic type to the head of the section voice waveforms f1_tj and f2_tj.

本実施形態によれば、認識結果補正部１５が元音声波形に付随する映像から抽出した文字認識結果に基づいて音声認識結果を補正することにより、音声の意味に合った音声認識結果が得られる。 According to the present embodiment, the recognition result correction unit 15 corrects the voice recognition result based on the character recognition result extracted from the video accompanying the original voice waveform, so that the voice recognition result matching the meaning of the voice can be obtained. ..

本実施形態によれば、認識結果補正部１５が音声認識結果の補正内容を示す補正状態フラグを出力することにより、音声認識結果の妥当性を判断できるようになる。 According to the present embodiment, the recognition result correction unit 15 can determine the validity of the voice recognition result by outputting the correction state flag indicating the correction content of the voice recognition result.

１…音声テキスト化装置
１１…雑音抑圧部
１２…発話区間検出部
１３…音声波形切断部
１４…音声認識部
１５…認識結果補正部 1 ... Voice text conversion device 11 ... Noise suppression unit 12 ... Speaking section detection unit 13 ... Voice waveform cutting unit 14 ... Voice recognition unit 15 ... Recognition result correction unit

Claims

入力した音声波形の雑音を抑圧する雑音抑圧部と、
複数の音声認識エンジンのそれぞれにより、前記音声波形を音声認識した第１の音声認識結果と、雑音を抑圧した雑音抑圧音声波形を音声認識した第２の音声認識結果を得て、前記第１の音声認識結果と前記第２の音声認識結果のうち文字数の多い方を当該音声認識エンジンの音声認識結果として選択する音声認識部と、
前記複数の音声認識エンジンの音声認識結果を互いに比較して前記音声認識結果を補正する認識結果補正部と、を有する
ことを特徴とする音声テキスト化装置。 A noise suppressor that suppresses the noise of the input voice waveform,
Each of the plurality of voice recognition engines obtained a first voice recognition result in which the voice waveform was voice-recognized and a second voice recognition result in which the noise-suppressed voice waveform was voice-recognized, and the first voice recognition result was obtained. A voice recognition unit that selects the voice recognition result and the second voice recognition result having the larger number of characters as the voice recognition result of the voice recognition engine.
A voice text conversion device including a recognition result correction unit that compares the voice recognition results of the plurality of voice recognition engines with each other and corrects the voice recognition result.

前記音声波形から発話区間を検出する発話区間検出部と、
前記音声波形と前記雑音抑圧音声波形を発話区間ごとに切断するとともに、発話区間ごとに切断した区間音声波形の頭に無音波形を付加する音声波形切断部と、を有し、
前記音声認識部は、前記発話区間ごとに、前記音声波形と前記雑音抑圧音声波形のそれぞれから切り出した前記区間音声波形を音声認識する
ことを特徴とする請求項１に記載の音声テキスト化装置。 An utterance section detection unit that detects an utterance section from the voice waveform,
It has a voice waveform cutting section that cuts the voice waveform and the noise suppression voice waveform for each utterance section, and adds a sinusoidal shape to the head of the section voice waveform cut for each utterance section.
The voice text conversion device according to claim 1, wherein the voice recognition unit performs voice recognition of the section voice waveform cut out from each of the voice waveform and the noise suppression voice waveform for each utterance section.

前記認識結果補正部は、前記音声波形に付随する映像から抽出した文字認識結果に基づいて前記音声認識結果を補正する
ことを特徴とする請求項１または２に記載の音声テキスト化装置。 The voice text conversion device according to claim 1 or 2, wherein the recognition result correction unit corrects the voice recognition result based on a character recognition result extracted from a video accompanying the voice waveform.

前記認識結果補正部は、前記音声認識結果の補正内容を示す情報を出力する
ことを特徴とする請求項１ないし３のいずれかに記載の音声テキスト化装置。 The voice text conversion device according to any one of claims 1 to 3, wherein the recognition result correction unit outputs information indicating the correction content of the voice recognition result.

入力した音声波形の雑音を抑圧するステップと、
複数の音声認識エンジンのそれぞれにより、前記音声波形を音声認識した第１の音声認識結果と、雑音を抑圧した雑音抑圧音声波形を音声認識した第２の音声認識結果を得るステップと、
前記第１の音声認識結果と前記第２の音声認識結果のうち文字数の多い方を当該音声認識エンジンの音声認識結果として選択するステップと、
前記複数の音声認識エンジンの音声認識結果を互いに比較して前記音声認識結果を補正するステップと、を有する
ことを特徴とする音声テキスト化方法。 Steps to suppress the noise of the input voice waveform,
A step of obtaining a first voice recognition result in which the voice waveform is voice-recognized by each of the plurality of voice recognition engines and a second voice recognition result in which the noise-suppressed voice waveform in which noise is suppressed is voice-recognized.
A step of selecting the one having the larger number of characters from the first voice recognition result and the second voice recognition result as the voice recognition result of the voice recognition engine, and
A voice text conversion method comprising a step of comparing the voice recognition results of the plurality of voice recognition engines with each other and correcting the voice recognition result.

請求項１ないし４のいずれかに記載の音声テキスト化装置の各部としてコンピュータを動作させることを特徴とする音声テキスト化プログラム。 A voice text conversion program comprising operating a computer as each part of the voice text conversion device according to any one of claims 1 to 4.