JP7113474B2

JP7113474B2 - data segmentation device

Info

Publication number: JP7113474B2
Application number: JP2018148249A
Authority: JP
Inventors: シャオリンワン; 将夫内山; 英一郎隅田
Original assignee: National Institute of Information and Communications Technology
Current assignee: National Institute of Information and Communications Technology
Priority date: 2018-08-07
Filing date: 2018-08-07
Publication date: 2022-08-05
Anticipated expiration: 2038-08-07
Also published as: JP2020024277A

Description

本発明は、時系列に連続するシーケンスデータを分割する方法に関し、例えば、文章分割方法（センテンス・セグメンテーション）についての技術に関する。 The present invention relates to a method for segmenting sequence data that is continuous in time series, and for example, to a technique for a sentence segmentation method (sentence segmentation).

機械により同時通訳（リアルタイム通訳）を実現するためには、音声認識と機械翻訳とを実行する必要がある。つまり、自動同時通訳を実現するためには、自動音声認識と機械翻訳とを統合する必要がある。 In order to realize simultaneous interpretation (real-time interpretation) by a machine, it is necessary to perform speech recognition and machine translation. In other words, it is necessary to integrate automatic speech recognition and machine translation in order to realize automatic simultaneous interpretation.

自動音声認識により、取得されるデータ（文章データ）には、文章の区切り（セグメンテーション）が存在しない。一方、機械翻訳では、文章単位に区切られた文章（文章のデータ）が入力される必要がある。 Data (sentence data) acquired by automatic speech recognition does not have sentence breaks (segmentation). On the other hand, in machine translation, it is necessary to input sentences (sentence data) divided into sentence units.

近年、自動音声認識と機械翻訳とを統合するために、自動音声認識により取得されたデータ（文章データ）をリアルタイムで文章単位に分割し、文章単位に区切られた文章（文章のデータ）を取得するための技術が開発されている。 In recent years, in order to integrate automatic speech recognition and machine translation, data (sentence data) acquired by automatic speech recognition is divided into sentence units in real time, and sentences (sentence data) are acquired. Techniques have been developed to do so.

例えば、ｎ－ｇｒａｍ言語モデルを用いて会話音声の文章分割（センテンス・セグメンテーション）を自動で実行する技術がある（例えば、非特許文献1を参照）。 For example, there is a technique for automatically executing sentence segmentation of conversational speech using an n-gram language model (see, for example, Non-Patent Document 1).

このようなｎ－ｇｒａｍ言語モデルを用いた手法では、文章の境界（区切り位置）を入力される単語間において発生する隠れイベント（ｈｉｄｄｅｎｅｖｅｎｔ）とみなす。そして、ｎ－ｇｒａｍ言語モデルを用いた手法では、文章の境界（区切り位置）が存在するとみなしたときと、文章の境界（区切り位置）が存在しないとみなしたときとの入力単語の尤度を計算する。具体的には、ｎ－ｇｒａｍ言語モデルを用いた手法では、入力される単語（単語データ）を、・・・，ｗ_ｔ－１，ｗ_ｔ，ｗ_ｔ＋１，・・・とし、次の２つの（仮定１）、（仮定２）を設定する。
（仮定１）：
単語ｗ_ｔの後の位置に文章の境界（区切り位置）は存在せず、入力データは、・・・，ｗ_ｔ－１，ｗ_ｔ，ｗ_ｔ＋１，・・・のままであるものとする。
（仮定２）：
単語ｗ_ｔの後の位置に文章の境界（区切り位置）が存在し、入力データは、・・・，ｗ_ｔ－１，ｗ_ｔ，＜／ｓ＞，＜ｓ＞，ｗ_ｔ＋１，・・・であるものとする。なお、「＜／ｓ＞」は、文章の終端位置を示し、「＜ｓ＞」は、文章の開始位置を示している。 In the technique using such an n-gram language model, boundaries (delimitation positions) of sentences are regarded as hidden events that occur between input words. Then, in the method using the n-gram language model, the likelihood of input words when it is assumed that sentence boundaries (delimiter positions) exist and when it is assumed that sentence boundaries (delimiter positions) do not exist is calculated. calculate. Specifically, in the method using the _n _- _gram language model, the input words (word data) are . (Assumption 1) and (Assumption 2) are set.
(Assumption 1):
It is assumed that there is _no sentence boundary (delimiter position ₎ at the position after the word _w _t , and the input data remains .
(Assumption 2):
A sentence boundary (delimiter position ₎ exists at _a position after the word _w _t , and the input data is . shall be "</s>" indicates the end position of the text, and "<s>" indicates the start position of the text.

ｎ－ｇｒａｍ言語モデルを用いた手法では、上記（仮定１）の確率と（仮定２）の確率とを比較することにより、文章の境界（区切り位置）を予測する。例えば、ｎ－ｇｒａｍ言語モデルを用いた手法では、下記数式により規定される単語ｗ_ｔの後に文章の境界が存在する信頼度ｓ_ｔに基づいて、文章の境界（区切り位置）を予測する。

ｏ：ｎ－ｇｒａｍ言語モデルのオーダー（ｏｒｄｅｒ）
なお、例えば、「ｐ（＜／ｓ＞｜ｗ_{ｔ－ｏ＋２} ^ｔ）」は、ｗ_{ｔ－ｏ＋２}，・・・，ｗ_ｔ－１，ｗ_ｔの後に、＜／ｓ＞が存在する確率を示している。他の表記についても同様である。 In the method using the n-gram language model, the sentence boundaries (delimiter positions) are predicted by comparing the probability of (Assumption 1) and the probability of (Assumption 2). For example, in a technique using an n-gram language model, a sentence boundary (delimiter position) is predicted based on the reliability s _t that a sentence boundary exists after a word w _t defined by the following formula.

o: order of n-gram language model
For example, "p(</s>|w _t-o+2 ^t )" indicates the probability that </s> exists after w _t-o+2 , ..., w _t-1 , w _t . ing. The same applies to other notations.

Andreas Stolcke and Elizabeth Shriberg. 1996. Automatic linguistic segmentation of conversational speech. In Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International Conference on, volume 2, pages 1005-1008. IEEE.Andreas Stolcke and Elizabeth Shriberg. 1996. Automatic linguistic segmentation of conversational speech. In Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International Conference on, volume 2, pages 1005-1008.

しかしながら、上記のｎ－ｇｒａｍ言語モデルを用いた手法では、以下の２つの問題点がある。 However, the above method using the n-gram language model has the following two problems.

第１に、ｎ－ｇｒａｍ言語モデルを用いた手法では、長い文章の依存性を把握することができない。文章は、通常、ｎ－ｇｒａｍのオーダーよりも長いので、ｎ単語より多い数の単語により構成される文章について、ｎ－ｇｒａｍ言語モデルを用いた手法では、当該文章の依存性を適切に判断することができず、その結果、文章の境界を適切に検出することができない。 First, the method using the n-gram language model cannot grasp the dependencies of long sentences. Since sentences are usually longer than the order of n-grams, for sentences composed of more than n words, the method using the n-gram language model appropriately judges the dependency of the sentences. As a result, sentence boundaries cannot be detected properly.

第２に、ｎ－ｇｒａｍ言語モデルを用いた手法では、２つのシーケンス（上記（仮定１）および（仮定２））の同時確率（ｇｅｎｅｒａｔｉｖｅｐｒｏｂａｂｉｌｉｔｙ）を比較することで、文章の境界（区切り位置）を予測するが、同時確率（ｇｅｎｅｒａｔｉｖｅｐｒｏｂａｂｉｌｉｔｙ）を用いたモデル（生成モデル（ｇｅｎｅｒａｔｉｖｅｍｏｄｅｌ））の検出精度（例えば、文章の境界の検出精度）は、条件付き確率を用いたモデル（識別モデル（ｄｉｓｃｒｉｍｉｎａｔｉｖｅｍｏｄｅｌ））の検出精度よりも劣る。 Second, in the method using the n-gram language model, by comparing the generative probability of the two sequences (above (hypothesis 1) and (hypothesis 2)), sentence boundaries (delimiter positions) However, the detection accuracy (e.g., sentence boundary detection accuracy) of a model using joint probability (generative model) is higher than that of a model using conditional probability (discriminative model). It is inferior to the detection accuracy of model)).

そこで、本発明は、上記課題に鑑み、シーケンスデータ（例えば、文章）を構成する単位データ数（例えば、単語数）に関係なく、リアルタイムでシーケンスデータ分割処理（例えば、文章分割処理）を実行することができるデータ・セグメンテーション装置を実現することを目的とする。 Therefore, in view of the above problems, the present invention executes sequence data division processing (eg, sentence division processing) in real time regardless of the number of unit data (eg, the number of words) constituting sequence data (eg, sentence). It is an object of the present invention to realize a data segmentation device capable of

上記課題を解決するための第１の発明は、ニューラルネットワーク部と、境界検出部と、を備えるデータ・セグメンテーション装置である。 A first invention for solving the above problems is a data segmentation device comprising a neural network section and a boundary detection section.

ニューラルネットワーク部は、時系列に連続するデータであるシーケンスデータ（例えば、文章）を構成する単位データ（例えば、語データ）を入力し、各要素がシーケンスデータ（例えば、文章）の境界位置である確率を示すデータであるｍ個（ｍ：自然数）の要素と、もう１つの要素との合計ｍ＋１個の要素からなるｍ＋１次元ベクトルデータを出力する。 The neural network unit inputs unit data (e.g., word data) that constitutes sequence data (e.g., sentences), which is data continuous in time series, and each element is a boundary position of the sequence data (e.g., sentences). It outputs m+1-dimensional vector data consisting of m+1 elements in total, including m (m: natural number) elements that are data indicating probability and another element.

境界検出部は、ニューラルネットワーク部から出力されるｍ＋１次元ベクトルデータに基づいて、シーケンスデータ（例えば、文章）の境界位置を決定する。 The boundary detection unit determines boundary positions of sequence data (for example, sentences) based on the m+1-dimensional vector data output from the neural network unit.

このデータ・セグメンテーション装置では、例えば、語データ（例えば、単語データ）を入力とし、文章境界が存在する位置および確率を示すベクトルを出力とするニューラルネットワークによるモデルを用いて学習処理を行い、学習済みモデルを取得する。そして、このセンテンス・セグメンテーション装置では、上記の学習済みモデルを用いて、例えば、文章境界を検出する処理を行う。つまり、このセンテンス・セグメンテーション装置では、各要素が文章の境界位置である確率を示すデータであるｍ個（ｍ：自然数）の要素の値を調べることで、文章境界が存在する位置を、容易かつ適切に検出することができる。また、このデータ・セグメンテーション装置では、ｍ＋１次元ベクトルデータを出力するので、例えば、「もう１つの要素」をｍ＋１次元ベクトルの各要素の総和が、例えば「１」になるように設定することができる。これにより、例えば、ニューラルネットワーク部の出力層をソフトマックス層（活性化関数をＳｏｆｔｍａｘ関数とする出力層）として扱うことが容易となる。 In this data segmentation device, for example, word data (for example, word data) is input, and learning processing is performed using a neural network model that outputs a vector indicating the position and probability of sentence boundaries. Get the model. Then, in this sentence segmentation device, using the learned model, for example, a process of detecting sentence boundaries is performed. In other words, in this sentence segmentation device, by examining the values of m (m: natural number) elements, which are data indicating the probability that each element is a sentence boundary position, the position where the sentence boundary exists can be easily and easily determined. can be properly detected. In addition, since this data segmentation device outputs m+1-dimensional vector data, for example, "another element" can be set so that the sum of each element of the m+1-dimensional vector becomes, for example, "1". . As a result, for example, the output layer of the neural network section can be easily treated as a softmax layer (an output layer whose activation function is the Softmax function).

第２の発明は、第１の発明であって、ニューラルネットワーク部は、再帰型ニューラルネットワークを含む。 A second invention is the first invention, wherein the neural network unit includes a recursive neural network.

これにより、このデータ・セグメンテーション装置では、例えば、長い文章の依存性も考慮することができ、文章を構成する単語数に関係なく、精度の高い文章境界検出処理、センテンス・セグメンテーション処理を実行することができる。 As a result, this data segmentation device can take into consideration the dependency of long sentences, for example, and can perform highly accurate sentence boundary detection processing and sentence segmentation processing regardless of the number of words that make up the sentence. can be done.

第３の発明は、第１または第２の発明であって、ｍ次元の閾値ベクトルを設定する閾値ベクトル設定部をさらに備える。 A third invention is the first or second invention, further comprising a threshold vector setting unit for setting an m-dimensional threshold vector.

境界検出部は、ｍ次元の閾値ベクトルとｍ＋１次元ベクトルデータとに基づいて、シーケンスデータ（例えば、文章）の境界位置を決定する。 The boundary detection unit determines boundary positions of sequence data (for example, sentences) based on the m-dimensional threshold vector and the m+1-dimensional vector data.

第４の発明は、第３の発明であって、境界検出部は、ｍ次元の閾値ベクトルのｍ個の要素と、ｍ＋１次元ベクトルデータのシーケンスデータ（例えば、文章）の境界位置である確率を示すデータであるｍ個の要素とを、それぞれ、比較することにより、シーケンスデータ（例えば、文章）の境界位置を決定する。 In a fourth invention based on the third invention, the boundary detection unit detects a probability of a boundary position between m elements of an m-dimensional threshold vector and sequence data (for example, sentences) of m+1-dimensional vector data. The boundary positions of the sequence data (for example, sentences) are determined by comparing each of the m elements, which are the indicated data.

これにより、このデータ・セグメンテーション装置では、簡単な比較処理を行うだけで、適切に文章の境界位置を決定（検出）することができる。 As a result, this data segmentation device can appropriately determine (detect) the boundary positions of sentences by simply performing a simple comparison process.

第５の発明は、第４の発明であって、境界検出部は、ｍ次元の閾値ベクトルのｍ個の要素と、ｍ＋１次元ベクトルデータのシーケンスデータ（例えば、文章）の境界位置である確率を示すデータであるｍ個の要素とを、検出する対象のシーケンスデータ（例えば、文章）の境界位置が現時刻から時間的に近い順に、比較することでシーケンスデータ（例えば、文章）の境界位置を決定する処理を行い、シーケンスデータ（例えば、文章）の境界位置が決定されたとき、以降の比較処理を行わない。 In a fifth invention based on the fourth invention, the boundary detection unit detects the probability of a boundary position between m elements of an m-dimensional threshold vector and sequence data (for example, sentences) of m+1-dimensional vector data. The boundary position of the sequence data (e.g. sentence) is compared by comparing the m elements, which are the data shown, in the order in which the boundary position of the sequence data (e.g. sentence) to be detected is closest to the current time. When the boundary position of the sequence data (for example, sentence) is determined, the subsequent comparison processing is not performed.

これにより、このデータ・セグメンテーション装置では、現時刻ｔに時間的に近い方から順番に、例えば、ニューラルネットワーク部に入力された単語の後に文章境界があるか否かを判定することができる。そして、このセンテンス・セグメンテーション装置では、閾値ベクトルθを用いて比較処理により、文章境界を検出したら即文章境界検出処理を終了させる。従って、このセンテンス・セグメンテーション装置では、ユーザの発話開始からあまり時間を経過しない間に文章境界検出処理を行うことができるため、リアルタイムで文章分割処理を実行することができる。 As a result, in this data segmentation device, it is possible to determine, for example, whether or not there is a sentence boundary after a word input to the neural network unit, in order from the time closest to the current time t. Then, in this sentence segmentation apparatus, when a sentence boundary is detected by comparison processing using the threshold vector θ, the sentence boundary detection process is terminated immediately. Therefore, in this sentence segmentation apparatus, the sentence boundary detection process can be performed in a short time after the user starts speaking, so the sentence segmentation process can be performed in real time.

第６の発明は、第３から第５のいずれかの発明であって、閾値ベクトル設定部は、値Ｆ_１を
Ｆ_１＝２×Ｐｒｅｃｉｓｉｏｎ×Ｒｅｃａｌｌ／（Ｐｒｅｃｉｓｉｏｎ＋Ｒｅｃａｌｌ）
Ｐｒｅｃｉｓｉｏｎ：正しいと予測したデータのうち、実際に正しいデータであった割合
Ｒｅｃａｌｌ：実際に正しいデータであるもののうち、正しいと予測されたデータの割合
とし、
評価値ｓｃｏｒｅを
ｓｃｏｒｅ＝Ｆ_１－α×ｌａｔｅｎｃｙ
ｌａｔｅｎｃｙ：遅延時間（遅延量）
α：係数
としたとき、
閾値ベクトルθをチューニングするために用いるデータセットにおいて、評価値ｓｃｏｒｅが所定の値よりも大きな値となるように、閾値ベクトルθを設定する。 A sixth invention is the invention according to any one of the third to fifth inventions, wherein the threshold vector setting unit sets the value F ₁ as F ₁ =2×Precision×Recall/(Precision+Recall)
Precision: The percentage of data that was actually correct out of the data predicted to be correct Recall: The percentage of data predicted to be correct out of the data that was actually correct,
The evaluation value score is score=F ₁ −α×latency
latency: delay time (delay amount)
When α is a coefficient,
The threshold vector θ is set so that the evaluation value score is greater than a predetermined value in the data set used for tuning the threshold vector θ.

このデータ・セグメンテーション装置では、上記によりチューニングされた閾値ベクトルθを用いてデータ・セグメンテーション処理を行うことができる。上記によりチューニングされた閾値ベクトルθは、データ・セグメンテーションの正確さ（ａｃｃｕｒａｃｙ）と遅延時間（ｌａｔｅｎｃｙ）とを考慮した評価値に基づいて、最適化されているため、データ・セグメンテーションを実行するときの閾値処理に用いる閾値ベクトルθとして適切である。 In this data segmentation device, data segmentation processing can be performed using the threshold vector θ tuned as described above. The threshold vector θ tuned by the above is optimized based on the evaluation value considering the accuracy and latency of data segmentation, so when performing data segmentation It is appropriate as the threshold vector θ used for threshold processing.

つまり、このデータ・セグメンテーション装置では、上記のようにしてチューニングされた閾値ベクトルθを用いて、例えば、文章境界検出処理、センテンス・セグメンテーション処理を実行することで、高精度かつ低遅延の文章境界検出処理、センテンス・セグメンテーション処理を実現することができる。 In other words, in this data segmentation device, using the threshold vector θ tuned as described above, for example, by executing sentence boundary detection processing and sentence segmentation processing, sentence boundary detection can be performed with high accuracy and low delay. processing, sentence segmentation processing can be implemented.

なお、閾値ベクトルθをチューニングするために用いるデータセットにおいて、評価値ｓｃｏｒｅが最大となるように、閾値ベクトルθを設定してもよい。 Note that the threshold vector θ may be set such that the evaluation value score is maximized in the data set used for tuning the threshold vector θ.

本発明によれば、シーケンスデータ（例えば、文章）を構成する単位データ数（例えば、単語数）に関係なく、リアルタイムでシーケンスデータ分割処理（例えば、文章分割処理）を実行することができるシーケンスデータ分割方法（例えば、文章分割方法）、データ・セグメンテーション装置を実現することができる。また、当該文章分割方法、センテンス・セグメンテーション装置を利用することで、リアルタイムで精度の高い機械翻訳を行うことができ、さらに、機械によるリアルタイムでの同時通訳を実現することができる。 According to the present invention, sequence data segmentation processing (eg, sentence segmentation processing) can be executed in real time, regardless of the number of unit data (eg, the number of words) constituting sequence data (eg, sentences). A segmentation method (eg, a sentence segmentation method), a data segmentation device can be implemented. In addition, by using the sentence segmentation method and the sentence segmentation device, highly accurate machine translation can be performed in real time, and real-time simultaneous interpretation by a machine can be realized.

第１実施形態に係る同時通訳システム１０００の概略構成図。1 is a schematic configuration diagram of a simultaneous interpretation system 1000 according to the first embodiment; FIG. 第１実施形態に係る文章分割装置１００のセンテンス分割部２の概略構成図。2 is a schematic configuration diagram of a sentence dividing section 2 of the sentence dividing device 100 according to the first embodiment; FIG. 学習時における文章分割装置１００のニューラルネットワーク部２２の各層のデータ入出力関係を時系列に展開した図。FIG. 4 is a diagram showing the data input/output relationship of each layer of the neural network unit 22 of the text segmentation device 100 during learning, which is developed in time series. 予測時（実行時）における文章分割装置１００のニューラルネットワーク部２２の各層のデータ入出力関係を時系列に展開した図。FIG. 4 is a diagram showing the data input/output relationship of each layer of the neural network unit 22 of the text segmentation device 100 at the time of prediction (at the time of execution) developed in time series. 文章境界検出処理のフローチャートである。9 is a flowchart of sentence boundary detection processing; 閾値ベクトルθをチューニングするためのアルゴリズムの疑似コードPseudo-code of the algorithm for tuning the threshold vector θ 第２実施形態に係る話者識別システム２０００の概略構成図。1 is a schematic configuration diagram of a speaker identification system 2000 according to a second embodiment; FIG. 第２実施形態に係る話者識別システム２０００のデータ分割装置１００Ａのデータ分割部２Ａの概略構成図。FIG. 4 is a schematic configuration diagram of a data dividing section 2A of a data dividing device 100A of a speaker identification system 2000 according to a second embodiment; 第３実施形態に係る映像識別システム３０００の概略構成図。The schematic block diagram of the video identification system 3000 which concerns on 3rd Embodiment. 第３実施形態に係る映像識別システム３０００のデータ分割装置１００Ｂのデータ分割部２Ｂの概略構成図。The schematic block diagram of the data division|segmentation part 2B of the data division|segmentation apparatus 100B of the video identification system 3000 which concerns on 3rd Embodiment. 第３実施形態の第１変形例に係る映像識別システム３０００Ａの概略構成図。The schematic block diagram of 3000 A of image|video identification systems based on the 1st modification of 3rd Embodiment. 第３実施形態の第２変形例に係る映像識別システム３０００Ｂの概略構成図。The schematic block diagram of the image|video identification system 3000B based on the 2nd modification of 3rd Embodiment. 第１実施形態を実現するコンピュータの内部構成を示すブロック図。FIG. 2 is a block diagram showing the internal configuration of a computer that implements the first embodiment; FIG.

［第１実施形態］
第１実施形態について、図面を参照しながら、以下説明する。 [First embodiment]
A first embodiment will be described below with reference to the drawings.

＜１．１：同時通訳システムの構成＞
図１は、第１実施形態に係る同時通訳システム１０００の概略構成図である。 <1.1: Configuration of Simultaneous Interpretation System>
FIG. 1 is a schematic configuration diagram of a simultaneous interpretation system 1000 according to the first embodiment.

図２は、第１実施形態に係る文章分割装置１００のセンテンス分割部２の概略構成図である。 FIG. 2 is a schematic configuration diagram of the sentence dividing section 2 of the sentence dividing device 100 according to the first embodiment.

同時通訳システム１０００は、図１に示すように、音声認識部Ａｕｄ１と、文章分割装置１００（データ・セグメンテーション装置）と、機械翻訳部ＭＴ１とを備える。 The simultaneous interpretation system 1000 includes, as shown in FIG. 1, a speech recognition unit Aud1, a text segmentation device 100 (data segmentation device), and a machine translation unit MT1.

音声認識部Ａｕｄ１は、例えば、マイク等の音声入力装置により取得された音声データＤｉｎを入力する。そして、音声認識部Ａｕｄ１は、例えば、音響モデル、言語モデル、辞書のデータベースを有しており、音声データＤｉｎに対して、音響モデル、言語モデル、辞書を用いて、音声認識処理を実行し、音声データＤｉｎに対応する文章データＤ１を取得する。そして、音声認識部Ａｕｄ１は、取得した文章データＤ１を文章分割装置１００に出力する。 The voice recognition unit Aud1 inputs voice data Din acquired by a voice input device such as a microphone. The speech recognition unit Aud1 has, for example, a database of acoustic models, language models, and dictionaries. Sentence data D1 corresponding to voice data Din is obtained. Then, the speech recognition unit Aud1 outputs the acquired sentence data D1 to the sentence dividing device 100. FIG.

文章分割装置１００は、図１に示すように、閾値ベクトル設定部１と、センテンス分割部２とを備える。 The sentence segmentation device 100 includes a threshold vector setting unit 1 and a sentence segmentation unit 2, as shown in FIG.

閾値ベクトル設定部１は、所定のデータセットを用いて、学習することにより、閾値ベクトルθを取得する。そして、閾値ベクトル設定部１は、取得した閾値ベクトルθをセンテンス分割部２に出力する。 The threshold vector setting unit 1 acquires the threshold vector θ by learning using a predetermined data set. Then, the threshold vector setting unit 1 outputs the acquired threshold vector θ to the sentence dividing unit 2 .

センテンス分割部２は、図２に示すように、単語取得部２１と、ニューラルネットワーク部２２と、文章境界検出部２３と、センテンス取得部２４とを備える。 The sentence division unit 2 includes a word acquisition unit 21, a neural network unit 22, a sentence boundary detection unit 23, and a sentence acquisition unit 24, as shown in FIG.

単語取得部２１は、音声認識部Ａｕｄ１から出力される文章データＤ１を入力する。単語取得部２１は、入力された文章データＤ１から単語データｘ_ｔを取得し、取得した単語データｘ_ｔをニューラルネットワーク部２２に出力する。 The word acquisition unit 21 receives sentence data D1 output from the voice recognition unit Aud1. The word acquisition unit 21 acquires word data _{xt from the input sentence data D1 and outputs the acquired word data xt} _to the neural network unit 22 .

ニューラルネットワーク部２２は、複数のＲＮＮ層（ＲＮＮ：Recurrent Neural Network）を有するニューラルネットワークにより構成されている。ニューラルネットワーク部２２は、図２に示すように、埋込層２２１と、第１ＲＮＮ層２２２と、第２ＲＮＮ層２２３と、第３ＲＮＮ層２２４と、出力マッピング層２２５と、ソフトマックス層２２６とを備える。 The neural network unit 22 is configured by a neural network having a plurality of RNN layers (RNN: Recurrent Neural Network). The neural network unit 22 includes an embedding layer 221, a first RNN layer 222, a second RNN layer 223, a third RNN layer 224, an output mapping layer 225, and a softmax layer 226, as shown in FIG. .

ニューラルネットワーク部２２は、単語取得部２１から出力される単語データｘ_ｔを入力し、単語データｘ_ｔを用いて、ニューラルネットワークによる処理を実行し、センテンス・セグメンテーション判定用データｙ_ｔを取得する。そして、ニューラルネットワーク部２２は、取得したセンテンス・セグメンテーション判定用データｙ_ｔを文章境界検出部２３に出力する。 The neural network unit 22 receives the word data _xt output from the word acquisition unit 21, executes neural network processing using the word data _xt , and acquires sentence/segmentation determination data _yt . Then, the neural network unit 22 outputs the acquired sentence/segmentation determination data _yt to the sentence boundary detection unit 23 .

埋込層２２１は、単語データｘ_ｔに対して、埋め込み処理を行うための行列を用いて、行列変換することで、分散表現データに変換し、取得した分散表現データを第１ＲＮＮ層２２２に出力する。 The embedding layer 221 converts the word data x _t into distributed representation data by matrix transformation using a matrix for embedding processing, and outputs the acquired distributed representation data to the first RNN layer 222. do.

第１ＲＮＮ層２２２、第２ＲＮＮ層２２３および第３ＲＮＮ層２２４は、ＲＮＮにより構成されている。第１ＲＮＮ層２２２は、時刻ｔにおいて埋込層２２１から出力される分散表現データｘｏ_ｅｍｂ（ｔ）と、時刻ｔ－１において第１ＲＮＮ層２２２から出力されたデータｘｏ_ＲＮＮ１（ｔ－１）とを入力する。そして、第１ＲＮＮ層２２２は、時刻ｔにおいて埋込層２２１から出力される分散表現データと、時刻ｔ－１において第１ＲＮＮ層２２２から出力されたデータとを用いて、ＲＮＮによる処理を実行する。つまり、第１ＲＮＮ層２２２は、
ｘｏ_ＲＮＮ１（ｔ）＝Ｗ_ｒｅｃ×ｘｏ_ＲＮＮ１（ｔ－１）＋Ｗ_１×ｘｏ_ｅｍｂ（ｔ）
Ｗ_ｒｅｃ：重み行列
Ｗ_１：重み行列
に相当する処理を実行し、時刻ｔの第１ＲＮＮ層の出力データｘｏ_ＲＮＮ１（ｔ）を取得し、当該データを第２ＲＮＮ層２２３に出力する。 The first RNN layer 222, the second RNN layer 223 and the third RNN layer 224 are composed of RNNs. The first RNN layer 222 combines distributed representation data xo _emb (t) output from the embedding layer 221 at time t and data xo _RNN1 (t-1) output from the first RNN layer 222 at time t-1. input. The first RNN layer 222 then uses the distributed representation data output from the embedding layer 221 at time t and the data output from the first RNN layer 222 at time t-1 to perform RNN processing. That is, the first RNN layer 222 is
xo _RNN1 (t)=W _rec ×xo _RNN1 (t−1)+W ₁ ×xo _emb (t)
W _rec : Weight matrix W ₁ : Performs processing corresponding to the weight matrix, acquires the output data xo _RNN1 (t) of the first RNN layer at time t, and outputs the data to the second RNN layer 223 .

第２ＲＮＮ層２２３は、時刻ｔにおいて第１ＲＮＮ層２２２から出力されるデータと、時刻ｔ－１において第２ＲＮＮ層２２３から出力されたデータとを入力する。そして、第２ＲＮＮ層２２３は、時刻ｔにおいて第１ＲＮＮ層２２２から出力されるデータｘｏ_ＲＮＮ１（ｔ）と、時刻ｔ－１において第２ＲＮＮ層２２３から出力されたデータｘｏ_ＲＮＮ２（ｔ－１）とを用いて、ＲＮＮによる処理を実行する。つまり、第２ＲＮＮ層２２３は、
ｘｏ_ＲＮＮ２（ｔ）＝Ｗ_ｒｅｃ２×ｘｏ_ＲＮＮ２（ｔ－１）＋Ｗ_２×ｘｏ_ＲＮＮ１（ｔ）
Ｗ_ｒｅｃ２：重み行列
Ｗ_２：重み行列
に相当する処理を実行し、時刻ｔの第２ＲＮＮ層の出力データｘｏ_ＲＮＮ２（ｔ）を取得し、当該データを第３ＲＮＮ層２２４に出力する。 The second RNN layer 223 receives data output from the first RNN layer 222 at time t and data output from the second RNN layer 223 at time t−1. Then, the second RNN layer 223 converts the data xo _RNN1 (t) output from the first RNN layer 222 at time t and the data xo _RNN2 (t−1) output from the second RNN layer 223 at time t−1. is used to perform processing by the RNN. That is, the second RNN layer 223 is
xo _RNN2 (t)=W _rec2 ×xo _RNN2 (t−1)+W ₂ ×xo _RNN1 (t)
W _rec2 : Weight matrix W ₂ : Performs processing corresponding to the weight matrix, acquires output data xo _RNN2 (t) of the second RNN layer at time t, and outputs the data to the third RNN layer 224 .

第３ＲＮＮ層２２４は、時刻ｔにおいて第２ＲＮＮ層２２３から出力されるデータと、時刻ｔ－１において第３ＲＮＮ層２２４から出力されたデータとを入力する。そして、第３ＲＮＮ層２２４は、時刻ｔにおいて第２ＲＮＮ層２２３から出力されるデータｘｏ_ＲＮＮ２（ｔ）と、時刻ｔ－１において第３ＲＮＮ層２２４から出力されたデータｘｏ_ＲＮＮ３（ｔ－１）とを用いて、ＲＮＮによる処理を実行する。つまり、第３ＲＮＮ層２２４は、
ｘｏ_ＲＮＮ３（ｔ）＝Ｗ_ｒｅｃ３×ｘｏ_ＲＮＮ３（ｔ－１）＋Ｗ_３×ｘｏ_ＲＮＮ２（ｔ）
Ｗ_ｒｅｃ３：重み行列
Ｗ_３：重み行列
に相当する処理を実行し、時刻ｔの第３ＲＮＮ層の出力データｘｏ_ＲＮＮ３（ｔ）を取得し、当該データを出力マッピング層２２５に出力する。 The third RNN layer 224 receives data output from the second RNN layer 223 at time t and data output from the third RNN layer 224 at time t−1. Then, the third RNN layer 224 combines the data xo _RNN2 (t) output from the second RNN layer 223 at time t and the data xo _RNN3 (t-1) output from the third RNN layer 224 at time t-1. is used to perform processing by the RNN. That is, the third RNN layer 224 is
xo _RNN3 (t)=W _rec3 ×xo _RNN3 (t−1)+W ₃ ×xo _RNN2 (t)
W _rec3 : weight matrix W ₃ : performs processing corresponding to the weight matrix, acquires the output data xo _RNN3 (t) of the third RNN layer at time t, and outputs the data to the output mapping layer 225 .

出力マッピング層２２５は、例えば、ニューラルネットワークにより構成されている。出力マッピング層２２５は、第３ＲＮＮ層２２４から出力されたデータを入力する。出力マッピング層２２５は、第３ＲＮＮ層２２４の各ノードから出力されるデータに対して重み付けを行い、重み付け後のデータを加算し、さらに、当該加算結果に対して、活性化関数による処理（例えば、ｔａｎｈ（ｘ）による処理）を実行し、ソフトマックス層２２６のノード数と同じ次元のデータを取得する。そして、出力マッピング層２２５は、取得したデータをソフトマックス層２２６に出力する。 The output mapping layer 225 is composed of, for example, a neural network. The output mapping layer 225 receives data output from the third RNN layer 224 . The output mapping layer 225 weights the data output from each node of the third RNN layer 224, adds the weighted data, and further processes the addition result by an activation function (for example, tanh(x)) is executed to acquire data of the same dimension as the number of nodes in the softmax layer 226 . The output mapping layer 225 then outputs the acquired data to the softmax layer 226 .

ソフトマックス層２２６は、例えば、活性化関数をＳｏｆｔｍａｘ関数として、ｍ＋１次元（ｍ：自然数）のベクトルを出力データとして出力する。ソフトマックス層２２６は、出力マッピング層２２５から出力されるデータに対して、活性化関数を用いた処理を実行し、ｍ＋１次元のベクトルデータを取得する。例えば、ソフトマックス層２２６のｉ番目（ｉ：自然数、１≦ｉ≦ｍ＋１）のノードの出力値ｙ_ｔ（ｉ）は、下記の数式（Ｓｏｆｔｍａｘ関数）により算出される。

ソフトマックス層２２６は、各ノードに対応する出力値ｙ_ｔ（ｉ）を要素とするｍ＋１次元のベクトルデータを取得し、取得したｍ＋１次元のベクトルデータを出力データｙ_ｔ（センテンス・セグメンテーション判定用データｙ_ｔ）として文章境界検出部２３に出力する。 The softmax layer 226 outputs, for example, an m+1-dimensional (m: natural number) vector as output data using a Softmax function as an activation function. The softmax layer 226 performs processing using an activation function on the data output from the output mapping layer 225 to obtain m+1-dimensional vector data. For example, the output value y _t (i) of the i-th (i: natural number, 1≤i≤m+1) node of the softmax layer 226 is calculated by the following formula (Softmax function).

The softmax layer 226 acquires m+1-dimensional vector data whose elements are the output values _yt (i) corresponding to each node, and converts the acquired _m +1-dimensional vector data into output data yt (sentence/segmentation determination data y _t ) to the sentence boundary detection unit 23 .

文章境界検出部２３は、ニューラルネットワーク部２２から出力されるセンテンス・セグメンテーション判定用データｙ_ｔと、閾値ベクトル設定部１から出力される閾値ベクトルθとを入力する。文章境界検出部２３は、センテンス・セグメンテーション判定用データｙ_ｔと、閾値ベクトルθとに基づいて、文章境界を検出する処理を実行し、当該処理の検出結果データδ_ｔを取得する。そして、文章境界検出部２３は、取得した検出結果データδ_ｔをセンテンス取得部２４に出力する。 The sentence boundary detection unit 23 receives the sentence/segmentation determination data _yt output from the neural network unit 22 and the threshold vector θ output from the threshold vector setting unit 1 . The sentence boundary detection unit 23 executes a process of detecting a sentence boundary based on the sentence/segmentation determination data _yt and the threshold vector θ, and acquires detection result data _δt of the process. The sentence boundary detection unit 23 then outputs the acquired detection result data δ _t to the sentence acquisition unit 24 .

センテンス取得部２４は、音声認識部Ａｕｄ１から出力される文章データＤ１と、文章境界検出部２３から出力される検出結果データδ_ｔとを入力する。センテンス取得部２４は、検出結果データδ_ｔに基づいて、文章データＤ１を文章単位に分割する。つまり、センテンス取得部２４は、検出結果データδ_ｔに基づいて、文章データＤ１の文章の境界を示すデータ（例えば＜ＥＯＳ＞の記号）を付与したデータをデータＤ２として取得し、取得したデータＤ２を機械翻訳部ＭＴ１に出力する。 The sentence acquisition unit 24 receives the sentence data D1 output from the speech recognition unit _Aud1 and the detection result data δt output from the sentence boundary detection unit 23 . The sentence acquisition unit 24 divides the sentence data D1 into sentence units based on the detection result data _δt . That is, based on the detection result data _δt , the sentence acquisition unit 24 acquires data to which data indicating the boundaries of sentences of the sentence data D1 (for example, the symbol <EOS>) is added as the data D2, and the acquired data D2 is output to the machine translation unit MT1.

図１を参照して、機械翻訳部ＭＴ１は、文章分割装置１００から出力されるデータＤ２を入力する。機械翻訳部ＭＴ１は、データＤ２に対して、機械翻訳処理を実行し、機械翻訳処理後のデータＤｏｕｔを取得する。 Referring to FIG. 1, machine translation unit MT1 receives data D2 output from text segmentation device 100 . The machine translation unit MT1 performs machine translation processing on the data D2 and obtains data Dout after the machine translation processing.

＜１．２：同時通訳システムの動作＞
以上のように構成された同時通訳システム１０００の動作について説明する。 <1.2: Operation of Simultaneous Interpretation System>
The operation of the simultaneous interpretation system 1000 configured as above will be described.

（１．２．１：学習処理）
まず、同時通訳システム１０００において、文章分割装置１００のニューラルネットワーク部２２のニューラルネットワークの学習処理について、説明する。 (1.2.1: learning processing)
First, in the simultaneous interpretation system 1000, the learning process of the neural network of the neural network section 22 of the sentence segmentation device 100 will be described.

文章のシーケンスＳをＳ＝（Ｓ_１，Ｓ_２，・・・）とする。つまり、センテンスＳ_ｉ＋１は、センテンスＳ_ｉに続くセンテンスであるものとする。そして、１つの学習用サンプル（Ｘ_ｉ，ｎ_ｉ）が（Ｓ_ｉ，Ｓ_ｉ＋１）から抽出されるものとする。そして、
Ｓ_ｉ＝（ｗ_１ ^ｉ，ｗ_２ ^ｉ，・・・，ｗ_ni ^ｉ）
であり、
（１）１≦ｔ≦ｎ_ｉの場合、
ｘ_ｔ＝ｗ_ｔ ^ｉ
であり、
（２）ｎ_ｉ＋１≦ｔ≦ｎ_ｉ＋ｍの場合、
ｘ_ｔ＝ｗ_ｔ－ｎｉ ^ｉ＋１
とする。なお、Ｘ_ｉ＝（ｘ_１，ｘ_２，・・・，ｘ_ｎｉ＋ｍ）であり、Ｘ_ｉは、入力単語のシーケンスである。 Let a sequence S of sentences be S=(S ₁ , S ₂ , . . . ). That is, sentence S _i+1 shall be the sentence following sentence S _i . Then, one training sample (X _i , n _i ) shall be extracted from (S _i , S _i+1 ). and,
S _i =( w ₁ ⁱ , w ₂ ⁱ , . . . , w _ni ⁱ )
and
(1) If 1 ≤ t ≤ n _i ,
x _t = w _t ⁱ
and
(2) if n _i +1≦t≦n _i +m,
x _t =w _t−ni ⁱ⁺¹
and Note that X _i =(x ₁ , x ₂ , . . . , x _ni+m ), where X _i is the sequence of input words.

データｙ_ｔが教師データ（理想データ）であるとき、ｙ_ｔは、以下のように定義される。
ｙ_ｔ ^＜ｋ＞＝１ｉｆ１≦ｔ≦ｎ_ｉ、ｋ＝ｍ＋１
ｙ_ｔ ^＜ｋ＞＝１ｉｆｎ_ｉ＋１≦ｔ≦ｎ_ｉ＋ｍ、ｋ＝ｔ－ｎ_ｉ
ｙ_ｔ ^＜ｋ＞＝０ｏｔｈｅｒｗｉｓｅ（上記以外の場合）
したがって、実データｙ_ｔ（訓練用データを入力したときの出力データｙ_ｔ）と教師データとのクロスエントロピーＥ（Ｓ）を最小にするために、以下の基準を採用する。

文章分割装置１００のニューラルネットワーク部２２では、訓練用データを入力し、出力データｙ_ｔを取得する。そして、取得したデータｙ_ｔについて、上記クロスエントロピーＥ（Ｓ）が所定の基準を満たすように、ニューラルネットワーク部２２のニューラルネットワークのパラメータ（各シナプス間の重み付け）を決定する。そして、決定したパラメータ（各シナプス間の重み付け）を用いて、ニューラルネットワーク部２２のニューラルネットワークにおいて、学習済みモデルを構築する。 When data _yt is teacher data (ideal data), _yt is defined as follows.
y _t ^<k> = 1 if 1 ≤ t ≤ n _i , k = m + 1
y _t ^<k> =1 if n _i +1≦t≦n _i +m, k=t−n _i
y _t ^<k> = 0 otherwise (otherwise)
Therefore, in order to minimize the cross entropy E(S) between the actual data y _t (output data y _t when the training data is input) and the teacher data, the following criteria are adopted.

The neural network unit 22 of the text segmentation device 100 receives training data and acquires output data _yt . Then, for the acquired data _yt , the neural network parameters (weighting between synapses) of the neural network unit 22 are determined so that the cross entropy E(S) satisfies a predetermined standard. Then, using the determined parameters (weighting between synapses), a trained model is constructed in the neural network of the neural network unit 22 .

例えば、文章データＤ１が「ｉ‘ｄｌｉｋｅｓｏｍｅｓｔｒａｗｂｅｒｒｉｅｓｈｏｗｍｕｃｈｄｏｅｓｉｔｃｏｓｔ」である場合について、図３を用いて説明する。 For example, a case where text data D1 is "i'd like some strawberries how much does it cost" will be described with reference to FIG.

図３は、学習時における文章分割装置１００のニューラルネットワーク部２２の各層のデータ入出力関係を時系列に展開した図である。 FIG. 3 is a diagram showing the data input/output relationship of each layer of the neural network unit 22 of the text segmentation device 100 during learning, developed in time series.

図３に示すように、ニューラルネットワーク部２２には、文章データＤ１から出力した以下の単語データｘ_ｔが入力される。なお、ｍ＝３とする。
ｘ_１＝「ｉ」
ｘ_２＝「‘ｄ」
ｘ_３＝「ｌｉｋｅ」
ｘ_４＝「ｓｏｍｅ」
ｘ_５＝「ｓｔｒａｗｂｅｒｒｉｅｓ」
ｘ_６＝「ｈｏｗ」
ｘ_７＝「ｍｕｃｈ」
ｘ_８＝「ｄｏｅｓ」
そして、ニューラルネットワーク部２２の出力は、ｍ＋１次元のベクトルである。時刻ｔのニューラルネットワーク部２２の出力は、ｙ_ｔであり、
ｙ_ｔ＝（ｙ_ｔ ^＜１＞，ｙ_ｔ ^＜２＞，・・・，ｙ_ｔ ^＜ｍ＞，ｙ_ｔ ^{＜ｍ＋１＞}）
ｙ_ｔ ^＜１＞：単語ｗ_ｔ－１（入力データｘ_ｔ－１）の後に文章の境界（区切り）がある確率
ｙ_ｔ ^＜２＞：単語ｗ_ｔ－２（入力データｘ_ｔ－２）の後に文章の境界（区切り）がある確率
・・・
ｙ_ｔ ^＜ｍ＞：単語ｗ_ｔ－ｍ（入力データｘ_ｔ－ｍ）の後に文章の境界（区切り）がある確率
ｙ_ｔ ^{＜ｍ＋１＞}：ｙ_ｔの全ての要素の加算値を「１」とするための値
ｙ_ｔ ^{＜ｍ＋１＞}は、以下の数式を満たす。

上記の場合、図３に示すように、「ｓｔｒａｗｂｅｒｒｉｅｓ」の後に文章の境界があるので、教師データｙ_ｔを以下のデータとして、学習を行う。
ｙ_１＝ｙ_２＝ｙ_３＝ｙ_４＝ｙ_５＝（０，０，０，１）
ｙ_６＝（１，０，０，０）
ｙ_７＝（０，１，０，０）
ｙ_８＝（０，０，１，０）
上記以外の訓練用データについても同様にして学習を行う。つまり、上記クロスエントロピーＥ（Ｓ）が所定の基準を満たすように、ニューラルネットワーク部２２のニューラルネットワークのパラメータ（各シナプス間の重み付け）を決定する。そして、決定したパラメータ（各シナプス間の重み付け）を用いて、ニューラルネットワーク部２２のニューラルネットワークにおいて、学習済みモデルを構築する。 As shown in FIG. 3, the neural network unit 22 receives the following word data _xt output from the sentence data D1. Note that m=3.
x ₁ = "i"
x ₂ = "'d"
x ₃ = "like"
x ₄ = "some"
x ₅ = "strawberries"
x ₆ = "how"
_x7 = "much"
x ₈ = "does"
The output of the neural network unit 22 is an m+1 dimensional vector. The output of the neural network unit 22 at time _t is yt,
yt ₌ ( _yt ^<1> , _yt ^<2> , ..., yt < _m ^> , yt < _m ^+1> )
y _t ^<1> : Probability of sentence boundary (delimiter) after word w _t−1 (input data x _t−1 ) y _t ^<2> : Word w _t−2 (input data x _t−2 ) Probability that there is a sentence boundary (delimiter) after ...
y _t ^<m> : Probability that there is a sentence boundary (separation) after word w _t−m (input data x _t−m ) y _t ^<m+1> : The sum of all elements of y _t The value for y _t ^<m+1> satisfies the following formula.

In the above case, as shown in FIG. 3, since there is a sentence boundary after "strawberries", learning is performed using the following data as teacher data _yt .
_y1 =y2 ₌ y3=y4= _y5 = ₍ _0,0,0,1 )
_y6 = (1,0,0,0)
_y7 = (0, 1, 0, 0)
y8 = ( _0,0,1,0 )
Training data other than the above are similarly learned. That is, the neural network parameters (weighting between synapses) of the neural network unit 22 are determined so that the cross entropy E(S) satisfies a predetermined criterion. Then, using the determined parameters (weighting between synapses), a trained model is constructed in the neural network of the neural network unit 22 .

（１．２．２：予測処理）
次に、同時通訳システム１０００において、上記学習処理により取得した学習済みモデルを用いた予測処理、すなわち、同時通訳処理について説明する。 (1.2.2: prediction processing)
Next, in the simultaneous interpretation system 1000, prediction processing using the trained model obtained by the above learning processing, that is, simultaneous interpretation processing will be described.

以下では、説明便宜のため、学習用の文章データＤ１が「ｉ‘ｄｌｉｋｅｓｏｍｅｓｔｒａｗｂｅｒｒｉｅｓｈｏｗｍｕｃｈｄｏｅｓｉｔｃｏｓｔ」である場合について、説明する。 For convenience of explanation, a case where the text data D1 for learning is "i'd like some strawberries how much does it cost" will be explained below.

図４は、予測時（実行時）における文章分割装置１００のニューラルネットワーク部２２の各層のデータ入出力関係を時系列に展開した図である。 FIG. 4 is a diagram showing the data input/output relationship of each layer of the neural network unit 22 of the text segmentation device 100 at the time of prediction (at the time of execution) developed in chronological order.

図４に示すように、ニューラルネットワーク部２２には、文章データＤ１から出力した以下の単語データｘ_ｔが入力される。なお、ｍ＝３とする。 As shown in FIG. 4, the neural network unit 22 receives the following word data _xt output from the sentence data D1. Note that m=3.

単語取得部２１は、音声認識部Ａｕｄ１から入力された文章データＤ１から単語データｘ_ｔ（時刻ｔにおける単語データｘ_ｔ）を取得する。具体的には、単語取得部２１は、ｔ＝１～８（１≦ｔ≦８）において、文章データＤ１から以下の単語データｘ_ｔを取得し、ニューラルネットワーク部２２の埋込層２２１に入力する。
ｘ_１＝「ｉ」
ｘ_２＝「‘ｄ」
ｘ_３＝「ｌｉｋｅ」
ｘ_４＝「ｓｏｍｅ」
ｘ_５＝「ｓｔｒａｗｂｅｒｒｉｅｓ」
ｘ_６＝「ｈｏｗ」
ｘ_７＝「ｍｕｃｈ」
ｘ_８＝「ｄｏｅｓ」
埋込層２２１では、入力された単語データｘ_ｔに対応する分散表現データが取得される。取得された分散表現データは、第１ＲＮＮ層２２２に入力される。 The word acquisition unit 21 acquires word data x _t (word data x t at time _t ) from the text data D1 input from the voice recognition unit Aud1. Specifically, the word acquisition unit 21 acquires the following word data x _t from the text data D1 at t=1 to 8 (1≦t≦8), and inputs them to the embedding layer 221 of the neural network unit 22. do.
x ₁ = "i"
x ₂ = "'d"
x ₃ = "like"
x ₄ = "some"
x ₅ = "strawberries"
x ₆ = "how"
_x7 = "much"
x ₈ = "does"
The embedding layer 221 acquires distributed representation data corresponding to the input word data _xt . The acquired distributed representation data is input to the first RNN layer 222 .

第１ＲＮＮ層２２２は、時刻ｔにおいて埋込層２２１から出力される分散表現データｘｏ_ｅｍｂ（ｔ）と、時刻ｔ－１において第１ＲＮＮ層２２２から出力されたデータｘｏ_ＲＮＮ１（ｔ－１）とを用いて、ＲＮＮによる処理を実行する。つまり、第１ＲＮＮ層２２２は、
ｘｏ_ＲＮＮ１（ｔ）＝Ｗ_ｒｅｃ×ｘｏ_ＲＮＮ１（ｔ－１）＋Ｗ_１×ｘｏ_ｅｍｂ（ｔ）
Ｗ_ｒｅｃ：重み行列
Ｗ_１：重み行列
に相当する処理を実行し、時刻ｔの第１ＲＮＮ層の出力データｘｏ_ＲＮＮ１（ｔ）を取得し、当該データを第２ＲＮＮ層２２３に出力する。 The first RNN layer 222 combines distributed representation data xo _emb (t) output from the embedding layer 221 at time t and data xo _RNN1 (t-1) output from the first RNN layer 222 at time t-1. is used to perform processing by the RNN. That is, the first RNN layer 222 is
xo _RNN1 (t)=W _rec ×xo _RNN1 (t−1)+W ₁ ×xo _emb (t)
W _rec : Weight matrix W ₁ : Performs processing corresponding to the weight matrix, acquires the output data xo _RNN1 (t) of the first RNN layer at time t, and outputs the data to the second RNN layer 223 .

第２ＲＮＮ層２２３は、時刻ｔにおいて第１ＲＮＮ層２２２から出力されるデータｘｏ_ＲＮＮ１（ｔ）と、時刻ｔ－１において第２ＲＮＮ層２２３から出力されたデータｘｏ_ＲＮＮ２（ｔ－１）とを用いて、ＲＮＮによる処理を実行する。つまり、第２ＲＮＮ層２２３は、
ｘｏ_ＲＮＮ２（ｔ）＝Ｗ_ｒｅｃ２×ｘｏ_ＲＮＮ２（ｔ－１）＋Ｗ_２×ｘｏ_ＲＮＮ１（ｔ）
Ｗ_ｒｅｃ２：重み行列
Ｗ_２：重み行列
に相当する処理を実行し、時刻ｔの第２ＲＮＮ層の出力データｘｏ_ＲＮＮ２（ｔ）を取得し、当該データを第３ＲＮＮ層２２４に出力する。 Using the data xo _RNN1 (t) output from the first RNN layer 222 at time t and the data xo _RNN2 (t-1) output from the second RNN layer 223 at time t-1, the second RNN layer 223 , perform processing by the RNN. That is, the second RNN layer 223 is
xo _RNN2 (t)=W _rec2 ×xo _RNN2 (t−1)+W ₂ ×xo _RNN1 (t)
W _rec2 : Weight matrix W ₂ : Performs processing corresponding to the weight matrix, acquires output data xo _RNN2 (t) of the second RNN layer at time t, and outputs the data to the third RNN layer 224 .

第３ＲＮＮ層２２４は、時刻ｔにおいて第２ＲＮＮ層２２３から出力されるデータｘｏ_ＲＮＮ２（ｔ）と、時刻ｔ－１において第３ＲＮＮ層２２４から出力されたデータｘｏ_ＲＮＮ３（ｔ－１）とを用いて、ＲＮＮによる処理を実行する。つまり、第３ＲＮＮ層２２４は、
ｘｏ_ＲＮＮ３（ｔ）＝Ｗ_ｒｅｃ３×ｘｏ_ＲＮＮ３（ｔ－１）＋Ｗ_３×ｘｏ_ＲＮＮ２（ｔ）
Ｗ_ｒｅｃ３：重み行列
Ｗ_３：重み行列
に相当する処理を実行し、時刻ｔの第３ＲＮＮ層の出力データｘｏ_ＲＮＮ３（ｔ）を取得し、当該データを出力マッピング層２２５に出力する。 Using the data xo _RNN2 (t) output from the second RNN layer 223 at time t and the data xo _RNN3 (t-1) output from the third RNN layer 224 at time t-1, the third RNN layer 224 , perform processing by the RNN. That is, the third RNN layer 224 is
xo _RNN3 (t)=W _rec3 ×xo _RNN3 (t−1)+W ₃ ×xo _RNN2 (t)
W _rec3 : weight matrix W ₃ : performs processing corresponding to the weight matrix, acquires the output data xo _RNN3 (t) of the third RNN layer at time t, and outputs the data to the output mapping layer 225 .

出力マッピング層２２５は、第３ＲＮＮ層２２４の各ノードから出力されるデータに対して重み付けを行い、重み付け後のデータを加算し、さらに、当該加算結果に対して、活性化関数による処理（例えば、ｔａｎｈ（ｘ）による処理）を実行し、ソフトマックス層２２６のノード数と同じ次元のデータを取得する。そして、出力マッピング層２２５は、取得したデータをソフトマックス層２２６に出力する。 The output mapping layer 225 weights the data output from each node of the third RNN layer 224, adds the weighted data, and further processes the addition result by an activation function (for example, tanh(x)) is executed to acquire data of the same dimension as the number of nodes in the softmax layer 226 . The output mapping layer 225 then outputs the acquired data to the softmax layer 226 .

ソフトマックス層２２６は、出力マッピング層２２５から出力されるデータに対して、活性化関数を用いた処理を実行し、ｍ＋１次元のベクトルデータを取得する。例えば、ソフトマックス層２２６のｉ番目（ｉ：自然数、１≦ｉ≦ｍ＋１）のノードの出力値ｙ_ｔ（ｉ）は、下記の数式（Ｓｏｆｔｍａｘ関数）により算出される。

ソフトマックス層２２６は、各ノードに対応する出力値ｙ_ｔ（ｉ）を要素とするｍ＋１次元のベクトルデータを取得し、取得したｍ＋１次元のベクトルデータを出力データｙ_ｔ（センテンス・セグメンテーション判定用データｙ_ｔ）として文章境界検出部２３に出力する。 The softmax layer 226 performs processing using an activation function on the data output from the output mapping layer 225 to obtain m+1-dimensional vector data. For example, the output value y _t (i) of the i-th (i: natural number, 1≤i≤m+1) node of the softmax layer 226 is calculated by the following formula (Softmax function).

図４の場合、ｙ_１～ｙ_８は、以下のデータとして取得されたものとする。
ｙ_１＝（ｙ_１ ^＜１＞，ｙ_１ ^＜２＞，ｙ_１ ^＜３＞，ｙ_１ ^＜４＞）
＝（０．１５，０．２１，０．１８，０．４６）
ｙ_２＝（ｙ_２ ^＜１＞，ｙ_２ ^＜２＞，ｙ_２ ^＜３＞，ｙ_２ ^＜４＞）
＝（０．１３，０．２４，０．２１，０．４２）
ｙ_３＝（ｙ_３ ^＜１＞，ｙ_３ ^＜２＞，ｙ_３ ^＜３＞，ｙ_３ ^＜４＞）
＝（０．２５，０．１１，０．２２，０．４２）
ｙ_４＝（ｙ_４ ^＜１＞，ｙ_４ ^＜２＞，ｙ_４ ^＜３＞，ｙ_４ ^＜４＞）
＝（０．３６，０．２４，０．２１，０．１９）
ｙ_５＝（ｙ_５ ^＜１＞，ｙ_５ ^＜２＞，ｙ_５ ^＜３＞，ｙ_５ ^＜４＞）
＝（０．１７，０．１９，０．１３，０．５１）
ｙ_６＝（ｙ_６ ^＜１＞，ｙ_６ ^＜２＞，ｙ_６ ^＜３＞，ｙ_６ ^＜４＞）
＝（０．３３，０．２４，０．２１，０．２２）
ｙ_７＝（ｙ_７ ^＜１＞，ｙ_７ ^＜２＞，ｙ_７ ^＜３＞，ｙ_７ ^＜４＞）
＝（０．１５，０．５１，０．１２，０．２２）
ｙ_８＝（ｙ_８ ^＜１＞，ｙ_８ ^＜２＞，ｙ_８ ^＜３＞，ｙ_８ ^＜４＞）
＝（０．１３，０．２４，０．６１，０．０２）
また、閾値ベクトル設定部１は、所定のデータセットを用いて学習した閾値ベクトルθをセンテンス分割部２に出力する。なお、ここでは、閾値ベクトルθは、
θ＝（θ^＜１＞，θ^＜２＞，θ^＜３＞）＝（０．４，０．５，０．６）
であるものとする。 In the case of FIG. 4, y ₁ to y ₈ are obtained as the following data.
y1 = ( _y1 < ₁ ^> , y1 ^<2> _, y1< ₃ ^> , y1< ₄ ^> )
= (0.15, 0.21, 0.18, 0.46)
y2 ₌ (y2 ^<1> , y2 ^<2> _, y2< ₃ ^> , _y2 < ₄ ^> )
= (0.13, 0.24, 0.21, 0.42)
y3=(y3 ^<1> _, y3< ₂ ^> , _y3 < ₃ ^> , _y3 ^<4> )
= (0.25, 0.11, 0.22, 0.42)
_y4 =(y4 ^<1> , y4 ^<2> _, y4< ₃ ^> , _y4 < ₄ ^> )
= (0.36, 0.24, 0.21, 0.19)
_y5 = (y5 ^<1> _, y5 ^<2> , _y5 < ₃ ^> , y5< ₄ ^> )
= (0.17, 0.19, 0.13, 0.51)
_y6 =( _y6 ^<1> , _y6 ^<2> , _y6 ^<3> , _y6 ^<4> )
= (0.33, 0.24, 0.21, 0.22)
_y7 =( _y7 ^<1> , _y7 ^<2> , _y7 ^<3> , _y7 ^<4> )
= (0.15, 0.51, 0.12, 0.22)
_y8 = ( _y8 ^<1> , y8 ^<2> _, y8< ₃ ^> , y8< ₄ ^> )
= (0.13, 0.24, 0.61, 0.02)
Also, the threshold vector setting unit 1 outputs a threshold vector θ learned using a predetermined data set to the sentence dividing unit 2 . Here, the threshold vector θ is
θ=(θ ^<1> , θ ^<2> , θ ^<3> )=(0.4, 0.5, 0.6)
shall be

文章境界検出部２３は、ニューラルネットワーク部２２から出力されるセンテンス・セグメンテーション判定用データｙ_ｔと、閾値ベクトルθとに基づいて、文章境界を検出する処理を実行する。この処理について、図５のフローチャートを用いて説明する。 The sentence boundary detection unit 23 executes processing for detecting sentence boundaries based on the sentence/segmentation determination data _yt output from the neural network unit 22 and the threshold vector θ. This process will be described with reference to the flowchart of FIG.

図５は、文章境界検出処理のフローチャートである。 FIG. 5 is a flow chart of sentence boundary detection processing.

（ステップＳ１）：
ステップＳ１において、文章境界検出部２３は、ｉ＝１に設定する処理を行う。 (Step S1):
In step S1, the sentence boundary detection unit 23 performs processing for setting i=1.

（ステップＳ２）：
ステップＳ２において、文章境界検出部２３は、センテンス・セグメンテーション判定用データｙ_ｔのｉ番目の要素ｙ_ｔ ^＜ｉ＞と、閾値ベクトルθのｉ番目の要素θ^＜ｉ＞との比較処理を行う。そして、ｙ_ｔ ^＜ｉ＞＞θ^＜ｉ＞である場合、処理をステップＳ３に進め、ｙ_ｔ ^＜ｉ＞＞θ^＜ｉ＞ではない場合、処理をステップＳ４に進める。 (Step S2):
In step S2, the sentence boundary detection unit 23 compares the i-th element y _t ^ of the sentence/segmentation determination data y _t with the i-th element θ ^ of the threshold vector θ. Then, if y _t ^ > θ ^ , the process proceeds to step S3, and if not y _t ^ > θ ^ , the process proceeds to step S4.

（ステップＳ３）：
ステップＳ３において、文章境界検出部２３は、文章境界決定処理を行う。具体的には、文章境界検出部２３は、ｙ_ｔ ^＜ｉ＞＞θ^＜ｉ＞であるので、文章境界（文章の区切り位置）が時刻ｔ－ｉにニューラルネットワーク部２２に入力された単語ｘ_ｔ－ｉの後であると判定する。そして、文章境界検出部２３は、当該判定結果を含むデータをセンテンス取得部２４に出力する。具体的には、文章境界検出部２３は、時刻ｔにおいて、ニューラルネットワーク部２２に入力された単語ｘ_ｔの後に文章境界があると判定した場合、δ_ｔ＝１とし、ニューラルネットワーク部２２に入力された単語ｘ_ｔの後に文章境界がないと判定した場合、δ_ｔ＝０とする。そして、δ_ｔを含むデータをセンテンス取得部２４に出力する。 (Step S3):
In step S3, the sentence boundary detection unit 23 performs sentence boundary determination processing. Specifically, since y _t ^ > θ ^ , the sentence boundary detection unit 23 detects that the sentence boundary (sentence delimiting position) is the word x Determine that it is after _ti . The sentence boundary detection unit 23 then outputs data including the determination result to the sentence acquisition unit 24 . Specifically, when the sentence boundary detection unit 23 determines that there is a sentence boundary after the word x _t input to the neural network unit 22 at time t, the sentence boundary detection unit 23 sets δ _t = 1, and inputs to the neural network unit 22 If it is determined that there is no sentence boundary after the selected word x _t , then δ _t =0. Then, the data including _δt is output to the sentence acquiring section 24 .

例えば、図４の場合、ｙ_７ ^＜２＞＞θ^＜２＞（０．５１＞０．５）であるので、δ_５＝１（５＝７－２）となり、文章境界検出部２３は、ｘ_５（「ｓｔｒａｗｂｅｒｒｉｅｓ」）の後に、文章境界があると判定する。 For example, in the case of FIG. 4, since y ₇ ^<2> >θ ^<2> (0.51>0.5), δ ₅ =1 (5=7−2), and the sentence boundary detection unit 23 Determine that there is a sentence boundary after x ₅ (“strawberries”).

（ステップＳ４）：
ステップＳ４において、文章境界検出部２３は、値ｉと値ｍとを比較し、ｉ＞ｍである場合、処理を終了させ、ｉ＞ｍではない場合、処理をステップＳ５に進める。 (Step S4):
In step S4, the sentence boundary detection unit 23 compares the value i and the value m, and terminates the process if i>m, and advances the process to step S5 if i>m is not true.

ステップＳ５では、値ｉを１だけインクリメントし、処理をステップＳ２に戻す。 In step S5, the value i is incremented by 1, and the process returns to step S2.

文章境界検出部２３では、上記のようにして、文章境界検出処理が実行される。 The sentence boundary detection section 23 executes the sentence boundary detection process as described above.

上記の通り、文章境界検出部２３では、現時刻ｔに時間的に近い方から順番に、ニューラルネットワーク部２２に入力された単語の後に文章境界があるか否かを判定する。そして、文章境界検出部２３は、センテンス・セグメンテーション判定用データｙ_ｔと、閾値ベクトルθとの比較処理により、文章境界があると判定したら、即座に文章境界を出力し処理を終了させる。したがって、文章境界検出部２３では、高速に文章境界を検出することができる。また、文章境界検出部２３では、上記の通り、現時刻ｔから１～ｍステップ前の時刻（ｔ－１～ｔ－ｍ）までのｍ個のデータについてのみ、文章境界検出処理を行うので、文章境界が検出されない場合において、処理が不要に長引くことを防止することができる。 As described above, the sentence boundary detection unit 23 determines whether or not there is a sentence boundary after the word input to the neural network unit 22 in order from the one closest to the current time t. When the sentence boundary detection unit 23 determines that there is a sentence boundary by comparing the sentence/segmentation determination data _yt with the threshold vector θ, the sentence boundary is immediately output and the process ends. Therefore, the sentence boundary detection unit 23 can detect sentence boundaries at high speed. In addition, as described above, the sentence boundary detection unit 23 performs the sentence boundary detection process only for m pieces of data from the current time t to the time (t-1 to tm) 1 to m steps before. Processing can be prevented from being unnecessarily prolonged when a sentence boundary is not detected.

センテンス取得部２４は、文章境界検出部２３により取得された検出結果データδ_ｔに基づいて、文章データＤ１を文章単位に分割する。例えば、センテンス取得部２４は、検出結果データδ_ｔに基づいて、文章データＤ１の文章の境界を示すデータ（＜ＥＯＳ＞）を付与したデータをデータＤ２として取得し、取得したデータＤ２を機械翻訳部ＭＴ１に出力する。例えば、図４の場合、δ_５＝１であるため、センテンス取得部２４は、「ｓｔｒａｗｂｅｒｒｉｅｓ」の後に文章境界があると適切に判定することができる。 The sentence acquisition unit 24 divides the sentence data D1 into sentence units based on the detection result data δ _t acquired by the sentence boundary detection unit 23 . For example, based on the detection result data δ _t , the sentence acquisition unit 24 acquires data to which data (<EOS>) indicating a sentence boundary of the sentence data D1 is added as the data D2, and machine-translates the acquired data D2. output to the unit MT1. For example, in the case of FIG. 4, δ ₅ =1, so the sentence acquisition unit 24 can appropriately determine that there is a sentence boundary after “strawberries”.

機械翻訳部ＭＴ１は、文章分割装置１００から出力されるデータＤ２に対して、機械翻訳処理を実行し、機械翻訳処理後のデータＤｏｕｔを取得する。このとき、文章分割装置１００から出力されるデータＤ２は、文章の区切り（文章境界）が明示されているデータであるため、機械翻訳部ＭＴ１は、機械翻訳対象とする文章を適切に取得することができる。つまり、機械翻訳部ＭＴ１は、文章単位に機械翻訳処理を実行することができる。例えば、図４の場合、δ_５＝１であり、「ｓｔｒａｗｂｅｒｒｉｅｓ」の後に文章境界があると判定することができるので、機械翻訳部ＭＴ１は、「ｉ‘ｄｌｉｋｅｓｏｍｅｓｔｒａｗｂｅｒｒｉｅｓ」を一文と判定した上で翻訳文を出力し、次の一文が「ｈｏｗ」から始まることを適切に把握することができる。 The machine translation unit MT1 performs machine translation processing on the data D2 output from the text segmentation device 100, and obtains data Dout after the machine translation processing. At this time, since the data D2 output from the sentence segmentation device 100 is data in which sentence boundaries (sentence boundaries) are specified, the machine translation unit MT1 appropriately acquires sentences to be machine-translated. can be done. That is, the machine translation unit MT1 can execute machine translation processing for each sentence. For example, in the case of FIG. 4, δ ₅ =1, and it can be determined that there is a sentence boundary after "strawberries", so the machine translation unit MT1 determines that "i'd like some strawberries" is one sentence. You can output the translation above and know properly that the next sentence starts with "how".

したがって、機械翻訳部ＭＴ１は、文章境界が適切に判断された文章単位に機械翻訳を行うことができ、その結果、精度の高い機械翻訳結果を取得することができる。機械翻訳部ＭＴ１により取得された機械翻訳結果データは、データＤｏｕｔとして出力される。 Therefore, the machine translation unit MT1 can perform machine translation on sentence units whose sentence boundaries are appropriately determined, and as a result, can obtain a highly accurate machine translation result. The machine translation result data acquired by the machine translation unit MT1 is output as data Dout.

以上のように、同時通訳システム１０００では、単語データを入力とし、文章境界が存在する位置および確率を示すベクトルを出力とするニューラルネットワークによるモデルを用いて学習処理を行い、学習済みモデルを取得する。そして、同時通訳システム１０００では、上記の学習済みモデルを用いて、文章境界を検出する処理を行う。同時通訳システム１０００では、閾値ベクトルθを導入し、現時刻ｔに時間的に近い方から順番に、ニューラルネットワーク部２２に入力された単語の後に文章境界があるか否かを判定する。そして、同時通訳システム１０００では、閾値ベクトルθを用いて比較処理により、文章境界を検出したら即座に文章境界検出処理を終了させるとともに、平均してユーザの音声入力開始から文章境界検出までの時間(遅延時間)が短いため、リアルタイムで文章分割処理を実行することができる。 As described above, the simultaneous interpretation system 1000 acquires a trained model by performing learning processing using a model based on a neural network that receives word data as input and outputs vectors that indicate the positions and probabilities of sentence boundaries. . Simultaneous interpretation system 1000 then uses the learned model to detect sentence boundaries. In the simultaneous interpretation system 1000, a threshold vector θ is introduced, and it is determined whether or not there is a sentence boundary after the word input to the neural network unit 22 in order from the time closer to the current time t. Then, in the simultaneous interpretation system 1000, when a sentence boundary is detected by comparison processing using the threshold vector θ, the sentence boundary detection processing is terminated immediately, and the average time from the start of the user's speech input to the sentence boundary detection is calculated ( Since the delay time) is short, sentence segmentation processing can be executed in real time.

また、同時通訳システム１０００では、単語データを入力とし、文章境界が存在する位置および確率を示すベクトルを出力とするニューラルネットワークにおいて、ＲＮＮを用いているため、長い文章の依存性も考慮することができ、文章を構成する単語数に関係なく、精度の高い文章境界検出処理、センテンス・セグメンテーション処理を実行することができる。 In addition, since the simultaneous interpretation system 1000 uses RNN in a neural network that receives word data as input and outputs vectors that indicate the positions and probabilities of sentence boundaries, it is possible to consider the dependency of long sentences. Highly accurate sentence boundary detection processing and sentence segmentation processing can be executed regardless of the number of words constituting a sentence.

また、同時通訳システム１０００では、上記の通り、精度の高い文章境界検出処理、センテンス・セグメンテーション処理を実行できるので、文章境界を明示したデータを機械翻訳部ＭＴ１に入力し、機械翻訳部ＭＴ１が当該データに対して、機械翻訳処理を実行することで、精度の高い機械翻訳処理結果をリアルタイムで取得することができる。そして、このようにして取得した機械翻訳結果を、例えば、ディスプレイ等にテキストデータとして表示することで、同時通訳処理（リアルタイム通訳処理）を実行することができる。また、同時通訳システム１０００において、上記のようにして取得した機械翻訳結果を、例えば、音声合成処理部により、音声合成処理を行い、機械翻訳結果に対応する合成音声を出力することで、同時通訳処理（リアルタイム通訳処理）を実行することができる。 As described above, the simultaneous interpretation system 1000 can execute highly accurate sentence boundary detection processing and sentence segmentation processing. By executing machine translation processing on the data, highly accurate machine translation processing results can be obtained in real time. Simultaneous interpretation processing (real-time interpretation processing) can be executed by displaying the obtained machine translation result as text data on a display or the like, for example. Further, in the simultaneous interpretation system 1000, the machine translation result obtained as described above is subjected to speech synthesis processing by, for example, a speech synthesis processing unit, and a synthesized speech corresponding to the machine translation result is output. Processing (real-time interpretation processing) can be performed.

≪変形例≫
次に、第１実施形態の変形例について、説明する。 <<Modification>>
Next, a modified example of the first embodiment will be described.

本変形例の同時通訳システムでは、閾値ベクトル設定部１において、センテンス・セグメンテーションの正確さ（ａｃｃｕｒａｃｙ）と遅延時間（ｌａｔｅｎｃｙ）とを考慮した評価値を導入し、当該評価値に基づいて、閾値ベクトルθをチューニングする方法について、説明する。 In the simultaneous interpretation system of this modified example, the threshold vector setting unit 1 introduces an evaluation value that considers the accuracy and latency of sentence segmentation, and based on the evaluation value, sets the threshold vector A method for tuning θ will be described.

まず、値Ｆ_１（Ｆ値）を
Ｆ_１＝２×Ｐｒｅｃｉｓｉｏｎ×Ｒｅｃａｌｌ／（Ｐｒｅｃｉｓｉｏｎ＋Ｒｅｃａｌｌ）
Ｐｒｅｃｉｓｉｏｎ：正しいと予測したデータのうち、実際に正しいデータであった割合
Ｒｅｃａｌｌ：実際に正しいデータであるもののうち、正しいと予測されたデータの割合
とする。 First, the value F ₁ (F value) is F ₁ =2×Precision×Recall/(Precision+Recall)
Precision: Percentage of actually correct data out of data predicted to be correct Recall: Percentage of data predicted to be correct out of data that is actually correct.

そして、評価値ｓｃｏｒｅを
ｓｃｏｒｅ＝Ｆ_１－α×ｌａｔｅｎｃｙ
ｌａｔｅｎｃｙ：遅延時間（遅延量）
とする。 Then, the evaluation value score is score=F ₁ -α×latency
latency: delay time (delay amount)
and

なお、αは係数であり、例えば、α＝０．０１である。 Note that α is a coefficient, and for example, α=0.01.

そして、閾値ベクトルθをチューニングするために用いるデータセットにおいて、上記評価値ｓｃｏｒｅを最大にするように、貪欲法（ＧｒｅｅｄｙＡｌｇｏｒｉｔｈｍ）を用いたサーチを行う。例えば、図６に疑似コードを示したアルゴリズムにより、閾値ベクトルθをチューニングする。この手法では、その親データの評価値ｓｃｏｒｅが大きな値となる閾値ベクトルの優先順位を高くする。そして、上記手法では、θ^＜ｋ＞が降順となるようにし、ヒューリスティック手法により探索空間を刈り取る（取り除く）ことで、閾値ベクトルθをチューニングする。 Then, in the data set used for tuning the threshold vector θ, a search is performed using a greedy algorithm so as to maximize the evaluation value score. For example, the algorithm shown in pseudocode in FIG. 6 tunes the threshold vector θ. In this method, a threshold vector having a large evaluation value score of its parent data is given a high priority. Then, in the above method, θ ^<k> is arranged in descending order, and the threshold vector θ is tuned by pruning (removing) the search space using a heuristic method.

上記によりチューニングされた閾値ベクトルθは、センテンス・セグメンテーションの正確さ（ａｃｃｕｒａｃｙ）と遅延時間（ｌａｔｅｎｃｙ）とを考慮した評価値に基づいて、最適化されているため、センテンス・セグメンテーションを実行するときの閾値処理に用いる閾値ベクトルθとして適切である。 The threshold vector θ tuned by the above is optimized based on the evaluation value considering sentence segmentation accuracy and latency, so when executing sentence segmentation It is appropriate as the threshold vector θ used for threshold processing.

つまり、本変形例の同時通訳システムでは、上記のようにしてチューニングされた閾値ベクトルθを用いて、文章境界検出処理、センテンス・セグメンテーション処理を実行することで、高精度かつ低遅延の文章境界検出処理、センテンス・セグメンテーション処理を実現することができる。 In other words, in the simultaneous interpretation system of this modified example, the threshold vector θ tuned as described above is used to execute the sentence boundary detection process and the sentence segmentation process, thereby achieving sentence boundary detection with high accuracy and low delay. processing, sentence segmentation processing can be implemented.

［第２実施形態］
次に、第２実施形態について、説明する。 [Second embodiment]
Next, a second embodiment will be described.

なお、上記実施形態と同様の部分については、同一符号を付し、詳細な説明を省略する。 Parts similar to those of the above-described embodiment are given the same reference numerals, and detailed description thereof is omitted.

図７は、第２実施形態に係る話者識別システム２０００の概略構成図である。 FIG. 7 is a schematic configuration diagram of a speaker identification system 2000 according to the second embodiment.

図８は、第２実施形態に係る話者識別システム２０００のデータ分割装置１００Ａのデータ分割部２Ａの概略構成図である。 FIG. 8 is a schematic configuration diagram of the data dividing section 2A of the data dividing device 100A of the speaker identification system 2000 according to the second embodiment.

第２実施形態では、第１実施形態の同時通訳システム１０００の文章分割装置１００に類似する構成を有するデータ分割装置１００Ａを用いて、話者識別システム２０００を構築し、話者識別処理を実現させる方法について説明する。 In the second embodiment, a data dividing device 100A having a configuration similar to the sentence dividing device 100 of the simultaneous interpretation system 1000 of the first embodiment is used to construct a speaker identification system 2000 to implement speaker identification processing. I will explain how.

話者識別システム２０００は、図７に示すように、第１実施形態の同時通訳システム１０００において、音声認識部Ａｕｄ１を音声特徴量取得部Ｐｒｅ１に置換し、文章分割装置１００をデータ分割装置１００Ａに置換し、機械翻訳部ＭＴ１を話者識別部Ｐｏｓｔ１に置換した構成を有している。 As shown in FIG. 7, in the simultaneous interpretation system 1000 of the first embodiment, the speaker identification system 2000 replaces the speech recognition unit Aud1 with the speech feature acquisition unit Pre1, and replaces the sentence segmentation device 100 with the data segmentation device 100A. In this configuration, the machine translation unit MT1 is replaced with the speaker identification unit Post1.

音声特徴量取得部Ｐｒｅ１は、入力データＤｉｎ（例えば、音声データ）から時間的に連続した音声特徴量を取得し、取得した音声特徴量を含むデータをデータＤ１Ａとしてデータ分割装置１００Ａに出力する。 The speech feature acquisition unit Pre1 acquires temporally continuous speech features from the input data Din (for example, speech data), and outputs data including the acquired speech features as data D1A to the data division device 100A.

データ分割装置１００Ａは、図７に示すように、閾値ベクトル設定部１と、データ分割部２Ａとを備える。 As shown in FIG. 7, the data dividing device 100A includes a threshold vector setting section 1 and a data dividing section 2A.

閾値ベクトル設定部１は、第１実施形態の閾値ベクトル設定部１と同様の構成を有している。 The threshold vector setting unit 1 has the same configuration as the threshold vector setting unit 1 of the first embodiment.

データ分割部２Ａは、図８に示すように、単位データ取得部２１Ａと、ニューラルネットワーク部２２と、境界検出部２３Ａと、分割データ取得部２４Ａとを備える。 As shown in FIG. 8, the data division section 2A includes a unit data acquisition section 21A, a neural network section 22, a boundary detection section 23A, and a division data acquisition section 24A.

単位データ取得部２１Ａは、第１実施形態の単語取得部２１と類似の処理を実行する機能部であり、入力されるデータＤ１Ａ（例えば、時間的に連続した音声特徴量のデータ）から、ニューラルネットワークでの処理単位となるデータ（単位データ）を取得し、取得したデータをデータｘ_ｔとしてニューラルネットワーク部２２に出力する。 The unit data acquisition unit 21A is a functional unit that executes processing similar to that of the word acquisition unit 21 of the first embodiment. It acquires data (unit data) that serves as a processing unit in the network, and outputs the acquired data to the neural network unit 22 as data _xt .

ニューラルネットワーク部２２では、学習処理において、第１実施形態と同様の処理が実行される。なお、第１実施形態のニューラルネットワーク部２２では、センテンスの区切り位置において教師データｙ_ｔのｙ_ｔ ^＜ｋ＞の値を「１」として学習処理を実行したが、本実施形態のニューラルネットワーク部２２では、話者が切り替わった時刻を区切り位置として当該区切り位置において教師データｙ_ｔのｙ_ｔ ^＜ｋ＞の値を「１」として学習処理を実行する。 In the neural network unit 22, processing similar to that of the first embodiment is executed in the learning processing. In addition, in the neural network unit 22 of the first embodiment, the learning process is executed with the value of y _t ^<k> of the teacher data y _t set to “1” at the sentence delimiter position, but the neural network unit 22 of the present embodiment Now, learning processing is executed with the time at which the speaker is switched as a delimiter position, and the value of y _t ^<k> of the teacher data y _t is set to “1” at the delimiter position.

ニューラルネットワーク部２２は、予測処理において、第１実施形態のニューラルネットワーク部２２と同様の処理を実行する。 The neural network unit 22 performs the same processing as the neural network unit 22 of the first embodiment in the prediction processing.

境界検出部２３Ａは、第１実施形態の文章境界検出部２３と同様の構成を有しており、ニューラルネットワーク部２２から出力されるデータｙ_ｔと、閾値ベクトル設定部１から出力される閾値ベクトルθとに基づいて、第１実施形態と同様の処理により、データ境界（話者が切り替わったタイミング）を検出する。そして、境界検出部２３Ａは、検出結果データδ_ｔを分割データ取得部２４Ａに出力する。 The boundary detection unit _23A has the same configuration as the sentence boundary detection unit 23 of the first embodiment. Based on θ, a data boundary (timing at which the speaker switches) is detected by the same processing as in the first embodiment. Then, the boundary detection section 23A outputs the detection result data _δt to the divided data acquisition section 24A.

分割データ取得部２４Ａは、第１実施形態のセンテンス取得部２４と同様の構成を有しており、境界検出部２３Ａにより取得された検出結果データδ_ｔに基づいて、データＤ１Ａを話者ごとのデータに分割する。例えば、分割データ取得部２４Ａは、検出結果データδ_ｔに基づいて、データＤ１Ａの境界（話者が変わったタイミング）を示すデータ（例えば、特別な記号）を付与したデータをデータＤ２Ａとして取得し、取得したデータＤ２Ａを話者識別部Ｐｏｓｔ１に出力する。 The divided data acquisition unit 24A has a configuration similar to that of the sentence acquisition unit 24 of the first embodiment, and converts the data _D1A for each speaker based on the detection result data δt acquired by the boundary detection unit 23A. Split into data. For example, based on the detection result data δt, the divided data acquisition unit 24A acquires data to which data (e.g., a special symbol) indicating the boundary (timing at which the speaker changes) of the data _D1A is added as the data D2A. , outputs the acquired data D2A to the speaker identification unit Post1.

話者識別部Ｐｏｓｔ１は、データ分割部２Ａから出力されるデータＤ２Ａを入力する。話者識別部Ｐｏｓｔ１は、データＤ２ＡからデータＤ１Ａの境界（話者が変わったタイミング）を特定し、話者を識別する処理を実行し、当該識別処理の結果を含むデータＤｏｕｔを出力する。 The speaker identifying section Post1 receives data D2A output from the data dividing section 2A. The speaker identification unit Post1 specifies the boundary (the timing at which the speaker changes) from the data D2A to the data D1A, executes processing for identifying the speaker, and outputs data Dout including the result of the identification processing.

上記により、話者識別システム２０００では、時間的に連続するデータから、特定の区切り（本実施形態では、話者の変更）を検出する処理を実行することができる。そして、話者識別システム２０００では、第１実施形態と同様の手法を用いているので、検出結果を取得するまでの時間が短く、その結果、リアルタイムで検出処理を実行することができる。 As described above, the speaker identification system 2000 can execute a process of detecting a specific delimiter (change of speaker in this embodiment) from temporally continuous data. Since the speaker identification system 2000 uses the same method as in the first embodiment, it takes a short time to obtain the detection result, and as a result, the detection process can be executed in real time.

［第３実施形態］
次に、第３実施形態について、説明する。 [Third Embodiment]
Next, a third embodiment will be described.

図９は、第３実施形態に係る映像識別システム３０００の概略構成図である。 FIG. 9 is a schematic configuration diagram of a video identification system 3000 according to the third embodiment.

図１０は、第３実施形態に係る映像識別システム３０００のデータ分割装置１００Ｂのデータ分割部２Ｂの概略構成図である。 FIG. 10 is a schematic configuration diagram of the data dividing section 2B of the data dividing device 100B of the video identification system 3000 according to the third embodiment.

第３実施形態では、第２実施形態の話者識別システム２０００のデータ分割装置１００Ａと同様の構成を有するデータ分割装置１００Ｂを用いて、映像識別システム３０００を構築し、映像識別処理（シーンチェンジの検出処理）を実現させる方法について説明する。 In the third embodiment, a video identification system 3000 is constructed using a data division device 100B having the same configuration as the data division device 100A of the speaker identification system 2000 of the second embodiment, and video identification processing (scene change processing) is performed. detection processing) will be described.

映像識別システム３０００は、図９に示すように、第２実施形態の話者識別システム２０００において、音声特徴量取得部Ｐｒｅ１を映像データ取得部Ｐｒｅ２に置換し、データ分割装置１００Ａをデータ分割装置１００Ｂに置換し、話者識別部Ｐｏｓｔ１をシーンチェンジ検出部Ｐｏｓｔ２に置換した構成を有している。 As shown in FIG. 9, the video identification system 3000 replaces the audio feature acquisition unit Pre1 with the video data acquisition unit Pre2 in the speaker identification system 2000 of the second embodiment, and replaces the data division device 100A with the data division device 100B. , and the speaker identification section Post1 is replaced with the scene change detection section Post2.

映像データ取得部Ｐｒｅ２は、入力データＤｉｎ（例えば、映像データ、あるいは、映像データをＣＮＮ（ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋ）により圧縮して取得したデータ）から時間的に連続したデータを取得し、取得したデータをデータＤ１Ｂとしてデータ分割装置１００Ｂに出力する。 The video data acquisition unit Pre2 acquires temporally continuous data from input data Din (for example, video data or data acquired by compressing video data by a CNN (Convolutional Neural Network)), and converts the acquired data into It is output to the data dividing device 100B as data D1B.

データ分割装置１００Ｂは、図９に示すように、閾値ベクトル設定部１と、データ分割部２Ｂとを備える。 As shown in FIG. 9, the data dividing device 100B includes a threshold vector setting section 1 and a data dividing section 2B.

データ分割部２Ｂは、図１０に示すように、単位データ取得部２１Ｂと、ニューラルネットワーク部２２と、境界検出部２３Ｂと、分割データ取得部２４Ｂとを備える。 As shown in FIG. 10, the data division unit 2B includes a unit data acquisition unit 21B, a neural network unit 22, a boundary detection unit 23B, and a division data acquisition unit 24B.

単位データ取得部２１Ｂは、第１実施形態の単語取得部２１と類似の処理を実行する機能部であり、入力されるデータＤ１Ｂ（例えば、時間的に連続した映像のデータ、あるいは、時間的に連続した映像のデータのＣＮＮによる圧縮後のデータ）から、ニューラルネットワークでの処理単位となるデータ（単位データ）を取得し、取得したデータをデータｘ_ｔとしてニューラルネットワーク部２２に出力する。 The unit data acquisition unit 21B is a functional unit that executes processing similar to that of the word acquisition unit 21 of the first embodiment, and receives input data D1B (for example, temporally continuous video data or temporally Data (unit data) to be a processing unit in the neural network is obtained from continuous video data compressed by CNN, and the obtained data is output to the neural network unit 22 as data _xt .

ニューラルネットワーク部２２では、学習処理において、第１実施形態と同様の処理が実行される。なお、第１実施形態のニューラルネットワーク部２２では、センテンスの区切り位置において教師データｙ_ｔのｙ_ｔ ^＜ｋ＞の値を「１」として学習処理を実行したが、本実施形態のニューラルネットワーク部２２では、シーンが切り替わった時刻（シーンチェンジの時刻）を区切り位置として当該区切り位置において教師データｙ_ｔのｙ_ｔ ^＜ｋ＞の値を「１」として学習処理を実行する。 In the neural network unit 22, processing similar to that of the first embodiment is executed in the learning processing. In addition, in the neural network unit 22 of the first embodiment, the learning process is executed with the value of y _t ^<k> of the teacher data y _t set to “1” at the sentence delimiter position, but the neural network unit 22 of the present embodiment Now, the learning process is executed with the time at which the scene changes (scene change time) as a delimiting position, and the value of y _t ^<k> of the teacher data y _t is set to “1” at the delimiting position.

境界検出部２３Ｂは、第１実施形態の文章境界検出部２３と同様の構成を有しており、ニューラルネットワーク部２２から出力されるデータｙ_ｔと、閾値ベクトル設定部１から出力される閾値ベクトルθとに基づいて、第１実施形態と同様の処理により、データ境界（シーンが切り替わったタイミング）を検出する。そして、境界検出部２３Ｂは、検出結果データδ_ｔを分割データ取得部２４Ｂに出力する。 The boundary detection unit _23B has the same configuration as the sentence boundary detection unit 23 of the first embodiment. Based on θ, the data boundary (the timing at which the scene changes) is detected by the same processing as in the first embodiment. Then, the boundary detection section 23B outputs the detection result data _δt to the divided data acquisition section 24B.

分割データ取得部２４Ｂは、第１実施形態のセンテンス取得部２４と同様の構成を有しており、境界検出部２３Ｂにより取得された検出結果データδ_ｔに基づいて、データＤ１Ｂをシーンごとのデータに分割する。例えば、分割データ取得部２４Ｂは、検出結果データδ_ｔに基づいて、データＤ１Ｂの境界（シーンが切り替わったタイミング）を示すデータ（例えば、特別な記号）を付与したデータをデータＤ２Ｂとして取得し、取得したデータＤ２Ｂをシーンチェンジ検出部Ｐｏｓｔ２に出力する。 The divided data acquisition unit 24B has the same configuration as the sentence acquisition unit 24 of the first embodiment, and converts the data D1B into data for each scene based on the detection result data _δt acquired by the boundary detection unit 23B. split into For example, based on the detection result data δt, the divided data acquisition unit 24B acquires, as the data _D2B , data to which data (for example, a special symbol) indicating the boundary (timing at which the scene changes) of the data D1B is added, The acquired data D2B is output to the scene change detection unit Post2.

シーンチェンジ検出部Ｐｏｓｔ２は、データ分割部２Ｂから出力されるデータＤ２Ｂを入力する。シーンチェンジ検出部Ｐｏｓｔ２は、データＤ２ＢからデータＤ１Ｂの境界（シーンが切り替わったタイミング）を特定し、シーンチェンジを検出する処理を実行し、当該検出結果を含むデータＤｏｕｔを出力する。 The scene change detector Post2 receives the data D2B output from the data divider 2B. The scene change detection unit Post2 identifies the boundary (timing at which the scene changes) from the data D2B to the data D1B, executes processing for detecting a scene change, and outputs data Dout including the detection result.

上記により、映像識別システム３０００では、時間的に連続するデータから、特定の区切り（本実施形態では、シーンチェンジ）を検出する処理を実行することができる。そして、映像識別システム３０００では、第１実施形態と同様の手法を用いているので、検出結果を取得するまでの時間が短く、その結果、リアルタイムで検出処理を実行することができる。 As described above, the video identification system 3000 can execute a process of detecting a specific delimiter (a scene change in this embodiment) from temporally continuous data. Since the video identification system 3000 uses the same method as in the first embodiment, it takes a short time to obtain the detection result, and as a result, the detection process can be executed in real time.

≪第１変形例≫
次に、第３実施形態の第１変形例について、説明する。 <<First Modification>>
Next, the 1st modification of 3rd Embodiment is demonstrated.

図１１は、第３実施形態の第１変形例に係る映像識別システム３０００Ａの概略構成図である。 FIG. 11 is a schematic configuration diagram of a video identification system 3000A according to the first modified example of the third embodiment.

第３実施形態の第１変形例では、第２実施形態の話者識別システム２０００のデータ分割装置１００Ａと同様の構成を有するデータ分割装置１００Ｂを用いて、映像識別システム３０００Ａを構築し、映像識別処理（人物検出処理）を実現させる方法について説明する。 In the first modification of the third embodiment, a video identification system 3000A is constructed using a data division device 100B having the same configuration as the data division device 100A of the speaker identification system 2000 of the second embodiment. A method for realizing processing (human detection processing) will be described.

映像識別システム３０００Ａは、図１１に示すように、第３実施形態の映像識別システム３０００において、映像データ取得部Ｐｒｅ２を映像データ取得部Ｐｒｅ３に置換し、シーンチェンジ検出部Ｐｏｓｔ２を人物検出部Ｐｏｓｔ３に置換した構成を有している。 As shown in FIG. 11, the image identification system 3000A replaces the image data acquisition unit Pre2 with the image data acquisition unit Pre3 and replaces the scene change detection unit Post2 with the person detection unit Post3 in the image identification system 3000 of the third embodiment. It has a permuted configuration.

映像データ取得部Ｐｒｅ３は、入力データＤｉｎ（例えば、映像データ、あるいは、映像データをＣＮＮ（ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋ）により圧縮して取得したデータ）から時間的に連続したデータを取得し、取得したデータをデータＤ１Ｂとしてデータ分割装置１００Ｂに出力する。 The image data acquisition unit Pre3 acquires temporally continuous data from input data Din (for example, image data or data acquired by compressing image data by a CNN (Convolutional Neural Network)), and converts the acquired data into It is output to the data dividing device 100B as data D1B.

データ分割装置１００Ｂでは、第３実施形態と同様の処理が実行される。 The data division device 100B performs the same processing as in the third embodiment.

単位データ取得部２１Ｂは、第１実施形態の単語取得部２１と類似の処理を実行する機能部であり、入力されるデータＤ１Ｂ（例えば、時間的に連続した映像のデータ、あるいは、時間的に連続した映像のデータのＣＮＮによる圧縮後のデータ）から、ニューラルネットワークでの処理単位となるデータ（単位データ）を取得し、取得したデータをデータｘ_ｔとしてニューラルネットワーク部２２に出力する。 The unit data acquisition unit 21B is a functional unit that executes processing similar to that of the word acquisition unit 21 of the first embodiment, and receives input data D1B (for example, temporally continuous video data or temporally Data (unit data) to be a processing unit in the neural network is obtained from continuous video data compressed by CNN), and the obtained data is output to the neural network unit 22 as data _xt .

ニューラルネットワーク部２２では、学習処理において、第１実施形態と同様の処理が実行される。なお、第１実施形態のニューラルネットワーク部２２では、センテンスの区切り位置において教師データｙ_ｔのｙ_ｔ ^＜ｋ＞の値を「１」として学習処理を実行したが、本実施形態のニューラルネットワーク部２２では、人物が検出された時刻（映像に人物が写り始めた時刻、あるいは、映像から人物が消えた時刻）を区切り位置として当該区切り位置において教師データｙ_ｔのｙ_ｔ ^＜ｋ＞の値を「１」として学習処理を実行する。 In the neural network unit 22, processing similar to that of the first embodiment is executed in the learning processing. In addition, in the neural network unit 22 of the first embodiment, the learning process is executed with the value of y _t ^<k> of the teacher data y _t set to “1” at the sentence delimiter position, but the neural network unit 22 of the present embodiment Then, the time at which the person is detected (the time at which the person starts appearing in the video or the time at which the person disappears from the video) is used as a delimiting position, and the value of y _t ^<k> of the training data y _t is set to " 1” to execute the learning process.

境界検出部２３Ｂは、第１実施形態の文章境界検出部２３と同様の構成を有しており、ニューラルネットワーク部２２から出力されるデータｙ_ｔと、閾値ベクトル設定部１から出力される閾値ベクトルθとに基づいて、第１実施形態と同様の処理により、データ境界（映像に人物が写り始めた時刻、あるいは、映像から人物が消えた時刻）を検出する。そして、境界検出部２３Ｂは、検出結果データδ_ｔを分割データ取得部２４Ｂに出力する。 The boundary detection unit _23B has the same configuration as the sentence boundary detection unit 23 of the first embodiment. Based on θ, a data boundary (time at which a person starts appearing in the image or time at which the person disappears from the image) is detected by the same processing as in the first embodiment. Then, the boundary detection section 23B outputs the detection result data _δt to the divided data acquisition section 24B.

分割データ取得部２４Ｂは、第１実施形態のセンテンス取得部２４と同様の構成を有しており、境界検出部２３Ｂにより取得された検出結果データδ_ｔに基づいて、データＤ１Ｂをシーンごとのデータに分割する。例えば、分割データ取得部２４Ｂは、検出結果データδ_ｔに基づいて、データＤ１Ｂの境界（映像に人物が写り始めた時刻、あるいは、映像から人物が消えた時刻）を示すデータ（例えば、特別な記号）を付与したデータをデータＤ２Ｂとして取得し、取得したデータＤ２Ｂを人物検出部Ｐｏｓｔ３に出力する。 The divided data acquisition unit 24B has the same configuration as the sentence acquisition unit 24 of the first embodiment, and converts the data D1B into data for each scene based on the detection result data _δt acquired by the boundary detection unit 23B. split into For example, based on the detection result data _δt , the divided data acquisition unit 24B acquires data (for example, special symbol) is acquired as data D2B, and the acquired data D2B is output to the person detection unit Post3.

人物検出部Ｐｏｓｔ３は、データ分割部２Ｂから出力されるデータＤ２Ｂを入力する。人物検出部Ｐｏｓｔ３は、データＤ２ＢからデータＤ１Ｂの境界（映像に人物が写り始めた時刻、あるいは、映像から人物が消えた時刻）を特定し、人物を検出する処理を実行し、当該検出結果を含むデータＤｏｕｔを出力する。 The person detection unit Post3 receives the data D2B output from the data division unit 2B. The person detection unit Post3 identifies the boundary between the data D1B and the data D1B (the time when the person starts appearing in the image or the time when the person disappears from the image), executes processing for detecting the person, and outputs the detection result. output the data Dout containing.

上記により、映像識別システム３０００Ａでは、時間的に連続するデータから、特定の区切り（本実施形態では、映像に人物が写り始めた時刻、あるいは、映像から人物が消えた時刻）を検出する処理を実行することができる。そして、映像識別システム３０００Ａでは、第１実施形態と同様の手法を用いているので、検出結果を取得するまでの時間が短く、その結果、リアルタイムで検出処理を実行することができる。 As described above, the video identification system 3000A performs processing for detecting a specific break (in this embodiment, the time when a person starts appearing in the video or the time when the person disappears from the video) from temporally continuous data. can be executed. Since the video identification system 3000A uses the same method as in the first embodiment, it takes a short time to obtain the detection result, and as a result, the detection process can be executed in real time.

≪第２変形例≫
次に、第３実施形態の第２変形例について、説明する。 <<Second Modification>>
Next, the 2nd modification of 3rd Embodiment is demonstrated.

図１２は、第３実施形態の第２変形例に係る映像識別システム３０００Ｂの概略構成図である。 FIG. 12 is a schematic configuration diagram of a video identification system 3000B according to the second modification of the third embodiment.

第３実施形態の第２変形例では、第２実施形態の話者識別システム２０００のデータ分割装置１００Ａと同様の構成を有するデータ分割装置１００Ｂを用いて、映像識別システム３０００Ｂを構築し、映像識別処理（犯罪行為検出処理）を実現させる方法について説明する。 In the second modification of the third embodiment, a video identification system 3000B is constructed using a data division device 100B having the same configuration as the data division device 100A of the speaker identification system 2000 of the second embodiment, and video identification is performed. A method for realizing processing (criminal act detection processing) will be described.

映像識別システム３０００Ｂは、図１２に示すように、第３実施形態の映像識別システム３０００において、映像データ取得部Ｐｒｅ２を映像データ取得部Ｐｒｅ４に置換し、シーンチェンジ検出部Ｐｏｓｔ２を犯罪行為検出部Ｐｏｓｔ４に置換した構成を有している。 As shown in FIG. 12, the image identification system 3000B replaces the image data acquisition unit Pre2 with the image data acquisition unit Pre4 in the image identification system 3000 of the third embodiment, and replaces the scene change detection unit Post2 with the crime detection unit Post4. It has a configuration replaced with

映像データ取得部Ｐｒｅ４は、入力データＤｉｎ（例えば、映像データ、あるいは、映像データをＣＮＮ（ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋ）により圧縮して取得したデータ）から時間的に連続したデータを取得し、取得したデータをデータＤ１Ｂとしてデータ分割装置１００Ｂに出力する。 The video data acquisition unit Pre4 acquires temporally continuous data from input data Din (for example, video data or data acquired by compressing video data by CNN (Convolutional Neural Network)), and converts the acquired data into It is output to the data dividing device 100B as data D1B.

ニューラルネットワーク部２２では、学習処理において、第１実施形態と同様の処理が実行される。なお、第１実施形態のニューラルネットワーク部２２では、センテンスの区切り位置において教師データｙ_ｔのｙ_ｔ ^＜ｋ＞の値を「１」として学習処理を実行したが、本実施形態のニューラルネットワーク部２２では、人物が検出された時刻（犯罪行為の開始時刻、あるいは、犯罪行為の終了時刻）を区切り位置として当該区切り位置において教師データｙ_ｔのｙ_ｔ ^＜ｋ＞の値を「１」として学習処理を実行する。 In the neural network unit 22, processing similar to that of the first embodiment is executed in the learning processing. In addition, in the neural network unit 22 of the first embodiment, the learning process is executed with the value of y _t ^<k> of the teacher data y _t set to “1” at the sentence delimiter position, but the neural network unit 22 of the present embodiment In the learning process, the time when a person is detected (the start time of a criminal act or the end time of a criminal act) is set as a delimiter position, and the value of y _t ^<k> of the teacher data y _t is set to “1” at the delimiter position. to run.

分割データ取得部２４Ｂは、第１実施形態のセンテンス取得部２４と同様の構成を有しており、境界検出部２３Ｂにより取得された検出結果データδ_ｔに基づいて、データＤ１Ｂをシーンごとのデータに分割する。例えば、分割データ取得部２４Ｂは、検出結果データδ_ｔに基づいて、データＤ１Ｂの境界（犯罪行為の開始時刻、あるいは、犯罪行為の終了時刻）を示すデータ（例えば、特別な記号）を付与したデータをデータＤ２Ｂとして取得し、取得したデータＤ２Ｂを犯罪行為検出部Ｐｏｓｔ４に出力する。 The divided data acquisition unit 24B has the same configuration as the sentence acquisition unit 24 of the first embodiment, and converts the data D1B into data for each scene based on the detection result data _δt acquired by the boundary detection unit 23B. split into For example, based on the detection result data _δt , the divided data acquisition unit 24B adds data (for example, a special symbol) indicating the boundary of the data D1B (the start time of the criminal act or the end time of the criminal act). It acquires the data as data D2B and outputs the acquired data D2B to the criminal act detection unit Post4.

犯罪行為検出部Ｐｏｓｔ４は、データ分割部２Ｂから出力されるデータＤ２Ｂを入力する。犯罪行為検出部Ｐｏｓｔ４は、データＤ２ＢからデータＤ１Ｂの境界（犯罪行為の開始時刻、あるいは、犯罪行為の終了時刻）を特定し、犯罪行為を検出する処理を実行し、当該検出結果を含むデータＤｏｕｔを出力する。 The criminal act detection unit Post4 receives the data D2B output from the data division unit 2B. The criminal act detection unit Post4 identifies the boundary (the start time of the criminal act or the end time of the criminal act) of the data D1B from the data D2B, executes the process of detecting the criminal act, and generates the data Dout containing the detection result. to output

上記により、映像識別システム３０００Ｂでは、時間的に連続するデータから、特定の区切り（本実施形態では、犯罪行為の開始時刻、あるいは、犯罪行為の終了時刻）を検出する処理を実行することができる。そして、映像識別システム３０００Ｂでは、第１実施形態と同様の手法を用いているので、検出結果を取得するまでの時間が短く、その結果、リアルタイムで検出処理を実行することができる。 As described above, the video identification system 3000B can execute a process of detecting a specific delimiter (in this embodiment, the start time of a criminal act or the end time of a criminal act) from temporally continuous data. . Since the video identification system 3000B uses the same method as in the first embodiment, it takes a short time to obtain the detection result, and as a result, the detection process can be executed in real time.

［他の実施形態］
上記実施形態（変形例を含む）で説明した同時通訳システムの各機能部は、１つの装置（システム）により実現されてもよいし、複数の装置により実現されてもよい。 [Other embodiments]
Each functional unit of the simultaneous interpretation system described in the above embodiments (including modifications) may be realized by one device (system) or may be realized by a plurality of devices.

また、上記実施形態において、入力言語が英語である場合について説明したが、入力言語は英語に限定されることはなく、他の言語であってもよい。つまり、上記実施形態（変形例を含む）の同時通訳システムにおいて、翻訳元言語および翻訳先言語は、任意の言語であってよい。 Also, in the above embodiment, the case where the input language is English has been described, but the input language is not limited to English and may be another language. That is, in the simultaneous interpretation system of the above-described embodiments (including modifications), the source language and the destination language may be arbitrary languages.

また上記実施形態で説明した同時通訳システム１０００において、各ブロックは、ＬＳＩなどの半導体装置により個別に１チップ化されても良いし、一部または全部を含むように１チップ化されても良い。 Further, in the simultaneous interpretation system 1000 described in the above embodiment, each block may be individually integrated into one chip by a semiconductor device such as LSI, or may be integrated into one chip so as to include part or all of them.

なおここではＬＳＩとしたが、集積度の違いにより、ＩＣ、システムＬＳＩ、スーパーＬＳＩ、ウルトラＬＳＩと呼称されることもある。 Although LSI is used here, it may also be called IC, system LSI, super LSI, or ultra LSI depending on the degree of integration.

また集積回路化の手法はＬＳＩに限るものではなく、専用回路または汎用プロセサで実現してもよい。ＬＳＩ製造後にプログラムすることが可能なＦＰＧＡ（ＦｉｅｌｄＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）や、ＬＳＩ内部の回路セルの接続や設定を再構成可能なリコンフィギュラブル・プロセッサーを利用しても良い。 Further, the method of circuit integration is not limited to LSI, and may be realized by a dedicated circuit or a general-purpose processor. An FPGA (Field Programmable Gate Array) that can be programmed after the LSI is manufactured, or a reconfigurable processor that can reconfigure connections and settings of circuit cells inside the LSI may be used.

また上記各実施形態の各機能ブロックの処理の一部または全部は、プログラムにより実現されるものであってもよい。そして上記各実施形態の各機能ブロックの処理の一部または全部は、コンピュータにおいて、中央演算装置（ＣＰＵ）により行われる。また、それぞれの処理を行うためのプログラムは、ハードディスク、ＲＯＭなどの記憶装置に格納されており、ＲＯＭにおいて、あるいはＲＡＭに読み出されて実行される。 Also, part or all of the processing of each functional block in each of the above embodiments may be implemented by a program. Part or all of the processing of each functional block in each of the above embodiments is performed by a central processing unit (CPU) in a computer. A program for performing each process is stored in a storage device such as a hard disk or ROM, and is read from the ROM or RAM and executed.

また上記実施形態の各処理をハードウェアにより実現してもよいし、ソフトウェア（ＯＳ（オペレーティングシステム）、ミドルウェア、あるいは所定のライブラリとともに実現される場合を含む。）により実現してもよい。さらにソフトウェアおよびハードウェアの混在処理により実現しても良い。 Further, each process of the above embodiments may be realized by hardware, or may be realized by software (including cases where it is realized together with an OS (operating system), middleware, or a predetermined library). Further, it may be realized by mixed processing of software and hardware.

例えば上記実施形態の各機能部をソフトウェアにより実現する場合、図１３に示したハードウェア構成（例えばＣＰＵ、ＧＰＵ、ＲＯＭ、ＲＡＭ、入力部、出力部、通信部、記憶部（例えば、ＨＤＤ、ＳＳＤ等により実現される記憶部）、外部メディア用ドライブ等をバスＢｕｓにより接続したハードウェア構成）を用いて各機能部をソフトウェア処理により実現するようにしてもよい。 For example, when each functional unit of the above embodiment is realized by software, the hardware configuration shown in FIG. 13 (eg, CPU, GPU, ROM, RAM, input unit, output unit, communication unit, storage unit (eg, HDD, SSD etc.) and a hardware configuration in which an external media drive and the like are connected via a bus), each functional unit may be realized by software processing.

また上記実施形態の各機能部をソフトウェアにより実現する場合、当該ソフトウェアは、図１３に示したハードウェア構成を有する単独のコンピュータを用いて実現されるものであってもよいし、複数のコンピュータを用いて分散処理により実現されるものであってもよい。 Further, when the functional units of the above embodiments are implemented by software, the software may be implemented using a single computer having the hardware configuration shown in FIG. It may be realized by distributed processing using

また上記実施形態における処理方法の実行順序は、必ずしも上記実施形態の記載に制限されるものではなく、発明の要旨を逸脱しない範囲で、実行順序を入れ替えることができるものである。 Also, the execution order of the processing methods in the above embodiments is not necessarily limited to the description of the above embodiments, and the execution order can be changed without departing from the gist of the invention.

前述した方法をコンピュータに実行させるコンピュータプログラム、及びそのプログラムを記録したコンピュータ読み取り可能な記録媒体は、本発明の範囲に含まれる。ここでコンピュータ読み取り可能な記録媒体としては、例えば、フレキシブルディスク、ハードディスク、ＣＤ－ＲＯＭ、ＭＯ、ＤＶＤ、ＤＶＤ－ＲＯＭ、ＤＶＤ－ＲＡＭ、大容量ＤＶＤ、次世代ＤＶＤ、半導体メモリを挙げることができる。 A computer program that causes a computer to execute the method described above and a computer-readable recording medium that records the program are included in the scope of the present invention. Examples of computer-readable recording media include flexible disks, hard disks, CD-ROMs, MOs, DVDs, DVD-ROMs, DVD-RAMs, large-capacity DVDs, next-generation DVDs, and semiconductor memories.

上記コンピュータプログラムは、上記記録媒体に記録されたものに限らず、電気通信回線、無線または有線通信回線、インターネットを代表とするネットワーク等を経由して伝送されるものであってもよい。 The computer program is not limited to being recorded on the recording medium, and may be transmitted via an electric communication line, a wireless or wired communication line, a network represented by the Internet, or the like.

なお本発明の具体的な構成は、前述の実施形態に限られるものではなく、発明の要旨を逸脱しない範囲で種々の変更および修正が可能である。 The specific configuration of the present invention is not limited to the above-described embodiments, and various changes and modifications are possible without departing from the scope of the invention.

本発明によれば、文章を構成する単語数に関係なく、リアルタイムで文章分割処理を実行することができる文章分割方法、センテンス・セグメンテーション装置を実現することができる。このため本発明は、自然言語処理関連産業分野において有用であり、当該分野において実施することができる。 According to the present invention, it is possible to realize a sentence segmentation method and a sentence segmentation apparatus capable of executing sentence segmentation processing in real time regardless of the number of words constituting a sentence. Therefore, the present invention is useful in the industrial field related to natural language processing and can be implemented in the field.

１０００同時通訳システム
１００文章分割装置（データ・セグメンテーション装置）
１閾値ベクトル設定部
２センテンス分割部
２１単語取得部
２２ニューラルネットワーク部
２３文章境界検出部
２４センテンス取得部 1000 simultaneous interpretation system 100 sentence segmentation device (data segmentation device)
1 threshold vector setting unit 2 sentence division unit 21 word acquisition unit 22 neural network unit 23 sentence boundary detection unit 24 sentence acquisition unit

Claims

時系列に連続するデータであるシーケンスデータを構成する単位データを入力し、
各要素がシーケンスデータの境界位置である確率を示すデータであるｍ個（ｍ：自然数）の要素と、もう１つの要素との合計ｍ＋１個の要素からなるｍ＋１次元ベクトルデータを出力するニューラルネットワーク部と、
前記ニューラルネットワーク部から出力される前記ｍ＋１次元ベクトルデータに基づいて、シーケンスデータの境界位置を決定する境界検出部と、
を備えるデータ・セグメンテーション装置。 Enter the unit data that constitutes the sequence data, which is continuous data in time series,
A neural network unit that outputs m+1-dimensional vector data consisting of a total of m+1 elements consisting of m elements (m: natural number), which are data indicating the probability that each element is the boundary position of sequence data, and another element. When,
a boundary detection unit that determines boundary positions of sequence data based on the m+1 dimensional vector data output from the neural network unit;
A data segmentation device comprising:

前記ニューラルネットワーク部は、
再帰型ニューラルネットワークを含む、
請求項１に記載のデータ・セグメンテーション装置。 The neural network unit is
including recurrent neural networks,
A data segmentation apparatus according to claim 1.

ｍ次元の閾値ベクトルを設定する閾値ベクトル設定部をさらに備え、
前記境界検出部は、
前記ｍ次元の閾値ベクトルと前記ｍ＋１次元ベクトルデータとに基づいて、シーケンスデータの境界位置を決定する、
請求項１または２に記載のデータ・セグメンテーション装置。 Further comprising a threshold vector setting unit for setting an m-dimensional threshold vector,
The boundary detection unit is
determining a boundary position of sequence data based on the m-dimensional threshold vector and the m+1-dimensional vector data;
Data segmentation apparatus according to claim 1 or 2.

前記境界検出部は、
前記ｍ次元の閾値ベクトルのｍ個の要素と、前記ｍ＋１次元ベクトルデータのシーケンスデータの境界位置である確率を示すデータであるｍ個の要素とを、それぞれ、比較することにより、シーケンスデータの境界位置を決定する、
請求項３に記載のデータ・セグメンテーション装置。 The boundary detection unit is
Boundary of sequence data by comparing m elements of the m-dimensional threshold vector and m elements that are data indicating the probability that the m+1-dimensional vector data is the boundary position of the sequence data. determine the position,
4. A data segmentation apparatus according to claim 3.

前記境界検出部は、
前記ｍ次元の閾値ベクトルのｍ個の要素と、前記ｍ＋１次元ベクトルデータのシーケンスデータの境界位置である確率を示すデータであるｍ個の要素とを、検出する対象のシーケンスデータの境界位置が現時刻から時間的に近い順に、比較することで文章の境界位置を決定する処理を行い、文章の境界位置が決定されたとき、以降の比較処理を行わない、
請求項４に記載のデータ・セグメンテーション装置。 The boundary detection unit is
The boundary positions of the sequence data to be detected are the m elements of the m-dimensional threshold vector and the m elements of data indicating the probability of the boundary positions of the sequence data of the m+1-dimensional vector data. Perform processing to determine the boundary position of the sentence by comparing in order of temporal proximity from the time, and when the boundary position of the sentence is determined, the subsequent comparison processing is not performed.
5. A data segmentation apparatus according to claim 4.

前記閾値ベクトル設定部は、
値Ｆ_１を
Ｆ_１＝２×Ｐｒｅｃｉｓｉｏｎ×Ｒｅｃａｌｌ／（Ｐｒｅｃｉｓｉｏｎ＋Ｒｅｃａｌｌ）
Ｐｒｅｃｉｓｉｏｎ：正しいと予測したデータのうち、実際に正しいデータであった割合
Ｒｅｃａｌｌ：実際に正しいデータであるもののうち、正しいと予測されたデータの割合
とし、
評価値ｓｃｏｒｅを
ｓｃｏｒｅ＝Ｆ_１－α×ｌａｔｅｎｃｙ
ｌａｔｅｎｃｙ：遅延時間（遅延量）
α：係数
としたとき、
閾値ベクトルθをチューニングするために用いるデータセットにおいて、前記評価値ｓｃｏｒｅが所定の値よりも大きな値となるように、閾値ベクトルθを設定する、
請求項３から５のいずれかに記載のデータ・セグメンテーション装置。 The threshold vector setting unit
The value F ₁ is set to F ₁ =2×Precision×Recall/(Precision+Recall)
Precision: The percentage of data that was actually correct out of the data predicted to be correct Recall: The percentage of data predicted to be correct out of the data that was actually correct,
The evaluation value score is score=F ₁ −α×latency
latency: delay time (delay amount)
When α is a coefficient,
setting the threshold vector θ such that the evaluation value score is greater than a predetermined value in the data set used for tuning the threshold vector θ;
A data segmentation device according to any of claims 3-5.