JP7493398B2

JP7493398B2 - Conversion device, learning device, and program

Info

Publication number: JP7493398B2
Application number: JP2020115497A
Authority: JP
Inventors: 伶遠藤; 岳士梶山
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2020-07-03
Filing date: 2020-07-03
Publication date: 2024-05-31
Anticipated expiration: 2040-07-03
Also published as: JP2022013136A

Description

本発明は、変換装置、学習装置、およびプログラムに関する。 The present invention relates to a conversion device, a learning device, and a program.

映像に映されている内容を自動的に認識する技術は、人のコミュニケーションを補助する手段としての活用が期待されている。その一例として、手話をカメラ等で撮影して、その映像（画像）を自動的に認識する技術は、聴覚障害者と健聴者との間のコミュニケーションへの活用が期待される。 Technology that automatically recognizes the content of images is expected to be used as a means of assisting human communication. As one example, technology that captures sign language with a camera or other device and automatically recognizes the video (image) is expected to be used in communication between the hearing impaired and hearing people.

非特許文献１には、手話言語のひとつであるドイツ手話を自動認識してドイツ語へ変換する研究について記載されている。例えば、非特許文献１内のFigure 2は、手話言語を口語言語に翻訳するための手話翻訳機の概略構成を示している。このFigure 2が示す手話翻訳機は、エンコーダーとデコーダーを含んで構成される。エンコーダーおよびデコーダーは、それぞれ、再帰型ニューラルネットワーク（ＲＮＮ，recurrent neural network）を用いている。エンコーダーは、フレーム画像の系列を入力し、特徴ベクトルを生成する。
デコーダーは、エンコーダーによって生成された特徴ベクトルを入力し、語の系列を生成する。 Non-Patent Document 1 describes a study on automatically recognizing German Sign Language, a sign language, and converting it into German. For example, Figure 2 in Non-Patent Document 1 shows the schematic configuration of a sign language translator for translating sign language into spoken language. The sign language translator shown in Figure 2 includes an encoder and a decoder. The encoder and decoder each use a recurrent neural network (RNN). The encoder inputs a sequence of frame images and generates a feature vector.
The decoder inputs the feature vectors generated by the encoder and produces a sequence of words.

非特許文献２には、深層学習を用いた手話認識について記載されている。 Non-Patent Document 2 describes sign language recognition using deep learning.

Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, Hermann Ney, Richard Bowden ”Neural Sign Language Translation” In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018年Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, Hermann Ney, Richard Bowden ”Neural Sign Language Translation” In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018 梶山岳士，遠藤伶，加藤直人，河合吉彦，金子浩之，深層学習を用いた日本手話認識の評価実験，２０１９年映像情報メディア学会年次大会，講演予稿集，11B-2，一般社団法人映像情報メディア学会，2019年Takeshi Kajiyama, Rei Endo, Naoto Kato, Yoshihiko Kawai, Hiroyuki Kaneko, Evaluation Experiment of Japanese Sign Language Recognition Using Deep Learning, 2019 Annual Conference of the Institute of Image Information and Television Engineers, Abstracts, 11B-2, The Institute of Image Information and Television Engineers, 2019.

カメラを用いて撮影される映像の内容（例えば、人のジェスチャー等）を認識する技術は、例えば、非接触型のヒューマンマシンインターフェースが望まれる適用領域で実用化されてきた。非接触型のインターフェースが望まれる領域とは、例えば、食品工場や医療現場など、衛生面での考慮が求められる領域である。しかしながら、例えば手話言語のような、連続する複雑な人の動きを、自動認識して別の言語に変換する技術は、実用レベルに達していない。 Technology for recognizing the content of video captured by a camera (e.g., human gestures, etc.) has been put to practical use in application areas where a contactless human-machine interface is desired. Areas where a contactless interface is desired include food factories and medical facilities, where hygiene considerations are required. However, technology for automatically recognizing complex continuous human movements, such as sign language, and converting them into another language has not yet reached a practical level.

日本で使用される手話言語のひとつである日本手話の自動認識に関しても、実用例は報告されていない。 There have also been no reported practical examples of automatic recognition of Japanese Sign Language, one of the sign languages used in Japan.

また、入力される手話映像が予め単語単位に区切られていない場合には、映像を基に手話単語の単位に自動的に区切って手話単語を自動認識することは、さらに困難である。 Furthermore, if the input sign language video has not been divided into words in advance, it is even more difficult to automatically divide the video into sign language words and automatically recognize the sign language words.

本発明は、上記のような課題に基づいて行なわれたものであり、入力データ（例えば、所定の単位（例えば変換先の単語等の区切り）に区切られていないフレーム画像系列）を入力し、精度よく、その入力データに対応する記号列（例えば、所定の言語表現における単語列）を出力することができるようにする変換装置、学習装置、およびプログラムを提供しようとするものである。 The present invention has been made based on the above-mentioned problems, and aims to provide a conversion device, learning device, and program that can input input data (e.g., a series of frame images that are not separated into predetermined units (e.g., divisions of words to be converted)) and accurately output a symbol string that corresponds to the input data (e.g., a word string in a predetermined language expression).

［１］上記の課題を解決するため、本発明の一態様による変換装置は、入力される画像系列を基に状態データを生成するエンコーダー部と、前記状態データを基に記号列についての統計の情報である統計情報を生成する統計情報デコーダー部と、前記状態データと前記統計情報とを基に記号列を生成するデコーダー部と、前記エンコーダー部への入力の基となる学習用画像系列と、前記学習用画像系列に対応する前記記号列の正解である正解記号列と、前記記号列に対応する前記統計情報の正解である正解統計情報と、の組を供給する学習データ供給部と、前記学習用画像系列に基づいて前記エンコーダー部が生成する状態データ、に基づいて前記デコーダー部が生成する記号列である学習用推定記号列と、前記学習用画像系列に対応して前記学習データ供給部が供給する前記正解記号列と、の差を表すロスを算出するロス算出部と、前記学習用画像系列に基づいて前記エンコーダー部が生成する状態データ、に基づいて前記統計情報デコーダー部が生成する統計情報である学習用推定統計情報と、前記学習用画像系列に対応して前記学習データ供給部が供給する前記正解統計情報と、の差を表す統計情報ロスを算出する統計情報ロス算出部と、学習処理モードと推定処理モードとを適宜切り替えて実行させるように制御する制御部と、を備え、前記学習処理モードにおいては、前記デコーダー部は、前記統計情報デコーダー部が生成した統計情報である推定統計情報、または前記学習データ供給部が供給した前記正解統計情報の、いずれかの前記統計情報を基に、前記記号列を生成し、前記学習処理モードにおいては、前記ロス算出部が算出した前記ロスに基づいて前記デコーダー部の内部パラメーターと前記エンコーダー部の内部パラメーターとを調整するとともに、前記統計情報ロス算出部が算出した前記統計情報ロスに基づいて前記統計情報デコーダー部の内部パラメーターと前記エンコーダー部の内部パラメーターとを調整し、前記推定処理モードにおいては、前記エンコーダー部が推定対象の画像系列を基に状態データを生成し、前記エンコーダー部が生成した前記状態データを基に前記統計情報デコーダー部が前記統計情報を生成し、前記デコーダー部が前記状態データと前記統計情報とを基に前記記号列を生成する、ものである。 [1] In order to solve the above problem, a conversion device according to one aspect of the present invention includes an encoder unit that generates state data based on an input image sequence, a statistical information decoder unit that generates statistical information that is statistical information about a symbol sequence based on the state data, a decoder unit that generates a symbol sequence based on the state data and the statistical information, a learning data supply unit that supplies a set of a learning image sequence that is the basis of an input to the encoder unit, a correct symbol sequence that is a correct answer to the symbol sequence corresponding to the learning image sequence, and correct statistical information that is a correct answer to the statistical information corresponding to the symbol sequence, a loss calculation unit that calculates a loss representing the difference between a learning estimated symbol sequence that is a symbol sequence generated by the decoder unit based on state data generated by the encoder unit based on the learning image sequence, and the correct symbol sequence supplied by the learning data supply unit corresponding to the learning image sequence, and a loss calculation unit that calculates a loss representing the difference between the learning estimated statistical information that is statistical information generated by the statistical information decoder unit based on state data generated by the encoder unit based on the learning image sequence, and the correct statistical information that is a correct answer to the statistical information supplied by the learning data supply unit corresponding to the learning image sequence. and a control unit that controls the switching between a learning processing mode and an estimation processing mode as appropriate. In the learning processing mode, the decoder unit generates the symbol string based on either the estimated statistical information generated by the statistical information decoder unit or the correct statistical information supplied by the learning data supply unit. In the learning processing mode, the decoder unit adjusts an internal parameter of the decoder unit and an internal parameter of the encoder unit based on the loss calculated by the loss calculation unit, and adjusts an internal parameter of the statistical information decoder unit and an internal parameter of the encoder unit based on the statistical information loss calculated by the statistical information loss calculation unit. In the estimation processing mode, the encoder unit generates status data based on an image sequence to be estimated, the statistical information decoder unit generates the statistical information based on the status data generated by the encoder unit, and the decoder unit generates the symbol string based on the status data and the statistical information.

［２］また、本発明の一態様による変換装置は、入力される画像系列を基に状態データを生成するエンコーダー部と、前記状態データを基に記号列についての統計の情報である統計情報を生成する統計情報デコーダー部と、前記状態データと前記統計情報とを基に記号列を生成するデコーダー部と、を備え、前記エンコーダー部の内部パラメーターは機械学習処理によって予め調整済みであり、前記統計情報デコーダー部の内部パラメーターは機械学習処理によって予め調整済みであり、前記デコーダー部の内部パラメーターは機械学習処理によって予め調整済みである、というものである。 [2] A conversion device according to one aspect of the present invention includes an encoder unit that generates state data based on an input image sequence, a statistical information decoder unit that generates statistical information that is statistical information about a symbol string based on the state data, and a decoder unit that generates a symbol string based on the state data and the statistical information, wherein the internal parameters of the encoder unit have been adjusted in advance by machine learning processing, the internal parameters of the statistical information decoder unit have been adjusted in advance by machine learning processing, and the internal parameters of the decoder unit have been adjusted in advance by machine learning processing.

［３］また、本発明の一態様は、上記の変換装置において、前記画像系列は、手話を表す映像であり、前記デコーダー部が生成する前記記号列は、前記手話のグロス表記を表す記号の列である、というものである。 [3] Another aspect of the present invention is that in the above conversion device, the image sequence is a video representing a sign language, and the symbol string generated by the decoder unit is a string of symbols representing a glossary of the sign language.

［４］また、本発明の一態様による学習装置は、入力される画像系列を基に状態データを生成するエンコーダー部と、前記状態データを基に記号列についての統計の情報である統計情報を生成する統計情報デコーダー部と、前記状態データと前記統計情報とを基に記号列を生成するデコーダー部と、前記エンコーダー部への入力の基となる学習用画像系列と、前記学習用画像系列に対応する前記記号列の正解である正解記号列と、前記記号列に対応する前記統計情報の正解である正解統計情報と、の組を供給する学習データ供給部と、前記学習用画像系列に基づいて前記エンコーダー部が生成する状態データ、に基づいて前記デコーダー部が生成する記号列である学習用推定記号列と、前記学習用画像系列に対応して前記学習データ供給部が供給する前記正解記号列と、の差を表すロスを算出するロス算出部と、前記学習用画像系列に基づいて前記エンコーダー部が生成する状態データ、に基づいて前記統計情報デコーダー部が生成する統計情報である学習用推定統計情報と、前記学習用画像系列に対応して前記学習データ供給部が供給する前記正解統計情報と、の差を表す統計情報ロスを算出する統計情報ロス算出部と、を備え、前記ロス算出部が算出した前記ロスに基づいて前記デコーダー部の内部パラメーターと前記エンコーダー部の内部パラメーターとを調整するとともに、前記統計情報ロス算出部が算出した前記統計情報ロスに基づいて前記統計情報デコーダー部の内部パラメーターと前記エンコーダー部の内部パラメーターとを調整する、ものである。 [4] In addition, a learning device according to one aspect of the present invention includes an encoder unit that generates state data based on an input image sequence, a statistical information decoder unit that generates statistical information that is statistical information about a symbol sequence based on the state data, a decoder unit that generates a symbol sequence based on the state data and the statistical information, a learning data supply unit that supplies a set of a learning image sequence that is the basis of an input to the encoder unit, a correct symbol sequence that is a correct answer for the symbol sequence corresponding to the learning image sequence, and correct statistical information that is a correct answer for the statistical information corresponding to the symbol sequence, an estimated learning symbol sequence that is a symbol sequence generated by the decoder unit based on state data generated by the encoder unit based on the learning image sequence, and a learning data supply unit that supplies the learning data corresponding to the learning image sequence. and a statistical information loss calculation unit that calculates a statistical information loss that represents the difference between the training estimated statistical information, which is statistical information generated by the statistical information decoder unit based on state data generated by the encoder unit based on the training image sequence, and the correct statistical information supplied by the training data supply unit in response to the training image sequence. The apparatus adjusts internal parameters of the decoder unit and internal parameters of the encoder unit based on the loss calculated by the loss calculation unit, and adjusts internal parameters of the statistical information decoder unit and internal parameters of the encoder unit based on the statistical information loss calculated by the statistical information loss calculation unit.

［５］また、本発明の一態様は、上記の学習装置において、前記画像系列は、手話を表す映像であり、前記デコーダー部が生成する前記記号列、および前記学習データ供給部が供給する正解記号列は、前記手話のグロス表記を表す記号の列である、というものである。 [5] Another aspect of the present invention is that, in the above learning device, the image sequence is a video representing a sign language, and the symbol string generated by the decoder unit and the correct symbol string supplied by the learning data supply unit are symbol strings representing a glossary of the sign language.

［６］また、本発明の一態様は、コンピューターを、上記［１］から［３］までのいずれかに記載の変換装置、として機能させるためのプログラムである。 [6] Another aspect of the present invention is a program for causing a computer to function as a conversion device according to any one of [1] to [3] above.

［７］また、本発明の一態様は、コンピューターを、上記［４］または［５］に記載の学習装置、として機能させるためのプログラムである。 [7] Another aspect of the present invention is a program for causing a computer to function as the learning device described in [4] or [5] above.

本発明によれば、統計情報デコーダー部は、統計情報を出力する。この統計情報は、デコーダー部が出力する記号列に関する統計の情報を持っている。この統計情報によって、デコーダー部は良好な制約を受けながら、記号列を推定する。この構成により、変換装置の変換精度は良くなる。また、学習装置は、そのような変換装置を実現するためのモデルを、機械学習により構築できる。 According to the present invention, the statistical information decoder unit outputs statistical information. This statistical information has statistical information regarding the symbol string output by the decoder unit. Using this statistical information, the decoder unit estimates the symbol string while being well constrained. This configuration improves the conversion accuracy of the conversion device. In addition, the learning device can construct a model for realizing such a conversion device through machine learning.

本発明の実施形態による変換装置の概略機能構成を示す機能ブロック図である。1 is a functional block diagram showing a schematic functional configuration of a conversion device according to an embodiment of the present invention. 同実施形態によるエンコーダー部および統計情報デコーダー部の動作時のデータの流れを示す概略図である。10 is a schematic diagram showing the flow of data during operation of an encoder unit and a statistical information decoder unit according to the embodiment. FIG. 同実施形態によるエンコーダー部およびデコーダー部の動作時のデータの流れを示す概略図である。2 is a schematic diagram showing the flow of data during operation of an encoder unit and a decoder unit according to the embodiment. FIG. 同実施形態によるエンコーダー部の内部のより具体的な構成を示すブロック図である。2 is a block diagram showing a more specific internal configuration of an encoder unit according to the embodiment. FIG. 同実施形態による統計情報デコーダー部の内部のより具体的な構成を示すブロック図である。13 is a block diagram showing a more specific internal configuration of a statistical information decoder unit according to the embodiment. FIG. 同実施形態によるデコーダー部の内部のより具体的な構成を示すブロック図である。4 is a block diagram showing a more specific internal configuration of a decoder unit according to the embodiment. FIG. 同実施形態による変換装置の学習処理モードの動作の手順を示すフローチャートである。13 is a flowchart showing a procedure of an operation in a learning process mode of the conversion device according to the embodiment. 同実施形態による変換装置の推定処理モードの動作の手順を示すフローチャートである。13 is a flowchart showing a procedure of an operation in an estimation processing mode of the conversion device according to the embodiment. 同実施形態等による変換装置の内部構成の例を示すブロック図である。FIG. 2 is a block diagram showing an example of an internal configuration of a conversion device according to the embodiment.

次に、図面を参照しながら、本発明の実施形態について説明する。 Next, an embodiment of the present invention will be described with reference to the drawings.

本実施形態では、機械学習のモデルとしてトランスフォーマー（Ｔｒａｎｓｆｏｒｍｅｒ）を利用する。トランスフォーマーは、ＲＮＮ（再帰型ニューラルネットワーク）やＣＮＮ（畳み込みニューラルネットワーク）を使わずに、アテンション（Ａｔｔｅｎｔｉｏｎ）だけを使って、入力データと出力データとの間の広範囲な依存関係を捉えることのできるモデルである。なお、トランスフォーマー自体は既存の技術である。トランスフォーマーの技術については、下記の文献に記載されている。
参考文献：Ashish Vaswani他，Attention Is All You Need，arXiv:1706.03762v5，２０１７年 In this embodiment, a Transformer is used as a machine learning model. The Transformer is a model that can capture a wide range of dependencies between input data and output data using only Attention, without using a Recurrent Neural Network (RNN) or a Convolutional Neural Network (CNN). The Transformer itself is an existing technology. The technology of the Transformer is described in the following literature.
Reference: Ashish Vaswani et al., Attention Is All You Need, arXiv:1706.03762v5, 2017

本実施形態の変換装置の特徴は、次の通りである。エンコーダー部２１は、入力されるフレーム画像系列に基づいてエンコーダー出力行列（状態データ）を出力する。統計情報デコーダー部６１は、このエンコーダー出力行列を基に、出力語列についての統計情報を出力する。デコーダー部３１は、上記のエンコーダー出力行列と、統計情報デコーダー部６１が出力した統計情報とに基づいて推定語列を出力する。エンコーダー部２１と、統計情報デコーダー部６１と、デコーダー部３１とは、機械学習可能となるように構成されている。つまり、予め、学習用データを用いて、エンコーダー部２１と、統計情報デコーダー部６１と、デコーダー部３１との機械学習処理を行い、それぞれの内部のパラメーターを調整しておくことができる。 The features of the conversion device of this embodiment are as follows. The encoder unit 21 outputs an encoder output matrix (state data) based on the input frame image sequence. The statistical information decoder unit 61 outputs statistical information about the output word sequence based on this encoder output matrix. The decoder unit 31 outputs an estimated word sequence based on the encoder output matrix and the statistical information output by the statistical information decoder unit 61. The encoder unit 21, the statistical information decoder unit 61, and the decoder unit 31 are configured to be capable of machine learning. In other words, machine learning processing can be performed in advance by the encoder unit 21, the statistical information decoder unit 61, and the decoder unit 31 using learning data, and the internal parameters of each can be adjusted.

このような特徴を有する本実施形態の変換装置１は、上記の統計情報デコーダー部６１を備えるため、変換処理の過程において、統計情報を算出することができる。この統計情報は、変換装置１が出力する記号列（例えば、グロス表記の語列）についての統計的な情報である。統計情報の具体例については、後で説明する。この統計情報は、言わば、出力語列に対するある種の制約として作用し得る。変換装置１において、デコーダー部３１は、算出された統計情報に基づいて、推定語列を出力する。この構成により、本実施形態の変換装置１は、入力されるフレーム画像列を基に出力記号列に変換する処理において、従来技術による変換装置と比べて、精度の高い変換処理を行うことができる。 The conversion device 1 of this embodiment, which has these characteristics, is equipped with the statistical information decoder unit 61 described above, and is therefore able to calculate statistical information during the conversion process. This statistical information is statistical information about the symbol string (e.g., a glossed word string) output by the conversion device 1. Specific examples of statistical information will be described later. This statistical information can act, so to speak, as a kind of constraint on the output word string. In the conversion device 1, the decoder unit 31 outputs an estimated word string based on the calculated statistical information. With this configuration, the conversion device 1 of this embodiment can perform a more accurate conversion process in the process of converting an input frame image string into an output symbol string, compared to conversion devices using conventional technology.

一例として、入力されるフレーム画像系列は、手話を表す映像である。また、この場合、変換装置１のデコーダー部３１が生成する記号列は、手話のグロス表記を表す記号の列であってよい。 As an example, the input frame image sequence is a video representing sign language. In this case, the symbol string generated by the decoder unit 31 of the conversion device 1 may be a symbol string representing a glossary of the sign language.

図１は、本実施形態による変換装置の概略機能構成を示す機能ブロック図である。図示するように変換装置１は、入力部１０と、エンコーダー部２１と、デコーダー部３１と、出力部４０と、統計情報デコーダー部６１と、ロス算出部７１と、統計情報ロス算出部７２と、学習データ供給部８１と、制御部９１とを含んで構成される。これらの各機能部は、例えば、コンピューターと、プログラムとで実現することが可能である。また、各機能部は、必要に応じて、記憶手段を有する。記憶手段は、例えば、プログラム上の変数や、プログラムの実行によりアロケーションされるメモリーである。また、必要に応じて、磁気ハードディスク装置やソリッドステートドライブ（ＳＳＤ）といった不揮発性の記憶手段を用いるようにしてもよい。また、各機能部の少なくとも一部の機能を、プログラムではなく専用の電子回路として実現してもよい。 FIG. 1 is a functional block diagram showing a schematic functional configuration of a conversion device according to this embodiment. As shown in the figure, the conversion device 1 includes an input unit 10, an encoder unit 21, a decoder unit 31, an output unit 40, a statistical information decoder unit 61, a loss calculation unit 71, a statistical information loss calculation unit 72, a learning data supply unit 81, and a control unit 91. Each of these functional units can be realized, for example, by a computer and a program. Each functional unit also has a storage means as necessary. The storage means is, for example, a variable in the program or a memory allocated by the execution of the program. Also, non-volatile storage means such as a magnetic hard disk device or a solid state drive (SSD) may be used as necessary. Also, at least a part of the functions of each functional unit may be realized as a dedicated electronic circuit rather than a program.

変換装置１は、学習処理モードと推定処理モードの２つのモードのいずれかのモードで動作し得る。各機能部の機能は、次に説明する通りである。 The conversion device 1 can operate in one of two modes: a learning processing mode and an estimation processing mode. The functions of each functional unit are as follows:

入力部１０は、入力フレーム画像系列（映像）を外部から取得し、エンコーダー部２１に渡す。この入力フレーム画像系列は、例えば、カメラ等で撮影された手話の話者の上半身を含むものである。入力フレーム画像系列のフレーム数をＴ_{ｉｍａｇｅ}とし、画像のチャンネル数をＣ_{ｉｍａｇｅ}とし、画像の高さ方向（垂直方向）の解像度（画素数）をＨとし、画像の幅方向（水平方向）の解像度（画素数）をＷとする。この場合、入力フレーム画像系列は、要素数がＴ_{ｉｍａｇｅ}×Ｃ_{ｉｍａｇｅ}×Ｈ×Ｗの４次元の配列（行列）として表現できる。この配列の要素は、画素値である。第２次元のチャンネル数Ｃ_{ｉｍａｇｅ}は、例えば、カラー画像の場合には３（例えば、Ｒ，Ｇ，Ｂの３原色）であり、モノクロ映像の場合には１である。フレーム画像は、カラーであることが望ましい。画像の解像度はどのような値でもよい。ただし、解決すべき課題と、必要となる計算資源（演算量やメモリー等）とを考慮すると、一例として、Ｈ＝２５６、Ｗ＝２５６といった小さめの画像でもよい。画像としては、手話話者の上半身（腕、手を含む）が映っていることが好ましい。画像における背景（手話話者以外の部分）の面積（総画素数）は、なるべく小さいことが望ましい。そのためには、撮影するカメラの画角や位置を適宜調整する。なお、背景は任意であるが、例えば無地の背景であることが好ましい。 The input unit 10 acquires an input frame image series (video) from the outside and passes it to the encoder unit 21. This input frame image series includes, for example, the upper body of a sign language speaker photographed by a camera or the like. The number of frames in the input frame image series is T _image , the number of channels of the image is C _image , the resolution (number of pixels) in the height direction (vertical direction) of the image is H, and the resolution (number of pixels) in the width direction (horizontal direction) of the image is W. In this case, the input frame image series can be expressed as a four-dimensional array (matrix) with the number of elements being T _image ×C _image ×H×W. The elements of this array are pixel values. The number of channels in the second dimension C _image is, for example, 3 (for example, the three primary colors R, G, and B) in the case of a color image, and 1 in the case of a monochrome image. It is desirable that the frame images are color. The resolution of the image may be any value. However, taking into consideration the problem to be solved and the required computational resources (such as the amount of calculations and memory), a small image with H=256 and W=256 may be used, for example. It is preferable that the image shows the upper body (including arms and hands) of the sign language speaker. It is desirable that the area (total number of pixels) of the background in the image (parts other than the sign language speaker) is as small as possible. To achieve this, the angle of view and position of the camera taking the image are adjusted as appropriate. The background can be any, but it is preferable that it is, for example, a plain background.

なお、フレーム数Ｔ_{ｉｍａｇｅ}は任意である。ただし、例えば、フレームレートを３０フレーム毎秒（ｆｐｓ）とした場合、手話の一文に相当する映像の長さを考量すると、例えばＴ_{ｉｍａｇｅ}の値が２００（６．６７秒相当）以上且つ３００（１０．００秒相当）以下程度であることを想定できる。本実施形態の変換装置１は、その程度の長さの映像を処理対象とすることを想定する。 The number of frames T _image is arbitrary. However, for example, when the frame rate is 30 frames per second (fps), taking into consideration the length of an image corresponding to one sentence in sign language, it is possible to assume that the value of T _image is, for example, about 200 (corresponding to 6.67 seconds) or more and about 300 (corresponding to 10.00 seconds) or less. The conversion device 1 of this embodiment assumes that an image of such length is to be processed.

エンコーダー部２１は、フレーム画像系列を入力し、このフレーム画像系列に対応するエンコーダー出力行列を出力する。エンコーダー出力行列を、「状態データ」とも呼ぶ。エンコーダー出力行列は、入力されるフレーム画像系列の特徴を表すデータである。つまり、エンコーダー部２１は、入力される画像系列を基に状態データを生成する。エンコーダー部２１は、学習処理モードで動作する際には、学習データ供給部８１によって供給されるフレーム画像系列（学習用データの一部）に基づく処理を行う。エンコーダー部２１は、推定処理モードで動作する際には、入力部１０から渡されるフレーム画像系列に基づく処理を行う。エンコーダー部２１は、出力するエンコーダー出力行列を、統計情報デコーダー部６１と、デコーダー部３１とに渡す。 The encoder unit 21 inputs a frame image sequence and outputs an encoder output matrix corresponding to this frame image sequence. The encoder output matrix is also called "status data". The encoder output matrix is data that represents the characteristics of the input frame image sequence. In other words, the encoder unit 21 generates status data based on the input image sequence. When operating in the learning processing mode, the encoder unit 21 performs processing based on the frame image sequence (part of the learning data) supplied by the learning data supply unit 81. When operating in the estimation processing mode, the encoder unit 21 performs processing based on the frame image sequence passed from the input unit 10. The encoder unit 21 passes the encoder output matrix it outputs to the statistical information decoder unit 61 and the decoder unit 31.

統計情報デコーダー部６１は、エンコーダー部２１が出力するエンコーダー出力行列に基づいて、統計情報を、推定により生成する。この統計情報は、デコーダー部３１が出力する記号列についての統計の情報である。統計情報デコーダー部６１は、出力する統計情報を、デコーダー部３１に渡す。統計情報デコーダー部６１は、学習処理モードにおいては、上記の統計情報を、統計情報ロス算出部７２にも渡す。 The statistical information decoder unit 61 generates statistical information by estimation based on the encoder output matrix output by the encoder unit 21. This statistical information is statistical information about the symbol string output by the decoder unit 31. The statistical information decoder unit 61 passes the output statistical information to the decoder unit 31. In the learning processing mode, the statistical information decoder unit 61 also passes the above statistical information to the statistical information loss calculation unit 72.

デコーダー部３１は、エンコーダー部２１が出力するエンコーダー出力行列と、統計情報とに基づいて、語列（記号列）を推定し、その推定語列を出力する。つまり、デコーダー部３１は、エンコーダー出力行列と、統計情報とを基に、記号列を生成する。デコーダー部３１は、学習処理モードにおいては、生成した推定語列を、ロス算出部７１に渡す。デコーダー部３１は、推定処理モードにおいては、生成した推定語列を、出力部４０に渡す。デコーダー部３１が出力する推定語列は、例えば、入力映像に対応する手話のグロス表現の記号列である。 The decoder unit 31 estimates a word string (symbol string) based on the encoder output matrix output by the encoder unit 21 and statistical information, and outputs the estimated word string. That is, the decoder unit 31 generates a symbol string based on the encoder output matrix and statistical information. In the learning processing mode, the decoder unit 31 passes the generated estimated word string to the loss calculation unit 71. In the estimation processing mode, the decoder unit 31 passes the generated estimated word string to the output unit 40. The estimated word string output by the decoder unit 31 is, for example, a symbol string of a gross expression in sign language corresponding to the input video.

なお、デコーダー部３１が用いる統計情報は、推定処理モードにおいては、統計情報デコーダー部６１が出力した統計情報である。デコーダー部３１が用いる統計情報は、学習処理モードにおいては、統計情報デコーダー部６１が出力した統計情報であってもよいし、学習データ供給部８１によって供給される正解の統計情報であってもよい。つまり、学習処理モードにおいては、デコーダー部３１は、統計情報デコーダー部６１が生成した統計情報である推定統計情報、または学習データ供給部８１が供給した正解統計情報の、いずれかの統計情報を基に、記号列を生成する。 The statistical information used by the decoder unit 31 in the estimation processing mode is the statistical information output by the statistical information decoder unit 61. In the learning processing mode, the statistical information used by the decoder unit 31 may be the statistical information output by the statistical information decoder unit 61, or may be the correct statistical information supplied by the learning data supply unit 81. In other words, in the learning processing mode, the decoder unit 31 generates a symbol string based on either the estimated statistical information, which is the statistical information generated by the statistical information decoder unit 61, or the correct statistical information supplied by the learning data supply unit 81.

エンコーダー部２１と、統計情報デコーダー部６１と、デコーダー部３１とのそれぞれは、機械学習のモデルを備えるように構成される。このモデルは、機械学習が可能である。このモデルは、例えば、ニューラルネットワークを用いて実現してよい。ニューラルネットワークの一種であるトランスフォーマーを用いてこれらの各部を構成する例について、後で、別の図を参照しながら、説明する。ニューラルネットワークは、誤差逆伝播法を用いることにより学習可能である。 Each of the encoder unit 21, the statistical information decoder unit 61, and the decoder unit 31 is configured to have a machine learning model. This model is capable of machine learning. This model may be realized, for example, by using a neural network. An example of configuring each of these units using a transformer, which is a type of neural network, will be described later with reference to another figure. The neural network is capable of learning by using the backpropagation method.

なお、デコーダー部３１は、ロス算出部７１が算出するロスに基づいて、誤差逆伝播法により、内部のパラメーターを更新する。統計情報デコーダー部６１は、統計情報ロス算出部７２が算出するロス（統計情報ロス）に基づいて、誤差逆伝播法により、内部のパラメーターを更新する。エンコーダー部２１は、ロス算出部７１が算出するロスと統計情報ロス算出部７２が算出するロスのそれぞれに基づいて、誤差逆伝播法により、内部のパラメーターを更新する。 The decoder unit 31 updates its internal parameters using the backpropagation method based on the loss calculated by the loss calculation unit 71. The statistical information decoder unit 61 updates its internal parameters using the backpropagation method based on the loss calculated by the statistical information loss calculation unit 72 (statistical information loss). The encoder unit 21 updates its internal parameters using the backpropagation method based on both the loss calculated by the loss calculation unit 71 and the loss calculated by the statistical information loss calculation unit 72.

出力部４０は、推定処理モードでの動作時に、デコーダー部３１が推定した推定語列を受け取り、外部に出力する。 When operating in the estimation processing mode, the output unit 40 receives the deduced word sequence estimated by the decoder unit 31 and outputs it to the outside.

本実施形態の特徴は、前記のデコーダー部３１が入力フレーム画像系列に対応する出力語列を推定する処理の前に、統計情報デコーダー部６１が統計情報を推定する点である。また、本実施形態の特徴は、この推定された統計情報を用いて、デコーダー部３１が語列を推定する点である。 The feature of this embodiment is that the statistical information decoder unit 61 estimates statistical information before the decoder unit 31 estimates an output word sequence corresponding to the input frame image sequence. Another feature of this embodiment is that the decoder unit 31 uses this estimated statistical information to estimate a word sequence.

上記の統計情報は、変換装置１が出力するグロス表記の語列についての統計的な情報である。この統計情報の性質は、次の通りである。即ち、変換装置１の出力である語列が与えられれば、所定のアルゴリズムにより、その語列についての統計情報を、大きすぎない妥当な所定の計算量で一意に求めることができるものである。言い換えれば、統計情報は、語列を基に容易に一意に計算可能な情報である。正解統計情報の具体的な例について、後で説明する。 The above statistical information is statistical information about the glossed word string output by the conversion device 1. The properties of this statistical information are as follows. That is, if a word string that is the output of the conversion device 1 is given, the statistical information about that word string can be uniquely calculated using a specified algorithm with a reasonable, specified amount of calculation that is not too large. In other words, the statistical information is information that can be easily and uniquely calculated based on the word string. Specific examples of correct answer statistical information will be explained later.

ロス算出部７１は、デコーダー部３１が求めた推定語列と、学習データ供給部８１が与える正解語列との間の差であるロスを算出する。つまり、ロス算出部７１は、学習用画像系列に基づいてエンコーダー部２１が生成するエンコーダー出力行列（状態データ）、に基づいてデコーダー部３１が生成する記号列である学習用推定記号列と、学習用画像系列に対応して学習データ供給部８１が供給する正解記号列と、の差を表すロスを算出する。このロスは、学習処理モードにおいて、誤差逆伝播法を用いて、デコーダー部３１やエンコーダー部２１のパラメーターを更新するために用いられる。 The loss calculation unit 71 calculates the loss, which is the difference between the estimated word sequence determined by the decoder unit 31 and the correct word sequence provided by the training data supply unit 81. In other words, the loss calculation unit 71 calculates the loss representing the difference between the training estimated symbol sequence, which is a symbol sequence generated by the decoder unit 31 based on the encoder output matrix (state data) generated by the encoder unit 21 based on the training image sequence, and the correct symbol sequence supplied by the training data supply unit 81 in response to the training image sequence. This loss is used to update the parameters of the decoder unit 31 and the encoder unit 21 using the backpropagation method in the training processing mode.

統計情報ロス算出部７２は、統計情報デコーダー部６１が求めた推定統計情報と、学習データ供給部８１が与える正解統計情報との間の差であるロス（統計情報ロス）を算出する。統計情報ロス算出部７２は、学習用画像系列に基づいてエンコーダー部２１が生成する状態データ、に基づいて統計情報デコーダー部６１が生成する統計情報である学習用推定統計情報と、学習用画像系列に対応して学習データ供給部８１が供給する正解統計情報と、の差を表す統計情報ロスを算出する。このロス（統計情報ロス）は、学習処理モードにおいて、誤差逆伝播法を用いて、統計情報デコーダー部６１やエンコーダー部２１のパラメーターを更新するために用いられる。 The statistical information loss calculation unit 72 calculates a loss (statistical information loss) that is the difference between the estimated statistical information obtained by the statistical information decoder unit 61 and the correct statistical information provided by the learning data supply unit 81. The statistical information loss calculation unit 72 calculates a statistical information loss that represents the difference between the learning estimated statistical information, which is statistical information generated by the statistical information decoder unit 61 based on state data generated by the encoder unit 21 based on the learning image sequence, and the correct statistical information supplied by the learning data supply unit 81 in response to the learning image sequence. This loss (statistical information loss) is used to update the parameters of the statistical information decoder unit 61 and the encoder unit 21 using the backpropagation method in the learning processing mode.

学習データ供給部８１は、学習用のデータを変換装置１内の各部に供給する。
具体的には、学習データ供給部８１は、入力フレーム画像系列と、正解語列と、正解統計情報との組を多数集めた学習データセットを持つ。学習データ供給部８１は、学習処理モードの際に、そのデータの組を順次１組ずつ供給する。つまり、学習データ供給部８１は、エンコーダー部２１への入力の基となる学習用画像系列と、その学習用画像系列に対応する記号列の正解である正解記号列と、その正解記号列に対応する統計情報の正解である正解統計情報と、の組を供給する。具体的には、学習データ供給部８１は、入力フレーム画像系列を、エンコーダー部２１に渡す。また、学習データ供給部８１は、正解語列を、ロス算出部７１に渡す。また、学習データ供給部８１は、正解統計情報を、統計情報ロス算出部７２に渡す。 The learning data supplying unit 81 supplies learning data to each unit within the conversion device 1 .
Specifically, the learning data supplying unit 81 has a learning data set that is a collection of a large number of pairs of an input frame image sequence, a correct word string, and correct statistical information. The learning data supplying unit 81 sequentially supplies the data pairs one by one during the learning processing mode. That is, the learning data supplying unit 81 supplies a pair of a learning image sequence that is the basis of the input to the encoder unit 21, a correct symbol string that is the correct answer to the symbol string corresponding to the learning image sequence, and correct statistical information that is the correct answer to the statistical information corresponding to the correct symbol string. Specifically, the learning data supplying unit 81 passes the input frame image sequence to the encoder unit 21. The learning data supplying unit 81 also passes the correct word string to the loss calculation unit 71. The learning data supplying unit 81 also passes the correct statistical information to the statistical information loss calculation unit 72.

なお、学習データ供給部８１は、正解統計情報をデコーダー部３１に渡してもよい。これは、デコーダー部３１が、学習処理モードにおいて、正解統計情報に基づいて語列を推定する処理を行う場合のためである。 The learning data supply unit 81 may pass the correct answer statistical information to the decoder unit 31. This is for the case where the decoder unit 31 performs processing to estimate a word string based on the correct answer statistical information in the learning processing mode.

学習データ供給部８１は、正解語列に基づいて、予め正解統計情報を求めて、学習データセットの一部として記憶しておいてもよい。また、学習データ供給部８１は、正解語列に基づいて、学習処理の際にその都度正解統計情報を算出するようにしてもよい。 The learning data supply unit 81 may obtain correct answer statistical information in advance based on the correct answer word string and store it as part of the learning data set. The learning data supply unit 81 may also calculate correct answer statistical information each time the learning process is performed based on the correct answer word string.

正解語列のデータは、入力フレーム画像列に基づいて、例えば、人手で作成する。正解語列は、例えば、手話の内容に対応するグロス記号列である。この正解語列は、例えば、要素数がＴ_ｗｏｒｄのベクトルで表現される。Ｔ_ｗｏｒｄは、正解語列の長さ（記号数）である。このベクトルの要素は、グロスＩＤの数値である。グロスＩＤは、グロス表現の記号を識別するための数値であり、例えば、０以上の整数値で表わされる。グロスＩＤは、特殊記号に対しても付与される。特殊記号とは、例えば、＜ｓｏｓ＞、＜ｅｏｓ＞、＜ｕｎｋ＞という３種類の記号である。＜ｓｏｓ＞は、文の開始を表す記号である。ここでの「ｓｏｓ」は、「Start of Sequence」の略である。＜ｅｏｓ＞は、文の終端を表す記号である。ここでの「ｅｏｓ」は、「End of Sequence」の略である。＜ｕｎｋ＞は、未知語を表す記号である。「ｕｎｋ」は、「unknown」を意味する。例えば、学習データセットに含まれなかった語は記号＜ｕｎｋ＞に対応し得る。上記の３種類の特殊記号を含んで、学習データセットに含まれるグロス記号の種類数プラス３個の記号を、グロスＩＤは識別する。この、（グロス記号の種類数）＋３の値をＶとする。言い換えれば、特殊記号を含む記号の種類数がＶである。なお、正解語列のデータは、必ず、特殊記号＜ｓｏｓ＞で始まり、特殊記号＜ｅｏｓ＞で終わるようにしてよい。 The data of the correct language is created, for example, the number of horned rows of the horny lines, for example. The element of the gross _ID is a numerical value for identifying the gloss expression, for example, a special symbol. The _SOS is the "EOS" that indicates the end of the statement. > Is a "UNK" symbol. For example, the number of gloss symbols in the learning dataset, including the above -mentioned special symbols, is compatible. The number of symbols containing the symbol is always the case that the data in the correct language is always started with a special symbol <SOS>.

なお、変形例として、＜ｓｏｓ＞と＜ｅｏｓ＞とに同じ記号を用いてもよい。＜ｓｏｓ＞はデコーダーへの１番目の入力としてのみ用いられ、デコーダーが＜ｓｏｓ＞を出力することがないためである。一例として、＜ｓｏｓ＞と＜ｅｏｓ＞とを代表させて、記号＜ｅｏｓ＞を用いることができる。また、本実施形態では、デコーダーへの１番目の入力は、＜ｓｏｓ＞ではなく統計情報である。このように、本実施形態で使用しない記号＜ｓｏｓ＞をそもそも特殊記号として持たないようにしてもよい。この場合には、特殊記号は、＜ｅｏｓ＞と＜ｕｎｋ＞の２種類である。また、上記のＶは、（グロス記号の種類数）＋２である。＜ｓｏｓ＞を持たない形態とする場合には、実施形態の説明におけるＶの値について適宜読み替える。 As a deformation example, the same symbol may be used as the first input to the decoder, and the decoder does not output <SOS>. In this embodiment, the first input to the decoder is not a special symbol in the present embodiment. In the case of the two types, the values of the implementation of the implementation are not available.

正解統計情報は、統計情報の正解である。正解統計情報は、正解語列から統計的処理によって求められる。正解統計情報の具体例は、サイズがＶ×Ｎの行列である。Ｖは、上述した通り特殊記号を含む記号の種類数である。Ｎは２以上の整数であり、一例として、Ｎ＝４あるいはＮ＝５などとしてよい。 The correct answer statistical information is the correct answer of the statistical information. The correct answer statistical information is obtained by statistical processing from the correct word string. A specific example of the correct answer statistical information is a matrix of size V×N. As described above, V is the number of types of symbols including special symbols. N is an integer of 2 or more, and as an example, N=4 or N=5.

例えばＮ＝４とするとき、正解統計情報は、Ｖ個の、要素数４のワンホット（one hot）ベクトルである。Ｖ個のベクトルは、Ｖ種類の記号に対応する。各記号対応するベクトルは、［１，０，０，０］、［０，１，０，０］、［０，０，１，０］、［０，０，０，１］のいずれかである。このベクトルは、正解語列内における特定の記号の出現回数を表すベクトルである。正解語列内にある記号が出現しないばあいには、その記号に対応するベクトルは［１，０，０，０］である。正解語列内にその記号が１回だけ出現する場合には、その記号に対応するベクトルは［０，１，０，０］である。正解語列内にその記号が２回だけ出現する場合には、その記号に対応するベクトルは［０，０，１，０］である。正解語列内にその記号が３回以上出現する場合には、その記号に対応するベクトルは［０，０，０，１］である。 For example, when N=4, the correct answer statistics information is a one-hot vector with V elements. The V vectors correspond to V types of symbols. The vector corresponding to each symbol is one of [1,0,0,0], [0,1,0,0], [0,0,1,0], and [0,0,0,1]. This vector represents the number of times a particular symbol appears in the correct answer string. If a symbol in the correct answer string does not appear, the vector corresponding to that symbol is [1,0,0,0]. If the symbol appears only once in the correct answer string, the vector corresponding to that symbol is [0,1,0,0]. If the symbol appears only twice in the correct answer string, the vector corresponding to that symbol is [0,0,1,0]. If the symbol appears three or more times in the correct answer string, the vector corresponding to that symbol is [0,0,0,1].

Ｎの値が４以外であってもよい。一般に１つの記号の統計的特徴を表すベクトルの要素数がＮのとき、次のようになる。即ち、正解語列内におけるその記号の出現回数が、０回のとき、１回のとき、・・・、（Ｎ－２）回のとき、（Ｎ－１）回以上のときにそれぞれ対応して、ベクトルは、［１，０，０，・・・，０］、［０，１，０，・・・，０］、・・・、［０，０，０，・・・，１］である。 The value of N may be other than 4. In general, when the number of elements of a vector representing the statistical characteristics of one symbol is N, it is as follows. That is, when the number of occurrences of that symbol in the correct word string is 0, 1, ..., (N-2), or (N-1) or more times, the vector is [1, 0, 0, ..., 0], [0, 1, 0, ..., 0], ..., [0, 0, 0, ..., 1].

例えばＮ＝２のときは、ベクトルは、その記号が正解語列内に出現するか否かのみを表す情報である。 For example, when N=2, the vector contains information only indicating whether the symbol appears in the correct word string.

Ｎの値は、４あるいは５程度が好ましい。Ｎの値が大きすぎても、手話表現の一文の中に特定の記号がそれほど多数回出現することは稀であり、統計情報のサイズと比べて表す情報がそれほど豊富にはならない。 The value of N is preferably around 4 or 5. If the value of N is too large, it is rare for a particular symbol to appear so many times in a sentence of a sign language expression, and the information represented will not be very rich compared to the size of the statistical information.

制御部９１は、変換装置１の全体を制御する。制御部９１が行う制御の一つは、変換装置１の動作モードの制御である。変換装置１は、学習処理モードまたは推定処理モードのいずれかで動作する。学習処理モードは、学習用データを用いて変換装置１内の各部の機械学習を行うモードである。推定処理モードは、入力されるフレーム画像系列に基づいて、その画像系列に対応する出力記号列を推定する処理を行うモードである。制御部９１は、現在の動作モードが学習処理モードであるか推定処理モードであるかを管理する。つまり、制御部９１は、学習処理モードと推定処理モードとを適宜切り替えて実行させるように変換装置１の中の各部を制御する。制御部９１は、現在の動作モードの情報を、他の各機能部に伝達する。これにより、各機能部が協調的に動作し、変換装置１の全体がその動作モードで動作する。 The control unit 91 controls the entire conversion device 1. One of the controls performed by the control unit 91 is control of the operation mode of the conversion device 1. The conversion device 1 operates in either a learning processing mode or an estimation processing mode. The learning processing mode is a mode in which machine learning of each part in the conversion device 1 is performed using learning data. The estimation processing mode is a mode in which a process is performed to estimate an output symbol string corresponding to an input frame image sequence based on the image sequence. The control unit 91 manages whether the current operation mode is the learning processing mode or the estimation processing mode. In other words, the control unit 91 controls each part in the conversion device 1 to switch between the learning processing mode and the estimation processing mode as appropriate. The control unit 91 transmits information on the current operation mode to each of the other functional units. As a result, each functional unit operates cooperatively, and the entire conversion device 1 operates in that operation mode.

この制御部９１が制御することにより、変換装置１は、各モードにおいて次のように動作する。つまり、学習処理モードにおいては、ロス算出部７１が算出したロスに基づいてデコーダー部３１の内部パラメーターとエンコーダー部２１の内部パラメーターとを調整するとともに、統計情報ロス算出部７２が算出した統計情報ロスに基づいて統計情報デコーダー部６１の内部パラメーターとエンコーダー部２１の内部パラメーターとを調整する。また、推定処理モードにおいては、エンコーダー部２１が推定対象の画像系列を基に状態データを生成し、エンコーダー部２１が生成した状態データを基に統計情報デコーダー部６１が統計情報を生成し、デコーダー部３１が状態データと統計情報とを基に記号列を生成する。 Under the control of this control unit 91, the conversion device 1 operates as follows in each mode. That is, in the learning processing mode, the internal parameters of the decoder unit 31 and the internal parameters of the encoder unit 21 are adjusted based on the loss calculated by the loss calculation unit 71, and the internal parameters of the statistical information decoder unit 61 and the internal parameters of the encoder unit 21 are adjusted based on the statistical information loss calculated by the statistical information loss calculation unit 72. In addition, in the estimation processing mode, the encoder unit 21 generates status data based on the image sequence to be estimated, the statistical information decoder unit 61 generates statistical information based on the status data generated by the encoder unit 21, and the decoder unit 31 generates a symbol string based on the status data and statistical information.

このように、本実施形態では、変換装置１は、途中で統計情報を用いる方法によって語列を推定する。入力されるフレーム画像系列に基づいて対応する語列を推定するタスクは、次の２つタスクに分割することができる。その第１のタスクは、語列内に出現する語が何であるかを推定するタスク（統計情報を推定するタスク）である。そして、第２のタスクは、その統計情報が制約する条件下で出力される語列は何かを推定するタスク（語列内における語の順序を推定するタスク）である。本実施形態の変換装置１は、これらの２つのタスクに専用の機能部をそれぞれ含む構成を持つ。第１のタスクを担うのが統計情報デコーダー部６１であり、第２のタスクを担うのがデコーダー部３１である。言い換えれば、統計情報デコーダー部６１とデコーダー部３１のそれぞれは、第１のタスクと第２のタスクを同時に実行する場合に比べて、より簡単なタスクのみを実行すればよい。これにより、本実施形態の変換処理の精度は、従来技術におけるそれよりも、良くなる。 In this way, in this embodiment, the conversion device 1 estimates a word string by using statistical information along the way. The task of estimating a corresponding word string based on an input frame image sequence can be divided into the following two tasks. The first task is a task of estimating what words appear in the word string (a task of estimating statistical information). The second task is a task of estimating what word string is output under conditions constrained by the statistical information (a task of estimating the order of words in the word string). The conversion device 1 of this embodiment has a configuration including functional units dedicated to these two tasks. The statistical information decoder unit 61 is responsible for the first task, and the decoder unit 31 is responsible for the second task. In other words, each of the statistical information decoder unit 61 and the decoder unit 31 needs to execute only a simpler task compared to the case where the first task and the second task are executed simultaneously. As a result, the accuracy of the conversion process of this embodiment is improved compared to that of the conventional technology.

図２は、エンコーダー部２１および統計情報デコーダー部６１の動作時のデータの流れを示す概略図である。図示するように、エンコーダー部２１は、内部に、ニューラルネットワーク２００１を備える。また、統計情報デコーダー部６１は、内部に、ニューラルネットワーク６００１を備える。エンコーダー部２１のニューラルネットワーク２００１は、入力されるフレーム画像系列に基づく演算を行い、エンコーダー出力行列のデータを出力する。統計情報デコーダー部６１のニューラルネットワーク６００１は、ニューラルネットワーク２００１から渡されるエンコーダー出力行列に基づいて、演算を行い、推定統計情報を出力する。 Figure 2 is a schematic diagram showing the flow of data during operation of the encoder unit 21 and the statistical information decoder unit 61. As shown in the figure, the encoder unit 21 includes a neural network 2001 therein. The statistical information decoder unit 61 includes a neural network 6001 therein. The neural network 2001 of the encoder unit 21 performs calculations based on the input frame image sequence, and outputs encoder output matrix data. The neural network 6001 of the statistical information decoder unit 61 performs calculations based on the encoder output matrix passed from the neural network 2001, and outputs estimated statistical information.

推定処理モードにおいては、統計情報デコーダー部６１が求めた推定統計情報は、デコーダー部３１に渡され、推定語列の推定のために用いられる。 In the estimation processing mode, the estimated statistical information obtained by the statistical information decoder unit 61 is passed to the decoder unit 31 and used to estimate the estimated word string.

学習処理モードにおいては、統計情報ロス算出部７２が、統計情報の正解である正解統計情報と、ニューラルネットワーク６００１が出力した統計情報（推定統計情報）とのロス（統計情報ロス）を求める。ニューラルネットワーク６００１および２００１は、このロス（統計情報ロス）に基づいて、誤差逆伝播を行う。つまり、ニューラルネットワーク６００１および２００１は、ニューラルネットワーク６００１から２００１への伝播経路による処理で、それぞれのパラメーターを更新する。 In the learning processing mode, the statistical information loss calculation unit 72 calculates the loss (statistical information loss) between the correct answer statistical information, which is the correct answer to the statistical information, and the statistical information (estimated statistical information) output by the neural network 6001. The neural networks 6001 and 2001 perform error backpropagation based on this loss (statistical information loss). In other words, the neural networks 6001 and 2001 update their respective parameters by processing along the propagation path from the neural network 6001 to the neural network 2001.

図３は、エンコーダー部２１およびデコーダー部３１の動作時のデータの流れを示す概略図である。前述の通り、エンコーダー部２１は、ニューラルネットワーク２００１を備える。また、デコーダー部３１は、内部に、ニューラルネットワーク３００１を備える。デコーダー部３１のニューラルネットワーク３００１は、ニューラルネットワーク２００１から渡されるエンコーダー出力行列と、統計情報と、に基づいて、演算を行い、推定語列のデータを出力する。 Figure 3 is a schematic diagram showing the flow of data during operation of the encoder unit 21 and the decoder unit 31. As described above, the encoder unit 21 includes a neural network 2001. The decoder unit 31 includes an internal neural network 3001. The neural network 3001 of the decoder unit 31 performs calculations based on the encoder output matrix and statistical information passed from the neural network 2001, and outputs data on the estimated word sequence.

推定処理モードにおいては、デコーダー部３１が求めた推定語列は、出力部４０に渡され、入力フレーム画像系列に対応する語列として出力される。例えば、入力フレーム画像系列が手話を表す映像である場合には、出力される推定語列は、その手話に対応するグロス表現の記号列である。 In the estimation processing mode, the deduced word string obtained by the decoder unit 31 is passed to the output unit 40 and output as a word string corresponding to the input frame image series. For example, if the input frame image series is video representing a sign language, the output deduced word string is a symbol string of a gloss expression corresponding to that sign language.

学習処理モードにおいては、ロス算出部７１が、入力フレーム画像系列に対応する正解語列と、ニューラルネットワーク３００１が出力した推定語列とのロスを求める。ニューラルネットワーク３００１および２００１は、このロスに基づいて、誤差逆伝播を行う。つまり、ニューラルネットワーク３００１および２００１は、ニューラルネットワーク３００１から２００１への伝播経路による処理で、それぞれのパラメーターを更新する。 In the learning processing mode, the loss calculation unit 71 calculates the loss between the correct word sequence corresponding to the input frame image sequence and the estimated word sequence output by the neural network 3001. The neural networks 3001 and 2001 perform error backpropagation based on this loss. In other words, the neural networks 3001 and 2001 update their respective parameters by processing along the propagation path from the neural network 3001 to the neural network 2001.

図４は、エンコーダー部２１のより具体的な構成を示すブロック図である。図示するように、エンコーダー部２１は、ニューラルネットワークとして、トランスフォーマー２００２を使用する。トランスフォーマー２００２は、入力フレーム画像系列を受け取り、その特徴を表す行列を出力する。この行列を、エンコーダー部出力行列と呼ぶ。エンコーダー出力行列は、Ｔ_{ｉｍａｇｅ}×Ｃ_{ｅｎｃｏｄｅｒ}の行列として表現される。前述の通り、Ｔ_{ｉｍａｇｅ}は、エンコーダー部２１に入力されるフレーム画像の数である。また、Ｃ_{ｅｎｃｏｄｅｒ}は、適宜定められる正整数である。Ｃ_{ｅｎｃｏｄｅｒ}の値に特に制限はない。エンコーダー部２１が出力する情報量と、コンピューターのメモリーの制約とを考慮して、Ｃ_{ｅｎｃｏｄｅｒ}の値を、例えば５１２以上且つ４０９６以下程度としてよい。 FIG. 4 is a block diagram showing a more specific configuration of the encoder unit 21. As shown in the figure, the encoder unit 21 uses a transformer 2002 as a neural network. The transformer 2002 receives an input frame image sequence and outputs a matrix representing the characteristics of the input frame image sequence. This matrix is called an encoder unit output matrix. The encoder output matrix is expressed as a matrix of T _image ×C _encoder . As described above, T _image is the number of frame images input to the encoder unit 21. Furthermore, C _encoder is a positive integer that is appropriately determined. There is no particular restriction on the value of C _encoder . Taking into consideration the amount of information output by the encoder unit 21 and the memory constraints of the computer, the value of C _encoder may be set to, for example, about 512 or more and 4096 or less.

図５は、統計情報デコーダー部６１のより具体的な構成を示すブロック図である。図示するように、統計情報デコーダー部６１は、内部のニューラルネットワークとして、トランスフォーマー６００２と、全結合層６００３とを含む。全結合層６００３は、全結合１層のニューラルネットワークである。 Figure 5 is a block diagram showing a more specific configuration of the statistical information decoder unit 61. As shown in the figure, the statistical information decoder unit 61 includes a transformer 6002 and a fully connected layer 6003 as an internal neural network. The fully connected layer 6003 is a fully connected single-layer neural network.

トランスフォーマー６００２は、エンコーダー部２１から渡されるエンコーダー出力行列を受け取る。エンコーダー出力行列は、Ｔ_{ｉｍａｇｅ}×Ｃ_{ｅｎｃｏｄｅｒ}のサイズを持つ行列である。一般にデコーダーは系列を出力する構造を持つものであるが、統計情報デコーダー部６１は統計情報を出力すればよい。そのため、トランスフォーマー６００２は、１×Ｖの行列を出力した時点で、動作を終了する。この動作は、デコーダーが第１番目の記号のみを出力する動作に相当する。トランスフォーマー６００２が出力する１×Ｖの行列は、Ｖ種類の記号のそれぞれについての出現回数の情報を表す。 The transformer 6002 receives the encoder output matrix passed from the encoder unit 21. The encoder output matrix is a matrix having a size of T _image ×C _encoder . Generally, a decoder has a structure for outputting a sequence, but the statistical information decoder unit 61 only needs to output statistical information. Therefore, the transformer 6002 ends its operation when it outputs a 1 × V matrix. This operation corresponds to the decoder outputting only the first symbol. The 1 × V matrix output by the transformer 6002 represents information on the number of occurrences of each of the V types of symbols.

全結合層６００３は、トランスフォーマー６００２が出力した上記の１×Ｖの行列を受け取り、Ｖ×Ｎの行列に変換して、推定統計情報として出力する。推定統計情報を表すＶ×Ｎの行列は、既に説明した正解統計情報を表すＶ×Ｎの行列と同じ構造を持つ。つまり、Ｖは記号（特殊記号を含む）の種類数である。また、Ｖ×Ｎの行列を構成するＶ個のベクトル（要素数がＮのベクトル）は、各記号の出現回数に関する情報（例えば、Ｎ＝４の場合、出現回数は、０回、１回、２回、３回以上のいずれか）を表す。 The fully connected layer 6003 receives the 1xV matrix output by the transformer 6002, converts it into a VxN matrix, and outputs it as estimated statistical information. The VxN matrix representing the estimated statistical information has the same structure as the VxN matrix representing the ground truth statistical information already explained. In other words, V is the number of types of symbols (including special symbols). In addition, the V vectors (vectors with N elements) that make up the VxN matrix represent information about the number of times each symbol appears (for example, when N=4, the number of appearances is either 0, 1, 2, or 3 or more times).

学習処理モードで動作する場合には、統計情報デコーダー部６１は、全結合層６００３が出力した上記のＶ×Ｎの行列（推定統計情報）を、統計情報ロス算出部７２に渡す。統計情報ロス算出部７２は、統計情報デコーダー部６１が出力した推定統計情報と、学習データ供給部８１が供給する正解統計情報との間のロス（統計情報ロス）を計算する。このとき、推定統計情報と正解統計情報とは、ともにＶ×Ｎの行列で表現されている。統計情報ロス算出部７２は、任意の適切な方法で、推定統計情報と正解統計情報とのロス（統計情報ロス）を算出してよい。統計情報ロス算出部７２は、例えば、推定統計情報と正解統計情報との交差エントロピー誤差を、ロスとして算出する。統計情報ロス算出部７２が計算したロスにしたがって、統計誤差の逆伝播路上にあるニューラルネットワークは内部のパラメーターを更新する。具体的には、ここでは、統計情報デコーダー部６１とエンコーダー部２１が、内部のパラメーターを更新する。パラメーターを更新する手法としては、例えば、確率的勾配降下法（ＳＧＤ：Stochastic Gradient Descent）や、Ａｄａｍ（Adaptive moment estimation）など、一般的なニューラルネットワークの最適化手法を用いることができる。 When operating in the learning processing mode, the statistical information decoder unit 61 passes the above V×N matrix (estimated statistical information) output by the fully connected layer 6003 to the statistical information loss calculation unit 72. The statistical information loss calculation unit 72 calculates the loss (statistical information loss) between the estimated statistical information output by the statistical information decoder unit 61 and the correct statistical information supplied by the learning data supply unit 81. At this time, both the estimated statistical information and the correct statistical information are expressed as a V×N matrix. The statistical information loss calculation unit 72 may calculate the loss (statistical information loss) between the estimated statistical information and the correct statistical information by any appropriate method. For example, the statistical information loss calculation unit 72 calculates the cross entropy error between the estimated statistical information and the correct statistical information as the loss. According to the loss calculated by the statistical information loss calculation unit 72, the neural network on the backpropagation path of the statistical error updates the internal parameters. Specifically, here, the statistical information decoder unit 61 and the encoder unit 21 update the internal parameters. As a method for updating parameters, for example, common neural network optimization methods such as Stochastic Gradient Descent (SGD) and Adaptive moment estimation (Adam) can be used.

推定処理モードで動作する場合には、統計情報デコーダー部６１は、全結合層６００３が出力した上記のＶ×Ｎの行列（推定統計情報）を、デコーダー部３１に渡す。 When operating in the estimation processing mode, the statistical information decoder unit 61 passes the above V×N matrix (estimated statistical information) output by the fully connected layer 6003 to the decoder unit 31.

図６は、デコーダー部３１のより具体的な構成を示すブロック図である。図示するように、デコーダー部３１は、内部のニューラルネットワークとして、トランスフォーマー３００２と、全結合層３００３とを含む。 Figure 6 is a block diagram showing a more specific configuration of the decoder unit 31. As shown in the figure, the decoder unit 31 includes a transformer 3002 and a fully connected layer 3003 as an internal neural network.

トランスフォーマー３００２は、エンコーダー部出力行列と、統計情報とに基づいて、記号列を出力する。前述の通り、トランスフォーマー３００２が出力する記号列は、グロス記号および特殊記号から成る列である。 The transformer 3002 outputs a symbol string based on the encoder output matrix and statistical information. As described above, the symbol string output by the transformer 3002 is a string consisting of gross symbols and special symbols.

全結合層３００３は、全結合１層のニューラルネットワークである。全結合層３００３は、Ｖ×Ｎのサイズの行列を入力し、その行列を１×Ｖのサイズの行列に変換し、出力する。 The fully connected layer 3003 is a fully connected single layer neural network. The fully connected layer 3003 inputs a matrix of size V x N, converts the matrix to a matrix of size 1 x V, and outputs it.

本実施形態のトランスフォーマー３００２は、特殊記号＜ｓｏｓ＞（シーケンスの開始）の代わりに、統計情報を入力としてとる。つまり、トランスフォーマー３００２は、まず統計情報を入力にとり、１個目のグロス記号を出力する。次に、トランスフォーマー３００２は、その１個目のグロス記号を入力にとり、２個目のグロス記号を出力する。以後、同様に、トランスフォーマー３００２は、グロス記号の出力を続ける。トランスフォーマー３００２は、最後に特殊記号＜ｅｏｓ＞（シーケンスの終わり）を出力する。つまり、特殊記号＜ｅｏｓ＞を出力すると、トランスフォーマー３００２は、処理を終了する。 In this embodiment, the transformer 3002 takes statistical information as input instead of the special symbol <sos> (start of sequence). That is, the transformer 3002 first takes statistical information as input and outputs the first gross symbol. Next, the transformer 3002 takes the first gross symbol as input and outputs the second gross symbol. Thereafter, the transformer 3002 continues to output gross symbols in the same manner. Finally, the transformer 3002 outputs the special symbol <eos> (end of sequence). That is, when the transformer 3002 outputs the special symbol <eos>, it ends the processing.

なお、デコーダー部３１に入力される統計情報は、Ｖ×Ｎのサイズを持つ行列である。全結合層３００３は、このＶ×Ｎの行列として表現された統計情報を入力し、その統計情報を１×Ｖのサイズの統計情報に変換する。全結合層３００３は、出力する１×Ｖのサイズの行列（統計情報）を、トランスフォーマー３００２に渡すものである。 The statistical information input to the decoder unit 31 is a matrix with a size of V×N. The fully connected layer 3003 inputs the statistical information expressed as this V×N matrix and converts the statistical information into statistical information of a size of 1×V. The fully connected layer 3003 passes the output matrix (statistical information) of a size of 1×V to the transformer 3002.

デコーダー部３１は、最終的に長さＴ_ｐｒｅｄのベクトルを出力することを期待される。この長さＴ_ｐｒｅｄのベクトルが、デコーダー部３１による推定結果（推定語列）である。ただし、トランスフォーマー３００２は、推定語列に相当するベクトルを直接出力するのではなく、記号の出現確率の列であるＴ_ｐｒｅｄ×Ｖの行列を出力する。行列の要素は、長さＴ_ｐｒｅｄの記号列の位置ごとの、特定の記号が出現する確率の値である。前述の通り、Ｖは、語彙の規模、つまりグロス記号と特殊記号の種類の総数である。 The decoder unit 31 is expected to finally output a vector of length T _pred . This vector of length T _pred is the estimation result (estimated word string) by the decoder unit 31. However, the transformer 3002 does not directly output a vector corresponding to the estimated word string, but outputs a matrix of T _pred ×V, which is a string of symbol occurrence probabilities. The elements of the matrix are the values of the probability that a specific symbol appears for each position in the symbol string of length T _pred . As mentioned above, V is the size of the vocabulary, that is, the total number of types of gross symbols and special symbols.

学習処理モードでは、ロス算出部７１は、推定語列と正解語列とのロスを求める。具体的には、ロス算出部７１は、推定語列に相当するＴ_ｐｒｅｄ×Ｖの行列と、正解語列に相当するＴ_ｐｒｅｄ×Ｖの行列とから、交差エントロピー誤差を求め、これをロスとする。このとき、推定語列に相当するＴ_ｐｒｅｄ×Ｖの行列は、推定語列の各位置における、Ｖ種類の記号の各々の出現確率を表す行列である。また、正解語列に相当するＴ_ｐｒｅｄ×Ｖの行列は、正解語列の各位置において、正解である記号の出現確率は１として、その他の記号の出現確率を０としたような確率分布を表す行列である。 In the learning processing mode, the loss calculation unit 71 calculates the loss between the estimated word string and the correct word string. Specifically, the loss calculation unit 71 calculates the cross entropy error from a T _pred ×V matrix corresponding to the estimated word string and a T _pred ×V matrix corresponding to the correct word string, and sets this as the loss. At this time, the T _pred ×V matrix corresponding to the estimated word string is a matrix representing the occurrence probability of each of V types of symbols at each position of the estimated word string. Also, the T _pred ×V matrix corresponding to the correct word string is a matrix representing a probability distribution in which the occurrence probability of the correct symbol is 1 and the occurrence probability of other symbols is 0 at each position of the correct word string.

ロス算出部７１が計算したロスにしたがって、推定語列の逆伝播路上にあるニューラルネットワークは内部のパラメーターを更新する。具体的には、ここでは、デコーダー部３１とエンコーダー部２１が、内部のパラメーターを更新する。パラメーターを更新する手法としては、例えば、確率的勾配降下法（ＳＧＤ：Stochastic Gradient Descent）や、Ａｄａｍ（Adaptive moment estimation）など、一般的なニューラルネットワークの最適化手法を用いることができる。 The neural network on the backpropagation path of the inferred word string updates its internal parameters according to the loss calculated by the loss calculation unit 71. Specifically, here, the decoder unit 31 and the encoder unit 21 update their internal parameters. As a method for updating parameters, for example, a general neural network optimization method such as Stochastic Gradient Descent (SGD) or Adaptive moment estimation (ADAM) can be used.

なお、トランスフォーマーのモデルを用いる場合には、その構造上、推定語列の長さであるＴ_ｐｒｅｄと、正解語列の長さであるＴ_ｗｏｒｄは同じである。 When a Transformer model is used, due to its structure, the length of the estimated word string, T _pred , is the same as the length of the correct word string, T _word .

推論処理モードでは、記号出現確率を表すＴ_ｐｒｅｄ×Ｖの行列から、推定語列が求められる。推定語列を求めるための一つの方法は、語列内の位置ごとに、即ち要素数Ｖのベクトルごとに、最も高い確率値を持つ要素に対応する記号を、その位置の記号として決定する方法（貪欲法，greedy algorithm）である。他の方法の例として、評価値の良い方から順に上位ｎ個（ｎは、適宜定められる正整数）の選択肢のみに絞りながら文全体の最適解（または準最適解）を探索する方法（ビームサーチ、beam search）を用いてもよい。なお、貪欲法も、ビームサーチも、それらの手法自体は既存の技術である。このようにして、トランスフォーマー３００２が出力する、グロスＩＤ出現確率を表すＴ_ｐｒｅｄ×Ｖの行列を、要素数Ｔ_ｐｒｅｄのベクトル（推定語列）に変換することができる。 In the inference processing mode, a deduced word string is obtained from a matrix of T _pred ×V representing the probability of occurrence of symbols. One method for obtaining a deduced word string is a method (greedy algorithm) in which, for each position in the word string, i.e., for each vector of V elements, a symbol corresponding to an element having the highest probability value is determined as the symbol at that position. As another example of the method, a method (beam search) in which an optimal solution (or a quasi-optimal solution) for the entire sentence is searched for while narrowing down the options to only the top n (n is a positive integer determined appropriately) in order from the best evaluation value may be used. Note that both the greedy algorithm and the beam search are existing techniques. In this way, the matrix of T _pred ×V representing the probability of occurrence of gross IDs output by the transformer 3002 can be transformed into a vector of T _pred (deduced word string) of the number of elements.

推定語率に対応する、記号出現確率を表す行列は、上記の通りである。この行列は、Ｖ行Ｔ_ｐｒｅｄ列の行列である。この行列の第ｉ列（１≦ｉ≦Ｔ_ｐｒｅｄ）は、要素数Ｖのベクトル（上で破線で示す列ベクトル）である。このベクトルの要素は、ｐ_ｉ，１，ｐ_ｉ，２，・・・，ｐ_ｉ，Ｖである。すべての記号には、予めサフィックスが関連付けられている。例えば、第１番目の記号の出現確率は、ｐ_ｉ，１である。他の記号の出現確率も同様である。 The matrix representing the symbol occurrence probability corresponding to the estimated word rate is as shown above. This matrix has V rows and T _pred columns. The i-th column (1≦i≦T _pred ) of this matrix is a vector with V elements (column vector shown by dashed line above). The elements of this vector are p _i,1 , p _i,2 , ..., p _i,V . A suffix is associated with every symbol in advance. For example, the occurrence probability of the first symbol is p _i,1 . The occurrence probabilities of the other symbols are similar.

図７は、変換装置１における機械学習処理（学習処理モードの動作）の手順を示すフローチャートである。以下、このフローチャートに沿って、機械学習処理の手順について説明する。 Figure 7 is a flowchart showing the steps of the machine learning process (operation in the learning process mode) in the conversion device 1. Below, the steps of the machine learning process are explained in accordance with this flowchart.

ステップＳ１０１において、学習データ供給部８１は、１件の学習用データを学習データセットから取得して、変換装置１に供給する。ここで、１件の学習用データは、入力フレーム画像系列（入力映像）と、正解語列と、正解統計情報とを含む。この１件の学習用データにおいて、正解語列は、その入力フレーム画像系列に対応する正解の手話記号列である。また、正解統計情報は、その正解語列に対応する正解の統計情報である。学習データ供給部８１は、正解統計情報を、正解語列に基づいて予め計算しておき、学習データセット内に持っている。この学習用データは、機械学習を行うモデルの入力データと出力データ（正解データ）に相当する。エンコーダー部２１から統計情報デコーダー部６１に続く系列に関しては、入力フレーム画像系列が入力データであり、正解統計情報が正解の出力データである。エンコーダー部２１からデコーダー部３１に続く系列に関しては、入力フレーム画像系列が入力データであり、正解語列が正解の出力データである。あるいは、学習データ供給部８１は、正解統計情報を、正解語列に基づいて本ステップでその都度算出するようにしてもよい。学習データ供給部８１は、入力フレーム画像系列のデータを、エンコーダー部２１に渡す。また、学習データ供給部８１は、正解語列のデータを、ロス算出部７１に渡す。また、学習データ供給部８１は、正解統計情報のデータを、統計情報ロス算出部７２に渡す。 In step S101, the learning data supply unit 81 acquires one learning data from the learning data set and supplies it to the conversion device 1. Here, the one learning data includes an input frame image sequence (input video), a correct answer word sequence, and correct answer statistical information. In this one learning data, the correct answer word sequence is a correct sign language symbol sequence corresponding to the input frame image sequence. Also, the correct answer statistical information is correct answer statistical information corresponding to the correct answer word sequence. The learning data supply unit 81 calculates the correct answer statistical information in advance based on the correct answer word sequence and stores it in the learning data set. This learning data corresponds to the input data and output data (correct answer data) of a model that performs machine learning. With respect to the sequence continuing from the encoder unit 21 to the statistical information decoder unit 61, the input frame image sequence is the input data, and the correct answer statistical information is the correct answer output data. With respect to the sequence continuing from the encoder unit 21 to the decoder unit 31, the input frame image sequence is the input data, and the correct answer word sequence is the correct answer output data. Alternatively, the learning data supplying unit 81 may calculate the correct answer statistical information each time in this step based on the correct answer word string. The learning data supplying unit 81 passes data of the input frame image sequence to the encoder unit 21. The learning data supplying unit 81 also passes data of the correct answer word string to the loss calculation unit 71. The learning data supplying unit 81 also passes data of the correct answer statistical information to the statistical information loss calculation unit 72.

次に、ステップＳ１０２において、エンコーダー部２１は、ステップＳ１０１で渡された入力フレーム画像系列を基に、順伝播を行う。即ち、エンコーダー部２１は、エンコーディング処理を行う。エンコーダー部２１は、順伝播の結果として、エンコーダー出力行列を出力する。エンコーダー部２１は、このエンコーダー出力行列を、統計情報デコーダー部６１とデコーダー部３１とにそれぞれ渡す。 Next, in step S102, the encoder unit 21 performs forward propagation based on the input frame image sequence passed in step S101. That is, the encoder unit 21 performs encoding processing. The encoder unit 21 outputs an encoder output matrix as a result of the forward propagation. The encoder unit 21 passes this encoder output matrix to the statistical information decoder unit 61 and the decoder unit 31, respectively.

次に、ステップＳ１０３において、統計情報デコーダー部６１は、エンコーダー部２１から渡されたエンコーダー出力行列を基に、順伝搬を行う。この順伝搬の結果として、統計情報デコーダー部６１は統計情報の推定値である推定統計情報を出力する。統計情報デコーダー部６１は、この推定統計情報を、統計情報ロス算出部７２に渡す。 Next, in step S103, the statistical information decoder unit 61 performs forward propagation based on the encoder output matrix passed from the encoder unit 21. As a result of this forward propagation, the statistical information decoder unit 61 outputs estimated statistical information, which is an estimated value of the statistical information. The statistical information decoder unit 61 passes this estimated statistical information to the statistical information loss calculation unit 72.

次に、ステップＳ１０４において、デコーダー部３１は、エンコーダー部２１から渡されたエンコーダー出力行列と、統計情報とを基に、順伝搬を行う。この順伝搬の結果として、デコーダー部３１は入力フレーム画像系列に対応する推定語列を推定する。デコーダー部３１は、求めた推定語列を、ロス算出部７１に渡す。 Next, in step S104, the decoder unit 31 performs forward propagation based on the encoder output matrix and statistical information passed from the encoder unit 21. As a result of this forward propagation, the decoder unit 31 estimates an estimated word sequence corresponding to the input frame image sequence. The decoder unit 31 passes the obtained estimated word sequence to the loss calculation unit 71.

上記の通りステップＳ１０４において、デコーダー部３１は、統計情報を用いて推定語列を求める。このときの統計情報として、推定統計情報を使ってもよいし、正解統計情報を使ってもよい。また、デコーダー部３１がステップＳ１０４において推定統計情報を用いるか正解統計情報を用いるかを、確率的に分布させるようにしてもよい。つまり、デコーダー部３１が、確率ｐで推定統計情報を用いて、確率（１－ｐ）で正解統計情報を用いるようにしてもよい。ここで、ｐは、０≦ｐ≦１を満たす実数である。一例として、ｐ＝０．５としてもよい。なお、推定統計情報は、統計情報デコーダー部６１がステップＳ１０３で求めたデータである。正解統計情報は、学習データ供給部８１が供給するデータである。制御部９１は、ステップＳ１０４においてどちらの統計情報がデコーダー部３１に渡されるかを、適宜、制御する。 As described above, in step S104, the decoder unit 31 uses statistical information to obtain an estimated word string. The statistical information used at this time may be estimated statistical information or correct statistical information. In addition, whether the decoder unit 31 uses estimated statistical information or correct statistical information in step S104 may be distributed probabilistically. In other words, the decoder unit 31 may use estimated statistical information with a probability p and correct statistical information with a probability (1-p). Here, p is a real number that satisfies 0≦p≦1. As an example, p=0.5 may be used. The estimated statistical information is the data obtained by the statistical information decoder unit 61 in step S103. The correct statistical information is the data supplied by the learning data supply unit 81. The control unit 91 appropriately controls which statistical information is passed to the decoder unit 31 in step S104.

次に、ステップＳ１０５において、変換装置１は、２種類のロスを計算する。具体的には、ロス算出部７１は、デコーダー部３１が出力した推定語列と、学習データ供給部８１が与えた正解語列との間のロスを算出する。統計情報ロス算出部７２は、統計情報デコーダー部６１が出力した推定統計情報と、学習データ供給部８１が与えた正解統計情報との間のロス（統計情報ロス）を算出する。 Next, in step S105, the conversion device 1 calculates two types of loss. Specifically, the loss calculation unit 71 calculates the loss between the estimated word string output by the decoder unit 31 and the correct answer word string provided by the training data supply unit 81. The statistical information loss calculation unit 72 calculates the loss (statistical information loss) between the estimated statistical information output by the statistical information decoder unit 61 and the correct answer statistical information provided by the training data supply unit 81.

次に、ステップＳ１０６において、デコーダー部３１は、ステップＳ１０５においてロス算出部７１が算出したロスに基づいて、誤差逆伝播を行う。この誤差逆伝播により、デコーダー部３１は、内部のニューラルネットワークの各ノードにおける演算パラメーターの値を更新する。 Next, in step S106, the decoder unit 31 performs error backpropagation based on the loss calculated by the loss calculation unit 71 in step S105. Through this error backpropagation, the decoder unit 31 updates the values of the calculation parameters at each node of the internal neural network.

次に、ステップＳ１０７において、統計情報デコーダー部６１は、ステップＳ１０５においてロス算出部７１が算出したロス（統計情報ロス）に基づいて、誤差逆伝播を行う。この誤差逆伝播により、統計情報デコーダー部６１は、内部のニューラルネットワークの各ノードにおける演算パラメーターの値を更新する。 Next, in step S107, the statistical information decoder unit 61 performs error backpropagation based on the loss (statistical information loss) calculated by the loss calculation unit 71 in step S105. Through this error backpropagation, the statistical information decoder unit 61 updates the values of the calculation parameters at each node of the internal neural network.

次に、ステップＳ１０８において、エンコーダー部２１は、デコーダー部３１や統計情報デコーダー部６１の側から逆伝播してくるデータに基づいて、誤差逆伝播を行う。この誤差逆伝播により、エンコーダー部２１は、内部のニューラルネットワークの各ノードにおける演算パラメーターの値を更新する。具体的には、エンコーダー部２１は、まずデコーダー部３１から逆伝播してくるデータに基づいて、誤差逆伝播による内部パラメーターの更新を行う。エンコーダー部２１は、次に、統計情報デコーダー部６１から逆伝播してくるデータに基づいて、誤差逆伝播による内部パラメーターの更新を行う。なお、この順序を逆にしてもよい。即ち、エンコーダー部２１は、まず統計情報デコーダー部６１からのデータに基づく誤差逆伝播を行い、次にデコーダー部３１からのデータに基づく誤差逆伝播を行うようにしてもよい。 Next, in step S108, the encoder unit 21 performs error backpropagation based on the data backpropagated from the decoder unit 31 and the statistical information decoder unit 61. Through this error backpropagation, the encoder unit 21 updates the values of the calculation parameters at each node of the internal neural network. Specifically, the encoder unit 21 first updates the internal parameters through error backpropagation based on the data backpropagated from the decoder unit 31. The encoder unit 21 then updates the internal parameters through error backpropagation based on the data backpropagated from the statistical information decoder unit 61. Note that this order may be reversed. That is, the encoder unit 21 may first perform error backpropagation based on the data from the statistical information decoder unit 61, and then perform error backpropagation based on the data from the decoder unit 31.

ステップＳ１０８における処理を次のようにしても良い。エンコーダー部２１は、デコーダー部３１から逆伝搬してくるデータに基づいて、誤差逆伝搬による内部パラメーターの更新量を計算する。また、エンコーダー部２１は、統計情報デコーダー部６１から逆伝搬してくるデータに基づいて、誤差逆伝搬による内部パラメーターの更新量を計算する。エンコーダー部２１は、これら２つの更新量を合計し、その更新量を内部パラメーターの更新を行う。 The processing in step S108 may be performed as follows. The encoder unit 21 calculates the amount of update of the internal parameters due to error backpropagation based on the data backpropagated from the decoder unit 31. The encoder unit 21 also calculates the amount of update of the internal parameters due to error backpropagation based on the data backpropagated from the statistical information decoder unit 61. The encoder unit 21 adds up these two update amounts and uses this update amount to update the internal parameters.

ステップＳ１０９において、制御部９１は、学習データセット内のすべての学習用データを用いた処理を完了したか否かを判定する。全ての学習用データを処理済みである場合（ステップＳ１０９：ＹＥＳ）には、次のステップＳ１１０に進む。まだ学習データ用が残っている場合（ステップＳ１０９：ＮＯ）には、次のデータを処理するためにステップＳ１０１に戻る。 In step S109, the control unit 91 determines whether or not processing using all the learning data in the learning dataset has been completed. If all the learning data has been processed (step S109: YES), the process proceeds to the next step S110. If there is still learning data remaining (step S109: NO), the process returns to step S101 to process the next data.

ステップＳ１１０に進んだ場合には、制御部９１は、現在の学習データセットを用いた学習処理の所定回数の繰り返しが完了したか否かを判定する。なお、この回数は、例えば、予め定めておくものとする。所定回数の処理が完了した場合（ステップＳ１１０：ＹＥＳ）には、本フローチャート全体の処理を終了する。所定回数の処理が完了していない場合（ステップＳ１１０：ＮＯ）には、次の回の処理を行うためにステップＳ１０１に戻る。なお、本ステップにおいて、予め定めておいた回数に基づいて全体の処理を終了するか否かの判断を行う代わりに、他の判断基準に基づいた判断を行うようにしてもよい。一例として、更新対象であるニューラルネットワークのパラメーター集合の値の収束状況（十分に収束しているか否か）に基づいて、全体の処理を終了するか否かの判断を行うようにしてもよい。 When the process proceeds to step S110, the control unit 91 determines whether or not the learning process using the current learning data set has been repeated a predetermined number of times. This number of times is, for example, predetermined. If the process has been completed a predetermined number of times (step S110: YES), the entire process of this flowchart is terminated. If the process has not been completed a predetermined number of times (step S110: NO), the process returns to step S101 to perform the next process. In this step, instead of determining whether or not to terminate the entire process based on a predetermined number of times, a determination may be made based on other criteria. As an example, the determination of whether or not to terminate the entire process may be made based on the convergence status (whether or not it has converged sufficiently) of the values of the parameter set of the neural network to be updated.

以上の処理の手順により、エンコーダー部２１と、統計情報デコーダー部６１と、デコーダー部３１との学習が進む。学習により、エンコーダー部２１と、統計情報デコーダー部６１と、デコーダー部３１とのそれぞれの内部のパラメーターが調整される。これにより、変換装置１は、より良い精度で、入力フレーム画像系列に対応する出力データ（具体例としては、記号の列。さらに具体的な例としては、手話に対応するグロス表記の単語列。）を生成するようになる。 Through the above processing steps, learning progresses in the encoder unit 21, the statistical information decoder unit 61, and the decoder unit 31. Through learning, the internal parameters of the encoder unit 21, the statistical information decoder unit 61, and the decoder unit 31 are adjusted. This enables the conversion device 1 to generate output data (a specific example is a string of symbols; a more specific example is a string of words in glossary that corresponds to sign language) that corresponds to the input frame image series with greater accuracy.

図８は、変換装置１における語列推定処理（推定処理モードの動作）の手順を示すフローチャートである。以下、このフローチャートに沿って説明する。 Figure 8 is a flowchart showing the steps of the word string inference process (operation in the inference process mode) in the conversion device 1. The following explanation will be given in accordance with this flowchart.

ステップＳ２０１において、入力部１０は、外部から入力フレーム画像系列を取得する。この入力フレーム画像系列は、例えば、人の上半身を映した手話の映像である。入力部１０は、この入力フレーム画像系列を、エンコーダー部２１に渡す。 In step S201, the input unit 10 acquires an input frame image series from the outside. This input frame image series is, for example, a sign language video showing a person's upper body. The input unit 10 passes this input frame image series to the encoder unit 21.

ステップＳ２０２において、エンコーダー部２１は、ステップＳ２０１で渡された入力フレーム画像系列を基に、順伝播を行う。即ち、エンコーダー部２１は、エンコーディング処理を行う。エンコーダー部２１は、順伝播の結果として、エンコーダー出力行列を出力する。エンコーダー部２１は、このエンコーダー出力行列を、統計情報デコーダー部６１とデコーダー部３１とにそれぞれ渡す。 In step S202, the encoder unit 21 performs forward propagation based on the input frame image sequence passed in step S201. That is, the encoder unit 21 performs encoding processing. The encoder unit 21 outputs an encoder output matrix as a result of the forward propagation. The encoder unit 21 passes this encoder output matrix to the statistical information decoder unit 61 and the decoder unit 31, respectively.

ステップＳ２０３において、統計情報デコーダー部６１は、エンコーダー部２１から渡されたエンコーダー出力行列を基に、順伝搬を行う。この順伝搬の結果として、統計情報デコーダー部６１は統計情報の推定値である推定統計情報を出力する。統計情報デコーダー部６１は、この推定統計情報を、デコーダー部３１に渡す。 In step S203, the statistical information decoder unit 61 performs forward propagation based on the encoder output matrix passed from the encoder unit 21. As a result of this forward propagation, the statistical information decoder unit 61 outputs estimated statistical information, which is an estimated value of the statistical information. The statistical information decoder unit 61 passes this estimated statistical information to the decoder unit 31.

ステップＳ２０４において、デコーダー部３１は、エンコーダー部２１から渡されたエンコーダー出力行列と、統計情報デコーダー部６１から渡されて統計情報とを基に、順伝搬を行う。この順伝搬の結果として、デコーダー部３１は入力フレーム画像系列に対応する推定語列を推定する。デコーダー部３１は、求めた推定語列のデータを、出力部４０に渡す。 In step S204, the decoder unit 31 performs forward propagation based on the encoder output matrix passed from the encoder unit 21 and the statistical information passed from the statistical information decoder unit 61. As a result of this forward propagation, the decoder unit 31 estimates a deduced word sequence corresponding to the input frame image sequence. The decoder unit 31 passes data of the estimated word sequence thus obtained to the output unit 40.

ステップＳ２０５において、出力部４０は、デコーダー部３１から渡された推定語列のデータを、外部に出力する。 In step S205, the output unit 40 outputs the data of the deduced word sequence passed from the decoder unit 31 to the outside.

以上説明した処理により、変換装置１は、入力されるフレーム画像系列に対応する推定語列を出力することができる。フレーム画像系列が例えば手話の映像である場合、変換装置１は、その手話の翻訳結果である語列を出力することができる。 By the process described above, the conversion device 1 can output a sequence of estimated words corresponding to the input sequence of frame images. If the sequence of frame images is, for example, a sign language video, the conversion device 1 can output a sequence of words that is the translation result of that sign language.

本実施形態では、推定処理モードにおいて、エンコーダー部２１が算出したデコーダー出力行列（入力データの特徴を表すデータ）に基づいて、予め機械学習済みの統計情報デコーダー部６１が統計情報を算出する。統計情報は、入力データに対応する出力データ（出力記号列）の統計的特徴を表す情報である。そして、デコーダー部３１は、この統計情報と、上記のデコーダー出力行列とに基づいて、出力記号列を推定する。このような構成では、統計情報デコーダー部６１が適切な学習を完了している場合、統計情報デコーダー部６１が出力する統計情報は、デコーダー部３１が出力記号列を推定する際の良好な制約となる。これにより、変換装置１は、入力データに対応して精度の良い出力記号列を出力することができるようになる。 In this embodiment, in the estimation processing mode, the statistical information decoder unit 61, which has been trained by machine learning in advance, calculates statistical information based on the decoder output matrix (data representing the characteristics of the input data) calculated by the encoder unit 21. The statistical information is information representing the statistical characteristics of the output data (output symbol string) corresponding to the input data. The decoder unit 31 then estimates the output symbol string based on this statistical information and the above-mentioned decoder output matrix. In this configuration, if the statistical information decoder unit 61 has completed appropriate learning, the statistical information output by the statistical information decoder unit 61 becomes an appropriate constraint when the decoder unit 31 estimates the output symbol string. This enables the conversion device 1 to output an output symbol string with high accuracy corresponding to the input data.

なお、入力フレーム画像系列に含まれるフレーム画像の枚数は任意である。一例として、フレーム画像の枚数は２００枚以上且つ３００枚以下程度であることが望ましい。これは、３０フレーム毎秒のレートにおいては、６．６７秒以上且つ１０．００秒以下の映像に相当する。この長さは、例えば、手話における一文の長さとして妥当である。より長い映像を扱う場合には、適宜、例えば手話の文ごとに入力フレーム画像系列を区切ってもよい。この場合、統計情報としては、区切りごとにリセットした記号出現回数をカウントしたものを用いてもよい。 The number of frame images included in the input frame image series can be any number. As an example, it is desirable for the number of frame images to be greater than or equal to 200 and less than or equal to 300. At a rate of 30 frames per second, this corresponds to an image of greater than or equal to 6.67 seconds and less than or equal to 10.00 seconds. This length is appropriate as the length of a sentence in sign language, for example. When dealing with longer images, the input frame image series may be divided appropriately, for example, into sign language sentences. In this case, the statistical information may be a count of the number of symbol occurrences reset for each division.

以上、実施形態を説明したが、本発明はさらに次のような変形例でも実施することが可能である。複数の変形例を、組み合わせることが可能な限りにおいて、組み合わせて実施してもよい。 Although the embodiment has been described above, the present invention can also be implemented in the following modified examples. Multiple modified examples may be implemented in combination to the extent that they can be combined.

［第１変形例］
上記実施形態では、エンコーダー部２１や、デコーダー部３１や、統計情報デコーダー部６１は、ニューラルネットワークの構造として、トランスフォーマーを用いた。変形例として、エンコーダー部２１や、デコーダー部３１や、統計情報デコーダー部６１が、他の構造を用いてもよい。例えば、エンコーダー部２１や、デコーダー部３１や、統計情報デコーダー部６１が、トランスフォーマーの代わりにＲＮＮ（ＲｅｃｕｒｒｅｎｔＮｅｕｒａｌＮｅｔｗｏｒｋ）等を用いても良い。 [First Modification]
In the above embodiment, the encoder unit 21, the decoder unit 31, and the statistical information decoder unit 61 use a transformer as the neural network structure. As a modification, the encoder unit 21, the decoder unit 31, and the statistical information decoder unit 61 may use other structures. For example, the encoder unit 21, the decoder unit 31, and the statistical information decoder unit 61 may use a recurrent neural network (RNN) or the like instead of a transformer.

［第２変形例］
上記実施形態では、統計情報として、記号の出現回数を用いた。変形例として、記号の出現頻度を統計情報として用いてもよい。 [Second Modification]
In the above embodiment, the number of occurrences of a symbol is used as the statistical information. As a modified example, the frequency of occurrence of a symbol may be used as the statistical information.

［第３変形例］
上記実施形態では、統計情報として、記号の出現回数の情報のみを用いた。変形例として、出現回数の情報に加えて、出力記号列の長さを、統計情報に含めるようにしてよい。この場合、学習データ供給部８１が供給する正解統計情報が、正解語列の長さの情報を含むようにする。 [Third Modification]
In the above embodiment, only the information on the number of occurrences of symbols is used as the statistical information. As a modified example, in addition to the information on the number of occurrences, the length of the output symbol string may be included in the statistical information. In this case, the correct answer statistical information provided by the training data providing unit 81 includes information on the length of the correct answer word string.

［第４変形例］
上記実施形態では、統計情報デコーダー部６１の中の全結合層６００３（図５参照）が、１×Ｖの行列を入力し、Ｖ×Ｎの行列に変換していた。また、統計情報デコーダー部６１は、このＶ×Ｎの行列（統計情報）を、デコーダー部３１に渡していた。そして、デコーダー部３１の中の全結合層３００３（図６参照）が、Ｖ×Ｎの行列を入力し、１×Ｖの行列に変換していた。全結合層３００３は、この１×Ｎの行列を、トランスフォーマー３００２に渡していた。 [Fourth Modification]
In the above embodiment, the fully connected layer 6003 (see FIG. 5) in the statistical information decoder unit 61 inputs a 1×V matrix and converts it into a V×N matrix. The statistical information decoder unit 61 passes this V×N matrix (statistical information) to the decoder unit 31. The fully connected layer 3003 (see FIG. 6) in the decoder unit 31 inputs a V×N matrix and converts it into a 1×V matrix. The fully connected layer 3003 passes this 1×N matrix to the transformer 3002.

本変形例では、上記のような処理手順によって１×Ｎの行列をトランスフォーマー３００２に渡す代わりに、次のような処理手順を実行する。即ち、トランスフォーマー６００２が出力する１×Ｖの行列（全結合層６００３による返還前の統計情報）を保存しておく。そして、この１×Ｖの行列を、デコーダー部３１の中のトランスフォーマー３００２への入力とする。本変形例では、デコーダー部３１が持つ全結合層３００３を省略することが可能となる。 In this modified example, instead of passing a 1xN matrix to the transformer 3002 using the above-mentioned processing procedure, the following processing procedure is executed. That is, the 1xV matrix output by the transformer 6002 (statistical information before return by the fully connected layer 6003) is saved. Then, this 1xV matrix is used as the input to the transformer 3002 in the decoder unit 31. In this modified example, it is possible to omit the fully connected layer 3003 of the decoder unit 31.

［第５変形例］
上記実施形態において、統計情報を、Ｖ×Ｎのサイズを持つ行列とした（ただし、Ｎ≧２）。変形例として、統計情報を、Ｖ×１のサイズを持つ行列としてよい。この行列が、Ｖ種類の記号（特殊記号を含む）のそれぞれが、出力記号列内に存在する確率を表すようにする。推定統計情報において、Ｖ×１の行列の要素は、各記号が出力記号列内に含まれる確率を表す。また、正解統計情報において、Ｖ×１の行列の要素は、１または０の値をとる。値が１の場合には、正解語列の中にその記号が含まれる。値が０の場合には、正解語列の中にその記号が含まれない。このような統計情報を用いる点以外は、上記実施系遺体と同様にしてよい。 [Fifth Modification]
In the above embodiment, the statistical information is a matrix having a size of V×N (where N≧2). As a modified example, the statistical information may be a matrix having a size of V×1. This matrix represents the probability that each of the V types of symbols (including special symbols) is present in the output symbol string. In the estimated statistical information, the elements of the V×1 matrix represent the probability that each symbol is included in the output symbol string. In the correct answer statistical information, the elements of the V×1 matrix take a value of 1 or 0. If the value is 1, the symbol is included in the correct answer word string. If the value is 0, the symbol is not included in the correct answer word string. Other than using such statistical information, the above embodiment may be the same.

［第６変形例］
上記実施形態において、統計情報は、単語（記号）の出現回数であった。変形例において、その代わりに、単語（記号）の連鎖の出現回数（ないしは頻度）についての統計情報を用いるようにしてもよい。例えば、いわゆるｎグラム（ｎ－ｇｒａｍ；ｎ個の単語の連鎖；ｎ≧２）の出現回数の情報を統計情報とする。学習データ供給部８１は、正解語列に含まれるｎグラムの出現回数を数え上げ、正解統計情報を生成する。学習データ供給部８１は、その正解統計情報を、統計情報ロス算出部７２に提供する。統計情報ロス算出部７２は、学習処理モードにおいて、上記の正解統計情報と、統計情報デコーダー部６１が求めた推定統計情報との間のロス（統計情報ロス）を算出する。統計情報デコーダー部６１およびエンコーダー部２１は、このロスに基づいて、誤差逆伝播法により、内部のパラメーターを更新する。また、推定処理モードにおいて、統計情報デコーダー部６１は、推定結果である推定統計情報を、デコーダー部３１に渡す。デコーダー部３１は、この推定統計情報に基づいて、出力記号列の推定処理を行う。なお、本変形例において、統計情報のニューラルネットワーク上での表現方法は、設計として適宜定めるようにする。 [Sixth Modification]
In the above embodiment, the statistical information is the number of occurrences of words (symbols). In a modified example, instead, statistical information on the number of occurrences (or frequency) of a chain of words (symbols) may be used. For example, information on the number of occurrences of so-called n-grams (n-grams; a chain of n words; n≧2) is used as the statistical information. The learning data supplying unit 81 counts the number of occurrences of n-grams included in the correct word string to generate correct statistical information. The learning data supplying unit 81 provides the correct statistical information to the statistical information loss calculating unit 72. In the learning processing mode, the statistical information loss calculating unit 72 calculates a loss (statistical information loss) between the correct statistical information and the estimated statistical information obtained by the statistical information decoder unit 61. The statistical information decoder unit 61 and the encoder unit 21 update their internal parameters by the error backpropagation method based on this loss. In the estimation processing mode, the statistical information decoder unit 61 passes the estimated statistical information, which is the estimation result, to the decoder unit 31. The decoder unit 31 performs an estimation process for an output symbol string based on this estimated statistical information. In this modification, the method for expressing the statistical information on a neural network is determined appropriately as part of the design.

［第７変形例］
上記の実施形態において、エンコーダー部２１や、デコーダー部３１や、統計情報デコーダー部６１のそれぞれは、学習処理モードと推定処理モードのどちらの動作モードでも動作するように構成されていた。本変形例では、エンコーダー部２１や、デコーダー部３１や、統計情報デコーダー部６１のそれぞれは、学習処理モードでのみ動作するようにする。この場合、実施形態において「変換装置１」として説明した装置は、モデルの学習を行うための「学習装置」として機能する。このような学習装置が動作することにより、モデルの学習を行える。学習済みのモデル（学習済みのパラメーター値のデータを含む）は、当該装置を、あるいはモデルの移植先の他の装置（コンピューター等）を、変換装置として稼働させることができる。 [Seventh Modification]
In the above embodiment, the encoder unit 21, the decoder unit 31, and the statistical information decoder unit 61 are configured to operate in both the learning processing mode and the estimation processing mode. In this modified example, the encoder unit 21, the decoder unit 31, and the statistical information decoder unit 61 are configured to operate only in the learning processing mode. In this case, the device described as the "conversion device 1" in the embodiment functions as a "learning device" for learning the model. The model can be learned by operating such a learning device. The learned model (including data of learned parameter values) can operate the device or another device (computer, etc.) to which the model is to be transplanted as a conversion device.

［第８変形例］
上記の実施形態において、エンコーダー部２１や、デコーダー部３１や、統計情報デコーダー部６１のそれぞれは、学習処理モードと推定処理モードのどちらの動作モードでも動作するように構成されていた。本変形例では、エンコーダー部２１や、デコーダー部３１や、統計情報デコーダー部６１のそれぞれは、推定処理モードでのみ動作するようにする。この場合、エンコーダー部２１や、デコーダー部３１や、統計情報デコーダー部６１のモデルの学習は予め済ませておいたものとする。つまり、エンコーダー部２１の内部パラメーターは機械学習処理によって予め調整済みである。また、統計情報デコーダー部６１の内部パラメーターは機械学習処理によって予め調整済みである。また、デコーダー部３１の内部パラメーターも機械学習処理によって予め調整済みである。例えば、他の装置（コンピューター等）から学習済みのモデル（学習済みのパラメーター値のデータを含む）を移植してもよい。この場合の変換装置１もまた、良い精度で入力フレーム画像系列から出力記号列への変換を行う。 [Eighth Modification]
In the above embodiment, the encoder unit 21, the decoder unit 31, and the statistical information decoder unit 61 are configured to operate in both the learning processing mode and the estimation processing mode. In this modification, the encoder unit 21, the decoder unit 31, and the statistical information decoder unit 61 are configured to operate only in the estimation processing mode. In this case, it is assumed that the models of the encoder unit 21, the decoder unit 31, and the statistical information decoder unit 61 have been previously learned. That is, the internal parameters of the encoder unit 21 have been previously adjusted by machine learning processing. The internal parameters of the statistical information decoder unit 61 have also been previously adjusted by machine learning processing. The internal parameters of the decoder unit 31 have also been previously adjusted by machine learning processing. For example, a learned model (including data of learned parameter values) may be transferred from another device (such as a computer). In this case, the conversion device 1 also converts the input frame image sequence into an output symbol string with high accuracy.

図９は、上記の実施形態やその変形例における変換装置１（第７変形例における「学習装置」を含む）の内部構成の例を示すブロック図である。変換装置１は、コンピューターを用いて実現され得る。図示するように、そのコンピューターは、中央処理装置９０１と、ＲＡＭ９０２と、入出力ポート９０３と、入出力デバイス９０４や９０５等と、バス９０６と、を含んで構成される。コンピューター自体は、既存技術を用いて実現可能である。中央処理装置９０１は、ＲＡＭ９０２等から読み込んだプログラムに含まれる命令を実行する。中央処理装置９０１は、各命令にしたがって、ＲＡＭ９０２にデータを書き込んだり、ＲＡＭ９０２からデータを読み出したり、算術演算や論理演算を行ったりする。ＲＡＭ９０２は、データやプログラムを記憶する。ＲＡＭ９０２に含まれる各要素は、アドレスを持ち、アドレスを用いてアクセスされ得るものである。なお、ＲＡＭは、「ランダムアクセスメモリー」の略である。入出力ポート９０３は、中央処理装置９０１が外部の入出力デバイス等とデータのやり取りを行うためのポートである。入出力デバイス９０４や９０５は、入出力デバイスである。入出力デバイス９０４や９０５は、入出力ポート９０３を介して中央処理装置９０１との間でデータをやりとりする。バス９０６は、コンピューター内部で使用される共通の通信路である。例えば、中央処理装置９０１は、バス９０６を介してＲＡＭ９０２のデータを読んだり書いたりする。また、例えば、中央処理装置９０１は、バス９０６を介して入出力ポートにアクセスする。 Figure 9 is a block diagram showing an example of the internal configuration of the conversion device 1 (including the "learning device" in the seventh modified example) in the above embodiment and its modified examples. The conversion device 1 can be realized using a computer. As shown in the figure, the computer is configured to include a central processing unit 901, a RAM 902, an input/output port 903, input/output devices 904 and 905, etc., and a bus 906. The computer itself can be realized using existing technology. The central processing unit 901 executes instructions included in a program read from the RAM 902, etc. According to each instruction, the central processing unit 901 writes data to the RAM 902, reads data from the RAM 902, and performs arithmetic operations and logical operations. The RAM 902 stores data and programs. Each element included in the RAM 902 has an address and can be accessed using the address. Note that RAM is an abbreviation for "random access memory." The input/output port 903 is a port through which the central processing unit 901 exchanges data with external input/output devices, etc. Input/output devices 904 and 905 are input/output devices. Input/output devices 904 and 905 exchange data with central processing unit 901 via input/output port 903. Bus 906 is a common communication path used inside the computer. For example, central processing unit 901 reads and writes data in RAM 902 via bus 906. Also, for example, central processing unit 901 accesses the input/output port via bus 906.

なお、上述した実施形態における変換装置１（第７変形例における「学習装置」を含む）の少なくとも一部の機能をコンピューターで実現することができる。その場合、この機能を実現するためのプログラムをコンピューター読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピューターシステムに読み込ませ、実行することによって実現しても良い。なお、ここでいう「コンピューターシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピューター読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ－ＲＯＭ、ＤＶＤ－ＲＯＭ、ＵＳＢメモリー等の可搬媒体、コンピューターシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピューター読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、一時的に、動的にプログラムを保持するもの、その場合のサーバーやクライアントとなるコンピューターシステム内部の揮発性メモリーのように、一定時間プログラムを保持しているものも含んでも良い。また上記プログラムは、前述した機能の一部を実現するためのものであっても良く、さらに前述した機能をコンピューターシステムにすでに記録されているプログラムとの組み合わせで実現できるものであっても良い。 At least some of the functions of the conversion device 1 (including the "learning device" in the seventh modified example) in the above-mentioned embodiment can be realized by a computer. In that case, a program for realizing this function may be recorded in a computer-readable recording medium, and the program recorded in this recording medium may be read into a computer system and executed to realize the function. Note that the "computer system" here includes hardware such as an OS and peripheral devices. Furthermore, the "computer-readable recording medium" refers to portable media such as flexible disks, optical magnetic disks, ROMs, CD-ROMs, DVD-ROMs, and USB memories, and storage devices such as hard disks built into computer systems. Furthermore, the "computer-readable recording medium" may also include those that temporarily and dynamically hold a program, such as a communication line when a program is transmitted via a network such as the Internet or a communication line such as a telephone line, and those that hold a program for a certain period of time, such as a volatile memory inside a computer system that is a server or client in such a case. Furthermore, the above-mentioned program may be for realizing some of the above-mentioned functions, and may further be capable of realizing the above-mentioned functions in combination with a program already recorded in the computer system.

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 The above describes an embodiment of the present invention in detail with reference to the drawings, but the specific configuration is not limited to this embodiment, and includes designs that do not deviate from the gist of the present invention.

［評価実験］
上で説明した実施形態の変換装置１を用いて、１００エポック（epoch）の機械学習処理を実施した。その結果、統計情報を用いて推定語列を求める変換装置１のほうが、統計情報を用いずに推定語列を求める従来技術の場合よりも、ＢＬＵＥ値において約５％の変換精度の改善を確認できた。 [Evaluation experiment]
A machine learning process of 100 epochs was carried out using the conversion device 1 of the embodiment described above. As a result, it was confirmed that the conversion device 1, which obtains a deduced word string using statistical information, improves the conversion accuracy by about 5% in terms of the BLUE value compared to the conventional technology, which obtains a deduced word string without using statistical information.

本発明は、例えば、映像を基に記号列を生成するあらゆる適用領域（一例として、映像理解等）に利用することができる。特に手話映像を対象とした処理を行う場合には、聴覚障害者と健聴者のコミュニケーションに利用したり、手話学習者の教育に利用したり、コンテンツ配信の事業に利用したり、することができる。但し、本発明の利用範囲はここに例示したものには限られない。 The present invention can be used, for example, in any application area where a symbol string is generated based on an image (one example is image understanding). In particular, when processing sign language images, the present invention can be used for communication between the hearing impaired and the hearing impaired, for education of sign language learners, and for content distribution businesses. However, the scope of use of the present invention is not limited to the examples given here.

１変換装置
１０入力部
２１エンコーダー部
３１デコーダー部
４０出力部
６１統計情報デコーダー部
７１ロス算出部
７２統計情報ロス算出部
８１学習データ供給部
９１制御部
９０１中央処理装置
９０２ＲＡＭ
９０３入出力ポート
９０４，９０５入出力デバイス
９０６バス
２００１ニューラルネットワーク
２００２トランスフォーマー
３００１ニューラルネットワーク
３００２トランスフォーマー
３００３全結合層
６００１ニューラルネットワーク
６００２トランスフォーマー
６００３全結合層 REFERENCE SIGNS LIST 1 Conversion device 10 Input unit 21 Encoder unit 31 Decoder unit 40 Output unit 61 Statistical information decoder unit 71 Loss calculation unit 72 Statistical information loss calculation unit 81 Learning data supply unit 91 Control unit 901 Central processing unit 902 RAM
903 Input/Output Port 904, 905 Input/Output Device 906 Bus 2001 Neural Network 2002 Transformer 3001 Neural Network 3002 Transformer 3003 Fully Connected Layer 6001 Neural Network 6002 Transformer 6003 Fully Connected Layer

Claims

入力される画像系列を基に状態データを生成するエンコーダー部と、
前記状態データを基に記号列についての統計の情報である統計情報を生成する統計情報デコーダー部と、
前記状態データと前記統計情報とを基に記号列を生成するデコーダー部と、
前記エンコーダー部への入力の基となる学習用画像系列と、前記学習用画像系列に対応する前記記号列の正解である正解記号列と、前記記号列に対応する前記統計情報の正解である正解統計情報と、の組を供給する学習データ供給部と、
前記学習用画像系列に基づいて前記エンコーダー部が生成する状態データ、に基づいて前記デコーダー部が生成する記号列である学習用推定記号列と、前記学習用画像系列に対応して前記学習データ供給部が供給する前記正解記号列と、の差を表すロスを算出するロス算出部と、
前記学習用画像系列に基づいて前記エンコーダー部が生成する状態データ、に基づいて前記統計情報デコーダー部が生成する統計情報である学習用推定統計情報と、前記学習用画像系列に対応して前記学習データ供給部が供給する前記正解統計情報と、の差を表す統計情報ロスを算出する統計情報ロス算出部と、
学習処理モードと推定処理モードとを適宜切り替えて実行させるように制御する制御部と、
を備え、
前記学習処理モードにおいては、前記デコーダー部は、前記統計情報デコーダー部が生成した統計情報である推定統計情報、または前記学習データ供給部が供給した前記正解統計情報の、いずれかの前記統計情報を基に、前記記号列を生成し、
前記学習処理モードにおいては、前記ロス算出部が算出した前記ロスに基づいて前記デコーダー部の内部パラメーターと前記エンコーダー部の内部パラメーターとを調整するとともに、前記統計情報ロス算出部が算出した前記統計情報ロスに基づいて前記統計情報デコーダー部の内部パラメーターと前記エンコーダー部の内部パラメーターとを調整し、
前記推定処理モードにおいては、前記エンコーダー部が推定対象の画像系列を基に状態データを生成し、前記エンコーダー部が生成した前記状態データを基に前記統計情報デコーダー部が前記統計情報を生成し、前記デコーダー部が前記状態データと前記統計情報とを基に前記記号列を生成する、
変換装置。 an encoder unit that generates state data based on an input image sequence;
a statistical information decoder unit for generating statistical information on the symbol string based on the state data;
a decoder unit that generates a symbol string based on the state data and the statistical information;
a learning data supply unit that supplies a set of a learning image sequence that is a basis for an input to the encoder unit, a correct answer symbol string that is a correct answer to the symbol string corresponding to the learning image sequence, and correct answer statistical information that is a correct answer to the statistical information corresponding to the symbol string;
a loss calculation unit that calculates a loss representing a difference between a training estimated symbol string, which is a symbol string generated by the decoder unit based on state data generated by the encoder unit based on the training image sequence, and the correct symbol string supplied by the training data supply unit in correspondence with the training image sequence;
a statistical information loss calculation unit that calculates a statistical information loss representing a difference between learning estimated statistical information, which is statistical information generated by the statistical information decoder unit based on state data generated by the encoder unit based on the learning image sequence, and the correct answer statistical information supplied by the learning data supply unit in response to the learning image sequence;
a control unit that controls the execution of the learning processing mode and the estimation processing mode by appropriately switching between them;
Equipped with
In the learning processing mode, the decoder unit generates the symbol string based on either the estimated statistical information generated by the statistical information decoder unit or the correct answer statistical information supplied by the learning data supply unit;
In the learning processing mode, an internal parameter of the decoder unit and an internal parameter of the encoder unit are adjusted based on the loss calculated by the loss calculation unit, and an internal parameter of the statistical information decoder unit and an internal parameter of the encoder unit are adjusted based on the statistical information loss calculated by the statistical information loss calculation unit;
In the estimation processing mode, the encoder unit generates status data based on an image sequence to be estimated, the statistical information decoder unit generates the statistical information based on the status data generated by the encoder unit, and the decoder unit generates the symbol string based on the status data and the statistical information.
Conversion device.

前記画像系列は、手話を表す映像であり、
前記デコーダー部が生成する前記記号列は、前記手話のグロス表記を表す記号の列である、
請求項１に記載の変換装置。 the sequence of images is video representing a sign language;
The symbol string generated by the decoder unit is a symbol string representing a glossary of the sign language.
The conversion device according to claim 1 .

入力される画像系列を基に状態データを生成するエンコーダー部と、
前記状態データを基に記号列についての統計の情報である統計情報を生成する統計情報デコーダー部と、
前記状態データと前記統計情報とを基に記号列を生成するデコーダー部と、
前記エンコーダー部への入力の基となる学習用画像系列と、前記学習用画像系列に対応する前記記号列の正解である正解記号列と、前記記号列に対応する前記統計情報の正解である正解統計情報と、の組を供給する学習データ供給部と、
前記学習用画像系列に基づいて前記エンコーダー部が生成する状態データ、に基づいて前記デコーダー部が生成する記号列である学習用推定記号列と、前記学習用画像系列に対応して前記学習データ供給部が供給する前記正解記号列と、の差を表すロスを算出するロス算出部と、
前記学習用画像系列に基づいて前記エンコーダー部が生成する状態データ、に基づいて前記統計情報デコーダー部が生成する統計情報である学習用推定統計情報と、前記学習用画像系列に対応して前記学習データ供給部が供給する前記正解統計情報と、の差を表す統計情報ロスを算出する統計情報ロス算出部と、
を備え、
前記ロス算出部が算出した前記ロスに基づいて前記デコーダー部の内部パラメーターと前記エンコーダー部の内部パラメーターとを調整するとともに、前記統計情報ロス算出部が算出した前記統計情報ロスに基づいて前記統計情報デコーダー部の内部パラメーターと前記エンコーダー部の内部パラメーターとを調整する、
学習装置。 an encoder unit that generates state data based on an input image sequence;
a statistical information decoder unit for generating statistical information on the symbol string based on the state data;
a decoder unit that generates a symbol string based on the state data and the statistical information;
a learning data supply unit that supplies a set of a learning image sequence that is a basis for an input to the encoder unit, a correct answer symbol string that is a correct answer to the symbol string corresponding to the learning image sequence, and correct answer statistical information that is a correct answer to the statistical information corresponding to the symbol string;
a loss calculation unit that calculates a loss representing a difference between a training estimated symbol string, which is a symbol string generated by the decoder unit based on state data generated by the encoder unit based on the training image sequence, and the correct symbol string supplied by the training data supply unit in correspondence with the training image sequence;
a statistical information loss calculation unit that calculates a statistical information loss representing a difference between learning estimated statistical information, which is statistical information generated by the statistical information decoder unit based on state data generated by the encoder unit based on the learning image sequence, and the correct answer statistical information supplied by the learning data supply unit in response to the learning image sequence;
Equipped with
adjusting an internal parameter of the decoder unit and an internal parameter of the encoder unit based on the loss calculated by the loss calculation unit, and adjusting an internal parameter of the statistical information decoder unit and an internal parameter of the encoder unit based on the statistical information loss calculated by the statistical information loss calculation unit;
Learning device.

前記画像系列は、手話を表す映像であり、
前記デコーダー部が生成する前記記号列、および前記学習データ供給部が供給する正解記号列は、前記手話のグロス表記を表す記号の列である、
請求項３に記載の学習装置。 the sequence of images is video representing a sign language;
the symbol string generated by the decoder unit and the correct symbol string provided by the learning data providing unit are symbol strings representing a glossary of the sign language.
The learning device according to claim 3 .

コンピューターを、
請求項１または２に記載の変換装置、
として機能させるためのプログラム。 Computer,
A conversion device according to claim 1 or 2 ,
A program to function as a

コンピューターを、
請求項３または４に記載の学習装置、
として機能させるためのプログラム。 Computer,
The learning device according to claim 3 or 4,
A program to function as a