JP2017037424A

JP2017037424A - Learning device, recognition device, learning program and recognition program

Info

Publication number: JP2017037424A
Application number: JP2015157596A
Authority: JP
Inventors: 井上　誠喜; Seiki Inoue; 誠喜井上
Original assignee: Nippon Hoso Kyokai NHK
Current assignee: Japan Broadcasting Corp
Priority date: 2015-08-07
Filing date: 2015-08-07
Publication date: 2017-02-16

Abstract

PROBLEM TO BE SOLVED: To provide a learning device, recognition device, learning program and recognition program for improving accuracy of image recognition for such as finger characters.SOLUTION: A learning device 11 includes: a distant image generating means 24 for generating a CG distance image of a person corresponding to a CG character model by using motion data including at least one type of information among shape, posture and action of the preset CG character model; a range extraction means 25 for extracting a prescribed range of the CG character model from the CG distance image obtained by the distance image generation means 24 based on position information of the CG character model included in the motion data; and a neural network learning means 26 having concentration difference information included in a prescribed range obtained by the range extraction means 25 and outputting a correct solution data for the prescribed range from the input concentration information.SELECTED DRAWING: Figure 1

Description

本発明は、学習装置、認識装置、学習プログラム、及び認識プログラムに関し、特に、指文字等の画像認識の精度を向上させるための学習装置、認識装置、学習プログラム、及び認識プログラムに関する。 The present invention relates to a learning device, a recognition device, a learning program, and a recognition program, and more particularly to a learning device, a recognition device, a learning program, and a recognition program for improving the accuracy of image recognition of finger characters and the like.

従来では、カメラ等の撮像手段で被写体である手等の物体を撮影し、その撮影画像に含まれる手の形状から指文字を認識する手法が存在する（例えば、非特許文献１〜３参照）。 Conventionally, there is a method of photographing an object such as a hand that is a subject with an imaging unit such as a camera and recognizing a finger character from the shape of the hand included in the photographed image (for example, see Non-Patent Documents 1 to 3). .

非特許文献１の手法では、所定の指や手首に所定の色を付けた手袋を着用し、その手袋を着用した手を撮影することで、指位置を高精度に認識する。また、非特許文献２の手法では、多数の距離画像を学習データとして入力し、入力した学習データを主成分分析して指文字を認識する。また、非特許文献３の手法では、距離画像カメラで撮影した距離画像を用いて、距離による手領域の抽出したり、ＨＯＧ（ＨｉｓｔｏｇｒａｍｓｏｆＯｒｉｅｎｔｅｄＧｒａｄｉｅｎｔｓ）特徴等を用いることにより、指文字を認識する。 In the technique of Non-Patent Document 1, a finger with a predetermined color is worn on a predetermined finger or wrist, and the finger position is recognized with high accuracy by photographing the hand wearing the glove. In the method of Non-Patent Document 2, many distance images are input as learning data, and the input learning data is subjected to principal component analysis to recognize a finger character. In the method of Non-Patent Document 3, a finger character is recognized by extracting a hand region based on a distance or using a HOG (Histograms of Oriented Gradients) feature or the like using a distance image captured by a distance image camera. .

菅谷隆浩他、「可視光カメラとカラー手袋を用いた手話認識手法に関する基本検討」、ヒューマンインタフェースシンポジウム２０１４．Takahiro Sugaya et al., “Basic study on sign language recognition using visible light camera and color gloves”, Human Interface Symposium 2014. 中島文香他、「ｋｉｎｅｃｔ（登録商標）を用いた距離画像による指文字認識システム」、一般社団法人電子情報通信学会、ＨＣＧシンポジウム２０１４．Nakashima Fumika et al., “Digit Character Recognition System Using Kinect (registered trademark)”, The Institute of Electronics, Information and Communication Engineers, HCG Symposium 2014. 井上誠喜他、「距離情報を用いた手話単語モーション認識に関する一検討」、社団法人映像情報メディア学会技術報告、ＩＴＥＴｅｃｈｎｉｃａｌＲｅｐｏｒｔＶｏｌ．３６，Ｎｏ．９、２０１２年２月２１日．Inoue Seiki et al., “A Study on Sign Language Word Motion Recognition Using Distance Information”, ITE Technical Report, ITE Technical Report Vol. 36, no. 9, February 21, 2012.

しかしながら、上述した非特許文献１の手法では、予め所定の色の付いた手袋用意したり、指文字認識時に着用する必要がある。また、ユーザの手のサイズに合った手袋の用意や、色情報と指情報とを対応付けた情報の管理等が必要となるため、非常に手間がかかってしまう。 However, in the method of Non-Patent Document 1 described above, it is necessary to prepare gloves with a predetermined color in advance or to wear them during finger character recognition. Further, it is necessary to prepare gloves suitable for the size of the user's hand, and to manage information in which color information and finger information are associated with each other.

また、非特許文献２の手法では、距離画像を用いて手指領域を抽出しているが、学習用画像の収集が問題となる。例えば、カラー画像であっても、距離画像であっても、大量の画像を撮影するための作業や労力が大きい。例えば、手の形状は、操作する人だけでなく、手の向きや指の僅かな曲げの違いで見た目が大きく変化する。また、特定の人や特定の条件で作成された画像群だけでは、汎用性やロバスト性に問題を生じる。 In the method of Non-Patent Document 2, a finger region is extracted using a distance image, but there is a problem in collecting learning images. For example, even if it is a color image or a distance image, the work and labor for photographing a lot of images are large. For example, the shape of the hand greatly changes not only by the person who operates it, but also by the difference in hand orientation and slight bending of the fingers. In addition, a problem arises in versatility and robustness only with a specific person or a group of images created under specific conditions.

また、非特許文献３の手法は、ＣＧ（ＣｏｍｐｕｔｅｒＧｒａｐｈｉｃｓ）技術を使うことにより、教師用の画像撮影が不要で簡便に指文字認識を行えるが、手の向きや指の曲げ具合等が管理者毎に設定が異なるため、指文字認識の精度に大きく影響してしまう。 The method of Non-Patent Document 3 uses CG (Computer Graphics) technology, so that it is possible to easily recognize finger characters without taking an image for teachers. Since each setting is different, the accuracy of finger character recognition is greatly affected.

本発明は、上述した問題点に鑑みなされたものであり、指文字等の画像認識の精度を向上させるための学習装置、認識装置、学習プログラム、及び認識プログラムを提供することを目的とする。 The present invention has been made in view of the above-described problems, and an object thereof is to provide a learning device, a recognition device, a learning program, and a recognition program for improving the accuracy of image recognition of finger characters and the like.

上記課題を解決するために、本件発明は、以下の特徴を有する課題を解決するための手段を採用している。 In order to solve the above problems, the present invention employs means for solving the problems having the following characteristics.

一つの態様として、本発明における学習装置は、予め設定されたＣＧキャラクタモデルの形状、姿勢、及び動作のうち、少なくとも１つの情報を含むモーションデータとを用いて、前記ＣＧキャラクタモデルに対応する人物のＣＧ距離画像を生成する距離画像生成手段と、前記モーションデータに含まれるＣＧキャラクタモデルの位置情報に基づき、前記距離画像生成手段により得られるＣＧ距離画像から前記ＣＧキャラクタモデルの所定の領域を抽出する領域抽出手段と、前記領域抽出手段により得られる所定の領域に含まれる濃淡情報を入力とし、入力した該濃淡情報から前記所定の領域に対する正解データを出力するニューラルネットワークを学習する学習手段とを有する。 As one aspect, the learning device according to the present invention uses a motion data including at least one piece of information among a preset shape, posture, and motion of a CG character model, and a person corresponding to the CG character model. A predetermined area of the CG character model is extracted from the CG distance image obtained by the distance image generating means based on the position information of the CG character model included in the motion data and the distance image generating means for generating the CG distance image And a learning means for learning a neural network that receives the grayscale information included in the predetermined area obtained by the area extracting means and outputs correct data for the predetermined area from the inputted grayscale information. Have.

また、一つの態様として、本発明における認識装置は、上述した学習装置から得られるニューラルネットワークを用いて、前記所定の領域の認識を行う認識装置において、距離画像カメラから撮影された人物を含む距離画像を取得する画像取得手段と、前記画像取得手段により得られる距離画像から所定の領域を抽出する領域抽出手段と、前記領域抽出手段により得られる所定の領域に含まれる各画素の濃淡情報を、前記ニューラルネットワークに入力して前記所定の領域に対応する指文字の認識を行う認識手段とを有する。 Further, as one aspect, the recognition apparatus according to the present invention is a distance including a person photographed from a distance image camera in a recognition apparatus that recognizes the predetermined area using a neural network obtained from the learning apparatus described above. Image acquisition means for acquiring an image, area extraction means for extracting a predetermined area from the distance image obtained by the image acquisition means, and density information of each pixel included in the predetermined area obtained by the area extraction means, Recognizing means for inputting to the neural network and recognizing a finger character corresponding to the predetermined area.

また、一つの態様として、本発明における学習プログラムは、コンピュータを、上述した学習装置として機能させる。また、一つの態様として、本発明における認識プログラムは、コンピュータを、上述した認識装置として機能させる。 Moreover, as one aspect, the learning program according to the present invention causes a computer to function as the learning device described above. As one aspect, the recognition program according to the present invention causes a computer to function as the above-described recognition device.

本発明によれば、画像認識の精度を向上させることができる。 According to the present invention, the accuracy of image recognition can be improved.

本実施形態における指文字認識システムの概略構成の一例を示す図である。It is a figure which shows an example of schematic structure of the finger character recognition system in this embodiment. 本実施形態における指文字認識処理の一例を示すフローチャートである。It is a flowchart which shows an example of the finger character recognition process in this embodiment. 学習処理の一例を示すフローチャートである。It is a flowchart which shows an example of a learning process. 手指画像取得処理の一例を示すフローチャートである。It is a flowchart which shows an example of a finger image acquisition process. 本実施形態における指文字認識処理の具体例を説明するための図である。It is a figure for demonstrating the specific example of the finger character recognition process in this embodiment. モーションデータの一例を示す図（その１）である。It is a figure (example 1) which shows an example of motion data. モーションデータの一例を示す図（その２）である。It is a figure (the 2) which shows an example of motion data. ニューラルネットワークを用いた文字認識処理の具体例を説明するための図である。It is a figure for demonstrating the specific example of the character recognition process using a neural network.

以下に、本発明における学習装置、認識装置、学習プログラム、及び認識プログラムを好適に実施した形態について、図面を用いて詳細に説明する。 Hereinafter, embodiments in which a learning device, a recognition device, a learning program, and a recognition program according to the present invention are suitably implemented will be described in detail with reference to the drawings.

＜指文字認識システムの概略構成例＞
まず本実施形態における指文字認識システムの概略構成例について、図を用いて説明する。図１は、本実施形態における指文字認識システムの概略構成の一例を示す図である。図１の例に示す指文字認識システム１０は、学習装置１１と、認識装置１２とを有し、学習装置１１と、認識装置１２とは、例えばインターネットやＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）、ＷｉＦｉ（登録商標）、Ｂｌｕｅｔｏｏｔｈ（登録商標）等の通信ネットワーク１３を介してデータの送受信が可能な状態で接続されている。 <Schematic configuration example of finger character recognition system>
First, a schematic configuration example of the finger character recognition system in the present embodiment will be described with reference to the drawings. FIG. 1 is a diagram illustrating an example of a schematic configuration of a finger character recognition system according to the present embodiment. The finger character recognition system 10 shown in the example of FIG. 1 includes a learning device 11 and a recognition device 12. The learning device 11 and the recognition device 12 are, for example, the Internet, a LAN (Local Area Network), or WiFi (registration). (Registered trademark), Bluetooth (registered trademark), etc., are connected in a state where data can be transmitted and received.

指文字認識システム１０は、例えば被写体の手指を撮影した画像から、画像処理を使って手話等の指文字の認識を行う。また、指文字認識システム１０では、対象者（被写体）に器具等を装着する必要なく、簡便な構成で手指モーション（指文字）の認識を実現する。 The finger character recognition system 10 recognizes a finger character such as a sign language from an image obtained by photographing a finger of a subject using image processing. In addition, the finger character recognition system 10 realizes finger motion (finger character) recognition with a simple configuration without the need to wear an instrument or the like on the subject (subject).

学習装置１１は、予め認識対象の物体（例えば、人物の手等）の形状、姿勢、及び動作のうち、少なくとも１つの情報を含むモーションデータを登録しておき、登録されたモーションデータのパラメータ等を変更することにより、多量のＣＧ距離画像を生成する。また、本発明では、生成したＣＧ距離画像を用いて学習したニューラルネットワークを用意する。ニューラルネットワークは、入力値から出力値を予測することに使用されるものであり、予め入力値の実績値と出力値の実績値とを学習しておくことで、新たな入力値に対して、出力値を推定することができる。本実施形態では、例えば、距離画像を入力値とし、指文字を出力値とするニューラルネットワークを用意するが、これに限定されるものではない。また、本実施形態では、上述した入力値と出力値とを関連付けた他の学習モデルやデータベースを有していてもよい。 The learning device 11 registers in advance motion data including at least one information among the shape, posture, and motion of an object to be recognized (for example, a human hand), parameters of the registered motion data, and the like. A large amount of CG distance images are generated by changing. Further, in the present invention, a neural network learned using the generated CG distance image is prepared. A neural network is used to predict an output value from an input value, and by learning the actual value of the input value and the actual value of the output value in advance, for a new input value, The output value can be estimated. In this embodiment, for example, a neural network having a distance image as an input value and a finger character as an output value is prepared. However, the present invention is not limited to this. Moreover, in this embodiment, you may have another learning model and database which linked | related the input value and output value which were mentioned above.

また、認識装置１２は、学習装置１１で学習されたニューラルネットワークを、通信ネットワーク１３を介して取得する。また、認識装置１２は、距離画像カメラ等で撮影された距離画像を、取得したニューラルネットワークに入力して指文字認識を行う。 In addition, the recognition device 12 acquires the neural network learned by the learning device 11 via the communication network 13. In addition, the recognition device 12 inputs a distance image captured by a distance image camera or the like to the acquired neural network and performs finger character recognition.

これにより、対象者（被写体）に特別の器具を装着させることなく、簡便かつ高精度に指文字等の手指モーションの認識を行う。 Thereby, finger motions such as finger characters are recognized easily and with high accuracy without attaching a special instrument to the subject (subject).

なお、図１に示す指文字認識システム１０では、学習装置１１と、認識装置１２とが別々に設けられているが、これに限定されるものではなく、一体の装置構成（例えば、指文字認識装置）として設けられていてもよい。また、図１に示す指文字認識システム１０は、学習装置１１と、認識装置１２とが１対１の関係で構成されているが、これに限定されるものではなく、例えばＭ対Ｎ（Ｍ，Ｎは、１以上の整数）で構成されていてもよい。次に、上述した学習装置１１及び認識装置１２の具体的な機能構成例について、説明する。 In the finger character recognition system 10 shown in FIG. 1, the learning device 11 and the recognition device 12 are provided separately. However, the present invention is not limited to this, and an integrated device configuration (for example, finger character recognition). Device). Further, in the finger character recognition system 10 shown in FIG. 1, the learning device 11 and the recognition device 12 are configured in a one-to-one relationship. However, the present invention is not limited to this, for example, M to N (M , N may be an integer of 1 or more. Next, specific functional configuration examples of the learning device 11 and the recognition device 12 described above will be described.

＜学習装置１１の機能構成例＞
学習装置１１は、入力手段２１と、出力手段２２と、モーションデータ管理手段２３と、距離画像生成手段２４と、領域抽出手段（学習用領域抽出手段）２５と、学習手段の一例としてのニューラルネットワーク学習手段２６と、記憶手段２７と、送受信手段２８とを有する。 <Functional configuration example of learning device 11>
The learning device 11 includes an input unit 21, an output unit 22, a motion data management unit 23, a distance image generation unit 24, a region extraction unit (learning region extraction unit) 25, and a neural network as an example of a learning unit. A learning unit 26, a storage unit 27, and a transmission / reception unit 28 are included.

入力手段２１は、本実施形態における指文字認識用のニューラルネットワークの学習処理を行うための各種入力（例えば、学習処理の開始や終了、各種設定情報の入力）等を受け付ける。入力手段２１は、例えば学習装置１１がＰＣ（ＰｅｒｓｏｎａｌＣｏｍｐｕｔｅｒ）等の汎用のコンピュータであれば、キーボードやマウス等のポインティングデバイスである。また、学習装置１１が、タブレット端末やスマートフォン等であれば、タッチパネル等である。また、入力手段２１は、例えば音声等により上述した入力が可能なマイク等の音声入力デバイスであってもよい。 The input unit 21 accepts various inputs (for example, start and end of learning processing, input of various setting information) for performing learning processing of the neural network for finger character recognition in the present embodiment. For example, if the learning device 11 is a general-purpose computer such as a PC (Personal Computer), the input unit 21 is a pointing device such as a keyboard or a mouse. Moreover, if the learning apparatus 11 is a tablet terminal, a smart phone, or the like, it is a touch panel or the like. Further, the input means 21 may be a voice input device such as a microphone that can input the above-described voice, for example.

出力手段２２は、入力手段２１により入力された内容や、入力内容に基づいて実行された内容等の出力を行う。出力手段２２は、例えばディスプレイやスピーカ等である。出力手段２２は、入力手段２１と一体型のタッチパネルであってもよい。 The output unit 22 outputs the content input by the input unit 21 and the content executed based on the input content. The output means 22 is, for example, a display or a speaker. The output means 22 may be a touch panel integrated with the input means 21.

モーションデータ管理手段２３は、予め設定されたＣＧキャラクタモデルと、そのＣＧキャラクタモデルの各関節部分等を動作させるためのモーションデータとを記憶手段２７に記憶して管理する。なお、モーションデータは、モーション対象のＣＧキャラクタの体全体に関する形状、姿勢、動作等を示すモーションキャプチャデータであってもよく、体の一部（例えば手や指等）に関する形状、姿勢、動作等を示すモーションキャプチャデータであってもよい。また、ＣＧキャラクタは、撮影するカメラ（撮像手段）の画角等によっても見え方が変わるため、モーションデータ内に、そのＣＧキャラクタを撮影するカメラの画角等のカメラデータを含めてもよい。 The motion data management unit 23 stores and manages a preset CG character model and motion data for operating each joint portion of the CG character model in the storage unit 27. Note that the motion data may be motion capture data indicating the shape, posture, action, etc. relating to the entire body of the CG character subject to motion, and the shape, posture, action, etc. relating to a part of the body (for example, a hand or a finger). May be motion capture data. Further, since the appearance of the CG character changes depending on the angle of view of the camera (imaging means) that captures the image, the camera data such as the angle of view of the camera that captures the CG character may be included in the motion data.

距離画像生成手段２４は、予め記憶手段２７に記憶されたＣＧキャラクタモデルに対応するモーションデータに基づいて、ＣＧによる人物の距離画像（ＣＧ距離画像）を生成する。なお、距離画像生成手段２４は、全身のＣＧキャラクタモデルに対するＣＧ距離画像でもよく、手指領域を含む一部の領域（例えば、上半身等）のＣＧ距離画像のみを生成してもよい。 The distance image generation unit 24 generates a distance image (CG distance image) of a person based on CG based on the motion data corresponding to the CG character model stored in the storage unit 27 in advance. The distance image generation means 24 may be a CG distance image for the CG character model of the whole body, or may generate only a CG distance image of a part of the region including the finger region (for example, the upper body).

また、距離画像生成手段２４は、モーションデータに含まれるパラメータ等を変更することで、複数の距離画像を生成する。例えば、距離画像生成手段２４は、カメラ方向に対してＣＧキャラクタモデルの手首の関節部分の角度又は位置に関するパラメータを、所定角度回転させたり位置を変更させた手指領域を生成する。また、距離画像生成手段２４は、所定の指の関節部分の角度又は位置に関するパラメータを所定角度曲げたり、位置を変更させた複数の手指領域を生成してもよい。また、距離画像生成手段２４は、上述したカメラパラメータを変更させた複数の手指領域を生成してもよい。これにより、複数の異なるＣＧ距離画像を容易に生成することができる。 Further, the distance image generation unit 24 generates a plurality of distance images by changing parameters or the like included in the motion data. For example, the distance image generation unit 24 generates a finger region in which a parameter related to the angle or position of the joint portion of the wrist of the CG character model is rotated by a predetermined angle or the position is changed with respect to the camera direction. In addition, the distance image generating unit 24 may generate a plurality of finger regions in which a parameter related to the angle or position of a predetermined finger joint portion is bent by a predetermined angle or the position is changed. In addition, the distance image generation unit 24 may generate a plurality of finger regions in which the above-described camera parameters are changed. Thereby, a plurality of different CG distance images can be easily generated.

領域抽出手段２５は、モーションデータに含まれるＣＧキャラクタモデルの位置情報に基づき、閾値処理や位置（座標）判定処理を行い、人物のＣＧ距離画像からＣＧキャラクタモデルの所定の領域（例えば、手指領域）を抽出する。なお、領域抽出手段２５は、所定の画像サイズの手指領域画像を抽出する。これは、例えば認識装置１２等で実際に距離画像カメラを用いて得られる被写体の指文字の認識画像と同じ大きさにすることで、指文字等の認識精度を向上させるためである。 The area extraction unit 25 performs threshold processing and position (coordinate) determination processing based on the position information of the CG character model included in the motion data, and performs a predetermined area (for example, finger area) of the CG character model from the CG distance image of the person. ). Note that the region extraction unit 25 extracts a finger region image having a predetermined image size. This is because, for example, the recognition accuracy of a finger character or the like is improved by making it the same size as the recognition image of the finger character of the subject actually obtained by using the distance image camera with the recognition device 12 or the like.

ニューラルネットワーク学習手段２６は、複数のＣＧ手指領域の距離画像から抽出される手指領域の情報を学習データとして学習し、ニューラルネットワークを構築する。なお、ＣＧの手指領域の距離画像は、例えば、１２８×１２８画素等のように所定の画素サイズに統一することが好ましいが、サイズの大きさ等については、これに限定されるものではない。 The neural network learning means 26 learns information on the finger area extracted from the distance images of the plurality of CG hand areas as learning data, and constructs a neural network. Note that the distance image of the finger region of the CG is preferably unified to a predetermined pixel size such as 128 × 128 pixels, for example, but the size is not limited thereto.

記憶手段２７は、本実施形態における学習処理において、必要な各種情報を記憶する。具体的には、記憶手段２７は、本実施形態における学習処理に必要なデータ（例えば、ＣＧキャラクタモデル、モーションデータ等）や、学習処理を実行するための各種プログラム（アプリケーション、ソフトウェア等）、各種設定情報等を記憶する。また、記憶手段２７は、学習したニューラルネットワーク、実行経過や実行結果（履歴情報）、エラー情報、ユーザ情報等を記憶する。記憶手段２７に記憶される情報は、例えば通信ネットワーク１３を介して接続される他の装置に記憶させてもよい。 The storage unit 27 stores various information necessary for the learning process in the present embodiment. Specifically, the storage means 27 includes data (for example, CG character model, motion data, etc.) necessary for the learning process in the present embodiment, various programs (application, software, etc.) for executing the learning process, Stores setting information and the like. The storage means 27 stores the learned neural network, execution progress and execution results (history information), error information, user information, and the like. The information stored in the storage unit 27 may be stored in another device connected via the communication network 13, for example.

送受信手段２８は、通信ネットワーク１３を介して認識装置１２等の外部装置とデータの送受信を行うための通信手段である。送受信手段２８は、上述したニューラルネットワークや、学習処理を実行するための実行プログラムや実行結果等の情報を、通信ネットワーク１３を介して外部装置から取得したり、外部装置に出力してもよい。 The transmission / reception unit 28 is a communication unit for transmitting / receiving data to / from an external device such as the recognition device 12 via the communication network 13. The transmission / reception unit 28 may acquire information such as the above-described neural network, an execution program for executing the learning process, and an execution result from an external device via the communication network 13 or may output the information to the external device.

上述したように学習装置１１は、予め高品質に取得したモーションデータを用いて人物距離画像を生成した後、手指領域を抽出する。また、学習装置１１は、上述した複数の教師用距離画像を用いてニューラルネットワークを学習する。また、学習装置１１は、例えば手の向きや指の曲げる角度や各指間の広がり幅の違う教師用距離画像をＣＧを用いて学習する。ＣＧを用いることで、複数の教師用距離画像を容易に生成することができる。 As described above, the learning device 11 extracts a finger region after generating a person distance image using motion data acquired in advance with high quality. In addition, the learning device 11 learns the neural network using the plurality of teacher distance images described above. Further, the learning device 11 learns, using CG, teacher distance images having different hand orientations, finger bending angles, and spread widths between the fingers, for example. By using CG, it is possible to easily generate a plurality of teacher distance images.

＜認識装置１２の機能構成例＞
認識装置１２は、入力手段３１と、出力手段３２と、画像取得手段３３と、領域抽出手段（認識用領域抽出手段）３４と、画像処理手段３５と、認識手段の一例としての指文字認識手段３６と、記憶手段３７と、送受信手段３８とを有する。 <Example of Functional Configuration of Recognition Device 12>
The recognition device 12 includes an input means 31, an output means 32, an image acquisition means 33, an area extraction means (recognition area extraction means) 34, an image processing means 35, and a finger character recognition means as an example of a recognition means. 36, storage means 37, and transmission / reception means 38.

入力手段３１は、本実施形態における指文字認識処理を行うための各種入力（例えば、認識処理の開始や終了、各種設定情報の入力）等を受け付ける。入力手段３１は、例えば認識装置１２がＰＣ等の汎用のコンピュータであれば、キーボードやマウス等のポインティングデバイスである。また、認識装置１２が、タブレット端末やスマートフォン等であれば、タッチパネル等である。また、入力手段３１は、例えば音声等により上述した入力が可能なマイク等の音声入力デバイスであってもよい。 The input unit 31 accepts various inputs (for example, start and end of recognition processing, input of various setting information) and the like for performing finger character recognition processing in the present embodiment. The input means 31 is a pointing device such as a keyboard or a mouse if the recognition device 12 is a general-purpose computer such as a PC. Moreover, if the recognition apparatus 12 is a tablet terminal, a smart phone, etc., it will be a touch panel etc. Further, the input unit 31 may be a voice input device such as a microphone that can input the above-described voice, for example.

出力手段３２は、入力手段３１により入力された内容や、入力内容に基づいて実行された内容等の出力を行う。出力手段３２は、例えばディスプレイやスピーカ等である。 The output unit 32 outputs the content input by the input unit 31 and the content executed based on the input content. The output means 32 is, for example, a display or a speaker.

画像取得手段３３は、距離画像カメラ等の撮像手段により撮影された実際の人物（被写体）の距離画像を取得する。画像取得手段３３は、距離画像カメラ等の撮像手段であってもよく、外部の撮像手段により撮影された距離画像を通信ネットワーク１３を用いて取得してもよい。 The image acquisition means 33 acquires a distance image of an actual person (subject) photographed by an imaging means such as a distance image camera. The image acquisition unit 33 may be an imaging unit such as a range image camera, or may acquire a range image captured by an external imaging unit using the communication network 13.

領域抽出手段３４は、画像取得手段３３により取得した人物距離画像に対して、ローパスフィルタやメディアンフィルタ等による画像中のノイズを除去するノイズ処理を行う。また、領域抽出手段３４は、距離画像カメラから得られる人物距離画像に含まれる各部の距離情報（例えば、距離に対応する濃淡情報等）に基づき、人物距離画像から手指部分の画像を取得する。なお、領域抽出手段３４は、予め所定の距離に対応する闘値を設定しておき、その閾値による手前の距離（カメラに近い距離）部分を抽出する。また、領域抽出手段３４は、手指部分の画像に対して、学習装置１１の領域抽出手段２５で抽出した画像サイズ（例えば、１２８×１２８画素等）に合わせて画像の調整を行い、手指領域画像を抽出する。 The region extraction unit 34 performs noise processing on the person distance image acquired by the image acquisition unit 33 to remove noise in the image using a low-pass filter, a median filter, or the like. Further, the area extracting unit 34 acquires an image of a finger part from the person distance image based on distance information (for example, grayscale information corresponding to the distance) of each part included in the person distance image obtained from the distance image camera. The area extracting unit 34 sets a threshold value corresponding to a predetermined distance in advance, and extracts a near distance (distance close to the camera) portion based on the threshold value. In addition, the region extraction unit 34 adjusts the image of the finger portion image in accordance with the image size (for example, 128 × 128 pixels) extracted by the region extraction unit 25 of the learning device 11, and the finger region image To extract.

また、領域抽出手段３４は、学習装置１１においてＣＧ距離画像からＣＧキャラクタモデルの手指領域を抽出した処理と同じ処理を行って、人物距離画像から手指領域を抽出することで、それぞれの手指領域の類似性を高めることができ、例えば指文字等の認識精度を向上させることができる。 In addition, the region extraction unit 34 performs the same process as the process of extracting the finger region of the CG character model from the CG distance image in the learning device 11 and extracts the finger region from the person distance image, so that each finger region is extracted. Similarity can be increased, and for example, recognition accuracy of finger characters and the like can be improved.

画像処理手段３５は、領域抽出手段３４により得られた手指画像に対して明るさレベルの補正を行う。なお、画像処理手段３５は、例えば学習装置１１におけるＣＧ距離画像と同じ明るさになるように明るさレベルを補正したり、領域抽出手段３４により得られた複数の手指画像間での明るさの違いをなくすことで、指文字認識の精度を向上させることができる。 The image processing unit 35 corrects the brightness level of the finger image obtained by the region extracting unit 34. Note that the image processing unit 35 corrects the brightness level so that the brightness is the same as that of the CG distance image in the learning device 11 or the brightness of a plurality of finger images obtained by the region extraction unit 34, for example. By eliminating the difference, the accuracy of finger character recognition can be improved.

指文字認識手段３６は、送受信手段３８により通信ネットワーク１３を介して、学習装置１１から取得した学習済みのニューラルネットワークに対して、画像処理手段３５により得られる距離画像の情報を入力し、指文字等の認識結果を取得する。 The finger character recognizing means 36 inputs the distance image information obtained by the image processing means 35 to the learned neural network acquired from the learning device 11 via the communication network 13 by the transmitting / receiving means 38, Acquire recognition results.

記憶手段３７は、本実施形態における認識処理において、必要な各種情報を記憶する。具体的には、記憶手段３７は、学習装置１１から得られる学習データ（例えば、ニューラルネットワーク等）や認識処理を実行するための各種プログラム（アプリケーション、ソフトウェア等）、各種設定情報等を記憶する。また、記憶手段３７は、実行経過や実行結果（履歴情報）、エラー情報、ユーザ情報等を記憶してもよい。記憶手段３７に記憶される情報は、例えば通信ネットワーク１３を介して接続される他の装置等から取得してもよい。 The storage unit 37 stores various pieces of information necessary for the recognition process in the present embodiment. Specifically, the storage unit 37 stores learning data (for example, a neural network) obtained from the learning device 11, various programs (application, software, etc.) for executing recognition processing, various setting information, and the like. The storage unit 37 may store an execution progress, an execution result (history information), error information, user information, and the like. The information stored in the storage unit 37 may be obtained from other devices connected via the communication network 13, for example.

送受信手段３８は、通信ネットワーク１３を介して認識装置１２等の外部装置とデータの送受信を行うための通信手段である。送受信手段３８は、上述したニューラルネットワークや、学習処理を実行するための実行プログラムや実行結果等の情報を、通信ネットワーク１３を介して外部装置から取得したり、外部装置に出力してもよい。 The transmission / reception unit 38 is a communication unit for transmitting / receiving data to / from an external device such as the recognition device 12 via the communication network 13. The transmission / reception means 38 may acquire information such as the above-described neural network, an execution program for executing the learning process, and an execution result from an external device via the communication network 13 or may output the information to the external device.

上述したように認識装置は、ＣＧを用いた指文字認識用（教師用距離画像）のニューラルネットワークを用いて、実際に距離画像カメラを用いて得られる被写体の指文字の認識を高精度に行うことができる。 As described above, the recognition device uses a neural network for finger character recognition (distance image for teacher) using CG to accurately recognize a finger character of a subject actually obtained using a distance image camera. be able to.

＜指文字認識処理＞
次に、本実施形態の指文字認識システム１０における指文字認識処理の一例について、フローチャートを用いて説明する。図２は、本実施形態における指文字認識処理の一例を示すフローチャートである。図２の例において、学習装置１１は、予め蓄積されたＣＧキャラクタモデルと、そのモデルの形状、姿勢、及び動作等のうち、少なくとも１つの情報を含むモーションデータとを用いた学習処理を行う（Ｓ０１）。Ｓ０１の学習処理の詳細については、後述する。 <Finger character recognition processing>
Next, an example of a finger character recognition process in the finger character recognition system 10 of the present embodiment will be described using a flowchart. FIG. 2 is a flowchart showing an example of a finger character recognition process in the present embodiment. In the example of FIG. 2, the learning device 11 performs a learning process using a CG character model stored in advance and motion data including at least one piece of information such as the shape, posture, and motion of the model ( S01). Details of the learning process in S01 will be described later.

次に、認識装置１２は、距離画像カメラ等を用いて実際に撮影された人物（被写体）の距離画像から手指画像取得処理を行う（Ｓ０２）。Ｓ０２の手指画像取得処理の詳細については、後述する。なお、上述したＳ０１の学習処理とＳ０２以降の認識処理とは、連続して行う必要はなく、Ｓ０２以降の処理の実行時に、Ｓ０１の処理結果のニューラルネットワークが使用できればよい。 Next, the recognition device 12 performs a finger image acquisition process from a distance image of a person (subject) actually photographed using a distance image camera or the like (S02). Details of the finger image acquisition process in S02 will be described later. Note that the above-described learning process of S01 and the recognition process after S02 do not have to be performed in succession, and it is sufficient that the neural network of the process result of S01 can be used when the process after S02 is executed.

次に、認識装置１２は、Ｓ０２の処理で得られた手指画像から指文字認識を行い（Ｓ０３）、その認識結果を出力する（Ｓ０４）。次に、認識装置１２は、処理を終了するか否かを判断し（Ｓ０５）、終了しない場合（Ｓ０５において、ＮＯ）、Ｓ０２の処理に戻る。また、Ｓ０５の処理において、ユーザ等からの終了動作や、取得した全ての手指画像に対して判定が終了した場合（Ｓ０５において、ＹＥＳ）、処理を終了する。 Next, the recognition device 12 performs finger character recognition from the finger image obtained in the process of S02 (S03), and outputs the recognition result (S04). Next, the recognizing device 12 determines whether or not to end the process (S05). If the process does not end (NO in S05), the process returns to the process of S02. Further, in the process of S05, if the end operation from the user or the like or the determination is completed for all the acquired finger images (YES in S05), the process ends.

＜Ｓ０１：学習処理＞
図３は、学習処理の一例を示すフローチャートである。なお、図３（Ａ）は、学習処理のフローチャートを示し、図３（Ｂ）は、処理の一部を補足するための画像例である。 <S01: Learning process>
FIG. 3 is a flowchart illustrating an example of the learning process. 3A shows a flowchart of the learning process, and FIG. 3B is an image example for supplementing a part of the process.

図３の例において、学習装置１１は、記憶手段２７等からＣＧキャラクタモデルに対応するモーションデータを取得する（Ｓ１１）。次に、学習装置１１は、モーションデータを変更し（Ｓ１２）、ＣＧ距離画像を生成する（Ｓ１３）。Ｓ１２及びＳ１３の処理では、例えばモーションデータ管理手段２３で予め管理されているモーションデータを用いて、ＣＧキャラクタモデルに対する人物距離画像（ＣＧ距離画像）を生成する。例えば、通常のモーションデータは、標準の「気をつけ」の姿勢から指文字動作を行い、また標準の姿勢に戻る一連の動きを収録している。そこで、本実施形態では、そのモーションデータに含まれる関節の動き情報から指文字動作時のタイミングを判定し、その指文字動作時におけるＣＧキャラクタモデルの全身や上半身等の距離画像（図３（Ｂ）に示すＣＧ距離画像４１）を生成する。 In the example of FIG. 3, the learning device 11 acquires motion data corresponding to the CG character model from the storage means 27 and the like (S11). Next, the learning device 11 changes the motion data (S12) and generates a CG distance image (S13). In the processes of S12 and S13, a person distance image (CG distance image) for the CG character model is generated using, for example, motion data previously managed by the motion data management unit 23. For example, normal motion data includes a series of movements in which a finger character is moved from a standard “careful” posture and returned to the standard posture. Therefore, in this embodiment, the timing at the time of finger character movement is determined from the joint movement information included in the motion data, and the distance image of the whole body and upper body of the CG character model at the time of the finger character movement (FIG. 3B CG distance image 41) shown in FIG.

また、ＣＧキャラクタモデルは、骨格及び骨格の動きに従って滑らかに変形する皮膚情報を有する。また、モーションデータは、各関節（ジョイント）の角度データが記録されている。骨格は、各関節位置と関節角度とを持ち、モーションデータに従って関節角度が変化し、骨格が回転する。 Further, the CG character model has skin information that smoothly deforms according to the movement of the skeleton and the skeleton. Further, as the motion data, angle data of each joint (joint) is recorded. The skeleton has each joint position and joint angle, the joint angle changes according to the motion data, and the skeleton rotates.

そこで、本実施形態では、例えば右手首（ＲｉｇｈｔＷｒｉｓｔ）のジョイント（関節）部分のＸ軸周りの回転を、例えば＋５度ずつ変更した距離画像を生成する。この変更により、手指領域の距離画像の見た目が変化する。 Therefore, in the present embodiment, for example, a distance image in which the rotation around the X axis of the joint (joint) portion of the right wrist (RightWrist) is changed by, for example, +5 degrees is generated. This change changes the appearance of the distance image of the finger region.

なお、上述の例では、ＣＧキャラクタのジョイントのＸ軸を基準に回転させたが、Ｙ軸又はＺ軸を基準にしてもよく、ＸＹＺ軸のうち複数の軸方向に回転させてもよい。また、上述の例では、ＣＧキャラクタのジョイントの角度を変更したが、これに限定されるものではなく、例えば指の曲げ具合（曲げ角度）や各指間の広がり幅等を変更してもよい。距離画像生成手段２４は、このような変更を手首や指の関節のパラメータを変える処理を複数回行い、複数のＣＧ距離画像を生成する。 In the above example, the rotation is based on the X axis of the joint of the CG character, but the Y axis or the Z axis may be used as a reference, and the rotation may be performed in a plurality of axial directions among the XYZ axes. In the above-described example, the angle of the joint of the CG character is changed. However, the present invention is not limited to this. For example, the degree of bending of the fingers (bending angle), the spread width between the fingers, and the like may be changed. . The distance image generation means 24 performs such a change a plurality of times to change the parameters of the wrist and finger joints, and generates a plurality of CG distance images.

また、距離画像生成手段２４は、生成したＣＧ距離画像を２次元的に歪ませてもよい。例えば、距離画像生成手段２４は、画像中の手指領域に対し、輝度の変更、回転、拡大／縮小、及び平行移動のうち、少なくとも１つをランダムに行ったり、各画素について、画素位置に揺らぎを与えたりすることで、歪みを伴うＣＧ距離画像を生成してもよい。ここで、処理前後の画素の位置を（ｘ，ｙ）、（ｘ'，ｙ'）とすると、以下の式を用いて歪みを与えることができるが、これに限定されるものではない。
（ｘ'，ｙ'）＝（ｘ＋α＊ｒａｎｄｏｍ（０，１），ｙ＋α＊ｒａｎｄｏｍ（０，１））
ここで、αは定数を示し、ｒａｎｄｏｍは、乱数を発生する関数を示す。 In addition, the distance image generation unit 24 may distort the generated CG distance image two-dimensionally. For example, the distance image generation unit 24 randomly performs at least one of luminance change, rotation, enlargement / reduction, and parallel movement on the finger region in the image, or the pixel position of each pixel fluctuates in the pixel position. Or a CG distance image with distortion may be generated. Here, if the positions of the pixels before and after the processing are (x, y) and (x ′, y ′), distortion can be given using the following formula, but the present invention is not limited to this.
(X ′, y ′) = (x + α * random (0, 1), y + α * random (0, 1))
Here, α represents a constant, and random represents a function that generates a random number.

次に、領域抽出手段２５は、手指部分の領域を生成する。一般に距離画像では、手前が暗く、奥に行くほど明るくなる特徴がある。例えば、手話の動作（モーション）では、手指部分は身体の前にある。そのため、手指部分は、周囲の明るさと比較して比較的暗い領域となる。そこで、領域抽出手段２５は、予め所定の距離に対応する闘値を設定しておき、その閾値による手前の距離（カメラに近い距離）部分の切り出しを行う（Ｓ１４）。例えば、図３（Ｂ）に示すＣＧキャラクタモデルを撮影したＣＧ距離画像４１において、例えば疑似的にカメラから１ｍの距離にＣＧキャラクタモデルを立たせて、その位置から手話を行った場合、手指部分はＣＧキャラクタモデルの体より前に出る。そのため、領域抽出手段２５は、１ｍ未満にある物体を切り出し、切り出した領域を手指部分として抽出する（Ｓ１５）。なお、手指部分の抽出手法については、これに限定されるものではない。 Next, the region extraction unit 25 generates a region of the finger part. In general, a distance image has a feature that the near side is dark and becomes brighter toward the back. For example, in the sign language motion, the finger part is in front of the body. Therefore, the finger portion is a relatively dark area as compared with the surrounding brightness. Therefore, the region extraction unit 25 sets a threshold value corresponding to a predetermined distance in advance, and cuts out a near distance (distance close to the camera) portion based on the threshold value (S14). For example, in the CG distance image 41 obtained by photographing the CG character model shown in FIG. 3B, for example, when the CG character model is artificially placed at a distance of 1 m from the camera and sign language is performed from that position, the finger portion is It comes out before the body of the CG character model. Therefore, the area extracting unit 25 cuts out an object that is less than 1 m and extracts the cut out area as a finger part (S15). The finger part extraction method is not limited to this.

なお、Ｓ１５の処理では、例えば手指領域が画像中央であり、且つ画像の上辺に手指領域の端が接するようトリミング処理（手指領域の抽出）を行い、図３（Ｂ）の例に示す手指領域画像４２を抽出する。なお、領域抽出手段２５は、手指領域画像４２において、手指領域が画像の中央になるように配置したり、画像の上辺に手指領域の端部が接するように配置してもよい。上述した手指領域の抽出は、例えばパラメータ等により指や手首の各関節（ジョイント）を変えながら所定数になるまで行う。また、異なる指文字形状についても同様の処理を行う。 In the process of S15, for example, the finger region is the center of the image, and the trimming process (extraction of the finger region) is performed so that the upper edge of the image is in contact with the upper side of the image. The image 42 is extracted. Note that the region extraction unit 25 may be disposed in the finger region image 42 such that the finger region is in the center of the image, or the end of the finger region is in contact with the upper side of the image. The above-described extraction of the finger region is performed until a predetermined number is reached while changing each joint of the finger and wrist according to parameters or the like, for example. The same processing is performed for different finger character shapes.

次に、ニューラルネットワーク学習手段２６は、Ｓ１５の処理で得られた手指領域の抽出結果を用いてニューラルネットワークの学習を行う（Ｓ１６）。学習したニューラルネットワークは、例えば記憶手段２７に記憶されたり、認識装置１２に出力される。なお、モーションデータで得られる手指の距離画像は、例えば所定の指文字形状がなされた正解データである。したがって、ニューラルネットワークも指文字の認識結果が出力可能なニューラルネットワークが構築されることになる。 Next, the neural network learning means 26 learns the neural network using the finger region extraction result obtained in the process of S15 (S16). The learned neural network is stored in, for example, the storage unit 27 or output to the recognition device 12. The finger distance image obtained from the motion data is, for example, correct answer data having a predetermined finger character shape. Therefore, a neural network that can output the recognition result of the finger character is also constructed.

ここで、処理を終了するか否かを判断し（Ｓ１７）、処理を終了しない場合（Ｓ１７において、ＮＯ）、Ｓ１１の処理に戻り処理を継続する。この場合の処理は、Ｓ１１により新たなモーションデータを取得して上述したニューラルネットワークの学習を行うが、これに限定されるものではない。また、Ｓ１７の処理において、ユーザ指示等により処理を終了する場合（Ｓ１７において、ＹＥＳ）、学習処理を終了する。 Here, it is determined whether or not to end the process (S17). If the process is not ended (NO in S17), the process returns to S11 and the process is continued. The processing in this case is not limited to this, although new motion data is acquired in S11 and the above-described neural network is learned. In the process of S17, when the process is terminated by a user instruction or the like (YES in S17), the learning process is terminated.

これにより、本実施形態では、１つのＣＧモーションデータから、同一の指文字に対して、カメラ方向を基準にして、指の位置や角度等の見え方の異なる複数の手指領域を抽出することができる。 Accordingly, in the present embodiment, a plurality of finger regions having different appearances such as finger positions and angles with respect to the same finger character can be extracted from one CG motion data with reference to the camera direction. it can.

＜Ｓ０２：手指画像取得処理の一例＞
次に、上述したＳ０２の手指画像取得処理の一例について、フローチャートを用いて説明する。図４は、手指画像取得処理の一例を示すフローチャートである。なお、図４（Ａ）は、手指画像取得処理のフローチャートを示し、図４（Ｂ）は、処理の一部を補足するための画像例である。 <Example of S02: Finger Image Acquisition Processing>
Next, an example of the finger image acquisition process of S02 described above will be described using a flowchart. FIG. 4 is a flowchart illustrating an example of a finger image acquisition process. 4A shows a flowchart of the finger image acquisition process, and FIG. 4B is an image example for supplementing a part of the process.

図４の例において、認識装置１２は、画像取得手段３３により、距離画像カメラによる実際の人物距離画像を取得し（Ｓ２１）、取得した画像に対するノイズ除去を行う（Ｓ２２）。ここで、Ｓ２１の処理で取得した人物距離画像は、学習処理時のＣＧとは異なり、距離画像カメラを使った実際の撮影により行う。そのため、画像内には、画素の一部にノイズ等が含んでいる場合がある。そこで、距離画像カメラにより得られた人物距離画像（例えば、図４（Ｂ）に示す人物距離画像５１）に対し、Ｓ２２のノイズ除去処理を行う。Ｓ２２の処理では、一般的なローパスフィルタやメディアンフィルタ等によりノイズ除去を行うことができるが、これに限定されるものではない。 In the example of FIG. 4, the recognition apparatus 12 acquires an actual person distance image from the distance image camera by the image acquisition unit 33 (S21), and performs noise removal on the acquired image (S22). Here, unlike the CG at the time of the learning process, the person distance image acquired in the process of S21 is performed by actual photographing using a distance image camera. For this reason, noise or the like may be included in some of the pixels in the image. Therefore, the noise removal process of S22 is performed on the person distance image (for example, the person distance image 51 shown in FIG. 4B) obtained by the distance image camera. In the process of S22, noise can be removed by a general low-pass filter or median filter, but is not limited thereto.

次に、領域抽出手段３４は、上述した学習処理で行った処理（例えば、Ｓ１４及びＳ１５の処理）と同様に、Ｓ２２の処理で得られた画像に対して、カメラからの距離等に基づく同一の閾値等による領域の切り出しを行い（Ｓ２３）、切り出した領域から手指領域（例えば、図４（Ｂ）に示す手指領域５２）を抽出する（Ｓ２４）。なお、Ｓ２４の処理では、手指領域の大きさが異なる可能性があるため、特定の指文字、例えば全ての指を伸ばす「て」の指文字で大きさが合うようにサイズ調整を行う。 Next, similarly to the processing (for example, the processing of S14 and S15) performed in the learning processing described above, the region extraction unit 34 is the same based on the distance from the camera or the like for the image obtained by the processing of S22. The region is cut out based on the threshold value (S23), and a finger region (for example, the finger region 52 shown in FIG. 4B) is extracted from the cut-out region (S24). In the process of S24, since the size of the finger area may be different, the size adjustment is performed so that the size matches a specific finger character, for example, a “te” finger character that extends all fingers.

次に、画像処理手段３５は、学習装置１１におけるニューラルネットワークの学習時に入力したＣＧ距離画像と同じ明るさ（濃淡）になるように濃度変換処理を行う（Ｓ２５）。Ｓ２５の処理において、画像処理手段３５は、例えば予め手指領域の明るさの閾値を設定しておき、その明るさの閾値の範囲内になるように濃度を調整して手指領域（例えば、図４（Ｂ）に示す手指領域５３）を取得するが、これに限定されるものではない。次に、画像処理手段３５は、得られた手指画像を出力する（Ｓ２６）。 Next, the image processing means 35 performs density conversion processing so that the brightness (shading) is the same as the CG distance image input during learning of the neural network in the learning device 11 (S25). In the processing of S25, the image processing means 35 sets a brightness threshold of the finger area in advance, for example, and adjusts the density so as to be within the brightness threshold range, for example, the finger area (for example, FIG. 4). The finger region 53) shown in (B) is acquired, but is not limited to this. Next, the image processing means 35 outputs the obtained hand image (S26).

＜本実施形態における指文字認識の具体例＞
次に、本実施形態における指文字認識の具体例について、図を用いて説明する。図５は、本実施形態における指文字認識処理の具体例を説明するための図である。本実施形態における指文字認識処理は、まず、手指部分を対象に高品質に得られた手話に対応する指文字のモーションデータを用いて生成したＣＧ距離画像を教師画像（教師用データ）としてニューラルネットワークを学習する。 <Specific example of finger character recognition in this embodiment>
Next, a specific example of finger character recognition in the present embodiment will be described with reference to the drawings. FIG. 5 is a diagram for explaining a specific example of the finger character recognition process in the present embodiment. In the finger character recognition process according to the present embodiment, first, a CG distance image generated by using finger character motion data corresponding to a sign language obtained with high quality for a finger portion is used as a teacher image (teacher data) as a neural network. Learn the network.

本実施形態では、学習装置１１が、図５に示すようにＣＧキャラクタモデル６１と、ＣＧキャラクタモデル６１に対応するモーションデータ６２とを用いてニューラルネットワーク学習６３を行う。なお、学習に用いられるＣＧキャラクタモデル６１は、一人に限定されるものではなく複数人でもよい。 In the present embodiment, the learning device 11 performs a neural network learning 63 using a CG character model 61 and motion data 62 corresponding to the CG character model 61 as shown in FIG. Note that the CG character model 61 used for learning is not limited to one person but may be a plurality of persons.

図５の（Ａ）は、ＣＧキャラクタ６１ａの表面データ例を示し、図５の（Ｂ）は、（Ａ）のＣＧキャラクタ６１ａに対応する頂点とＣＧキャラクタの骨との関係を示している。 5A shows an example of surface data of the CG character 61a, and FIG. 5B shows the relationship between the vertex corresponding to the CG character 61a of FIG. 5A and the bone of the CG character.

ＣＧキャラクタ６１ａは、図５の（Ｂ）に示すようにポリゴン６１ｂの集合で表される。なお、ポリゴン６１ｂは、一例として四角形の集合として表されているが、これに限定されるものではなく、三角形でもよい。また、各四角形は、４つの頂点から構成され、それらの頂点は、図５の（Ｂ）のように表すことができる。 The CG character 61a is represented by a set of polygons 61b as shown in FIG. The polygon 61b is represented as a set of squares as an example, but is not limited to this and may be a triangle. Each quadrangle is composed of four vertices, and these vertices can be represented as shown in FIG.

例えば、ＣＧキャラクタ６１ａは、図５の（Ｂ）に示すように、体の中心にボーン（骨格）６１ｃがあり、そのボーン６１ｃが回転し、それに応じて頂点が移動することにより、姿勢、動作（モーション）等が生成される。 For example, as shown in FIG. 5B, the CG character 61a has a bone (skeleton) 61c at the center of the body, the bone 61c rotates, and the vertex moves accordingly, so that the posture and motion (Motion) or the like is generated.

なお、ＣＧキャラクタ６１ａの形状、姿勢、動作は、例えばモーションキャプチャと呼ばれる技術を利用することができる。モーションキャプチャでは、実際の人の動きを複数のカメラを使って取得し、その取得したデータや人体の移動の軌跡から各ボーンの回転量を取得する。つまり、ＣＧキャラクタ６１ａを自然に動作させるために、ＣＧキャラクタ６１ａのボーン６１ｃの動きを基準とし、そのボーン６１ｃの動きに合わせて図５の（Ａ）に示す皮膚表面の動きを制御することで、滑らかで自然な動きを実現させている。 For example, a technique called motion capture can be used for the shape, posture, and motion of the CG character 61a. In motion capture, the actual movement of a person is acquired using a plurality of cameras, and the rotation amount of each bone is acquired from the acquired data and the movement trajectory of the human body. That is, in order to move the CG character 61a naturally, the movement of the bone 61c of the CG character 61a is used as a reference, and the movement of the skin surface shown in FIG. 5A is controlled in accordance with the movement of the bone 61c. Realizes smooth and natural movement.

なお、図５の（Ａ）に示すような表面データの貼り付けは、専用のスキニング手法やその編集ツール等を用いて、ＣＧキャラクタの動作の生成において、例えばＣＧキャラクタ６１ａのボーン６１ｃ等に対して、皮膚や衣服との関係付け（スキニング）等が行われる。 In addition, the pasting of the surface data as shown in FIG. 5A is performed, for example, on the bone 61c of the CG character 61a in the generation of the motion of the CG character by using a dedicated skinning method or its editing tool. Thus, the skin and clothing are related (skinned).

ここで、図６、図７は、モーションデータの一例を示す図（その１，その２）である。なお、図６，図７の例では、説明の便宜上、行番号（１）〜（７４）を付して説明している。また、図６，図７の例では、１つの手話単語に対するＢＶＨファイルの一例を示しているが、使用できるファイル形式は、これに限定されるものではない。。本実施形態では、手話単語毎に対応するＢＶＨファイルが存在する。 Here, FIGS. 6 and 7 are diagrams (part 1 and part 2) illustrating an example of motion data. In the example of FIGS. 6 and 7, for convenience of explanation, row numbers (1) to (74) are given for explanation. 6 and 7 show an example of the BVH file for one sign language word, the file format that can be used is not limited to this. . In the present embodiment, there is a BVH file corresponding to each sign language word.

ここで、ＢＶＨファイル形式とは、Ｂｉｏｖｉｓｉｏｎ社が提唱したモーションデータのファイルフォーマットである。ＢＶＨファイルは、テキスト形式で記述することができる。また、ＸＹＺ各軸の扱い（例えば、どの軸が鉛直方向に対応するか等）については、任意に設定することができる。また、ＢＶＨファイルは、関節ノードに関する情報を記述し、その後に関節回転量を記述する。関節回転量は、オイラー角で記述する。 Here, the BVH file format is a motion data file format proposed by Biovision. A BVH file can be described in text format. Further, the handling of each axis of XYZ (for example, which axis corresponds to the vertical direction) can be arbitrarily set. In addition, the BVH file describes information about the joint node, and then describes the joint rotation amount. The joint rotation amount is described by Euler angle.

また、ＢＶＨファイルは、例えば、ＣＧキャラクタのスケルトン階層構造を記述するＨＩＥＲＡＲＣＨＹ部（（１）〜（６８）行目）と、動作データを記述するＭＯＴＩＯＮ部（（６９）〜（７４）行目）の２つの部から構成される。 The BVH file includes, for example, a HIERARCHY part (line (1) to (68)) describing a skeleton structure of a CG character and a MOTION part (line (69) to (74)) describing motion data. It consists of two parts.

図６の例において、最初の部分「ＲＯＯＴＨｉｐｓ」（（２）行目）で、ＣＧキャラクタの大本となる身体のパーツがＨｉｐｓという名前で定義される。その２行下のＯＦＦＳＥＴ（（４）行目）で、Ｈｉｐｓの位置を座標（絶対座標）で定義する。その次の行（（５）行目）で、Ｈｉｐｓは、６つのパラメータを持つことを表している（ＣＨＡＮＮＥＬＳ６）。ここでは、三次元（Ｘ，Ｙ，Ｚ）方向の絶対位置情報（Ｘｐｏｓｉｔｉｏｎ，Ｙｐｏｓｉｔｉｏｎ，Ｚｐｏｓｉｔｉｏｎ）と、それぞれの回転角情報（曲がり角度）（Ｘｒｏｔａｔｉｏｎ，Ｙｒｏｔａｔｉｏｎ，Ｚｒｏｔａｔｉｏｎ）とを示す。 In the example of FIG. 6, in the first part “ROOT Hips” (line (2)), the body part that is the main character of the CG character is defined by the name Hips. The position of Hips is defined by coordinates (absolute coordinates) in OFFSET (line (4)) two lines below. In the next line (line (5)), Hips indicates that it has six parameters (CHANNELS 6). Here, absolute position information (Xposition, Yposition, Zposition) in the three-dimensional (X, Y, Z) direction and respective rotation angle information (bending angle) (Xrotation, Yrotation, Zrotation) are shown.

また、ＨＩＥＲＡＲＣＨＹ部の「ＯＦＦＳＥＴ」で定義しているのは、Ｘ，Ｙ，Ｚの絶対位置の初期値で、回転角は、全て０を初期値としている。その次の行の「ＪＯＩＮＴＲｉｇｈｔＨｉｐ」で、Ｈｉｐｓから連結する次の関節「ＲｉｇｈｔＨｉｐ」を定義している。 Also, what is defined by “OFFSET” in the HIERARCHY section is the initial value of the absolute position of X, Y, and Z, and all the rotation angles have an initial value of 0. “JOIGHT RightHip” in the next line defines the next joint “RightHip” connected from Hips.

その２行下のＯＦＦＳＥＴ（（８）行目）では、連結元であるＨｉｐからの相対位置を定義している。また、その次の行で、「ＲｉｇｈｔＨｉｐ」もＨｉｐｓと同じ６つのパラメータを持つことを定義している。 The OFFSET (line (8)) below the two lines defines the relative position from Hip as the connection source. The next line defines that “RightHip” also has the same six parameters as Hips.

以下、「ＲｉｇｈｔＨｉｐ」と同様の設定方法で、体中の関節について定義する。例えば、手話では、ＣＧキャラクタを１１２個の関節を持つデータとして定義しており、図６，図７の例では、右の肩（（２７）行目）、肘（（３１）行目）、前腕（（３５）行目））、手首（（３９）行目）、指（親指の根本、第１関節、第２関節、指先）（（４３）〜（６１）行目）等の各関節に対する設定例が示されている。本実施形態では、例えば手話等のモーションを高品質に記録するため、各指に対する関節ノードの情報を有する。 Hereinafter, the joints in the body are defined by the same setting method as “RightHip”. For example, in sign language, a CG character is defined as data having 112 joints. In the examples of FIGS. 6 and 7, the right shoulder (line (27)), elbow (line (31)), Each joint such as forearm (line (35))), wrist (line (39)), finger (base of thumb, first joint, second joint, fingertips) (line (43) to (61)) A setting example for is shown. In the present embodiment, for example, in order to record a motion such as sign language with high quality, information on joint nodes for each finger is included.

また、図７の（６９）行目以降に示すように、後半部分は、「ＭＯＴＩＯＮ」という記述から始まる。（７０）行目のＦｒａｍｅｓ：３２１は、これ以降、１つの単語を手話で表現するための３２１フレーム分の動作データがあることを示している。また、（７１）行目のＦｒａｍｅＴｉｍｅは、各フレームの持続時間を表す。図７の例では、０．０１６６６６７という値が設定されているが、これは１／６０秒の値に相当し、１秒間に６０コマのモーションデータを持っていることを表している。 Further, as shown from the (69) line onward in FIG. 7, the latter half starts with the description “MOTION”. (70) Frames: 321 on the line indicates that there is motion data for 321 frames for expressing one word in sign language. Also, Frame Time in the (71) th row represents the duration of each frame. In the example of FIG. 7, a value of 0.0166667 is set, which corresponds to a value of 1/60 seconds, and represents that 60 frames of motion data are held per second.

それ以降の行では、前半部分で定義した全てのパラメータの１フレーム毎の値を格納している。各行が「関節数」×「各関節の持つパラメータ数」の要素を有する。例えば、関節数が１１２あり、各関節が６つのパラメータを持つ場合には、１１２×６＝６７２のパラメータ数を有することになる。 In the subsequent lines, the values for every frame of all parameters defined in the first half are stored. Each row has an element of “number of joints” × “number of parameters of each joint”. For example, when the number of joints is 112 and each joint has six parameters, the number of parameters is 112 × 6 = 672.

各パラメータは、スペース区切りの各数値が、それぞれ先頭からＨｉｐｓのＸｐｏｓｉｔｉｏｎ，ＨｉｐｓのＹｐｏｓｉｔｉｏｎ，ＨｉｐｓのＺｐｏｓｉｔｉｏｎ，ＨｉｐｓのＺｒｏｔａｔｉｏｎ，ＨｉｐｓのＸｒｏｔａｔｉｏｎ，ＨｉｐｓのＹｒｏｔａｔｉｏｎ，ＲｉｇｈｔＨｉｐのＸｐｏｓｉｔｉｏｎ，ＲｉｇｈｔＨｉｐのＹｐｏｓｉｔｉｏｎ，ＲｉｇｈｔＨｉｐのＺｐｏｓｉｔｉｏｎ・・・と並ぶことになる。なお、モーションデータファイルの例としては、これに限定されるものではない。 Each parameter is a space-separated numerical value from the beginning.・・ It will be lined up with. The example of the motion data file is not limited to this.

なお、本実施形態では、指文字を形成する手指部分を変更（変形）等を行う場合、例えば右手部分の各関節（例えば、右肩、右肘、右手首）のパラメータの調整を行う。 In the present embodiment, when changing (deforming) a finger part forming a finger character, for example, parameters of each joint (for example, right shoulder, right elbow, right wrist) of the right hand part are adjusted.

このように、本実施形態では、ＣＧキャラクタモデル６１と、ＣＧキャラクタモデル６１に対応するモーションデータ６２の位置情報と、それぞれの回転角情報とから、ＣＧ距離画像を生成することができ、それぞれの関節部分に対する角度や位置のパラメータ（例えば、回転角情報、位置座標情報）を変更することで、学習用のＣＧ距離画像を容易に複数生成することができる。 Thus, in this embodiment, a CG distance image can be generated from the CG character model 61, the position information of the motion data 62 corresponding to the CG character model 61, and the respective rotation angle information. A plurality of learning CG distance images can be easily generated by changing the angle and position parameters (for example, rotation angle information and position coordinate information) with respect to the joint portion.

なお、本実施形態では、ニューラルネットワーク学習６３として、図５の（Ｃ）の例に示すように、（あ）〜（お）等の指文字（例えば、（あ）〜（ん）の４６文字）が認識結果として抽出されるようなニューラルネットワークを学習するが、これに限定されるものではなく、例えば認識対象として「あ」〜「ん」の指文字（46文字）である。い。また、ニューラルネットワーク学習６３では、モーションデータ６２を使って手指の一部の見た目の形状を変更し、同一の指文字に対する異なる複数のＣＧ距離画像を用いて学習を行う。 In this embodiment, as the neural network learning 63, as shown in the example of FIG. 5C, finger characters such as (A) to (O) (for example, 46 characters (A) to (N)). ) Is extracted as a recognition result, but the present invention is not limited to this. For example, finger characters (“a” to “n”) (46 characters) are recognized as recognition targets. Yes. In the neural network learning 63, the appearance shape of a part of the finger is changed using the motion data 62, and learning is performed using a plurality of different CG distance images for the same finger character.

次に、認識装置１２は、図５に示すように、実際の人物を対象に距離画像取得６４を行い、得られた距離画像から手指領域抽出６５を行う。距離画像取得６４で取得される画像は、例えば図５の（Ｄ）に示す画像６４ａである。また、手指領域抽出６５により抽出される手指抽出画像は、例えば図５（Ｅ）に示す手指画像６５ａである。この画像は、例えば、１２８×１２８画素等のように所定の画像サイズに変換され、また濃度変換により、学習データや複数の距離画像同士における誤差が生じないように、予め設定された濃度（濃淡）に調整を行う。 Next, as shown in FIG. 5, the recognition device 12 performs distance image acquisition 64 for an actual person, and performs finger region extraction 65 from the obtained distance image. An image acquired by the distance image acquisition 64 is, for example, an image 64a shown in FIG. Moreover, the finger extraction image extracted by the finger region extraction 65 is, for example, a finger image 65a shown in FIG. This image is converted into a predetermined image size such as 128 × 128 pixels, for example, and a density (light / dark) set in advance so that an error does not occur between learning data and a plurality of distance images by density conversion. ).

上述した調整により得られた画像６５ａを用いて、ニューラルネットワーク学習６３で学習済みのニューラルネットワークに入力して、ニューラルネットワークによる指文字認識６６を行う。 The image 65a obtained by the above-described adjustment is used to input the neural network learned by the neural network learning 63 to perform finger character recognition 66 by the neural network.

図８は、ニューラルネットワークを用いた文字認識処理の具体例を説明するための図である。本実施形態では、ニューラルネットワークの一例として「ＤｅｅｐＬｅａｒｎｉｎｇ（深層学習）」を用い、更に「ＤｅｅｐＬｅａｒｎｉｎｇ」の一例である畳み込みニューラルネットワーク（ＣＮＮ：ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋ）を用いるが、これに限定されるものではない。 FIG. 8 is a diagram for explaining a specific example of character recognition processing using a neural network. In this embodiment, “Deep Learning” is used as an example of a neural network, and a convolutional neural network (CNN) which is an example of “Deep Learning” is used, but the present invention is not limited to this. is not.

ＣＮＮは、畳み込み層（ＣｏｎｖｏｌｕｔｉｏｎＬａｙｅｒ）、プーリング層（ＰｏｏｌｉｎｇＬａｙｅｒ）、全結合層（ＦｕｌｌｙＣｏｎｎｅｃｅｄＬａｙｅｒ）等の層から構成される。ＣＮＮでは、畳み込み層、プーリング層の組み合わせを任意の回数繰り返し、得られた結果を全結合層に入力して最終的な出力結果を取得する。 The CNN includes layers such as a convolution layer, a pooling layer, and a fully connected layer. In the CNN, a combination of a convolution layer and a pooling layer is repeated an arbitrary number of times, and the obtained result is input to all coupling layers to obtain a final output result.

図８（Ａ）の例では、最下層から上層に向かって入出力を繰り返すニューロンを示している。また、図８（Ａ）に示すニューラルネットワークでは、手指抽出画像の各画素全部（例えば、１２８×１２８画素）の濃淡情報（明るさ情報、輝度情報）を入力とする。つまり、図８（Ａ）の例に示すニューラルネットワークでは、手指領域抽出６５で得られた手指画像の１×１画素目から１×２画素目、１×３画素目、…等のように、１画素毎に濃淡（明るさ、輝度）情報を最下層のニューロンに入力し、その結果から最もらしい指文字の認識結果を、最上位のニューロンの結果から出力する。なお、図８（Ａ）に示すニューラルネットワークの例では、畳み込み層、プーリング層の組み合わせを５回、全結合層を３層重ね、最終的に４６文字に対応した４６個の出力を行うこととするが、これに限定されるものではない。また、入力する距離画像の画素数については、１２８×１２８画素に限定されるものではなく、例えば２５６×２５６画素等でもよい。実際に距離画像カメラで撮影して得られる手指画像の大きさに応じて任意に設定することができる。 In the example of FIG. 8A, a neuron that repeats input and output from the lowest layer to the upper layer is shown. In the neural network shown in FIG. 8A, grayscale information (brightness information and luminance information) of all the pixels (for example, 128 × 128 pixels) of the finger extraction image is input. That is, in the neural network shown in the example of FIG. 8A, from the 1 × 1 pixel to the 1 × 2 pixel, the 1 × 3 pixel, and so on of the finger image obtained by the finger region extraction 65, Grayscale (brightness, luminance) information is input to the lowermost neuron for each pixel, and the most likely finger character recognition result is output from the result of the highest neuron. In the example of the neural network shown in FIG. 8A, the combination of the convolution layer and the pooling layer is combined five times, and all the connection layers are stacked three times. Finally, 46 outputs corresponding to 46 characters are output. However, the present invention is not limited to this. Further, the number of pixels of the distance image to be input is not limited to 128 × 128 pixels, and may be, for example, 256 × 256 pixels. It can be arbitrarily set according to the size of the finger image actually obtained by photographing with the distance image camera.

ニューラルネットワークには、例えば図８（Ｂ）に示すように、「あ」の指文字に対して、パラメータにより見た目の角度等が異なる複数の指文字（図８（Ｂ）の例では、（あ１）〜（あ４））が学習されている。これらは全て（あ）に属している。 For example, as shown in FIG. 8 (B), the neural network includes a plurality of finger characters (“A” in the example of FIG. 1) to (a4)) are learned. These all belong to (A).

これらの情報を入力した場合、図８（Ａ）に示す最上層の「あ」に該当するニューロンの出力が最も高い値（理想的には、１．０）となり、残りのニューロンからは低い値（理想的には、０．０）が出力される。また、同様に（い）を入力した場合には、同様の最上位層のニューロンは、（い）に該当するニューロンの位置から最も高い値が出力され、それ以外は低い値が出力される。他の文字に対しても同様に、対応するニューロンの位置から高い値が出力されるようにニューラルネットワークを学習しておく。 When these pieces of information are input, the output of the neuron corresponding to “A” in the uppermost layer shown in FIG. 8A has the highest value (ideally 1.0) and is lower than the remaining neurons. (Ideally 0.0) is output. Similarly, when (yes) is input, the highest value of the neurons in the same uppermost layer is output from the position of the neuron corresponding to (yes), and otherwise the lower values are output. Similarly, for other characters, the neural network is learned so that a high value is output from the position of the corresponding neuron.

これにより、撮影画像の指の形状に近い指文字の認識を行うことができる。なお、図５の例では、認識結果により、指文字の（う）が出力される。 Thereby, the finger character close to the shape of the finger in the photographed image can be recognized. In the example of FIG. 5, a finger character (U) is output based on the recognition result.

＜本実施形態の応用例＞
上述したように、本実施形態の文字認識を行うことで、例えば距離画像等から指文字を行うことができる。なお、上述した指文字認識は、片手からの指文字認識の手法を示したがこれに限定されるものではなく、例えば画像に含まれる両手に対する指文字認識を行ってもよい。また、左右の手の認識結果から、例えば奈良の大仏の手話（ポーズ）から「奈良」を認識する等、手話の固定訳の認識を行ってもよい。 <Application example of this embodiment>
As described above, by performing character recognition according to the present embodiment, for example, a finger character can be performed from a distance image or the like. In addition, although the finger character recognition mentioned above showed the method of finger character recognition from one hand, it is not limited to this, For example, you may perform the finger character recognition with respect to both hands contained in an image. Further, from the recognition result of the left and right hands, for example, “Nara” may be recognized from the sign language (pause) of the Great Buddha of Nara.

更に、本実施形態では、時系列に複数の指文字を認識することで、例えば「アの指文字」＋「ワの指文字」から「阿波」等の連語の指文字を認識するようにしてもよい。更に、本実施形態では、手指の動作を時系列で認識することで、例えば「幸せ」等の漢字手話等を認識することもできる。 Furthermore, in this embodiment, by recognizing a plurality of finger characters in time series, for example, a finger character of a collocation such as “Awa” from “A finger character” + “Wa finger character” is recognized. Also good. Furthermore, in this embodiment, it is also possible to recognize kanji sign language such as “happy”, for example, by recognizing finger movements in time series.

更に、本実施形態の応用例として、手指領域の認識に限定されずに、人物全体がどのような姿勢をしているか等の認識や、人物を時系列で認識し、その結果から人物の動き（モーション）を認識してもよい。 Furthermore, as an application example of the present embodiment, the present invention is not limited to the recognition of the finger region, but the recognition of the posture of the entire person, the person is recognized in time series, and the movement of the person is determined from the result. (Motion) may be recognized.

＜実行プログラム＞
ここで、上述した指文字認識システム１０における学習装置１１及び認識装置１２は、例えば、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）等の揮発性の記憶媒体、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）等の不揮発性の記憶媒体、マウスやキーボード、ポインティングデバイス等の入力装置、ＣＧキャラクタモデルやモーションデータ、ＣＧ距離画像、人物距離画像等の各種画像、抽出した手指画像、指文字の認識結果等の各種データを表示する表示部、並びに外部と通信するためのインタフェースを備えたコンピュータによって構成することができる。 <Execution program>
Here, the learning device 11 and the recognition device 12 in the finger character recognition system 10 described above are, for example, a volatile storage medium such as a CPU (Central Processing Unit), a RAM (Random Access Memory), a ROM (Read Only Memory), or the like. Non-volatile storage media, input devices such as mouse, keyboard, pointing device, various images such as CG character model and motion data, CG distance image, person distance image, extracted hand image, finger character recognition result, etc. It can be configured by a computer having a display unit for displaying data and an interface for communicating with the outside.

したがって、指文字認識システム１０が有する各機能は、これらの機能を記述したプログラム等をＣＰＵに実行させることによりそれぞれ実現可能となる。また、これらのプログラムは、磁気ディスク（フロッピィーディスク、ハードディスク等）、光ディスク（ＣＤ−ＲＯＭ、ＤＶＤ等）、半導体メモリ等の記録媒体に格納して頒布することもできる。 Accordingly, each function of the finger character recognition system 10 can be realized by causing the CPU to execute a program describing these functions. These programs can also be stored and distributed in a recording medium such as a magnetic disk (floppy disk, hard disk, etc.), optical disk (CD-ROM, DVD, etc.), semiconductor memory, or the like.

つまり、上述した各構成における処理をコンピュータに実行させるための実行プログラム（例えば、学習プログラム、認識プログラム、各プログラムを統合した指文字認識プログラム）を生成し、例えば、汎用のパーソナルコンピュータやサーバ、タブレット端末等に、生成したプログラムをインストールすることにより、ハードウェア資源とソフトウェアとが協働して本実施形態における学習処理、認識処理（指文字認識処理）、等を実現することができる。 That is, an execution program (for example, a learning program, a recognition program, and a finger recognition program that integrates each program) for causing the computer to execute the processing in each configuration described above is generated. For example, a general-purpose personal computer, server, or tablet By installing the generated program in a terminal or the like, the learning process, the recognition process (finger character recognition process), and the like in the present embodiment can be realized in cooperation with hardware resources and software.

上述したように本発明によれば、指文字等の画像認識の精度を向上させることができる。例えば、本実施形態によれば、学習処理において、ＣＧを用いてニューラルネットワークの学習に必要な画像を大量に効率よく生成できる。また、認識処理において、ＣＧで生成した距離画像と距離画像カメラ等を用いて撮影する距離画像の見た目が近くなり、精度の高い認識を行うことができる。また、本実施形態では、安価な距離画像カメラを利用して、高品質な指文字認識を簡便に行うことができる。 As described above, according to the present invention, it is possible to improve the accuracy of image recognition of finger characters and the like. For example, according to the present embodiment, a large amount of images necessary for learning of the neural network can be efficiently generated using CG in the learning process. Further, in the recognition process, the distance image generated by the CG and the distance image photographed using a distance image camera or the like are close to each other, so that highly accurate recognition can be performed. In this embodiment, high-quality finger character recognition can be easily performed using an inexpensive range image camera.

以上本発明の好ましい実施形態について詳述したが、本発明は係る特定の実施形態に限定されるものではなく、特許請求の範囲に記載された本発明の要旨の範囲内において、種々の変形、変更が可能である。また、上述した各実施形態の一部又は全部を組み合わせることも可能である。 Although the preferred embodiment of the present invention has been described in detail above, the present invention is not limited to the specific embodiment, and various modifications, within the scope of the gist of the present invention described in the claims, It can be changed. Moreover, it is also possible to combine a part or all of each embodiment mentioned above.

１０指文字認識システム
１１学習装置
１２認識装置
１３通信ネットワーク
２１，３１入力手段
２２，３２出力手段
２３モーションデータ管理手段
２４距離画像生成手段
２５領域抽出手段（学習用領域抽出手段）
２６ニューラルネットワーク学習手段（学習手段）
２７，３７記憶手段
２８，３８送受信手段
３３画像取得手段
３４領域抽出手段（認識用領域抽出手段）
３５画像処理手段
３６指文字認識手段（認識手段）
４１ＣＧ距離画像
４２手指領域画像
５１人物距離画像
５２，５３手指領域
６１ＣＧキャラクタモデル
６２モーションデータ
６３ニューラルネットワーク学習
６４距離画像取得
６５手指領域抽出
６６指文字認識 DESCRIPTION OF SYMBOLS 10 Finger character recognition system 11 Learning apparatus 12 Recognition apparatus 13 Communication network 21, 31 Input means 22, 32 Output means 23 Motion data management means 24 Distance image generation means 25 Area extraction means (Learning area extraction means)
26 Neural network learning means (learning means)
27, 37 Storage means 28, 38 Transmission / reception means 33 Image acquisition means 34 Area extraction means (recognition area extraction means)
35 Image processing means 36 Finger character recognition means (recognition means)
41 CG distance image 42 finger region image 51 human distance image 52, 53 finger region 61 CG character model 62 motion data 63 neural network learning 64 distance image acquisition 65 finger region extraction 66 finger character recognition

Claims

予め設定されたＣＧキャラクタモデルの形状、姿勢、及び動作のうち、少なくとも１つの情報を含むモーションデータとを用いて、前記ＣＧキャラクタモデルに対応する人物のＣＧ距離画像を生成する距離画像生成手段と、
前記モーションデータに含まれるＣＧキャラクタモデルの位置情報に基づき、前記距離画像生成手段により得られるＣＧ距離画像から前記ＣＧキャラクタモデルの所定の領域を抽出する領域抽出手段と、
前記領域抽出手段により得られる所定の領域に含まれる濃淡情報を入力とし、入力した該濃淡情報から前記所定の領域に対する正解データを出力するニューラルネットワークを学習する学習手段とを有することを特徴とする学習装置。 Distance image generating means for generating a CG distance image of a person corresponding to the CG character model using motion data including at least one of the preset shape, posture, and motion of the CG character model; ,
Area extracting means for extracting a predetermined area of the CG character model from a CG distance image obtained by the distance image generating means based on position information of the CG character model included in the motion data;
Learning means for learning a neural network that receives the grayscale information included in the predetermined area obtained by the area extracting means and outputs correct data for the predetermined area from the inputted grayscale information. Learning device.

前記距離画像生成手段は、
前記モーションデータに含まれる前記ＣＧキャラクタモデルの関節部分の位置又は角度に関するパラメータを変更することで、複数の異なる前記ＣＧ距離画像を生成することを特徴とする請求項１に記載の学習装置。 The distance image generation means includes
The learning apparatus according to claim 1, wherein a plurality of different CG distance images are generated by changing a parameter related to a position or an angle of a joint portion of the CG character model included in the motion data.

前記学習手段は、
前記所定の領域として手指領域を含むＣＧ距離画像の各画素の濃淡情報を入力とし、入力した該濃淡情報から前記所定の領域に対する指文字の正解データを出力するニューラルネットワークを学習することを特徴とする請求項１又は２に記載の学習装置。 The learning means includes
Learning a neural network that receives grayscale information of each pixel of a CG distance image including a finger region as the predetermined region and outputs correct data of finger characters for the predetermined region from the input grayscale information. The learning device according to claim 1 or 2.

請求項１乃至３の何れか１項に記載の学習装置から得られるニューラルネットワークを用いて、前記所定の領域の認識を行う認識装置において、
距離画像カメラから撮影された人物を含む距離画像を取得する画像取得手段と、
前記画像取得手段により得られる距離画像から所定の領域を抽出する領域抽出手段と、
前記領域抽出手段により得られる所定の領域に含まれる各画素の濃淡情報を、前記ニューラルネットワークに入力して前記所定の領域に対応する指文字の認識を行う認識手段とを有することを特徴とする認識装置。 In the recognition apparatus which recognizes the said predetermined area | region using the neural network obtained from the learning apparatus of any one of Claim 1 thru | or 3,
Image acquisition means for acquiring a distance image including a person photographed from the distance image camera;
Area extraction means for extracting a predetermined area from the distance image obtained by the image acquisition means;
Recognizing means for recognizing finger characters corresponding to the predetermined area by inputting gray level information of each pixel included in the predetermined area obtained by the area extracting means to the neural network. Recognition device.

前記領域抽出手段により抽出された所定の領域の距離画像に対する濃度変換及び画像サイズの調整を行う画像処理手段とを有することを特徴とする請求項４に記載の認識装置。 5. The recognition apparatus according to claim 4, further comprising image processing means for performing density conversion and image size adjustment on a distance image of a predetermined area extracted by the area extraction means.

コンピュータを、請求項１乃至３の何れか１項に記載の学習装置として機能させるための学習プログラム。 A learning program for causing a computer to function as the learning device according to claim 1.

コンピュータを、請求項４又は５に記載の認識装置として機能させるための認識プログラム。 A recognition program for causing a computer to function as the recognition device according to claim 4 or 5.