WO2022264268A1

WO2022264268A1 - Learning device, estimation device, method for these, and program

Info

Publication number: WO2022264268A1
Application number: PCT/JP2021/022703
Authority: WO
Inventors: 瑛彦高島; 亮増村
Original assignee: 日本電信電話株式会社
Priority date: 2021-06-15
Filing date: 2021-06-15
Publication date: 2022-12-22
Also published as: JPWO2022264268A1

Abstract

Provided is a parameter learning device for use in estimating a face orientation and face key points that represent more universal face features. This learning device converts a learning face image into intermediate features via a neural network function using a model parameter ^θk or ^θp. When the learning face image consists of a face image for learning face key points, the learning device converts the intermediate features into estimated face key point vectors via a neural network function using the model parameter ^θk, and updates the model parameter ^θk using the estimated face key point vectors and face key point correct labels for the learning face image. When the learning face image consists of a face image for learning a face orientation, the learning device converts the intermediate features into an estimated face orientation vector via a neural network function using a model parameter, and updates the model parameter ^θp using the estimated face orientation vector and a face orientation correct label.

Description

学習装置、推定装置、それらの方法、およびプログラムLEARNING APPARATUS, ESTIMATION APPARATUS, THEIR METHOD, AND PROGRAM

　本発明は、顔画像から顔キーポイントと顔向きを推定する推定技術と、推定する際に用いるパラメータの学習技術に関する。 The present invention relates to an estimation technique for estimating face keypoints and face orientation from a face image, and a parameter learning technique used for estimation.

　顔キーポイントとは人間の顔の、眉、目、鼻、口、輪郭を数点～十数点の特徴点としてプロットしたもので、表情の動きや顔の動きを定量的に捉える特徴であり、顔画像から、顔キーポイントを推定することは人間の状態・内面理解において重要である。一方、顔向きとは、顔の３軸（ロール、ピッチ、ヨー）の回転角度であり、顔画像から顔向きを推定することで、顔の動きや注意方向を分析することができる。顔キーポイント、顔向きは一般にニューラルネットワークを利用して推定が行われる。従来技術では、顔キーポイント、顔向きはそれぞれの学習データのみで学習した独立したニューラルネットワークモデルを用いて、独立して推定を行う（非特許文献１，２）。顔キーポイント、顔向きとも正解ラベルはベクトルデータあり。顔キーポイントは各キーポイントの２次元、もしくは３次元の、画像中における座標値の連続データとして表す。例えば、21点や68点など任意のキーポイント数を用いることができる。顔向きは回転角度のロール、ピッチ、ヨー成分の各回転角度値の連続データである。ニューラルネットワークによる顔キーポイント、顔向きの推定は、例えば画像認識等で広く用いられている畳み込み層、プーリング層を用いて画像の特徴を抽出し、その後の全結合層により、顔キーポイント、顔向きのベクトルデータへの回帰を行うことで、推定することができる。 A facial keypoint is a plot of the eyebrows, eyes, nose, mouth, and outline of a human face as several to a dozen feature points. , Estimating facial keypoints from facial images is important for understanding human states and inner states. On the other hand, the face orientation is the rotation angle of the three axes (roll, pitch, yaw) of the face, and by estimating the face orientation from the face image, it is possible to analyze the movement of the face and the direction of attention. Face keypoints and face orientation are generally estimated using a neural network. In the prior art, face keypoints and face orientation are estimated independently using independent neural network models that have been trained using only their respective learning data (Non-Patent Documents 1 and 2). Correct labels for both face keypoints and face orientation have vector data. A face keypoint is expressed as continuous data of coordinate values in a two-dimensional or three-dimensional image of each keypoint. For example, any number of keypoints such as 21 points or 68 points can be used. The face orientation is continuous data of each rotation angle value of the roll, pitch, and yaw components of the rotation angle. Face keypoints and facial direction estimation by neural networks are performed by extracting image features using convolution layers and pooling layers, which are widely used in image recognition, for example. It can be estimated by performing regression on the orientation vector data.

　非特許文献１，２では、ニューラルネットワークを用いて顔キーポイント、顔向きを推定している。 In Non-Patent Documents 1 and 2, a neural network is used to estimate face keypoints and face orientation.

　従来技術では、顔キーポイントの推定に用いるモデルパラメータ、顔向きの推定に用いるモデルパラメータは、それぞれ独立した学習データを用いて、独立したニューラルネットワークで学習される。一方、顔キーポイント、顔向きはどちらも顔の動きを捉える特徴であり、両者の特徴を考慮してモデルパラメータを学習することができれば、より普遍的な顔の特徴を獲得することができ、高精度な顔キーポイントおよび顔向きの推定が期待できる。しかしながら、従来技術は、顔キーポイントの推定に用いるモデルパラメータ、顔向きの推定に用いるモデルパラメータを独立して学習しているため、局所的な顔の特徴しか捉えられないという問題がある。 In the conventional technology, the model parameters used for estimating face keypoints and the model parameters used for estimating face orientation are learned by independent neural networks using independent learning data. On the other hand, both face keypoints and face orientation are features that capture facial movements. Highly accurate face keypoint and face orientation estimation can be expected. However, the conventional technique independently learns the model parameters used for estimating face keypoints and the model parameters used for estimating face orientation, so there is a problem that only local facial features can be captured.

　本発明は、顔キーポイントと顔向きの推定に用いるモデルパラメータを１つのニューラルネットワークの系で学習することで、より普遍的な顔の特徴を捉えた顔キーポイントおよび顔向きの推定ができる推定装置、推定する際に用いるパラメータの学習装置、それらの方法、およびプログラムを提供することを目的とする。 The present invention is capable of estimating face keypoints and face orientation that capture more universal facial features by learning model parameters used for estimating face keypoints and face orientation in a single neural network system. An object of the present invention is to provide a device, a parameter learning device used for estimation, a method thereof, and a program.

　上記の課題を解決するために、本発明の一態様によれば、学習装置は、モデルパラメータ^θ_kまたは^θ_pを用いて、ニューラルネットワークの関数により学習用顔画像S^bを中間特徴v^bに変換する共有ネットワーク部と、学習用顔画像S^bが顔キーポイント学習用顔画像で構成されている場合、モデルパラメータ^θ_kを用いて、ニューラルネットワークの関数により中間特徴v^bを推定顔キーポイントベクトルZ'_k,bに変換する顔キーポイントネットワーク部と、推定顔キーポイントベクトルZ'_k,bと学習用顔画像S^bに対する顔キーポイント正解ラベルT^k ₁,…,T^k _|M|とを用いて、モデルパラメータ^θ_kを更新する顔キーポイントモデルパラメータ最適化部と、学習用顔画像S^bが顔向き学習用顔画像で構成されている場合、モデルパラメータ^θ_pを用いて、ニューラルネットワークの関数により中間特徴v^bを推定顔向きベクトルZ'_p,bに変換する顔向きネットワーク部と、推定顔向きベクトルZ'_p,bと顔向き正解ラベルT^p ₁,…,T^p _|N|とを用いて、モデルパラメータ^θ_pを更新する顔向きモデルパラメータ最適化部とを含み、モデルパラメータ^θ_kに対応する学習済みのモデルパラメータθ_kと、モデルパラメータ^θ_pに対応する学習済みのモデルパラメータθ_pを取得する。 In order to solve the above problems, according to one aspect of the present invention, a learning device uses a model parameter ^θ _k or ^θ _p to transform a learning face image S ^b into an intermediate feature v using a neural network function. A shared network part that _converts to ^b and a learning face image S ^b is composed of ^a face key point learning face image. A face keypoint network unit that transforms into a face keypoint vector Z'k, _b , and _a face keypoint correct label ^Tk1 ,...,Tk for the estimated face keypoint vector ^Z'k _,b and the learning face image ^Sb . The face keypoint model parameter optimization unit updates the model parameter ^ ^θ _k using _|M| Using _p , a face orientation network unit that transforms the intermediate feature v ^b into an estimated face orientation vector Z' _p,b by a neural network function, and an estimated face orientation vector Z' _p,b and a face orientation correct label T ^p ₁ , _… , _T ^p _|N _| Get the learned model parameter θ _p corresponding to the parameter ^θ _p .

　本発明によれば、より普遍的な顔の特徴を捉えた顔キーポイントおよび顔向きの推定ができるという効果を奏する。 According to the present invention, it is possible to estimate facial keypoints and facial orientations that capture more universal facial features.

第一実施形態に係る推定システムの構成例を示す図。The figure which shows the structural example of the estimation system which concerns on 1st embodiment. 第一実施形態に係る学習装置の機能ブロック図。FIG. 2 is a functional block diagram of the learning device according to the first embodiment; 第一実施形態に係る学習装置の処理フローの例を示す図。4 is a diagram showing an example of the processing flow of the learning device according to the first embodiment; FIG. 第一実施形態に係る推定装置の機能ブロック図。The functional block diagram of the estimation apparatus which concerns on 1st embodiment. 第一実施形態に係る推定装置の処理フローの例を示す図。The figure which shows the example of the processing flow of the estimation apparatus which concerns on 1st embodiment. 本手法を適用するコンピュータの構成例を示す図。The figure which shows the structural example of the computer which applies this method.

　以下、本発明の実施形態について、説明する。なお、以下の説明に用いる図面では、同じ機能を持つ構成部や同じ処理を行うステップには同一の符号を記し、重複説明を省略する。以下の説明において、テキスト中で使用する記号「^」等は、本来直前の文字の真上に記載されるべきものであるが、テキスト記法の制限により、当該文字の直後に記載する。式中においてはこれらの記号は本来の位置に記述している。また、ベクトルや行列の各要素単位で行われる処理は、特に断りが無い限り、そのベクトルやその行列の全ての要素に対して適用されるものとする。 Embodiments of the present invention will be described below. It should be noted that in the drawings used for the following description, the same reference numerals are given to components having the same functions and steps that perform the same processing, and redundant description will be omitted. In the following description, symbols such as "^" used in the text should be written directly above the immediately preceding character, but are written immediately after the relevant character due to text notation restrictions. These symbols are written in their original positions in the formulas. Further, unless otherwise specified, the processing performed for each element of a vector or matrix is applied to all the elements of the vector or matrix.

＜第一実施形態のポイント＞
　本実施形態では、顔キーポイントの推定に用いるモデルパラメータの学習データと、顔向きの推定に用いるモデルパラメータの学習データの両者を１つのニューラルネットワークの系である学習装置にて学習する。顔キーポイントの推定に用いるモデルパラメータの学習用顔画像と、顔向きの推定に用いるモデルパラメータの学習用顔画像は、交互に学習装置内の共有ネットワーク部に入力され、ここで両学習用顔画像から獲得した、より普遍的な顔の特徴が学習される。共有ネットワーク部の後段には、顔キーポイントネットワーク部と顔向きネットワーク部が分岐しており、それぞれの正解ラベルを用いて推定値の誤差最小化を行い、モデルパラメータを更新する。推定時には、それぞれ、顔キーポイント推定部、顔向き推定部を用いる。顔キーポイント推定部は、共有ネットワーク部とそれに続く顔キーポイントネットワーク部のネットワークアーキテクチャと、学習装置で学習したモデルパラメータを用いて、顔キーポイントの推定を行う。同様に、顔向き推定部は、共有ネットワーク部とそれに続く顔向きネットワーク部のネットワークアーキテクチャと学習装置で学習したモデルパラメータを用いて、顔向きの推定を行う。 <Points of the first embodiment>
In this embodiment, both model parameter learning data used for estimating face keypoints and model parameter learning data used for estimating face orientation are learned by a learning device that is a single neural network system. The face images for learning the model parameters used for estimating face keypoints and the face images for learning the model parameters used for estimating the face direction are alternately input to the shared network unit in the learning device. More universal facial features are learned from images. A face keypoint network and a face direction network are branched after the shared network, and error minimization of estimated values is performed using the respective correct labels to update the model parameters. At the time of estimation, a face keypoint estimation unit and a face direction estimation unit are used, respectively. The face keypoint estimation unit estimates face keypoints using the network architecture of the shared network unit followed by the face keypoint network unit and the model parameters learned by the learning device. Similarly, the face orientation estimation unit estimates the face orientation using the network architecture of the shared network unit and the subsequent face orientation network unit, and the model parameters learned by the learning device.

＜第一実施形態＞
　図１は第一実施形態に係る推定システムの構成例を示す。 <First embodiment>
FIG. 1 shows a configuration example of an estimation system according to the first embodiment.

　推定システムは、学習装置１００と、推定装置２００とを含む。 The estimation system includes a learning device 100 and an estimation device 200.

　学習装置１００および推定装置２００は、例えば、中央演算処理装置（CPU: Central Processing Unit）、主記憶装置（RAM: Random Access Memory）などを有する公知又は専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。学習装置１００および推定装置２００は、例えば、中央演算処理装置の制御のもとで各処理を実行する。学習装置１００および推定装置２００に入力されたデータや各処理で得られたデータは、例えば、主記憶装置に格納され、主記憶装置に格納されたデータは必要に応じて中央演算処理装置へ読み出されて他の処理に利用される。学習装置１００および推定装置２００の各処理部は、少なくとも一部が集積回路等のハードウェアによって構成されていてもよい。学習装置１００および推定装置２００が備える各記憶部は、例えば、RAM（Random Access Memory）などの主記憶装置、またはリレーショナルデータベースやキーバリューストアなどのミドルウェアにより構成することができる。ただし、各記憶部は、必ずしも学習装置１００および推定装置２００がその内部に備える必要はなく、ハードディスクや光ディスクもしくはフラッシュメモリ（Flash Memory）のような半導体メモリ素子により構成される補助記憶装置により構成し、学習装置１００および推定装置２００の外部に備える構成としてもよい。 The learning device 100 and the estimating device 200 are configured by reading a special program into a known or dedicated computer having, for example, a central processing unit (CPU: Central Processing Unit), a main memory (RAM: Random Access Memory), etc. It is a special device designed Learning device 100 and estimating device 200 execute each process under the control of a central processing unit, for example. The data input to the learning device 100 and the estimation device 200 and the data obtained in each process are stored, for example, in a main storage device, and the data stored in the main storage device are read into the central processing unit as needed. output and used for other processing. At least a part of each processing unit of learning device 100 and estimation device 200 may be configured by hardware such as an integrated circuit. Each storage unit included in the learning device 100 and the estimating device 200 can be configured by, for example, a main storage device such as RAM (Random Access Memory), or middleware such as a relational database or key-value store. However, each storage unit does not necessarily have to be provided inside the learning device 100 and the estimation device 200, and may be configured by an auxiliary storage device configured by a semiconductor memory device such as a hard disk, an optical disk, or a flash memory. , may be provided outside the learning device 100 and the estimation device 200 .

　まず、学習装置１００について説明する。 First, the learning device 100 will be explained.

＜学習装置１００＞
　学習装置１００は、顔キーポイントの推定に用いるモデルパラメータθ_k(以下、「顔キーポイントパラメータθ_k」ともいう)を学習するための学習データK=(S^k ₁,T^k ₁),…,(S^k _|M|,T^k _|M|)と、顔向きの推定に用いるモデルパラメータθ_p(以下、「顔向きパラメータθ_p」ともいう)を学習するための学習データP=(S^p ₁,T^p ₁),…,(S^p _|N|,T^p _|N|)とを入力とし、学習データK，Pを用いてモデルパラメータθ_k,θ_pを学習し、学習済みのモデルパラメータθ_k,θ_pを取得し、出力する。ここで、Mは学習データKのデータサイズ、Nは学習データPのデータサイズである。S^k _m(m=1,…,M)は、顔キーポイントパラメータθ_kを学習するための学習データにおける学習用顔画像である。学習用顔画像は、例えば、顔だけを切り出した画像であり、解像度は224x224ピクセル、RGB３チャンネルを持つ。T^k _m(m=1,…,M)は、顔キーポイントパラメータθ_kを学習するための学習データにおける正解ラベル（以下、「顔キーポイント正解ラベル」ともいう）であり、例えば、顔キーポイントの各点の座標を格納したベクトルデータであり、[[X1,Y1],…,[X68,Y68]]のようなデータ形式をとる。S^p _n(n=1,…,N)は顔向きパラメータθ_pを学習するための学習データにおける学習用顔画像であり、学習データKの学習用顔画像S^k _mと同様に、例えば、顔だけを切り出した画像で、解像度は例えば224x224ピクセル、RGB３チャンネルを持つ。T^p _n(n=1,…,N)は、顔向きパラメータθ_pを学習するための学習データにおける正解ラベル（以下、「顔向き正解ラベル」ともいう）であり、顔の向きのロール、ピッチ、ヨーの３方向の回転角度を格納したベクトルであり、[roll, pitch, yaw]のようなデータ形式をとる。 <Learning Device 100>
Learning device ₁₀₀ ^{acquires learning data K=(S k} ₁ _, T _k ¹ ), . , (S ^k _|M| ,T ^k _|M| ) and learning data _P ₌ (S ^p ₁ , T ^p ₁ ), …, (S ^p _|N| , T ^p _|N| ) are input, learning model parameters θ _k , θ _p are learned using learning data K, P, and learned Acquire and output the model parameters θ _k and θ _p . Here, M is the data size of the learning data K, and N is the data size of the learning data P. S ^k _m (m=1, . . . , M) are learning face images in learning data for learning face keypoint parameters θ _k . The learning face image is, for example, an image obtained by extracting only the face, and has a resolution of 224×224 pixels and three RGB channels. T ^k _m (m= ₁ , . It is vector data that stores the coordinates of each point, and takes a data format such as [[X1,Y1],...,[X68,Y68]]. ^Sp ⁿ ( _n =1, . . . , N ₎ are learning face images in learning data for learning the face direction parameter θ _p . It is an image that cuts out only the face, and has a resolution of, for example, 224x224 pixels and three RGB channels. T ^p _n ( _n =1, . It is a vector that stores rotation angles in the three directions of pitch and yaw, and takes a data format such as [roll, pitch, yaw].

　図２は、学習装置１００の機能ブロック図を、図３はその処理フローを示す。 FIG. 2 shows a functional block diagram of the learning device 100, and FIG. 3 shows its processing flow.

　学習装置１００は、データ選択部１１０と、共有ネットワーク部１２０と、顔キーポイントネットワーク部１３０と、顔キーポイントモデルパラメータ最適化部１４０と、顔向きネットワーク部１５０と、顔向きモデルパラメータ最適化部１６０とを含む。 The learning device 100 includes a data selection unit 110, a shared network unit 120, a face keypoint network unit 130, a face keypoint model parameter optimization unit 140, a face orientation network unit 150, and a face orientation model parameter optimization unit. 160 and .

　以下、各部の概要について説明する。 Below is an overview of each part.

　データ選択部１１０は、顔キーポイントパラメータθ_kの学習用顔画像(以下「顔キーポイント学習用顔画像」ともいう)S^k ₁,…,S^k _|M|と顔向きパラメータθ_pの学習用顔画像（以下「顔向き学習用顔画像」ともいう）S^p ₁,…,S^p _|N|とを入力とし、１回のバッチで使用する学習データを選択する。 _The data selection unit 110 _selects _learning face images S _k ¹ , ^. , S ^p _|N| ^are input, and learning data to be used in _one batch is selected.

　共有ネットワーク部１２０は、１バッチの学習用顔画像を入力とし、中間特徴を出力する任意のニューラルネットワークを用いて、中間特徴を取得する。 The shared network unit 120 receives one batch of learning face images and acquires intermediate features using an arbitrary neural network that outputs intermediate features.

　顔キーポイントネットワーク部１３０は、中間特徴を入力とし、顔キーポイントの推定値を出力する任意のニューラルネットワークを用いて、顔キーポイントの推定値を取得する。 The face keypoint network unit 130 obtains face keypoint estimates using any neural network that takes intermediate features as input and outputs face keypoint estimates.

　顔キーポイントモデルパラメータ最適化部１４０は、顔キーポイントの推定値と顔キーポイントの正解ラベルを入力とし、顔キーポイントの推定座標の誤差を計算し、誤差に基づいてモデルパラメータθ_kを更新する。 The face keypoint model parameter optimization unit 140 receives the estimated values of the face keypoints and the correct labels of the face keypoints, calculates the error of the estimated coordinates of the face keypoints, and updates the model parameters θ _k based on the errors. do.

　顔向きネットワーク部１５０は、中間特徴を入力とし、顔向きの推定値を出力する任意のニューラルネットワークを用いて、顔向きの推定値を取得する。 The face orientation network unit 150 obtains an estimated face orientation value using an arbitrary neural network that receives intermediate features as input and outputs an estimated face orientation value.

　顔向きモデルパラメータ最適化部１６０は、顔向きの推定値と顔向きの正解ラベルを入力とし、顔向きの推定回転角度の誤差を計算し、誤差に基づいてモデルパラメータθ_pを更新する。 The face orientation model parameter optimization unit 160 receives an estimated face orientation value and a correct face orientation label, calculates an error in the estimated face orientation rotation angle, and updates the model parameter θ _p based on the error.

　上述の処理は、１バッチ（学習データのうち部分的に選択した一塊のデータ）の学習手順を示したものであり、これを繰り返して、全データの学習を任意の回数、行えるものとする。 The above process shows the learning procedure for one batch (a block of data partially selected from the learning data), and by repeating this, all data can be learned any number of times.

　以下、各部の詳細について説明する。 The details of each part will be explained below.

＜データ選択部１１０＞
入力：顔キーポイント学習用顔画像S^k ₁,…,S^k _|M|、顔向き学習用顔画像S^p ₁,…,S^p _|N|
出力：１バッチの学習用顔画像S^b、データ識別子i^b
　データ選択部１１０は、顔キーポイント学習用顔画像S^k ₁,…,S^k _|M|と顔向き学習用顔画像S^p ₁,…,S^p _|N|とを入力とし、１回の学習で使うデータ＝１バッチの学習用顔画像S^bを選択する（Ｓ１１０）。なお、bはバッチ番号を示すインデックスである。例えば、顔キーポイント学習用顔画像S^k ₁,…,S^k _|M|、顔向き学習用顔画像S^p ₁,…,S^p _|N|をそれぞれB個のバッチに分割し、b=1,2,…,２Bとする。このとき、１バッチの学習用顔画像S^bは、例えば、１６枚の画像で構成される。この１バッチの学習用顔画像S^bは、顔キーポイント学習用顔画像、顔向き学習用顔画像ごとにそれぞれ作られ、１つのバッチ内の学習用顔画像は、全てが顔キーポイント学習用顔画像、または、全てが顔向き学習用顔画像で構成される。データ選択の手順としては、例えば、学習の１バッチ目は顔キーポイント学習用顔画像で全て構成されたバッチ、学習の２バッチ目は顔向き学習用顔画像で全て構成されたバッチ、以降、これを交互に繰り返す。また、データ識別子i^bとは、１バッチの学習用顔画像S^bが顔キーポイント学習用顔画像と顔向き学習用顔画像のどちらのデータで構成されているかを識別するための識別子である。 <Data selection unit 110>
Input: face keypoint learning face images S ^k ₁ ,...,S ^k _|M| , face orientation learning face images S ^p ₁ ,...,S ^p _|N|
Output: 1 batch of training face images S ^b , data identifier i ^b
The data selection unit 110 receives face ^keypoint learning face images S ^k ₁ , . . . , S ^k | _M _| and face orientation learning face images S ^p ₁ , . Data used in learning=one batch of learning face images ^Sb is selected (S110). Note that b is an index indicating a batch number. For example, face keypoint learning face images S ^k ₁ ,..., S ^k _|M| and face orientation learning face images S ^p ₁ ,..., S ^p _|N| 1, 2, …, 2B. At this time, one batch of learning face images S ^b is composed of, for example, 16 images. This batch of learning face images S ^b is created for each face keypoint learning face image and face orientation learning face image. A face image, or all of which is composed of face orientation learning face images. As a data selection procedure, for example, the first batch of learning is a batch composed entirely of face keypoint learning face images, the second batch of learning is a batch composed entirely of face orientation learning face images, and so on. Repeat this alternately. The data identifier i ^b is an identifier for identifying whether one batch of learning face images S ^b is composed of face keypoint learning face image data or face orientation learning face image data. .

＜共有ネットワーク部１２０＞
入力：１バッチの学習用顔画像S^b、データ識別子i^b、更新したモデルパラメータ^θ_kまたは^θ_pのうち共有ネットワーク部１２０を構成するニューラルネットワークに対応するパラメータ
出力：１バッチの中間特徴v^b、データ識別子i^b
　共有ネットワーク部１２０は、任意のニューラルネットワークで構成されており、例えば、４層の畳み込み層などで構成される。 <Shared Network Unit 120>
Input: one batch of training face images S ^b , data identifier i ^b , updated model parameters ^θ _k or ^θ _p corresponding to the neural network forming the shared network unit 120 Output: one batch of intermediate features v ^b , data identifier i ^b
The shared network unit 120 is composed of an arbitrary neural network, and is composed of, for example, four convolutional layers.

　共有ネットワーク部１２０は、変換処理に先立ち、更新したモデルパラメータ^θ_k(顔キーポイントモデルパラメータ最適化部１４０の出力値)または^θ_p(顔向きモデルパラメータ最適化部１６０の出力値)のうち共有ネットワーク部１２０を構成する任意のニューラルネットワークに対応するパラメータを受け取る。 Prior to conversion processing, the shared network unit 120 updates the updated model parameters ^θ _k (output values of the face keypoint model parameter optimization unit 140) or ^θ _p (output values of the face pose model parameter optimization unit 160). A parameter corresponding to an arbitrary neural network constituting the shared network unit 120 is received.

　共有ネットワーク部１２０は、更新したモデルパラメータ^θ_kまたは^θ_pのうち共有ネットワーク部１２０を構成する任意のニューラルネットワークに対応するパラメータを用いて、任意のニューラルネットワークの関数により１バッチの学習用顔画像S^bを中間特徴v^bに変換する（Ｓ１２０）。例えば、b番目のバッチの学習用顔画像S^bがQ枚の画像で構成される場合、q番目の画像に対応する中間特徴をv^b _qとし、v^b=[v^b ₁,v^b ₂,…,v^b _Q]とする。ただし、q=1,2,…,Qである。 The shared network unit 120 uses a parameter corresponding to an arbitrary neural network constituting the shared network unit 120 out of the updated model parameters ^θ _k or ^θ _p to generate one batch of learning data using a function of an arbitrary neural network. The face image ^Sb is converted into intermediate features ^vb (S120). For example, if the b-th batch of training face images S ^b consists of Q images, let v ^b _q be the intermediate feature corresponding to the q-th image, and v ^b =[v ^b ₁ , v ^b ₂ ,…,v ^b _Q ]. However, q=1,2,...,Q.

　また、共有ネットワーク部１２０は、データ識別子i^bをそのまま出力する。 Also, the shared network unit 120 outputs the data identifier i ^b as it is.

＜顔キーポイントネットワーク部１３０＞
入力：１バッチの中間特徴v^b、データ識別子i^b、更新したモデルパラメータ^θ_kのうち顔キーポイントネットワーク部１３０を構成するニューラルネットワークに対応するパラメータ
出力：推定顔キーポイントベクトルZ'_k,b
　顔キーポイントネットワーク部１３０は、任意のニューラルネットワークで構成されており、例えば２層の全結合層などで構成される。 <Face Keypoint Network Unit 130>
Input: Intermediate feature v ^b of one batch, data identifier i ^b , parameters among updated model parameters ^θ _k corresponding to the neural network constituting face key point network unit 130 Output: estimated face key point vector Z′ _{k, b}
The face keypoint network unit 130 is composed of an arbitrary neural network, such as two fully connected layers.

　顔キーポイントネットワーク部１３０は、変換処理に先立ち、更新したモデルパラメータ^θ_k(顔キーポイントモデルパラメータ最適化部１４０の出力値)のうち顔キーポイントネットワーク部１３０を構成する任意のニューラルネットワークに対応するパラメータを受け取る。 Prior to conversion processing, the face keypoint network unit 130 applies the updated model parameters ^θ _k (output values of the face keypoint model parameter optimization unit 140) to an arbitrary neural network constituting the face keypoint network unit 130. Receive the corresponding parameters.

　顔キーポイントネットワーク部１３０は、学習用顔画像S^bが顔キーポイント学習用顔画像で構成されていることをデータ識別子i^bが示す場合（Ｓ１２９のYES）、受け取ったパラメータを用いて、任意のニューラルネットワークの関数により１バッチの中間特徴v^bを推定顔キーポイントベクトルZ'_k,bに変換する（Ｓ１３０）。推定顔キーポイントベクトルとは、顔キーポイントの各点の推定値の座標を配列で格納したベクトルデータである。別の言い方をすると、推定顔キーポイントベクトルとは、顔キーポイントベクトルの推定値である。顔キーポイントベクトルとは、顔キーポイントの各点の座標を格納したベクトルデータであり、例えば68点の２次元キーポイントを推定するモデルの場合、X1をキーポイント１点目のX座標値、Y1をキーポイント１点目のY座標値とすると、顔キーポイントベクトルは[[X1,Y1],…,[X68,Y68]]のようなベクトルになる。また、３次元のキーポイントを推定するモデルの場合は、Z1をキーポイント１点目のZ座標値とすると、[[X1,Y1,Z1],…,[X68,Y68,Z68]]のようなベクトルになる。例えば、b番目のバッチのq番目の中間特徴v^b _qに対応する推定顔キーポイントベクトルをZ'_k,b,qとし、Z'_k,b=[Z'_k,b,1,Z'_k,b,2,…,Z'_k,b,Q]とする。推定顔キーポイントベクトルZ'_k,b,qが、[[X1,Y1],…,[X68,Y68]]や[[X1,Y1,Z1],…,[X68,Y68,Z68]]のようなベクトルとなる。 When the data identifier i ^b indicates that the face keypoint learning face image S ^b is composed of the face keypoint learning face image S b (YES in S129), the face keypoint network unit 130 uses the received parameters to perform arbitrary A batch of intermediate features v ^b is converted into an estimated face keypoint vector Z′ _k,b by the neural network function of (S130). An estimated face keypoint vector is vector data in which coordinates of estimated values of points of face keypoints are stored in an array. Stated another way, the estimated face keypoint vector is an estimate of the face keypoint vector. A face keypoint vector is vector data storing the coordinates of each face keypoint. For example, in the case of a model that estimates 68 two-dimensional keypoints, X1 is the X coordinate value of the first keypoint, Assuming that Y1 is the Y coordinate value of the first keypoint, the face keypoint vector becomes a vector like [[X1,Y1],...,[X68,Y68]]. In the case of a model that estimates 3D keypoints, if Z1 is the Z coordinate value of the first keypoint, [[X1,Y1,Z1],...,[X68,Y68,Z68]] vector. For example, let Z' _k,b,q be the estimated face keypoint vector corresponding to the q-th intermediate feature v ^b _q of the b-th batch, and Z' _k,b =[Z' _k,b,1 ,Z' _k,b,2 ,...,Z' _k,b,Q ]. The estimated face keypoint vectors Z' _k,b,q are A vector such as

　なお、学習用顔画像S^bが顔向き学習用顔画像で構成されていることをデータ識別子i^bが示す場合、顔キーポイントネットワーク部１３０および顔キーポイントモデルパラメータ最適化部１４０は、処理を行わない。 Note that when the data identifier i ^b indicates that the learning face image S ^b is composed of the face orientation learning face image, the face keypoint network unit 130 and the face keypoint model parameter optimization unit 140 perform the processing. Not performed.

＜顔キーポイントモデルパラメータ最適化部１４０＞
入力：推定顔キーポイントベクトルZ'_k,b、顔キーポイント正解ラベルT^k ₁,…,T^k _|M|
出力：更新したモデルパラメータ^θ_kまたは学習済みモデルパラメータθ_k
　モデルパラメータ^θ_kおよびθ_kは、共有ネットワーク部１２０を構成するニューラルネットワークに対応するパラメータと顔キーポイントネットワーク部１３０を構成するニューラルネットワークに対応するパラメータとを連結したものである。 <Face Keypoint Model Parameter Optimization Unit 140>
Input: estimated face keypoint vector Z' _k,b , correct face keypoint label T ^k ₁ ,...,T ^k _|M|
Output: updated model parameters ^θ _k or learned model parameters θ _k
The model parameters ^θ _k and θ _k are obtained by concatenating the parameters corresponding to the neural networks forming the shared network unit 120 and the parameters corresponding to the neural networks forming the face keypoint network unit 130 .

　顔キーポイントモデルパラメータ最適化部１４０は、推定顔キーポイントベクトルZ'_k,bと顔キーポイント正解ラベルT^k ₁,…,T^k _|M|とを用いて、モデルパラメータ^θ_kを更新し（Ｓ１４０）、最適化を行う。例えば、顔キーポイントモデルパラメータ最適化部１４０は、推定顔キーポイントベクトルZ'_k,bと顔キーポイント正解ラベルT^k ₁,…,T^k _|M|との間の誤差を計算し、誤差を最小化するようにモデルパラメータ^θ_kを更新し、最適化を行う。誤差は例えばMSE誤差やMAE誤差などを用いることができ、パラメータの更新方法としては勾配降下法等を用いることができる。 The face keypoint model parameter optimization unit 140 uses the estimated face keypoint vector Z′ _k,b and the face keypoint correct labels T ^k ₁ , . . . , T ^k _|M| to update the model parameters ^θ _k (S140) and optimization is performed. For example, the face keypoint model parameter optimization unit 140 calculates the error between the estimated face keypoint vector Z′ _k,b and the correct face keypoint label T ^k ₁ , . . . , T ^k _|M| Update the model parameter ^θ _k to minimize , and perform the optimization. For example, an MSE error or an MAE error can be used as the error, and a gradient descent method or the like can be used as a parameter update method.

　上述の処理Ｓ１２０～Ｓ１４０を所定の条件を満たすまで繰り返す（Ｓ１７０）。所定の条件とは、パラメータの更新が収束したか否かを判断するための条件であり、例えば、所定の条件を(i)更新回数が所定の回数を超えたこと、(ii)更新前後のパラメータの差分が所定の値よりも小さいことなどとしてもよい。 The above processes S120 to S140 are repeated until a predetermined condition is satisfied (S170). The predetermined condition is a condition for determining whether or not the update of the parameter has converged. The parameter difference may be smaller than a predetermined value.

＜顔向きネットワーク部１５０＞
入力：１バッチの中間特徴v^b、データ識別子i^b、更新したモデルパラメータ^θ_pのうち顔向きネットワーク部１５０を構成するニューラルネットワークに対応するパラメータ
出力：推定顔向きベクトルZ'_p,b
　顔向きネットワーク部１５０は、任意のニューラルネットワークで構成されており、例えば２層の全結合層などで構成される。 <Face Network Unit 150>
Input: Intermediate feature v ^b of one batch, data identifier i ^b , parameters among updated model parameters ^θ _p corresponding to the neural network constituting face orientation network unit 150 Output: estimated face orientation vector Z′ _p,b
The face orientation network unit 150 is composed of an arbitrary neural network, such as two fully connected layers.

　顔向きネットワーク部１５０は、変換処理に先立ち、更新したモデルパラメータ^θ_p(顔向きモデルパラメータ最適化部１６０の出力値)のうち顔向きネットワーク部１５０を構成する任意のニューラルネットワークに対応するパラメータを受け取る。 Prior to the conversion process, the face orientation network unit 150 selects parameters corresponding to arbitrary neural networks constituting the face orientation network unit 150 among the updated model parameters ^θ _p (output values of the face orientation model parameter optimization unit 160). receive.

　顔向きネットワーク部１５０は、学習用顔画像S^bが顔向き学習用顔画像で構成されていることをデータ識別子i^bが示す場合（Ｓ１２９のNO）、受け取ったパラメータを用いて、任意のニューラルネットワークの関数により１バッチの中間特徴v^bを推定顔向きベクトルZ'_p,bに変換する（Ｓ１５０）。推定顔向きベクトルとは、顔の向きのロール、ピッチ、ヨーの３方向の回転角度の推定値を格納したベクトルデータである。別の言い方をすると、推定顔向きベクトルとは、顔向きベクトルの推定値である。顔向きベクトルとは、顔の向きのロール、ピッチ、ヨーの３方向の回転角度を格納したベクトルである。回転角度の値は例えば、-180度～180度の範囲を持ち、顔向きベクトルは[20、43、-151]などのベクトルとなる。例えば、b番目のバッチのq番目の中間特徴v^b _qに対応する推定顔向きベクトルをZ'_p,b,qとし、Z'_p,b=[Z'_p,b,1,Z'_p,b,2,…,Z'_p,b,Q]とする。推定顔向きベクトルZ'_p,b,qが、[20、43、-151]のようなベクトルとなる。 When the data identifier i ^b indicates that the face orientation learning face image S ^b is composed of the face orientation learning face image (NO in S129), the face orientation network unit 150 uses the received parameters to generate an arbitrary neural network. A batch of intermediate features v ^b is converted into an estimated face orientation vector Z′ _p,b by a network function (S150). The estimated face orientation vector is vector data that stores estimated values of rotation angles in three directions of roll, pitch, and yaw of the orientation of the face. In other words, the estimated face orientation vector is an estimate of the face orientation vector. A face orientation vector is a vector that stores rotation angles in the three directions of roll, pitch, and yaw of the orientation of the face. For example, the rotation angle value has a range of -180 degrees to 180 degrees, and the face direction vector is a vector such as [20, 43, -151]. For example, let Z' _p,b,q be the estimated face orientation vector corresponding to the q-th intermediate feature v ^b _q of the b-th batch, and Z' _p,b =[Z' _p,b,1 ,Z' _{p ,b,2} ,...,Z' _p,b,Q ]. The estimated face direction vector Z'p _,b,q becomes a vector like [20, 43, -151].

　なお、学習用顔画像S^bが顔キーポイント学習用顔画像で構成されていることをデータ識別子i^bが示す場合、顔向きネットワーク部１５０および顔向きモデルパラメータ最適化部１６０は、処理を行わない。 Note that when the data identifier i ^b indicates that the learning face image S ^b is composed of the face keypoint learning face image, the face orientation network unit 150 and the face orientation model parameter optimization unit 160 perform processing. No.

＜顔向きモデルパラメータ最適化部１６０＞
入力：推定顔向きベクトルZ'_p,b、顔向き正解ラベルT^p ₁,…,T^p _|N|
出力：更新したモデルパラメータ^θ_pまたは学習済みモデルパラメータθ_p
　モデルパラメータ^θ_pおよびθ_pは、共有ネットワーク部１２０を構成するニューラルネットワークに対応するパラメータと顔向きネットワーク部１５０を構成するニューラルネットワークに対応するパラメータとを連結したものである。 <Face pose model parameter optimization unit 160>
Input: estimated face orientation vector Z' _p,b , correct face orientation label T ^p ₁ ,...,T ^p _|N|
Output: updated model parameters ^θ _p or learned model parameters θ _p
The model parameters ^θ _p and θ _p are obtained by concatenating the parameters corresponding to the neural networks forming the shared network section 120 and the parameters corresponding to the neural networks forming the face orientation network section 150 .

　顔向きモデルパラメータ最適化部１６０は、推定顔向きベクトルZ'_p,bと顔向き正解ラベルT^p ₁,…,T^p _|N|とを用いて、モデルパラメータ^θ_pを更新し（Ｓ１６０）、最適化を行う。例えば、顔向きモデルパラメータ最適化部１６０は、推定顔向きベクトルZ'_p,bと顔向き正解ラベルT^p ₁,…,T^p _|N|との間の誤差を計算し、誤差を最小化するようにモデルパラメータ^θ_pを更新し、最適化を行う。誤差は例えばMSE誤差やMAE誤差などを用いることができ、パラメータの更新方法としては勾配降下法等を用いることができる。 The face orientation model parameter optimization unit 160 _{updates the model parameter ^θ p} ^using the estimated face orientation vector _Z'p _,b and the face orientation correct label ^Tp1 ,...,Tp _|N| (S160 ) and perform optimization. For example, the face orientation model parameter optimization unit 160 calculates the error between the estimated face orientation vector Z′ ^p _,b and the face _orientation correct label T ^p ₁ , . Update the model parameter ^θ _p so that For example, an MSE error or an MAE error can be used as the error, and a gradient descent method or the like can be used as a parameter update method.

　上述の処理Ｓ１２０～Ｓ１６０を所定の条件を満たすまで繰り返す（Ｓ１７０）。 The above processes S120 to S160 are repeated until a predetermined condition is satisfied (S170).

　さらに、上述の処理Ｓ１１０～Ｓ１７０を全てのバッチデータ(学習データ)に対して行う。例えば、未処理のバッチデータがあるか否かを判定し（Ｓ１８０）、未処理のバッチデータがある場合には上述の処理Ｓ１１０～Ｓ１７０を行い（Ｓ１８０のNO）、未処理のバッチデータがない場合（Ｓ１８０のYES）には処理を終了する。 Furthermore, the above-described processes S110 to S170 are performed for all batch data (learning data). For example, it is determined whether or not there is unprocessed batch data (S180), and if there is unprocessed batch data, the above-described processes S110 to S170 are performed (NO in S180), and there is no unprocessed batch data. If so (YES in S180), the process ends.

　全てのバッチデータに対して上述の処理を行った後、最終的に得られた、更新したモデルパラメータ^θ_k、^θ_pを学習済みモデルパラメータθ_k、θ_pとして出力する。 After performing the above-described processing on all batch data, the finally obtained updated model parameters ^ _θk , ^ _θp are output as learned model parameters _θk , _θp .

　次に、推定装置２００について説明する。 Next, the estimation device 200 will be described.

＜推定装置２００＞
　推定装置２００は、推定処理に先立ち、学習済みのモデルパラメータθ_k,θ_pを受け取る。推定装置２００は、推定対象の顔画像Sを入力とし、学習済みのモデルパラメータθ_kを用いて顔キーポイントを推定し、学習済みのモデルパラメータθ_pを用いて顔向きを推定し、推定顔キーポイントベクトルZ_k、推定顔向きベクトルZ_pを出力する。 <Estimation device 200>
The estimation device 200 receives learned model parameters θ _k and θ _p prior to the estimation process. The estimation apparatus 200 receives an estimation target face image S as an input, estimates face keypoints using learned model parameters θ _k , estimates face orientation using learned model parameters θ _p , and generates an estimated face Output key point vector Z _k and estimated face direction vector Z _p .

　図４は、推定装置２００の機能ブロック図を、図５はその処理フローを示す。 FIG. 4 is a functional block diagram of the estimation device 200, and FIG. 5 shows its processing flow.

　推定装置２００は、顔キーポイント推定部２１０と、顔向き推定部２２０とを含む。 The estimating device 200 includes a face keypoint estimating section 210 and a face orientation estimating section 220 .

＜顔キーポイント推定部２１０＞
入力: 顔画像S、モデルパラメータθ_k
出力: 顔画像Sに対する推定顔キーポイントベクトルZ_k
　顔キーポイント推定部２１０は、推定処理に先立ち、モデルパラメータθ_kを受け取る。 <Face keypoint estimation unit 210>
Input: face image S, model parameters θ _k
Output: estimated face keypoint vector Z _k for face image S
The face keypoint estimation unit 210 receives model parameters θ _k prior to estimation processing.

　顔キーポイント推定部２１０は、共有ネットワーク部１２０とそれに続く顔キーポイントネットワーク部１３０のネットワークアーキテクチャと、モデルパラメータθ_kとを用いて、顔画像Sから顔キーポイントを推定し（Ｓ２１０）、推定値（推定顔キーポイントベクトルZ_k）を求める。例えば、共有ネットワーク部１２０が４層の畳み込み層からなる任意のニューラルネットワークで構成され、顔キーポイントネットワーク部１３０が２層の全結合層からなる任意のニューラルネットワークで構成される場合、顔キーポイント推定部２１０は、共有ネットワーク部１２０と顔キーポイントネットワーク部１３０とに対応する４層の畳み込み層と２層の全結合層とからなるニューラルネットワークで構成され、このニューラルネットワークでモデルパラメータθ_kを用いる。 The face keypoint estimation unit 210 estimates face keypoints from the face image S using the network architecture of the shared network unit 120 and subsequent face keypoint network unit 130, and model parameters θ _k (S210). Find the value (estimated face keypoint vector Z _k ). For example, when the shared network unit 120 is composed of an arbitrary neural network consisting of four convolutional layers, and the face keypoint network unit 130 is composed of an arbitrary neural network consisting of two fully connected layers, the face keypoint The estimation unit 210 is composed of a neural network consisting of four convolution layers and two fully connected layers corresponding to the shared network unit 120 and the face keypoint network unit 130, and the model parameter θ _k is calculated by this neural network. use.

＜顔向き推定部２２０＞
入力: 顔画像S、モデルパラメータθ_p
出力: 顔画像Sに対する推定顔向きベクトルZ_p
　顔向き推定部２２０は、推定処理に先立ち、モデルパラメータθ_pを受け取る。 <Face Orientation Estimating Unit 220>
Input: face image S, model parameters θ _p
Output: Estimated face direction vector Z _p for face image S
Face orientation estimation section 220 receives model parameters θ _p prior to estimation processing.

　顔向き推定部２２０は、共有ネットワーク部１２０とそれに続く顔向きネットワーク部１５０のネットワークアーキテクチャと、モデルパラメータθ_pとを用いて、顔画像Sから顔向きを推定し（Ｓ２２０）、推定値（推定顔向きベクトルZ_p）を求める。例えば、共有ネットワーク部１２０が４層の畳み込み層からなる任意のニューラルネットワークで構成され、顔向きネットワーク部１５０が２層の全結合層からなる任意のニューラルネットワークで構成される場合、顔向き推定部２２０は、共有ネットワーク部１２０と顔向きネットワーク部１５０とに対応する４層の畳み込み層と２層の全結合層とからなるニューラルネットワークで構成され、このニューラルネットワークでモデルパラメータθ_pを用いる。 The face direction estimation unit 220 estimates the face direction from the face image S using the network architecture of the shared network unit 120 and the subsequent face direction network unit 150, and the model parameter θ _p (S220), and obtains an estimated value (estimation Find the facial orientation vector Z _p ). For example, when the shared network unit 120 is composed of an arbitrary neural network consisting of four convolutional layers, and the face orientation network unit 150 is composed of an arbitrary neural network consisting of two fully connected layers, the face orientation estimation unit A neural network 220 is composed of four convolutional layers and two fully connected layers corresponding to the shared network section 120 and the face orientation network section 150, and uses the model parameter θ _p in this neural network.

＜効果＞
　以上の構成により、より普遍的な顔の特徴を捉えた顔キーポイントおよび顔向きの推定ができる。 <effect>
With the configuration described above, it is possible to estimate facial keypoints and facial orientations that capture more universal facial features.

＜変形例＞
　本実施形態では、推定装置２００は、顔キーポイントと顔向きとを推定しているが、何れか一方のみを推定する構成としてもよい。その場合であっても、学習時には、顔キーポイントと顔向きの推定に用いるモデルパラメータを１つのニューラルネットワークの系で学習しているため、より普遍的な顔の特徴を捉えた顔キーポイントまたは顔向きを推定することができる。 <Modification>
In this embodiment, the estimation apparatus 200 estimates the face keypoint and the face orientation, but may be configured to estimate only one of them. Even in that case, since the face keypoints and the model parameters used for estimating the face direction are learned in a single neural network system during learning, face keypoints that capture more universal facial features or Face orientation can be estimated.

　また、本実施形態では、顔キーポイント学習用顔画像で全て構成されたバッチと、顔向き学習用顔画像で全て構成されたバッチとを交互に学習データとして用いて学習を行っているが、必ずしも1バッチずつ交互に用いる必要はなく、例えば、10バッチずつ交互に用いてもよい。ただし、何れかの学習データへの偏りが生じないようにするために1バッチずつ交互に用いることが望ましい。 In addition, in the present embodiment, learning is performed by alternately using batches composed entirely of face keypoint learning face images and batches composed entirely of face orientation learning face images as learning data. It is not always necessary to alternately use one batch at a time, and for example, 10 batches may be alternately used. However, it is desirable to alternately use one batch at a time so as not to bias any learning data.

＜その他の変形例＞
　本発明は上記の実施形態及び変形例に限定されるものではない。例えば、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。 <Other Modifications>
The present invention is not limited to the above embodiments and modifications. For example, the various types of processing described above may not only be executed in chronological order according to the description, but may also be executed in parallel or individually according to the processing capacity of the device that executes the processing or as necessary. In addition, appropriate modifications are possible without departing from the gist of the present invention.

＜プログラム及び記録媒体＞
　上述の各種の処理は、図６に示すコンピュータの記憶部２０２０に、上記方法の各ステップを実行させるプログラムを読み込ませ、制御部２０１０、入力部２０３０、出力部２０４０などに動作させることで実施できる。 <Program and recording medium>
The various processes described above can be performed by loading a program for executing each step of the above method into the storage unit 2020 of the computer shown in FIG. .

　この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 A program that describes this process can be recorded on a computer-readable recording medium. Any computer-readable recording medium may be used, for example, a magnetic recording device, an optical disk, a magneto-optical recording medium, a semiconductor memory, or the like.

　また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ－ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 In addition, the distribution of this program is carried out, for example, by selling, assigning, lending, etc. portable recording media such as DVDs and CD-ROMs on which the program is recorded. Further, the program may be distributed by storing the program in the storage device of the server computer and transferring the program from the server computer to other computers via the network.

　このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program, for example, first stores the program recorded on a portable recording medium or the program transferred from the server computer once in its own storage device. Then, when executing the process, this computer reads the program stored in its own recording medium and executes the process according to the read program. Also, as another execution form of this program, the computer may read the program directly from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to this computer. Each time, the processing according to the received program may be executed sequentially. In addition, the above-mentioned processing is executed by a so-called ASP (Application Service Provider) type service, which does not transfer the program from the server computer to this computer, and realizes the processing function only by its execution instruction and result acquisition. may be It should be noted that the program in this embodiment includes information that is used for processing by a computer and that conforms to the program (data that is not a direct instruction to the computer but has the property of prescribing the processing of the computer, etc.).

　また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In addition, in this embodiment, the device is configured by executing a predetermined program on a computer, but at least part of these processing contents may be implemented by hardware.

Claims

　モデルパラメータ^θ_kまたは^θ_pを用いて、ニューラルネットワークの関数により学習用顔画像S^bを中間特徴v^bに変換する共有ネットワーク部と、
　前記学習用顔画像S^bが顔キーポイント学習用顔画像で構成されている場合、前記モデルパラメータ^θ_kを用いて、ニューラルネットワークの関数により前記中間特徴v^bを推定顔キーポイントベクトルZ'_k,bに変換する顔キーポイントネットワーク部と、
　前記推定顔キーポイントベクトルZ'_k,bと前記学習用顔画像S^bに対する顔キーポイント正解ラベルT^k ₁,…,T^k _|M|とを用いて、前記モデルパラメータ^θ_kを更新する顔キーポイントモデルパラメータ最適化部と、
　前記学習用顔画像S^bが顔向き学習用顔画像で構成されている場合、前記モデルパラメータ^θ_pを用いて、ニューラルネットワークの関数により中間特徴v^bを推定顔向きベクトルZ'_p,bに変換する顔向きネットワーク部と、
　前記推定顔向きベクトルZ'_p,bと顔向き正解ラベルT^p ₁,…,T^p _|N|とを用いて、前記モデルパラメータ^θ_pを更新する顔向きモデルパラメータ最適化部とを含み、
　前記モデルパラメータ^θ_kに対応する学習済みのモデルパラメータθ_kと、前記モデルパラメータ^θ_pに対応する学習済みのモデルパラメータθ_pを取得する、
　学習装置。 a shared network unit that converts a learning face image S ^b into an intermediate feature v ^b using a neural network function using model parameters ^θ _k or ^θ _p ;
When the face image for learning S ^b is composed of the face image for face key point learning, the intermediate feature v ^b is converted to an estimated face key point vector Z′ by a neural network function using the model parameter ^θ _k . a face keypoint network part that converts to _{k, b} ;
Update the model parameter ^ _θk using the estimated face keypoint vector _Z'k _,b and the face keypoint correct label ^Tk1 ,..., ^Tk _|M| for the training face image ^Sb . a face keypoint model parameter optimization unit;
When the face image for learning S ^b is composed of the face image for face orientation learning, the model parameter ^θ _p is used to convert the intermediate feature v ^b into an estimated face orientation vector Z′ _p,b by a neural network function. a face network part that converts to
a face orientation model parameter optimization unit that updates the model parameter ^θ _p using the estimated face orientation vector Z' _p,b and the face orientation correct label T ^p ₁ , . . . , T ^p _|N| ,
obtaining a learned model parameter θ _k corresponding to the model parameter ^θ _k and a learned model parameter θ _p corresponding to the model parameter ^θ _p ;
learning device.
　請求項１の学習装置で学習したモデルパラメータθ_kを用いる推定装置であって、
　前記共有ネットワーク部とそれに続く前記顔キーポイントネットワーク部のネットワークアーキテクチャと、前記モデルパラメータθ_kとを用いて、推定対象の顔画像Sから顔キーポイントを推定する顔キーポイント推定部を含む、
　推定装置。 An estimating device that uses the model parameters θ _k learned by the learning device of claim 1,
a face keypoint estimation unit for estimating face keypoints from the face image S to be estimated using the network architecture of the shared network unit followed by the face keypoint network unit, and the model parameters θ _k ;
estimation device.
　請求項１の学習装置で学習したモデルパラメータθ_pを用いる推定装置であって、
　前記共有ネットワーク部とそれに続く前記顔向きネットワーク部のネットワークアーキテクチャと、前記モデルパラメータθ_pとを用いて、前記顔画像Sから顔向きを推定する顔向き推定部を含む、
　推定装置。 An estimating device that uses the model parameters θ _p learned by the learning device of claim 1,
a face orientation estimation unit that estimates the face orientation from the face image S using the network architecture of the shared network unit and the subsequent face orientation network unit, and the model parameter θ _p ;
estimation device.
　請求項１の学習装置で学習したモデルパラメータθ_kとモデルパラメータθ_pとを用いる推定装置であって、
　前記共有ネットワーク部とそれに続く前記顔キーポイントネットワーク部のネットワークアーキテクチャと、前記モデルパラメータθ_kとを用いて、推定対象の顔画像Sから顔キーポイントを推定する顔キーポイント推定部と、
　前記共有ネットワーク部とそれに続く前記顔向きネットワーク部のネットワークアーキテクチャと、前記モデルパラメータθ_pとを用いて、前記顔画像Sから顔向きを推定する顔向き推定部とを含む、
　推定装置。 An estimating device using the model parameter θ _k and the model parameter θ _p learned by the learning device of claim 1,
a face keypoint estimation unit for estimating face keypoints from an estimation target face image S using the network architecture of the shared network unit followed by the face keypoint network unit, and the model parameters θ _k ;
A face orientation estimation unit that estimates the face orientation from the face image S using the network architecture of the shared network unit followed by the face orientation network unit, and the model parameter θ _p ,
estimation device.
　モデルパラメータ^θ_kまたは^θ_pを用いて、ニューラルネットワークの関数により学習用顔画像S^bを中間特徴v^bに変換する共有ネットワークステップと、
　前記学習用顔画像S^bが顔キーポイント学習用顔画像で構成されている場合、前記モデルパラメータ^θ_kを用いて、ニューラルネットワークの関数により前記中間特徴v^bを推定顔キーポイントベクトルZ'_k,bに変換する顔キーポイントネットワークステップと、
　前記推定顔キーポイントベクトルZ'_k,bと前記学習用顔画像S^bに対する顔キーポイント正解ラベルT^k ₁,…,T^k _|M|とを用いて、前記モデルパラメータ^θ_kを更新する顔キーポイントモデルパラメータ最適化ステップと、
　前記学習用顔画像S^bが顔向き学習用顔画像で構成されている場合、前記モデルパラメータ^θ_pを用いて、ニューラルネットワークの関数により中間特徴v^bを推定顔向きベクトルZ'_p,bに変換する顔向きネットワークステップと、
　前記推定顔向きベクトルZ'_p,bと顔向き正解ラベルT^p ₁,…,T^p _|N|とを用いて、前記モデルパラメータ^θ_pを更新する顔向きモデルパラメータ最適化ステップとを含み、
　前記モデルパラメータ^θ_kに対応する学習済みのモデルパラメータθ_kと、前記モデルパラメータ^θ_pに対応する学習済みのモデルパラメータθ_pを取得する、
　学習方法。 A shared network step of transforming the learning face image S ^b into the intermediate feature v ^b by a neural network function using the model parameters ^θ _k or ^θ _p ;
When the face image for learning S ^b is composed of the face image for face key point learning, the intermediate feature v ^b is converted to an estimated face key point vector Z′ by a neural network function using the model parameter ^θ _k . a face keypoint network step that converts to _k,b ;
Update the model parameter ^ _θk using the estimated face keypoint vector _Z'k _,b and the face keypoint correct label ^Tk1 ,..., ^Tk _|M| for the training face image ^Sb . a face keypoint model parameter optimization step;
When the face image for learning S ^b is composed of the face image for face orientation learning, the model parameter ^θ _p is used to convert the intermediate feature v ^b into an estimated face orientation vector Z′ _p,b by a neural network function. a face network step that converts to
a face orientation model parameter optimization step of _{updating the model parameters ^θ p} ^using the estimated face orientation vector _Z'p _,b and the correct face orientation label ^Tp1 ,...,Tp _|N| ,
obtaining a learned model parameter θ _k corresponding to the model parameter ^θ _k and a learned model parameter θ _p corresponding to the model parameter ^θ _p ;
learning method.
　請求項５の学習方法で学習したモデルパラメータθ_kを用いる推定方法であって、
　前記共有ネットワークステップとそれに続く前記顔キーポイントネットワークステップのネットワークアーキテクチャと、前記モデルパラメータθ_kとを用いて、推定対象の顔画像Sから顔キーポイントを推定する顔キーポイント推定ステップを含む、
　推定方法。 An estimation method using the model parameter θ _k learned by the learning method of claim 5,
a face keypoint estimation step of estimating face keypoints from the face image S to be estimated using the network architecture of the sharing network step followed by the face keypoint network step and the model parameters θ _k ;
estimation method.
　請求項５の学習方法で学習したモデルパラメータθ_pを用いる推定方法であって、
　前記共有ネットワークステップとそれに続く前記顔向きネットワークステップのネットワークアーキテクチャと、前記モデルパラメータθ_pとを用いて、前記顔画像Sから顔向きを推定する顔向き推定ステップを含む、
　推定方法。 An estimation method using the model parameter θ _p learned by the learning method of claim 5,
a face orientation estimation step of estimating the face orientation from the face image S using the network architecture of the shared network step followed by the face orientation network step and the model parameters θ _p ;
estimation method.
　請求項１の学習装置、または、請求項２から請求項４の何れかの推定装置として、コンピュータを機能させるためのプログラム。 A program for causing a computer to function as the learning device of claim 1 or the estimation device of any one of claims 2 to 4.