JP6879072B2

JP6879072B2 - Processing methods, programs, information processing equipment, and image processing equipment

Info

Publication number: JP6879072B2
Application number: JP2017121615A
Authority: JP
Inventors: 敬貴小宮山
Original assignee: Konica Minolta Inc
Current assignee: Konica Minolta Inc
Priority date: 2017-06-21
Filing date: 2017-06-21
Publication date: 2021-06-02
Anticipated expiration: 2037-06-21
Also published as: JP2019008421A

Description

本発明は、処理方法、プログラム、情報処理装置、および画像処理装置に関する。 The present invention relates to processing methods, programs, information processing devices, and image processing devices.

従来、取得した画像から物体を認識する画像認識の分野では認識精度の改善が図られている。これは、多層構造の畳み込みニューラルネットワーク（ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋ：以下、「ＣＮＮ」という）によるところが大きい。 Conventionally, recognition accuracy has been improved in the field of image recognition in which an object is recognized from an acquired image. This is largely due to the multi-layered convolutional neural network (Convolutional Neural Network: hereinafter referred to as "CNN").

ＣＮＮ処理では、複数の畳込み層が含まれており、この畳み込み層では畳み込み演算を行う。畳み込み演算では、複数の積和演算を繰り返し行うため、多くの処理が必要になり時間がかかるという問題がある。 The CNN process includes a plurality of convolutional layers, and the convolutional layer performs a convolutional operation. In the convolution operation, since a plurality of product-sum operations are repeatedly performed, there is a problem that many processes are required and it takes time.

例えば、特許文献１に開示された技術では、グラフィックプロセッサ（ＧｒａｐｈｉｃｓＰｒｏｃｅｓｓｉｎｇＵｎｉｔ：以下、「ＧＰＵ」という）を用いて、畳み込み演算を行っている。畳み込み演算時にＧＰＵのメモリに処理対象の画像データを単純に展開した場合には、必要なメモリサイズが増加し、コストが増加するという課題がある。この課題を解決するために、特許文献１に開示された技術では、画像データを複数のデータブロックに分割し、ローカルメモリに分割した複数のデータブロックと、複数のフィルターを同時に読み込んで畳み込み演算を並列的に計算している。 For example, in the technique disclosed in Patent Document 1, a graphic processor (Graphics Processing Unit: hereinafter referred to as "GPU") is used to perform a convolution calculation. When the image data to be processed is simply expanded in the memory of the GPU at the time of the convolution operation, there is a problem that the required memory size increases and the cost increases. In order to solve this problem, in the technique disclosed in Patent Document 1, image data is divided into a plurality of data blocks, and a plurality of data blocks divided into a local memory and a plurality of filters are simultaneously read to perform a convolution operation. It is calculated in parallel.

特開２０１６−４５７２号公報Japanese Unexamined Patent Publication No. 2016-4572

しかしながら、特許文献１に開示された技術は、ＧＰＵを前提とするものである。一方で、例えば監視カメラ等の装置への搭載した組み込みシステムによる画像認識処理を想定した場合においては、組み込みシステムにＧＰＵを搭載すると、消費電力量が大きく現実的ではない。結果として、消費電力量の小さいＣＰＵを利用せざるをえない。 However, the technique disclosed in Patent Document 1 is premised on GPU. On the other hand, when image recognition processing is assumed by an embedded system mounted on a device such as a surveillance camera, if the GPU is mounted on the embedded system, the power consumption is large and it is not realistic. As a result, there is no choice but to use a CPU with low power consumption.

本発明は、上記事情に鑑みてなされたものであり、消費電力量の小さいプロセッサ、特にＳＩＭＤ（ＳｉｎｇｌｅＩｎｓｔｒｕｃｔｉｏｎＭｕｌｔｉｐｌｅＤａｔａ）命令を処理可能なプロセッサを含む情報処理装置を用いて、効率的に畳み込み演算を行う処理方法を提供することを目的とする。 The present invention has been made in view of the above circumstances, and an efficient convolution operation is performed by using an information processing device including a processor having a small power consumption, particularly a processor capable of processing SIMD (Single Instruction Multiple Data) instructions. It is an object of the present invention to provide a processing method for performing the above.

本発明の上記目的は、下記の手段によって達成される。 The above object of the present invention is achieved by the following means.

（１）ＳＩＭＤ用に利用可能な複数のレジスターと、
ＳＩＭＤ処理するプロセッサと、を備える情報処理装置を制御する方法であって、
（ａ）カーネルサイズから使用する前記レジスターのサイズを選択するステップと、
（ｂ）選択したサイズのレジスターの数、前記カーネルサイズから、カーネルの分割量を決定するステップと、
（ｃ）決定した前記カーネルの分割量で分割したカーネル毎に、前記レジスターに格納し、畳み込み演算を並列処理するステップと、
を含む、処理方法。 (1) Multiple registers available for SIMD and
A method of controlling an information processing device including a processor that performs SIMD processing.
(A) A step of selecting the size of the register to be used from the kernel size, and
(B) A step of determining the amount of kernel division from the number of registers of the selected size and the kernel size, and
(C) A step of storing each kernel divided by the determined kernel division amount in the register and processing the convolution operation in parallel.
Processing methods, including.

（２）前記レジスターに１度に格納するカーネルの行数をｎとしたとき、
前記ステップ（ｂ）では、下記式を満たす最大の値に、前記ｎを設定する、上記（１）に記載の処理方法。
ｋ−（ｘ＋ｊ）≧（ｂ＋１）ｎ
ここでｊは、ｊ＝（ｗ＋１−ｙ）／ｂ、
ｂ＝ｗ／ｑ、
ｊ、ｂは小数点以下を切り上げた整数であり、
ｗ：カーネルサイズ
ｋ：利用可能な前記レジスターの個数
ｑ：１つの前記レジスターに格納するデータ数
ｘ：畳み込み演算に必要な前記レジスターの個数
ｙ：スキップする画素数である。 (2) When the number of kernel lines stored in the register at one time is n
The processing method according to (1) above, wherein in step (b), the n is set to the maximum value satisfying the following formula.
k- (x + j) ≧ (b + 1) n
Where j is j = (w + 1-y) / b,
b = w / q,
j and b are integers rounded up to the nearest whole number.
w: Kernel size k: Number of available registers q: Number of data stored in one register x: Number of registers required for convolution operation y: Number of pixels to be skipped.

（３）前記カーネルサイズが１１×１１のときに、
前記ステップ（ｂ）では、前記カーネルを２行ずつの分割に決定し、
前記ステップ（ｃ）では、前記カーネルを２行分毎に前記レジスターに格納し、格納した後、２行毎に畳み込み演算を実行する、上記（１）に記載の処理方法。 (3) When the kernel size is 11 × 11,
In step (b), the kernel is divided into two lines each, and the kernel is divided into two lines.
The processing method according to (1) above, wherein in the step (c), the kernel is stored in the register every two lines, and then a convolution operation is executed every two lines.

（４）前記カーネルサイズが１１×１１のときに、
前記ステップ（ａ）では、１２８ビットのレジスターを選択し、
前記ステップ（ｃ）では、入力画素４画素分を１２８ビットのレジスターに格納し、４画素ずつの畳み込み演算を並列処理する、上記（３）に記載の処理方法。 (4) When the kernel size is 11 × 11,
In step (a), a 128-bit register is selected.
The processing method according to (3) above, wherein in step (c), four input pixels are stored in a 128-bit register, and convolution operations of four pixels are processed in parallel.

（５）前記カーネルサイズが５×５以下のときに、
前記ステップ（ｂ）では、前記カーネルを分割しないことを決定し、
前記ステップ（ｃ）では、前記カーネルを１度に前記レジスターに格納し、格納した後に畳み込み演算を実行する、上記（１）に記載の処理方法。 (5) When the kernel size is 5 × 5 or less
In step (b), it is decided not to split the kernel.
The processing method according to (1) above, wherein in the step (c), the kernel is stored in the register at one time, and after the storage, the convolution operation is executed.

（６）前記カーネルサイズが５×５以下のときに、
前記ステップ（ａ）では、６４ビットのレジスターを選択し、
前記ステップ（ｃ）では、入力画素２画素分を６４ビットの前記レジスターに格納し、２画素ずつの畳み込み演算を並列処理する、上記（５）に記載の処理方法。 (6) When the kernel size is 5 × 5 or less
In step (a), a 64-bit register is selected and
The processing method according to (5) above, wherein in the step (c), two input pixels are stored in the 64-bit register, and a convolution operation of two pixels is processed in parallel.

（７）前記複数のレジスターのうち、カーネルを格納するレジスターを固定しておき、入力画素用のレジスターに格納するデータをスライドさせながら、中間バッファに格納した結果データを更新する、上記（１）から上記（６）のいずれか１つに記載の処理方法。 (7) Of the plurality of registers, the register that stores the kernel is fixed, and the result data stored in the intermediate buffer is updated while sliding the data stored in the register for the input pixel. (1) The processing method according to any one of the above (6).

（８）ＳＩＭＤ用に利用可能な複数のレジスターと、
ＳＩＭＤ処理するプロセッサと、を備える情報処理装置を制御するプログラムであって、
（ａ）カーネルサイズから使用する前記レジスターのサイズを選択するステップと、
（ｂ）選択したサイズのレジスターの数、前記カーネルサイズから、カーネルの分割量を決定するステップと、
（ｃ）決定した前記カーネルの分割量で分割したカーネル毎に、前記レジスターに格納し、畳み込み演算を並列処理するステップと、
を含む、方法を実行するためのプログラム。 (8) Multiple registers available for SIMD and
A program that controls an information processing device including a processor that performs SIMD processing.
(A) A step of selecting the size of the register to be used from the kernel size, and
(B) A step of determining the amount of kernel division from the number of registers of the selected size and the kernel size, and
(C) A step of storing each kernel divided by the determined kernel division amount in the register and processing the convolution operation in parallel.
A program for executing methods, including.

（９）前記レジスターに１度に格納するカーネルの行数をｎとしたとき、
前記ステップ（ｂ）では、下記式を満たす最大の値に、前記ｎを設定する、上記（８）に記載のプログラム。
ｋ−（ｘ＋ｊ）≧（ｂ＋１）ｎ
ここでｊは、ｊ＝（ｗ＋１−ｙ）／ｂ、
ｂ＝ｗ／ｑ、
ｊ、ｂは小数点以下を切り上げた整数であり、
ｗ：カーネルサイズ
ｋ：利用可能な前記レジスターの個数
ｑ：１つの前記レジスターに格納するデータ数
ｘ：畳み込み演算に必要な前記レジスターの個数
ｙ：スキップする画素数である。 (9) When the number of kernel lines stored in the register at one time is n
In the step (b), the program according to the above (8), in which the n is set to the maximum value satisfying the following equation.
k- (x + j) ≧ (b + 1) n
Where j is j = (w + 1-y) / b,
b = w / q,
j and b are integers rounded up to the nearest whole number.
w: Kernel size k: Number of available registers q: Number of data stored in one register x: Number of registers required for convolution operation y: Number of pixels to be skipped.

（１０）前記カーネルサイズが１１×１１のときに、
前記ステップ（ｂ）では、前記カーネルを２行ずつの分割に決定し、
前記ステップ（ｃ）では、前記カーネルを２行分毎に前記レジスターに格納し、格納した後、２行毎に畳み込み演算を実行する、上記（８）に記載のプログラム。 (10) When the kernel size is 11 × 11,
In step (b), the kernel is divided into two lines each, and the kernel is divided into two lines.
The program according to (8) above, in step (c), the kernel is stored in the register every two lines, and after the kernel is stored, a convolution operation is executed every two lines.

（１１）前記カーネルサイズが１１×１１のときに、
前記ステップ（ａ）では、１２８ビットのレジスターを選択し、
前記ステップ（ｃ）では、入力画素４画素分を１２８ビットのレジスターに格納し、４画素ずつの畳み込み演算を並列処理する、上記（１０）に記載のプログラム。 (11) When the kernel size is 11 × 11,
In step (a), a 128-bit register is selected.
The program according to (10) above, in the step (c), the input pixel 4 pixels are stored in a 128-bit register, and the convolution operation of 4 pixels is processed in parallel.

（１２）前記カーネルサイズが５×５以下のときに、
前記ステップ（ｂ）では、前記カーネルを分割しないことを決定し、
前記ステップ（ｃ）では、前記カーネルを１度に前記レジスターに格納し、格納した後に畳み込み演算を実行する、上記（８）に記載のプログラム。 (12) When the kernel size is 5 × 5 or less
In step (b), it is decided not to split the kernel.
The program according to (8) above, in the step (c), the kernel is stored in the register at one time, and after the storage, the convolution operation is executed.

（１３）前記カーネルサイズが５×５以下のときに、
前記ステップ（ａ）では、６４ビットのレジスターを選択し、
前記ステップ（ｃ）では、入力画素２画素分を６４ビットの前記レジスターに格納し、２画素ずつの畳み込み演算を並列処理する、上記（１２）に記載のプログラム。 (13) When the kernel size is 5 × 5 or less
In step (a), a 64-bit register is selected and
The program according to (12) above, in step (c), the program according to (12) above, wherein two input pixels are stored in the 64-bit register, and convolution operations of two pixels are processed in parallel.

（１４）前記複数のレジスターのうち、カーネルを格納するレジスターを固定しておき、入力画素用のレジスターに格納するデータをスライドさせながら、中間バッファに格納した結果データを更新する、上記（８）から上記（１３）のいずれか１つに記載のプログラム。 (14) Among the plurality of registers, the register for storing the kernel is fixed, and the result data stored in the intermediate buffer is updated while sliding the data stored in the register for the input pixel (8). The program according to any one of (13) above.

（１５）ＳＩＭＤ用に利用可能な複数のレジスターと、
ＳＩＭＤ処理するプロセッサと、を備える情報処理装置であって、
カーネルサイズから使用する前記レジスターのサイズを選択し、選択したサイズのレジスターの数、前記カーネルサイズから、カーネルの分割量を決定し、決定した前記カーネルの分割量で分割したカーネル毎に、前記レジスターに格納し、畳み込み演算を並列処理する、情報処理装置。 (15) Multiple registers available for SIMD,
An information processing device including a processor that performs SIMD processing.
The size of the register to be used is selected from the kernel size, the division amount of the kernel is determined from the number of registers of the selected size and the kernel size, and the register is divided for each kernel divided by the determined division amount of the kernel. An information processing device that stores in the kernel and processes convolution operations in parallel.

（１６）前記決定した分割量で分割され、前記レジスターに１度に格納するカーネルの行数をｎとしたとき、
下記式を満たす最大の値に、前記ｎを設定する、上記（１５）に記載の情報処理装置。
ｋ−（ｘ＋ｊ）≧（ｂ＋１）ｎ
ここでｊは、ｊ＝（ｗ＋１−ｙ）／ｂ、
ｂ＝ｗ／ｑ、
ｊ、ｂは小数点以下を切り上げた整数であり、
ｗ：カーネルサイズ
ｋ：利用可能な前記レジスターの個数
ｑ：１つの前記レジスターに格納するデータ数
ｘ：畳み込み演算に必要な前記レジスターの個数
ｙ：スキップする画素数である。 (16) When the number of kernel lines divided by the determined division amount and stored at one time in the register is n.
The information processing apparatus according to (15) above, wherein the n is set to the maximum value satisfying the following equation.
k- (x + j) ≧ (b + 1) n
Where j is j = (w + 1-y) / b,
b = w / q,
j and b are integers rounded up to the nearest whole number.
w: Kernel size k: Number of available registers q: Number of data stored in one register x: Number of registers required for convolution operation y: Number of pixels to be skipped.

（１７）前記複数のレジスターのうち、カーネルを格納するレジスターを固定しておき、入力画素用のレジスターに格納するデータをスライドさせながら、中間バッファに格納した結果データを更新する、上記（１５）または上記（１６）に記載の情報処理装置。 (17) Among the plurality of registers, the register for storing the kernel is fixed, and the result data stored in the intermediate buffer is updated while sliding the data stored in the register for the input pixel (15). Alternatively, the information processing apparatus according to (16) above.

（１８）撮像装置が生成した画像を取得する画像取得部と、
前記画像に映る人の特徴、および前記画像に映る人の周辺物体の形状、位置又は種別を示す周辺特徴を抽出する、上記（１５）から上記（１７）のいずれか１つに記載の情報処理装置と、
を備える、画像処理装置。 (18) An image acquisition unit that acquires an image generated by the imaging device, and
The information processing according to any one of (15) to (17) above, which extracts the characteristics of the person shown in the image and the peripheral features indicating the shape, position, or type of the peripheral object of the person shown in the image. Equipment and
An image processing device.

本発明に係る処理方法によれば、カーネルサイズから使用するレジスターのサイズを選択し、選択したサイズのレジスターの数、カーネルサイズから、カーネルの分割量を決定し、決定したカーネルの分割量で分割したカーネル毎に、レジスターに格納し、畳み込み演算を並列処理する。このようにすることで、効率的に畳み込み演算を行える。 According to the processing method according to the present invention, the size of the register to be used is selected from the kernel size, the kernel division amount is determined from the number of registers of the selected size and the kernel size, and the kernel is divided by the determined kernel division amount. Each kernel is stored in a register and convolution operations are processed in parallel. By doing so, the convolution operation can be performed efficiently.

本発明の実施形態に係る情報処理装置を示すブロック図である。It is a block diagram which shows the information processing apparatus which concerns on embodiment of this invention. ５×５サイズのカーネルを６４ビットレジスターおよび１２８ビットレジスターにそれぞれ格納した状態を示す模式図である。It is a schematic diagram which shows the state which stored the kernel of 5 × 5 size in a 64-bit register and a 128-bit register, respectively. １１×１１サイズのカーネルを１２８ビットレジスターに格納した状態を示す模式図である。It is a schematic diagram which shows the state which the 11 × 11 size kernel is stored in the 128-bit register. １１×１１サイズのカーネルの分割量が２行の場合の処理を説明する模式図である。It is a schematic diagram explaining the processing when the division amount of the 11 × 11 size kernel is 2 lines. ＣＮＮ処理においてストライド４の場合の窓の重なり状態を示す図である。It is a figure which shows the overlapping state of the window in the case of stride 4 in the CNN process. 情報処理装置が実行するＣＮＮ処理のフローチャートを示す図である。It is a figure which shows the flowchart of the CNN process executed by an information processing apparatus. 図６Ａに続く、フローチャートを示す図である。It is a figure which shows the flowchart which follows FIG. 6A. レジスターのデータ格納状態を示す模式図である。It is a schematic diagram which shows the data storage state of a register. レジスターのデータ格納状態を示す模式図である。It is a schematic diagram which shows the data storage state of a register. 変形例に係るフローチャートを示す図である。It is a figure which shows the flowchart which concerns on the modification. 実施形態に係る画像処理装置を示すブロック図である。It is a block diagram which shows the image processing apparatus which concerns on embodiment. 画像処理装置の機能ブロックを示す図である。It is a figure which shows the functional block of an image processing apparatus.

以下、添付した図面を参照して、本発明の実施形態を説明する。なお、図面の説明において同一の要素には同一の符号を付し、重複する説明を省略する。また、図面の寸法比率は、説明の都合上誇張されており、実際の比率とは異なる場合がある。 Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. In the description of the drawings, the same elements are designated by the same reference numerals, and duplicate description will be omitted. In addition, the dimensional ratios in the drawings are exaggerated for convenience of explanation and may differ from the actual ratios.

図１は本発明の実施形態に係る情報処理装置を示すブロック図である。同図に示すように情報処理装置１００は、ＳＩＭＤプロセッサ１０、汎用プロセッサ２０、メモリ３０を備え、これらの要素はデータバス、コマンドバス、アドレスバス等のバスにより互いに接続されている。この情報処理装置１００は、例えば監視カメラ等の装置の組み込みシステム用の情報処理装置である。 FIG. 1 is a block diagram showing an information processing apparatus according to an embodiment of the present invention. As shown in the figure, the information processing apparatus 100 includes a SIMD processor 10, a general-purpose processor 20, and a memory 30, and these elements are connected to each other by buses such as a data bus, a command bus, and an address bus. The information processing device 100 is an information processing device for an embedded system of a device such as a surveillance camera.

ＳＩＭＤプロセッサ１０は、１つの命令で複数のデータを演算するＳＩＭＤ型（単一命令複数データ処理）の命令を実行するプロセッサであり、例えばＡＲＭホールディングスのＳＩＭＤ拡張命令に対応したＮＥＯＮである。ＳＩＭＤ型の処理（以下、単に「ＳＩＭＤ処理」という）はベクトルデータ処理とも呼ばれる。 The SIMD processor 10 is a processor that executes SIMD type (single instruction and multiple data processing) instructions that calculate a plurality of data with one instruction, and is a NEON corresponding to the SIMD extension instruction of ARM Holdings, for example. SIMD type processing (hereinafter, simply referred to as "SIMD processing") is also called vector data processing.

ＳＩＭＤプロセッサ１０は、ＳＩＭＤレジスターファイル１１、ＡＬＵ（ＡｒｉｔｈｍｅｔｉｃＬｏｇｉｃＵｎｉｔ）１２、ＭＵＬ（ｍｕｌｔｉｐｌｉｅｒＵｎｉｔ）１３、シフター１４、およびＬＳ（Ｌｏａｄ／ＳｔｏｒｅＵｎｉｔ）１５を含む。 The SIMD processor 10 includes a SIMD register file 11, an ALU (Arithmetic Logic Unit) 12, a MUL (multiplier Unit) 13, a shifter 14, and an LS (Load / Store Unit) 15.

ＳＩＭＤレジスターファイル１１は、２５６ビット、１２８ビット、または６４ビットの長さの複数のＳＩＭＤ用レジスター（以下、単に「レジスター」という）で構成され得る。本実施形態においては、ＳＩＭＤレジスターファイル１１１を１６個の１２８ビット長のレジスター、または３２個の６４ビット長のレジスターとして用いることができる（以下、それぞれ「１２８ビットレジスター」、「６４ビットレジスター」ともいう）。そして、３２ビット長の単精度浮動小数点数や整数であれば、１２８ビットレジスターに４個分のオペランドが、６４ビットレジスターであれば２個分のオペランドが１度に格納可能である。このようなデータは、ベクトルデータとも呼ばれる。 The SIMD register file 11 may consist of a plurality of SIMD registers (hereinafter, simply referred to as "registers") having a length of 256 bits, 128 bits, or 64 bits. In the present embodiment, the SIMD register file 111 can be used as 16 128-bit length registers or 32 64-bit length registers (hereinafter, both "128-bit register" and "64-bit register", respectively). Say). If it is a 32-bit long single precision floating point number or an integer, four operands can be stored in the 128-bit register, and if it is a 64-bit register, two operands can be stored at one time. Such data is also called vector data.

ＡＬＵ１２は、加算、減算、および論理演算を行う演算器である。ＭＵＬ１３は、乗算、除算を行う演算器である。シフター１４は、ビットを左右にずらすシフト処理を行う演算器である。ＬＳ１５は、メモリ３０からレジスターへのデータのロード、およびレジスターからメモリ３０へのデータのロードの処理を行う。 The ALU12 is an arithmetic unit that performs addition, subtraction, and logical operations. The MUL 13 is an arithmetic unit that performs multiplication and division. The shifter 14 is an arithmetic unit that performs a shift process for shifting the bit to the left or right. The LS 15 processes the data from the memory 30 to the register and the data from the register to the memory 30.

汎用プロセッサ２０は、ＳＩＭＤ処理以外の一般的なスカラデータ処理を実行するプロセッサである。汎用プロセッサ２０は、ＡＬＵ２２、ＭＵＬ２３、ＬＳ２４を含む。これらは、ＡＬＵ１２、ＭＵＬ１３、およびＬＳ１５と同様の機能を備えるので説明は省略する。 The general-purpose processor 20 is a processor that executes general scalar data processing other than SIMD processing. The general-purpose processor 20 includes ALU22, MUL23, and LS24. Since these have the same functions as ALU12, MUL13, and LS15, the description thereof will be omitted.

メモリ３０は、ＤＲＡＭ（ＤｙｎａｍｉｃＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）等のＲＡＭ、およびＮＡＮＤ型フラッシュメモリ等のＲＯＭを含む。また、さらにＨＤＤなどの補助記憶装置を含んでもよい。メモリ３０には、処理する画像データを一時的に格納したり、後述する情報処理装置１００が実行するＣＮＮ処理の実行方法のプログラム、情報処理装置１００が組み込まれた機器を制御する制御プログラム、各種データ等を格納したりする。 The memory 30 includes a RAM such as a DRAM (Dynamic Random Access Memory) and a ROM such as a NAND flash memory. Further, an auxiliary storage device such as an HDD may be further included. The memory 30 temporarily stores image data to be processed, a program of a method of executing CNN processing executed by the information processing device 100 described later, a control program for controlling a device in which the information processing device 100 is incorporated, and various types. Store data etc.

（カーネルの格納）
本実施形態では、ＣＮＮ処理を行う際には、図１で示したようなハードウェアリソースの制約のため、カーネル（ファイルターともいう）のサイズによっては全てを一度にＳＩＭＤレジスターファイル１１１のレジスターに格納しようとした場合、レジスターが不足してしまう。そこで、本実施形態では、ＳＩＭＤレジスターファイル１１１を分割した１２８ビットレジスターまたは６４ビットレジスターに、カーネルの一部を分割して格納する。なお、適用するカーネルのサイズは、処理に応じて異なる。 (Kernel storage)
In the present embodiment, when performing CNN processing, due to the limitation of hardware resources as shown in FIG. 1, depending on the size of the kernel (also referred to as a fileter), all at once to the register of the SIMD register file 111. If you try to store it, you will run out of registers. Therefore, in the present embodiment, a part of the kernel is divided and stored in a 128-bit register or a 64-bit register in which the SIMD register file 111 is divided. The size of the kernel to be applied differs depending on the processing.

図２は、高さ５行、幅５列のサイズのカーネル（以下、「５×５サイズ」等という）のカーネルを６４ビットレジスターおよび１２８ビットレジスターにそれぞれ格納した状態を示す模式図である。カーネルの格納は、ＳＩＭＤ処理をするために、１行毎に、１または複数のレジスターに格納する。５×５サイズのカーネルは２５個のデータから構成され、各データのサイズが単精度浮動小数点数の３２ビットの場合、１個の１２８ビットレジスターには、連続した４個のデータを格納できる。 FIG. 2 is a schematic diagram showing a state in which a kernel having a size of 5 rows in height and 5 columns in width (hereinafter referred to as “5 × 5 size” or the like) is stored in a 64-bit register and a 128-bit register, respectively. The kernel is stored in one or more registers for each line in order to perform SIMD processing. A 5x5 size kernel is composed of 25 pieces of data, and if the size of each piece of data is 32 bits, which is a single precision floating point number, one 128-bit register can store four consecutive pieces of data.

１行分の５個のデータを１２８ビットレジスターに格納した場合、同図に示すように２個の１２８ビットレジスターが必要であり、２個目の１２８ビットレジスターに格納されるデータは１個であり、残り３個分の空きが生じる。一方で、６４ビットレジスターに格納した場合は、１個の６４ビットレジスターには、それぞれ２個のカーネルのデータを格納できるので３個の６４ビットレジスターが必要であり、３個目の６４ビットレジスターには、１個分の空きが生じる。 When five data for one line are stored in a 128-bit register, two 128-bit registers are required as shown in the figure, and one data is stored in the second 128-bit register. Yes, there will be space for the remaining 3 pieces. On the other hand, when stored in a 64-bit register, one 64-bit register can store data from two kernels, so three 64-bit registers are required, and a third 64-bit register is required. There is a space for one piece.

両者の比較では、５×５サイズのカーネルの場合であれば６４ビットレジスターを選択した方が、格納効率が高く、有効にレジスターを活用できることが分かる。 A comparison of the two shows that in the case of a 5x5 size kernel, selecting a 64-bit register has higher storage efficiency and makes effective use of the register.

同様に１１×１１サイズのカーネルの場合には、６４ビットレジスター、および１２８ビットレジスターのいずれを選択した場合であっても、最後のレジスターでは１個分の空きが生じることになる。図３は、１１×１１サイズのカーネルを１２８ビットレジスターに格納した状態を示す模式図であり、３個目のレジスター（ｋ０２）では、１個分の空きがある。この場合、６４ビットレジスターおよび１２８ビットレジスターにおいて、格納効率はどちらも同じである。しかしながら、ＳＩＭＤ処理を考慮した場合、１２８ビットレジスターの方が、６４ビットレジスターに比べて格納しているデータ数が多いため、１サイクル（１つのロードまたはストア命令）でより多くのデータを一度に転送できるため、転送効率が高いと言える。 Similarly, in the case of an 11x11 size kernel, one free space will be created in the last register regardless of whether the 64-bit register or the 128-bit register is selected. FIG. 3 is a schematic diagram showing a state in which an 11 × 11 size kernel is stored in a 128-bit register, and there is space for one in the third register (k02). In this case, the storage efficiency is the same for both the 64-bit register and the 128-bit register. However, when considering SIMD processing, the 128-bit register stores more data than the 64-bit register, so more data can be stored at one time in one cycle (one load or store instruction). Since it can be transferred, it can be said that the transfer efficiency is high.

ここで、本実施形態においては、ＳＩＭＤレジスターファイルはサイズ上の制約から、カーネルの格納に割り当てることが可能なレジスターの個数が限られる。例えば１２８ビットレジスターであれば利用可能なレジスター個数は１６個である。１１×１１サイズのカーネルを一度に１２８ビットレジスターに格納するためには、３３個（＝３×１１）のレジスターが必要であり、１６個のレジスターでは不足することになる。そこで後述する本実施形態に係る方法では、カーネルのサイズ等の条件に応じて、カーネルの分割量を決定し、決定した分割量でカーネルを複数に分割し、分割したカーネル毎に畳み込み演算を行う。これにより組み込みシステムのような限られたリソースであっても効率的に、ＣＮＮ処理を行えるようにする（図４参照）。 Here, in the present embodiment, the number of registers that can be allocated to the storage of the kernel is limited due to the size limitation of the SIMD register file. For example, in the case of a 128-bit register, the number of registers that can be used is 16. In order to store an 11x11 size kernel in a 128-bit register at a time, 33 (= 3x11) registers are required, and 16 registers are insufficient. Therefore, in the method according to the present embodiment described later, the division amount of the kernel is determined according to the conditions such as the size of the kernel, the kernel is divided into a plurality of parts by the determined division amount, and the convolution operation is performed for each divided kernel. .. This makes it possible to efficiently perform CNN processing even with limited resources such as embedded systems (see FIG. 4).

（演算結果の共通使用について）
本実施形態においては、ＣＮＮ処理（ＡｌｅｘＮｅｔ）におけるストライドは４である。図５は４画素分のストライドを実行する前と後の入力画素の範囲の重なり部分を示す図である。図５に示すようにｉ×ｉサイズの入力画素に対して、１１×１１サイズのカーネルを用いて、畳み込み演算をする場合、図５（ａ）に示すように、左上から１１×１１サイズの入力画素を、カーネルの対応する位置のデータと積和演算して、１つの出力画素のデータを算出する。太枠は、カーネルと積和演算する入力画素の範囲を示す窓である。 (About common use of calculation results)
In this embodiment, the stride in the CNN treatment (AlexNet) is 4. FIG. 5 is a diagram showing an overlapping portion of the range of the input pixels before and after the stride for 4 pixels is executed. When performing a convolution operation on an i × i size input pixel using an 11 × 11 size kernel as shown in FIG. 5, as shown in FIG. 5 (a), the 11 × 11 size from the upper left. The input pixel is multiplied and summed with the data at the corresponding position of the kernel to calculate the data of one output pixel. The thick frame is a window showing the range of input pixels for multiply-accumulate operation with the kernel.

そして、次の出力画素は、ストライドの設定値分、すなわち図５（ｂ）に示すように４画素分スキップさせた入力画素を、カーネルと積和演算することで、算出される。このとき、１つ前の窓と、現時点の窓とは、一部重なるため、レジスターに格納した入力画素を共通で使用できる。なお、この重なる量は、スキップの設定値と、カーネルのサイズにより異なる。ここで、以下の例では、カーネルをストライドさせる画素数は、畳み込み演算のためにずらす画素数（後述の格納データ数ｑ）と一致している。また、このストライドする画素数は、畳み込み演算のためにずらす画素数に対して同じ、または整数倍であることが好ましい。以下に説明する本実施形態に係る方法では、この共通の使用できる領域内の入力画素をレジスターに格納し、複数の出力画素の畳み込み演算に使用することで処理の効率化を図る。 Then, the next output pixel is calculated by performing a product-sum calculation with the kernel for the set value of the stride, that is, the input pixel skipped by 4 pixels as shown in FIG. 5 (b). At this time, since the previous window and the current window partially overlap, the input pixels stored in the register can be used in common. The amount of overlap differs depending on the skip setting value and the kernel size. Here, in the following example, the number of pixels for striding the kernel matches the number of pixels shifted for the convolution operation (the number of stored data q described later). Further, the number of striding pixels is preferably the same as or an integral multiple of the number of pixels shifted for the convolution calculation. In the method according to the present embodiment described below, the input pixels in the common usable area are stored in the register and used for the convolution calculation of a plurality of output pixels to improve the processing efficiency.

（処理方法）
以下、図６Ａ〜図８を参照し、情報処理装置１００によるＣＮＮ処理の手順を説明する。図６Ａ、図６Ｂは情報処理装置１００が実行するＣＮＮ処理のフローチャートを示す図である。 (Processing method)
Hereinafter, the procedure of CNN processing by the information processing apparatus 100 will be described with reference to FIGS. 6A to 8. 6A and 6B are diagrams showing a flowchart of CNN processing executed by the information processing apparatus 100.

（Ｓ１１１）
最初に、情報処理装置１００は、カーネルサイズからレジスターサイズの選択を行う。レジスターサイズの選択は、上述したように最初に、（１）格納効率の観点から格納効率がより高い方を選択する。格納効率で差がない場合には、次に（２）転送効率の観点から、レジスターサイズがより大きい方を選択する。例えばカーネルサイズが５×５であれば格納効率の観点から６４ビットレジスターを選択する。カーネルサイズが１１×１１であれば格納効率と転送効率の観点から１２８ビットレジスターを選択する。 (S111)
First, the information processing apparatus 100 selects the register size from the kernel size. As described above, when selecting the register size, first, (1) the one having the higher storage efficiency is selected from the viewpoint of the storage efficiency. If there is no difference in storage efficiency, then (2) from the viewpoint of transfer efficiency, the one with the larger register size is selected. For example, if the kernel size is 5 × 5, a 64-bit register is selected from the viewpoint of storage efficiency. If the kernel size is 11 × 11, 128-bit registers are selected from the viewpoint of storage efficiency and transfer efficiency.

（ステップＳ１１２）
次に、ステップＳ１１１で選択したレジスターサイズの使用レジスター数、カーネルサイズを用いて、カーネルの分割量を決定する。 (Step S112)
Next, the division amount of the kernel is determined using the number of registers used and the kernel size of the register size selected in step S111.

カーネルの分割量、すなわち分割行数ｎは、下記式（１）を満たす最大の数（整数）に設定する。
ｋ−（ｘ＋ｊ）≧（ｂ＋１）ｎ（１）
ここでｊは、ｊ＝（ｗ＋１−ｙ）／ｂで、
ｂ＝ｗ／ｑ
ｊ、ｂは小数点以下を切り上げた整数であり、
ｗ：カーネルサイズ
ｋ：レジスターの個数
ｑ：１つのレジスターに格納するデータ数
ｘ：畳み込み演算に必要なレジスター数
ｙ：スキップする画素数である。 The division amount of the kernel, that is, the number of division lines n is set to the maximum number (integer) satisfying the following equation (1).
k- (x + j) ≧ (b + 1) n (1)
Here, j is j = (w + 1-y) / b.
b = w / q
j and b are integers rounded up to the nearest whole number.
w: Kernel size k: Number of registers q: Number of data stored in one register x: Number of registers required for convolution operation y: Number of pixels to be skipped.

（式（１）について）
カーネルサイズｗ×ｗに基づいて選択した、ｋ個のｚビットレジスターを使用する。 (About formula (1))
Use k z-bit registers selected based on kernel size w × w.

最小ビット数を単数浮動小数点型の３２ビットとすると、ｚビットレジスターにはｚ／３２＝ｑ個のデータが格納される。 Assuming that the minimum number of bits is 32 bits of a singular floating point type, z / 32 = q data are stored in the z bit register.

畳み込み演算時に必要な積算、加算のためにｘ個のレジスターが必要である。この積算、加算演算に必要なレジスター数については、後述する。 X registers are required for the integration and addition required for the convolution operation. The number of registers required for this integration and addition operation will be described later.

カーネルサイズｗ×ｗでは、１行でｚビットレジスターを用いた計算をｗ／ｑ＝ｂ回（ｂは小数点切り上げた整数）計算する必要がある。 With the kernel size w × w, it is necessary to perform the calculation using the z-bit register in one line w / q = b times (b is an integer rounded up to the nearest whole number).

また、入力画素をｙ画素ずつスキップして畳み込みを行う場合、（ｗ＋１−ｙ）画素分は共通して使用できる。 Further, when the input pixels are skipped by y pixels and the convolution is performed, the (w + 1-y) pixels can be used in common.

そのため、ｚビットレジスターにｑ個のデータが格納されることを考慮すると、中間バッファは（ｗ＋１−ｙ）／ｂ＝ｊ個用意する必要がある。（ｊは小数点を切り上げた整数）
よって、残りのレジスター数は、ｋ−（ｘ＋ｊ）＝ｍ個
カーネルをｎ行ずつに分割するとカーネルに必要なレジスター個数はｂｎ、入力画素データに必要なレジスター個数はｎ。
そこで、ｍ≧（ｂ＋１）ｎの条件を満たす、最大整数ｎを求める。 Therefore, considering that q data are stored in the z-bit register, it is necessary to prepare (w + 1-y) / b = j intermediate buffers. (J is an integer with the decimal point rounded up)
Therefore, the number of remaining registers is k- (x + j) = m. When the kernel is divided into n rows, the number of registers required for the kernel is bn, and the number of registers required for input pixel data is n.
Therefore, the maximum integer n satisfying the condition of m ≧ (b + 1) n is obtained.

（本実施形態のｎ＝２となる具体例）
例えば、ステップＳ１１１で、カーネルサイズが１１×１１（ｗ＝１１）に応じて１２８ビットレジスターを選択した場合、ＳＩＭＤ処理に割当て可能なレジスター数ｋは、１６個である。１つの１２８ビットレジスターには、格納するデータ数ｑは、３２ビットの単精度浮動小数点数で４個である。 (Specific example in which n = 2 of this embodiment)
For example, when the 128-bit register is selected according to the kernel size of 11 × 11 (w = 11) in step S111, the number of registers k that can be assigned to the SIMD process is 16. The number of data q stored in one 128-bit register is four, which is a 32-bit single-precision floating-point number.

畳み込み演算時に必要な積算、加算のために少なくとも３個（ｘ個）のレジスター個数が必要である（後述）。１２８ビットレジスターには３２ビットのデータが４個（＝１２８／３２）格納される。 At least 3 (x) registers are required for the integration and addition required for the convolution operation (described later). Four 32-bit data (= 128/32) are stored in the 128-bit register.

また、カーネルサイズは１１ｘ１１では、１行で１２８ビットレジスターを用いた計算を１１／４＝２．７回、小数点切り上げで３回行う必要がある（ｂ＝３）。 In addition, when the kernel size is 11x11, it is necessary to perform the calculation using the 128-bit register in one line 11/4 = 2.7 times and round up the decimal point 3 times (b = 3).

また、入力画素を４画素ずつスキップして畳み込みを行う場合、８画素分（ｗ＋１−ｙ＝１１＋１−４）は共通して使用できるため中間バッファを３個確保すればよい（ｊ＝８／ｂ＝２．６、小数点以下切り上げでｊ＝３）。ここで、加算した「１」はレジスターに１度に格納するデータ数（ｂ：４個）に合わせるための調整値である。 Further, when convolution is performed by skipping 4 input pixels at a time, 8 pixels (w + 1-y = 11 + 1-4) can be used in common, so 3 intermediate buffers may be secured (j = 8 / b). = 2.6, rounded up to the nearest whole number j = 3). Here, the added "1" is an adjustment value for adjusting to the number of data (b: 4) stored at one time in the register.

残りのレジスターは１６−（４＋３）＝９個
カーネルをｎ行ずつに分割すると
１６−（４＋３）≧（ｂ＋１）ｎ＝４ｎ
９≧４ｎを満たす最大整数ｎを求めるとｎ＝２である。
以上をまとめると、１６個（ｋ）の１２８（ｚ）ビットレジスターのうち、用途が決まっているものは、カーネル格納用に６個（３ｎ）、入力画素用に２個（ｎ）、中間バッファ用に３個（ｊ）、積算、加算用に４個（ｘ）で、合計１５個となる。 The remaining registers are 16- (4 + 3) = 9. When the kernel is divided into n rows, 16- (4 + 3) ≥ (b + 1) n = 4n
When the maximum integer n satisfying 9 ≧ 4n is obtained, n = 2.
Summarizing the above, of the 16 (k) 128 (z) bit registers, the ones with fixed uses are 6 (3n) for kernel storage, 2 (n) for input pixels, and an intermediate buffer. 3 pieces (j) for use and 4 pieces (x) for integration and addition, for a total of 15 pieces.

（畳み込み演算の積算、加算にｘ個（３個）のレジスターが必要な場合）
（ａ＋ｂ）×（ｃ＋ｄ）＝ｅを算出する際には、ａ、ｂをレジスターＲ０、Ｒ１にそれぞれ格納する。ａ＋ｂの結果ｘをレジスターＲ０に格納する。その後、ｃ、ｄをレジスターＲ１、Ｒ２にそれぞれ格納する。ｃ＋ｄの結果ｙをレジスターＲ１に格納する。ｘ＊ｙの結果ｅをレジスターＲ０に格納する。以上のことからレジスターＲ０〜Ｒ２の３個のレジスターが必要となる。 (When x (3) registers are required for integration and addition of convolution operations)
When calculating (a + b) × (c + d) = e, a and b are stored in the registers R0 and R1, respectively. The result x of a + b is stored in the register R0. After that, c and d are stored in the registers R1 and R2, respectively. The result y of c + d is stored in the register R1. The result e of x * y is stored in the register R0. From the above, three registers R0 to R2 are required.

（他の例：畳み込み演算の積算、加算にｘ個（４個）のレジスターが必要な場合）
（ａ＋ｂ）×（ｃ＋ｄ）＝ｅを算出する際には、ａ、ｂをレジスターＲ０、Ｒ１にそれぞれ格納する。ａ＋ｂの結果ｘをレジスターＲ２格納する。その後、ｃ、ｄをレジスターＲ０、Ｒ１にそれぞれ格納する。ｃ＋ｄの結果ｙをレジスターＲ３に格納する。ｘ＊ｙの結果ｅをレジスターＲ３に格納する。以上のことからレジスターＲ０〜Ｒ３の４個のレジスターが必要となる。以上説明したように、必要な数は、あくまでも一例であり、変数の数や、レジスターの空き状況に応じて、積算、加算に必要なレジスターの数も異なる。 (Other example: When x (4) registers are required for integration and addition of convolution operations)
When calculating (a + b) × (c + d) = e, a and b are stored in the registers R0 and R1, respectively. The result x of a + b is stored in the register R2. After that, c and d are stored in the registers R0 and R1, respectively. The result y of c + d is stored in the register R3. The result e of x * y is stored in the register R3. From the above, four registers R0 to R3 are required. As described above, the required number is just an example, and the number of registers required for integration and addition differs depending on the number of variables and the availability of registers.

（ステップＳ１１３）
以上のような手順によって決定したｎ行分のカーネルをレジスターに格納する。なお、以下の処理においては、上述した実施形態の条件下でｎ＝２に決定したものとして説明を行う。 (Step S113)
The kernel for n lines determined by the above procedure is stored in the register. In the following processing, it is assumed that n = 2 is determined under the conditions of the above-described embodiment.

図７、図８は、レジスターのデータ格納状態を示す模式図である。図７（ａ）は、ステップＳ１１３で、２行分のカーネルを分割してｂｎ個（＝６）のレジスターに格納した状態を示している。図３に示すようにカーネルを構成する２行分の各データは、連続する４個毎にレジスターに格納される。同図においては各レジスターには、格納したデータに対応した符号ｋ００〜ｋ１２を示している。このうち、３つ目のレジスターｋ０２、ｋ１２には、４個分の格納領域のうち３個分のカーネルのデータが格納されており、データ１個分の空き領域がある（図３参照）。この空き領域にはゼロを入れている。 7 and 8 are schematic views showing the data storage state of the register. FIG. 7A shows a state in which the kernel for two lines is divided and stored in bn (= 6) registers in step S113. As shown in FIG. 3, each of the two lines of data constituting the kernel is stored in the register for every four consecutive lines. In the figure, each register is indicated by a code k00 to k12 corresponding to the stored data. Of these, the third registers k02 and k12 store kernel data for three of the four storage areas, and have a free area for one data (see FIG. 3). Zero is put in this free space.

（ステップＳ１１４）
次に入力画素の右端、下端に数値０を入れるゼロパディングを行う。パディングの幅は、例えば１である。このパディング幅は、必要な出力画素のサイズに応じて、適宜設定される。 (Step S114)
Next, zero padding is performed in which the numerical value 0 is inserted at the right end and the lower end of the input pixel. The padding width is, for example, 1. This padding width is appropriately set according to the required output pixel size.

（ステップＳ１１５）
上端、左端から、ステップＳ１１２で決定したｎ（＝２）、および１つのレジスターに格納可能なデータ数（４個）に対応した、２×４の入力画素をｎ個（＝２）のレジスターに格納する。図７（ｂ）は、ステップＳ１１５で、２×４の入力画素を２個のレジスターに格納した状態を示している。同図においても各レジスターには、格納したデータに対応した符号ｄ００、ｄ１０を示している。 (Step S115)
From the upper end and the left end, the 2 × 4 input pixels corresponding to the n (= 2) determined in step S112 and the number of data (4) that can be stored in one register are converted into n (= 2) registers. Store. FIG. 7B shows a state in which 2 × 4 input pixels are stored in two registers in step S115. Also in the figure, the codes d00 and d10 corresponding to the stored data are shown in each register.

（ステップＳ１１６）
入力画素用のレジスターに格納したデータ（ｄ００、ｄ１０）と、左４データ分のカーネルを格納したレジスターのデータ（ｋ００、ｋ１０）とを積和演算し、結果Ｓを中間バッファ１用レジスターに格納する。ここでの積和演算（以下でも同様）においては、ＳＩＭＤ処理により、入力用、カーネル用のレジスターにそれぞれ格納されている４画素分（ｄ００、ｋ００）のデータを用いて、４画素ずつの畳み込み演算を並列処理する。図７（ｃ）は、ステップＳ１１６で積和演算し、結果Ｓをレジスターに格納した状態を示している。 (Step S116)
The data (d00, d10) stored in the register for the input pixel and the data (k00, k10) in the register storing the kernel for the left 4 data are multiplied and summed, and the result S is stored in the register for the intermediate buffer 1. To do. In the product-sum operation (the same applies hereinafter), the SIMD processing uses the data for 4 pixels (d00, k00) stored in the input and kernel registers, respectively, and convolves 4 pixels at a time. Process operations in parallel. FIG. 7C shows a state in which the product-sum calculation is performed in step S116 and the result S is stored in the register.

（ステップＳ１２１）
次に、４画素分スキップ（格納データ数ｑ分）した後の次の２×４の入力画素を入力画素用のレジスターに格納する。ここでは、例えばｄ０１、ｄ１１（図７（ｂ）参照）のデータが入力画素用の２個のレジスターにそれぞれ格納される。 (Step S121)
Next, the next 2 × 4 input pixels after skipping 4 pixels (q minutes of stored data) are stored in the input pixel register. Here, for example, the data of d01 and d11 (see FIG. 7B) are stored in the two registers for the input pixels, respectively.

（ステップＳ１２２）
入力画素用のレジスターに格納したデータと、左４データ分のカーネルを格納したレジスターのデータとを積和演算し、結果Ａを中間バッファ２用レジスターに格納する。図８（ａ）は、ステップＳ１２２で積和演算し、結果Ａをレジスターに格納した状態を示している。 (Step S122)
The data stored in the register for the input pixel and the data in the register storing the kernel for the left 4 data are multiplied and summed, and the result A is stored in the register for the intermediate buffer 2. FIG. 8A shows a state in which the product-sum calculation is performed in step S122 and the result A is stored in the register.

（ステップＳ１２３）
入力画素用のレジスターに格納したデータと、中央４データ分のカーネルを格納したレジスターのデータとを積和演算し、結果Ｂを中間バッファ１用レジスターに加算して格納する。図８（ｂ）は、ステップＳ１２３で積和演算し、結果Ｂをレジスターに格納した状態を示している。 (Step S123)
The data stored in the register for input pixels and the data in the register storing the kernel for the central 4 data are subjected to a product-sum calculation, and the result B is added to the register for intermediate buffer 1 and stored. FIG. 8B shows a state in which the product-sum calculation is performed in step S123 and the result B is stored in the register.

（ステップＳ１３１）
次に、４画素分スキップした後の２×４の入力画素を入力画素用のレジスターに格納する。こここでは、例えばｄ０２、ｄ１２（図７（ｂ）参照）のデータが入力画素用の２個のレジスターにそれぞれ格納される。 (Step S131)
Next, the 2 × 4 input pixels after skipping 4 pixels are stored in the input pixel register. Here, for example, the data of d02 and d12 (see FIG. 7B) are stored in the two registers for the input pixels, respectively.

（ステップＳ１３２）
入力画素用のレジスターに格納したデータと、左４データ分のカーネルを格納したレジスターのデータとを積和演算し、結果Ｃを中間バッファ３用レジスターに格納する。図８（ｃ）は、ステップＳ１３２で積和演算し、結果Ｃをレジスターに格納した状態を示している。 (Step S132)
The data stored in the register for the input pixel and the data in the register storing the kernel for the left 4 data are subjected to a product-sum calculation, and the result C is stored in the register for the intermediate buffer 3. FIG. 8C shows a state in which the product-sum calculation is performed in step S132 and the result C is stored in the register.

（ステップＳ１３３）
入力画素用のレジスターに格納したデータと、中央４データ分のカーネルを格納したレジスターのデータとを積和演算し、結果Ｄを中間バッファ２用レジスターに加算して格納する。図８（ｄ）は、ステップＳ１３３で積和演算し、結果Ｄをレジスターに格納した状態を示している。 (Step S133)
The data stored in the register for the input pixel and the data in the register storing the kernel for the central 4 data are multiplied and summed, and the result D is added to the register for the intermediate buffer 2 and stored. FIG. 8D shows a state in which the product-sum calculation is performed in step S133 and the result D is stored in the register.

（ステップＳ１３４）
入力画素用のレジスターに格納したデータと、右４データ分のカーネルを格納したレジスターのデータとを積和演算し、結果Ｅを中間バッファ１用レジスターに加算して格納する。図８（ｅ）は、ステップＳ１３３で積和演算し、結果Ｅをレジスターに格納した状態を示している。なお、このとき入力画素用のレジスターには、４個分のデータが格納されているが４番目のデータ、すなわち窓（図５参照）の外側の１２番目のデータは、カーネルの４番目の空データ（ゼロ値）と乗算し、加算されるので、積和演算の結果には影響を与えない。 (Step S134)
The data stored in the register for the input pixel and the data in the register storing the kernel for the right four data are multiplied and summed, and the result E is added to the register for the intermediate buffer 1 and stored. FIG. 8E shows a state in which the product-sum calculation is performed in step S133 and the result E is stored in the register. At this time, four pieces of data are stored in the register for the input pixel, but the fourth data, that is, the twelfth data outside the window (see FIG. 5) is the fourth empty of the kernel. Since it is multiplied by the data (zero value) and added, it does not affect the result of the product-sum operation.

（ステップＳ１４１）
ここまでの処理で、出力１画素の２行分の積和演算結果のデータが中間バッファ１用のレジスターに格納されたので、格納データをメモリ３０の所定アドレス位置に転送する。なお、このメモリ３０への転送処理は、レジスター数に余裕があれば、後述のステップＳ１４３を経由して、ステップＳ１３１以降の処理を実行して、出力２画素分のデータが揃った時点で、実行するようにしてもよい。これにより処理の効率化が図れる。 (Step S141)
In the processing up to this point, the data of the product-sum calculation result for two lines of one output pixel has been stored in the register for the intermediate buffer 1, so that the stored data is transferred to the predetermined address position of the memory 30. In the transfer process to the memory 30, if there is a margin in the number of registers, the processes after step S131 are executed via step S143, which will be described later, and when the data for two output pixels are prepared, the transfer process is performed. You may want to do it. As a result, processing efficiency can be improved.

（ステップＳ１４２）
中間バッファ１用レジスターの役目が終わったので、順次バッファのデータを次の手順でスライドし、更新する。最初に、中間バッファ２用レジスターに格納されているデータで、中間バッファ１用レジスターのデータを更新する。次に、中間バッファ３用レジスターに格納されているデータで、中間バッファ２用レジスターのデータを更新する。最後に、中間バッファ３用レジスターをゼロで更新する。 (Step S142)
Since the role of the register for the intermediate buffer 1 has been completed, the data in the buffer is sequentially slid and updated in the following procedure. First, the data in the register for intermediate buffer 1 is updated with the data stored in the register for intermediate buffer 2. Next, the data in the register for intermediate buffer 2 is updated with the data stored in the register for intermediate buffer 3. Finally, the register for the intermediate buffer 3 is updated with zero.

（ステップＳ１４３）
入力画素（ゼロパディング後）の最終列まで、処理が終わっていなければ（ＮＯ）、ステップＳ１３１に戻り、ステップＳ１３１で４画素分スキップした窓内の次の入力画素で、以降の処理を実行する。一方で、最終列まで処理が終わっていれば（ＹＥＳ）、処理をステップＳ１４４に進める。 (Step S143)
If the processing is not completed up to the last column of the input pixels (after zero padding) (NO), the process returns to step S131, and the subsequent processing is executed at the next input pixel in the window skipped by 4 pixels in step S131. .. On the other hand, if the processing is completed up to the last column (YES), the processing proceeds to step S144.

（ステップＳ１４４）
カーネルの最終行まで終了していなければ、すなわち１１行目までの処理が終了していなければ（ＮＯ）、処理をステップＳ１４５に進める。一方で、最終行までの処理が終了していれば（ＹＥＳ）、１画素分の出力画素の計算は終了する（エンド）。以降は、行方向にストライドし、全出力画素の計算を繰り返す。 (Step S144)
If the last line of the kernel has not been completed, that is, if the processing up to the 11th line has not been completed (NO), the process proceeds to step S145. On the other hand, if the processing up to the last line is completed (YES), the calculation of the output pixel for one pixel is completed (end). After that, stride in the row direction and repeat the calculation of all output pixels.

（Ｓ１４５）
次の２行分のカーネルを、カーネル用の６個分のレジスターに格納（更新）する。例えば、１、２行目が終了していれば、３、４行目のカーネルのデータをレジスターに格納する。なお、２行分ずつレジスターに格納するので、６回目には最後の１１行目の１行分のデータのみがレジスターに格納されることになる。この場合、使わない１行分の３個分のレジスターには全てゼロを入れて、積和計算するようにしてもよい。 (S145)
Store (update) the kernel for the next two lines in the six registers for the kernel. For example, if the first and second lines are completed, the kernel data of the third and fourth lines is stored in the register. Since two lines are stored in the register, only the data of one line of the last 11th line is stored in the register at the sixth time. In this case, zero may be entered in all three registers for one line that are not used, and the product-sum calculation may be performed.

（ステップＳ１４６）
次の２行分の左端から２×４の入力画素を入力画素用のレジスターに格納する。以降は、ステップＳ１１６に進み、以降の処理を実行する。 (Step S146)
The 2 × 4 input pixels from the left end of the next two lines are stored in the input pixel register. After that, the process proceeds to step S116, and the subsequent processing is executed.

以上説明した本実施形態に係る方法によれば、カーネルの分割量を決定し、決定した分割量でカーネルを分割して、演算を行うことで、組み込みシステムのような消費電力量に制限があり、限られたハードウェアリソースの条件下であっても、ＣＮＮ処理を効率的に行うことが可能となる。特に、カーネルを格納するレジスターは固定しておき、中間バッファ用のレジスター内のデータをシフトさせながら更新することで、演算結果を共通で使用することが可能となる（図５参照）。これによって、ＣＮＮ処理をさらに効率的に行うことが可能となる。 According to the method according to the present embodiment described above, the amount of power consumption as in an embedded system is limited by determining the division amount of the kernel, dividing the kernel by the determined division amount, and performing the calculation. , CNN processing can be performed efficiently even under the condition of limited hardware resources. In particular, by fixing the register that stores the kernel and updating the data in the register for the intermediate buffer while shifting it, the calculation results can be used in common (see FIG. 5). This makes it possible to perform the CNN process more efficiently.

（他の実施例）
以上に説明した情報処理装置および処理方法の構成は、上記の実施形態の特徴を説明するにあたって主要構成を説明したのであって、上記の構成に限られず、特許請求の範囲内において、種々改変することができる。また、一般的な情報処理装置が備える構成および処理方法で実行される処理を排除するものではない。 (Other Examples)
The configurations of the information processing apparatus and the processing method described above are not limited to the above configurations, but are variously modified within the scope of the claims, as the main configurations have been described in explaining the features of the above-described embodiments. be able to. Further, it does not exclude the processing executed by the configuration and the processing method provided in the general information processing apparatus.

（変形例１）
上述の実施形態では、５×５サイズのカーネルを用いる場合であっても分割量を決定し決定した分割量に基づいて、カーネルを分割してレジスターに格納するものであったが、カーネルのサイズに応じて、分割せずに、一度にレジスターに格納するようにしてもよい。図９は、変形例に係るフローチャートを示す図である。 (Modification example 1)
In the above-described embodiment, even when a 5 × 5 size kernel is used, the division amount is determined and the kernel is divided and stored in the register based on the determined division amount. However, the kernel size is used. Depending on the situation, the kernels may be stored at once without being divided. FIG. 9 is a diagram showing a flowchart according to a modified example.

（ステップＳ２１１）
ここでは、ＣＮＮ処理に用いるカーネルサイズを判断する。カーネルサイズが５×５以下の場合には、処理をステップＳ２１２に進める。一方で、５×５を超える場合には、処理を図６ＡのステップＳ１１１に進める。 (Step S211)
Here, the kernel size used for CNN processing is determined. If the kernel size is 5 × 5 or less, the process proceeds to step S212. On the other hand, if it exceeds 5 × 5, the process proceeds to step S111 of FIG. 6A.

（ステップＳ２１２）
カーネルの全データをレジスターに格納する。例えば５×５サイズのカーネルであれば、６４ビットレジスターを用いて、１行あたり３個、合計１５個のレジスターにカーネルの全データを格納する。これらレジスターのデータは、処理が終了するまで固定して用いる。 (Step S212)
Stores all kernel data in a register. For example, in the case of a 5 × 5 size kernel, all the data of the kernel is stored in a total of 15 registers, 3 per line, using 64-bit registers. The data in these registers is fixed and used until the processing is completed.

（ステップＳ２１３）
以下は、公知の手順により、入力画素の５×５サイズの窓内のデータを順次レジスターに格納し、カーネルを格納したレジスターのデータと、積和演算することで畳み込み演算を実行し、２画素ずつの畳み込み演算を並列処理する。 (Step S213)
In the following, according to a known procedure, the data in the window of 5 × 5 size of the input pixel is sequentially stored in the register, and the convolution operation is executed by performing the product-sum calculation with the data of the register storing the kernel, and the convolution operation is performed. Process each convolution operation in parallel.

（ステップＳ２１４）
１画素分の畳み込み演算が終了し次第、メモリに転送し、終了する。以降は、全出力画素の計算を、繰り返す。 (Step S214)
As soon as the convolution operation for one pixel is completed, it is transferred to the memory and finished. After that, the calculation of all output pixels is repeated.

このように、変形例においてはカーネルのサイズが小さい場合には、分割せずに、一度にレジスターに格納した方が、レジスターへのデータのロード、ストアの転送処理回数が少なくなるので、結果としてＣＮＮ処理を効率的に行うことが可能となる。 In this way, in the modified example, when the kernel size is small, it is better to store the data in the register at once without dividing it, because the number of data loading to the register and the number of store transfer processes will be reduced. CNN processing can be performed efficiently.

（画像処理装置への組み込み）
本実施形態の情報処理装置１００は、監視カメラ等の画像処理装置６０に適用してもよい。図１０は撮影した映像内の人の行動を認識する行動認識システムの構成を示すブロック図である。同図に示すように、行動認識システムには、撮像装置５０と、本実施形態に係る画像処理装置６０が含まれる。 (Incorporation into image processing equipment)
The information processing device 100 of the present embodiment may be applied to an image processing device 60 such as a surveillance camera. FIG. 10 is a block diagram showing a configuration of a behavior recognition system that recognizes a person's behavior in a captured image. As shown in the figure, the behavior recognition system includes an image pickup device 50 and an image processing device 60 according to the present embodiment.

撮像装置５０は、一般的なカメラや広角カメラであり、カメラの撮像素子が生成した画像信号をＡＤ変換して、画像データを生成する。撮像装置５０は、フレーム単位の画像データを連続的に生成した動画像を撮像可能である。 The image pickup device 50 is a general camera or a wide-angle camera, and AD-converts an image signal generated by an image pickup element of the camera to generate image data. The image pickup apparatus 50 can capture a moving image in which image data for each frame is continuously generated.

画像処理装置６０は、画像取得部７０、および上述した情報処理装置１００を備える。画像取得部７０は、撮像装置５０が生成した動画像の画像データＤ１を取得する。 The image processing device 60 includes an image acquisition unit 70 and the information processing device 100 described above. The image acquisition unit 70 acquires the image data D1 of the moving image generated by the image pickup apparatus 50.

図１１は、主に情報処理装置の機能ブロックを示す図である。情報処理装置１００は、人領域検出部１１０、人体特徴抽出部１２０、周辺特徴抽出部１３０、周辺特徴フィルター部１４０、行動判別部１５０、及び学習部１６０を備える。 FIG. 11 is a diagram mainly showing a functional block of an information processing apparatus. The information processing device 100 includes a human area detection unit 110, a human body feature extraction unit 120, a peripheral feature extraction unit 130, a peripheral feature filter unit 140, a behavior discrimination unit 150, and a learning unit 160.

人領域検出部１１０は、画像データＤ１の画像から人が含まれる区画化された人領域を検出する。人領域検出部１１０が人領域を検出する手法は、任意であり、例えば、動画像から、画像における差分画像を検出し、当該差分画像から人領域を検出する。又、人領域検出部１１０は、その他、学習済みのニューラルネットワーク、テンプレートマッチング、ＨＯＧ（ＨｉｓｔｏｇｒａｍｓｏｆＯｒｉｅｎｔｅｄＧｒａｄｉｅｎｔｓ）特徴量とＳＶＭ（ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅ）の組み合わせ、または背景差分法等の手法を用いてもよい。 The human area detection unit 110 detects a partitioned human area including a person from the image of the image data D1. The method by which the human area detection unit 110 detects the human area is arbitrary. For example, the difference image in the image is detected from the moving image, and the human area is detected from the difference image. In addition, the human region detection unit 110 may also use a method such as a learned neural network, template matching, a combination of HOG (Histograms of Oriented Gradients) features and SVM (Support Vector Machine), or background subtraction method. Good.

人体特徴抽出部１２０は、人領域検出部１１０から画像データＤ１と人領域を示すデータＤ２を取得して、人領域の画像に対して所定の演算処理を施して、画像に映る人の姿勢特徴を抽出する。そして、人体特徴抽出部１２０は、抽出した画像に映る人の姿勢特徴のデータＤ３を周辺特徴フィルター部１４０及び行動判別部１５０に出力する。 The human body feature extraction unit 120 acquires image data D1 and data D2 indicating the human area from the human area detection unit 110, performs predetermined arithmetic processing on the image of the human area, and performs a predetermined arithmetic process on the image of the human area, and the posture characteristic of the person reflected in the image. To extract. Then, the human body feature extraction unit 120 outputs the data D3 of the posture feature of the person reflected in the extracted image to the peripheral feature filter unit 140 and the behavior determination unit 150.

人体特徴抽出部１２０は、例えば、学習済みのＣＮＮ処理を用いて、画像から人の姿勢特徴を抽出する。尚、人体特徴抽出部１２０を構成するＣＮＮ処理は、例えば、人体の画像と、当該画像中における人体の関節位置の座標（二次元位置又は三次元推定位置）の対応関係を示す教師データによって学習処理が行われたものが用いられる（一般にＲ−ＣＮＮとも称される）。 The human body feature extraction unit 120 extracts a human posture feature from an image by using, for example, a learned CNN process. The CNN process constituting the human body feature extraction unit 120 is learned by, for example, teacher data showing the correspondence between the image of the human body and the coordinates (two-dimensional position or three-dimensional estimated position) of the joint position of the human body in the image. The processed one is used (generally also referred to as R-CNN).

人体特徴抽出部１２０は、例えば、前処理部、畳み込み処理部、および第１、第２全結合部を含む。前処理部では、人領域を示すデータＤ２に基づいて、全領域の画像から人領域の画像を切り出して、所定のサイズ及びアスペクト比に変換する等、画像の正規化を行う。 The human body feature extraction unit 120 includes, for example, a pretreatment unit, a convolution processing unit, and first and second fully connected units. Based on the data D2 indicating the human region, the preprocessing unit cuts out the image of the human region from the image of the entire region and converts the image into a predetermined size and aspect ratio to normalize the image.

畳み込み処理部は、複数の特徴量抽出層が階層的に接続されて構成されている。畳み込み処理部は、各特徴量抽出層において、前階層から入力される入力データに対して、上述したような畳み込み演算処理、活性化処理、及びプーリング処理を実行する。 The convolution processing unit is configured by hierarchically connecting a plurality of feature extraction layers. In each feature amount extraction layer, the convolution processing unit executes the convolution calculation process, the activation process, and the pooling process as described above for the input data input from the previous layer.

第１全結合部は、例えば、複数の特徴量を全結合する多層パーセプトロンで構成されている。第１全結合部は、畳み込み処理部から得られる複数の中間演算結果データを全結合して、人の姿勢特徴を示すデータＤ３を生成する。そして、第１全結合部は、当該人の姿勢特徴を示すデータＤ３を第２全結合部及び周辺特徴フィルター部１４０に対して出力する。 The first fully connected portion is composed of, for example, a multi-layer perceptron that fully bonds a plurality of feature quantities. The first fully connected unit fully combines a plurality of intermediate calculation result data obtained from the convolution processing unit to generate data D3 indicating the posture characteristics of a person. Then, the first fully connected unit outputs data D3 indicating the posture characteristics of the person to the second fully connected unit and the peripheral feature filter unit 140.

周辺特徴抽出部１３０は、人領域検出部１１０から画像データＤ１と人領域を示すデータＤ２を取得して、人領域の周辺の画像に対して所定の演算処理を施して、画像に映る人の周辺物体の周辺特徴を抽出する。そして、周辺特徴抽出部１３０は、抽出した周辺特徴のデータＤ４を周辺特徴フィルター部１４０に出力する。 The peripheral feature extraction unit 130 acquires the image data D1 and the data D2 indicating the human area from the human area detection unit 110, performs predetermined arithmetic processing on the image around the human area, and causes the image of the person to appear in the image. Extract the peripheral features of peripheral objects. Then, the peripheral feature extraction unit 130 outputs the extracted peripheral feature data D4 to the peripheral feature filter unit 140.

周辺特徴抽出部１３０は、例えば、人体特徴抽出部１２０と同様に、ＣＮＮ処理を用いて、画像から周辺特徴を抽出する。なお、周辺特徴抽出部１３０を構成するＣＮＮ処理は、例えば、物体の画像と、当該物体の形状、種別、又は各部位の位置等の対応関係を示す教師データによって学習処理が行われたものが用いられる。又、より好適には、人体を含む人体周辺の画像と、当該画像中における物体の形状及び位置関係の座標の対応関係を示す教師データによって学習処理が行われたものが用いられる。 The peripheral feature extraction unit 130 extracts peripheral features from the image by using CNN processing in the same manner as the human body feature extraction unit 120, for example. The CNN process constituting the peripheral feature extraction unit 130 is, for example, a learning process performed by learning processing using an image of an object and teacher data indicating a correspondence relationship such as the shape, type, or position of each part of the object. Used. Further, more preferably, an image of the surroundings of the human body including the human body and a teacher data showing the correspondence between the coordinates of the shape and the positional relationship of the object in the image are used for the learning process.

周辺特徴フィルター部１４０は、人領域検出部１１０から姿勢特徴のデータＤ３と周辺特徴抽出部１３０から周辺特徴のデータＤ４を取得する。そして、周辺特徴フィルター部１４０は、姿勢特徴と関連付けて設定された周辺特徴の重要度のデータＤａに基づいて、周辺特徴をフィルタリングする。そして、周辺特徴フィルター部１４０は、フィルタリングした周辺特徴のデータＤ４ａを行動判別部１５０に出力する。 The peripheral feature filter unit 140 acquires the posture feature data D3 from the human area detection unit 110 and the peripheral feature data D4 from the peripheral feature extraction unit 130. Then, the peripheral feature filter unit 140 filters the peripheral feature based on the data Da of the importance of the peripheral feature set in association with the posture feature. Then, the peripheral feature filter unit 140 outputs the filtered peripheral feature data D4a to the action determination unit 150.

行動判別部１５０は、人体特徴抽出部１２０から人の姿勢特徴のデータＤ３を取得すると共に、周辺特徴フィルター部１４０からフィルタリングされた周辺特徴のデータＤ４ａを取得する。そして、行動判別部１５０は、人の姿勢特徴のデータＤ３とフィルタリングされた周辺特徴のデータＤ４ａの時系列データに基づいて、画像に映る人の行動クラスを判別する。また、本実施形態に係る行動判別部１５０は、再帰型ニューラルネットワークの一種である階層型ＬＳＴＭ（ＬｏｎｇＳｈｏｒｔ−ＴｅｒｍＭｅｍｏｒｙ）を用いて、時系列解析を行ってもよい。階層型ＬＳＴＭは、短い時間間隔（例えば、直前の画像フレーム）における関係に加えて、長い時間間隔（例えば、１分前）における関係を認識することが可能である。 The behavior discrimination unit 150 acquires the data D3 of the human posture feature from the human body feature extraction unit 120, and also acquires the filtered peripheral feature data D4a from the peripheral feature filter unit 140. Then, the behavior discrimination unit 150 discriminates the behavior class of the person reflected in the image based on the time series data of the data D3 of the posture feature of the person and the data D4a of the filtered peripheral feature. Further, the behavior discrimination unit 150 according to the present embodiment may perform time series analysis using a hierarchical LSTM (Long Short-Term Memory) which is a kind of recurrent neural network. Hierarchical LSTMs can recognize relationships at long time intervals (eg, one minute ago) in addition to relationships at short time intervals (eg, immediately preceding image frames).

学習部１６０は、人体特徴抽出部１２０、周辺特徴抽出部１３０、周辺特徴フィルター部１４０、及び行動判別部１５０が上述した処理を実行し得るように、教師データを用いた機械学習を実行する。 The learning unit 160 executes machine learning using the teacher data so that the human body feature extraction unit 120, the peripheral feature extraction unit 130, the peripheral feature filter unit 140, and the behavior determination unit 150 can execute the above-described processing.

学習部１６０は、例えば、正規化された人領域の画像と人の姿勢特徴（例えば、関節位置）が関連付けられた教師データを用いて、人体特徴抽出部１２０の畳み込み処理部、第１、第２全結合部のネットワークパラメータ（例えば、重み係数、バイアス）を調整する。 The learning unit 160 uses, for example, the image of the normalized human region and the teacher data in which the human posture characteristics (for example, joint positions) are associated with each other, and the convolution processing unit, the first, first, first, the convolution processing unit 120 of the human body feature extraction unit 120. 2 Adjust the network parameters (eg, weighting factor, bias) of the fully connected portion.

又、学習部１６０は、例えば、正規化された人周辺の物体の画像と物体の特徴（例えば、人体との位置関係）が関連付けられた教師データを用いて、周辺特徴抽出部１３０の畳み込み処理部、及び周辺特徴抽出部１３０の全結合部のネットワークパラメータ（例えば、重み係数、バイアス）を調整する。 Further, the learning unit 160 uses, for example, the teacher data in which the normalized image of the object around the person and the feature of the object (for example, the positional relationship with the human body) are associated with each other, and the peripheral feature extraction unit 130 is convoluted. The network parameters (for example, weighting coefficient, bias) of the fully connected portion of the unit and the peripheral feature extraction unit 130 are adjusted.

又、学習部１６０は、例えば、人の姿勢特徴と周辺物体の重要度が関連付けられた教師データを用いて、周辺特徴フィルター部１４０の全結合部のネットワークパラメータ（例えば、重み係数、バイアス）を調整する。 Further, the learning unit 160 uses, for example, teacher data in which the posture characteristics of a person and the importance of peripheral objects are associated with each other, and sets the network parameters (for example, weighting coefficient, bias) of the fully connected portion of the peripheral feature filter unit 140. adjust.

又、学習部１６０は、例えば、人の姿勢特徴及び周辺特徴の時系列データと、正解となる行動クラスが関連付けられた教師データを用いて、行動判別部１５０の中間層６１及び行動判別部１５０の全結合部のネットワークパラメータ（例えば、重み係数、バイアス）を調整する。 Further, the learning unit 160 uses, for example, time-series data of a person's posture characteristics and peripheral characteristics and teacher data associated with a behavior class that is a correct answer, and uses the intermediate layer 61 of the behavior discrimination unit 150 and the behavior discrimination unit 150. Adjust the network parameters (eg, weighting factor, bias) of the fully connected part of.

尚、学習部１６０は、例えば、公知の誤差逆伝搬法等を用いて、これらの学習処理を行えばよい。そして、学習部１６０は、学習処理によって調整したネットワークパラメータを外部の記憶部等に格納する。 The learning unit 160 may perform these learning processes by using, for example, a known error back propagation method or the like. Then, the learning unit 160 stores the network parameters adjusted by the learning process in an external storage unit or the like.

以上のように、本実施形態に係る画像処理装置６０によれば、人体の姿勢特徴と関連付けて周辺特徴の重要度を設定しておき、周辺特徴をフィルタリングすることによって、人の行動に関連する周辺物体のみを抽出することが可能である。これによって、本実施形態に係る画像処理装置６０は、周辺物体の種別、位置又は見え方が種々に異なる環境下においても、高精度に人の行動クラスを推定できる。 As described above, according to the image processing device 60 according to the present embodiment, the importance of the peripheral features is set in association with the posture features of the human body, and the peripheral features are filtered to relate to the human behavior. It is possible to extract only peripheral objects. As a result, the image processing device 60 according to the present embodiment can estimate the human behavior class with high accuracy even in an environment in which the types, positions, or appearances of peripheral objects are different.

特に、本実施形態に係る情報処理装置１００を組み込んだ画像処理装置６０は、姿勢特徴と周辺特徴の時系列データに基づいて、人体の姿勢と関連する周辺物体の位置関係の時間的変化を抽出し、人の行動クラスを推定する構成となっているため、より高精度に人の行動クラスを推定できる。 In particular, the image processing device 60 incorporating the information processing device 100 according to the present embodiment extracts temporal changes in the positional relationship of peripheral objects related to the posture of the human body based on time-series data of posture features and peripheral features. However, since it is configured to estimate the human behavior class, it is possible to estimate the human behavior class with higher accuracy.

上述した情報処理装置を動作させるプログラムは、ＵＳＢメモリ、フレキシブルディスク、ＣＤ−ＲＯＭ等のコンピューター読み取り可能な記録媒体によって提供されてもよいし、インターネット等のネットワークを介してオンラインで提供されてもよい。この場合、コンピューター読み取り可能な記録媒体に記録されたプログラムのコードは、アセンブリ言語、または機械言語で記述されていてもよい。 The program for operating the above-mentioned information processing device may be provided by a computer-readable recording medium such as a USB memory, a flexible disk, or a CD-ROM, or may be provided online via a network such as the Internet. .. In this case, the code of the program recorded on the computer-readable recording medium may be written in assembly language or machine language.

１００情報処理装置
１０ＳＩＭＤプロセッサ
１１ＳＩＭＤレジスターファイル
１２ＡＬＵ
１３ＭＵＬ
１４シフター
１５ＬＳ
２０汎用プロセッサ
２１スカラーレジスターファイル
２２ＡＬＵ
２３ＭＵＬ
２４ＬＳ
３０メモリ
６０画像処理装置
100 Information processing device 10 SIMD processor 11 SIMD register file 12 ALU
13 MUL
14 shifter 15 LS
20 General-purpose processor 21 Scalar register file 22 ALU
23 MUL
24 LS
30 Memory 60 Image processing device

Claims

ＳＩＭＤ用に利用可能な複数のレジスターと、
ＳＩＭＤ処理するプロセッサと、を備える情報処理装置を制御する方法であって、
（ａ）カーネルサイズから使用する前記レジスターのサイズを選択するステップと、
（ｂ）選択したサイズのレジスターの数、前記カーネルサイズから、カーネルの分割量を決定するステップと、
（ｃ）決定した前記カーネルの分割量で分割したカーネル毎に、前記レジスターに格納し、畳み込み演算を並列処理するステップと、
を含む、処理方法。 With multiple registers available for SIMD,
A method of controlling an information processing device including a processor that performs SIMD processing.
(A) A step of selecting the size of the register to be used from the kernel size, and
(B) A step of determining the amount of kernel division from the number of registers of the selected size and the kernel size, and
(C) A step of storing each kernel divided by the determined kernel division amount in the register and processing the convolution operation in parallel.
Processing methods, including.

前記レジスターに１度に格納するカーネルの行数をｎとしたとき、
前記ステップ（ｂ）では、下記式を満たす最大の値に、前記ｎを設定する、請求項１に記載の処理方法。
ｋ−（ｘ＋ｊ）≧（ｂ＋１）ｎ
ここでｊは、ｊ＝（ｗ＋１−ｙ）／ｂ、
ｂ＝ｗ／ｑ、
ｊ、ｂは小数点以下を切り上げた整数であり、
ｗ：カーネルサイズ
ｋ：利用可能な前記レジスターの個数
ｑ：１つの前記レジスターに格納するデータ数
ｘ：畳み込み演算に必要な前記レジスターの個数
ｙ：スキップする画素数である。 When the number of kernel lines stored in the register at one time is n
The processing method according to claim 1, wherein in the step (b), the n is set to the maximum value satisfying the following formula.
k- (x + j) ≧ (b + 1) n
Where j is j = (w + 1-y) / b,
b = w / q,
j and b are integers rounded up to the nearest whole number.
w: Kernel size k: Number of available registers q: Number of data stored in one register x: Number of registers required for convolution operation y: Number of pixels to be skipped.

前記カーネルサイズが１１×１１のときに、
前記ステップ（ｂ）では、前記カーネルを２行ずつの分割に決定し、
前記ステップ（ｃ）では、前記カーネルを２行分毎に前記レジスターに格納し、格納した後、２行毎に畳み込み演算を実行する、請求項１に記載の処理方法。 When the kernel size is 11x11
In step (b), the kernel is divided into two lines each, and the kernel is divided into two lines.
The processing method according to claim 1, wherein in step (c), the kernel is stored in the register every two lines, and then a convolution operation is executed every two lines.

前記カーネルサイズが１１×１１のときに、
前記ステップ（ａ）では、１２８ビットのレジスターを選択し、
前記ステップ（ｃ）では、入力画素４画素分を１２８ビットのレジスターに格納し、４画素ずつの畳み込み演算を並列処理する、請求項３に記載の処理方法。 When the kernel size is 11x11
In step (a), a 128-bit register is selected.
The processing method according to claim 3, wherein in step (c), four input pixels are stored in a 128-bit register, and convolution operations of four pixels are processed in parallel.

前記カーネルサイズが５×５以下のときに、
前記ステップ（ｂ）では、前記カーネルを分割しないことを決定し、
前記ステップ（ｃ）では、前記カーネルを１度に前記レジスターに格納し、格納した後に畳み込み演算を実行する、請求項１、３および４のいずれか一項に記載の処理方法。 When the kernel size is 5x5 or less
In step (b), it is decided not to split the kernel.
The processing method according to any one of claims 1, 3 and 4, wherein in the step (c), the kernel is stored in the register at one time, and after the storage, the convolution operation is executed.

前記カーネルサイズが５×５以下のときに、
前記ステップ（ａ）では、６４ビットのレジスターを選択し、
前記ステップ（ｃ）では、入力画素２画素分を６４ビットの前記レジスターに格納し、２画素ずつの畳み込み演算を並列処理する、請求項５に記載の処理方法。 When the kernel size is 5x5 or less
In step (a), a 64-bit register is selected and
The processing method according to claim 5, wherein in step (c), two input pixels are stored in the 64-bit register, and convolution operations of two pixels are processed in parallel.

前記複数のレジスターのうち、カーネルを格納するレジスターを固定しておき、入力画素用のレジスターに格納するデータをスライドさせながら、中間バッファに格納した結果データを更新する、請求項１から請求項６のいずれか一項に記載の処理方法。 Claims 1 to 6 in which the register for storing the kernel is fixed among the plurality of registers, and the result data stored in the intermediate buffer is updated while sliding the data stored in the register for the input pixel. The processing method according to any one of the above.

ＳＩＭＤ用に利用可能な複数のレジスターと、
ＳＩＭＤ処理するプロセッサと、を備える情報処理装置を制御するプログラムであって、
（ａ）カーネルサイズから使用する前記レジスターのサイズを選択するステップと、
（ｂ）選択したサイズのレジスターの数、前記カーネルサイズから、カーネルの分割量を決定するステップと、
（ｃ）決定した前記カーネルの分割量で分割したカーネル毎に、前記レジスターに格納し、畳み込み演算を並列処理するステップと、
を含む、方法を実行するためのプログラム。 With multiple registers available for SIMD,
A program that controls an information processing device including a processor that performs SIMD processing.
(A) A step of selecting the size of the register to be used from the kernel size, and
(B) A step of determining the amount of kernel division from the number of registers of the selected size and the kernel size, and
(C) A step of storing each kernel divided by the determined kernel division amount in the register and processing the convolution operation in parallel.
A program for executing methods, including.

前記レジスターに１度に格納するカーネルの行数をｎとしたとき、
前記ステップ（ｂ）では、下記式を満たす最大の値に、前記ｎを設定する、請求項８に記載のプログラム。
ｋ−（ｘ＋ｊ）≧（ｂ＋１）ｎ
ここでｊは、ｊ＝（ｗ＋１−ｙ）／ｂ、
ｂ＝ｗ／ｑ、
ｊ、ｂは小数点以下を切り上げた整数であり、
ｗ：カーネルサイズ
ｋ：利用可能な前記レジスターの個数
ｑ：１つの前記レジスターに格納するデータ数
ｘ：畳み込み演算に必要な前記レジスターの個数
ｙ：スキップする画素数である。 When the number of kernel lines stored in the register at one time is n
The program according to claim 8, wherein in the step (b), the n is set to the maximum value satisfying the following equation.
k- (x + j) ≧ (b + 1) n
Where j is j = (w + 1-y) / b,
b = w / q,
j and b are integers rounded up to the nearest whole number.
w: Kernel size k: Number of available registers q: Number of data stored in one register x: Number of registers required for convolution operation y: Number of pixels to be skipped.

前記カーネルサイズが１１×１１のときに、
前記ステップ（ｂ）では、前記カーネルを２行ずつの分割に決定し、
前記ステップ（ｃ）では、前記カーネルを２行分毎に前記レジスターに格納し、格納した後、２行毎に畳み込み演算を実行する、請求項８に記載のプログラム。 When the kernel size is 11x11
In step (b), the kernel is divided into two lines each, and the kernel is divided into two lines.
The program according to claim 8, wherein in step (c), the kernel is stored in the register every two lines, and then a convolution operation is executed every two lines.

前記カーネルサイズが１１×１１のときに、
前記ステップ（ａ）では、１２８ビットのレジスターを選択し、
前記ステップ（ｃ）では、入力画素４画素分を１２８ビットのレジスターに格納し、４画素ずつの畳み込み演算を並列処理する、請求項１０に記載のプログラム。 When the kernel size is 11x11
In step (a), a 128-bit register is selected.
The program according to claim 10, wherein in step (c), four input pixels are stored in a 128-bit register, and convolution operations of four pixels are processed in parallel.

前記カーネルサイズが５×５以下のときに、
前記ステップ（ｂ）では、前記カーネルを分割しないことを決定し、
前記ステップ（ｃ）では、前記カーネルを１度に前記レジスターに格納し、格納した後に畳み込み演算を実行する、請求項８、１０および１１のいずれか一項に記載のプログラム。 When the kernel size is 5x5 or less
In step (b), it is decided not to split the kernel.
The program according to any one of claims 8, 10 and 11, wherein in the step (c), the kernel is stored in the register at one time, and after the storage, the convolution operation is executed.

前記カーネルサイズが５×５以下のときに、
前記ステップ（ａ）では、６４ビットのレジスターを選択し、
前記ステップ（ｃ）では、入力画素２画素分を６４ビットの前記レジスターに格納し、２画素ずつの畳み込み演算を並列処理する、請求項１２に記載のプログラム。 When the kernel size is 5x5 or less
In step (a), a 64-bit register is selected and
The program according to claim 12, wherein in step (c), two input pixels are stored in the 64-bit register, and convolution operations of two pixels are processed in parallel.

前記複数のレジスターのうち、カーネルを格納するレジスターを固定しておき、入力画素用のレジスターに格納するデータをスライドさせながら、中間バッファに格納した結果データを更新する、請求項８から請求項１３のいずれか一項に記載のプログラム。 Claims 8 to 13 update the result data stored in the intermediate buffer while fixing the register for storing the kernel among the plurality of registers and sliding the data stored in the register for the input pixel. The program described in any one of the above.

ＳＩＭＤ用に利用可能な複数のレジスターと、
ＳＩＭＤ処理するプロセッサと、を備える情報処理装置であって、
カーネルサイズから使用する前記レジスターのサイズを選択し、選択したサイズのレジスターの数、前記カーネルサイズから、カーネルの分割量を決定し、決定した前記カーネルの分割量で分割したカーネル毎に、前記レジスターに格納し、畳み込み演算を並列処理する、情報処理装置。 With multiple registers available for SIMD,
An information processing device including a processor that performs SIMD processing.
The size of the register to be used is selected from the kernel size, the division amount of the kernel is determined from the number of registers of the selected size and the kernel size, and the register is divided for each kernel divided by the determined division amount of the kernel. An information processing device that stores in the kernel and processes convolution operations in parallel.

前記決定した分割量で分割され、前記レジスターに１度に格納するカーネルの行数をｎとしたとき、
下記式を満たす最大の値に、前記ｎを設定する、請求項１５に記載の情報処理装置。
ｋ−（ｘ＋ｊ）≧（ｂ＋１）ｎ
ここでｊは、ｊ＝（ｗ＋１−ｙ）／ｂ、
ｂ＝ｗ／ｑ、
ｊ、ｂは小数点以下を切り上げた整数であり、
ｗ：カーネルサイズ
ｋ：利用可能な前記レジスターの個数
ｑ：１つの前記レジスターに格納するデータ数
ｘ：畳み込み演算に必要な前記レジスターの個数
ｙ：スキップする画素数である。 When the number of kernel lines divided by the determined division amount and stored at one time in the register is n.
The information processing apparatus according to claim 15, wherein the n is set to the maximum value satisfying the following equation.
k- (x + j) ≧ (b + 1) n
Where j is j = (w + 1-y) / b,
b = w / q,
j and b are integers rounded up to the nearest whole number.
w: Kernel size k: Number of available registers q: Number of data stored in one register x: Number of registers required for convolution operation y: Number of pixels to be skipped.

前記複数のレジスターのうち、カーネルを格納するレジスターを固定しておき、入力画素用のレジスターに格納するデータをスライドさせながら、中間バッファに格納した結果データを更新する、請求項１５または請求項１６に記載の情報処理装置。 Of the plurality of registers, the register for storing the kernel is fixed, and the result data stored in the intermediate buffer is updated while sliding the data stored in the register for the input pixel, claim 15 or claim 16. The information processing device described in.

撮像装置が生成した画像を取得する画像取得部と、
前記画像に映る人の特徴、および前記画像に映る人の周辺物体の形状、位置又は種別を示す周辺特徴を抽出する、請求項１５から請求項１７のいずれか一項に記載の情報処理装置と、
を備える、画像処理装置。
An image acquisition unit that acquires an image generated by the image pickup device, and
The information processing apparatus according to any one of claims 15 to 17, which extracts the characteristics of the person shown in the image and the peripheral features indicating the shape, position, or type of the peripheral object of the person shown in the image. ,
An image processing device.