JP2020173669A

JP2020173669A - Image recognition device, image recognition method, image recognition program, and image recognition system

Info

Publication number: JP2020173669A
Application number: JP2019075833A
Authority: JP
Inventors: 牧劉; Mu Ryu; 岡本　康宏; Yasuhiro Okamoto; 康宏岡本; 大柱金; Daeju Kim; 山田　聡; Satoshi Yamada; 聡山田
Original assignee: SoftBank Corp
Current assignee: SoftBank Corp
Priority date: 2019-04-11
Filing date: 2019-04-11
Publication date: 2020-10-22
Anticipated expiration: 2039-04-11
Also published as: JP6868052B2

Abstract

To provide an image recognition device, an image recognition method, and an image recognition program capable of properly recognizing a character and a numeral described in a non-standard form.SOLUTION: An image recognition device 10 includes: an acquisition unit 11 for acquiring an image having a photographed character string; a dividing unit 12 for dividing the image into a plurality of partial images; a calculation unit 13 for calculating information indicating a break of the character string for each of the plurality of partial images by using a first model 15b for extracting a feature quantity from each of the plurality of partial images, and a second model 15c for successively converting the feature quantities into information indicating the breaks of the character string; and an output unit 14 for outputting the information indicating the breaks of the character string.SELECTED DRAWING: Figure 7

Description

本発明は、画像認識装置、画像認識方法、画像認識プログラム及び画像認識システムに関する。 The present invention relates to an image recognition device, an image recognition method, an image recognition program, and an image recognition system.

従来、ＯＣＲ（Optical Character Recognition）技術を用いて、画像に写された文字や数字を認識する画像認識装置が用いられている。 Conventionally, an image recognition device that recognizes characters and numbers transferred to an image by using OCR (Optical Character Recognition) technology has been used.

例えば、下記特許文献１には、対象画像から文字情報を抽出し、文字情報の対象画像中の位置に基づいて、その文字情報と対をなす文字情報とを紐づけする画像分析装置が記載されている。 For example, Patent Document 1 below describes an image analyzer that extracts character information from a target image and associates the character information with paired character information based on the position of the character information in the target image. ing.

また、下記非特許文献１には、日本語の手書き文字について、複数の手法を用いて文字列のセグメンテーションを行った研究が記載されている。 Further, Non-Patent Document 1 below describes a study in which Japanese handwritten characters are segmented by using a plurality of methods.

特開２０１８−９２４５９号公報JP-A-2018-92459

Kha Cong Nguyen and Nakagawa Masaki, "Text-Line and Character Segmentation for Off-line Recognition of Handwritten Japanese Text", 信学技報, vol. 115, no. 517, PRMU2015-173, pp. 53-58, 2016年3月Kha Cong Nguyen and Nakagawa Masaki, "Text-Line and Character Segmentation for Off-line Recognition of Handwritten Japanese Text", Shingaku Giho, vol. 115, no. 517, PRMU2015-173, pp. 53-58, 2016 March

例えば特許文献１に記載の技術では、画像のうち罫線で囲まれた領域を検出し、当該領域に記載された文字を認識している。また、非特許文献１に記載の技術では、手書きの漢字、平仮名及びカタカナについてセグメンテーションが試みられている。しかしながら、例えば「川」や「ル」のように２以上に分離可能な要素で構成される文字の場合、適切にセグメンテーションすることが困難だった。 For example, in the technique described in Patent Document 1, a region surrounded by a ruled line in an image is detected, and characters described in the region are recognized. Further, in the technique described in Non-Patent Document 1, segmentation is attempted for handwritten Chinese characters, hiragana and katakana. However, in the case of a character composed of two or more separable elements such as "river" and "le", it is difficult to properly segment.

そこで、本発明は、２以上に分離可能な要素で構成される文字を含む場合であっても、文字列のセグメンテーションを適切に行うことができる画像認識装置、画像認識方法、画像認識プログラム及び画像認識システムを提供する。 Therefore, the present invention includes an image recognition device, an image recognition method, an image recognition program, and an image capable of appropriately performing segmentation of a character string even when a character composed of two or more separable elements is included. Provide a recognition system.

本発明の一態様に係る画像認識装置は、文字列が写された画像を取得する取得部と、画像を複数の部分画像に分割する分割部と、複数の部分画像それぞれから特徴量を抽出する第１モデル及び特徴量を文字列の区切りを表す情報に順次変換する第２モデルを用いて、複数の部分画像それぞれについて文字列の区切りを表す情報を算出する算出部と、文字列の区切りを表す情報を出力する出力部と、を備える。 The image recognition device according to one aspect of the present invention extracts a feature amount from each of an acquisition unit that acquires an image on which a character string is copied, a division unit that divides the image into a plurality of partial images, and a plurality of partial images. Using the first model and the second model that sequentially converts the feature amount into the information representing the character string delimiter, the calculation unit that calculates the information representing the character string delimiter for each of the plurality of subimages and the character string delimiter are separated. It includes an output unit that outputs information to be represented.

この態様によれば、画像を複数の部分画像に分割し、第１モデルによって複数の部分画像の特徴を捉え、第２モデルによって特徴量を文字列の区切りを表す情報に変換することで、画像に写された文字列が２以上に分離可能な要素で構成される場合であっても、文字列のセグメンテーションを適切に行うことができる。 According to this aspect, the image is divided into a plurality of partial images, the features of the plurality of partial images are captured by the first model, and the feature amount is converted into information representing a character string delimiter by the second model. Even when the character string copied to is composed of two or more separable elements, the segmentation of the character string can be appropriately performed.

上記態様において、文字列の区切りを表す情報は、部分画像が文字列の区切りに対応するか否かを表す２値情報であってもよい。 In the above aspect, the information representing the character string delimiter may be binary information indicating whether or not the partial image corresponds to the character string delimiter.

この態様によれば、２値情報に従って、画像を文字列の区切りに対応する領域と、文字列の区切りに対応しない領域とに分けることができ、文字列のセグメンテーションを適切に行うことができる。 According to this aspect, the image can be divided into an area corresponding to the character string delimiter and an area not corresponding to the character string delimiter according to the binary information, and the character string segmentation can be appropriately performed.

上記態様において、第１モデルは、複数の部分画像それぞれから特徴量として特徴マップを算出するＣＮＮ（Convolutional Neural Network）であり、第２モデルは、特徴マップを２値情報に順次変換するＲＮＮ（Recurrent Neural Network）であってもよい。 In the above aspect, the first model is a CNN (Convolutional Neural Network) that calculates a feature map as a feature amount from each of a plurality of partial images, and the second model is an RNN (Recurrent) that sequentially converts the feature map into binary information. Neural Network) may be used.

この態様によれば、ＣＮＮによって算出される複数の部分画像の特徴マップによって、複数の部分画像の特徴を適切に捉え、ＲＮＮによって特徴マップの前後関係を考慮しつつ、特徴マップを文字列の区切りを表す情報に変換することができ、文字列のセグメンテーションをより適切に行うことができる。 According to this aspect, the features of the plurality of partial images are appropriately captured by the feature maps of the plurality of partial images calculated by the CNN, and the feature map is separated into character strings while considering the context of the feature map by the RNN. It can be converted into information representing, and the segmentation of the character string can be performed more appropriately.

上記態様において、文字列が写された学習用画像に対して、文字列の区切りを表す情報が関連付けられた学習データを記憶する記憶部と、学習データを用いて、第１モデル及び第２モデルを生成する生成部と、をさらに備えてもよい。 In the above aspect, the first model and the second model are used by using the learning data and the storage unit that stores the learning data in which the information indicating the character string delimiter is associated with the learning image on which the character string is copied. It may further include a generation unit for generating the above.

この態様によれば、学習用画像に対して、文字列の区切りを表す情報が関連付けられた学習データが与えられた場合に、画像に写された文字列の区切りを表す情報を適切に算出することができる第１モデル及び第２モデルを生成することができる。 According to this aspect, when the learning data associated with the information representing the character string delimiter is given to the learning image, the information representing the character string delimiter copied in the image is appropriately calculated. It is possible to generate a first model and a second model that can be generated.

上記態様において、生成部は、ＣＴＣ（Connectionist Temporal Classification）損失関数を最小化するように、第１モデル及び第２モデルを生成してもよい。 In the above aspect, the generator may generate the first model and the second model so as to minimize the CTC (Connectionist Temporal Classification) loss function.

この態様によれば、任意の文字間隔と大きさで記載された文字列について、文字列のセグメンテーションを適切に行う第１モデル及び第２モデルを生成することができる。 According to this aspect, it is possible to generate a first model and a second model for appropriately performing character string segmentation for a character string described with an arbitrary character spacing and size.

本発明の他の態様に係る画像認識方法は、画像認識装置に、文字列が写された画像を取得することと、画像を複数の部分画像に分割することと、複数の部分画像それぞれから特徴量を抽出する第１モデル及び特徴量を文字列の区切りを表す情報に順次変換する第２モデルを用いて、複数の部分画像それぞれについて文字列の区切りを表す情報を算出することと、文字列の区切りを表す情報を出力することと、を実行させる。 The image recognition method according to another aspect of the present invention is characterized by acquiring an image on which a character string is copied on an image recognition device, dividing the image into a plurality of partial images, and each of the plurality of partial images. Using the first model for extracting quantities and the second model for sequentially converting feature quantities into information representing character string delimiters, calculating information representing character string delimiters for each of a plurality of subimages and character strings Output the information that represents the delimiter of, and execute.

本発明の他の態様に係る画像認識プログラムは、画像認識装置に、文字列が写された画像を取得することと、画像を複数の部分画像に分割することと、複数の部分画像それぞれから特徴量を抽出する第１モデル及び特徴量を文字列の区切りを表す情報に順次変換する第２モデルを用いて、複数の部分画像それぞれについて文字列の区切りを表す情報を算出することと、文字列の区切りを表す情報を出力することと、を実行させる画像認識プログラム。 The image recognition program according to another aspect of the present invention is characterized by acquiring an image in which a character string is copied on an image recognition device, dividing the image into a plurality of partial images, and each of the plurality of partial images. Using the first model for extracting quantities and the second model for sequentially converting feature quantities into information representing character string delimiters, calculating information representing character string delimiters for each of a plurality of partial images and character strings An image recognition program that outputs information that represents the delimiter of and executes.

本発明の他の態様に係る画像認識システムは、画像認識装置と、ユーザ端末とを備える画像認識システムであって、画像認識装置は、ユーザ端末から、文字列が写された画像を取得する取得部と、画像を複数の部分画像に分割する分割部と、複数の部分画像それぞれから特徴量を抽出する第１モデル及び特徴量を文字列の区切りを表す情報に順次変換する第２モデルを用いて、複数の部分画像それぞれについて文字列の区切りを表す情報を算出する算出部と、文字列の区切りを表す情報をユーザ端末に出力する出力部と、を有する。 The image recognition system according to another aspect of the present invention is an image recognition system including an image recognition device and a user terminal, and the image recognition device acquires an image in which a character string is copied from the user terminal. A part, a division part that divides an image into a plurality of partial images, a first model that extracts a feature amount from each of the plurality of partial images, and a second model that sequentially converts the feature amount into information representing a character string delimiter are used. It has a calculation unit that calculates information representing a character string delimiter for each of the plurality of partial images, and an output unit that outputs information representing the character string delimiter to the user terminal.

本発明によれば、２以上に分離可能な要素で構成される文字を含む場合であっても、文字列のセグメンテーションを適切に行うことができる画像認識装置、画像認識方法、画像認識プログラム及び画像認識システムを提供することができる。 According to the present invention, an image recognition device, an image recognition method, an image recognition program, and an image capable of appropriately performing segmentation of a character string even when a character composed of two or more separable elements is included. A recognition system can be provided.

本発明の実施形態に係る画像認識システムのネットワーク構成を示す図である。It is a figure which shows the network structure of the image recognition system which concerns on embodiment of this invention. 本実施形態に係る画像認識装置の機能ブロックを示す図である。It is a figure which shows the functional block of the image recognition apparatus which concerns on this embodiment. 本実施形態に係る画像認識装置の物理的構成を示す図である。It is a figure which shows the physical structure of the image recognition apparatus which concerns on this embodiment. 本実施形態に係る画像認識装置により取得される画像の一例を示す図である。It is a figure which shows an example of the image acquired by the image recognition apparatus which concerns on this embodiment. 本実施形態に係る画像認識装置により分割された部分画像の一例を示す図である。It is a figure which shows an example of the partial image divided by the image recognition apparatus which concerns on this embodiment. 本実施形態に係る画像認識装置により用いられる第１モデル及び第２モデルの概念図である。It is a conceptual diagram of the 1st model and the 2nd model used by the image recognition apparatus which concerns on this embodiment. 本実施形態に係る画像認識装置により算出された文字列の区切りの一例を示す図である。It is a figure which shows an example of the delimiter of the character string calculated by the image recognition apparatus which concerns on this embodiment. 本実施形態に係る画像認識装置により実行されるセグメンテーション処理のフローチャートである。It is a flowchart of the segmentation processing executed by the image recognition apparatus which concerns on this embodiment. 本実施形態に係る画像認識装置により実行される学習処理のフローチャートである。It is a flowchart of the learning process executed by the image recognition apparatus which concerns on this embodiment.

添付図面を参照して、本発明の実施形態について説明する。なお、各図において、同一の符号を付したものは、同一又は同様の構成を有する。 Embodiments of the present invention will be described with reference to the accompanying drawings. In each figure, those having the same reference numerals have the same or similar configurations.

図１は、本発明の実施形態に係る画像認識システム１００のネットワーク構成を示す図である。画像認識システム１００は、画像認識装置１０及びユーザ端末２０を備える。画像認識装置１０は、通信ネットワークＮを介してユーザ端末２０から画像を受信して、画像に写された文字列のセグメンテーションを行う。ここで、文字列のセグメンテーションとは、文字列を一文字ずつに区切る処理をいう。本実施形態では、画像認識装置１０によって、カタカナで記載された文字列のセグメンテーションを行う例について説明する。もっとも、画像認識装置１０は、漢字や平仮名、数字のセグメンテーションを行うこともできるし、漢字、平仮名及びカタカナが混合した文字列のセグメンテーションを行うこともできる。また、英文字や他の言語の文字を含む文字列についてセグメンテーションを行うこともできる。 FIG. 1 is a diagram showing a network configuration of an image recognition system 100 according to an embodiment of the present invention. The image recognition system 100 includes an image recognition device 10 and a user terminal 20. The image recognition device 10 receives an image from the user terminal 20 via the communication network N, and performs segmentation of the character string copied on the image. Here, the segmentation of a character string means a process of dividing a character string into characters one by one. In the present embodiment, an example in which the image recognition device 10 performs segmentation of the character string described in katakana will be described. However, the image recognition device 10 can perform segmentation of kanji, hiragana, and numbers, and can also perform segmentation of a character string in which kanji, hiragana, and katakana are mixed. It is also possible to perform segmentation on character strings that include English characters and characters in other languages.

ユーザ端末２０は、汎用のコンピュータであり、例えばカメラ付きスマートフォンで構成されてよい。ユーザ端末２０は、画像認識アプリケーションを実行し、文字列の画像を撮影して、通信ネットワークＮを介して画像を画像認識装置１０に送信し、文字列のセグメンテーション結果を画像認識装置１０から受信する。なお、画像認識システム１００は、セグメンテーション処理がされた文字列の画像に基づいて、ＯＣＲ（Optical Character Recognition）を行うＯＣＲ装置を含んでもよい。その場合、文字列を撮影した画像がユーザ端末２０から画像認識装置１０に送信され、セグメンテーション処理された文字列の画像が画像認識装置１０からＯＣＲ装置に送信され、ＯＣＲ装置によって画像に写された文字列の認識が行われて、認識結果がＯＣＲ装置からユーザ端末２０に送信されてよい。 The user terminal 20 is a general-purpose computer, and may be configured by, for example, a smartphone with a camera. The user terminal 20 executes an image recognition application, captures an image of a character string, transmits the image to the image recognition device 10 via the communication network N, and receives the segmentation result of the character string from the image recognition device 10. .. The image recognition system 100 may include an OCR device that performs OCR (Optical Character Recognition) based on an image of a character string that has undergone segmentation processing. In that case, the captured image of the character string is transmitted from the user terminal 20 to the image recognition device 10, the image of the segmented character string is transmitted from the image recognition device 10 to the OCR device, and is copied to the image by the OCR device. The character string may be recognized, and the recognition result may be transmitted from the OCR device to the user terminal 20.

図２は、本実施形態に係る画像認識装置１０の機能ブロックを示す図である。画像認識装置１０は、取得部１１、分割部１２、算出部１３、出力部１４、記憶部１５及び生成部１６を備える。 FIG. 2 is a diagram showing a functional block of the image recognition device 10 according to the present embodiment. The image recognition device 10 includes an acquisition unit 11, a division unit 12, a calculation unit 13, an output unit 14, a storage unit 15, and a generation unit 16.

取得部１１は、ユーザ端末２０から画像を取得する。取得部１１は、画像とあわせて、当該画像に関連付けられた検収に関する情報を取得してもよい。 The acquisition unit 11 acquires an image from the user terminal 20. The acquisition unit 11 may acquire information on acceptance inspection associated with the image together with the image.

分割部１２は、取得した画像を複数の部分画像に分割する。分割部１２は、文字列が写された矩形の画像を、複数の矩形の部分画像に分割してよい。ここで、部分画像の形状は、文字列が並ぶ方向の辺の長さが、文字列が並ぶ方向に直交する方向の辺の長さより短い矩形であってよい。例えば、文字列が水平方向に一列並んでいる画像の大きさが幅Ａピクセル×高さＢピクセルである場合、分割部１２は、画像を幅Ａ／Ｎピクセル×高さＢピクセルの複数の部分画像に分割してよい。ここで、Ｎは、Ａ以上の数値である。なお、Ａ／Ｎが割り切れない場合、部分画像の幅は、Ａ／Ｎを四捨五入した値であったり、Ａ／Ｎを繰り上げた値であったりしてよい。 The dividing unit 12 divides the acquired image into a plurality of partial images. The dividing unit 12 may divide the rectangular image on which the character string is copied into a plurality of rectangular partial images. Here, the shape of the partial image may be a rectangle in which the length of the side in the direction in which the character strings are arranged is shorter than the length of the side in the direction orthogonal to the direction in which the character strings are arranged. For example, when the size of an image in which character strings are arranged in a row in the horizontal direction is width A pixel × height B pixel, the dividing unit 12 divides the image into a plurality of portions of width A / N pixel × height B pixel. It may be divided into images. Here, N is a numerical value greater than or equal to A. If the A / N is not divisible, the width of the partial image may be a value obtained by rounding the A / N or a value obtained by raising the A / N.

算出部１３は、複数の部分画像それぞれから特徴量を抽出する第１モデル１５ｂ及び特徴量を文字列の区切りを表す情報に順次変換する第２モデル１５ｃを用いて、複数の部分画像それぞれについて文字列の区切りを表す情報を算出する。第１モデル１５ｂ及び第２モデル１５ｃにつては、後に詳細に説明する。 The calculation unit 13 uses a first model 15b that extracts a feature amount from each of the plurality of partial images and a second model 15c that sequentially converts the feature amount into information representing a character string delimiter, and uses characters for each of the plurality of partial images. Calculate the information that represents the column break. The first model 15b and the second model 15c will be described in detail later.

出力部１４は、文字列の区切りを表す情報を、ユーザ端末２０に出力する。もっとも、出力部１４は、文字列の区切りを表す情報をＯＣＲ装置等の他の機器に出力してもよい。このように、本実施形態に係る画像認識装置１０によれば、画像を複数の部分画像に分割し、第１モデル１５ｂによって複数の部分画像の特徴を捉え、第２モデル１５ｃによって特徴量を文字列の区切りを表す情報に変換することで、画像に写された文字列が２以上に分離可能な要素で構成される場合であっても、文字列のセグメンテーションを適切に行うことができる。 The output unit 14 outputs information indicating a character string delimiter to the user terminal 20. However, the output unit 14 may output information indicating the character string delimiter to another device such as an OCR device. As described above, according to the image recognition device 10 according to the present embodiment, the image is divided into a plurality of partial images, the features of the plurality of partial images are captured by the first model 15b, and the feature amount is characterized by the second model 15c. By converting to information representing a column delimiter, even when the character string copied in the image is composed of two or more separable elements, the segmentation of the character string can be appropriately performed.

算出部１３によって算出される文字列の区切りを表す情報は、部分画像が文字列の区切りに対応するか否かを表す２値情報であってよい。これにより、２値情報に従って、画像を文字列の区切りに対応する領域と、文字列の区切りに対応しない領域とに分けることができ、文字列のセグメンテーションを適切に行うことができる。 The information representing the character string delimiter calculated by the calculation unit 13 may be binary information indicating whether or not the partial image corresponds to the character string delimiter. As a result, the image can be divided into an area corresponding to the character string delimiter and an area not corresponding to the character string delimiter according to the binary information, and the character string segmentation can be appropriately performed.

記憶部１５は、文字列が写された学習用画像に対して、文字列の区切りを表す情報が関連付けられた学習データ１５ａを記憶する。また、記憶部１５は、第１モデル１５ｂ及び第２モデル１５ｃを記憶する。 The storage unit 15 stores the learning data 15a in which the information indicating the delimiter of the character string is associated with the learning image on which the character string is copied. In addition, the storage unit 15 stores the first model 15b and the second model 15c.

生成部１６は、学習データ１５ａを用いて、第１モデル１５ｂ及び第２モデル１５ｃを生成する。生成部１６は、学習データ１５ａを用いた教師あり学習によって、第１モデル１５ｂ及び第２モデル１５ｃを生成してよい。すなわち、生成部１６は、学習データ１５ａに含まれる学習用画像から第１モデル１５ｂによって特徴量を算出し、特徴量を第２モデル１５ｃによって文字列の区切りを表す情報に変換して、学習データ１５ａに含まれる文字列の区切りを表す情報と比較し、誤差が小さくなるように第１モデル１５ｂ及び第２モデル１５ｃのパラメータを更新することで、第１モデル１５ｂ及び第２モデル１５ｃを生成してよい。生成部１６によって、学習用画像に対して、文字列の区切りを表す情報が関連付けられた学習データ１５ａが与えられた場合に、画像に写された文字列の区切りを表す情報を適切に算出することができる第１モデル１５ｂ及び第２モデル１５ｃを生成することができる。 The generation unit 16 uses the learning data 15a to generate the first model 15b and the second model 15c. The generation unit 16 may generate the first model 15b and the second model 15c by supervised learning using the learning data 15a. That is, the generation unit 16 calculates the feature amount from the learning image included in the training data 15a by the first model 15b, converts the feature amount into information representing a character string delimiter by the second model 15c, and trains data. The first model 15b and the second model 15c are generated by updating the parameters of the first model 15b and the second model 15c so that the error becomes smaller by comparing with the information indicating the delimiter of the character string included in the 15a. You can. When the learning data 15a associated with the information representing the character string delimiter is given to the learning image by the generation unit 16, the information representing the character string delimiter copied in the image is appropriately calculated. It is possible to generate a first model 15b and a second model 15c that can be generated.

第１モデル１５ｂは、複数の部分画像それぞれから特徴量として特徴マップを算出するＣＮＮ（Convolutional Neural Network）であってよい。また、第２モデル１５ｃは、特徴マップを、部分画像が文字列の区切りに対応するか否かを表す２値情報に順次変換するＲＮＮ（Recurrent Neural Network）であってよい。ここで、ＲＮＮは、例えば双方向ＬＳＴＭ（Long Short-Term Memory）で構成されてよい。このように、ＣＮＮによって算出される複数の部分画像の特徴マップによって、複数の部分画像の特徴を適切に捉え、ＲＮＮによって特徴マップの前後関係を考慮しつつ、特徴マップを文字列の区切りを表す情報に変換することができ、文字列のセグメンテーションをより適切に行うことができる。 The first model 15b may be a CNN (Convolutional Neural Network) that calculates a feature map as a feature amount from each of a plurality of partial images. Further, the second model 15c may be an RNN (Recurrent Neural Network) that sequentially converts the feature map into binary information indicating whether or not the partial image corresponds to the character string delimiter. Here, the RNN may be composed of, for example, a bidirectional RSTM (Long Short-Term Memory). In this way, the features of the plurality of partial images calculated by the CNN appropriately capture the features of the plurality of partial images, and the feature map represents the character string delimiter while considering the context of the feature map by the RNN. It can be converted into information, and the segmentation of character strings can be performed more appropriately.

生成部１６は、ＣＴＣ（Connectionist Temporal Classification）損失関数を最小化するように、第１モデル１５ｂ及び第２モデル１５ｃを生成してよい。生成部１６は、例えば誤差逆伝播法によって、ＣＴＣ損失関数を最小化するように第１モデル１５ｂを構成するＣＮＮ及び第２モデル１５ｃを構成するＲＮＮのパラメータを最適化することで、第１モデル１５ｂ及び第２モデル１５ｃを生成してよい。第１モデル１５ｂ及び第２モデル１５ｃに含まれるＣＮＮの構成や、ＲＮＮの構成は任意であり、例えば、ＬＳＴＭブロックの代わりにＧＲＵ（Gated Recurrent Unit）を用いてもよい。ＣＴＣ損失関数を用いることで、任意の文字間隔と大きさで記載された文字列について、文字列のセグメンテーションを適切に行う第１モデル１５ｂ及び第２モデル１５ｃを生成することができる。 The generation unit 16 may generate the first model 15b and the second model 15c so as to minimize the CTC (Connectionist Temporal Classification) loss function. The generation unit 16 optimizes the parameters of the CNN constituting the first model 15b and the RNN constituting the second model 15c so as to minimize the CTC loss function by, for example, the error back propagation method, thereby optimizing the parameters of the first model. 15b and the second model 15c may be generated. The configuration of the CNN and the configuration of the RNN included in the first model 15b and the second model 15c are arbitrary, and for example, a GRU (Gated Recurrent Unit) may be used instead of the LSTM block. By using the CTC loss function, it is possible to generate the first model 15b and the second model 15c that appropriately perform the segmentation of the character string for the character string described with an arbitrary character spacing and size.

図３は、本実施形態に係る画像認識装置１０の物理的構成を示す図である。画像認識装置１０は、演算部に相当するＣＰＵ（Central Processing Unit）１０ａと、記憶部に相当するＲＡＭ（Random Access Memory）１０ｂと、記憶部に相当するＲＯＭ（Read only Memory）１０ｃと、通信部１０ｄと、入力部１０ｅと、表示部１０ｆと、を有する。これらの各構成は、バスを介して相互にデータ送受信可能に接続される。なお、本例では画像認識装置１０が一台のコンピュータで構成される場合について説明するが、画像認識装置１０は、複数のコンピュータが組み合わされて実現されてもよい。また、図３で示す構成は一例であり、画像認識装置１０はこれら以外の構成を有してもよいし、これらの構成のうち一部を有さなくてもよい。 FIG. 3 is a diagram showing a physical configuration of the image recognition device 10 according to the present embodiment. The image recognition device 10 includes a CPU (Central Processing Unit) 10a corresponding to a calculation unit, a RAM (Random Access Memory) 10b corresponding to a storage unit, a ROM (Read only Memory) 10c corresponding to a storage unit, and a communication unit. It has a 10d, an input unit 10e, and a display unit 10f. Each of these configurations is connected to each other via a bus so that data can be transmitted and received. In this example, the case where the image recognition device 10 is composed of one computer will be described, but the image recognition device 10 may be realized by combining a plurality of computers. Further, the configuration shown in FIG. 3 is an example, and the image recognition device 10 may have configurations other than these, or may not have a part of these configurations.

ＣＰＵ１０ａは、ＲＡＭ１０ｂ又はＲＯＭ１０ｃに記憶されたプログラムの実行に関する制御やデータの演算、加工を行う制御部である。ＣＰＵ１０ａは、文字列が写された画像を分割し、複数の部分画像それぞれについて文字列の区切りを表す情報を算出するプログラム（画像認識プログラム）を実行する演算部である。ＣＰＵ１０ａは、入力部１０ｅや通信部１０ｄから種々のデータを受け取り、データの演算結果を表示部１０ｆに表示したり、ＲＡＭ１０ｂに格納したりする。 The CPU 10a is a control unit that controls execution of a program stored in the RAM 10b or ROM 10c, calculates data, and processes data. The CPU 10a is a calculation unit that executes a program (image recognition program) that divides an image on which a character string is copied and calculates information indicating a character string delimiter for each of a plurality of partial images. The CPU 10a receives various data from the input unit 10e and the communication unit 10d, displays the calculation result of the data on the display unit 10f, and stores the data in the RAM 10b.

ＲＡＭ１０ｂは、記憶部のうちデータの書き換えが可能なものであり、例えば半導体記憶素子で構成されてよい。ＲＡＭ１０ｂは、ＣＰＵ１０ａが実行する画像認識プログラム、学習データといったデータを記憶してよい。なお、これらは例示であって、ＲＡＭ１０ｂには、これら以外のデータが記憶されていてもよいし、これらの一部が記憶されていなくてもよい。 The RAM 10b is a storage unit capable of rewriting data, and may be composed of, for example, a semiconductor storage element. The RAM 10b may store data such as an image recognition program and learning data executed by the CPU 10a. It should be noted that these are examples, and data other than these may be stored in the RAM 10b, or a part of these may not be stored.

ＲＯＭ１０ｃは、記憶部のうちデータの読み出しが可能なものであり、例えば半導体記憶素子で構成されてよい。ＲＯＭ１０ｃは、例えば画像認識プログラムや、書き換えが行われないデータを記憶してよい。 The ROM 10c is a storage unit capable of reading data, and may be composed of, for example, a semiconductor storage element. The ROM 10c may store, for example, an image recognition program or data that is not rewritten.

通信部１０ｄは、画像認識装置１０を他の機器に接続するインターフェースである。通信部１０ｄは、インターネット等の通信ネットワークＮに接続されてよい。 The communication unit 10d is an interface for connecting the image recognition device 10 to another device. The communication unit 10d may be connected to a communication network N such as the Internet.

入力部１０ｅは、画像認識装置１０の管理者からデータの入力を受け付けるものであり、例えば、キーボード及びタッチパネルを含んでよい。 The input unit 10e receives data input from the administrator of the image recognition device 10, and may include, for example, a keyboard and a touch panel.

表示部１０ｆは、ＣＰＵ１０ａによる演算結果を視覚的に表示するものであり、例えば、ＬＣＤ（Liquid Crystal Display）により構成されてよい。表示部１０ｆは、取得した画像、算出された文字列の区切りを表す情報等を表示してよい。 The display unit 10f visually displays the calculation result by the CPU 10a, and may be configured by, for example, an LCD (Liquid Crystal Display). The display unit 10f may display the acquired image, information indicating the delimiter of the calculated character string, and the like.

画像認識プログラムは、ＲＡＭ１０ｂやＲＯＭ１０ｃ等のコンピュータによって読み取り可能な記憶媒体に記憶されて提供されてもよいし、通信部１０ｄにより接続される通信ネットワークを介して提供されてもよい。画像認識装置１０では、ＣＰＵ１０ａが画像認識プログラムを実行することにより、図２を用いて説明した様々な動作が実現される。なお、これらの物理的な構成は例示であって、必ずしも独立した構成でなくてもよい。例えば、画像認識装置１０は、ＣＰＵ１０ａとＲＡＭ１０ｂやＲＯＭ１０ｃが一体化したＬＳＩ（Large-Scale Integration）を備えていてもよい。 The image recognition program may be stored in a storage medium readable by a computer such as RAM 10b or ROM 10c and provided, or may be provided via a communication network connected by the communication unit 10d. In the image recognition device 10, the CPU 10a executes the image recognition program to realize various operations described with reference to FIG. It should be noted that these physical configurations are examples and do not necessarily have to be independent configurations. For example, the image recognition device 10 may include an LSI (Large-Scale Integration) in which the CPU 10a, the RAM 10b, and the ROM 10c are integrated.

図４は、本実施形態に係る画像認識装置１０により取得される画像ＩＭＧの一例を示す図である。画像ＩＭＧは、「センタービル」というカタカナの文字列を含む。従来術を用いて画像ＩＭＧに記載された文字列をセグメンテーションすると、「ル」を「ノ」及び「レ」とセグメンテーションしてしまう場合がある。 FIG. 4 is a diagram showing an example of an image IMG acquired by the image recognition device 10 according to the present embodiment. The image IMG contains the katakana character string "center building". When the character string described in the image IMG is segmented by using the conventional technique, "ru" may be segmented as "no" and "re".

図５は、本実施形態に係る画像認識装置１０により分割された部分画像ＤＩＶ１，ＤＩＶ２，…ＤＩＶＮの一例を示す図である。同図では、画像ＩＭＧをＮコの部分画像ＤＩＶ１，ＤＩＶ２，…ＤＩＶＮに分割した例を示している。部分画像ＤＩＶ１，ＤＩＶ２，…ＤＩＶＮの高さは、画像ＩＭＧと等しく、部分画像ＤＩＶ１，ＤＩＶ２，…ＤＩＶＮの幅は、画像ＩＭＧの幅の１／Ｎ倍となっている。 FIG. 5 is a diagram showing an example of partial images DIV1, DIV2, ... DIVN divided by the image recognition device 10 according to the present embodiment. The figure shows an example in which the image IMG is divided into N partial images DIV1, DIV2, ... DIVN. The height of the partial images DIV1, DIV2, ... DIVN is equal to that of the image IMG, and the width of the partial images DIV1, DIV2, ... DIVN is 1 / N times the width of the image IMG.

図６は、本実施形態に係る画像認識装置１０により用いられる第１モデル１５ｂ及び第２モデル１５ｃの概念図である。第１モデル１５ｂは、ＣＮＮで構成され、画像ＩＭＧから分割された複数の部分画像ＤＩＶ１，ＤＩＶ２，…ＤＩＶＮに基づいて、複数の部分画像ＤＩＶ１，ＤＩＶ２，…ＤＩＶＮそれぞれの特徴マップＦＭを算出する。特徴マップＦＭは、任意の次元の配列であってよい。 FIG. 6 is a conceptual diagram of the first model 15b and the second model 15c used by the image recognition device 10 according to the present embodiment. The first model 15b calculates the feature map FM of each of the plurality of partial images DIV1, DIV2, ... DIVN based on the plurality of partial images DIV1, DIV2, ... DIVNs composed of CNNs and divided from the image IMG. The feature map FM may be an array of arbitrary dimensions.

第２モデル１５ｃは、双方向ＬＳＴＭで構成され、特徴マップＦＭを文字列の区切りを表す２値情報Ｂに順次変換する。２値情報Ｂは、０及び１のビット列であり、各ビットは、部分画像が文字列の区切りに対応するか否かを表す。本例では、２値情報Ｂのうち「１」が、部分画像が文字列の区切りに対応することを表し、２値情報Ｂのうち「０」が、部分画像が文字列の区切りに対応しない（すなわち当該部分画像は文字列の一部を構成する）ことを表す。 The second model 15c is composed of bidirectional LSTMs, and sequentially converts the feature map FM into binary information B representing a character string delimiter. The binary information B is a bit string of 0 and 1, and each bit indicates whether or not the partial image corresponds to the character string delimiter. In this example, "1" in the binary information B indicates that the partial image corresponds to the character string delimiter, and "0" in the binary information B does not correspond to the character string delimiter. (That is, the partial image constitutes a part of the character string).

図７は、本実施形態に係る画像認識装置１０により算出された文字列の区切りの一例を示す図である。本実施形態に係る画像認識装置１０によれば、「センタービル」という文字列を含む画像ＩＭＧに対して、文字列の区切りを表す第１区切り情報ＳＥＰ１、第２区切り情報ＳＥＰ２、第３区切り情報ＳＥＰ３、第４区切り情報ＳＥＰ４、第５区切り情報ＳＥＰ５、第６区切り情報ＳＥＰ６及び第７区切り情報ＳＥＰ７が出力される。これにより、「センタービル」という６文字の文字列が適切にセグメンテーションされる。 FIG. 7 is a diagram showing an example of character string delimiters calculated by the image recognition device 10 according to the present embodiment. According to the image recognition device 10 according to the present embodiment, for the image IMG including the character string "center building", the first delimiter information SEP1, the second delimiter information SEP2, and the third delimiter information SEP3 indicating the character string delimiter. , 4th delimiter information SEP4, 5th delimiter information SEP5, 6th delimiter information SEP6, and 7th delimiter information SEP7 are output. As a result, the 6-character character string "center building" is properly segmented.

画像認識装置１０は、第２モデル１５ｃにより算出された２値情報に基づいて、２値情報が「１」である部分画像に対応する画像ＩＭＧの領域を文字列の区切り領域と判定して、画像ＩＭＧに対して第１区切り情報ＳＥＰ１、第２区切り情報ＳＥＰ２、第３区切り情報ＳＥＰ３、第４区切り情報ＳＥＰ４、第５区切り情報ＳＥＰ５、第６区切り情報ＳＥＰ６及び第７区切り情報ＳＥＰ７を付与してよい。 Based on the binary information calculated by the second model 15c, the image recognition device 10 determines that the area of the image IMG corresponding to the partial image whose binary information is "1" is the character string delimiter area. The first delimiter information SEP1, the second delimiter information SEP2, the third delimiter information SEP3, the fourth delimiter information SEP4, the fifth delimiter information SEP5, the sixth delimiter information SEP6, and the seventh delimiter information SEP7 are added to the image IMG. Good.

図８は、本実施形態に係る画像認識装置１０により実行されるセグメンテーション処理のフローチャートである。はじめに、画像認識装置１０は、文字列が写された画像を取得する（Ｓ１０）。そして、画像認識装置１０は、画像を複数の部分画像に分割する（Ｓ１１）。 FIG. 8 is a flowchart of the segmentation process executed by the image recognition device 10 according to the present embodiment. First, the image recognition device 10 acquires an image on which the character string is copied (S10). Then, the image recognition device 10 divides the image into a plurality of partial images (S11).

その後、画像認識装置１０は、第１モデル１５ｂによって、複数の部分画像それぞれから特徴量を抽出し（Ｓ１２）、第２モデル１５ｃによって、特徴量を文字列の区切りを表す情報に順次変換する（Ｓ１３）。 After that, the image recognition device 10 extracts the feature amount from each of the plurality of partial images by the first model 15b (S12), and sequentially converts the feature amount into the information representing the character string delimiter by the second model 15c (S12). S13).

最後に、画像認識装置１０は、複数の部分画像それぞれについて文字列の区切りを表す情報を算出し、ユーザ端末２０に出力する。なお、画像認識装置１０は、文字列の区切りを表す情報をＯＣＲ装置等の他の機器に出力してもよい。 Finally, the image recognition device 10 calculates information representing a character string delimiter for each of the plurality of partial images and outputs the information to the user terminal 20. The image recognition device 10 may output information indicating a character string delimiter to another device such as an OCR device.

図９は、本実施形態に係る画像認識装置１０により実行される学習処理のフローチャートである。はじめに、画像認識装置１０は、文字列が写された学習用画像に対して、文字列の区切りを表す情報が関連付けられた学習データ１５ａを収集し、記憶部１５に記憶する（Ｓ２０）。 FIG. 9 is a flowchart of the learning process executed by the image recognition device 10 according to the present embodiment. First, the image recognition device 10 collects the learning data 15a associated with the information representing the character string delimiter with respect to the learning image on which the character string is copied, and stores it in the storage unit 15 (S20).

その後、画像認識装置１０は、学習データ１５ａを用いて、ＣＴＣ損失関数を最小化するように、第１モデル１５ｂ及び第２モデル１５ｃの学習処理を実行する（Ｓ２１）。ここで、学習処理は、第１モデル１５ｂを構成するＣＮＮのパラメータ及び第２モデル１５ｃを構成するＲＮＮのパラメータを、誤差逆伝播法によって更新する処理であってよい。 After that, the image recognition device 10 uses the learning data 15a to execute the learning process of the first model 15b and the second model 15c so as to minimize the CTC loss function (S21). Here, the learning process may be a process of updating the parameters of the CNN constituting the first model 15b and the parameters of the RNN constituting the second model 15c by the error back propagation method.

学習終了条件を満たさない場合（Ｓ２２：ＮＯ）、画像認識装置１０は、第１モデル１５ｂ及び第２モデル１５ｃの学習処理を再び実行する（Ｓ２１）。ここで、学習終了条件は、ＣＴＣ損失関数の値が所定値以下となることであったり、学習処理のエポック数が所定回数以上となることであったりしてよい。 When the learning end condition is not satisfied (S22: NO), the image recognition device 10 re-executes the learning process of the first model 15b and the second model 15c (S21). Here, the learning end condition may be that the value of the CTC loss function is equal to or less than a predetermined value, or that the number of epochs in the learning process is equal to or greater than a predetermined number of times.

一方、学習終了条件を満たす場合（Ｓ２２：ＹＥＳ）、画像認識装置１０は、生成された第１モデル１５ｂ及び第２モデル１５ｃを記憶部１５に記憶する。 On the other hand, when the learning end condition is satisfied (S22: YES), the image recognition device 10 stores the generated first model 15b and second model 15c in the storage unit 15.

以上説明した実施形態は、本発明の理解を容易にするためのものであり、本発明を限定して解釈するためのものではない。実施形態が備える各要素並びにその配置、材料、条件、形状及びサイズ等は、例示したものに限定されるわけではなく適宜変更することができる。また、異なる実施形態で示した構成同士を部分的に置換し又は組み合わせることが可能である。 The embodiments described above are for facilitating the understanding of the present invention, and are not for limiting and interpreting the present invention. Each element included in the embodiment and its arrangement, material, condition, shape, size, etc. are not limited to those exemplified, and can be changed as appropriate. In addition, the configurations shown in different embodiments can be partially replaced or combined.

１０…画像認識装置、１０ａ…ＣＰＵ、１０ｂ…ＲＡＭ、１０ｃ…ＲＯＭ、１０ｄ…通信部、１０ｅ…入力部、１０ｆ…表示部、１１…取得部、１２…分割部、１３…算出部、１４…出力部、１５…記憶部、１５ａ…学習データ、１５ｂ…第１モデル、１５ｃ…第２モデル、１６…生成部、２０…ユーザ端末、１００…画像認識システム 10 ... image recognition device, 10a ... CPU, 10b ... RAM, 10c ... ROM, 10d ... communication unit, 10e ... input unit, 10f ... display unit, 11 ... acquisition unit, 12 ... division unit, 13 ... calculation unit, 14 ... Output unit, 15 ... Storage unit, 15a ... Learning data, 15b ... First model, 15c ... Second model, 16 ... Generation unit, 20 ... User terminal, 100 ... Image recognition system

Claims

文字列が写された画像を取得する取得部と、
前記画像を複数の部分画像に分割する分割部と、
前記複数の部分画像それぞれから特徴量を抽出する第１モデル及び前記特徴量を前記文字列の区切りを表す情報に順次変換する第２モデルを用いて、前記複数の部分画像それぞれについて前記文字列の区切りを表す情報を算出する算出部と、
前記文字列の区切りを表す情報を出力する出力部と、
を備える画像認識装置。 An acquisition unit that acquires an image in which a character string is copied, and
A division portion that divides the image into a plurality of partial images,
Using a first model that extracts a feature amount from each of the plurality of partial images and a second model that sequentially converts the feature amount into information representing the division of the character string, the character string of each of the plurality of partial images is used. A calculation unit that calculates information that represents a delimiter,
An output unit that outputs information indicating the delimiter of the character string, and
An image recognition device comprising.

前記文字列の区切りを表す情報は、前記部分画像が前記文字列の区切りに対応するか否かを表す２値情報である、
請求項１に記載の画像認識装置。 The information representing the character string delimiter is binary information indicating whether or not the partial image corresponds to the character string delimiter.
The image recognition device according to claim 1.

前記第１モデルは、前記複数の部分画像それぞれから前記特徴量として特徴マップを算出するＣＮＮ（Convolutional Neural Network）であり、
前記第２モデルは、前記特徴マップを前記２値情報に順次変換するＲＮＮ（Recurrent Neural Network）である、
請求項２に記載の画像認識装置。 The first model is a CNN (Convolutional Neural Network) that calculates a feature map as the feature amount from each of the plurality of partial images.
The second model is an RNN (Recurrent Neural Network) that sequentially converts the feature map into the binary information.
The image recognition device according to claim 2.

文字列が写された学習用画像に対して、前記文字列の区切りを表す情報が関連付けられた学習データを記憶する記憶部と、
前記学習データを用いて、前記第１モデル及び前記第２モデルを生成する生成部と、をさらに備える、
請求項１から３のいずれか一項に記載の画像認識装置。 A storage unit that stores learning data associated with information indicating the delimiter of the character string with respect to the learning image on which the character string is copied.
A generation unit for generating the first model and the second model using the training data is further provided.
The image recognition device according to any one of claims 1 to 3.

前記生成部は、ＣＴＣ（Connectionist Temporal Classification）損失関数を最小化するように、前記第１モデル及び前記第２モデルを生成する、
請求項４に記載の画像認識装置。 The generation unit generates the first model and the second model so as to minimize the CTC (Connectionist Temporal Classification) loss function.
The image recognition device according to claim 4.

画像認識装置に、
文字列が写された画像を取得することと、
前記画像を複数の部分画像に分割することと、
前記複数の部分画像それぞれから特徴量を抽出する第１モデル及び前記特徴量を前記文字列の区切りを表す情報に順次変換する第２モデルを用いて、前記複数の部分画像それぞれについて前記文字列の区切りを表す情報を算出することと、
前記文字列の区切りを表す情報を出力することと、
を実行させる画像認識方法。 For image recognition devices
To get an image with a character string,
Dividing the image into a plurality of partial images and
Using a first model that extracts a feature amount from each of the plurality of partial images and a second model that sequentially converts the feature amount into information representing the division of the character string, the character string of each of the plurality of partial images is used. To calculate the information that represents the delimiter
To output the information indicating the delimiter of the character string and
Image recognition method to execute.

画像認識装置に、
文字列が写された画像を取得することと、
前記画像を複数の部分画像に分割することと、
前記複数の部分画像それぞれから特徴量を抽出する第１モデル及び前記特徴量を前記文字列の区切りを表す情報に順次変換する第２モデルを用いて、前記複数の部分画像それぞれについて前記文字列の区切りを表す情報を算出することと、
前記文字列の区切りを表す情報を出力することと、
を実行させる画像認識プログラム。 For image recognition devices
To get an image with a character string,
Dividing the image into a plurality of partial images and
Using a first model that extracts a feature amount from each of the plurality of partial images and a second model that sequentially converts the feature amount into information representing the division of the character string, the character string of each of the plurality of partial images is used. To calculate the information that represents the delimiter
To output the information indicating the delimiter of the character string and
An image recognition program that runs.

画像認識装置と、ユーザ端末とを備える画像認識システムであって、
前記画像認識装置は、
前記ユーザ端末から、文字列が写された画像を取得する取得部と、
前記画像を複数の部分画像に分割する分割部と、
前記複数の部分画像それぞれから特徴量を抽出する第１モデル及び前記特徴量を前記文字列の区切りを表す情報に順次変換する第２モデルを用いて、前記複数の部分画像それぞれについて前記文字列の区切りを表す情報を算出する算出部と、
前記文字列の区切りを表す情報を前記ユーザ端末に出力する出力部と、を有する、
画像認識システム。 An image recognition system including an image recognition device and a user terminal.
The image recognition device is
An acquisition unit that acquires an image in which a character string is copied from the user terminal, and
A division portion that divides the image into a plurality of partial images,
Using a first model that extracts a feature amount from each of the plurality of partial images and a second model that sequentially converts the feature amount into information representing the division of the character string, the character string of each of the plurality of partial images is used. A calculation unit that calculates information that represents a delimiter,
It has an output unit that outputs information representing the character string delimiter to the user terminal.
Image recognition system.