CN112801030B - Target text region positioning method and device - Google Patents

Target text region positioning method and device Download PDF

Info

Publication number
CN112801030B
CN112801030B CN202110185262.XA CN202110185262A CN112801030B CN 112801030 B CN112801030 B CN 112801030B CN 202110185262 A CN202110185262 A CN 202110185262A CN 112801030 B CN112801030 B CN 112801030B
Authority
CN
China
Prior art keywords
text
region
target
image
template image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110185262.XA
Other languages
Chinese (zh)
Other versions
CN112801030A (en
Inventor
费志军
邱雪涛
高鹏飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Unionpay Co Ltd
Original Assignee
China Unionpay Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Unionpay Co Ltd filed Critical China Unionpay Co Ltd
Priority to CN202110185262.XA priority Critical patent/CN112801030B/en
Publication of CN112801030A publication Critical patent/CN112801030A/en
Application granted granted Critical
Publication of CN112801030B publication Critical patent/CN112801030B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/412Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a method and a device for positioning a target text region, belongs to the technical field of computers, and relates to the artificial intelligence and computer vision technology, wherein the method and the device are used for improving the accuracy of positioning a text region in a merchant portal picture. Determining at least one text primary selection area in a target image, and acquiring a text template image corresponding to the target image; extracting features of at least one text primary selection region to obtain primary selection region features; comparing the features of the primary selected region with the template image features of the text template image, and determining at least one text selection region from the at least one text primary selected region; performing text recognition on the at least one text selection area, and determining a target text area from the at least one text selection area according to a text recognition result; if the text recognition result of the target text region is inconsistent with the label of the text template image, the range of the target text region is enlarged according to the label of the text template image.

Description

Target text region positioning method and device
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for positioning a target text region.
Background
The door head is a plaque and related facilities arranged at the door of enterprises, public institutions and individual business enterprises, is a decorative form outside the shop, and is a means for beautifying sales places and decorative shops and attracting customers.
The gate of the merchant generally contains text contents such as merchant names, merchant addresses and the like, when checking the authenticity of the merchant, a patrol inspector needs to go to the address of the shop to take a picture, and then the auditing inspector checks information, so that the efficiency is low and the error is easy to occur. Currently, in order to automatically identify characters in a business door header picture, the character position of a business name needs to be positioned in the street-shot business door header picture.
The existing image character recognition generally recognizes all characters in an image, and can not effectively distinguish a business name character area from other character areas in a business door picture, so that the accuracy of the subsequent business name recognition is affected.
Disclosure of Invention
The embodiment of the invention provides a method and a device for positioning a target text region, which are used for improving the accuracy of positioning the text region in a merchant portal picture.
In one aspect, an embodiment of the present invention provides a method for positioning a target text region, including:
Determining at least one text primary selection area in a target image, and acquiring a text template image corresponding to the target image;
extracting features of the at least one text primary selection region to obtain primary selection region features;
comparing the features of the preliminary selected region with the template image features of the text template image, and determining at least one text selection region from the at least one text preliminary selected region;
performing text recognition on the at least one text selection area, and determining a target text area from the at least one text selection area according to a text recognition result;
comparing the text recognition result of the target text region with the label of the text template image, and if the text recognition result of the target text region is inconsistent with the label of the text template image, expanding the range of the target text region according to the label of the text template image to obtain the final target text region of the target image.
Optionally, the expanding the range of the target text area according to the label of the text template image includes:
determining the amplification direction of the target text region according to the label of the text template image and the text recognition result;
And expanding the target text region towards the expansion direction until the text recognition result of the target text region is consistent with the label of the text template image.
Optionally, before extracting the features of the at least one text initial selection area, the method further includes:
extracting features of the text template image by using an image feature point extraction model to obtain a template image feature set of the text template image;
the feature extraction is performed on the at least one text initial selection area to obtain initial selection area features of the at least one text initial selection area, and the feature extraction comprises the following steps:
extracting features of the at least one text primary selection region by using the image feature point extraction model to obtain a primary selection region feature set of the at least one text primary selection region;
comparing the features of the preliminary selected region with the features of the template image of the text template image, and determining at least one text selection region from the at least one text preliminary selected region, comprising:
matching a preliminary region feature set of the at least one text preliminary region with a template image feature set of the text template image;
And taking the text selection area with the matching point number larger than the characteristic point threshold value as the text selection area.
Optionally, the text recognition is performed on the at least one text selection area, and determining the target text area from the at least one text selection area according to the text recognition result includes:
performing text recognition on the at least one text selection area by using a text recognition model to obtain a text recognition result;
and taking a text selection area with the largest number of target characters in the text recognition result as the target text area.
The embodiment of the invention also provides an image text positioning network training method, which comprises the following steps:
acquiring a training image;
inputting the training image into a business text positioning network to obtain the business text position in the training image;
determining a target text region in the training image, wherein the target text region in the training image is obtained by the method;
and calculating a loss function according to the merchant text position and the target text region, optimizing parameters of the merchant text positioning network according to the loss function until the loss function is smaller than a preset threshold value, and determining the corresponding parameters as parameters corresponding to the merchant text positioning network to obtain the merchant text positioning network.
The embodiment of the invention also provides a device for positioning the target text region, which comprises:
the method comprises the steps of acquiring at least one text primary selection area in a target image, and acquiring a text template image corresponding to the target image;
the extraction unit is used for extracting the characteristics of the at least one text primary selection area to obtain primary selection area characteristics;
the selecting unit is used for comparing the features of the preliminary selected area with the template image features of the text template image, and determining at least one text selecting area from the at least one text preliminary selected area;
a determining unit, configured to perform text recognition on the at least one text selection area, and determine a target text area from the at least one text selection area according to a text recognition result;
and the expansion unit is used for comparing the text recognition result of the target text region with the label of the text template image, and expanding the range of the target text region according to the label of the text template image if the text recognition result of the target text region is inconsistent with the label of the text template image, so as to obtain the final target text region of the target image.
Optionally, the amplification unit is specifically configured to:
determining the amplification direction of the target text region according to the label of the text template image and the text recognition result;
and expanding the target text region towards the expansion direction until the text recognition result of the target text region is consistent with the label of the text template image.
Optionally, the extracting unit is specifically configured to perform feature extraction on the text template image by using an image feature point extraction model to obtain a template image feature set of the text template image; extracting features of the at least one text primary selection region by using the image feature point extraction model to obtain a primary selection region feature set of the at least one text primary selection region;
the carefully selecting unit is used for matching the feature set of the primary selected area of the at least one text primary selected area with the feature set of the template image of the text template image; and taking the text selection area with the matching point number larger than the characteristic point threshold value as the text selection area.
Optionally, the determining unit is configured to:
performing text recognition on the at least one text selection area by using a text recognition model to obtain a text recognition result;
And taking a text selection area with the largest number of target characters in the text recognition result as the target text area.
The embodiment of the invention also provides an image text positioning network training device, which comprises:
an acquisition unit configured to acquire a training image;
the input unit is used for inputting the training image into a business text positioning network to obtain the business text position in the training image;
a positioning unit, configured to determine a target text region in the training image, where the target text region in the training image is obtained by the method described above;
and the optimizing unit is used for calculating a loss function according to the merchant text position and the target text region, optimizing parameters of the merchant text positioning network according to the loss function, and determining the corresponding parameters as parameters corresponding to the merchant text positioning network when the loss function is smaller than a preset threshold value to obtain the merchant text positioning network.
In another aspect, an embodiment of the present invention further provides a computer readable storage medium, where a computer program is stored, where the computer program, when executed by a processor, implements the method for positioning a target text region according to the first aspect.
In another aspect, an embodiment of the present invention further provides an electronic device, including a memory and a processor, where the memory stores a computer program that can be executed on the processor, and when the computer program is executed by the processor, the processor is caused to implement a method for positioning a target text area in the first aspect.
When the text region positioning is carried out on the target image of the merchant, the embodiment of the invention determines at least one text candidate region in the target image and acquires the text template image corresponding to the target image. Extracting features of at least one text primary selection region to obtain primary selection region features; comparing the features of the primary selected region with the template image features of the text template image, and determining at least one text selection region from the at least one text primary selected region; and carrying out text recognition on the at least one text selection area, and determining a target text area from the at least one text selection area according to a text recognition result. And comparing the text recognition result of the target text region with the label of the text template image, and if the text recognition result of the target text region is inconsistent with the label of the text template image, expanding the range of the target text region according to the label of the text template image to obtain the final target text region of the target image. According to the embodiment of the invention, the merchant portal characters can be accurately positioned under the complex background of the merchant portal picture, the interference character areas in the image can be effectively shielded, and meanwhile, the merchant portal characters with longer intervals can be effectively determined to be in the same target text area, so that the positioning accuracy of the target text area is improved, and the accuracy of the identification of the subsequent merchant names is further ensured.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it will be apparent that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic diagram of a system architecture of a method for positioning a target text region according to an embodiment of the present invention;
FIG. 2 is a flowchart of a method for locating a target text region according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a merchant portal image according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a system frame corresponding to a method for locating a target text region according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a Chinese character recognition module according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a positioning device for a target text area according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The word "exemplary" is used hereinafter to mean "serving as an example, embodiment, or illustration. Any embodiment described as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
The terms "first," "second," and the like herein are used for descriptive purposes only and are not to be construed as either explicit or implicit relative importance or to indicate the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature, and in the description of embodiments of the invention, unless otherwise indicated, the meaning of "a plurality" is two or more. Furthermore, the term "include" and any variations thereof is intended to cover non-exclusive protection. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.
The door head recognition of the commercial tenant belongs to the technical category of text positioning and recognition under the condition of natural scenes, and the door head text of the commercial tenant needs to be recognized in the normally shot picture. The deep network model is the technical method with the best effect in the field at present. Natural scene text recognition is typically divided into three steps: 1) Text positioning based on a depth network; 2) Text recognition based on a depth network; 3) And checking a character recognition result.
The text recognition based on the depth network firstly needs to establish a large-scale character image database, the database must contain character labeling samples of various colors and fonts, the construction difficulty is high, and when the current text positioning model is used for positioning merchant names, two errors are easy to occur: dividing the same merchant name into two text boxes; the extraneous characters are located to the same text box as the merchant name.
In order to solve the technical problems in the related art, the embodiment of the application provides a method and a device for positioning a target text region. The positioning method of the target text region provided by the embodiment of the application can be applied to positioning scenes, text recognition scenes and the like of the target text region.
The following description is made for some simple descriptions of application scenarios applicable to the technical solution of the embodiment of the present application, and it should be noted that the application scenarios described below are only used for illustrating the embodiment of the present application, but not limiting. In the specific implementation, the technical scheme provided by the embodiment of the application can be flexibly applied according to actual needs.
In order to further explain the technical solution provided by the embodiments of the present application, the following details are described with reference to the accompanying drawings and the detailed description. Although embodiments of the present application provide the method operational steps shown in the following embodiments or figures, more or fewer operational steps may be included in the method based on routine or non-inventive labor. In steps where there is logically no necessary causal relationship, the execution order of the steps is not limited to the execution order provided by the embodiments of the present application.
An application scenario of the method for positioning a target text region according to the embodiment of the present invention may be shown in fig. 1, where the application scenario includes a terminal device 101, a server 102, and a database 103.
The terminal device 101 is an electronic device with a photographing or image capturing function, capable of installing various clients and displaying the running interface of the installed clients, and the electronic device can be mobile or fixed. For example, a mobile phone, a tablet computer, a notebook computer, a desktop computer, various wearable devices, a smart television, a vehicle-mounted device, or other electronic devices capable of realizing the above functions, and the like. The client may be a video client or a browser client, etc. Each terminal device 101 is connected to the server 102 through a communication network, which may be a wired network or a wireless network. The server 102 may be a server corresponding to a client, may be a server or a server cluster or cloud computing center composed of a plurality of servers, or may be a virtualization platform.
Where fig. 1 is illustrated with database 103 residing independently of server 102, in other possible implementations database 103 may reside in server 102.
The server 102 is connected with the database 103, the database 103 stores historical images, labeling samples, training text images and the like, the server 102 receives target images to be positioned sent by the terminal equipment 101, determines at least one text candidate area in the target images, and acquires text template images corresponding to the target images; extracting features of at least one text primary selection region to obtain primary selection region features; comparing the features of the primary selected region with the template image features of the text template image, and determining at least one text selection region from the at least one text primary selected region; performing text recognition on the at least one text selection area, and determining a target text area from the at least one text selection area according to a text recognition result; comparing the text recognition result of the target text region with the label of the text template image, and if the text recognition result of the target text region is inconsistent with the label of the text template image, expanding the range of the target text region according to the label of the text template image to obtain the final target text region of the target image, thereby realizing the positioning of the target text region. Further, the server 102 also acquires training images; inputting the training image into a business text positioning network to obtain the business text position in the training image; determining a target text region in the training image according to the mode; and calculating a loss function according to the merchant text position and the target text region, and optimizing parameters of the merchant text positioning network according to the loss function until the loss function is smaller than a preset threshold value, determining the corresponding parameters as parameters corresponding to the merchant text positioning network, and obtaining the merchant text positioning network, thereby realizing training of the merchant text positioning network.
It should be noted that, the method for positioning the target text region provided by the embodiment of the present invention may be applied to the server 102, where the server executes the method for positioning the target text region provided by the embodiment of the present invention; the method for locating the target text region provided by the invention can also be applied to the client of the terminal equipment, and the terminal equipment 101 can also be used for matching the server 102 with the client of the terminal equipment 101.
Fig. 2 is a flowchart of a method for locating a target text area according to an embodiment of the present invention. As shown in fig. 2, the method comprises the steps of:
step S201, determining at least one text initial selection area in a target image, and acquiring a text template image corresponding to the target image.
The target image may include, but is not limited to, an image file in a format of jpg, bmp, tif, gif, png, etc., and the target image may also be a screenshot. The target image may be an image uploaded after the terminal device captures in real time, or the target image may be an image acquired from a network, or the target image may be a locally stored image.
The text template image is shown in fig. 3, and is a text template image of a merchant portal of a bank.
After the server acquires the target image, the text localization model TDN, for example, a CRAFT model, may be used to determine text regions in the target image, that is, the target image is input into the text localization model to obtain a region containing text, and in general, the merchant portal image contains a plurality of text regions, and then the plurality of text regions may be used as text preliminary selection regions to form a set of text preliminary selection regions.
And step S202, extracting features of the at least one text primary selection region to obtain primary selection region features.
Specifically, feature extraction may be performed for each pixel point of the text preliminary selection area.
A pixel point refers to a minimum unit, also called a pixel, in an image represented by a digital sequence. A pixel is an indivisible unit or element in the entire image. Each dot matrix image contains a certain number of pixels that determine the size of the image presented on the screen. A picture is made up of many pixels. For example, the picture size is 500×338, which means that the picture is formed by a matrix of 500×338 pixels, the width of the picture is 500 pixels long, the height is 338 pixels long, and 500×338= 149000 pixels total. The mouse is placed over a picture and the size, here pixels, is displayed.
And extracting the characteristics of each pixel point in the text preliminary selection area aiming at each text preliminary selection area to obtain the preliminary selection area characteristics of the text preliminary selection area.
And step S203, comparing the features of the preliminary selected region with the features of the template image of the text template image, and determining at least one text selection region from the at least one text preliminary selected region.
Specifically, feature extraction can be performed on each pixel point of the text template image, the feature of the primary selection area of each text primary selection area is compared with the template image feature of the text template image, and a text selection area is selected from all the text primary selection areas according to the comparison result.
And S204, carrying out text recognition on the at least one text selection area, and determining a target text area from the at least one text selection area according to a text recognition result.
Step S205, comparing the text recognition result of the target text region with the label of the text template image, and if the text recognition result of the target text region is inconsistent with the label of the text template image, expanding the range of the target text region according to the label of the text template image to obtain the final target text region of the target image.
When the text region positioning is carried out on the target image of the merchant, the embodiment of the invention determines at least one text candidate region in the target image and acquires the text template image corresponding to the target image. Extracting features of at least one text primary selection region to obtain primary selection region features; comparing the features of the primary selected region with the template image features of the text template image, and determining at least one text selection region from the at least one text primary selected region; and carrying out text recognition on the at least one text selection area, and determining a target text area from the at least one text selection area according to a text recognition result. And comparing the text recognition result of the target text region with the label of the text template image, and if the text recognition result of the target text region is inconsistent with the label of the text template image, expanding the range of the target text region according to the label of the text template image to obtain the final target text region of the target image. According to the embodiment of the invention, the merchant portal characters can be accurately positioned under the complex background of the merchant portal picture, the interference character areas in the image can be effectively shielded, and meanwhile, the merchant portal characters with longer intervals can be effectively determined to be in the same target text area, so that the positioning accuracy of the target text area is improved, and the accuracy of the identification of the subsequent merchant names is further ensured.
Further, before the feature extraction is performed on the at least one text initial selection area to obtain the initial selection area feature, the method further includes:
extracting features of the text template image by using an image feature point extraction model to obtain a template image feature set of the text template image;
the feature extraction is performed on the at least one text initial selection area to obtain initial selection area features of the at least one text initial selection area, and the feature extraction comprises the following steps:
extracting features of the at least one text primary selection region by using the image feature point extraction model to obtain a primary selection region feature set of the at least one text primary selection region;
comparing the features of the preliminary selected region with the features of the template image of the text template image, and determining at least one text selection region from the at least one text preliminary selected region, comprising:
matching a preliminary region feature set of the at least one text preliminary region with a template image feature set of the text template image;
and taking the text selection area with the matching point number larger than the characteristic point threshold value as the text selection area.
Specifically, a feature point extraction algorithm (e.g., SIFT) may be employed to generate a template image feature set of the text template image. For example, a text template image is input into an image feature point extraction model (FS), and feature extraction is performed on each pixel point of the text template image by using the model, so as to obtain a template image feature set.
And on the other hand, sequentially inputting the text primary selection areas into the image feature point extraction model to obtain a primary selection area feature set of each text primary selection area.
And then, aiming at each text primary selection area, the primary selection area feature set of the text primary selection area can be matched with the template image feature set of the text template image, and the number of pixels matched with the pixels in the template image feature set in the primary selection area feature set is determined. And if the number of the matched pixels is greater than the characteristic point threshold value, the text primary selection area is used as a text selection area, otherwise, the text primary selection area is deleted.
Further, the text recognition of the at least one text selection area, and determining a target text area from the at least one text selection area according to a text recognition result, includes:
performing text recognition on the at least one text selection area by using a text recognition model to obtain a text recognition result;
and taking a text selection area with the largest number of target characters in the text recognition result as the target text area.
In the implementation process, text recognition is carried out on each text selection area by using a text recognition model (TR) to obtain a text recognition result of each text selection area. And comparing the text recognition result with the label of the text template image, and selecting a text selection area with the most characters in the merchant name in the recognition result as a target text area.
Further, the expanding the range of the target text area according to the label of the text template image includes:
determining the amplification direction of the target text region according to the label of the text template image and the text recognition result;
and expanding the target text region towards the expansion direction until the text recognition result of the target text region is consistent with the label of the text template image.
In the specific implementation process, it is assumed that the feature point set extracted by the image feature point extraction model (FS) for the text template image Ti is xt= { (Xt 1, yt 1),.+ -. (xtm, ytm) }, and the feature point set extracted by the FS for the target text region is Xta = { (xta 1, yta 1),.+ - (xtam, ytam) }.
If the obtained merchant portal information, namely the text recognition result of the target text region is less than the label of the text template image, supplementing the target text region according to the missing direction of the merchant label. For example, the tag is "vendor bank", the recognition result is "vendor", the missing direction is right, 1 minimum unit (1 x) is added to the right based on max (xtam), and if the obtained merchant door header information does not contain the merchant name, 1 minimum unit (ly) is added up and down based on max (ytam) and min (ytam), respectively, so as to obtain a new target text region.
And continuing to perform text recognition aiming at the new target text region, comparing the text recognition result with the label of the text template image, and repeating the supplementing process until the text recognition result is consistent with the label of the text template image or the supplementing times exceed the supplementing threshold value.
Further, in an alternative embodiment, after determining the target text region in the target image, the embodiment of the present invention may use the target image to perform model training. After the final target text region of the target image is obtained, the method further comprises:
acquiring a training image;
inputting the training image into a business text positioning network to obtain the business text position in the training image;
determining a target text region in the training image, wherein the target text region in the training image is obtained by the positioning method of the target text region;
and calculating a loss function according to the merchant text position and the target text region, optimizing parameters of the merchant text positioning network according to the loss function until the loss function is smaller than a preset threshold value, and determining the corresponding parameters as parameters corresponding to the merchant text positioning network to obtain the merchant text positioning network.
In a specific embodiment, a training image may be input into a merchant text positioning network (MDN) (e.g., a CRAFT model), a merchant door position is predicted, a predicted position Loc is output, a positioning loss function is calculated by comparing the predicted position Loc with a merchant door position obtained according to the positioning method of the target text region, and a reverse conduction parameter optimization is performed on the merchant text positioning network according to the loss function until the loss function converges.
The implementation process of the target text region positioning method provided by the embodiment of the invention is described by a specific example.
A system frame corresponding to the positioning method of the target text region in the specific embodiment is shown in FIG. 4. The text positioning module is used for positioning the position of the door text of the merchant and sending the position information to the subsequent recognition module. The character recognition module is used for recognizing the image containing the merchant door head to obtain merchant name information. The structure of the text recognition module is shown in fig. 5, and includes:
merchant door header location network (MDN): is responsible for positioning the character position of the merchant door head.
Text location network (TDN): and performing text positioning on the input image information to obtain all text position information in the image, and forming a merchant portal position candidate set H.
Image feature point extraction model (FS): extracting feature points (S) of a merchant portal template image and a feature point set of a merchant portal position candidate set H, removing candidate areas of all non-merchant portal positions in a feature point matching mode, combining feedback results of a text positioning network in the later period, dynamically adjusting a merchant portal position identification result, feeding back the merchant portal position to the merchant portal positioning network, and supervising training of MDN.
Text recognition network (TR): and extracting text information in the image.
In a specific embodiment, the method for positioning the target text region includes the following steps:
1. merchant door header information label
For each merchant portal, a merchant portal image template Ti, for example labeled as a merchant bank, is set. The other bank door head images are only marked with the name of the 'business bank', and no specific position marking is carried out.
The image feature point extraction model (FS) generates a feature point set TVi for each image template Ti using a feature point extraction algorithm (e.g., SIFT).
2. Merchant door head positioning Model (MDN) training
(1) Merchant portal position candidate set generation
A text localization network (TDN) (e.g., a CRAFT model) obtains all text regions in the merchant image to form a candidate set Ac.
(2) Merchant portal candidate set screening
The image feature point extraction model (FS) performs feature extraction on the merchant portal head position candidate set Ac generated by the TDN, and matches the extraction result with the corresponding template feature point set TVI. And if the number of the matching points of a certain candidate region is smaller than the threshold value L, deleting the candidate region to finally form a simplified candidate set As.
(3) Acquiring text information
The text recognition network (TR) carries out text recognition on each area in As, and selects an area Asx with the largest text in the merchant name in the recognition result. If the result of identification of Asx by TR is inconsistent with the merchant name label, a text information area adjustment is required.
(4) Text information area adjustment
Assume that the feature point set extracted by the image feature point extraction model (FS) for the merchant image template Ti is xt= { (Xt 1, yt 1),.+ -. (xtm, ytm) }, and the initial point set for feature extraction by FS for Asx is Xta = { (xta, yta 1),.+ - (xtam, ytam) }.
If the merchant door header information obtained in the step (3) is less than the actual tag, supplementing the Asx according to the missing direction of the merchant tag. For example, the label is "vendor bank", the recognition result is "vendor", the missing direction is right, the minimum unit lx is supplemented to the right based on max (xtam), if the merchant door header information obtained in the step (3) does not contain the merchant name, the minimum unit ly is supplemented upwards and downwards based on max (ytam) and min (ytam), respectively, and a new Asx is obtained.
Step (3) is performed for the new Asx. The loop is executed until the result of step (3) contains the actual label, or the number of supplements exceeds a maximum threshold.
(5) Merchant door header location network (MDN) parameter optimization
And (3) a merchant door head positioning network (MDN) (such as a CRAST model) predicts the position of the merchant door head of the input sample, outputs a predicted position Loc, calculates a positioning Loss function Loss with the position of the merchant door head obtained in the step (3) (or the step (4)), and optimizes the reverse conduction parameters of the MDN according to the Loss until the Loss converges.
The following is an embodiment of the device according to the present invention, and for details of the device embodiment that are not described in detail, reference may be made to the above-described one-to-one embodiment of the method.
Referring to fig. 6, a block diagram of a target text region positioning device according to an embodiment of the invention is shown. The device comprises:
an obtaining unit 601, configured to determine at least one text primary selection area in a target image, and obtain a text template image corresponding to the target image;
an extracting unit 602, configured to perform feature extraction on the at least one text primary selection region to obtain primary selection region features;
a selecting unit 603, configured to compare the features of the preliminary selected area with the features of the template image of the text template image, and determine at least one text selecting area from the at least one text preliminary selected area;
A determining unit 604, configured to perform text recognition on the at least one text selection area, and determine a target text area from the at least one text selection area according to a text recognition result;
and the expansion unit 605 is configured to compare the text recognition result of the target text region with the label of the text template image, and if it is determined that the text recognition result of the target text region is inconsistent with the label of the text template image, expand the range of the target text region according to the label of the text template image, so as to obtain the final target text region of the target image.
Optionally, the amplification unit is specifically configured to:
determining the amplification direction of the target text region according to the label of the text template image and the text recognition result;
and expanding the target text region towards the expansion direction until the text recognition result of the target text region is consistent with the label of the text template image.
Optionally, the extracting unit is specifically configured to perform feature extraction on the text template image by using an image feature point extraction model to obtain a template image feature set of the text template image; extracting features of the at least one text primary selection region by using the image feature point extraction model to obtain a primary selection region feature set of the at least one text primary selection region;
The carefully selecting unit is used for matching the feature set of the primary selected area of the at least one text primary selected area with the feature set of the template image of the text template image; and taking the text selection area with the matching point number larger than the characteristic point threshold value as the text selection area.
Optionally, the determining unit is configured to:
performing text recognition on the at least one text selection area by using a text recognition model to obtain a text recognition result;
and taking a text selection area with the largest number of target characters in the text recognition result as the target text area.
Corresponding to the method embodiment, the embodiment of the invention also provides electronic equipment. The electronic device may be a server, such as server 102 shown in fig. 1, comprising at least a memory for storing data and a processor for data processing. Among them, for a processor for data processing, when performing processing, a microprocessor, a CPU, a GPU (Graphics Processing Unit, a graphics processing unit), a DSP, or an FPGA may be employed. For the memory, the memory stores operation instructions, which may be computer executable codes, to implement the steps in the flow of the video filtering method according to the embodiment of the present invention.
Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention; as shown in fig. 7, the electronic device 70 according to the embodiment of the present invention includes: a processor 71, a display 72, a memory 73, an input device 76, a bus 75, and a communication device 74; the processor 71, memory 73, input device 76, display 72 and communication device 74 are all connected by a bus 75, which bus 75 is used for transferring data between the processor 71, memory 73, display 72, communication device 74 and input device 76.
The memory 73 may be used to store software programs and modules, such as program instructions/modules corresponding to the method for locating a target text region in the embodiment of the present invention, and the processor 71 executes the software programs and modules stored in the memory 73, thereby executing various functional applications and data processing of the electronic device 70, such as the method for locating a target text region provided in the embodiment of the present invention. The memory 73 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program of at least one application, and the like; the storage data area may store data created from the use of the electronic device 70 (e.g., animation segments, control policy network), etc. In addition, memory 73 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
The processor 71 is a control center of the electronic device 70, connects various parts of the entire electronic device 70 using the bus 75 and various interfaces and lines, and performs various functions of the electronic device 70 and processes data by running or executing software programs and/or modules stored in the memory 73, and calling data stored in the memory 73. Alternatively, the processor 71 may include one or more processing units, such as a CPU, GPU (Graphics Processing Unit ), digital processing unit, or the like.
In the embodiment of the present invention, the processor 71 displays the determined target text region and text information to the user through the display 72.
The processor 71 may also be connected to a network via a communication device 74, and if the electronic device is a server, the processor 71 may transmit data between the communication device 74 and the terminal device.
The input device 76 is mainly used for obtaining input operations by a user, and the input device 76 may be different when the electronic devices are different. For example, when the electronic device is a computer, the input device 76 may be an input device such as a mouse, keyboard, etc.; when the electronic device is a portable device such as a smart phone, tablet computer, etc., the input device 76 may be a touch screen.
The embodiment of the invention also provides a computer storage medium, wherein the computer storage medium is stored with computer executable instructions for realizing the target text region positioning method of any embodiment of the invention.
In some possible embodiments, aspects of the method for locating a target text area provided by the present invention may also be implemented in the form of a program product, which includes a program code for causing a computer device to perform the steps of the method for locating a target text area according to the various exemplary embodiments of the present invention described above when the program product is run on the computer device, for example, the computer device may perform the locating procedure of the target text area in steps S201 to S205 as shown in fig. 2.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above described device embodiments are only illustrative, e.g. the division of units is only one logical function division, and there may be other divisions in actual implementation, such as: multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.
The units described as separate units may or may not be physically separate, and units displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units; some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present invention may be integrated in one processing unit, or each unit may be separately used as one unit, or two or more units may be integrated in one unit; the integrated units may be implemented in hardware or in hardware plus software functional units.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention.

Claims (10)

1. A method for locating a target text region, the method comprising:
determining at least one text primary selection area in a target image, and acquiring a text template image corresponding to the target image;
Extracting features of the at least one text primary selection region to obtain primary selection region features;
comparing the features of the preliminary selected region with the template image features of the text template image, and determining at least one text selection region from the at least one text preliminary selected region;
performing text recognition on the at least one text selection area, and determining a target text area from the at least one text selection area according to a text recognition result;
comparing the text recognition result of the target text region with the label of the text template image, and if the text recognition result of the target text region is inconsistent with the label of the text template image, determining the text recognition result as the amplification direction of the target text region according to the label of the text template image and the text recognition result and comparing the text recognition result with the text deletion direction of the label of the text template image; expanding the target text region towards the expansion direction until the text recognition result of the target text region is consistent with the label of the text template image, so as to obtain a final target text region of the target image;
The label of the text template image comprises a merchant name contained in the text template image.
2. The method of claim 1, wherein the feature extraction of the at least one text preliminary region, before obtaining the preliminary region feature, further comprises:
extracting features of the text template image by using an image feature point extraction model to obtain a template image feature set of the text template image;
the feature extraction is performed on the at least one text initial selection area to obtain initial selection area features of the at least one text initial selection area, and the feature extraction comprises the following steps:
extracting features of the at least one text initial region by using the image feature point extraction model to obtain an initial region feature set of the at least one text initial region,
comparing the features of the preliminary selected region with the features of the template image of the text template image, and determining at least one text selection region from the at least one text preliminary selected region, comprising:
matching a preliminary region feature set of the at least one text preliminary region with a template image feature set of the text template image;
And taking the text preliminary selection area with the matching point number larger than the characteristic point threshold value as the text carefully selection area.
3. The method of claim 1, wherein said text identifying said at least one text refining area, determining a target text area from said at least one text refining area based on a text identification result, comprises:
performing text recognition on the at least one text selection area by using a text recognition model to obtain a text recognition result;
and taking a text selection area with the largest number of target characters in the text recognition result as the target text area.
4. A method for training an image text positioning network, the method comprising:
acquiring a training image;
inputting the training image into a business text positioning network to obtain the business text position in the training image;
determining a target text region in the training image, wherein the target text region in the training image is obtained by the method of any one of claims 1-3;
and calculating a loss function according to the merchant text position and the target text region, optimizing parameters of the merchant text positioning network according to the loss function until the loss function is smaller than a preset threshold value, and determining the corresponding parameters as parameters corresponding to the merchant text positioning network to obtain the merchant text positioning network.
5. A target text region locating apparatus, the apparatus comprising:
the method comprises the steps of acquiring at least one text primary selection area in a target image, and acquiring a text template image corresponding to the target image;
the extraction unit is used for extracting the characteristics of the at least one text primary selection area to obtain primary selection area characteristics;
the selecting unit is used for comparing the features of the preliminary selected area with the template image features of the text template image, and determining at least one text selecting area from the at least one text preliminary selected area;
a determining unit, configured to perform text recognition on the at least one text selection area, and determine a target text area from the at least one text selection area according to a text recognition result;
the expansion unit is used for comparing the text recognition result of the target text region with the label of the text template image, and if the text recognition result of the target text region is not consistent with the label of the text template image, determining the text recognition result as the expansion direction of the target text region in comparison with the text deletion direction of the label of the text template image according to the label of the text template image and the text recognition result; expanding the target text region towards the expansion direction until the text recognition result of the target text region is consistent with the label of the text template image, so as to obtain a final target text region of the target image;
The label of the text template image comprises a merchant name contained in the text template image.
6. The device according to claim 5, wherein the extracting unit is specifically configured to perform feature extraction on the text template image by using an image feature point extraction model to obtain a template image feature set of the text template image; extracting features of the at least one text primary selection region by using the image feature point extraction model to obtain a primary selection region feature set of the at least one text primary selection region;
the carefully selecting unit is used for matching the feature set of the primary selected area of the at least one text primary selected area with the feature set of the template image of the text template image; and taking the text selection area with the matching point number larger than the characteristic point threshold value as the text selection area.
7. The apparatus according to claim 5, wherein the determining unit is configured to:
performing text recognition on the at least one text selection area by using a text recognition model to obtain a text recognition result;
and taking a text selection area with the largest number of target characters in the text recognition result as the target text area.
8. An image text positioning network training apparatus, the apparatus comprising:
an acquisition unit configured to acquire a training image;
the input unit is used for inputting the training image into a business text positioning network to obtain the business text position in the training image;
a positioning unit for determining a target text region in the training image, wherein the target text region in the training image is obtained by the method according to any one of claims 1-3;
and the optimizing unit is used for calculating a loss function according to the merchant text position and the target text region, optimizing parameters of the merchant text positioning network according to the loss function, and determining the corresponding parameters as parameters corresponding to the merchant text positioning network when the loss function is smaller than a preset threshold value to obtain the merchant text positioning network.
9. A computer-readable storage medium having a computer program stored therein, characterized in that: the computer program, when executed by a processor, implements the method of any of claims 1-4.
10. An electronic device comprising a memory and a processor, the memory having stored thereon a computer program executable on the processor, which when executed by the processor causes the processor to implement the method of any of claims 1-4.
CN202110185262.XA 2021-02-10 2021-02-10 Target text region positioning method and device Active CN112801030B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110185262.XA CN112801030B (en) 2021-02-10 2021-02-10 Target text region positioning method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110185262.XA CN112801030B (en) 2021-02-10 2021-02-10 Target text region positioning method and device

Publications (2)

Publication Number Publication Date
CN112801030A CN112801030A (en) 2021-05-14
CN112801030B true CN112801030B (en) 2023-09-01

Family

ID=75815110

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110185262.XA Active CN112801030B (en) 2021-02-10 2021-02-10 Target text region positioning method and device

Country Status (1)

Country Link
CN (1) CN112801030B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107622266A (en) * 2017-09-21 2018-01-23 平安科技(深圳)有限公司 A kind of processing method, storage medium and the server of OCR identifications
US10032072B1 (en) * 2016-06-21 2018-07-24 A9.Com, Inc. Text recognition and localization with deep learning
CN111814794A (en) * 2020-09-15 2020-10-23 北京易真学思教育科技有限公司 Text detection method and device, electronic equipment and storage medium
CN112016546A (en) * 2020-08-14 2020-12-01 ***股份有限公司 Text region positioning method and device
CN112241739A (en) * 2020-12-17 2021-01-19 北京沃东天骏信息技术有限公司 Method, device, equipment and computer readable medium for identifying text errors
CN112308046A (en) * 2020-12-02 2021-02-02 龙马智芯(珠海横琴)科技有限公司 Method, device, server and readable storage medium for positioning text region of image

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10032072B1 (en) * 2016-06-21 2018-07-24 A9.Com, Inc. Text recognition and localization with deep learning
CN107622266A (en) * 2017-09-21 2018-01-23 平安科技(深圳)有限公司 A kind of processing method, storage medium and the server of OCR identifications
CN112016546A (en) * 2020-08-14 2020-12-01 ***股份有限公司 Text region positioning method and device
CN111814794A (en) * 2020-09-15 2020-10-23 北京易真学思教育科技有限公司 Text detection method and device, electronic equipment and storage medium
CN112308046A (en) * 2020-12-02 2021-02-02 龙马智芯(珠海横琴)科技有限公司 Method, device, server and readable storage medium for positioning text region of image
CN112241739A (en) * 2020-12-17 2021-01-19 北京沃东天骏信息技术有限公司 Method, device, equipment and computer readable medium for identifying text errors

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
自然场景下文本区域定位方法的研究;王毅;中国硕士论文全文库;全文 *

Also Published As

Publication number Publication date
CN112801030A (en) 2021-05-14

Similar Documents

Publication Publication Date Title
CN112580623B (en) Image generation method, model training method, related device and electronic equipment
CN108073910B (en) Method and device for generating human face features
CN107832662B (en) Method and system for acquiring image annotation data
WO2023015922A1 (en) Image recognition model training method and apparatus, device, and storage medium
CN109344762B (en) Image processing method and device
US20210271872A1 (en) Machine Learned Structured Data Extraction From Document Image
JP2012234494A (en) Image processing apparatus, image processing method, and program
JP7393472B2 (en) Display scene recognition method, device, electronic device, storage medium and computer program
CN112101386B (en) Text detection method, device, computer equipment and storage medium
CN108182457B (en) Method and apparatus for generating information
CN109784330B (en) Signboard content identification method, device and equipment
CN113627439A (en) Text structuring method, processing device, electronic device and storage medium
CN112016545A (en) Image generation method and device containing text
CN113378958A (en) Automatic labeling method, device, equipment, storage medium and computer program product
CN113642481A (en) Recognition method, training method, device, electronic equipment and storage medium
CN115101069A (en) Voice control method, device, equipment, storage medium and program product
CN113887375A (en) Text recognition method, device, equipment and storage medium
CN111881900B (en) Corpus generation method, corpus translation model training method, corpus translation model translation method, corpus translation device, corpus translation equipment and corpus translation medium
CN116361502B (en) Image retrieval method, device, computer equipment and storage medium
CN112801030B (en) Target text region positioning method and device
CN108170683B (en) Method and apparatus for obtaining information
US20190149878A1 (en) Determining and correlating visual context on a user device with user behavior using digital content on the user device
US11468658B2 (en) Systems and methods for generating typographical images or videos
CN114743030A (en) Image recognition method, image recognition device, storage medium and computer equipment
CN114612971A (en) Face detection method, model training method, electronic device, and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant