CN113498520B

CN113498520B - Character recognition method, character recognition device, and storage medium

Info

Publication number: CN113498520B
Application number: CN202080000058.XA
Authority: CN
Inventors: 黄光伟; 李月; 史新艳
Original assignee: BOE Technology Group Co Ltd
Current assignee: BOE Technology Group Co Ltd
Priority date: 2020-01-21
Filing date: 2020-01-21
Publication date: 2024-05-17
Anticipated expiration: 2040-01-21
Also published as: WO2021146937A1; CN113498520A

Abstract

A character recognition method, a character recognition device and a storage medium. The character recognition method comprises the following steps: acquiring an input image; performing text detection on the input image to determine a text box group, wherein the text box group comprises at least one text box; determining a target text box from the at least one text box, wherein the target text box comprises target text; acquiring a coordinate set of at least one text box and a deflection angle relative to a reference direction, determining a correction angle and a correction direction for a target text box according to the deflection angle and the coordinate set of the at least one text box, and rotating the target text box according to the correction angle and the direction to obtain a final target text box; and identifying the final target text box to obtain the target text.

Description

Character recognition method, character recognition device, and storage medium

Technical Field

Embodiments of the present disclosure relate to a character recognition method, a character recognition apparatus, and a storage medium.

Background

When a user reads an article and encounters a new word to be queried, the following modes can be adopted for querying: (1) dictionary: the method has the defects of difficult carrying, inquiry turning over and extremely low efficiency; (2) a cell phone application or electronic dictionary: the method has the defects of time consumption, complex operation, easy breaking of thought and energy dissipation of keyboard input; (3) a translation pen product: the method has the defects of easy occurrence of false scanning and missing scanning, requirement of adapting the use mode of the product by a user, and the like.

Disclosure of Invention

At least one embodiment of the present disclosure provides a text recognition method, including: acquiring an input image; performing text detection on the input image to determine a text box group, wherein the text box group comprises at least one text box; determining a target text box from the at least one text box, wherein the target text box comprises target text; acquiring a coordinate set and a deflection angle relative to a reference direction of the at least one text box, determining a correction angle and a correction direction for the target text box according to the deflection angle and the coordinate set of the at least one text box, and rotating the target text box according to the correction angle and the correction direction to obtain a final target text box; and identifying the final target text box to obtain the target text.

For example, in the text recognition method provided in at least one embodiment of the present disclosure, the at least one text box includes N text boxes, N is a positive integer greater than 2, and determining the correction angle and the correction direction for the target text box according to the deflection angle and the coordinate set of the at least one text box includes: determining the average deflection angles of the N text boxes according to the N deflection angles corresponding to the N text boxes; judging whether the average deflection angle is larger than a first angle threshold or smaller than a second angle threshold; determining that a correction angle for the target text box is 0 degrees in response to the average deflection angle being greater than the first angle threshold or less than the second angle threshold; or in response to the average deflection angle being less than or equal to the first angle threshold and greater than or equal to the second angle threshold, determining N aspect ratios respectively corresponding to the N text boxes according to N coordinate sets corresponding to the N text boxes, determining the correction direction for the target text box according to the N aspect ratios, and in response to the correction direction, determining the correction angle according to the N deflection angles.

For example, in a text recognition method provided in at least one embodiment of the present disclosure, determining the correction direction for the target text box according to the N aspect ratios includes: dividing the N text boxes into a first text box subgroup and a second text box subgroup according to the N length-width ratios, wherein the length-width ratio of each text box in the first text box subgroup is greater than or equal to 1, and the length-width ratio of each text box in the second text box subgroup is less than 1; determining a first text box number and a second text box number according to the first text box subgroup and the second text box subgroup, wherein the first text box number is the number of text boxes in the first text box subgroup, and the second text box number is the number of text boxes in the second text box subgroup; and determining the correction direction according to the first text box number and the second text box number.

For example, in the text recognition method provided in at least one embodiment of the present disclosure, determining the correction direction according to the first number of text boxes and the second number of text boxes includes: determining that the correction direction is a counterclockwise direction in response to the first number of text boxes and the second number of text boxes meeting a first condition; or in response to the first number of text boxes and the second number of text boxes meeting a second condition, determining that the correction direction is clockwise, wherein the first condition is ra > rb+r0, the second condition is ra+r0 < rb, ra is the first number of text boxes, rb is the second number of text boxes, and r0 is a constant.

For example, in the text recognition method provided in at least one embodiment of the present disclosure, in response to the average deflection angle being equal to or smaller than the first angle threshold and equal to or larger than the second angle threshold, the text recognition method further includes: in response to the first number of text boxes and the second number of text boxes not meeting the first condition and the second condition, determining that a correction angle for the target text box is 0 degrees.

For example, in the text recognition method provided in at least one embodiment of the present disclosure, r0 is 2.

For example, in a text recognition method provided in at least one embodiment of the present disclosure, in response to the correction direction, determining the correction angle according to the N deflection angles includes: the N deflection angles are sequenced according to ascending order to obtain a first deflection angle to an N deflection angle in response to the correction direction, wherein the difference between the P deflection angle and the P+1 deflection angle in the N deflection angles is larger than 10 degrees, and P is a positive integer and smaller than N; dividing the N deflection angles into a first deflection angle group, a second deflection angle group and a third deflection angle group, wherein the deflection angles in the first deflection angle group are all 0 degrees, the second deflection angle group comprises a first deflection angle to the P-th deflection angle, and the third deflection angle group comprises a P+1th deflection angle to an N-th deflection angle; determining a first angle number, a second angle number and a third angle number according to the first deflection angle group, the second deflection angle group and the third deflection angle group, wherein the first angle number is the number of deflection angles in the first deflection angle group, the second angle number is the number of deflection angles in the second deflection angle group, and the third angle number is the number of deflection angles in the third deflection angle group; and determining the correction angle according to the first angle number, the second angle number and the third angle number.

For example, in the text recognition method provided in at least one embodiment of the present disclosure, determining the correction angle according to the first angle number, the second angle number, and the third angle number includes: determining that the correction angle is 0 degrees in response to the first angle degree meeting a third condition; or in response to the first number of angles not meeting the third condition and the second number of angles and the third number of angles meeting a fourth condition, determining the correction angle as a first angle value; or in response to the first number of angles not meeting the third condition and the second number of angles and the third number of angles meeting a fifth condition, determining the correction angle as a second angle value; or determining that the correction angle is 0 degrees in response to the first number of angles not satisfying the third condition and the second and third numbers of angles not satisfying the fourth and fifth conditions; wherein the third condition is s0 > ss1, the fourth condition is s1 > s2+s2, the fifth condition is s1+s2 < s2, s0 is the first angle number, s1 is the second angle number, s2 is the third angle number, ss1 is a constant, ss2 is a constant,

The first angle value is expressed as:

Wherein 1.ltoreq.i.ltoreq.P, ai representing the first deflection angle in the second deflection angle group through the ith deflection angle in the P-th deflection angle,

The second angle value is expressed as:

Wherein P+1.ltoreq.j.ltoreq.N, aj represents the jth deflection angle from the P+1-th deflection angle to the Nth deflection angle in the third deflection angle group.

For example, in the text recognition method provided in at least one embodiment of the present disclosure, ss1 is 5 and ss2 is 2.

For example, in the text recognition method provided in at least one embodiment of the present disclosure, the first angle threshold is 80 degrees, and the second angle threshold is 10 degrees.

For example, in the text recognition method provided in at least one embodiment of the present disclosure, a deflection angle of the final target text box with respect to the reference direction is greater than the first angle threshold or less than the second angle threshold.

For example, in the text recognition method provided in at least one embodiment of the present disclosure, the at least one text box includes N text boxes, N is 1 or 2, and determining the correction angle and the correction direction for the target text box according to the deflection angle and the coordinate set of the at least one text box includes: determining the correction angle for the target text box according to the deflection angle of the target text box; determining an aspect ratio of the target text box according to the coordinate set of the target text box in response to the correction angle; and determining the correction direction for the target text box according to the aspect ratio of the target text box.

For example, in a text recognition method provided in at least one embodiment of the present disclosure, determining the correction direction for the target text box according to an aspect ratio of the target text box includes: determining that the correction direction is a counterclockwise direction in response to the aspect ratio of the target text box being 1 or more; or in response to the aspect ratio of the target text box being less than 1, determining that the correction direction is clockwise.

For example, in the text recognition method provided in at least one embodiment of the present disclosure, the at least one text box is a rectangular box, and the coordinate set of each text box in the at least one text box includes coordinates of at least three vertices of each text box.

In the text recognition method provided in at least one embodiment of the present disclosure, for example, the deflection angle of each of the at least one text box is 0 degrees or more and 90 degrees or less,

For example, in the text recognition method provided in at least one embodiment of the present disclosure, rotating the target text box according to the correction angle and the correction direction to obtain the final target text box includes: rotating the input image according to the correction angle and the correction direction so that the target text box is rotated to obtain the final target text box; or cutting the target text box to obtain a cut target text box, and rotating the cut target text box according to the correction angle and the correction direction to obtain the final target text box.

For example, in a text recognition method provided in at least one embodiment of the present disclosure, performing text detection on the input image to determine the text box group includes: performing scale transformation processing on the input image to obtain a plurality of intermediate input images, wherein the plurality of intermediate input images comprise the input image, and the sizes of the plurality of intermediate input images are different from each other; for each intermediate input image of the plurality of intermediate input images, performing text detection on each intermediate input image to obtain an intermediate text box group corresponding to each intermediate input image, thereby obtaining a plurality of intermediate text box groups corresponding to the plurality of intermediate input images, wherein each intermediate text box group comprises at least one intermediate text box; and determining the text box group according to the plurality of intermediate text box groups.

For example, in the text recognition method provided in at least one embodiment of the present disclosure, the at least one intermediate text box corresponds to the at least one text box one by one, each of the intermediate text box groups includes an i-th intermediate text box, the i-th intermediate text box corresponds to the i-th text box, i is greater than or equal to 1 and less than or equal to the number of intermediate text boxes in each of the intermediate text box groups, and determining the text box groups according to the plurality of intermediate text box groups includes: and for the ith text box, determining the coordinate set of the ith text box according to the coordinate sets corresponding to the ith intermediate text boxes of the intermediate text box sets, so as to determine the text box set.

For example, in the text recognition method provided in at least one embodiment of the present disclosure, performing text detection on each intermediate input image to obtain an intermediate text box group corresponding to each intermediate input image includes: performing text detection on each intermediate input image by using a text detection neural network to determine a text detection region group corresponding to each intermediate input image; and processing the text detection area group by using a minimum circumscribed rectangle algorithm to determine the intermediate text box group, wherein the text detection area group comprises at least one text detection area, the at least one text detection area corresponds to the at least one intermediate text box one by one, and each intermediate text box covers the corresponding text detection area.

For example, in the text recognition method provided in at least one embodiment of the present disclosure, the text detection neural network includes a first convolution module to a fifth convolution module, a first downsampling module to a fifth downsampling module, a full connection module, a first upsampling module to a third upsampling module, a first dimension reduction module to a fourth dimension reduction module, and a classifier, and the text detection is performed on each intermediate input image by using the text detection neural network to determine the text detection region group corresponding to each intermediate input image includes: performing convolution processing on each intermediate input image by using the first convolution module to obtain a first convolution characteristic image group; performing downsampling processing on the first convolution feature image group by using the first downsampling module so as to obtain a first downsampled feature image group; performing convolution processing on the first downsampled feature map group by using the second convolution module to obtain a second convolution feature map group; performing downsampling processing on the second convolution feature map set by using the second downsampling module to obtain a second downsampled feature map set; performing convolution processing on the second downsampled feature map group by using the third convolution module to obtain a third convolution feature map group; performing downsampling processing on the third convolution feature image group by using the third downsampling module to obtain a third downsampled feature image group, and performing dimension reduction processing on the third convolution feature image group by using the first dimension reduction module to obtain a first dimension reduction feature image group; performing convolution processing on the third downsampled feature map group by using the fourth convolution module to obtain a fourth convolution feature map group; performing downsampling processing on the fourth convolution feature image group by using the fourth downsampling module to obtain a fourth downsampled feature image group, and performing dimension reduction processing on the fourth convolution feature image group by using the second dimension reduction module to obtain a second dimension reduction feature image group; performing convolution processing on the fourth downsampled feature map group by using the fifth convolution module to obtain a fifth convolution feature map group; performing downsampling processing on the fifth convolution feature image group by using the fifth downsampling module to obtain a fifth downsampled feature image group, and performing dimension reduction processing on the fifth convolution feature image group by using the third dimension reduction module to obtain a third dimension reduction feature image group; performing convolution processing on the fifth downsampled feature map group by using the full connection module to obtain a sixth convolution feature map group; performing dimension reduction processing on the sixth convolution feature map set by using the fourth dimension reduction module to obtain a fourth dimension reduction feature map set; performing up-sampling processing on the fourth dimension-reduction feature map set by using the first up-sampling module so as to obtain a first up-sampling feature map set; performing fusion processing on the first up-sampling feature image group and the third dimension-reduction feature image group to obtain a first fusion feature image group; performing upsampling processing on the first fusion feature map set by using the second upsampling module to obtain a second upsampled feature map set; performing fusion processing on the second up-sampling feature image group and the second dimension-reduction feature image group to obtain a second fusion feature image group; performing upsampling processing on the second fusion feature map set by using the third upsampling module to obtain a third upsampled feature map set; performing fusion processing on the third upsampling feature map set and the first dimension-reduction feature map set to obtain a third fusion feature map set; classifying the third fusion feature image group by using the classifier to obtain a text classification prediction image and a connection classification prediction image; and determining the text detection area group according to the connection classification prediction graph and the text classification prediction graph.

For example, in the text recognition method provided in at least one embodiment of the present disclosure, the number of feature graphs in the first convolution feature graph group is 8, the number of feature graphs in the second convolution feature graph group is 16, the number of feature graphs in the third convolution feature graph group is 32, the number of feature graphs in the fourth convolution feature graph group is 64, the number of feature graphs in the fifth convolution feature graph group is 128, the number of feature graphs in the sixth convolution feature graph group is 256, the number of feature graphs in the first dimension-reduction feature graph group is 10, the number of feature graphs in the second dimension-reduction feature graph group is 10, the number of feature graphs in the third dimension-reduction feature graph group is 10, and the number of feature graphs in the fourth dimension-reduction feature graph group is 10.

For example, in the text recognition method provided in at least one embodiment of the present disclosure, the text detection neural network includes a first convolution module to a fifth convolution module, a first downsampling module to a fifth downsampling module, a full connection module, a first upsampling module to a third upsampling module, a first dimension reduction module to a fifth dimension reduction module, and a classifier, and the text detection is performed on each intermediate input image by using the text detection neural network, so as to determine a text detection region group corresponding to each intermediate input image includes: performing convolution processing on the input image by using the first convolution module to obtain a first convolution characteristic image group; performing downsampling processing on the first convolution feature image group by using the first downsampling module so as to obtain a first downsampled feature image group; performing convolution processing on the first downsampled feature map group by using the second convolution module to obtain a second convolution feature map group; performing downsampling processing on the second convolution feature image group by using the second downsampling module to obtain a second downsampled feature image group, and performing dimension reduction processing on the second convolution feature image group by using the first dimension reduction module to obtain a first dimension reduction feature image group; performing convolution processing on the second downsampled feature map group by using the third convolution module to obtain a third convolution feature map group; performing downsampling processing on the third convolution feature image group by using the third downsampling module to obtain a third downsampled feature image group, and performing dimension reduction processing on the third convolution feature image group by using the second dimension reduction module to obtain a second dimension reduction feature image group; performing convolution processing on the third downsampled feature map group by using the fourth convolution module to obtain a fourth convolution feature map group; performing downsampling processing on the fourth convolution feature image group by using the fourth downsampling module to obtain a fourth downsampled feature image group, and performing dimension reduction processing on the fourth convolution feature image group by using the third dimension reduction module to obtain a third dimension reduction feature image group; performing convolution processing on the fourth downsampled feature map group by using the fifth convolution module to obtain a fifth convolution feature map group; performing downsampling processing on the fifth convolution feature image group by using the fifth downsampling module to obtain a fifth downsampled feature image group, and performing dimension reduction processing on the fifth convolution feature image group by using the fourth dimension reduction module to obtain a fourth dimension reduction feature image group; performing convolution processing on the fifth downsampled feature map group by using the full connection module to obtain a sixth convolution feature map group; performing dimension reduction processing on the sixth convolution feature map set by using the fifth dimension reduction module to obtain a fifth dimension reduction feature map set; performing fusion processing on the fourth dimension reduction feature image group and the fifth dimension reduction feature image group to obtain a first fusion feature image group; performing upsampling processing on the first fusion feature map set by using the first upsampling module to obtain a first upsampled feature map set; performing fusion processing on the first up-sampling feature image group and the third dimension-reduction feature image group to obtain a second fusion feature image group; performing upsampling processing on the second fused feature map set by using the second upsampling module to obtain a second upsampled feature map set; performing fusion processing on the second up-sampling feature image group and the second dimension-reduction feature image group to obtain a third fusion feature image group; performing upsampling processing on the third fused feature map set by using the third upsampling module to obtain a third upsampled feature map set; performing fusion processing on the third upsampling feature map set and the first dimension-reduction feature map set to obtain a fourth fusion feature map set; classifying the fourth fusion feature image group by using the classifier to obtain a text classification prediction image and a connection classification prediction image; and determining the text detection area group according to the connection classification prediction graph and the text classification prediction graph.

For example, in the text recognition method provided in at least one embodiment of the present disclosure, the number of feature maps in the first convolution feature map set is 64, the number of feature maps in the second convolution feature map set is 128, the number of feature maps in the third convolution feature map set is 256, the number of feature maps in the fourth convolution feature map set is 512, the number of feature maps in the fifth convolution feature map set is 512, the number of feature maps in the sixth convolution feature map set is 512, and the number of feature maps in each of the first dimension-reduction feature map set to the fifth dimension-reduction feature map set is 18.

For example, in the text recognition method provided in at least one embodiment of the present disclosure, before the input image is acquired, the text recognition method further includes: training the text detection neural network to be trained to obtain the text detection neural network, training the text detection neural network to be trained to obtain the text detection neural network comprises: acquiring a training input image and a target text detection region group; processing the training input image by using the text detection neural network to be trained to obtain a training text detection region group; calculating a loss value of the text detection neural network to be trained according to the target text detection area group and the training text detection area group through a loss function; correcting parameters of the text detection neural network to be trained according to the loss value, obtaining the trained text detection neural network when the loss function meets a preset condition, and continuously inputting the training input image and the target text detection region group to repeatedly execute the training process when the loss function does not meet the preset condition.

For example, in a text recognition method provided in at least one embodiment of the present disclosure, the loss function includes a focus loss function.

For example, in a text recognition method provided in at least one embodiment of the present disclosure, determining a target text box from the at least one text box includes: determining the position of a pen point of the fixed point translation pen; marking a region to be detected in the input image based on the position of the pen point; determining at least one overlapping area between the to-be-detected area and the at least one text box respectively; and determining a text box corresponding to the largest overlapping area in the at least one overlapping area as the target text box.

For example, in the text recognition method provided in at least one embodiment of the present disclosure, the step of recognizing the final target text box to obtain the target text includes: performing recognition processing on the final target text box by using the text recognition neural network to obtain an intermediate text; and checking the intermediate text to obtain the target text.

For example, in the text recognition method provided in at least one embodiment of the present disclosure, the text recognition neural network is a multi-objective correction attention network.

For example, the text recognition method provided in at least one embodiment of the present disclosure further includes: and translating the target text to obtain and output a translation result of the target text.

At least one embodiment of the present disclosure provides a text recognition method, including: acquiring an input image; performing text detection on the input image by using a text detection neural network to determine a text box group, wherein the text box group comprises at least one text box; determining a target text box from the at least one text box, wherein the target text box comprises target text; rotating the target text box to obtain a final target text box; identifying the final target text box to obtain the target text, wherein the text detection neural network comprises a first to fifth convolution modules and a first to fourth dimension reduction modules, the number of convolution kernels in each of the first dimension reduction modules is 8, the number of convolution kernels in each of the second dimension reduction modules is 16, the number of convolution kernels in each of the third dimension reduction modules is 32, the number of convolution kernels in each of the fourth dimension reduction modules is 64, the number of convolution kernels in each of the fifth dimension reduction modules is 128, the number of convolution kernels in each of the first dimension reduction modules is 10, the number of convolution kernels in each of the second dimension reduction modules is 10, the number of convolution kernels in each of the third dimension reduction modules is 10, and the number of convolution kernels in each of the fourth dimension reduction modules is 10.

For example, in a text recognition method provided in at least one embodiment of the present disclosure, performing text detection on the input image using the text detection neural network to determine a text box group includes: performing scale transformation processing on the input image to obtain a plurality of intermediate input images, wherein the plurality of intermediate input images comprise the input image, and the sizes of the plurality of intermediate input images are different from each other; for each intermediate input image in the plurality of intermediate input images, performing text detection on each intermediate input image by using the text detection neural network to obtain an intermediate text box group corresponding to each intermediate input image, thereby obtaining a plurality of intermediate text box groups corresponding to the plurality of intermediate input images, wherein each intermediate text box group comprises at least one intermediate text box; and determining the text box group according to the plurality of intermediate text box groups.

For example, in the text recognition method provided in at least one embodiment of the present disclosure, text detection is performed on each intermediate input image by using the text detection neural network to obtain an intermediate text box group corresponding to each intermediate input image, including: performing text detection on each intermediate input image by using the text detection neural network to determine a text detection region group corresponding to each intermediate input image; and processing the text detection area group by using a minimum circumscribed rectangle algorithm to determine the intermediate text box group, wherein the text detection area group comprises at least one text detection area, the at least one text detection area corresponds to the at least one intermediate text box one by one, and each intermediate text box covers the corresponding text detection area.

For example, in the text recognition method provided in at least one embodiment of the present disclosure, the text detection neural network further includes a first to fifth downsampling modules, a full connection module, a first to third upsampling modules, and a classifier, and the text detection neural network is used to perform text detection on each intermediate input image to determine the text detection region group corresponding to each intermediate input image, where the text detection region group includes: performing convolution processing on each intermediate input image by using the first convolution module to obtain a first convolution characteristic image group; performing downsampling processing on the first convolution feature image group by using the first downsampling module so as to obtain a first downsampled feature image group; performing convolution processing on the first downsampled feature map group by using the second convolution module to obtain a second convolution feature map group; performing downsampling processing on the second convolution feature map set by using the second downsampling module to obtain a second downsampled feature map set; performing convolution processing on the second downsampled feature map group by using the third convolution module to obtain a third convolution feature map group; performing downsampling processing on the third convolution feature image group by using the third downsampling module to obtain a third downsampled feature image group, and performing dimension reduction processing on the third convolution feature image group by using the first dimension reduction module to obtain a first dimension reduction feature image group; performing convolution processing on the third downsampled feature map group by using the fourth convolution module to obtain a fourth convolution feature map group; performing downsampling processing on the fourth convolution feature image group by using the fourth downsampling module to obtain a fourth downsampled feature image group, and performing dimension reduction processing on the fourth convolution feature image group by using the second dimension reduction module to obtain a second dimension reduction feature image group; performing convolution processing on the fourth downsampled feature map group by using the fifth convolution module to obtain a fifth convolution feature map group; performing downsampling processing on the fifth convolution feature image group by using the fifth downsampling module to obtain a fifth downsampled feature image group, and performing dimension reduction processing on the fifth convolution feature image group by using the third dimension reduction module to obtain a third dimension reduction feature image group; performing convolution processing on the fifth downsampled feature map group by using the full connection module to obtain a sixth convolution feature map group; performing dimension reduction processing on the sixth convolution feature map set by using the fourth dimension reduction module to obtain a fourth dimension reduction feature map set; performing up-sampling processing on the fourth dimension-reduction feature map set by using the first up-sampling module so as to obtain a first up-sampling feature map set; performing fusion processing on the first up-sampling feature image group and the third dimension-reduction feature image group to obtain a first fusion feature image group; performing upsampling processing on the first fusion feature map set by using the second upsampling module to obtain a second upsampled feature map set; performing fusion processing on the second up-sampling feature image group and the second dimension-reduction feature image group to obtain a second fusion feature image group; performing upsampling processing on the second fusion feature map set by using the third upsampling module to obtain a third upsampled feature map set; performing fusion processing on the third upsampling feature map set and the first dimension-reduction feature map set to obtain a third fusion feature map set; classifying the third fusion feature image group by using the classifier to obtain a text classification prediction image and a connection classification prediction image; and determining the text detection area group according to the connection classification prediction graph and the text classification prediction graph.

At least one embodiment of the present disclosure provides a text recognition device, including: the image acquisition device is used for acquiring an input image; a memory for storing the input image and computer readable instructions; a processor for reading the input image and executing the computer readable instructions, which when executed by the processor, perform the text recognition method according to any of the embodiments described above.

For example, the text recognition device provided in at least one embodiment of the present disclosure further includes: and the point translation pen is arranged on the image acquisition device and is used for selecting the target text.

At least one embodiment of the present disclosure provides a storage medium, non-transitory storing computer readable instructions, which when executed by a computer, may perform a word recognition method according to any one of the embodiments described above.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly described below, and it is apparent that the drawings in the following description relate only to some embodiments of the present disclosure, not to limit the present disclosure.

FIG. 1 is a schematic flow chart of a text recognition method according to at least one embodiment of the present disclosure;

FIGS. 2A-2E are schematic illustrations of a plurality of intermediate input images provided in accordance with at least one embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a text detection neural network provided in accordance with at least one embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a pixel and a pixel adjacent to the pixel in a feature map according to at least one embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a text detection neural network;

FIG. 6 is a schematic diagram of a pixel and a neighboring pixel of the pixel in a feature map according to another embodiment of the present disclosure;

FIG. 7A is a schematic diagram of a text box group in an input image provided in accordance with at least one embodiment of the present disclosure;

FIG. 7B is a schematic diagram of a text box group in another input image provided in accordance with at least one embodiment of the present disclosure;

FIG. 8A is a schematic diagram of a text box group in an input image according to another embodiment of the present disclosure;

FIG. 8B is a schematic diagram of a text box group in another input image provided in another embodiment of the present disclosure;

FIG. 9 is a schematic diagram of a text box in a coordinate system provided by at least one embodiment of the present disclosure;

FIG. 10 is a graph of loss dip for a cross entropy loss function and a focus loss function provided by at least one embodiment of the present disclosure;

FIG. 11A is a schematic diagram of a model result of a text detection neural network based on a cross entropy loss function provided by at least one embodiment of the present disclosure;

FIG. 11B is a schematic diagram of a model result of a text detection neural network based on a focus loss function provided in at least one embodiment of the present disclosure;

FIG. 12 is a schematic block diagram of a word recognition device provided in accordance with at least one embodiment of the present disclosure; and

Fig. 13 is a schematic diagram of a storage medium provided in at least one embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present disclosure. It will be apparent that the described embodiments are some, but not all, of the embodiments of the present disclosure. All other embodiments, which can be made by one of ordinary skill in the art without the need for inventive faculty, are within the scope of the present disclosure, based on the described embodiments of the present disclosure.

Unless defined otherwise, technical or scientific terms used in this disclosure should be given the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The terms "first," "second," and the like, as used in this disclosure, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed. In order to keep the following description of the embodiments of the present disclosure clear and concise, the present disclosure omits a detailed description of some known functions and known components.

The translation pen may be used to query for new words. When inquiring, the pen point of the translation pen is aligned to the lower part of the new word, and the new word can be quickly inquired and translated by slightly adjusting the pen point. Compared with other ways of inquiring new words, the pen for point translation has the characteristics of higher use efficiency, light pen body, portability and the like.

The point translation pen may perform character recognition based on optical character recognition (OCR, optical Character Recognition) technology to enable querying and translating text, e.g., foreign words, etc. At present, various text detection technologies are continuously emerging in the field of OCR, and most of the text detection technologies with good performance are realized based on a deep learning algorithm. For example, the text detection technology may include a pixel connection (PixelLink) algorithm, where the pixel connection algorithm is based on image segmentation to implement detection of a text box, and the pixel connection algorithm has a better word detection effect, but the pixel connection algorithm has a larger calculation amount, and the corresponding neural network model is harder to quickly converge, so that the detection effect on the text scale change in an image is worse, and cannot be directly used in an application scene of point translation.

At least one embodiment of the present disclosure provides a text recognition method, a text recognition device, and a storage medium, where the text recognition method may rotate a selected target text box containing a target text to be translated, thereby improving accuracy of text recognition. The text recognition method is applied to the point translation technology, so that the text to be translated can be clicked and the translation result can be directly displayed, the operation mode of the traditional key-type electronic dictionary is replaced, the convenience of text query is improved under the condition that the accuracy of text recognition is improved, the learning efficiency is greatly improved, and the reading quantity is increased. It should be noted that the point translation technique may be implemented based on a point translation pen, however, the disclosure is not limited thereto, and the product implementing the point translation technique may be other suitable forms instead of a pen form.

Embodiments of the present disclosure will be described in detail below with reference to the attached drawings, but the present disclosure is not limited to these specific embodiments.

In some embodiments, the text recognition method includes: acquiring an input image; performing text detection on the input image to determine a text box group, wherein the text box group comprises at least one text box; determining a target text box from the at least one text box, wherein the target text box comprises target text; and identifying the target text box to obtain target text.

In some embodiments, the target text box is corrected after the target text box is determined from at least one text box so as to more quickly and accurately recognize the target text.

Fig. 1 is a schematic flow chart of a text recognition method according to at least one embodiment of the present disclosure.

The text recognition method may be applied to point translation techniques, for example, in some embodiments, the text recognition method provided by the present disclosure may be applied to point translation strokes. The embodiment is not limited to the specific configuration, form, etc. of the translation pen. The text recognition method provided by the disclosure can also be applied to other suitable electronic products. The present disclosure is described in detail below by taking an example in which a text recognition method is applied to a translation pen.

For example, as shown in FIG. 1, the text recognition method includes, but is not limited to, the following steps:

S100: acquiring an input image;

S101: performing text detection on the input image to determine a text box group, wherein the text box group comprises at least one text box;

s102: determining a target text box from the at least one text box, wherein the target text box comprises target text;

S103: acquiring a coordinate set of at least one text box and a deflection angle relative to a reference direction, determining a correction angle and a correction direction for the target text box according to the deflection angle and the coordinate set of the at least one text box, and rotating the target text box according to the correction angle and the correction direction to obtain a final target text box;

s104: and identifying the final target text box to obtain the target text.

For example, in step S100, the input image may be an image captured by an image capturing device. For example, the point translator may include a camera, and the image capturing device may be a camera on the point translator, that is, the input image is captured by the camera disposed on the point translator.

For example, the input image may be a grayscale image or a color image. The shape of the input image may be rectangular, diamond-shaped, circular, etc., which is not particularly limited by the present disclosure. In the embodiments of the present disclosure, an input image is described as a rectangle.

For example, the input image may be an original image directly acquired by the image acquisition device, or may be an image obtained after preprocessing the original image. For example, in order to avoid the influence of data quality, data imbalance, and the like of the input image on the character recognition, the character recognition method provided by the embodiment of the present disclosure may further include an operation of preprocessing the input image before performing text detection on the input image. Preprocessing may eliminate extraneous or noise information in the input image to facilitate better processing of the input image. Preprocessing may include, for example, scaling, cropping, gamma (Gamma) correction, image enhancement, or noise reduction filtering of the input image.

For example, the input image includes at least one text, the at least one text including the target text. It should be noted that the target text is a text that the user wishes to detect. An input image refers to a form in which text is visually presented, such as pictures of text, video, and the like.

For example, the target text may include: one word in english, french, german, spanish, etc., or one word or word in chinese, japanese, korean, etc.

For example, all text boxes in the text box group are rectangular boxes, diamond boxes, etc. In the embodiments of the present disclosure, text boxes are exemplified as rectangular boxes, but the present disclosure is not limited thereto.

Fig. 2A-2E are schematic illustrations of a plurality of intermediate input images provided in accordance with at least one embodiment of the present disclosure.

For example, in step S101, at least one text is included within each text box in the text box group. In some embodiments, one text is included within each text box, e.g., one text may be one English word (e.g., "order" or the like), one Chinese word (e.g., "network" or the like), one Chinese character (e.g., "high" or the like), and the like. It should be noted that in some embodiments, multiple texts may be included in each text box.

For example, step S101 may include:

s1011: performing scale transformation processing on the input images to obtain a plurality of intermediate input images;

S1012: for each intermediate input image of the plurality of intermediate input images, performing text detection on each intermediate input image to obtain an intermediate text box group corresponding to each intermediate input image, thereby obtaining a plurality of intermediate text box groups corresponding to the plurality of intermediate input images, wherein each intermediate text box group comprises at least one intermediate text box;

s1013: a text box group is determined based on the plurality of intermediate text box groups.

For example, in step S1011, for the case where the pixel connection algorithm is not adapted to the change of the dimension of the text in the input image, the input image may be transformed to different dimensions to construct an image pyramid (i.e., a plurality of intermediate input images), so that various text dimensions can be satisfied, while improving the accuracy of text detection.

For example, the plurality of intermediate input images may include input images, and the plurality of intermediate input images are different from each other in size. For example, in some embodiments, the size of the input image is w×h, that is, the width of the input image is W, the height of the input image is H, and the input image is subjected to a scaling process to adjust the size of the input image to 1.5×h, 0.8×h, 0.6×h, and 0.4×h, respectively, so as to obtain a plurality of intermediate input images.

For example, the plurality of intermediate input images may include a first intermediate input image, a second intermediate input image, a third intermediate input image, a fourth intermediate input image, and a fifth intermediate input image, fig. 2A illustrates the first intermediate input image and the first intermediate input image has a size of 0.4 x (W x H), fig. 2B illustrates the second intermediate input image and the second intermediate input image has a size of 0.6 x (W x H), fig. 2C illustrates the third intermediate input image and the third intermediate input image has a size of 0.8 x (W x H), fig. 2D illustrates the fourth intermediate input image and the fourth intermediate input image has a size of (W x H), that is, the fourth intermediate input image is the input image, fig. 2D illustrates the image as the input image, fig. 2E illustrates the fifth intermediate input image and the fifth intermediate input image has a size of 1.5 x (W x H). It should be noted that the size of the plurality of intermediate input images is not limited to the above-described size, and may be arbitrarily set according to actual situations. In addition, the plurality of intermediate input images may not include the input image.

For example, in step S1012, text detection is performed on each of the plurality of intermediate input images, thereby obtaining a plurality of intermediate text box groups corresponding to the plurality of intermediate input images one by one. The text detection mode of each intermediate input image is the same and is based on a pixel connection algorithm.

For example, the number of intermediate text boxes of each intermediate text box group may be the same, and each text group contained within the intermediate text box of each intermediate text box group may be the same. The "text group" represents a collection of text contained by all intermediate text boxes in the intermediate text box group. In the intermediate input images shown in fig. 2A to 2E, the number of intermediate text boxes in the intermediate text box group corresponding to the first intermediate input image may be 8, the number of intermediate text boxes in the intermediate text box group corresponding to the second intermediate input image may be 8, the number of intermediate text boxes in the intermediate text box group corresponding to the third intermediate input image may be 8, the number of intermediate text boxes in the intermediate text box group corresponding to the fourth intermediate input image may be 8, and the number of intermediate text boxes in the intermediate text box group corresponding to the fifth intermediate input image may be 8. Taking the first intermediate input image and the fifth intermediate input image as examples, the text group included in the intermediate text box of the intermediate text box group corresponding to the first intermediate input image includes text: "ur", "of", "French", "Spring's", "studio", "to", "view" and "desig"; the text group contained in the intermediate text box of the intermediate text box group corresponding to the fifth intermediate input image also includes text: "ur", "of", "French", "Spring's", "studio", "to", "view" and "desig". Also, an intermediate text box including "ur" corresponding to the first intermediate input image and an intermediate text box including "ur" corresponding to the fifth intermediate input image correspond to each other, an intermediate text box including "French" corresponding to the first intermediate input image and an intermediate text box including "French" corresponding to the fifth intermediate input image correspond to each other, and so on.

It should be understood that, in practical applications, since the sizes of the plurality of intermediate input images are different, the number of intermediate text boxes of the plurality of intermediate text box groups obtained by text detecting the plurality of intermediate input images may be different, and each text group included in the intermediate text boxes of each intermediate text box group may be different.

For example, in step S1012, performing text detection on each intermediate input image to obtain an intermediate text box group corresponding to each intermediate input image includes: performing text detection on each intermediate input image by using a text detection neural network to determine a text detection region group corresponding to each intermediate input image; and processing the text detection area group by using a minimum circumscribed rectangle algorithm to determine an intermediate text box group.

For example, the text detection neural network may employ a pixel connection (PixelLink) algorithm for text detection.

For example, the text detection area group includes at least one text detection area, which corresponds one-to-one to at least one intermediate text box, and each intermediate text box includes a corresponding text detection area, that is, the intermediate text box covers the corresponding text detection area. For example, after obtaining the text detection area set, the text detection area set may be first subjected to contour detection using an OpenCV-based contour detection (findContours) function to obtain the contours of all text detection areas in the text detection area set; and then, processing the outlines of all the text detection areas by using a minimum circumscribed rectangle (MINAREARECT) function based on OpenCV and a union method to obtain the minimum circumscribed rectangle of the outlines of all the text detection areas, and finally obtaining all the intermediate text boxes in the intermediate text box group.

Fig. 3 is a schematic diagram of a text detection neural network provided in at least one embodiment of the present disclosure.

For example, the text detection neural network may employ the VGG16 network as a feature extractor, and replace the full connection layer in the VGG16 network with a convolutional layer. In PixelLink algorithm, the feature fusion and pixel prediction mode is based on FPN (feature pyramid network, pyramid feature network) idea, namely, the size of convolution layers in the text detection neural network is sequentially halved, but the number of convolution kernels in the convolution layers is sequentially doubled.

For example, as shown in fig. 3, in some embodiments, the text detection neural network may include first 301 through fifth 305 convolution modules, first 306 through fifth 310 downsampling modules, a full connection module 311, first 312 through third 314 upsampling modules, first 315 through fourth 318 dimension reduction modules, and a classifier 319.

For example, the first convolution module 301 may include two convolution layers conv1_1 and conv1_2, each convolution layer in the first convolution module 301 including 8 convolution kernels; the second convolution module 302 may include two convolution layers conv2_1 and conv2_2, each convolution layer in the second convolution module 302 including 16 convolution kernels; the third convolution module 303 may include three convolution layers conv3_1 through conv3_3, each convolution layer in the third convolution module 303 including 32 convolution kernels; the fourth convolution module 304 may include three convolution layers conv4_1 through conv4_3, each convolution layer in the fourth convolution module 304 including 64 convolution kernels; the fifth convolution module 305 may include three convolution layers conv5_1 through conv5_3, each convolution layer in the fifth convolution module 305 including 128 convolution kernels. It should be noted that each convolution layer includes an activation function, for example, the activation function may be a ReLU activation function.

For example, each of the first through fifth downsampling modules 306 through 310 may include a downsampling layer. On one hand, the downsampling layer can be used for reducing the scale of an input image, simplifying the complexity of calculation and reducing the phenomenon of overfitting to a certain extent; on the other hand, the downsampling layer can also perform feature compression to extract main features of the input image. The downsampling layer is capable of reducing the size of the feature map without changing the number of feature maps, e.g., the downsampling process is used to reduce the size of the feature map, thereby reducing the amount of data of the feature map. For example, in some embodiments, the downsampling layer may employ maximum pooling (max pooling) for downsampling, with downsampling factors of 1/2 x 2 for all downsampling layers, but the disclosure is not limited thereto, e.g., in other embodiments, downsampling may be implemented using average combining (average pooling), span convolution (strided convolution), downsampling (decimation, e.g., selecting fixed pixels), demultiplexing the output (demuxout, splitting the input image into multiple smaller images), or the like.

For example, the full connection module 311 includes two full connection layers fc6 and fc7. The full link layer fc6 is a convolution layer and includes 256 convolution kernels, and the full link layer fc7 is also a convolution layer and includes 256 convolution kernels.

For example, each of the first to third upsampling modules 312 to 314 may include an upsampling layer for performing upsampling processing, and the upsampling factor of all the upsampling layers may be 2×2. For example, the upsampling process is used to increase the size of the feature map, thereby increasing the data amount of the feature map. For example, the upsampling layer may implement the upsampling process using a span transpose convolution (strided transposed convolution), interpolation algorithm, or other upsampling method. The interpolation algorithm may include, for example, interpolation, bilinear interpolation, bicubic interpolation (Bicubic Interprolation), etc. algorithms.

For example, each of the first dimension reduction module 315 through the fourth dimension reduction module 318 may include 1*1 convolution kernels, e.g., the first dimension reduction module 315 may include 10 1*1 convolution kernels, the second dimension reduction module 316 may include 10 1*1 convolution kernels, the third dimension reduction module 317 may include 10 1*1 convolution kernels, and the fourth dimension reduction module 318 may include 10 1*1 convolution kernels.

For example, the classifier 319 may include two softmax classifiers, a first softmax classifier and a second softmax classifier, respectively. The first softmax classifier is used for classifying and predicting whether each pixel is text or non-text (positive or negative), and the second softmax classifier is used for classifying and predicting whether the pixels in four fields of each pixel have a link.

In this disclosure, each of the convolution layer, the downsampling layer, the upsampling layer, and the like refers to a corresponding processing operation, that is, convolution processing, downsampling processing, upsampling processing, and the like, and a description thereof will not be repeated.

For example, performing text detection on each intermediate input image using a text detection neural network to determine a set of text detection regions corresponding to each intermediate input image includes: carrying out convolution processing on each intermediate input image by using a first convolution module to obtain a first convolution characteristic image group; performing downsampling processing on the first convolution feature image group by using a first downsampling module to obtain a first downsampled feature image group; performing convolution processing on the first downsampled feature map group by using a second convolution module to obtain a second convolution feature map group; performing downsampling processing on the second convolution feature image group by using a second downsampling module to obtain a second downsampled feature image group; performing convolution processing on the second downsampled feature map group by using a third convolution module to obtain a third convolution feature map group; performing downsampling processing on the third convolution feature image group by using a third downsampling module to obtain a third downsampled feature image group, and performing dimension reduction processing on the third convolution feature image group by using a first dimension reduction module to obtain a first dimension reduction feature image group; performing convolution processing on the third downsampled feature map group by using a fourth convolution module to obtain a fourth convolution feature map group; performing downsampling processing on the fourth convolution feature image group by using a fourth downsampling module to obtain a fourth downsampled feature image group, and performing dimension reduction processing on the fourth convolution feature image group by using a second dimension reduction module to obtain a second dimension reduction feature image group; carrying out convolution processing on the fourth downsampled feature map group by using a fifth convolution module to obtain a fifth convolution feature map group; performing downsampling processing on the fifth convolution feature image group by using a fifth downsampling module to obtain a fifth downsampled feature image group, and performing dimension reduction processing on the fifth convolution feature image group by using a third dimension reduction module to obtain a third dimension reduction feature image group; carrying out convolution processing on the fifth downsampled feature map group by using a full connection module to obtain a sixth convolution feature map group; performing dimension reduction processing on the sixth convolution feature image group by using a fourth dimension reduction module to obtain a fourth dimension reduction feature image group; performing up-sampling processing on the fourth dimension-reduction feature image group by using a first up-sampling module to obtain a first up-sampling feature image group; performing fusion processing on the first up-sampling feature image group and the third dimension-reduction feature image group to obtain a first fusion feature image group; performing upsampling processing on the first fusion feature image group by using a second upsampling module to obtain a second upsampled feature image group; performing fusion processing on the second up-sampling feature image group and the second dimension-reduction feature image group to obtain a second fusion feature image group; performing upsampling processing on the second fusion feature image group by using a third upsampling module to obtain a third upsampled feature image group; performing fusion processing on the third upsampling feature map set and the first dimension reduction feature map set to obtain a third fusion feature map set; classifying the third fusion feature image group by using a classifier to obtain a text classification prediction image and a connection classification prediction image; and determining a text detection area group according to the connection classification prediction graph and the text classification prediction graph.

For example, as shown in fig. 3, in some embodiments, each intermediate input image may be 512 x 512 in size, with 3,3 channels being the red, blue, and green channels, respectively.

For example, as shown in fig. 3, the number of feature maps in the first convolution feature map set CN1 is 8, and the size of each feature map in the first convolution feature map set CN1 may be 512×512; the number of feature maps in the second convolution feature map set CN2 is 16, and the size of each feature map in the second convolution feature map set CN2 may be 256×256; the number of feature maps in the third convolution feature map set CN3 is 32, and the size of each feature map in the third convolution feature map set CN3 may be 128×128; the number of feature maps in the fourth convolution feature map set CN4 is 64, and the size of each feature map in the fourth convolution feature map set CN4 may be 64×64; the number of feature maps in the fifth convolution feature map set CN5 is 128, and the size of each feature map in the fifth convolution feature map set CN5 may be 32×32; the number of feature maps in the sixth convolution feature map set CN6 is 256, and the size of each feature map in the sixth convolution feature map set CN6 may be 16×16.

For example, the size of the feature map in the third convolution feature map set CN3 is 1/(4*4) of the size of the intermediate input image, the size of the feature map in the fourth convolution feature map set CN4 is 1/(8×8) of the size of the intermediate input image, and the size of the feature map in the fifth convolution feature map set CN5 is 1/(16×16) of the size of the intermediate input image.

For example, the first convolution feature set CN1 is input to the first downsampling module 306, and the first downsampling module 306 performs downsampling on the first convolution feature set CN1 to obtain a first downsampled feature set DP1, where the number of feature patterns in the first downsampled feature set DP1 is 8, and the size of each feature pattern in the first downsampled feature set DP1 is 256×256. The first set of downsampled feature map DP1 is the input to second convolution module 302.

For example, the second convolution feature map set CN2 is input to the second downsampling module 307, and the second downsampling module 307 performs downsampling on the second convolution feature map set CN2 to obtain a second downsampled feature map set DP2, where the number of feature maps in the second downsampled feature map set DP2 is 16, and the size of each feature map in the second downsampled feature map set DP2 is 128×128. The second set of downsampled feature map DP2 is the input to the third convolution module 303.

For example, the third convolution feature map set CN3 is input to the third downsampling module 308, and the third downsampling module 308 performs downsampling on the third convolution feature map set CN3 to obtain a third downsampled feature map set DP3, where the number of feature maps in the third downsampled feature map set DP3 is 32, and the size of each feature map in the third downsampled feature map set DP3 is 64×64. The third set of downsampled feature map DP3 is the input to fourth convolution module 304.

For example, the fourth convolution feature set CN4 is input to the fourth downsampling module 309, and the fourth downsampling module 309 performs downsampling on the fourth convolution feature set CN4 to obtain a fourth downsampled feature set DP4, where the number of feature patterns in the fourth downsampled feature set DP4 is 64, and the size of each feature pattern in the fourth downsampled feature set DP4 is 32×32. The fourth set of downsampled feature map DP4 is the input to the fifth convolution module 305.

For example, the fifth convolution feature set CN5 is input to the fifth downsampling module 310, the fifth downsampling module 310 performs downsampling on the fifth convolution feature set CN5 to obtain a fifth downsampled feature set DP5, the number of feature images in the fifth downsampled feature set DP5 is 128, and the size of each feature image in the fifth downsampled feature set DP5 is 16×16. The fifth set of downsampled feature patterns DP5 is the input of fully connected module 311.

For example, the full connection module 311 performs convolution processing on the fifth downsampled feature map set DP5 to obtain a sixth convolved feature map set CN6, where the number of feature maps in the sixth convolved feature map set CN6 is 256, and the size of each feature map in the sixth convolved feature map set CN6 is 16×16.

For example, the third convolution feature map set CN3 is further input to the first dimension reduction module 315, the first dimension reduction module 315 performs dimension reduction processing on the third convolution feature map set CN3 to obtain a first dimension reduction feature map set DR1, the number of feature maps in the first dimension reduction feature map set DR1 is 10, and the size of each feature map in the first dimension reduction feature map set DR1 is 128×128.

For example, the fourth convolution feature map set CN4 is further input to the second dimension reduction module 316, the second dimension reduction module 316 performs dimension reduction processing on the fourth convolution feature map set CN4 to obtain a second dimension reduction feature map set DR2, the number of feature maps in the second dimension reduction feature map set DR2 is 10, and the size of each feature map in the second dimension reduction feature map set DR2 is 64×64.

For example, the fifth convolution feature map set CN5 is further input to the third dimension reduction module 317, the third dimension reduction module 317 performs dimension reduction processing on the fifth convolution feature map set CN5 to obtain a third dimension reduction feature map set DR3, the number of feature maps in the third dimension reduction feature map set DR3 is 10, and the size of each feature map in the third dimension reduction feature map set DR3 is 32×32.

For example, the sixth convolution feature map set CN6 is further input to the fourth dimension reduction module 318, the fourth dimension reduction module 318 performs dimension reduction processing on the sixth convolution feature map set CN6 to obtain a fourth dimension reduction feature map set DR4, the number of feature maps in the fourth dimension reduction feature map set DR4 is 10, and the size of each feature map in the fourth dimension reduction feature map set DR4 is 16×16.

For example, the fourth dimension-reduction feature map set DR4 is input to the first upsampling module 312, and the first upsampling module 312 performs an upsampling process on the fourth dimension-reduction feature map set DR4 to obtain a first upsampled feature map set UP1, where the number of feature maps in the first upsampled feature map set UP1 is 10, and the size of each feature map in the first upsampled feature map set UP1 is 32×32. Then, the first UP-sampling feature map set UP1 and the third dimension-reduction feature map set DR3 are subjected to fusion processing to obtain a first fusion feature map set FU1. The number of feature maps in the first fused feature map set FU1 is 10, and the size of each feature map in the first fused feature map set FU1 is 32 x 32.

For example, the first fused feature map set FU1 is input to the second upsampling module 313, the second upsampling module 313 performs an upsampling process on the first fused feature map set FU1 to obtain a second upsampled feature map set UP2, the number of feature maps in the second upsampled feature map set UP2 is 10, and the size of each feature map in the second upsampled feature map set UP2 is 64×64. Then, the second UP-sampling feature map set UP2 and the second dimension-reduction feature map set DR2 are fused, so as to obtain a second fused feature map set FU2. The number of feature maps in the second fused feature map set FU2 is 10, and the size of each feature map in the second fused feature map set FU2 is 64×64.

For example, the second fused feature map set FU2 is input to the third upsampling module 314, and the third upsampling module 314 performs an upsampling process on the second fused feature map set FU2 to obtain a third upsampled feature map set UP3, where the number of feature maps in the third upsampled feature map set UP3 is 10, and the size of each feature map in the third upsampled feature map set UP3 is 128×128. Then, the third UP-sampling feature map set UP3 and the first dimension-reduction feature map set DR1 are subjected to fusion processing to obtain a third fusion feature map set FU3. The number of feature maps in the third fused feature map set FU3 is 10, and the size of each feature map in the third fused feature map set FU3 is 128×128.

It should be noted that, in the embodiment of the present disclosure, the fusion process may include a summation process, for example, "fusion process" may mean adding values of corresponding pixels in the corresponding feature map to obtain a new feature map. For example, for the first UP-sampling feature map group UP1 and the third dimension-reduction feature map group DR3, the "fusion processing" means adding the values of the pixel in one feature map of the first UP-sampling feature map group UP1 and the corresponding pixel in the feature map corresponding to the feature map in the third dimension-reduction feature map group DR3 to obtain a new feature map. The "fusion process" does not change the number and size of feature maps.

Fig. 4 is a schematic diagram of a pixel and a neighboring pixel of the pixel in a feature map according to at least one embodiment of the present disclosure.

For example, the classifier 319 performs classification processing on the third fusion feature map group FU3 to obtain a text classification prediction map and a connection classification prediction map. For example, the text classification prediction graph includes 2 feature graphs, and the connection classification prediction graph includes 8 feature graphs, and it should be noted that the values of the pixels in each of the text classification prediction graph and the connection classification prediction graph are equal to or greater than 0 and equal to or less than 1, and represent the text prediction probability or the connection prediction probability. The feature map in the text classification prediction map indicates a probability map of whether each pixel is text, and the feature map in the connection classification prediction map indicates a probability map of whether each pixel is connected to adjacent pixels of four neighborhoods of the pixel.

For example, 2 feature maps in the text classification prediction map include a text feature map that represents a prediction probability that each pixel in the intermediate input image belongs to text and a non-text feature map that represents a prediction probability that each pixel in the intermediate input image belongs to non-text, and the values of the corresponding pixel points of the two feature maps are added to 1. As shown in fig. 4, for the pixel PX1, if the value of the pixel PX1 in the text feature map is 0.75, that is, the prediction probability that the pixel PX1 belongs to the text is 0.75, the value of the pixel PX1 in the non-text feature map is 0.25, that is, the prediction probability that the pixel PX1 does not belong to the text is 0.25. For example, in some embodiments, a type probability threshold may be set, for example, to 0.7, where when the prediction probability of a pixel belonging to text is equal to or greater than the type probability threshold, it indicates that the pixel belongs to text, and thus it is known that when the prediction probability of a pixel PX1 belonging to text is 0.75, the pixel PX1 belongs to text, that is, the pixel PX1 belongs to positive pixel (pixel positive). Note that, if the pixel PX1 does not belong to text, that is, the pixel PX1 belongs to a negative pixel (pixel negative).

For example, as shown in fig. 4, in the direction R1, the pixel PX4 and the pixel PX5 are directly adjacent to the pixel PX1, and in the direction C1, the pixel PX2 and the pixel PX3 are directly adjacent to the pixel PX1, that is, the pixels PX2 to PX5 are adjacent pixels of four neighborhoods of the pixel PX1, and are located above, below, right, and left of the pixel PX1, respectively. In some embodiments, the pixel arrays in each feature map are arranged in a plurality of rows and columns, the direction R1 may be a row direction of the pixels, and the direction C1 may be a column direction of the pixels.

For example, the 8 feature maps in the connection classification prediction graph may include a first classification feature map, a second classification feature map, a third classification feature map, a fourth classification feature map, a fifth classification feature map, a sixth classification feature map, a seventh classification feature map, and an eighth classification feature map. As shown in fig. 4, for the pixel PX1, the first classification characteristic map represents a connection prediction probability from the pixel PX1 to the direction of the pixel PX2, and the second classification characteristic map represents a disconnection prediction probability from the pixel PX1 to the direction of the pixel PX 2; the third classification characteristic map represents a connection prediction probability from the pixel PX1 to the direction of the pixel PX3, and the fourth classification characteristic map represents a disconnection prediction probability from the pixel PX1 to the direction of the pixel PX 3; the fifth classification characteristic map represents a connection prediction probability from the pixel PX1 to the direction of the pixel PX4, and the sixth classification characteristic map represents a disconnection prediction probability from the pixel PX1 to the direction of the pixel PX 4; the seventh classification characteristic map represents a connection prediction probability from the pixel PX1 to the direction of the pixel PX5, and the eighth classification characteristic map represents a disconnection prediction probability from the pixel PX1 to the direction of the pixel PX 5. Taking the example of determining whether the pixel PX1 is connected to the pixel PX2, the connection between the pixel PX1 and the pixel PX2 is determined by the pixel PX1 and the pixel PX2, and if the pixel PX1 and the pixel PX2 are both positive pixels, the connection between the pixel PX1 and the pixel PX2 is positive connection (positive link); if one of the pixel PX1 and the pixel PX2 is a positive pixel, the connection between the pixel PX1 and the pixel PX2 is a positive connection; if both the pixel PX1 and the pixel PX2 are negative pixels, the connection between the pixel PX1 and the pixel PX2 is negative (NEGATIVE LINK).

For example, as shown in fig. 4, for the pixel PX1, the value of the pixel PX1 in the first classification characteristic map is 0.8, that is, the connection prediction probability indicating that the pixel PX1 and the pixel PX2 are 0.8; the value of the pixel PX1 in the second classification characteristic map is 0.2, that is, the non-connection prediction probability indicating that the pixel PX1 and the pixel PX2 are 0.2; the value of the pixel PX1 in the third classification characteristic map is 0.6, which means that the connection prediction probability of the pixel PX1 and the pixel PX3 is 0.6, the value of the pixel PX1 in the fourth classification characteristic map is 0.4, which means that the disconnection prediction probability of the pixel PX1 and the pixel PX3 is 0.4, and so on. For example, in some embodiments, a classification probability threshold may be set, e.g., at 0.7, when the connection prediction probability of a pixel is equal to or greater than the classification probability threshold, this indicates that the pixel may be connected to an adjacent pixel. For example, in the above example, the value of the pixel PX1 in the first classification characteristic map is 0.8, that is, the connection prediction probability (0.8) of the pixel PX1 and the pixel PX2 is greater than the classification probability threshold (0.7), whereby the connection between the pixel PX1 and the pixel PX2 is a positive connection in the direction from the pixel PX1 toward the pixel PX2, and the connection prediction probability is 0.8; the value of the pixel PX1 in the third classification characteristic map is 0.6, that is, the connection prediction probability (0.6) of the pixel PX1 and the pixel PX3 is smaller than the classification probability threshold (0.7), whereby the connection between the pixel PX1 and the pixel PX3 is a negative connection in the direction from the pixel PX1 toward the pixel PX 3.

It should be noted that the above type probability threshold and classification probability threshold are merely illustrative, and the type probability threshold and classification probability threshold may be set according to actual application requirements.

For example, from the classification prediction graph and the text classification prediction graph, a text detection region group may be determined in a union manner. For example, each intermediate input image is passed through the text detection neural network shown in fig. 3 to obtain a classification prediction probability of text/non-text (positive/negative) of each pixel, and a connection prediction probability of whether or not there is a connection (link) between each pixel and four neighboring pixels in the neighborhood direction of the pixel. The text prediction result and the connection prediction result are respectively filtered by setting a type probability threshold and a classification probability threshold, so that a positive pixel set and a positive connection set can be obtained, then positive pixels are connected according to positive connection to group the positive pixels together, for example, a connected domain (Connected Components) set of the positive pixels can be generated by using a union searching method, and in order to prevent the influence of noise, the connected domain set can be subjected to denoising processing, namely, connected domains with short sides smaller than 10 pixels or with areas smaller than 300 pixels in the connected domain set are removed. The connected domain in the connected domain set after the denoising process is performed represents the detected text detection region.

Fig. 5 is a schematic diagram of a text detection neural network.

For example, as shown in fig. 5, in other embodiments, the text detection neural network includes first 501 to fifth 505 convolutions, first 506 to fifth 510 downsamples, full connection 511, first 512 to third 514 upsamples, first 515 to fifth 519 downscales, and classifiers 520.

For example, the first convolution module 501 may include two convolution layers conv51_1 and conv51_2, each convolution layer in the first convolution module 501 including 64 convolution kernels; the second convolution module 502 may include two convolution layers conv52_1 and conv52_2, each convolution layer in the second convolution module 502 including 128 convolution kernels; the third convolution module 503 may include three convolution layers conv53_1 through conv53_3, each convolution layer in the third convolution module 503 including 256 convolution kernels; the fourth convolution module 504 may include three convolution layers conv54_1 through conv54_3, each of the fourth convolution modules 304 including 512 convolution kernels; the fifth convolution module 505 may include three convolution layers conv55_1 through conv55_3, each convolution layer in the fifth convolution module 505 including 512 convolution kernels. It should be noted that each convolution layer includes an activation function, for example, the activation function may be a ReLU activation function.

For example, each of the first through fifth downsampling modules 506 through 510 may include a downsampling layer. For example, in some embodiments, the downsampling layer may employ maximum pooling (max pooling) for the downsampling process. The downsampling factors of the downsampling layers in the first to fourth downsampling modules 506 to 509 are 1/(2×2), and the downsampling factor of the downsampling layer in the fifth downsampling module 510 is 1, that is, the size of the feature map is unchanged after the feature map is processed by the downsampling layer in the fifth downsampling module 510.

For example, the full connection module 511 includes two full connection layers fc56 and fc57. Full link layer fc56 is a convolution layer and includes 512 convolution kernels, and full link layer fc57 is also a convolution layer and includes 512 convolution kernels.

For example, each of the first to third upsampling modules 512 to 514 may include an upsampling layer for performing an upsampling process, and an upsampling factor of each upsampling layer may be 2×2.

For example, each of the first dimension reduction module 515 through the fifth dimension reduction module 519 may include 1*1 convolution kernels, e.g., the first dimension reduction module 515 may include 18 1*1 convolution kernels, the second dimension reduction module 516 may include 18 1*1 convolution kernels, the third dimension reduction module 517 may include 18 1*1 convolution kernels, the fourth dimension reduction module 518 may include 18 1*1 convolution kernels, and the fifth dimension reduction module 519 may include 18 1*1 convolution kernels.

For example, the classifier 520 may include two softmax classifiers, a first softmax classifier and a second softmax classifier, respectively. The first softmax classifier is used for classifying and predicting whether each pixel is text or non-text (positive or negative), and the second softmax classifier is used for classifying and predicting whether the pixels in four fields of each pixel have a link.

For example, performing text detection on each intermediate input image using a text detection neural network to determine a set of text detection regions corresponding to each intermediate input image includes: carrying out convolution processing on the input image by using a first convolution module to obtain a first convolution characteristic image group; performing downsampling processing on the first convolution feature image group by using a first downsampling module to obtain a first downsampled feature image group; performing convolution processing on the first downsampled feature map group by using a second convolution module to obtain a second convolution feature map group; performing downsampling processing on the second convolution feature image group by using a second downsampling module to obtain a second downsampled feature image group, and performing dimension reduction processing on the second convolution feature image group by using a first dimension reduction module to obtain a first dimension reduction feature image group; performing convolution processing on the second downsampled feature map group by using a third convolution module to obtain a third convolution feature map group; performing downsampling processing on the third convolution feature image group by using a third downsampling module to obtain a third downsampled feature image group, and performing dimension reduction processing on the third convolution feature image group by using a second dimension reduction module to obtain a second dimension reduction feature image group; performing convolution processing on the third downsampled feature map group by using a fourth convolution module to obtain a fourth convolution feature map group; performing downsampling processing on the fourth convolution feature image group by using a fourth downsampling module to obtain a fourth downsampled feature image group, and performing dimension reduction processing on the fourth convolution feature image group by using a third dimension reduction module to obtain a third dimension reduction feature image group; carrying out convolution processing on the fourth downsampled feature map group by using a fifth convolution module to obtain a fifth convolution feature map group; performing downsampling processing on the fifth convolution feature image group by using a fifth downsampling module to obtain a fifth downsampled feature image group, and performing dimension reduction processing on the fifth convolution feature image group by using a fourth dimension reduction module to obtain a fourth dimension reduction feature image group; performing convolution processing on the fifth downsampled feature map group by using a full connection module to obtain a sixth convolution feature map group; performing dimension reduction processing on the sixth convolution feature image group by using a fifth dimension reduction module to obtain a fifth dimension reduction feature image group; performing fusion processing on the fourth dimension reduction feature image group and the fifth dimension reduction feature image group to obtain a first fusion feature image group; performing upsampling processing on the first fusion feature image group by using a first upsampling module to obtain a first upsampled feature image group; performing fusion processing on the first up-sampling feature image group and the third dimension-reduction feature image group to obtain a second fusion feature image group; performing upsampling processing on the second fused feature map set by using a second upsampling module to obtain a second upsampled feature map set; performing fusion processing on the second up-sampling feature image group and the second dimension-reduction feature image group to obtain a third fusion feature image group; performing upsampling processing on the third fusion feature image group by using a third upsampling module to obtain a third upsampled feature image group; performing fusion processing on the third upsampling feature map set and the first dimension reduction feature map set to obtain a fourth fusion feature map set; classifying the fourth fusion feature image group by using a classifier to obtain a text classification prediction image and a connection classification prediction image; and determining a text detection area group according to the connection classification prediction graph and the text classification prediction graph.

For example, as shown in fig. 5, in some embodiments, each intermediate input image may be 512 x 512 in size, with 3,3 channels being red, blue, and green channels, respectively.

For example, as shown in fig. 5, the number of feature maps in the first convolution feature map set CN51 is 64, and the size of each feature map in the first convolution feature map set CN51 may be 512×512; the number of feature maps in the second convolution feature map set CN52 is 128, and the size of each feature map in the second convolution feature map set CN52 may be 256×256; the number of feature maps in the third convolution feature map set CN53 is 256, and the size of each feature map in the third convolution feature map set CN53 may be 128×128; the number of feature maps in the fourth convolution feature map set CN54 is 512, and the size of each feature map in the fourth convolution feature map set CN54 may be 64×64; the number of feature maps in the fifth convolution feature map set CN55 is 512, and the size of each feature map in the fifth convolution feature map set CN55 may be 32×32; the number of feature maps in the sixth convolution feature map set CN56 is 512, and the size of each feature map in the sixth convolution feature map set CN56 may be 32×32.

For example, the size of the feature map in the second convolution feature map set CN52 is 1/(2×2) of the size of the intermediate input image, the size of the feature map in the third convolution feature map set CN53 is 1/(4*4) of the size of the intermediate input image, the size of the feature map in the fourth convolution feature map set CN54 is 1/(8×8) of the size of the intermediate input image, and the size of the feature map in the fifth convolution feature map set CN55 is 1/(16×16) of the size of the intermediate input image.

For example, the number of feature maps in the first downsampled feature map set DP51 is 64, and the size of each feature map in the first downsampled feature map set DP51 is 256×256; the number of feature maps in the second downsampled feature map set DP52 is 128, and the size of each feature map in the second downsampled feature map set DP52 is 128 x 128; the number of feature maps in the third downsampled feature map set DP53 is 256, and the size of each feature map in the third downsampled feature map set DP53 is 64 x 64; the number of feature maps in the fourth downsampled feature map set DP54 is 512, and the size of each feature map in the fourth downsampled feature map set DP54 is 32 x 32; the number of feature maps in the fifth downsampled feature map set DP55 is 512 and the size of each feature map in the fifth downsampled feature map set DP55 is 32 x 32.

For example, the number of feature images in each of the first to fifth dimension-reduction feature image groups DR51 to DR55 is 18. The size of each feature map in the first dimension-reduction feature map set DR51 is 256×256, the size of each feature map in the second dimension-reduction feature map set DR52 is 128×128, the size of each feature map in the third dimension-reduction feature map set DR53 is 64×64, the size of each feature map in the fourth dimension-reduction feature map set DR54 is 32×32, and the size of each feature map in the fifth dimension-reduction feature map set DR55 is 32×32.

For example, the number of feature maps in each of the first to fourth fusion feature map groups FU51 to FU54 is 18. The size of each feature map in the first fused feature map set FU51 is 32 x 32; the size of each feature map in the second fused feature map set FU52 is 64 x 64; the size of each feature map in the third fused feature map set FU53 is 128 x 128; each feature map in the fourth fused feature map set FU54 has a size of 256×256.

For example, the number of feature maps in each of the first UP-sampled feature map group UP51 to the third UP-sampled feature map group UP53 is 18. The size of each feature map in the first UP-sampled feature map set UP51 is 64 x 64; each feature map in the second upsampled feature map set UP52 has a size of 128 x 128; each feature map in the third UP-sampled feature map set UP53 has a size of 256×256.

Fig. 6 is a schematic diagram of a pixel and a neighboring pixel of the pixel in a feature map according to another embodiment of the disclosure.

For example, the classifier 520 performs a classification process on the fourth fused feature map set FU54 to obtain a text classification prediction map and a connection classification prediction map. For example, the text classification prediction graph includes 2 feature graphs, and the connection classification prediction graph includes 16 feature graphs, and it should be noted that the values of the pixels in each of the text classification prediction graph and the connection classification prediction graph are equal to or greater than 0 and equal to or less than 1, and represent the text prediction probability or the connection prediction probability. The feature map in the text classification prediction map represents a probability map of whether each pixel is text or not, and the feature map in the connection classification prediction map represents a probability map of whether each pixel is connected to the adjacent pixels of the eight neighbors of the pixel.

For example, as shown in fig. 6, the pixels PX2 to PX9 are adjacent pixels to the pixel PX 1. In the direction R1, the pixel PX4 and the pixel PX5 are directly adjacent to the pixel PX1, and in the direction C1, the pixel PX2 and the pixel PX3 are directly adjacent to the pixel PX1, that is, the pixels PX2 to PX5 are adjacent to the pixel PX1 and are located above, below, right, and left of the pixel PX1, respectively, and furthermore, the pixel PX 6-pixel PX9 is located in the directions of the two diagonal lines of the rectangular pixel PX1, and the pixel PX6 is located at the upper right corner of the pixel PX1, the pixel PX7 is located at the upper left corner of the pixel PX1, the pixel PX8 is located at the lower right corner of the pixel PX1, and the pixel PX9 is located at the lower left corner of the pixel PX 1.

For example, each intermediate input image passes through the text detection neural network shown in fig. 5 to obtain a classification prediction probability of text/non-text (positive/negative) of each pixel, and a connection prediction probability of whether or not there is a connection (link) between each pixel and eight neighboring pixels of the pixel in the neighborhood direction (i.e., pixels PX2 to PX9 in fig. 6). By setting the type probability threshold and the classification probability threshold, a positive pixel set and a positive connection set can be obtained, then positive pixels are connected according to the positive connection to group the positive pixels together, for example, a connected domain (Connected Components) set of the positive pixels can be generated by using a union-checking method, and then the connected domain set is subjected to denoising processing, namely, connected domains with short sides smaller than 10 pixels or with areas smaller than 300 pixels are removed in the connected domain set. The connected domain in the connected domain set after the denoising process is performed represents the detected text detection region.

It should be noted that, the method for performing text detection on each intermediate input image by using the text detection neural network shown in fig. 5 is similar to the method for performing text detection on each intermediate input image by using the text detection neural network shown in fig. 3, and the above description may be referred to, and the repetition is omitted.

For example, the network depth (i.e., the number of convolutional layers) of the text detection neural network shown in fig. 5 is the same as the network depth of the text detection neural network shown in fig. 3. In the text detection neural network shown in fig. 5, the number of convolution kernels in the convolution layer in the first convolution module in the text detection neural network is 64, and the number of convolution kernels in the convolution layer of each subsequent convolution module is doubled, and in the text detection neural network shown in fig. 3, the number of convolution kernels in the convolution layer in the first convolution module in the text detection neural network is 8, and the number of convolution kernels in the convolution layer of each subsequent convolution module is doubled. Meanwhile, in the feature fusion process, features extracted by the second to fifth convolution modules are fused in the text detection neural network shown in fig. 5, and features extracted by the third to fifth convolution modules are only fused in the text detection neural network shown in fig. 3. Therefore, compared with the text detection neural network shown in fig. 5, the text detection neural network shown in fig. 3 has the characteristics of small network model, small calculation amount and the like under the condition of ensuring the detection accuracy, for example, the size of the network model is reduced by about 50 times, the calculation speed is improved by about 10 times, the calculation amount of the text detection neural network can be reduced, the calculation efficiency of the text detection neural network is accelerated, the waiting time of a user is reduced, and the use experience of the user is improved.

In addition, in the text detection neural network shown in fig. 5, it is necessary to acquire connection in eight domain directions of pixels, whereas in the text detection neural network shown in fig. 3, it is necessary to acquire connection in four domain directions of pixels only. Thus, the post-processing portion of the pixellink algorithm increases in speed by a factor of about 2 in the text detection neural network shown in fig. 3, relative to the text detection neural network shown in fig. 5, while improving the context of text blocking (multiple words in one text detection region) of the text detection region.

Fig. 7A is a schematic diagram of a text box group in an input image according to at least one embodiment of the present disclosure, and fig. 7B is a schematic diagram of a text box group in another input image according to at least one embodiment of the present disclosure.

For example, fig. 7A is a connection result of connection based on eight domain directions of pixels, and fig. 7B is a connection result of connection based on four domain directions of pixels. As can be seen from fig. 7A and 7B, in fig. 7A, "any communications yet" is divided into the same text box, and "subjects in" is also divided into the same text box, that is, a phenomenon of text sticking occurs, where one text box may include a plurality of texts, for example, for the text box corresponding to "any communications yet", the text box includes three texts, namely, text "any", text "communications" and text "yet", respectively; as shown in fig. 7B, the text "any", the text "communications" and the text "yet" are respectively located in three text boxes, and the text "subjects" and the text "in" are also respectively located in two text boxes, so that the text boxes are more accurately divided. It can also be seen from fig. 7A and 7B that the text box in fig. 7B more accurately overlays the corresponding text.

For example, at least one intermediate text box in each set of intermediate text boxes corresponds one-to-one with at least one text box in the set of text boxes. Each intermediate text box group comprises an ith intermediate text box, the text box group comprises an ith text box, the ith intermediate text box corresponds to the ith text box, and i is more than or equal to 1 and less than or equal to the number of intermediate text boxes in each intermediate text box group.

For example, step S1013 includes: for the ith text box, determining the coordinate set of the ith text box according to the coordinate sets corresponding to the ith intermediate text boxes in the intermediate text box sets, thereby determining the coordinate sets of all the text boxes in the text box sets. Thus, the resulting text box group can be more accurate. For example, the coordinate set corresponding to each ith intermediate text box may be coordinates of four vertices (for example, four vertices are an upper left corner vertex, a lower left corner vertex, an upper right corner vertex, and a lower right corner vertex of a rectangle, respectively) of the ith intermediate text box of the rectangle, and the size and the position of the ith intermediate text box may be determined based on the coordinates of the four vertices.

For example, the intermediate text box group corresponding to the first intermediate input image includes a first ith intermediate text box, the intermediate text box group corresponding to the second intermediate input image includes a second ith intermediate text box, the intermediate text box group corresponding to the third intermediate input image includes a third ith intermediate text box, the intermediate text box group corresponding to the fourth intermediate input image includes a fourth ith intermediate text box, and the intermediate text box group corresponding to the fifth intermediate input image includes a fifth ith intermediate text box, and in the examples shown in fig. 2A-2E, the first ith intermediate text box to the fifth ith intermediate text box may be text boxes corresponding to "French", i.e., the texts in the first ith intermediate text box to the fifth ith intermediate text box are all "French".

For example, the coordinate sets corresponding to the ith intermediate text box of the plurality of intermediate text box sets may be weighted and summed to determine the coordinate set of the ith text box.

For example, weights may be set for the first to fifth ith intermediate text boxes according to actual application conditions, e.g., in some embodiments, the weights of the first to fifth ith intermediate text boxes are all 1; then, the coordinate sets corresponding to the first ith intermediate text box to the fifth ith intermediate text box are weighted and averaged to determine the coordinate set of the ith text box, for example, the coordinates of the top left corner vertices of the first ith intermediate text box to the fifth ith intermediate text box are weighted and averaged to obtain the coordinates of the top left corner vertices of the ith text box; the coordinates of the left lower corner vertexes of the first ith intermediate text box to the fifth ith intermediate text box are weighted and averaged to obtain the coordinates of the left lower corner vertexes of the ith intermediate text box; the coordinates of the top right corner vertices of the first to fifth i-th intermediate text boxes are weighted and averaged to obtain the coordinates of the top right corner vertices of the i-th intermediate text boxes; the coordinates of the right lower corner vertices of the first i-th intermediate text box to the fifth i-th intermediate text box are weighted and averaged to obtain the coordinates of the right lower corner vertices of the i-th text box, thereby determining the coordinate set of the i-th text box.

Before weighting and averaging coordinate sets corresponding to the plurality of intermediate text boxes, the coordinate sets corresponding to the plurality of intermediate text boxes need to be converted according to the sizes of the plurality of intermediate input images. For example, in the example shown in fig. 2A to 2E, for the coordinate set of the first i-th intermediate text box, since the size of the first intermediate input image is 0.4 x (W x H), the coordinate set of the first i-th intermediate text box needs to be enlarged by 2.5 times; for the coordinate set of the second i-th intermediate text box, since the size of the second intermediate input image is 0.6 x (W x H), the coordinate set of the second i-th intermediate text box needs to be enlarged 5/3 times; for the coordinate set of the third i-th intermediate text box, since the size of the third intermediate input image is 0.8 x (W x H), the coordinate set of the third i-th intermediate text box needs to be enlarged 5/4 times; for the coordinate set of the fourth i-th intermediate text box, since the size of the fourth intermediate input image is 0.8 x (W x H), the coordinate set of the fourth i-th intermediate text box may be unchanged; for the coordinate set of the fifth i-th intermediate text box, since the size of the fourth intermediate input image is 1.5 x (W x H), the coordinate set of the fifth i-th intermediate text box needs to be reduced by 2/3 times. Then, the transformed coordinate sets corresponding to the first to fifth i-th intermediate text boxes are weighted and averaged to determine the coordinate set of the i-th intermediate text box.

It should be noted that, in the embodiment of the present disclosure, the manner of determining the coordinate set of the ith text box is not limited to the method described above, and other suitable methods may be adopted to determine the coordinate set of the ith text box according to the coordinate sets corresponding to the first ith intermediate text box to the fifth ith intermediate text box, which is not particularly limited in the present disclosure.

Fig. 8A is a schematic diagram of a text box group in an input image according to another embodiment of the present disclosure, and fig. 8B is a schematic diagram of a text box group in another input image according to another embodiment of the present disclosure.

For example, as shown in fig. 1, step S102 includes: determining the position of a pen point of the fixed point translation pen; marking a region to be detected in an input image based on the position of the pen point; determining at least one overlapping area between the area to be detected and at least one text box respectively; and determining the text box corresponding to the largest overlapping area in the at least one overlapping area as a target text box.

For example, in some embodiments, at least one overlap region corresponds one-to-one with at least one text box. When a certain text box in the input image is not overlapped with the area to be detected, the overlapping area corresponding to the text box is 0. It should be noted that, in other embodiments, at least one overlapping area and at least one text box do not correspond one to one, for example, when a certain text box in the input image and the area to be detected overlap each other, the text box has a corresponding overlapping area; when a certain text box in the input image does not overlap with the region to be detected, then the text box has no overlapping region, for example, in the example shown in fig. 8A, the region to be detected (i.e., the gray rectangular box) overlaps with only three text boxes in the input image, i.e., the number of overlapping regions is 3.

For example, the user may select the target text, i.e., the text to be translated, using a point translation pen. For example, the user may use the pen tip of the point translation pen to indicate the target text, and the relative positions of the pen tip and the camera are fixed, so that the pen tip is fixed in position in the input image captured by the camera, for example, in some embodiments, the pen tip may be located in the center of one side edge (for example, the bottom edge shown in fig. 8A) of the input image, and a region to be detected with a fixed size is set according to the size of the text in the input image, for example, the region to be detected may be an off-white rectangular box shown in fig. 8A. And respectively calculating the overlapping condition of at least one text box in the input image and the region to be detected, so that at least one overlapping region can be determined, the text box corresponding to the largest overlapping region in the at least one overlapping region is used as a target text box, and the text in the target text box is the target text selected by the user. As shown in fig. 8A and 8B, among a plurality of text boxes of the input image, a region to be detected overlaps with a text box containing text "applied" and has a first overlapping region; the region to be detected overlaps a text box containing text "Inte" and has a second overlapping region; the region to be detected overlaps with a text box containing the text "real", and has a third overlapping region, and the rest of the text boxes in the input image do not overlap with the region to be detected except for a text box containing the text "applied", a text box containing the text "Inte", and a text box containing the text "real". The third overlapping area is the largest among the first to third overlapping areas, that is, the third overlapping area between the text box containing the text "real" and the area to be detected is the largest, so that the text box containing the text "real" is the target text box, and the text "real" is the target text. Note that fig. 8B only shows the target text box.

It is noted that in some embodiments, the region to be detected may also be fixed and not necessarily variable with the size of the text in the input image. In the example shown in fig. 8A, the region to be detected is rectangular, however, the present disclosure is not limited thereto, and the region to be detected may also be a diamond, a circle, or the like in a suitable shape.

For example, the at least one text box comprises N text boxes, N being a positive integer greater than 2, that is, the text box group comprises at least three text boxes. At this time, in step S103, determining the correction angle and the correction direction for the target text box according to the deflection angle and the coordinate set of the at least one text box may include: determining the average deflection angles of the N text boxes according to the N deflection angles corresponding to the N text boxes; judging whether the average deflection angle is larger than a first angle threshold or smaller than a second angle threshold; determining that the correction angle for the target text box is 0 degrees in response to the average deflection angle being greater than the first angle threshold or less than the second angle threshold; or in response to the average deflection angle being less than or equal to the first angle threshold and greater than or equal to the second angle threshold, determining N aspect ratios respectively corresponding to the N text boxes according to the N coordinate sets corresponding to the N text boxes, determining a correction direction for the target text box according to the N aspect ratios, and in response to the correction direction, determining a correction angle according to the N deflection angles.

In the embodiment of the disclosure, after the target text box is obtained, the target text box can be rotated, and then the rotated target text box is subjected to text recognition, so that the accuracy of text recognition is improved.

For example, the set of coordinates of each of the at least one text box includes coordinates of at least three vertices of each text box. For rectangular text boxes, each text box has four vertices, and the set of coordinates for each text box includes the coordinates of the three vertices or the coordinates of the four vertices for each text box.

For example, in some embodiments, the first angle threshold is 80 degrees and the second angle threshold is 10 degrees.

For example, since the text recognition algorithm itself has a certain robustness, when the average deflection angle of the N text boxes is greater than the first angle threshold or less than the second angle threshold, the target text box does not need to be rotated, and at this time, the target text box is the final target text box, and text recognition is directly performed on the final target text box (i.e., the target text box). And when the average deflection angle of the N text boxes is smaller than or equal to the first angle threshold value and larger than or equal to the second angle threshold value, the target text boxes are required to be rotated to obtain final target text boxes, and then text recognition is carried out on the final target text boxes.

Fig. 9 is a schematic diagram of a text box in a coordinate system provided in at least one embodiment of the present disclosure.

For example, as shown in fig. 9, the origin of the coordinate system may be one vertex of the input image, for example, the vertex of the upper right corner, for example, for the input image shown in fig. 8A, the origin of the coordinate system may be the vertex of the input image near the text box containing text "with", i.e., the vertex of the upper right corner. The two coordinate axes (X-axis and Y-axis) of the coordinate system may be parallel to two adjacent sides of the input image, respectively.

In the embodiment of the present disclosure, in the minimum bounding rectangle algorithm, as shown in fig. 9, a vertex farthest from the X-axis is taken as a first vertex T1, and coordinates (X0, y 0) of the first vertex T1 are determined, then, based on the first vertex T1, a second vertex T2, a third vertex T3, and a fourth vertex T4 of the text box are sequentially obtained clockwise, and then coordinates (X1, y 1) of the second vertex T2, coordinates (X2, y 2) of the third vertex T3, and coordinates (X3, y 3) of the fourth vertex T4 are determined. The angle of the text box is an angle rotated counterclockwise with the first vertex T1 as the origin to the nearest one side of the text box, that is, an angle θ shown in fig. 9. In the present disclosure, the deflection angle of the text box is the angle from the first vertex T1 as the origin to the nearest edge of the text box, i.e. the deflection angle of the text box shown in fig. 9 is the angle θ.

It should be noted that, in the embodiment of the present disclosure, the width of the text box indicates that the first vertex T1 is rotated counterclockwise to the nearest side of the text box with the origin, and the length of the text box indicates the width of the adjacent sides of the text box. For example, in the example shown in fig. 9, the width of the text box is denoted as Wd, the length of the text box is denoted as Hg, and thus the aspect ratio of the text box is denoted as Hg/Wd. In the example shown in fig. 9, the width Wd of the text box is less than the length Hg of the text box, however, in some embodiments, the width Wd of the text box may also be greater than or equal to the length Hg of the text box.

For example, in step S103, the reference direction may be a horizontal direction, and in the example shown in fig. 9, the reference direction may be parallel to the X-axis of the coordinate system.

For example, determining the correction direction for the target text box from the N aspect ratios includes: dividing the N text boxes into a first text box subgroup and a second text box subgroup according to the N length-width ratios; determining the number of the first text boxes and the number of the second text boxes according to the first text box subgroup and the second text box subgroup, wherein the number of the first text boxes is the number of the text boxes in the first text box subgroup, and the number of the second text boxes is the number of the text boxes in the second text box subgroup; and determining a correction direction according to the first text box number and the second text box number.

For example, the text box group is divided into a first text box subgroup and a second text box subgroup. The aspect ratio of each text box in the first subset of text boxes is greater than or equal to 1, that is, the length of each text box in the first subset of text boxes is greater than or equal to the width of the text box, e.g., the text box shown in fig. 9 is a text box in the first subset of text boxes. The aspect ratio of each text box in the second subset of text boxes is less than 1, that is, the length of each text box in the first subset of text boxes is less than the width of the text box.

For example, determining the correction direction based on the first number of text boxes and the second number of text boxes includes: determining that the correction direction is a counterclockwise direction in response to the first number of text boxes and the second number of text boxes meeting a first condition; or in response to the first number of text boxes and the second number of text boxes meeting a second condition, determining that the correction direction is clockwise.

For example, the first condition is ra > rb+r0, the second condition is ra+r0 < rb, ra is the first number of text boxes, rb is the second number of text boxes, and r0 is a constant. ra+rb=n.

For example, in some embodiments, r0 is 2, but the disclosure is not limited thereto, and the value of r0 may be set according to specific needs.

For example, in response to the average deflection angle being equal to or less than the first angle threshold and equal to or greater than the second angle threshold, the text recognition method further includes: in response to the number of first text boxes and the number of second text boxes not satisfying the first condition and the second condition, a correction angle for the target text box is determined to be 0 degrees.

In summary, when the average deflection angle of the N text boxes is smaller than or equal to the first angle threshold and larger than or equal to the second angle threshold, the judgment formula of the correction direction is as follows:

in the above formula, the "correction direction is 0" means that the correction direction is arbitrary or correction is not necessary.

For example, when the correction direction is not 0, i.e., the correction direction is counterclockwise or clockwise, the correction angle may be determined according to the N deflection angles. And when the correction direction is 0, the table does not need to correct the target text box.

For example, from the N deflection angles, determining the correction angle includes: in response to the correction direction (i.e., in response to the correction direction not being 0), ordering the N deflection angles in ascending order to obtain a first deflection angle to an nth deflection angle, wherein a difference between a P-th deflection angle and a p+1th deflection angle in the N deflection angles is greater than 10 degrees, and P is a positive integer and less than N; dividing N deflection angles into a first deflection angle group, a second deflection angle group and a third deflection angle group, wherein the deflection angles in the first deflection angle group are all 0 degrees, the second deflection angle group comprises a first deflection angle to a P deflection angle, and the third deflection angle group comprises a P+1th deflection angle to an N deflection angle; determining a first angle number, a second angle number and a third angle number according to the first deflection angle group, the second deflection angle group and the third deflection angle group, wherein the first angle number is the number of deflection angles in the first deflection angle group, the second angle number is the number of deflection angles in the second deflection angle group, and the third angle number is the number of deflection angles in the third deflection angle group; and determining a correction angle according to the first angle number, the second angle number and the third angle number.

For example, determining the correction angle based on the first number of angles, the second number of angles, and the third number of angles includes: determining that the correction angle is 0 degrees in response to the first angle degree satisfying a third condition; or in response to the first angle number not meeting the third condition and the second angle number and the third angle number meeting the fourth condition, determining the correction angle as the first angle value; or in response to the first angle number not meeting the third condition and the second angle number and the third angle number meeting the fifth condition, determining the correction angle as a second angle value; or in response to the first angle number not satisfying the third condition and the second angle number and the third angle number not satisfying the fourth condition and the fifth condition, determining that the correction angle is 0 degrees.

For example, the third condition is s0 > ss1, the fourth condition is s1 > s2+s2, the fifth condition is s1+s2 < s2, s0 is the first angle number, s1 is the second angle number, s2 is the third angle number, ss1 is a constant, and ss2 is a constant.

For example, in some embodiments, ss1 is 5 and ss2 is 2. The present disclosure is not limited thereto and the values of ss1 and ss2 may be set according to specific requirements.

For example, the first angle value may be expressed as:

/>

Wherein, i is equal to or less than 1 and equal to or less than P, ai represents the ith deflection angle from the first deflection angle to the P-th deflection angle in the second deflection angle group.

For example, the second angle value may be expressed as:

Wherein P+1.ltoreq.j.ltoreq.N, aj represents a j-th deflection angle from the P+1-th deflection angle to the N-th deflection angle in the third deflection angle group.

For example, when s0 > ss1, i.e., the number of text boxes with a deflection angle of 0 degrees is greater than ss1 (e.g., 5), it is determined that the correction angle is 0 degrees, i.e., no rotation of the intermediate target image is required. When s0 is less than or equal to ss1 and s1 is more than s2+ss2, it is determined that rotation of the intermediate target image is required, and the correction angle isWhen s0 is less than or equal to ss1 and s1+s2 is less than s2, the rotation of the intermediate target image is determined to be needed, and the correction angle is/>When none of the third condition, the fourth condition, and the fifth condition is satisfied, the correction angle is determined to be 0 degrees, that is, the intermediate target image does not need to be rotated.

In summary, when the correction direction is counterclockwise or clockwise, the determination formula of the correction angle is:

for example, the at least one text box includes N text boxes, N being 1 or 2, that is, the text box group includes one or two text boxes, at which time the correction direction and the correction angle may be determined directly according to the deflection angle and the aspect ratio of the target text box. In step S103, determining a correction angle and a correction direction for the target text box based on the deflection angle and the coordinate set of the at least one text box includes: determining a correction angle for the target text box according to the deflection angle of the target text box; determining an aspect ratio of the target text box according to the coordinate set of the target text box in response to the correction angle; the correction direction for the target text box is determined based on the aspect ratio of the target text box.

For example, the correction angle for the target text box is the deflection angle of the target text box. It should be noted that, in some embodiments, when the deflection angle of the target text box is greater than the first angle threshold or less than the second angle threshold, the correction angle may be determined to be 0 degrees.

For example, in response to the correction angle, determining a correction direction for the intermediate text sample box according to the aspect ratio of the target text box includes: determining that the correction direction is a counterclockwise direction in response to the aspect ratio of the target text box being 1 or more; or in response to the aspect ratio of the target text box being less than 1, determining that the correction direction is clockwise. Note that "in response to the correction angle" means that the correction angle is not 0 degrees.

For example, when the text box group includes two text boxes, the correction direction for the target text box may also be determined according to the aspect ratio of the two text boxes. For example, if the aspect ratio of the two text boxes is greater than or equal to 1, determining that the correction direction is a counterclockwise direction; or if the aspect ratio of the two text boxes is smaller than 1, determining that the correction direction is clockwise; or if the aspect ratio of one of the two text boxes is smaller than 1 and the aspect ratio of the other of the two text boxes is larger than or equal to 1, determining the correction direction according to the aspect ratio of the target text box, namely if the aspect ratio of the target text box is larger than or equal to 1, determining the correction direction to be a counterclockwise direction; if the aspect ratio of the target text box is smaller than 1, the correction direction is determined to be clockwise.

For example, the deflection angle of the final target text box relative to the reference direction is greater than the first angle threshold or less than the second angle threshold.

For example, the deflection angle of each of the at least one text box is 0 degrees or more and 90 degrees or less.

For example, in step S103, rotating the target text box by the correction angle to obtain the final target text box includes: rotating the input image according to the correction angle and the correction direction so that the target text box rotates to obtain a final target text box; or cutting the target text box to obtain a cut target text box, and rotating the cut target text box according to the correction angle and the correction direction to obtain a final target text box.

In this disclosure, in some embodiments, each text box may be marked in the input image in the form of a marking box, so that in a subsequent operation, the input image marked with the text box may be directly processed, that is, in this disclosure, the input image may not be subjected to a cutting operation, at which time the input image may be directly rotated according to the correction angle and the correction direction, so that the target text box is rotated to obtain a final target text box. In other embodiments, after determining the target text box, the target text box may be subjected to a cutting process to obtain a cut target text box, so that in a subsequent operation, the cut target text box may be directly processed, and at this time, the cut target text box may be rotated according to the correction angle and the correction direction, so as to obtain a final target text box.

It should be noted that, in the embodiment of the present disclosure, the size of the target text box and the final target text box and the contained text and the like are not different, except that: if the target text box is rotated to obtain a final target text box, the deflection angle of the target text box relative to the reference direction is different from the deflection angle of the final target text box relative to the reference direction; and if the target text box does not need to be rotated, the final target text box is the target text box.

For example, before the input image is acquired, the text recognition method further includes: training the text detection neural network to be trained to obtain the text detection neural network.

For example, training the text detection neural network to be trained to obtain the text detection neural network includes: acquiring a training input image and a target text detection region group; processing the training input image by utilizing the text detection neural network to be trained to obtain a training text detection region group; calculating a loss value of the text detection neural network to be trained according to the target text detection area group and the training text detection area group through a loss function; correcting parameters of the text detection neural network to be trained according to the loss value, obtaining the trained text detection neural network when the loss function meets the preset condition, and continuously inputting the training input image and the target text detection region group to repeatedly execute the training process when the loss function does not meet the preset condition.

For example, in one example, the predetermined condition described above corresponds to the loss convergence of the loss function (i.e., the loss value is no longer significantly reduced) with the input of a certain number of training input images and target text detection region groups. For example, in another example, the predetermined condition is that the number of training times or training period reaches a predetermined number (for example, the predetermined number may be millions).

For example, the loss function includes a focus loss function. In the training stage of the neural network, aiming at the condition that positive and negative sample nonuniformity easily occurs in training data, a PixelLink algorithm adopts a cross entropy Loss function (Cross Entropy Loss) to be changed into a Focal Loss function (Focal Loss), the Focal Loss function can accelerate the convergence rate of a neural network model, the influence of the positive and negative sample nonuniformity in an image on the algorithm effect is improved, and the predicted text detection area is more accurate.

For example, the focus loss function may be expressed as:

FL(p_t)＝-α_t(1-p_t)^γlog(p_t)

Where p _t represents the classification probability (e.g., text prediction probability or connection prediction probability) for the different categories, (1-p _t) represents the adjustment coefficient, γ represents the focus parameter, and is a value greater than 0, α _t represents the fraction between [0,1], γ and α _t are both fixed values, e.g., γ=2, α _t =1 in some embodiments.

FIG. 10 is a graph of loss dip for a cross entropy loss function and a focus loss function provided by at least one embodiment of the present disclosure; fig. 11A is a schematic diagram of a model result of a text detection neural network based on a cross entropy loss function according to at least one embodiment of the present disclosure, and fig. 11B is a schematic diagram of a model result of a text detection neural network based on a focus loss function according to at least one embodiment of the present disclosure.

For example, as shown in fig. 10, the ordinate represents the loss (pixel_link_loss), and the abscissa represents the number of training iterations. The upper curve in fig. 10 is a loss-down curve of the cross entropy loss function, and the lower (i.e., near the abscissa) curve is a loss-down curve of the focus loss function, and as can be seen from fig. 10, the model convergence rate trained based on the focus loss function is faster than the model convergence rate trained based on the cross entropy loss function. For example, at a training iteration number of 120K (120000), the loss value based on the focus loss function is about 0.2, and the loss value based on the cross entropy loss function is about 0.73, i.e., the loss value based on the focus loss function is smaller than the loss value based on the cross entropy loss function in the same training iteration number, that is, the fitting of the model trained based on the focus loss function is better. In addition, in fig. 11A, "multiple essential" is divided into the same text detection area, i.e., a phenomenon of text blocking occurs; in fig. 11B, the text "multiple" and the text "are respectively located in two text detection areas, that is, the text detection areas are more accurate and no text blocking occurs after the text detection neural network based on the focus loss function training processes the intermediate input image, compared with the text detection neural network based on the cross entropy loss function training.

For example, as shown in fig. 1, step S104 may include: identifying the final target text box by using a text identification neural network to obtain an intermediate text; and checking the intermediate text to obtain the target text.

For example, the text recognition neural network is a multi-objective correction attention network (MORT), which may include a correction sub-network (MORN) and a recognition sub-network (ASRN). Firstly, the correction sub-network decomposes a final target text box into a plurality of small images, then returns the offset to each small image, performs smoothing operation on the offset, and then performs sampling operation on the final target text box to obtain a new horizontal text box with more regular shape, namely the corrected final target text box. The recognition sub-network is used for inputting the corrected final target text box into a convolution recurrent neural network based on an attention mechanism to perform text recognition, so that a recognized intermediate text is obtained.

Before the input image is acquired, the text recognition method further includes: the multi-target correction attention network to be trained is trained to obtain the multi-target correction attention network.

For example, the intermediate text obtained through the text recognition neural network may have character errors, character omission, multiple words, and the like, and in order to improve the accuracy, post-processing correction needs to be performed on the intermediate text obtained through recognition, so as to correct semantic errors, logic errors, and the like in the intermediate text, so as to obtain an accurate target text. For example, if the intermediate text is a word, a word database and a word segmentation database are first constructed respectively, character errors in the intermediate text are corrected through a matching algorithm, and recognized characters are distinguished by taking the word as a unit, so that a target text is finally obtained, and the accuracy of the overall algorithm is improved. For example, the word database and the word segmentation database may be the same database.

For example, in some embodiments, the text recognition method further comprises: and translating the target text to obtain and output a translation result of the target text.

For example, the dictionary database is used to index the final identified target text to retrieve the translation result. For example, the translation result of the target file may be displayed on a display, or may be output by voice through a speaker or the like.

At least one embodiment of the present disclosure further provides a text recognition method. The text recognition method may be applied to point translation techniques, for example, to point translation strokes.

The text detection of the input image can be implemented by the following scheme: i.e. text detection of the input image using the text detection neural network shown in fig. 3.

For example, the text recognition method may include: acquiring an input image; performing text detection on the input image by using a text detection neural network to determine a text box group, wherein the text box group comprises at least one text box; determining a target text box from the at least one text box, wherein the target text box comprises target text; rotating the target text box to obtain a final target text box; and identifying the final target text box to obtain the target text.

For example, the text detection neural network is the text detection neural network shown in fig. 3. The text detection neural network comprises a first convolution module, a fifth convolution module, a first downsampling module, a fifth downsampling module, a full-connection module, a first upsampling module, a third upsampling module, a first dimension reduction module, a fourth dimension reduction module and a classifier.

For example, the number of convolution kernels in each of the first convolution layers is 8, the number of convolution kernels in each of the second convolution layers is 16, the number of convolution kernels in each of the third convolution layers is 32, the number of convolution kernels in each of the fourth convolution layers is 64, and the number of convolution kernels in each of the fifth convolution layers is 128.

In the embodiment, under the condition of ensuring the detection accuracy, the text detection neural network has the characteristics of small network model, small calculated amount and the like, for example, compared with the traditional neural network based on PixelLink algorithm, the size of the network model is reduced by about 50 times, and the calculation speed is improved by about 10 times, so that the calculated amount of the text detection neural network can be reduced, the calculation efficiency of the text detection neural network is accelerated, the waiting time of a user is reduced, and the use experience of the user is improved.

Further, the number of convolution kernels in each of the convolution layers in the first dimension reduction module is 10, the number of convolution kernels in each of the convolution layers in the second dimension reduction module is 10, the number of convolution kernels in each of the convolution layers in the third dimension reduction module is 10, and the number of convolution kernels in each of the convolution layers in the fourth dimension reduction module is 10. That is, in the present embodiment, the text detection neural network only needs to acquire the connection of the four domain directions of the pixels. Thus, the post-processing portion of the pixellink algorithm increases in speed by a factor of about 2, while improving the text blocking (multiple words in one text detection area) of the text detection area.

Note that, for a specific description of the text detection neural network, reference is made to the above detailed description of the text detection neural network shown in fig. 3.

For example, text detection of an input image using a text detection neural network to determine a set of text boxes includes: performing scale transformation processing on the input images to obtain a plurality of intermediate input images; for each intermediate input image of the plurality of intermediate input images, performing text detection on each intermediate input image by using a text detection neural network to obtain an intermediate text box group corresponding to each intermediate input image, thereby obtaining a plurality of intermediate text box groups corresponding to the plurality of intermediate input images, wherein each intermediate text box group comprises at least one intermediate text box; a text box group is determined based on the plurality of intermediate text box groups.

For example, the plurality of intermediate input images include input images, and the plurality of intermediate input images are different in size from each other. It should be noted that, the description of the related intermediate input image may refer to the description in the embodiment of the above text recognition method, which is not repeated herein.

For example, from a plurality of intermediate text box groups, determining the text box group includes: for the ith text box, determining the coordinate set of the ith text box according to the coordinate sets corresponding to the ith intermediate text boxes in the intermediate text box sets, thereby determining the coordinate sets of all the text boxes in the text box sets. Thus, the resulting text box group can be more accurate.

For example, performing text detection on each intermediate input image by using a text detection neural network to obtain an intermediate text box group corresponding to each intermediate input image, including: performing text detection on each intermediate input image by using a text detection neural network to determine a text detection region group corresponding to each intermediate input image; and processing the text detection region group corresponding to each intermediate input image by using a minimum circumscribed rectangle algorithm to determine an intermediate text box group corresponding to each intermediate input image.

For example, the text detection area group corresponding to each intermediate input image includes at least one text detection area, the at least one text detection area corresponds to at least one intermediate text box one by one, and each intermediate text box covers the corresponding text detection area.

For example, the number of feature patterns in the first convolution feature pattern group is 8, the number of feature patterns in the second convolution feature pattern group is 16, the number of feature patterns in the third convolution feature pattern group is 32, the number of feature patterns in the fourth convolution feature pattern group is 64, the number of feature patterns in the fifth convolution feature pattern group is 128, the number of feature patterns in the sixth convolution feature pattern group is 256, the number of feature patterns in the first dimension-reduction feature pattern group is 10, the number of feature patterns in the second dimension-reduction feature pattern group is 10, the number of feature patterns in the third dimension-reduction feature pattern group is 10, and the number of feature patterns in the fourth dimension-reduction feature pattern group is 10.

For example, before the input image is acquired, the text recognition method further includes: training the text detection neural network to be trained to obtain the text detection neural network. When the text detection neural network to be trained is trained, the loss function can be a focus loss function, the focus loss function can accelerate the convergence rate of a neural network model, the influence of uneven positive and negative samples in an image on the algorithm effect is improved, and the predicted text detection area is more accurate.

For example, in some embodiments, rotating the target text box to obtain the final target text box includes: and determining the correction angle and the correction direction of the target text box relative to the reference direction, and rotating the target text box according to the correction angle and the correction direction to obtain the final target text box. For example, the method of "determining the correction angle and the correction direction" may be any existing method, and for example, the method of "determining the correction angle and the correction direction" may also be a method described in the embodiments of the above-mentioned word recognition method of the present disclosure, where the target text box is subjected to a rotation process to obtain a final target text box, which includes: and acquiring a coordinate set of at least one text box and a deflection angle relative to a reference direction, determining a correction angle and a correction direction for the target text box according to the deflection angle and the coordinate set of the at least one text box, and rotating the target text box according to the correction angle and the correction direction to obtain a final target text box.

It should be noted that, the steps of "obtaining the input image", "determining the target text box from at least one text box", "identifying the final target text box to obtain the target text" may refer to the related description in the above embodiments of the text recognition method, and the repetition is not repeated here.

At least one embodiment of the present disclosure further provides a text recognition device, and fig. 12 is a schematic block diagram of a text recognition device provided by at least one embodiment of the present disclosure.

For example, as shown in fig. 12, the word recognition device 1200 includes an image capture device 1210, a memory 1220, and a processor 1230. It should be noted that the components of the word recognition device 1200 shown in fig. 12 are exemplary only, and not limiting, and that the word recognition device 1200 may have other components as desired for practical applications.

For example, the image capturing device 1210 is configured to capture an input image; memory 1220 is used to non-transitory store input images and computer readable instructions; processor 1230 is configured to read the input image and execute computer readable instructions that when executed by processor 1230 perform one or more steps of the word recognition method according to any of the embodiments described above.

For example, the image capturing device 1210 is an image capturing device described in the above embodiment of the text recognition method, and for example, the image capturing device 1210 may be various types of cameras.

For example, the word recognition device 1200 also includes a point translator 1250, the point translator 1250 being used to select target text. The image pickup device 1210 is provided on the point translation pen 1250, and for example, the image pickup device 1210 may be a camera provided on the point translation pen 1250.

It should be noted that the memory 1220 and the processor 1230 may be integrated in the translation pen 1250, that is, the image capturing device 1210, the memory 1220 and the processor 1230 are integrated in the translation pen 1250. However, the present disclosure is not limited thereto, and the point translation pen 1250 may be configured separately from the memory 1220 and the processor 1230 in physical locations, for example, the memory 1220 and the processor 1230 may be integrated in an electronic device (e.g., a computer, a mobile phone, etc.), the image pickup device 1210 is integrated in the point translation pen 1250, the point translation pen 1250 and the electronic device may be configured separately in physical locations, and communication between the point translation pen 1250 and the electronic device may be performed through a wired or wireless manner. That is, after the input image is acquired by the image acquisition device 1210 on the point translator 1250, the electronic device may receive the input image transmitted from the point translator 1250 via a wired or wireless manner and perform text recognition processing on the input image. For another example, the memory 1220 and the processor 1230 may be integrated in a cloud server, where the translation pen 1250 communicates with the cloud server in a wired or wireless manner, and the cloud server receives the input image and performs text recognition processing on the input image.

For example, the word recognition device 1200 may further include an output device for outputting the translation result of the target text. For example, the output device may include a display, a speaker, a projector, etc., the display may be used to display the translation result of the target text, and the speaker may be used to output the translation result of the target text in the form of voice. For example, the point translator 1250 may also include a communication module to enable communication between the point translator 1250 and an output device, such as to transmit translation results to the output device.

For example, processor 1230 may control other components in word recognition device 1200 to perform desired functions. Processor 1230 may be a Central Processing Unit (CPU), tensor Processor (TPU), or the like having data processing and/or program execution capabilities. The Central Processing Unit (CPU) can be an X86 or ARM architecture, etc. The GPU may be integrated directly onto the motherboard alone or built into the north bridge chip of the motherboard. The GPU may also be built-in on a Central Processing Unit (CPU).

For example, memory 1220 may include any combination of one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, random Access Memory (RAM) and/or cache memory (cache) and the like. The non-volatile memory may include, for example, read-only memory (ROM), hard disk, erasable programmable read-only memory (EPROM), portable compact disc read-only memory (CD-ROM), USB memory, flash memory, and the like. One or more computer-readable instructions may be stored on the computer-readable storage medium and may be executed by processor 1230 to perform the various functions of word recognition device 1200.

For example, the components of the image acquisition device 1210, the memory 1220, the memory 1230, and the output device may communicate with each other via a network connection. The network may include a wireless network, a wired network, and/or any combination of wireless and wired networks. The network may include a local area network, the internet, a telecommunications network, an internet of things (Internet of Things) based on the internet and/or a telecommunications network, any combination of the above, and/or the like. The wired network may use twisted pair, coaxial cable or optical fiber transmission, and the wireless network may use 3G/4G/5G mobile communication network, bluetooth, zigbee or WiFi, for example. The present disclosure is not limited herein with respect to the type and functionality of the network.

For example, a detailed description of the text recognition process performed by the text recognition device 1200 may refer to a related description in the embodiment of the text recognition method, and the repetition is omitted.

At least one embodiment of the present disclosure also provides a storage medium. For example, the storage medium may be a non-transitory storage medium. Fig. 13 is a schematic diagram of a storage medium provided in at least one embodiment of the present disclosure. For example, as shown in fig. 13, one or more computer-readable instructions 1301 may be non-transitory stored on the storage medium 1300. For example, the computer readable instructions 1301, when executed by a computer, may perform one or more steps in accordance with the text recognition method described above.

For example, the storage medium 1300 may be applied to the word recognition device 1200 described above, and may be, for example, the memory 1220 in the word recognition device 1200. The description of the storage medium 1300 may refer to the description of the memory in the embodiment of the word recognition device 1200, and the repetition is omitted.

For the purposes of this disclosure, the following points are also noted:

(1) The drawings of the embodiments of the present disclosure relate only to the structures related to the embodiments of the present disclosure, and other structures may refer to the general design.

(2) In the drawings for describing embodiments of the present invention, thicknesses and dimensions of layers or structures are exaggerated for clarity. It will be understood that when an element such as a layer, film, region or substrate is referred to as being "on" or "under" another element, it can be "directly on" or "under" the other element or intervening elements may be present.

(3) The embodiments of the present disclosure and features in the embodiments may be combined with each other to arrive at a new embodiment without conflict.

The foregoing is merely a specific embodiment of the disclosure, but the scope of the disclosure is not limited thereto and should be determined by the scope of the claims.

Claims

1. A text recognition method, comprising:

Acquiring an input image;

Performing text detection on the input image to determine a text box group, wherein the text box group comprises at least one text box;

determining a target text box from the at least one text box, wherein the target text box comprises target text;

Acquiring a coordinate set and a deflection angle relative to a reference direction of the at least one text box, determining a correction angle and a correction direction for the target text box according to the deflection angle and the coordinate set of the at least one text box, and rotating the target text box according to the correction angle and the correction direction to obtain a final target text box;

identifying the final target text box to obtain the target text;

wherein text detecting the input image to determine the text box group includes:

Performing scale transformation processing on the input image to obtain a plurality of intermediate input images, wherein the plurality of intermediate input images comprise the input image, and the sizes of the plurality of intermediate input images are different from each other;

For each intermediate input image in the plurality of intermediate input images, performing text detection on each intermediate input image to obtain an intermediate text box group corresponding to each intermediate input image, thereby obtaining a plurality of intermediate text box groups corresponding to the plurality of intermediate input images, wherein the text detection is realized based on a pixel connection algorithm, and each intermediate text box group comprises at least one intermediate text box;

And determining the text box group according to the plurality of intermediate text box groups.

2. The text recognition method of claim 1, wherein the at least one text box comprises N text boxes, N being a positive integer greater than 2,

Determining the correction angle and the correction direction for the target text box from the deflection angle and the coordinate set of the at least one text box comprises:

determining the average deflection angles of the N text boxes according to the N deflection angles corresponding to the N text boxes;

judging whether the average deflection angle is larger than a first angle threshold or smaller than a second angle threshold;

Determining that a correction angle for the target text box is 0 degrees in response to the average deflection angle being greater than the first angle threshold or less than the second angle threshold; or alternatively

And determining N length-width ratios respectively corresponding to the N text boxes according to N coordinate sets corresponding to the N text boxes, determining the correction direction for the target text box according to the N length-width ratios, and determining the correction angle according to the N deflection angles in response to the correction direction.

3. The text recognition method of claim 2, wherein determining the correction direction for the target text box from the N aspect ratios comprises:

Dividing the N text boxes into a first text box subgroup and a second text box subgroup according to the N length-width ratios, wherein the length-width ratio of each text box in the first text box subgroup is greater than or equal to 1, and the length-width ratio of each text box in the second text box subgroup is less than 1;

Determining a first text box number and a second text box number according to the first text box subgroup and the second text box subgroup, wherein the first text box number is the number of text boxes in the first text box subgroup, and the second text box number is the number of text boxes in the second text box subgroup;

and determining the correction direction according to the first text box number and the second text box number.

4. The text recognition method of claim 3, wherein determining the correction direction based on the first number of text boxes and the second number of text boxes comprises:

Determining that the correction direction is a counterclockwise direction in response to the first number of text boxes and the second number of text boxes meeting a first condition; or alternatively

In response to the first number of text boxes and the second number of text boxes meeting a second condition, determining that the correction direction is clockwise,

The first condition is ra > rb+r0, the second condition is ra+r0 < rb, ra is the number of the first text boxes, rb is the number of the second text boxes, and r0 is a constant.

5. The text recognition method of claim 4, wherein, in response to the average deflection angle being less than or equal to the first angle threshold and greater than or equal to the second angle threshold, the text recognition method further comprises:

In response to the first number of text boxes and the second number of text boxes not meeting the first condition and the second condition, determining that a correction angle for the target text box is 0 degrees.

6. The text recognition method of claim 4, wherein r0 is 2.

7. The text recognition method of any of claims 2-6, wherein determining the correction angle from the N deflection angles in response to the correction direction comprises:

The N deflection angles are sequenced according to ascending order to obtain a first deflection angle to an N deflection angle in response to the correction direction, wherein the difference between the P deflection angle and the P+1 deflection angle in the N deflection angles is larger than 10 degrees, and P is a positive integer and smaller than N;

Dividing the N deflection angles into a first deflection angle group, a second deflection angle group and a third deflection angle group, wherein the deflection angles in the first deflection angle group are all 0 degrees, the second deflection angle group comprises a first deflection angle to the P-th deflection angle, and the third deflection angle group comprises a P+1th deflection angle to an N-th deflection angle;

Determining a first angle number, a second angle number and a third angle number according to the first deflection angle group, the second deflection angle group and the third deflection angle group, wherein the first angle number is the number of deflection angles in the first deflection angle group, the second angle number is the number of deflection angles in the second deflection angle group, and the third angle number is the number of deflection angles in the third deflection angle group;

and determining the correction angle according to the first angle number, the second angle number and the third angle number.

8. The text recognition method of claim 7, wherein determining the correction angle based on the first number of angles, the second number of angles, and the third number of angles comprises:

Determining that the correction angle is 0 degrees in response to the first angle degree meeting a third condition; or alternatively

Determining that the correction angle is a first angle value in response to the first number of angles not meeting the third condition and the second number of angles and the third number of angles meeting a fourth condition; or alternatively

Determining that the correction angle is a second angle value in response to the first number of angles not meeting the third condition and the second number of angles and the third number of angles meeting a fifth condition; or alternatively

Determining that the correction angle is 0 degrees in response to the first number of angles not satisfying the third condition and the second and third numbers of angles not satisfying the fourth and fifth conditions;

Wherein the third condition is s0 > ss1, the fourth condition is s1 > s2+s2, the fifth condition is s1+s2 < s2, s0 is the first angle number, s1 is the second angle number, s2 is the third angle number, ss1 is a constant, ss2 is a constant,

The first angle value is expressed as:

The second angle value is expressed as:

9. The text recognition method of claim 8, wherein ss1 is 5 and ss2 is 2.

10. The text recognition method of any of claims 2-6, wherein the first angular threshold is 80 degrees and the second angular threshold is 10 degrees.

11. The text recognition method of any of claims 2-6, wherein a deflection angle of the final target text box relative to the reference direction is greater than the first angle threshold or less than the second angle threshold.

12. The text recognition method of claim 1, wherein the at least one text box comprises N text boxes, N being 1 or 2,

determining the correction angle for the target text box according to the deflection angle of the target text box;

determining an aspect ratio of the target text box according to the coordinate set of the target text box in response to the correction angle;

And determining the correction direction for the target text box according to the aspect ratio of the target text box.

13. The text recognition method of claim 12, wherein determining the correction direction for the target text box according to an aspect ratio of the target text box comprises:

determining that the correction direction is a counterclockwise direction in response to the aspect ratio of the target text box being 1 or more; or alternatively

And determining that the correction direction is clockwise in response to the aspect ratio of the target text box being less than 1.

14. The text recognition method of any one of claims 1-6, wherein the at least one text box is a rectangular box, and the set of coordinates of each of the at least one text box includes coordinates of at least three vertices of each of the text boxes.

15. The text recognition method of any one of claims 1-6, wherein a deflection angle of each of the at least one text box is greater than or equal to 0 degrees and less than or equal to 90 degrees.

16. The text recognition method of any one of claims 1-6, wherein rotating the target text box by the correction angle and the correction direction to obtain the final target text box comprises:

Rotating the input image according to the correction angle and the correction direction so that the target text box is rotated to obtain the final target text box; or alternatively

And cutting the target text box to obtain a cut target text box, and rotating the cut target text box according to the correction angle and the correction direction to obtain the final target text box.

17. The text recognition method of claim 1, wherein the at least one intermediate text box corresponds one-to-one to the at least one text box,

Each of the intermediate text box groups includes an i-th intermediate text box, the text box groups include an i-th text box, the i-th intermediate text box corresponds to the i-th text box, i is greater than or equal to 1 and less than or equal to the number of intermediate text boxes in each of the intermediate text box groups,

Determining the text box group according to the plurality of intermediate text box groups comprises:

and for the ith text box, determining the coordinate set of the ith text box according to the coordinate sets corresponding to the ith intermediate text boxes of the intermediate text box sets, so as to determine the text box set.

18. The text recognition method of claim 1, wherein performing text detection on each intermediate input image to obtain the set of intermediate text boxes corresponding to each intermediate input image comprises:

performing text detection on each intermediate input image by using a text detection neural network to determine a text detection region group corresponding to each intermediate input image;

And processing the text detection area group by using a minimum circumscribed rectangle algorithm to determine the intermediate text box group, wherein the text detection area group comprises at least one text detection area, the at least one text detection area corresponds to the at least one intermediate text box one by one, and each intermediate text box covers the corresponding text detection area.

19. The text recognition method of claim 18, wherein the text detection neural network comprises first to fifth convolution modules, first to fifth downsampling modules, a full connection module, first to third upsampling modules, first to fourth dimension reduction modules, and a classifier,

Performing text detection on each intermediate input image by using the text detection neural network to determine the text detection region group corresponding to each intermediate input image comprises:

Performing convolution processing on each intermediate input image by using the first convolution module to obtain a first convolution characteristic image group;

Performing downsampling processing on the first convolution feature image group by using the first downsampling module so as to obtain a first downsampled feature image group;

performing convolution processing on the first downsampled feature map group by using the second convolution module to obtain a second convolution feature map group;

Performing downsampling processing on the second convolution feature map set by using the second downsampling module to obtain a second downsampled feature map set;

performing convolution processing on the second downsampled feature map group by using the third convolution module to obtain a third convolution feature map group;

Performing downsampling processing on the third convolution feature image group by using the third downsampling module to obtain a third downsampled feature image group, and performing dimension reduction processing on the third convolution feature image group by using the first dimension reduction module to obtain a first dimension reduction feature image group;

performing convolution processing on the third downsampled feature map group by using the fourth convolution module to obtain a fourth convolution feature map group;

Performing downsampling processing on the fourth convolution feature image group by using the fourth downsampling module to obtain a fourth downsampled feature image group, and performing dimension reduction processing on the fourth convolution feature image group by using the second dimension reduction module to obtain a second dimension reduction feature image group;

performing convolution processing on the fourth downsampled feature map group by using the fifth convolution module to obtain a fifth convolution feature map group;

Performing downsampling processing on the fifth convolution feature image group by using the fifth downsampling module to obtain a fifth downsampled feature image group, and performing dimension reduction processing on the fifth convolution feature image group by using the third dimension reduction module to obtain a third dimension reduction feature image group;

performing convolution processing on the fifth downsampled feature map group by using the full connection module to obtain a sixth convolution feature map group;

performing dimension reduction processing on the sixth convolution feature map set by using the fourth dimension reduction module to obtain a fourth dimension reduction feature map set;

performing up-sampling processing on the fourth dimension-reduction feature map set by using the first up-sampling module so as to obtain a first up-sampling feature map set;

Performing fusion processing on the first up-sampling feature image group and the third dimension-reduction feature image group to obtain a first fusion feature image group;

performing upsampling processing on the first fusion feature map set by using the second upsampling module to obtain a second upsampled feature map set;

Performing fusion processing on the second up-sampling feature image group and the second dimension-reduction feature image group to obtain a second fusion feature image group;

Performing upsampling processing on the second fusion feature map set by using the third upsampling module to obtain a third upsampled feature map set;

performing fusion processing on the third upsampling feature map set and the first dimension-reduction feature map set to obtain a third fusion feature map set;

Classifying the third fusion feature image group by using the classifier to obtain a text classification prediction image and a connection classification prediction image;

and determining the text detection area group according to the connection classification prediction graph and the text classification prediction graph.

20. The text recognition method of claim 19, wherein the number of feature patterns in the first set of convolution feature patterns is 8, the number of feature patterns in the second set of convolution feature patterns is 16, the number of feature patterns in the third set of convolution feature patterns is 32, the number of feature patterns in the fourth set of convolution feature patterns is 64, the number of feature patterns in the fifth set of convolution feature patterns is 128, the number of feature patterns in the sixth set of convolution feature patterns is 256,

The number of feature images in the first dimension-reduction feature image group is 10, the number of feature images in the second dimension-reduction feature image group is 10, the number of feature images in the third dimension-reduction feature image group is 10, and the number of feature images in the fourth dimension-reduction feature image group is 10.

21. The text recognition method of claim 18, wherein the text detection neural network comprises first to fifth convolution modules, first to fifth downsampling modules, a full connection module, first to third upsampling modules, first to fifth dimension reduction modules, and a classifier,

Performing text detection on each intermediate input image by using the text detection neural network to determine a text detection region group corresponding to each intermediate input image comprises:

Performing convolution processing on the input image by using the first convolution module to obtain a first convolution characteristic image group;

Performing downsampling processing on the second convolution feature image group by using the second downsampling module to obtain a second downsampled feature image group, and performing dimension reduction processing on the second convolution feature image group by using the first dimension reduction module to obtain a first dimension reduction feature image group;

Performing downsampling processing on the third convolution feature image group by using the third downsampling module to obtain a third downsampled feature image group, and performing dimension reduction processing on the third convolution feature image group by using the second dimension reduction module to obtain a second dimension reduction feature image group;

Performing downsampling processing on the fourth convolution feature image group by using the fourth downsampling module to obtain a fourth downsampled feature image group, and performing dimension reduction processing on the fourth convolution feature image group by using the third dimension reduction module to obtain a third dimension reduction feature image group;

Performing downsampling processing on the fifth convolution feature image group by using the fifth downsampling module to obtain a fifth downsampled feature image group, and performing dimension reduction processing on the fifth convolution feature image group by using the fourth dimension reduction module to obtain a fourth dimension reduction feature image group;

performing dimension reduction processing on the sixth convolution feature map set by using the fifth dimension reduction module to obtain a fifth dimension reduction feature map set;

Performing fusion processing on the fourth dimension reduction feature image group and the fifth dimension reduction feature image group to obtain a first fusion feature image group;

performing upsampling processing on the first fusion feature map set by using the first upsampling module to obtain a first upsampled feature map set;

performing fusion processing on the first up-sampling feature image group and the third dimension-reduction feature image group to obtain a second fusion feature image group;

performing upsampling processing on the second fused feature map set by using the second upsampling module to obtain a second upsampled feature map set;

performing fusion processing on the second up-sampling feature image group and the second dimension-reduction feature image group to obtain a third fusion feature image group;

performing upsampling processing on the third fused feature map set by using the third upsampling module to obtain a third upsampled feature map set;

performing fusion processing on the third upsampling feature map set and the first dimension-reduction feature map set to obtain a fourth fusion feature map set;

classifying the fourth fusion feature image group by using the classifier to obtain a text classification prediction image and a connection classification prediction image;

22. The text recognition method of claim 21, wherein the number of feature patterns in the first set of convolution feature patterns is 64, the number of feature patterns in the second set of convolution feature patterns is 128, the number of feature patterns in the third set of convolution feature patterns is 256, the number of feature patterns in the fourth set of convolution feature patterns is 512, the number of feature patterns in the fifth set of convolution feature patterns is 512, the number of feature patterns in the sixth set of convolution feature patterns is 512,

The number of feature maps in each of the first to fifth dimension-reduction feature map sets is 18.

23. The text recognition method of claim 18, wherein prior to acquiring the input image, the text recognition method further comprises: training a text detection neural network to be trained to obtain the text detection neural network,

Training a text detection neural network to be trained to obtain the text detection neural network comprises:

acquiring a training input image and a target text detection region group;

Processing the training input image by using the text detection neural network to be trained to obtain a training text detection region group;

Calculating a loss value of the text detection neural network to be trained according to the target text detection area group and the training text detection area group through a loss function;

Correcting parameters of the text detection neural network to be trained according to the loss value, obtaining the trained text detection neural network when the loss function meets a preset condition, and continuously inputting the training input image and the target text detection region group to repeatedly execute the training process when the loss function does not meet the preset condition.

24. The text recognition method of claim 23, wherein the loss function comprises a focus loss function.

25. The text recognition method of any of claims 1-6, wherein determining a target text box from the at least one text box comprises:

determining the position of a pen point of the fixed point translation pen;

Marking a region to be detected in the input image based on the position of the pen point;

determining at least one overlapping area between the to-be-detected area and the at least one text box respectively;

And determining a text box corresponding to the largest overlapping area in the at least one overlapping area as the target text box.

26. The text recognition method of any one of claims 1-6, wherein identifying the final target text box to obtain the target text comprises:

Performing recognition processing on the final target text box by using a text recognition neural network to obtain an intermediate text;

And checking the intermediate text to obtain the target text.

27. The text recognition method of claim 26, wherein the text recognition neural network is a multi-objective correction attention network.

28. The text recognition method of any one of claims 1-6, further comprising:

And translating the target text to obtain and output a translation result of the target text.

29. A text recognition method, comprising:

Acquiring an input image;

performing text detection on the input image by using a text detection neural network to determine a text box group, wherein the text detection is realized on the basis of a pixel connection algorithm, and the text box group comprises at least one text box;

Rotating the target text box to obtain a final target text box;

identifying the final target text box to obtain the target text,

Wherein the text detection neural network comprises a first convolution module, a fifth convolution module, a first dimension reduction module, a fourth dimension reduction module,

The number of convolution kernels in each of the first convolution layers is 8, the number of convolution kernels in each of the second convolution layers is 16, the number of convolution kernels in each of the third convolution layers is 32, the number of convolution kernels in each of the fourth convolution layers is 64, the number of convolution kernels in each of the fifth convolution layers is 128,

The number of convolution kernels in each convolution layer in the first dimension reduction module is 10, the number of convolution kernels in each convolution layer in the second dimension reduction module is 10, the number of convolution kernels in each convolution layer in the third dimension reduction module is 10, and the number of convolution kernels in each convolution layer in the fourth dimension reduction module is 10;

Wherein performing text detection on the input image using the text detection neural network to determine a text box group comprises:

for each intermediate input image in the plurality of intermediate input images, performing text detection on each intermediate input image by using the text detection neural network to obtain an intermediate text box group corresponding to each intermediate input image, thereby obtaining a plurality of intermediate text box groups corresponding to the plurality of intermediate input images, wherein each intermediate text box group comprises at least one intermediate text box;

30. The text recognition method of claim 29, wherein the at least one intermediate text box corresponds one-to-one with the at least one text box,

31. The text recognition method of claim 29, wherein performing text detection on each intermediate input image using the text detection neural network to obtain the set of intermediate text boxes corresponding to each intermediate input image comprises:

Performing text detection on each intermediate input image by using the text detection neural network to determine a text detection region group corresponding to each intermediate input image;

32. The text recognition method of claim 31, wherein the text detection neural network further comprises first through fifth downsampling modules, a full connection module, first through third upsampling modules, and a classifier,

Performing text detection on each intermediate input image by using the text detection neural network to determine the text detection region group corresponding to each intermediate input image, including:

33. The text recognition method of claim 32, wherein the number of feature patterns in the first set of convolution feature patterns is 8, the number of feature patterns in the second set of convolution feature patterns is 16, the number of feature patterns in the third set of convolution feature patterns is 32, the number of feature patterns in the fourth set of convolution feature patterns is 64, the number of feature patterns in the fifth set of convolution feature patterns is 128, the number of feature patterns in the sixth set of convolution feature patterns is 256,

34. The text recognition method of any one of claims 29-33, wherein prior to acquiring the input image, the text recognition method further comprises: training a text detection neural network to be trained to obtain the text detection neural network,

acquiring a training input image and a target text detection region group;

35. The text recognition method of claim 34, wherein the loss function comprises a focus loss function.

36. A text recognition device, comprising:

the image acquisition device is used for acquiring an input image;

a memory for storing the input image and computer readable instructions;

A processor for reading the input image and executing the computer readable instructions, which when executed by the processor perform the text recognition method according to any one of claims 1-35.

37. The word recognition device of claim 36, further comprising a translation pen,

The image acquisition device is arranged on the point translation pen, and the point translation pen is used for selecting the target text.

38. A storage medium non-transitory storing computer readable instructions, wherein the computer readable instructions, when executed by a computer, can perform the word recognition method of any one of claims 1-35.