CN111062258B

CN111062258B - Text region identification method, device, terminal equipment and readable storage medium

Info

Publication number: CN111062258B
Application number: CN201911159636.XA
Authority: CN
Inventors: 施烈航; 姚恒志; 王志远; 冯霞
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2019-11-22
Filing date: 2019-11-22
Publication date: 2023-10-24
Anticipated expiration: 2039-11-22
Also published as: CN111062258A

Abstract

The application is applicable to the field of terminal artificial intelligence and the corresponding technical field of computer vision, and provides a text region identification method, a device, terminal equipment and a readable storage medium, wherein the method comprises the following steps: acquiring an image to be detected, wherein the image to be detected comprises at least one text line; and inputting the image to be detected into a trained text recognition model for processing, and obtaining a text region formed by the text line head and the text line tail of the at least one text line. The text recognition model is obtained through head-tail characteristic point regression branch training, and the weights of the head and tail of the text recognition model are adjusted, so that the text recognition model can accurately recognize the head and tail of each text line, the problem of inaccurate text region recognition caused by inaccurate recognition of the head and tail of the text line is avoided, the accuracy of recognizing the head and tail of the text line is improved, and the accuracy of recognizing the text region is improved.

Description

Text region identification method, device, terminal equipment and readable storage medium

Technical Field

The application belongs to the technical field of computer vision, and particularly relates to a text region identification method, a text region identification device, terminal equipment and a readable storage medium.

Background

With the continuous development of the neural network, the application field of the neural network is also increased, and in the field of text recognition, text regions in images can be recognized through models trained by the neural network, and characters in the text regions can be recognized.

In the related art, in the process of training a model according to the neural network, the neural network can perform feature extraction on the image to obtain a feature layer, and perform dense sampling on the feature layer to determine a text region. In the process of determining the text region, one text line can be split into a plurality of sub-regions, and whether the left and right sides of each sub-region comprise text information or not is determined according to the left and right sides of each sub-region, so that the text line head, the text line tail and the text line are determined.

However, in the training process according to the above manner, the number of text lines is far greater than the number of text line head and text line tail, so that the samples are unbalanced, errors are easily generated when the models determine the text line head and the text line tail, and the problem of inaccurate text region identification is caused.

Disclosure of Invention

The embodiment of the application provides a text region identification method, a device, terminal equipment and a readable storage medium, which can solve the problem of inaccurate text region identification.

In a first aspect, an embodiment of the present application provides a text region recognition method, where the method includes:

acquiring an image to be detected, wherein the image to be detected comprises at least one text line;

inputting the image to be detected into a trained text recognition model for processing to obtain a text region formed by the text line head and the text line tail of at least one text line, wherein the text recognition model is obtained by training according to head and tail characteristic point regression branches, and the head and tail characteristic point regression branches are used for adjusting weights of the text line head and the text line tail in the text recognition model.

In a first possible implementation manner of the first aspect, the preset recognition model includes an initial text recognition model and the head-tail feature point regression branch;

before the input of the image to be detected into the trained text recognition model, the method further comprises:

inputting a sample image into the preset recognition model to obtain a region recognition image output by the initial text recognition model, and a head recognition image and a tail recognition image output by the head and tail feature point regression branch;

Calculating according to the region identification image, the sample text region of the sample image and the first loss function to obtain a region error;

calculating according to the head-of-line identification image, the tail-of-line identification image, the head-of-line target image, the tail-of-line target image and the second loss function to obtain head-of-line and tail-of-line errors, wherein the head-of-line target image and the tail-of-line target image are generated according to the sample image and the initial text identification model;

and training the initial text recognition model according to the region error and the head-to-tail error to obtain the text recognition model.

Based on the first possible implementation manner of the first aspect, in a second possible implementation manner, before the calculating according to the head-of-line identification image, the tail-of-line identification image, the head-of-line target image, the tail-of-line target image, and the second loss function, the method further includes:

acquiring line head midpoint position information and line tail midpoint position information of the sample image, wherein the line head midpoint position information is used for representing the midpoint position of the line head of the sample text in the sample image, and the line tail midpoint position information is used for representing the midpoint position of the line tail of the sample text in the sample image;

Generating a head-of-line initial image and a tail-of-line initial image according to the downsampling magnification of the initial text recognition model;

and generating the head-of-line target image according to the position of the head-of-line midpoint position information in the head-of-line initial image mapping, and generating the tail-of-line target image according to the position of the tail-of-line midpoint position information in the tail-of-line initial image mapping.

In a third possible implementation manner, the acquiring the position information of the middle point of the line head and the position information of the middle point of the line tail of the sample image includes:

acquiring the sample image;

determining a line head coordinate and a line tail coordinate of the sample image according to the labeling information of the sample image, wherein the line head coordinate and the line tail coordinate both comprise an upper boundary coordinate and a lower boundary coordinate;

and determining the position information of the middle point of the line head according to the upper boundary coordinate and the lower boundary coordinate of the line head coordinate, and determining the position information of the middle point of the line tail according to the upper boundary coordinate and the lower boundary coordinate of the line tail coordinate.

Based on the third possible implementation manner of the first aspect, in a fourth possible implementation manner, the generating the line head target image according to the position of the line head midpoint position information mapped in the line head initial image includes:

Acquiring a line head mapping position of a line head midpoint in the line head initial image according to the downsampling multiplying power of the initial text recognition model and the line head midpoint position information;

generating the line head target image by combining the line head initial image according to the line head mapping position, the line head coordinates and the downsampling multiplying power of the initial text recognition model;

the generating the line tail target image according to the position of the line tail midpoint position information in the line tail initial image mapping position comprises the following steps:

acquiring a line tail mapping position of a line tail midpoint in the line tail initial image according to the downsampling multiplying power of the initial text recognition model and the line tail midpoint position information;

and generating the line tail target image by combining the line tail initial image according to the line tail mapping position, the line tail coordinates and the downsampling multiplying power of the initial text recognition model.

Based on the second possible implementation manner of the first aspect, in a fifth possible implementation manner, the generating a head-of-line initial image and a tail-of-line initial image according to a downsampling magnification of the initial text recognition model includes:

calculating according to the downsampling ratio of the initial text recognition model and the resolution of the input image of the initial text recognition model to obtain initial resolution;

And respectively generating the head-of-line initial image and the tail-of-line initial image according to the initial resolution.

In a sixth possible implementation manner, based on the first possible implementation manner of the first aspect, the second loss function includes a second line head loss function and a second line tail loss function;

the calculating according to the head-of-line identification image, the tail-of-line identification image, the head-of-line target image, the tail-of-line target image and the second loss function to obtain head-of-line and tail-of-line errors comprises:

calculating according to the line head identification image, the line head target image and the second line head loss function to obtain a line head error;

calculating according to the line tail identification image, the line tail target image and the second line tail loss function to obtain a line tail error;

and summing the head-of-line error and the tail-of-line error to obtain the head-of-line tail-of-line error.

Based on the first possible implementation manner of the first aspect, in a seventh possible implementation manner, the training the initial text recognition model according to the region error and the line head-line tail error to obtain the text recognition model includes:

Summing the regional error and the head-to-tail error of the line to obtain a model error;

training the initial text recognition model according to the model error to obtain the text recognition model.

In a second aspect, an embodiment of the present application provides a text region recognition apparatus, including:

the image acquisition module is used for acquiring an image to be detected, wherein the image to be detected comprises at least one text line;

the recognition module is used for inputting the image to be detected into a trained text recognition model for processing, so as to obtain a text region formed by the text line head and the text line tail of the at least one text line, the text recognition model is obtained through training according to head and tail characteristic point regression branches, and the head and tail characteristic point regression branches are used for adjusting weights of the text line head and the text line tail in the text recognition model.

In a first possible implementation manner of the second aspect, the preset recognition model includes an initial text recognition model and the head-tail feature point regression branch; the apparatus further comprises:

the input module is used for inputting a sample image into the preset recognition model to obtain an area recognition image output by the initial text recognition model, and a head recognition image and a tail recognition image output by the head and tail feature point regression branch;

The first calculation module is used for calculating according to the region identification image, the sample text region of the sample image and the first loss function to obtain a region error;

the second calculation module is used for calculating according to the head-of-line identification image, the tail-of-line identification image, the head-of-line target image, the tail-of-line target image and the second loss function to obtain head-of-line tail errors, and the head-of-line target image and the tail-of-line target image are generated according to the sample image and the initial text identification model;

and the training module is used for training the initial text recognition model according to the region error and the head-to-tail error to obtain the text recognition model.

Based on the first possible implementation manner of the second aspect, in a second possible implementation manner, the apparatus further includes:

the information acquisition module is used for acquiring line head midpoint position information and line tail midpoint position information of the sample image, wherein the line head midpoint position information is used for representing the midpoint position of the line head of the sample text in the sample image, and the line tail midpoint position information is used for representing the midpoint position of the line tail of the sample text in the sample image;

The first generation module is used for generating a head-of-line initial image and a tail-of-line initial image according to the downsampling multiplying power of the initial text recognition model;

the second generating module is used for generating the head-of-line target image according to the position of the head-of-line midpoint position information in the head-of-line initial image mapping, and generating the tail-of-line target image according to the position of the tail-of-line midpoint position information in the tail-of-line initial image mapping.

Based on the second possible implementation manner of the second aspect, in a third possible implementation manner, the information acquisition module is further configured to acquire the sample image; determining a line head coordinate and a line tail coordinate of the sample image according to the labeling information of the sample image, wherein the line head coordinate and the line tail coordinate both comprise an upper boundary coordinate and a lower boundary coordinate; and determining the position information of the middle point of the line head according to the upper boundary coordinate and the lower boundary coordinate of the line head coordinate, and determining the position information of the middle point of the line tail according to the upper boundary coordinate and the lower boundary coordinate of the line tail coordinate.

Based on the third possible implementation manner of the second aspect, in a fourth possible implementation manner, the second generating module is further configured to obtain a line head mapping position of a line head midpoint in the line head initial image according to a downsampling magnification of the initial text recognition model and the line head midpoint position information; generating the line head target image by combining the line head initial image according to the line head mapping position, the line head coordinates and the downsampling multiplying power of the initial text recognition model;

The second generation module is further configured to obtain a line tail mapping position of a line tail midpoint in the line tail initial image according to the downsampling magnification of the initial text recognition model and the line tail midpoint position information; and generating the line tail target image by combining the line tail initial image according to the line tail mapping position, the line tail coordinates and the downsampling multiplying power of the initial text recognition model.

Based on the second possible implementation manner of the second aspect, in a fifth possible implementation manner, the first generating module is further configured to calculate, according to a downsampling magnification of the initial text recognition model and a resolution of an input image of the initial text recognition model, to obtain an initial resolution; and respectively generating the head-of-line initial image and the tail-of-line initial image according to the initial resolution.

Based on the first possible implementation manner of the second aspect, in a sixth possible implementation manner, the second loss function includes a second line head loss function and a second line tail loss function;

the second calculation module is further configured to calculate according to the line head identification image, the line head target image, and the second line head loss function, to obtain a line head error; calculating according to the line tail identification image, the line tail target image and the second line tail loss function to obtain a line tail error; and summing the head-of-line error and the tail-of-line error to obtain the head-of-line tail-of-line error.

Based on the first possible implementation manner of the second aspect, in a seventh possible implementation manner, the training module is further configured to sum the region error and the line head-line tail error to obtain a model error; training the initial text recognition model according to the model error to obtain the text recognition model.

In a third aspect, an embodiment of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the text region identification method according to any one of the first aspects when the processor executes the computer program.

In a fourth aspect, an embodiment of the present application provides a computer readable storage medium storing a computer program which, when executed by a processor, implements the text region recognition method according to any one of the first aspects above.

In a fifth aspect, an embodiment of the present application provides a computer program product, which, when run on a terminal device, causes the terminal device to perform the text region identification method according to any one of the first aspects above.

Compared with the prior art, the embodiment of the application has the beneficial effects that:

according to the embodiment of the application, the image to be detected comprising at least one text line is obtained, and the image to be detected is input into the text recognition model, so that the text recognition model is obtained through head-tail characteristic point regression branch training, the weights of the head and tail of the text line of the text recognition model are adjusted, the weights of the head and tail of the text line in the recognition process are improved, the text recognition model can accurately recognize the head and tail of each text line, the text recognition model can output a text region formed by the head and tail of the text line of a plurality of text lines, the problem of inaccurate text region recognition caused by inaccurate text head and tail recognition is avoided, the accuracy of recognizing the head and tail of the text line is improved, and the accuracy of recognizing the text region is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic view of a scene involved in a text region recognition method provided by the present application;

FIG. 2 is a block diagram of a portion of the structure of a mobile phone provided by the present application;

FIG. 3 is a schematic flow chart of a text region identification method provided by the present application;

FIG. 4 is a schematic flow chart of a method of training a text recognition model provided by the present application;

FIG. 5 is a schematic illustration of a text line area provided by the present application;

FIG. 6 is a schematic diagram of a line head target image provided by the present application;

FIG. 7 is a block diagram of a text region recognition device according to the present application;

FIG. 8 is a block diagram of another text region recognition device provided by the present application;

fig. 9 is a block diagram of a text region recognition apparatus according to still another embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

The terminology used in the following examples is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the application and the appended claims, the singular forms "a," "an," "the," and "the" are intended to include, for example, "one or more" such forms of expression, unless the context clearly indicates to the contrary.

The text region recognition method provided by the embodiment of the application can be applied to terminal equipment such as mobile phones, tablet computers, wearable equipment, augmented reality (augmented reality, AR)/Virtual Reality (VR) equipment, notebook computers, ultra-mobile personal computer (UMPC), netbooks, personal digital assistants (personal digital assistant, PDA) and the like, and the embodiment of the application does not limit the specific types of the terminal equipment.

For example, the terminal device may be a Station (ST) in a WLAN, a cellular telephone, a cordless telephone, a Session initiation protocol (Session InitiationProtocol, SIP) telephone, a wireless local loop (Wireless Local Loop, WLL) station, a personal digital assistant (Personal Digital Assistant, PDA) device, a handheld device with wireless communication capabilities, a computing device or other processing device connected to a wireless modem, a computer, a laptop computer, a handheld communication device, a handheld computing device, a satellite radio, a wireless modem card, a customer premise equipment (customer premise equipment, CPE) and/or other devices for communication over a wireless system, as well as next generation communication systems, such as a mobile terminal in a 5G network or a mobile terminal in a future evolved public land mobile network (Public Land Mobile Network, PLMN) network, etc.

By way of example, but not limitation, when the terminal device is a wearable device, the wearable device may also be a generic name for applying wearable technology to intelligently design daily wear, developing wearable devices, such as glasses, gloves, watches, apparel, shoes, and the like. The wearable device is a portable device that is worn directly on the body or integrated into the clothing or accessories of the user. The wearable device is not only a hardware device, but also can realize a powerful function through software support, data interaction and cloud interaction. The generalized wearable intelligent device comprises full functions, large size and complete or partial functions which can be realized independently of the intelligent mobile phone, such as a smart watch or intelligent glasses.

Fig. 1 is a schematic view of a scene related to a text region recognition method provided by the present application, where, as shown in fig. 1, the scene includes: a terminal device 110 and an object 120 to be photographed.

The terminal equipment can shoot the object to be shot to obtain an image to be detected comprising at least one text line, and identify the area where the text line is located in the image to be detected.

Moreover, the image to be detected may be an image including at least one text line photographed by the terminal device, or may be an image including at least one text line stored in advance by the terminal device.

In one possible implementation manner, the terminal device may acquire an image to be detected, input the image to be detected into a text recognition model trained in advance, and recognize the text line head and the text line tail of each text line through the text recognition model, where the text recognition model may determine the text region in the image to be detected according to the text line head and the text line tail of a plurality of text lines.

In the process of training the text recognition model, a sample image can be input to a preset recognition model comprising an initial text recognition model and a head-tail characteristic point regression branch to obtain a region recognition image output by the initial text recognition model and a head-tail recognition image output by the head-tail characteristic point regression branch, the region error and the head-tail error are obtained by calculation according to the region recognition image, the head-tail recognition image and the head-tail recognition image, and then the weights of the head and the tail of the text in the initial text recognition model are adjusted according to the region error and the head-tail error to obtain the text recognition model, so that the head-tail accuracy of the recognition text is improved.

In addition, the terminal device in the embodiment of the application can be terminal device in the field of terminal artificial intelligence, and is applied to the technical field of computers, the terminal device can shoot an object comprising at least one text line in a scene, and a text area where the at least one text line in the image to be detected is located is determined according to the shot image to be detected, so that the terminal device can identify each character in the text area. For example, the terminal device may shoot an english sentence in the scene to obtain an image to be detected, determine a text region in which the english sentence in the image to be detected is located, and perform recognition and translation on the english sentence in the text region to obtain a chinese sentence corresponding to the english sentence.

Of course, the embodiment of the application can also be applied to the scenes corresponding to the table, the paragraph and the AR translation backfill task, and the embodiment of the application does not limit the application scene of the terminal equipment.

Taking the terminal equipment as a mobile phone as an example. Fig. 2 is a block diagram of a part of the structure of the mobile phone provided by the application. Referring to fig. 2, the mobile phone includes: radio Frequency (RF) circuitry 210, memory 220, input unit 230, display unit 240, sensor 250, audio circuitry 260, wireless fidelity (wireless fidelity, wiFi) module 270, processor 280, and power supply 290. Those skilled in the art will appreciate that the handset configuration shown in fig. 2 is not limiting of the handset and may include more or fewer components than shown, or may combine certain components, or may be arranged in a different arrangement of components.

The following describes the components of the mobile phone in detail with reference to fig. 2:

the RF circuit 210 may be used for receiving and transmitting signals during the process of receiving and transmitting information or communication, specifically, after receiving downlink information of the base station, the downlink information is processed by the processor 280; in addition, the data of the design uplink is sent to the base station. Typically, RF circuitry includes, but is not limited to, antennas, at least one amplifier, transceivers, couplers, low noise amplifiers (Low Noise Amplifier, LNAs), diplexers, and the like. In addition, the RF circuitry 210 may also communicate with networks and other devices via wireless communications. The wireless communications may use any communication standard or protocol including, but not limited to, global system for mobile communications (Global System of Mobile communication, GSM), general packet radio service (General Packet Radio Service, GPRS), code division multiple access (Code Division Multiple Access, CDMA), wideband code division multiple access (Wideband Code Division Multiple Access, WCDMA), long term evolution (Long Term Evolution, LTE)), email, short message service (Short Messaging Service, SMS), and the like.

The memory 220 may be used to store software programs and modules, and the processor 280 performs various functional applications and data processing of the cellular phone by executing the software programs and modules stored in the memory 220. The memory 220 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc. In addition, memory 220 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The input unit 230 may be used to receive input numeric or character information and to generate key signal inputs related to user settings and function control of the handset. In particular, the input unit 230 may include a touch panel 231 and other input devices 232. The touch panel 231, also referred to as a touch screen, may collect touch operations thereon or thereabout by a user (e.g., operations of the user on the touch panel 231 or thereabout by using any suitable object or accessory such as a finger, a stylus, etc.), and drive the corresponding connection device according to a predetermined program. Alternatively, the touch panel 231 may include two parts of a touch detection device and a touch controller. The touch detection device detects the touch azimuth of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device and converts it into touch point coordinates, which are then sent to the processor 280, and can receive commands from the processor 280 and execute them. In addition, the touch panel 231 may be implemented in various types such as resistive, capacitive, infrared, and surface acoustic wave. The input unit 230 may include other input devices 232 in addition to the touch panel 231. In particular, other input devices 232 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, mouse, joystick, etc.

The display unit 240 may be used to display information input by a user or information provided to the user and various menus of the mobile phone. The display unit 240 may include a display panel 241, and alternatively, the display panel 241 may be configured in the form of a liquid crystal display (Liquid Crystal Display, LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch panel 231 may cover the display panel 241, and when the touch panel 231 detects a touch operation thereon or thereabout, the touch operation is transferred to the processor 280 to determine the type of the touch event, and then the processor 280 provides a corresponding visual output on the display panel 241 according to the type of the touch event. Although in fig. 1, the touch panel 231 and the display panel 241 are two independent components to implement the input and input functions of the mobile phone, in some embodiments, the touch panel 231 and the display panel 241 may be integrated to implement the input and output functions of the mobile phone.

The processor 280 is a control center of the mobile phone, and connects various parts of the entire mobile phone using various interfaces and lines, and performs various functions and processes of the mobile phone by running or executing software programs and/or modules stored in the memory 220, and calling data stored in the memory 220, thereby performing overall monitoring of the mobile phone. Optionally, the processor 280 may include one or more processing units; preferably, the processor 280 may integrate an application processor that primarily handles operating systems, user interfaces, applications, etc., with a modem processor that primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 280.

The handset further includes a power supply 290 (e.g., a battery) for powering the various components, which may be logically connected to the processor 280 via a power management system, such as a power management system for performing functions such as charge, discharge, and power consumption management.

Although not shown, the handset may also include a camera. Optionally, the position of the camera on the mobile phone may be front or rear, which is not limited by the embodiment of the present application.

Alternatively, the mobile phone may include a single camera, a dual camera, or a triple camera, which is not limited in the embodiment of the present application.

For example, a cell phone may include three cameras, one of which is a main camera, one of which is a wide angle camera, and one of which is a tele camera.

Alternatively, when the mobile phone includes a plurality of cameras, the plurality of cameras may be all front-mounted, all rear-mounted, or one part of front-mounted, another part of rear-mounted, which is not limited by the embodiment of the present application.

In addition, although not shown, the mobile phone may further include a bluetooth module, etc., which will not be described herein.

Fig. 3 is a schematic flowchart of a text region recognition method provided by the present application, which may be applied to the terminal device 110 described above, as shown in fig. 3, by way of example and not limitation, and the method may include:

S301, acquiring an image to be detected, wherein the image to be detected comprises at least one text line.

In the process of detecting the text region where at least one text line in the image to be detected is located, the terminal equipment can analyze and identify the image to be detected through a text identification model of the terminal equipment. The terminal device may acquire the image to be detected including at least one text line before being recognized by the text recognition model analysis.

In one possible implementation manner, the terminal device may shoot the object including at least one text line according to a preset shooting function, so as to obtain an image to be detected. For example, after detecting the operation of starting the shooting function triggered by the user, a shooting interface may be displayed, and at least one shot text line is displayed in the shooting interface, and if the shooting operation triggered by the user is detected, an image displayed on the shooting interface may be stored, so as to obtain an image to be detected.

Of course, the image to be detected may be obtained in other manners, for example, the image to be detected may be selected from a storage space of the terminal device according to an operation triggered by a user, and the manner of obtaining the image to be detected is not limited in the embodiment of the present application.

S302, inputting an image to be detected into a trained text recognition model for processing, and obtaining a text region formed by the text line head and the text line tail of at least one text line.

The text recognition model is obtained through training according to head and tail characteristic point regression branches, and the head and tail characteristic point regression branches are used for adjusting weights of the head and tail of the text in the text recognition model. For example, the head-to-tail feature point regression branch may be a feature point (landmark) regression branch.

After obtaining the image to be detected, the terminal device can input the image to be detected into a pre-trained text recognition model, and detect the text line head and the text line tail of each text line through the text recognition model, so as to obtain a text region through recognition.

In one possible implementation manner, the image to be detected may be input into a text recognition model, the text recognition model may split each text line in the image to be detected to obtain a plurality of sub-areas, and combine the weights of the improved text line head and the text line tail and the weights in the reduced text line, so as to identify the plurality of sub-areas, and determine the type of each sub-area, that is, determine whether each sub-area is located in the text line head, the text line tail or the text line, so that the text line head, the text line tail or the text line can be used as the type of each sub-area, and determine the text area of the image to be detected according to the text line head and the text line tail of each text line.

In summary, according to the text region recognition method provided by the embodiment of the application, the to-be-detected image including at least one text line is obtained, and the to-be-detected image is input into the text recognition model, so that the text recognition model is obtained through head-tail feature point regression branch training, the weights of the head and tail of the text line of the text recognition model are adjusted, the weights of the head and tail of the text line in the recognition process are improved, the text recognition model can accurately recognize and obtain the head and tail of each text line, the text recognition model can output a text region formed by the head and tail of the text line of a plurality of text lines, the problem of inaccurate text region recognition caused by inaccurate text head and tail recognition is avoided, the accuracy of recognizing the head and tail of the text line is improved, and the accuracy of recognizing the text region is improved.

Furthermore, in the process of identifying the text region by the method, only the weights occupied by the head and the tail of the text line are adjusted, namely parameters in a text identification model are adjusted, the increased calculated amount is little, and the identification accuracy is improved on the basis of ensuring the identification efficiency.

Further, through accurately identifying the text region, the situations of missing characters, misjudgment and the like in the process of identifying each character in the text region can be effectively avoided, and the accuracy of identifying each character in the text region is improved.

The above-mentioned embodiments are implemented based on a text recognition model in a terminal device, and the text recognition model may be trained according to a large number of sample images, referring to fig. 4, fig. 4 is a schematic flowchart of a method for training a text recognition model provided in the present application, which may be applied, by way of example and not limitation, to the terminal device 110 or a server connected to a link of the terminal device 110, and the method may include:

s401, acquiring position information of the middle point of the line head and the middle point of the line tail of the sample image.

The sample image may include labeling information, where the labeling information is used to indicate a vertex position of an area where a text line is located, the text line head may include an upper line head boundary and a lower line head boundary, and similarly, the text line tail may include an upper line tail boundary and a lower line tail boundary.

Moreover, the line head midpoint position information is used for representing the midpoint position of the line head of the sample text in the sample image, can be calculated according to the line head upper boundary and the line head lower boundary of the line head of the text, and the line tail midpoint position information is used for representing the midpoint position of the line tail of the sample text in the sample image, can be calculated according to the line tail upper boundary and the line tail lower boundary of the line tail of the text,

For example, as shown in fig. 5, fig. 5 shows a text line and 4 vertices A, B, C, D of the area where the text line is located, the labeling information may be coordinates corresponding to the vertices A, B, C, D, where the vertex a may be an upper line boundary, the vertex D may be a lower line boundary, the vertex B may be an upper line tail boundary, and the vertex C may be a lower line tail boundary.

Correspondingly, the coordinates of the point A and the point D in the graph can be averaged to obtain the coordinates of the point E, namely the line head midpoint, and similarly, the coordinates of the line tail midpoint F can be obtained by calculating the coordinates of the point B and the point C to obtain the line tail midpoint.

In the process of training the initial text recognition model, a head-to-tail error is generated according to the head-to-tail target image and the tail-to-tail target image, so that weights of the head-to-tail of the text in the initial text recognition model are adjusted according to the head-to-tail error, and the text recognition model is obtained.

And the head-of-line target image and the tail-of-line target image are generated based on the head-of-line midpoint and the tail-of-line midpoint of the text line, before the head-of-line target image and the tail-of-line target image are generated, the head-of-line midpoint position information and the tail-of-line midpoint position information of the text line in the sample image are required to be determined according to the head-of-line and the tail-of-line of the text line.

Alternatively, the sample image may be acquired first, and the first line coordinate and the last line coordinate of the sample image are determined according to the labeling information of the sample image, and then the first line midpoint position information is determined according to the upper boundary coordinate and the lower boundary coordinate of the first line coordinate, and then the last line midpoint position information is determined according to the upper boundary coordinate and the lower boundary coordinate of the last line coordinate.

In one possible implementation manner, a sample image may be acquired first, and label information corresponding to the sample image may be acquired, so as to obtain a line head coordinate for representing a line head of a text line in the sample image, and a line tail coordinate for representing a line tail of the text line in the sample image.

And the line head coordinate and the line tail coordinate can both comprise an upper boundary coordinate and a lower boundary coordinate, so that calculation can be performed according to the upper boundary coordinate and the lower boundary coordinate of the line head coordinate, and the average value of the upper boundary coordinate and the lower boundary coordinate is calculated to obtain the position information of the middle point of the line head, namely, the coordinate corresponding to the middle point of the line head.

Similarly, the calculation may be performed in the above manner to obtain the position information of the middle point of the line tail, which indicates the coordinates corresponding to the middle point of the line tail.

S402, generating a head-of-line initial image and a tail-of-line initial image according to the downsampling magnification of the initial text recognition model.

The initial line head image and the initial line tail image can be all-zero feature images. Moreover, the downsampling magnification may be determined by a neural network backbone structure of the initial text recognition model, and the backbone may include a classification branch for determining whether the sampling point is a point within the text region and a regression branch for determining a boundary position of the text region below the sampling point.

In the process of generating the head-of-line target image and the tail-of-line target image, not only the head-of-line midpoint and the tail-of-line midpoint are needed, but also the head-of-line initial image and the tail-of-line initial image are generated according to the downsampling magnification of the initial text recognition model, so that in the subsequent step, the head-of-line target image and the tail-of-line target image can be generated according to the head-of-line initial image and the tail-of-line initial image.

Alternatively, the initial resolution may be obtained by performing calculation according to the downsampling magnification of the initial text recognition model and the resolution of the input image of the initial text recognition model, and then generating the initial head-of-line image and the initial tail-of-line image according to the initial resolution.

The input image may be an image into which a preset recognition model is input.

In one possible implementation manner, the downsampling magnification of the initial text recognition model and the resolution of the input image may be obtained respectively, and a quotient between the resolution of the input image and the downsampling magnification may be calculated, so that the quotient is used as the initial resolution, and a line head initial image and a line tail initial image consistent with the initial resolution are regenerated.

For example, if the resolution of the input image is 640×640 and the downsampling ratio is 8, the initial resolution may be calculated as (640/8).

S403, generating a head-of-line target image according to the head-of-line midpoint position information and a tail-of-line target image according to the tail-of-line midpoint position information.

Since the midpoint of the line head and the midpoint of the line tail calculated in S401 are used to indicate the start point and the end point of the line of the text, respectively, the line head target image and the line tail target image can be generated in the line head initial image and the line tail initial image based on the midpoint of the line head and the midpoint of the line tail.

Before generating the head-of-line target image and the tail-of-line target image, the head-of-line midpoint can be mapped in the head-of-line initial image and the tail-of-line midpoint can be mapped in the tail-of-line initial image to obtain mapping positions respectively obtained by mapping the head-of-line midpoint and the tail-of-line midpoint, so that the head-of-line target image and the tail-of-line target image can be generated in the head-of-line initial image and the tail-of-line initial image according to the two mapping positions.

Optionally, a line head mapping position of a line head point in the line head initial image can be obtained according to the downsampling ratio of the initial text recognition model and the line head point position information, and then a line head target image is generated by combining the line head initial image according to the line head mapping position, the line head coordinate and the downsampling ratio of the initial text recognition model.

In one possible implementation manner, the line head midpoint may be mapped according to the downsampling magnification of the initial text recognition model, that is, a quotient between the line head midpoint position information and the downsampling magnification is calculated, the quotient may be used as a coordinate of a line head mapping position in the line head initial image at the line head midpoint, the line head mapping position is used as a center, the line head line height is calculated according to an upper boundary coordinate and a lower boundary coordinate of the line head coordinate, and the downsampling magnification is combined to generate a randomly distributed image in the line head initial image, so as to obtain the line head target image.

For example, if the coordinates of the midpoint of the line head are (x, y) and the downsampling magnification is 8, the coordinates of the mapping position in the line head initial image are (x/8, y/8), and if the line head height is Ls, a two-dimensional gaussian distribution map as shown in fig. 6 may be generated with (x/8, y/8) as the mean value and Ls/(3×downsampling magnification) as the variance, and the two-dimensional gaussian distribution map may be used as the line head target image.

Similarly, a line tail mapping position of a line tail middle point in a line tail initial image can be obtained according to the downsampling multiplying power of the initial text recognition model and the line tail middle point position information, and then a line tail target image is generated by combining the line tail initial image according to the line tail mapping position, the line tail coordinates and the downsampling multiplying power of the initial text recognition model.

Because the process of generating the end-of-line target image is similar to the process of generating the head-of-line target image described above, the description thereof will not be repeated.

In practical application, training is required according to a large number of sample images, but for simplicity and convenience, the present application is described by taking only one sample image as an example, and the number of sample images is not limited in the embodiment of the present application.

In addition, the embodiment of the present application is described with the example that S401 to S403 are executed before S404, but in practical application, S401 to S403 may be executed after S404 and before S405, or may be executed after S405 and before S406, and the embodiment of the present application does not limit the timing of executing S401 to S403.

S404, inputting the sample image into a preset recognition model to obtain an area recognition image output by the initial text recognition model, and a head recognition image and a tail recognition image output by the head and tail feature point regression branches.

The preset recognition model comprises an initial text recognition model and head-tail characteristic point regression branches, and the initial text recognition model and the head-tail characteristic point regression branches can recognize text areas where each text line in the sample image is located.

And the head-tail characteristic point regression branch is mainly used for detecting the head and tail of the text line of each text line in the sample image.

In one possible implementation manner, a sample image may be input into a preset recognition model, operations such as denoising and feature extraction are performed on the sample image through a neural network in the model, feature information of the sample image is obtained, the feature information is input into an initial text recognition model, and the feature information is analyzed through the initial text recognition model, so as to obtain an area recognition image indicating an area where a text line is located.

And meanwhile, the characteristic information can be input into a head-tail characteristic point regression branch, and the head-tail characteristic point regression branch is combined with the characteristic information to identify the head and the tail of the text line of each text line in the sample image, so that a head-line identification image indicating the position of the head of the text line and a tail-line identification image indicating the position of the tail of the text line are obtained.

And S405, calculating according to the region identification image, the sample text region of the sample image and the first loss function to obtain a region error.

Wherein the region error may be a parameter value.

After the region identification image is obtained, a sample text region of the pre-marked sample image and a preset first loss function can be combined to calculate, so that a region error between the text region obtained by identification and the sample text region is obtained.

In one possible implementation, the center, width, and height of the text region identified in the region identification image may be compared with the center, width, and height of the sample text region, respectively, to obtain a region error.

S406, calculating according to the head-of-line identification image, the tail-of-line identification image, the head-of-line target image, the tail-of-line target image and the second loss function, and obtaining head-of-line and tail-of-line errors.

Wherein, the head-of-line target image and the tail-of-line target image are generated according to the sample image and the initial text recognition model, and the generation process can refer to S401 to S403.

After the head-of-line identification image and the tail-of-line identification image are obtained, the head-of-line identification image and the tail-of-line identification image can be respectively compared with the head-of-line target image and the tail-of-line target image to obtain head-of-line error and tail-of-line error, so that head-of-line error and tail-of-line error can be generated according to the head-of-line error and the tail-of-line error.

In addition, in the process of generating the head-of-line error and the tail-of-line error, a second loss function needs to be combined, and the head-of-line identification image and the tail-of-line identification image respectively correspond to different loss functions, so that the second loss function can comprise a second head-of-line loss function and a second tail-of-line loss function.

Alternatively, the head-of-line error may be obtained by calculating according to the head-of-line identification image, the head-of-line target image, and the second head-of-line loss function, or the tail-of-line error may be obtained by calculating according to the tail-of-line identification image, the tail-of-line target image, and the second tail-of-line loss function, and then summing the head-of-line error and the tail-of-line error to obtain the head-of-line tail error.

In one possible implementation manner, the head-of-line identification image and the head-of-line target image may be compared, and then a preset second head-of-line loss function may be combined to obtain a head-of-line error for head-of-line identification, and furthermore, the tail-of-line identification image and the tail-of-line target image may be compared, and then a preset second tail-of-line loss function may be combined to generate a tail-of-line error for tail-of-line identification.

And then, summing the head-of-line errors and the tail-of-line errors, and taking the summed value as the head-of-line and the tail-of-line errors, so that training can be performed according to the head-of-line and the tail-of-line errors initial text recognition model in the subsequent steps.

S407, training the initial text recognition model according to the region error and the head-to-tail error to obtain a text recognition model.

After the regional errors are obtained, the initial text recognition model can be trained by combining the head-to-tail errors of the lines so as to adjust the weights of the head-to-tail of the lines of the text in the initial text recognition model, and therefore the head-to-tail of the lines of the text can be accurately recognized according to the adjusted weights.

Alternatively, the region error and the line head-line tail error may be summed to obtain a model error, and then the initial text recognition model is trained according to the model error to obtain the text recognition model.

In one possible implementation manner, the head-to-tail errors of the lines and the region errors can be added to obtain a sum value of the head-to-tail errors of the lines and the region errors, namely, a model error, and then back propagation is performed in the initial text recognition model according to the model error, each parameter in the initial text recognition model is adjusted, and finally the text recognition model for adjusting the weights of the head-to-tail of the text lines is obtained.

In summary, according to the method for training a text recognition model provided by the embodiment of the application, a head-of-line target image and a tail-of-line target image are respectively generated according to the head-of-line and the tail-of-line of the text of a sample image, the text region of the sample image is recognized through a preset recognition model comprising an initial text recognition model and head-of-line feature point regression branches, a region recognition image, a head-of-line recognition image and a tail-of-line recognition image are obtained, and parameters of the initial text recognition model are adjusted according to the region errors obtained by the region recognition image and the head-of-line and tail-of-line errors obtained according to the head-of-line recognition image and the tail-of-line recognition image, so that the text recognition model is obtained. Through the region error and by combining the head-to-tail error aiming at head-to-tail recognition and tail-to-tail recognition, the weights of the head-to-tail and the tail-to-tail of the text in the initial text recognition model are adjusted, so that the text recognition model capable of accurately recognizing the head-to-tail of the text can be obtained, the accuracy of recognizing the head-to-tail and the tail-to-tail of the text by the text recognition model is improved, and the accuracy of recognizing the text region is further improved.

Furthermore, in the process of training the model, the sample image does not need to be additionally marked, so that the cost of acquiring the sample image is reduced.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present application.

Fig. 7 is a block diagram of a text region recognition apparatus according to the present application, corresponding to the text region recognition method described in the above embodiments, and only the portions related to the embodiments of the present application are shown for convenience of explanation.

Referring to fig. 7, the apparatus includes:

an image acquisition module 701, configured to acquire an image to be detected, where the image to be detected includes at least one text line;

the recognition module 702 is configured to input the image to be detected into a trained text recognition model, and process the image to be detected to obtain a text region formed by a text line head and a text line tail of the at least one text line, where the text recognition model is obtained by training according to a head and tail feature point regression branch, and the head and tail feature point regression branch is used for adjusting weights of the text line head and the text line tail in the text recognition model.

Optionally, the preset recognition model includes an initial text recognition model and the head-tail feature point regression branch; referring to fig. 8, the apparatus further includes:

the input module 703 is configured to input a sample image into the preset recognition model, so as to obtain an area recognition image output by the initial text recognition model, and a head recognition image and a tail recognition image output by the head and tail feature point regression branch;

a first calculating module 704, configured to calculate according to the region identification image, the sample text region of the sample image, and the first loss function, to obtain a region error;

the second calculating module 705 is configured to calculate, according to the head-of-line identification image, the tail-of-line identification image, the head-of-line target image, the tail-of-line target image, and the second loss function, to obtain a head-of-line tail error, where the head-of-line target image and the tail-of-line target image are both generated according to the sample image and the initial text recognition model;

and a training module 706, configured to train the initial text recognition model according to the region error and the head-to-tail error of the line, so as to obtain the text recognition model.

Optionally, referring to fig. 9, the apparatus further includes:

an information obtaining module 707, configured to obtain first line midpoint position information and last line midpoint position information of the sample image, where the first line midpoint position information is used to represent a first line midpoint position of a sample text in the sample image, and the last line midpoint position information is used to represent a last line midpoint position of the sample text in the sample image;

A first generation module 708, configured to generate a head-of-line initial image and a tail-of-line initial image according to the downsampling magnification of the initial text recognition model;

the second generating module 709 is configured to generate the head-of-line target image according to the position of the head-of-line midpoint position information in the head-of-line initial image map, and generate the tail-of-line target image according to the position of the tail-of-line midpoint position information in the tail-of-line initial image map.

Optionally, the information obtaining module 707 is further configured to obtain the sample image; determining a line head coordinate and a line tail coordinate of the sample image according to the labeling information of the sample image, wherein the line head coordinate and the line tail coordinate both comprise an upper boundary coordinate and a lower boundary coordinate; and determining the position information of the midpoint of the line head according to the upper boundary coordinate and the lower boundary coordinate of the line head coordinate, and determining the position information of the midpoint of the line tail according to the upper boundary coordinate and the lower boundary coordinate of the line tail coordinate.

Optionally, the second generating module 709 is further configured to obtain a line head mapping position of a line head midpoint in the line head initial image according to the downsampling magnification of the initial text recognition model and the line head midpoint position information; generating the line head target image by combining the line head initial image according to the line head mapping position, the line head coordinate and the downsampling multiplying power of the initial text recognition model;

The second generating module 709 is further configured to obtain a line tail mapping position of a line tail midpoint in the line tail initial image according to the downsampling magnification of the initial text recognition model and the line tail midpoint position information; and generating the line tail target image by combining the line tail initial image according to the line tail mapping position, the line tail coordinates and the downsampling multiplying power of the initial text recognition model.

Optionally, the first generating module 708 is further configured to calculate, according to the downsampling magnification of the initial text recognition model and a resolution of an input image of the initial text recognition model, to obtain an initial resolution; and respectively generating the line head initial image and the line tail initial image according to the initial resolution.

Optionally, the second loss function includes a second head-of-line loss function and a second tail-of-line loss function;

the second calculating module 705 is further configured to calculate, according to the line head identification image, the line head target image, and the second line head loss function, a line head error; calculating according to the line tail identification image, the line tail target image and the second line tail loss function to obtain a line tail error; and summing the head-of-line error and the tail-of-line error to obtain the head-of-line tail-of-line error.

Optionally, the training module 706 is further configured to sum the region error and the head-to-tail error of the line to obtain a model error; training the initial text recognition model according to the model error to obtain the text recognition model.

In summary, according to the text region recognition device provided by the embodiment of the application, by acquiring the image to be detected including at least one text line and inputting the image to be detected to the text recognition model, as the text recognition model is obtained through head-tail feature point regression branch training, the weights of the head and tail of the text recognition model are adjusted, the influence factors of the head and tail of the text in the recognition process are improved, so that the text recognition model can accurately recognize and obtain the head and tail of each text line, the text recognition model can output a text region composed of the head and tail of the text line of a plurality of text lines, the problem of inaccurate text region recognition caused by inaccurate recognition of the head and tail of the text line is avoided, the accuracy of recognizing the head and tail of the text line is improved, and the accuracy of recognizing the text region is improved.

The embodiment of the application also provides a terminal device, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the text region identification method according to any one of the embodiments corresponding to fig. 3 to 4 when executing the computer program.

Embodiments of the present application also provide a computer-readable storage medium storing a computer program that, when executed by a processor, implements a text region recognition method according to any one of the corresponding embodiments of fig. 3 to 4.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the system embodiments described above are merely illustrative, e.g., the division of the modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application may implement all or part of the flow of the method of the above embodiments, and may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a terminal device, a recording medium, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunication signal, and a software distribution medium. Such as a U-disk, removable hard disk, magnetic or optical disk, etc. In some jurisdictions, computer readable media may not be electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. A text region identification method, the method comprising:

inputting the image to be detected into a trained text recognition model for processing to obtain a text region formed by a text line head and a text line tail of at least one text line, wherein the text recognition model is obtained by training according to head and tail characteristic point regression branches, and the head and tail characteristic point regression branches are used for adjusting weights of the text line head and the text line tail in the text recognition model; and the text recognition model determines a text region in the image to be detected according to the text line head and the text line tail of at least one text line.

2. The method of claim 1, wherein the pre-set recognition model comprises an initial text recognition model and the head-to-tail feature point regression branch;

3. The method of claim 2, wherein prior to said calculating from the head of line identification image, the tail of line identification image, the head of line target image, the tail of line target image, and the second loss function, the method further comprises:

4. The method of claim 3, wherein the acquiring the head-of-line midpoint location information and the tail-of-line midpoint location information of the sample image comprises:

acquiring the sample image;

5. The method of claim 4, wherein generating the head of line target image from the head of line midpoint location information at the location of the head of line initial image map comprises:

6. The method of claim 3, wherein generating a head-of-line initial image and a tail-of-line initial image from the downsampling magnification of the initial text recognition model comprises:

7. The method of any of claims 2 to 6, wherein the second loss function comprises a second head-of-line loss function and a second tail-of-line loss function;

8. The method according to any one of claims 2 to 6, wherein training the initial text recognition model according to the region error and the line-to-line-and-line-to-line-error to obtain the text recognition model includes:

9. A text region recognition device, the device comprising:

the recognition module is used for inputting the image to be detected into a trained text recognition model for processing to obtain a text region formed by the text line head and the text line tail of the at least one text line, the text recognition model is obtained by training according to head and tail characteristic point regression branches, and the head and tail characteristic point regression branches are used for adjusting weights of the text line head and the text line tail in the text recognition model; and the text recognition model determines a text region in the image to be detected according to the text line head and the text line tail of at least one text line.

10. The apparatus of claim 9, wherein the pre-set recognition model comprises an initial text recognition model and the head-to-tail feature point regression branch; the apparatus further comprises:

11. The apparatus of claim 10, wherein the apparatus further comprises:

12. The apparatus of claim 11, wherein the information acquisition module is further configured to acquire the sample image; determining a line head coordinate and a line tail coordinate of the sample image according to the labeling information of the sample image, wherein the line head coordinate and the line tail coordinate both comprise an upper boundary coordinate and a lower boundary coordinate; and determining the position information of the middle point of the line head according to the upper boundary coordinate and the lower boundary coordinate of the line head coordinate, and determining the position information of the middle point of the line tail according to the upper boundary coordinate and the lower boundary coordinate of the line tail coordinate.

13. The apparatus of claim 12, wherein the second generating module is further configured to obtain a head of line mapping position of a head of line point in the head of line initial image according to a downsampling magnification of the initial text recognition model and the head of line point position information; generating the line head target image by combining the line head initial image according to the line head mapping position, the line head coordinates and the downsampling multiplying power of the initial text recognition model;

14. The apparatus of claim 11, wherein the first generation module is further configured to calculate an initial resolution based on a downsampling magnification of the initial text recognition model and a resolution of an input image of the initial text recognition model; and respectively generating the head-of-line initial image and the tail-of-line initial image according to the initial resolution.

15. The apparatus of any of claims 10 to 14, wherein the second loss function comprises a second head-of-line loss function and a second tail-of-line loss function;

16. The apparatus of any of claims 10 to 14, wherein the training module is further configured to sum the region error and the line head-to-line tail error to obtain a model error; training the initial text recognition model according to the model error to obtain the text recognition model.

17. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 8 when executing the computer program.

18. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the method according to any one of claims 1 to 8.