CN114724168A

CN114724168A - Training method of deep learning model, text recognition method, text recognition device and text recognition equipment

Info

Publication number: CN114724168A
Application number: CN202210506419.9A
Authority: CN
Inventors: 蒋凯涛; 杜宇宁; 李晨霞; 杨烨华; 赖宝华; 毕然; 胡晓光; 于佃海; 马艳军
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-05-10
Filing date: 2022-05-10
Publication date: 2022-07-08

Abstract

The disclosure provides a training method of a deep learning model, which relates to the field of artificial intelligence, in particular to the field of natural language processing and image processing. The specific implementation scheme is as follows: according to the characteristics of the sample image, character prediction is carried out on text information in the sample image to obtain a first text; according to the characteristics of the sample image, semantic learning is carried out on text information in the sample image to obtain a second text; determining the loss of the deep learning model according to the first text and the second text; and adjusting parameters of the deep learning model according to the loss. The disclosure also provides a training device of the deep learning model, a text recognition method and device, an electronic device and a storage medium.

Description

Training method of deep learning model, text recognition method, text recognition device and text recognition equipment

Technical Field

The present disclosure relates to the field of artificial intelligence, and more particularly, to natural language processing and image processing techniques. More specifically, the disclosure provides a training method of a deep learning model, a text recognition method, a device, an electronic device and a storage medium.

Background

OCR (Optical Character Recognition) is a technique that can convert picture information into text information that is easier to edit and store. Text recognition is a subtask of OCR, whose task is to recognize text content in a fixed area.

Disclosure of Invention

The disclosure provides a training method of a deep learning model, a text recognition method, a device, equipment and a storage medium.

According to a first aspect, there is provided a training method of a deep learning model, the method comprising: according to the characteristics of the sample image, performing character prediction on text information in the sample image to obtain a first text; according to the characteristics of the sample image, semantic learning is carried out on text information in the sample image to obtain a second text; determining the loss of the deep learning model according to the first text and the second text; and adjusting parameters of the deep learning model according to the loss.

According to a second aspect, there is provided a text recognition method, the method comprising: acquiring the characteristics of an image to be identified; inputting the characteristics of the image to be recognized into a character prediction network to obtain text information in the image to be recognized; the character prediction network is obtained by training according to the training method of the deep learning model.

According to a third aspect, there is provided an apparatus for training a deep learning model, the apparatus comprising: the character prediction module is used for performing character prediction on text information in the sample image according to the characteristics of the sample image to obtain a first text; the semantic learning module is used for performing semantic learning on the text information in the sample image according to the characteristics of the sample image to obtain a second text; the loss determining module is used for determining the loss of the deep learning model according to the first text and the second text; and the parameter adjusting module is used for adjusting the parameters of the deep learning model according to the loss.

According to a fourth aspect, there is provided a text recognition apparatus comprising: the acquisition module is used for acquiring the characteristics of the image to be identified; the recognition module is used for inputting the characteristics of the image to be recognized into the character prediction network to obtain text information in the image to be recognized; the character prediction network is obtained by training according to the training device of the deep learning model.

According to a fifth aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method provided in accordance with the present disclosure.

According to a sixth aspect, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform a method provided according to the present disclosure.

According to a seventh aspect, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method provided according to the present disclosure.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram of an exemplary system architecture of a training method and apparatus and a text recognition method and apparatus to which a deep learning model may be applied, according to one embodiment of the present disclosure;

FIG. 2 is a flow diagram of a method of training a deep learning model according to one embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a training method of a deep learning model according to one embodiment of the present disclosure;

FIG. 4 is a flow diagram of a text recognition method according to one embodiment of the present disclosure;

FIG. 5 is a block diagram of a training apparatus for deep learning models according to one embodiment of the present disclosure;

FIG. 6 is a block diagram of a text recognition device according to one embodiment of the present disclosure;

FIG. 7 is a block diagram of an electronic device for a training method and/or a text recognition method for a deep learning model according to one embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

OCR technology is widely used in various scenarios such as bill recognition, bank card information recognition, formula recognition, and the like. OCR is also used for many downstream tasks such as subtitle translation, security monitoring, etc. In addition, OCR is also used for other visual tasks, such as video searching, etc. Text recognition is a subtask of OCR whose task is to recognize text content in a fixed area.

A text recognition method is mainly realized based on a CTC (connection Temporal Classification) decoding algorithm. For example, in the neural network model, the CTC algorithm is used after text detection to convert image information into text information. The input of the neural network model is a positioned text line picture, and the model predicts the text content and the confidence coefficient in the picture. The CTC-based text recognition method is fast, but has limited accuracy.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

In the technical scheme of the disclosure, before the personal information of the user is acquired or collected, the authorization or the consent of the user is acquired.

Fig. 1 is a schematic diagram of an exemplary system architecture of a training method and a text recognition method to which a deep learning model can be applied, according to one embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.

As shown in fig. 1, the system architecture 100 according to this embodiment may include a plurality of terminal devices 101, a network 102, and a server 103. Network 102 is the medium used to provide communication links between terminal equipment 101 and server 103. Network 102 may include various connection types, such as wired and/or wireless communication links, and so forth.

A user may use terminal device 101 to interact with server 103 over network 102 to receive or send messages and the like. The terminal device 101 may be a variety of electronic devices including, but not limited to, a smart phone, a tablet computer, a laptop computer, and the like.

At least one of the training method of the deep learning model and the text recognition method provided by the embodiments of the present disclosure may be generally performed by the server 103. Accordingly, the training device of the deep learning model and the text recognition device provided by the embodiments of the present disclosure may be generally disposed in the server 103. The training method of the deep learning model and the text recognition method provided by the embodiment of the disclosure may also be executed by a server or a server cluster which is different from the server 103 and can communicate with the terminal device 101 and/or the server 103. Accordingly, the training device of the deep learning model and the text recognition device provided by the embodiment of the present disclosure may also be disposed in a server or a server cluster different from the server 103 and capable of communicating with the terminal device 101 and/or the server 103.

FIG. 2 is a flow diagram of a method of training a deep learning model according to one embodiment of the present disclosure.

As shown in fig. 2, the training method 200 of the deep learning model may include operations S210 to S240.

In operation S210, character prediction is performed on text information in the sample image according to the characteristics of the sample image, so as to obtain a first text.

For example, the sample image may be an image containing text information, and the deep learning model may be trained with a goal of recognizing the text information in the sample image. The input to the deep learning model may be a sample image and the output may be textual information in the sample image.

The deep learning model can comprise a backbone network, wherein the backbone network comprises a plurality of convolutional layers, the sample image is input into the deep learning network, and the characteristics of the sample image can be obtained through the backbone network.

Based on the characteristics of the sample image, a mechanism of character prediction may be employed to achieve the above text recognition goal. Specifically, the deep learning model may include a character prediction network, the character prediction network may be implemented based on a CTC decoding algorithm, and the character prediction network may implement prediction of text information in a sample image based on features of the sample image, so as to obtain the first text.

The CTC decoding algorithm has high decoding speed, so that the text information in the sample image can be quickly identified by adopting a character prediction mechanism.

In operation S220, semantic learning is performed on the text information in the sample image according to the features of the sample image, so as to obtain a second text.

Based on the characteristics of the sample image, a semantic learning mechanism can be adopted to achieve the text recognition target. Specifically, the deep learning model may include a semantic learning network, the semantic learning network may be implemented by an attention mechanism or a Transformer module, and the semantic learning network may implement semantic learning on text information in the sample image based on features of the sample image to obtain the second text.

Different from a mechanism of character prediction, the method for realizing text recognition by adopting a semantic learning mechanism generally needs to adopt cyclic decoding, and has high text recognition accuracy.

In operation S230, a loss of the deep learning model is determined according to the first text and the second text.

The loss of the deep learning model may include a loss of the character prediction network and a loss of the semantic learning network. The loss of the character prediction network can be obtained based on the difference between the actual text information in the sample image and the first text, and the loss of the semantic learning network can be obtained based on the difference between the actual text information in the sample image and the second text.

In operation S240, parameters of the deep learning model are adjusted according to the loss.

Parameters of the deep learning network model can be adjusted based on the loss of the character prediction network and the loss of the semantic learning network.

Based on the two part losses (the loss of the character prediction network and the loss of the semantic learning network), various adjustment modes can be provided for the three networks (the backbone network, the character prediction network and the semantic learning network) of the deep learning network model.

For example, parameters of any one, any two, or all of the three networks (backbone network, character prediction network, and semantic learning network) of the deep learning network model may be adjusted based on the sum of the two part losses.

For another example, the parameters of the character prediction network may be adjusted based on the loss of the character prediction network, and the loss of the semantic learning network may be adjusted based on the loss of the semantic learning network.

For another example, the parameters of the character prediction network may be adjusted based on the loss of the character prediction network, and the losses of the backbone network and the semantic learning network may be adjusted based on the loss of the semantic learning network.

It should be noted that the above adjustment modes are merely examples, and an actual adjustment mode may be determined according to actual needs, which is not limited in the present disclosure.

The method has the advantages that the deep learning network adopts two branches of a character prediction mechanism and a semantic learning mechanism to respectively recognize the text after obtaining the characteristics of the sample image, and combines the advantages of the character prediction mechanism and the semantic learning mechanism, so that the speed of text recognition is ensured, and the precision of text recognition is improved.

Furthermore, the character prediction network and the semantic learning network share parameters at the backbone network part, and the parameters of the backbone network are adjusted through the loss of the semantic learning network, so that the effect of guiding and training the character prediction network by using the semantic learning network is achieved, and the recognition effect of the character prediction network is further improved.

The embodiment of the disclosure realizes text recognition through character prediction and semantic learning respectively based on the characteristics of the sample image, can ensure the speed of text recognition, and improves the precision of text recognition.

FIG. 3 is a schematic diagram of a method of training a deep learning model according to another embodiment of the present disclosure.

As shown in fig. 3, the deep learning model 300 includes a backbone network 310, a character prediction network 320, and a semantic learning network 330. The backbone network may include a plurality of convolutional layers, for example, the sample image 301 includes text information, the sample image 301 is input to the backbone network 310 to obtain the features 311 of the sample image, for example, the size of the sample image 301 is 32 × 100 × 3(H × W × C), and the sample image is input to the backbone network 310 to obtain the feature vectors with the dimensions of 1 × 25 × 512, that is, the features 311 of the sample image.

The character prediction Network 320 may include, for example, a CRNN (Convolutional Recurrent Neural Network) module and a CTC module, where the CRNN module is configured to perform dimension conversion on the features 311 of the input sample image, for example, the features 311 of the sample image with a dimension of 1 × 25 × 512 may be converted into 25 sequences of 1 × 512, and then input the sequences into the CRNN module to obtain 25 × n features, where n may be the number of characters in the sample image 301. The CRNN module transmits the 25 xn characteristics to the CTC module, and the CTC module conducts character alignment and character sequence prediction based on the 25 xn characteristics to obtain a predicted first text. For example, each 1 × n sequence may predict one character, traverse each column by 1 × n to predict the character with the highest probability, obtain a character sequence, align and deduplicate the obtained character sequence, and take the character sequence with the highest probability of the character sequence as the predicted first text. The CTC module decodes fast and thus can predict the first text quickly. Based on the difference between the predicted first text and the actual text information in the sample image 301, a first loss 321 is calculated. For example, the first loss may be a mean square error or cross entropy between the first text and actual text information in the sample image 301.

The semantic learning network 330 may be implemented based on an Attention network (Attention) or a transform module. The semantic learning network 330 performs context feature loop decoding on the features 311 of the input sample image, learns the semantic features according to the relationship of the context features, and predicts the second text. Since the semantic learning network 330 employs loop decoding, the predicted accuracy of the second text is higher. Based on the difference between the predicted second text and the actual text information in the sample image, a second loss 331 may be calculated. For example, the second loss may be a mean square error or cross entropy between the second text and actual text information in the sample image 301.

Parameters of the character prediction network 320 can be adjusted according to the first loss 321, and parameters of the backbone network 310 and the semantic learning network 330 can be adjusted according to the second loss 331, so that the updated deep learning model 300 is obtained.

For the next sample image 301, the step of inputting the sample image 301 to the backbone network 310 is returned by using the updated deep learning model 300, and the training process is repeated until a preset condition is reached (for example, the deep learning model 300 converges), and the training is finished.

It can be understood that the character prediction network 320 and the semantic learning network 330 share parameters in the backbone network 310, and the parameters of the backbone network 310 are adjusted based on the second loss 331, so that the effect of guiding the character prediction network 320 to train by using the semantic learning network 330 is achieved, and the recognition effect of the character prediction network 320 is further improved.

FIG. 4 is a flow diagram of a text recognition method according to one embodiment of the present disclosure.

As shown in fig. 4, the text recognition method 400 may include operations S410 to S420.

In operation S410, a feature of an image to be recognized is acquired.

In operation S420, the features of the image to be recognized are input into the character prediction network, and text information in the image to be recognized is obtained.

The character prediction network is obtained by training according to the training method of the deep learning model. The deep learning model comprises a backbone network, a character prediction network and a semantic learning network.

For example, in the training stage, the deep learning model trains the character prediction network and the semantic learning network together, the parameters of the backbone network are adjusted by using the semantic learning network according to the characteristic that the text recognition precision of the semantic learning network is high, and the character prediction network and the semantic learning network share the parameters of the backbone network, so that the effect of guiding character prediction network training by using the semantic learning network is achieved, and the accuracy of text recognition of the character prediction network is improved.

And in the stage of text recognition by using the trained deep learning model, the text recognition can be carried out by using only the character prediction network.

Specifically, an image to be recognized is input into a deep learning model, features of the image to be recognized are extracted through a backbone network, the features of the image to be recognized are transmitted to a character prediction network through the backbone network, and text information in the image to be recognized is decoded and output through the character prediction network based on the features of the image to be recognized.

In the stage of text recognition by using the trained deep learning model, only the character prediction network is used for text recognition, the semantic learning network does not need to be started, and the efficiency of the deep learning model for recognizing the text is not influenced.

Therefore, the character prediction model is used for text recognition, so that the efficiency of text recognition can be ensured, and the accuracy of text recognition can be improved.

FIG. 5 is a block diagram of a training apparatus for deep learning models, according to one embodiment of the present disclosure.

As shown in fig. 5, the training apparatus 500 of the deep learning model includes a character prediction module 501, a semantic learning module 502, a loss determination module 503, and a parameter adjustment module 504.

The character prediction module 501 is configured to perform character prediction on text information in the sample image according to characteristics of the sample image to obtain a first text.

The semantic learning module 502 is configured to perform semantic learning on the text information in the sample image according to the features of the sample image, so as to obtain a second text.

The loss determining module 503 is configured to determine a loss of the deep learning model according to the first text and the second text.

The parameter adjustment module 504 is configured to adjust parameters of the deep learning model based on the loss.

According to an embodiment of the present disclosure, a deep learning model includes a character prediction network and a semantic learning network.

And the character prediction module is used for inputting the characteristics of the sample image into a character prediction network to obtain a first text.

And the semantic learning module is used for inputting the characteristics of the sample image into a semantic learning network to obtain a second text.

According to an embodiment of the present disclosure, the deep learning model further comprises a backbone network; the device also comprises a characteristic extraction module which is used for inputting the sample image into the backbone network to obtain the characteristics of the sample image.

According to an embodiment of the present disclosure, the loss includes a first loss and a second loss, the first loss is determined based on the first text, and the second loss is determined based on the second text; the parameter adjusting module comprises a first adjusting unit and a second adjusting unit.

And the first adjusting unit is used for adjusting the parameters of the character prediction network according to the first loss.

And the second adjusting unit is used for adjusting parameters of the semantic learning network and the backbone network according to the second loss.

Fig. 6 is a block diagram of a text recognition device according to one embodiment of the present disclosure.

As shown in fig. 6, the text recognition 600 may include an acquisition module 601 and a recognition module 602.

The obtaining module 601 is used for obtaining features of an image to be recognized.

The recognition module 602 is configured to input features of the image to be recognized into a character prediction network, so as to obtain text information in the image to be recognized.

The character prediction network is obtained by training according to the training device of the deep learning model.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The calculation unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 701 performs the respective methods and processes described above, such as a training method of a deep learning model and/or a text recognition method. For example, in some embodiments, the training method and/or the text recognition method of the deep learning model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When loaded into RAM 703 and executed by the computing unit 701, may perform one or more steps of the training method of the deep learning model and/or the text recognition method described above. Alternatively, in other embodiments, the computing unit 701 may be configured in any other suitable way (e.g., by means of firmware) to perform the training method of the deep learning model and/or the text recognition method.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above, reordering, adding or deleting steps, may be used. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of training a deep learning model, the method comprising:

according to the characteristics of a sample image, performing character prediction on text information in the sample image to obtain a first text;

according to the characteristics of the sample image, semantic learning is carried out on text information in the sample image to obtain a second text;

determining the loss of the deep learning model according to the first text and the second text; and

and adjusting parameters of the deep learning model according to the loss.

2. The method of claim 1, wherein the deep learning model comprises a character prediction network and a semantic learning network;

the character prediction of the text information in the sample image according to the characteristics of the sample image to obtain a first text comprises the following steps:

inputting the characteristics of the sample image into the character prediction network to obtain the first text;

the semantic learning of the text information in the sample image according to the characteristics of the sample image to obtain a second text comprises:

and inputting the characteristics of the sample image into a semantic learning network to obtain the second text.

3. The method of claim 2, wherein the deep learning model further comprises a backbone network; the method further comprises the following steps:

and inputting the sample image into the backbone network to obtain the characteristics of the sample image.

4. The method of claim 3, wherein the loss comprises a first loss and a second loss, the first loss determined based on the first text and the second loss determined based on the second text; the adjusting parameters of the deep learning model according to the loss comprises:

adjusting parameters of the character prediction network according to the first loss; and

and adjusting parameters of the semantic learning network and the backbone network according to the second loss.

5. A text recognition method, comprising:

acquiring the characteristics of an image to be identified; and

inputting the characteristics of the image to be recognized into a character prediction network to obtain text information in the image to be recognized;

wherein the character prediction network is trained according to the method of any one of claims 1-4.

6. A training apparatus for deep learning a model, the apparatus comprising:

the character prediction module is used for performing character prediction on text information in the sample image according to the characteristics of the sample image to obtain a first text;

the semantic learning module is used for performing semantic learning on the text information in the sample image according to the characteristics of the sample image to obtain a second text;

the loss determining module is used for determining the loss of the deep learning model according to the first text and the second text; and

and the parameter adjusting module is used for adjusting the parameters of the deep learning model according to the loss.

7. The apparatus of claim 6, wherein the deep learning model comprises a character prediction network and a semantic learning network;

the character prediction module is used for inputting the characteristics of the sample image into the character prediction network to obtain the first text;

and the semantic learning module is used for inputting the characteristics of the sample image into a semantic learning network to obtain the second text.

8. The apparatus of claim 7, wherein the deep learning model further comprises a backbone network; the device also comprises a characteristic extraction module which is used for inputting the sample image into the backbone network to obtain the characteristics of the sample image.

9. The apparatus of claim 8, wherein the loss comprises a first loss and a second loss, the first loss determined based on the first text, the second loss determined based on the second text; the parameter adjustment module comprises:

a first adjusting unit, configured to adjust a parameter of the character prediction network according to the first loss; and

and the second adjusting unit is used for adjusting the parameters of the semantic learning network and the backbone network according to the second loss.

10. A text recognition apparatus comprising:

the acquisition module is used for acquiring the characteristics of the image to be identified; and

the recognition module is used for inputting the characteristics of the image to be recognized into a character prediction network to obtain text information in the image to be recognized;

wherein the character prediction network is trained according to the apparatus of any one of claims 6-9.

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 5.

12. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1 to 5.

13. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 5.