CN113361404A

CN113361404A - Method, apparatus, device, storage medium and program product for recognizing text

Info

Publication number: CN113361404A
Application number: CN202110629018.8A
Authority: CN
Inventors: 刘清灿
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-06-02
Filing date: 2021-06-02
Publication date: 2021-09-07

Abstract

The application discloses a method, a device, electronic equipment, a storage medium and a computer program product for recognizing texts, and relates to the technical field of image recognition and voice recognition. One embodiment of the method comprises: acquiring an image to be identified; identifying an image to be identified to obtain a text, a labeled text and labeled information in the text; and marking information corresponding to the marked text in the text to obtain the identified text. The text, the labeled text and the labeling information in the image to be recognized are recognized, the text, the labeled text and the labeling information are typeset according to the labeling format in the image to be recognized, the recognized text is obtained, the storage space of equipment is reduced, and the user experience of looking up is improved.

Description

Method, apparatus, device, storage medium and program product for recognizing text

Technical Field

The present application relates to the field of image processing, in particular to the field of image recognition and speech recognition, and in particular to a method, an apparatus, an electronic device, a storage medium, and a computer program product for recognizing text.

Background

In the process of referring to the paper book, the reader often likes to mark in the paper book. Because the paper books are unfavorable for carrying, the content of marking on the paper books can't be looked up by anytime and anywhere, and shoot the memory that obtains the image again to the mark page or leaf, and be unfavorable for categorizing, when subsequently looking up, look for the difficulty.

Disclosure of Invention

A method, an apparatus, an electronic device, a storage medium, and a computer program product for recognizing text are provided.

According to a first aspect, there is provided a method for recognizing text, comprising: acquiring an image to be identified; identifying an image to be identified to obtain a text, a labeled text and labeled information in the text; and marking information corresponding to the marked text in the text to obtain the identified text.

According to a second aspect, there is provided an apparatus for recognizing text, comprising: an acquisition unit configured to acquire an image to be recognized; the identification unit is configured to identify an image to be identified to obtain a body text, a labeled text in the body text and labeled information; and the marking unit is configured to mark marking information corresponding to the marked text in the text to obtain the identified text.

According to a third aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in any one of the implementations of the first aspect.

According to a fourth aspect, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform a method as described in any one of the implementations of the first aspect.

According to a fifth aspect, there is provided a computer program product comprising: computer program which, when being executed by a processor, carries out the method as described in any of the implementations of the first aspect.

According to the technology of the application, the text, the labeled text and the labeling information in the image to be recognized are recognized, the text, the labeled text and the labeling information are typeset according to the labeling format in the image to be recognized, the recognized text is obtained, the storage space of equipment is reduced, and the user experience degree of looking up is improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is an exemplary system architecture diagram in which one embodiment according to the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a method for recognizing text according to the present application;

fig. 3 is a schematic diagram of an application scenario of the method for recognizing text according to the present embodiment;

FIG. 4 is a flow diagram of yet another embodiment of a method for recognizing text according to the present application;

FIG. 5 is a block diagram of one embodiment of an apparatus for recognizing text according to the present application;

FIG. 6 is a block diagram of a computer system suitable for use in implementing embodiments of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 illustrates an exemplary architecture 100 to which the methods and apparatus for recognizing text of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The communication connections between the

terminal devices

101, 102, 103 form a topological network, and the network 104 serves to provide a medium for communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The

terminal devices

101, 102, 103 may be hardware devices or software that support network connections for data interaction and data processing. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices supporting network connection, information acquisition, interaction, display, processing, and the like, including but not limited to smart phones, tablet computers, e-book readers, laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented, for example, as multiple software or software modules to provide distributed services, or as a single software or software module. And is not particularly limited herein.

The server 105 may be a server providing various services, for example, a background processing server that obtains an image to be recognized sent by a user through the

terminal devices

101, 102, and 103, recognizes a text in the image to be recognized, and a labeled text and labeling information in the text, and obtains a recognized text by typesetting the text, the labeled text, and the labeling information with reference to a labeling format in the image to be recognized. Optionally, the background processing server feeds back the identified text to the terminal device. As an example, the server 105 may be a cloud server.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be further noted that the method for recognizing the text provided by the embodiment of the present application may be executed by the server, the terminal device, or both the server and the terminal device. Accordingly, each part (for example, each unit) included in the apparatus for recognizing a text may be entirely provided in the server, may be entirely provided in the terminal device, and may be provided in the server and the terminal device, respectively.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. The system architecture may include only the electronic device (e.g., server or terminal device) on which the method for recognizing text operates, when the electronic device on which the method for recognizing text operates does not require data transmission with other electronic devices.

Referring to fig. 2, fig. 2 is a flowchart of a method for recognizing a text according to an embodiment of the present application, wherein the process 200 includes the following steps:

step 201, acquiring an image to be identified.

In this embodiment, an execution subject (for example, a terminal device or a server in fig. 1) of the method for recognizing a text may acquire an image to be recognized.

The image to be recognized comprises text information of a plurality of parts, such as a text part, a marked part and the like. Wherein the marked part is included in the body part and is a part of the body part. As an example, the image to be recognized may be an image characterizing a reading note.

When the execution subject described above is a terminal device having an image capturing function, the image to be recognized may be an image captured by the terminal device. When the execution subject is a server that is communicatively connected to a terminal device, the image to be recognized may be an image acquired from the terminal device.

Step 202, identifying the image to be identified to obtain a text, a labeled text and labeled information in the text.

In this embodiment, the execution subject identifies the image to be identified, and obtains the text, the annotated text in the text, and the annotation information.

As an example, the execution subject may recognize the image to be recognized through a pre-trained recognition model, and obtain a body text, a tagged text in the body text, and tagging information. The recognition model represents the corresponding relationship between the image to be processed and the text, the labeled text and the labeling information, and a neural network model with a text recognition function can be adopted.

As another example, the executing body may mark image areas in which the text, the marked text, and the marking information correspond to each other in the image to be recognized, and then recognize each image area according to an OCR (Optical Character Recognition) technique, so as to obtain the text, the marked text, and the marking information in sequence.

Step 203, labeling information corresponding to the labeled text in the text to obtain the identified text.

In this embodiment, the execution main body may label the label information corresponding to the labeled text in the text to obtain the recognized text.

As an example, the executing body may obtain the recognized text by typesetting the text, the labeled text, and the labeling information with reference to the relative position information between the text, the labeled text, and the labeling information in the image to be recognized. The marked text in the text can be highlighted in the forms of line drawing, highlight and the like.

In the embodiment, the text, the labeled text and the labeling information in the image to be recognized are recognized, the text, the labeled text and the labeling information are typeset according to the labeling format in the image to be recognized, the recognized text is obtained, the storage space of equipment is reduced, and the user experience of looking up is improved.

With continued reference to fig. 3, fig. 3 is a schematic diagram 300 of an application scenario of the method for recognizing text according to the present embodiment. In the application scenario of fig. 3, the user 301 captures a reading note through the terminal device 302 to acquire an image to be recognized 3021. The terminal device 302 sends the image to be identified 3021 to the server 303, and the server 303 identifies the image to be identified 3021 to obtain a body text 30211, a labeled text 30212 in the body text, and labeled information 30213; in the body text 30211, the annotation information 30213 is annotated in correspondence with the annotated text 30212, resulting in a recognized text, and the recognized text 3022 is presented to the user 301 through the terminal apparatus 302.

In some optional implementations of the present embodiment, for a case that the marked part in the image to be recognized is marked by a preset mark, the executing main body may execute the step 202 as follows:

and in response to the fact that the body part in the image to be recognized comprises the preset mark, recognizing the body part in the image to be recognized, and the labeled part marked by the preset mark in the body part, and sequentially obtaining the body text, the labeled text and the labeled information.

The preset mark may be any marker having a marking function, such as a scribe line, a bracket, a painting of a specific color, and the like.

As an example, the execution subject may recognize a body part, a labeled part, and a labeled part in the image to be recognized by an ICR (Intelligent Character Recognition) technology, and sequentially obtain a body text, a labeled text, and labeling information. In this case, the execution subject directly determines and recognizes each part of text in the image to be recognized, and text recognition can be performed quickly.

In some optional implementations of the embodiment, for a case that the marked part in the image to be recognized is not marked by the preset mark, the executing main body may execute the step 203 as follows:

first, in response to determining that a body part in an image to be recognized does not include a preset mark, a speech to be recognized is received.

The speech to be recognized is used to cause the execution body to determine the marked part in the body part. By way of example, the speech to be recognized may be speech that completely matches the labeled text (speech formed by the user reading the labeled text of the labeled part), speech that represents only the beginning part and the ending part of the labeled text, or speech that represents the position information of the labeled part in the body part.

Secondly, recognizing the voice to be recognized to obtain a voice text, and determining a text matched with the voice text in the text as a marked text.

When the voice to be recognized is the voice completely matched with the marked text, the execution main body can directly recognize the voice to be recognized to determine the voice text; when the voice to be recognized is the voice representing the starting part and the ending part of the marked text, after the main body is executed to recognize the texts of the starting part and the ending part, matching the texts with the text of the body, and determining the text starting from the starting part and ending at the ending part in the text of the body as the marked text; when the voice to be recognized is the voice representing the position information of the marked part in the body part, the execution main body can carry out semantic recognition on the voice text, determine the position information and further determine the marked text from the body text according to the position information.

In such a case, the execution subject can determine the marked text based on the voice recognition technology, so that the convenience of the user in using the application is improved.

For the case that the annotation information is handwritten text, the text recognition may have recognition errors, and in this case, the executing entity may execute the step 202 as follows:

firstly, a text part and a marked part are identified, and a text and a marked text are obtained in sequence.

The text part and the marked part are generally print texts, and the text recognition accuracy for the print texts is high.

Secondly, carrying out image segmentation on the labeling part to obtain labeling information in an image form.

In the implementation mode, in order to avoid the problem of identification error of the labeling information of the labeling part, the execution main body directly performs image segmentation on the labeling part, and the accuracy of information processing is improved.

In some optional implementations of the embodiment, the execution main body may display the recognized text. As an example, after the execution subject recognizes the image to be recognized for the first time to obtain the recognized text, the recognized text is displayed to the user by default. During subsequent viewing, the recognized text may be displayed to the user based on the user's viewing request.

In the implementation mode, the execution main body can conveniently and flexibly display the recognized text to the user, and the flexibility of text display and the experience of the user are improved.

In some optional implementation manners of this embodiment, the execution main body receives an editing operation of a user, and edits the recognized text according to the editing operation.

The editing operation may be any operation of editing the identified text, including but not limited to adding, deleting, adjusting, modifying, and the like. As an example, the editing operation is a deletion operation, and the execution subject may delete the content of the recognized text that is not related to the tagged text.

In the implementation mode, the user can flexibly edit the recognized text, and the flexibility and the practicability of the application are improved.

With continuing reference to FIG. 4, an exemplary flow 400 of one method embodiment for recognizing text is shown according to the method of the present application, comprising the steps of:

step 401, acquiring an image to be identified.

Step 402, in response to determining that the body part in the image to be recognized includes the preset mark, recognizing the body part in the image to be recognized, and the labeled part marked by the preset mark in the body part, and sequentially obtaining a body text, a labeled text and labeled information.

Step 403, in response to determining that the text part in the image to be recognized does not include the preset mark, receiving the voice to be recognized.

And step 404, recognizing the voice to be recognized to obtain a voice text, and determining a text matched with the voice text in the text as a marked text.

Step 405, displaying the recognized text.

And 406, receiving the editing operation of the user, and editing the identified text according to the editing operation.

As can be seen from this embodiment, compared with the embodiment corresponding to fig. 2, the flow 400 of the method for recognizing a text in this embodiment specifically illustrates a text recognition process and a text editing process under different situations, so that flexibility and practicability of text recognition are improved.

With continuing reference to fig. 5, as an implementation of the method shown in the above figures, the present application provides an embodiment of an apparatus for recognizing text, which corresponds to the embodiment of the method shown in fig. 2, and which can be applied to various electronic devices.

As shown in fig. 5, the apparatus for recognizing text includes: an acquisition unit 501 configured to acquire an image to be recognized; the identification unit 502 is configured to identify an image to be identified, and obtain a body text, a labeled text in the body text and labeled information; and the labeling unit 503 is configured to label labeling information corresponding to the labeled text in the body text to obtain the recognized text.

In some embodiments, the identifying unit 502 is further configured to: and in response to the fact that the body part in the image to be recognized comprises the preset mark, recognizing the body part in the image to be recognized, and the labeled part marked by the preset mark in the body part, and sequentially obtaining the body text, the labeled text and the labeled information.

In some embodiments, the identifying unit 502 is further configured to: receiving a voice to be recognized in response to determining that the text portion in the image to be recognized does not include the preset mark; and recognizing the voice to be recognized to obtain a voice text, and determining a text matched with the voice text in the text as a marked text.

In some embodiments, the identifying unit 502 is further configured to: recognizing the text part and the marked part to obtain a text and a marked text in sequence; and carrying out image segmentation on the labeling part to obtain labeling information in an image form.

In some embodiments, the above apparatus further comprises: a display unit (not shown in the figure) configured to display the recognized text.

In some embodiments, the above apparatus further comprises: and an editing unit (not shown in the figure) configured to receive an editing operation by a user and edit the recognized text according to the editing operation.

The text, the labeled text and the labeling information in the image to be recognized are recognized, the text, the labeled text and the labeling information are typeset according to the labeling format in the image to be recognized, the recognized text is obtained, the storage space of equipment is reduced, and the user experience of looking up is improved.

According to an embodiment of the present disclosure, the present disclosure also provides an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method for recognizing text described in any of the embodiments described above.

According to an embodiment of the present disclosure, there is also provided a readable storage medium storing computer instructions for enabling a computer to implement the method for recognizing text described in any of the above embodiments when executed.

The embodiments of the present disclosure provide a computer program product, which when executed by a processor is capable of implementing the method for recognizing text described in any of the embodiments above.

FIG. 6 illustrates a schematic block diagram of an example electronic device 600 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data required for the operation of the device 600 can also be stored. The calculation unit 601, the ROM602, and the RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

A number of components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, a mouse, or the like; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 601 performs the respective methods and processes described above, such as a method for recognizing text. For example, in some embodiments, the method for recognizing text may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM602 and/or the communication unit 609. When the computer program is loaded into the RAM603 and executed by the computing unit 601, one or more steps of the method for recognizing text described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the method for recognizing text in any other suitable way (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service extensibility in the conventional physical host and Virtual Private Server (VPS) service.

According to the technical scheme of the embodiment of the disclosure, the image to be identified is obtained; identifying an image to be identified to obtain a text, a labeled text and labeled information in the text; and marking information corresponding to the marked text in the text to obtain the identified text. The text, the labeled text and the labeling information in the image to be recognized are recognized, the text, the labeled text and the labeling information are typeset according to the labeling format in the image to be recognized, the recognized text is obtained, the storage space of equipment is reduced, and the user experience of looking up is improved.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method for recognizing text, comprising:

acquiring an image to be identified;

identifying the image to be identified to obtain a body text, and a labeled text and labeled information in the body text;

and marking the marking information corresponding to the marked text in the text to obtain the identified text.

2. The method of claim 1, wherein the identifying the image to be identified to obtain a body text, a labeled text in the body text, and labeling information comprises:

and in response to the fact that the body part in the image to be recognized comprises a preset mark, recognizing the body part in the image to be recognized, and the labeled part marked by the preset mark in the body part, and sequentially obtaining the body text, the labeled text and the labeled information.

3. The method of claim 2, wherein the identifying the image to be identified to obtain a body text, a labeled text in the body text, and labeling information further comprises:

receiving a voice to be recognized in response to determining that the text portion in the image to be recognized does not include a preset mark;

and recognizing the voice to be recognized to obtain a voice text, and determining a text matched with the voice text in the text as the marked text.

4. The method of claim 2, wherein the identifying a body part in the image to be identified, and an annotated part marked by the preset mark in the body part to obtain the body text, the annotated text and the annotation information in sequence comprises:

identifying the text part and the marked part, and sequentially obtaining the text and the marked text;

and carrying out image segmentation on the labeled part to obtain the labeled information in an image form.

5. The method of any of claims 1-4, further comprising:

and displaying the recognized text.

6. The method of claim 5, further comprising:

and receiving the editing operation of a user, and editing the recognized text according to the editing operation.

7. An apparatus for recognizing text, comprising:

an acquisition unit configured to acquire an image to be recognized;

the identification unit is configured to identify the image to be identified, and obtain a body text, a labeled text in the body text and labeling information;

and the marking unit is configured to mark the marking information corresponding to the marked text in the text to obtain the identified text.

8. The apparatus of claim 7, wherein the identifying unit is further configured to:

9. The apparatus of claim 8, wherein the identifying unit is further configured to:

receiving a voice to be recognized in response to determining that the text portion in the image to be recognized does not include a preset mark; and recognizing the voice to be recognized to obtain a voice text, and determining a text matched with the voice text in the text as the marked text.

10. The apparatus of claim 8, wherein the identifying unit is further configured to:

identifying the text part and the marked part, and sequentially obtaining the text and the marked text; and carrying out image segmentation on the labeled part to obtain the labeled information in an image form.

11. The apparatus of any of claims 7-10, further comprising:

a display unit configured to display the recognized text.

12. The apparatus of claim 11, further comprising:

and the editing unit is configured to receive an editing operation of a user and edit the recognized text according to the editing operation.

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6.

15. A computer program product, comprising: computer program which, when being executed by a processor, carries out the method according to any one of claims 1-6.