CN117197811A - Text recognition method and electronic equipment - Google Patents

Text recognition method and electronic equipment Download PDF

Info

Publication number
CN117197811A
CN117197811A CN202210597895.6A CN202210597895A CN117197811A CN 117197811 A CN117197811 A CN 117197811A CN 202210597895 A CN202210597895 A CN 202210597895A CN 117197811 A CN117197811 A CN 117197811A
Authority
CN
China
Prior art keywords
text
text content
content
region
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210597895.6A
Other languages
Chinese (zh)
Inventor
滕益华
吴觊豪
洪芳宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202210597895.6A priority Critical patent/CN117197811A/en
Priority to PCT/CN2023/096921 priority patent/WO2023231987A1/en
Publication of CN117197811A publication Critical patent/CN117197811A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/146Aligning or centring of the image pick-up or image-field
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The embodiment of the application provides a text recognition method and electronic equipment, wherein the method comprises the following steps: the electronic device may obtain an image of a first text region and first text content of an object to be identified. The electronic device classifies the image of the first text region and the first text content to display a text recognition result of the first text region based on the classification result. Wherein the classification result includes a first classification, a second classification, or a third classification. The text recognition result corresponding to the first classification is filtering the first text content. The text recognition result corresponding to the second category comprises the text content corrected by the first text content. The text recognition result corresponding to the third category includes the first text content. Therefore, the electronic equipment can avoid text content with semantic errors in the text recognition result by comprehensively considering the images and the text content of the text region, so that the use experience of a user is improved.

Description

Text recognition method and electronic equipment
Technical Field
The embodiment of the application relates to the field of terminal equipment, in particular to a text recognition method and electronic equipment.
Background
With the continuous development of communication technology, terminals such as mobile phones have become an indispensable part of people's daily lives. The user can communicate with other users by using the mobile phone, and can browse or process various information.
During the use process, for the content of interest displayed by the mobile phone, such as the user is interested in a picture or some characters in the application interface, the user can identify the characters in the picture or the interface through the text identification function of the application. Typically the text recognition function is implemented based on optical character recognition (Optical Character Recognition, OCR) technology. Taking a picture as an example, the application can recognize characters in the picture based on OCR technology and output a recognition result. However, for a text recognition scene containing truncated text, the current OCR technology has a large difference between the output result after text recognition and the original text, which affects the user experience.
Disclosure of Invention
In order to solve the technical problems, the application provides a text recognition method and electronic equipment. In the method, the electronic equipment can output a text recognition result meeting the user requirement based on the image and the text content of the text region.
In a first aspect, an embodiment of the present application provides a text recognition method. The method comprises the following steps: the electronic equipment detects a text region of an object to be identified to obtain an image of a first text region, wherein the first text region comprises text content. And the electronic equipment performs text content identification on the acquired first text region to obtain first text content. Then, the electronic device classifies the first text content based on the image of the first text region, and a classification result is obtained. Then, the electronic device displays a text recognition result of the first text region based on the classification result. The step of displaying the text recognition result may specifically include: if the classification result is the first classification, the text recognition result filters the first text content. If the classification result is the second classification, the text recognition result comprises the text content corrected by the first text content. If the classification result is the third classification, the text recognition result comprises the first text content. In this way, the electronic device can filter the result of text content recognition (i.e., the first text content) in the case that the text content contained in the text region is missing more by comprehensively considering the image information (i.e., the image of the text region) and the text information (i.e., the text content). And in the case of less text content missing, outputting the corrected result. And can output the corresponding text without the text content missing. Therefore, the correct and semantic-smooth result can be presented in the text recognition result, and the result with semantic errors (namely the text content) is filtered, so that the personified complex decision effect can be obtained, and the use experience of a user is improved.
Illustratively, the text recognition result is optionally text recognition result display box 405 in FIG. 4. That is, if the text recognition result is the result of the first classification indication (i.e., filtering), the result corresponding to the first text region in the text recognition result display box 405 is null, i.e., the text content recognition result corresponding to the first text region (i.e., the first text content) is not displayed. If the text recognition result is the result of the second classification indication (i.e., outputting corrected text content) or the result of the third classification indication (i.e., directly outputting text content), the text recognition result display box 405 includes corrected text content corresponding to the first text region or text content of the first text region.
The text recognition result may be a result corresponding to the text region itself, for example. For example, if the text recognition result is the result of the first classification indication (i.e., filtering), the text recognition result corresponding to the first text region displayed by the electronic device is empty (may or may not be blank). If the text recognition result is the result of the second classification indication (i.e., outputting corrected text content) or the result of the third classification indication (i.e., directly outputting text content), the electronic device may display the text content (which may be modified or may be a result after text content recognition) corresponding to the first text region in the text recognition result display box 405.
Illustratively, the classification result is optionally a numerical value, which is used to represent the classification item.
For example, the classification result may also include 3 values, and the classification corresponding to the largest value is the classification corresponding to the first text region.
According to a first aspect, the electronic device classifies the image of the first text region with the first text content to obtain a classification result, including: and the electronic equipment obtains the intermediate sign information based on the image of the first text region and the first text content. And the electronic equipment classifies the intermediate characterization information to obtain a classification result. In this way, the electronic device makes more refined decisions on different input combinations by utilizing the high-dimensional multi-mode semantic information, so that a personified complex decision effect can be obtained.
For example, the intermediate characterization information may be referred to as multimodal information.
For example, the intermediate characterization information may be used to characterize image features of the image of the first text region with text features of the first text content.
According to the first aspect, or any implementation manner of the first aspect, the electronic device classifies the intermediate characterization information to obtain a classification result, including: the electronic equipment classifies the intermediate characterization information through the classification model to obtain a classification result. Thus, the electronic equipment can classify the intermediate characterization information through the pre-trained classification model so as to obtain a corresponding classification result.
According to the first aspect, or any implementation manner of the first aspect, before the electronic device displays the text recognition result of the first text region based on the classification result, the method further includes: and the electronic equipment corrects the intermediate characterization information to obtain the text content after the first text content correction. Illustratively, before, concurrently with, or after the electronic device classifies the intermediate characterization information, the intermediate characterization information is modified to obtain modified text content. The electronic device may determine whether to output the corrected text content based on the classification result. Illustratively, if the corrected text content does not need to be output, e.g., the classification result is the first classification or the third classification, the corrected text content is discarded.
According to the first aspect, or any implementation manner of the first aspect, the electronic device corrects the intermediate representation information to obtain corrected target text content, including: and the electronic equipment corrects the intermediate characterization information through the correction model to obtain the text content after the first text content is corrected. In this way, the electronic device can correct the intermediate characterization information through the pre-trained correction model, so that corrected text content is obtained.
According to the first aspect, or any implementation manner of the first aspect, the obtaining, by the electronic device, intermediate characterization information based on the image of the first text region and the first text content includes: and the electronic equipment performs image coding on the image of the first text area to obtain first image coding information. And the electronic equipment performs text coding on the first text content to obtain first text coding information. The electronic equipment carries out multi-mode coding on the first image coding information and the first text coding information through the multi-mode coding model to obtain intermediate characterization information. Thus, the electronic device can obtain higher-dimensional semantic information by encoding the image of the text region and the text content. The electronic equipment can perform multi-mode coding on the first image coding information and the first text coding information through a pre-trained multi-mode coding model so as to obtain intermediate representation information with high-dimensional semantics.
According to the first aspect, or any implementation manner of the first aspect, the multimodal coding model, the classification model, and the correction model form a neural network, and training data of the neural network includes a second text region and second text content corresponding to the second text region, and a third text region and third text content corresponding to the third text region; the second text region includes partially missing text content, and the text content in the third text region is complete text content. In this way, by inputting images and text contents of text regions of different types (including text regions with and without text deletions), the neural network can be trained in a cyclic manner, so that the neural network can perform corresponding functions, i.e., fusion, classification and correction of the images and text contents of the text regions.
According to the first aspect, or any implementation manner of the first aspect, the text recognition result of the first text region is displayed in a text recognition region, and the text recognition region further includes text content corresponding to a third text region in the object to be recognized. Therefore, the text recognition method can realize different processing modes of the text content, namely, the finally displayed text recognition results are all text content with consistent semantic meaning. And filtering or correcting text content with inconsistent semantics in the text content recognition result to avoid the influence of the text content with inconsistent semantics on the text recognition result.
According to the first aspect, or any implementation manner of the first aspect, if the first text area includes a part of missing text content, the text recognition result is the first classification or the second classification. For example, the partially missing text content may be that part of the information is missing for each word in the text region, for example, the upper half may be missing, or the lower half may be missing. The partially missing text may also be at least one literal missing part information in the text region.
According to the first aspect, or any implementation manner of the first aspect above, the semantics of the first text content expression are different from the semantics of the text content expression in the first text region. Therefore, in the embodiment of the application, the text content identification result can be screened to filter or correct the text content which is different from the original semantic, so that the use experience of a user is improved.
According to the first aspect, or any implementation manner of the first aspect, the object to be identified is a picture, a web page or a document.
In a second aspect, an embodiment of the present application provides a text recognition method. The method comprises the following steps: the electronic equipment detects a text region of an object to be identified to obtain an image of a first text region; the first text area includes text content therein. And the electronic equipment performs text content identification on the first text region to obtain first text content. The electronic device displays a text recognition result of the first text region based on the image of the first text region and the first text content. The electronic device displays a text recognition result of the first text region based on the image of the first text region and the first text content, including: if the image of the first text region represents that the first text region comprises partially missing text content and the first text content is semantic coherent text content, or the image of the first text region represents that the first text region does not comprise partially missing text content, the text recognition result comprises the first text content; if the image of the first text area characterizes the first text area to include the partially missing text content and the first text content includes the text content with the semantic error, the text recognition result filters the first text content or the text recognition result includes the text content corrected by the first text content. In this way, the electronic device can filter the result of text content recognition (i.e., the first text content) in the case that more text content is contained in the text region by comprehensively considering the image information (i.e., the image of the text region) and the text information (i.e., the text content). And in the case of less text content missing, outputting the corrected result. And can output the corresponding text without the text content missing. Therefore, the correct and semantic-smooth results can be presented in the text recognition results, and the results with semantic errors (namely text contents) are filtered, so that the personified complex decision effect can be obtained, and the use experience of the user is improved.
For example, the electronic device may detect whether text content in the text region is truncated, i.e., whether text of missing content is included, based on the image of the text region. In one example, if the text content is not truncated, the first text content may be directly output. In another example, if the text content is truncated, it is detected whether the semantics of the first text content are consistent. If the semantics of the first text content are consistent, the first text content can be directly output. If the semantics of the first text content are not consistent, further detecting whether the first text content can be modified. Outputting the modified text content if the first text content can be modified, and filtering the first text content if the first text content cannot be modified.
According to a second aspect, the electronic device displays a text recognition result of a first text region based on an image of the first text region and first text content, including: if the image of the first text region characterizes the first text region as including partially missing text content and the first text content includes semantically incoherent text content, the electronic device detects whether the first text content can be modified. If the first text content cannot be modified, the text recognition result filters the first text content. If the first text content can be revised, the text recognition result includes the revised text content of the first text content. In this way, the electronic device may further detect whether the first text content may be revised in the case that it is detected that the text content in the first text region is truncated and that the semantics of the first text content are not consistent. If so, the electronic device may modify the first text content and output the modified text content. If not, the electronic device filters the first text content. That is, the text recognition result of the first text region displayed by the electronic device is empty, or is corrected text content, or is text content with original semantic coherence, so as to avoid the influence of the text content recognition result error on the use of the user.
According to a second aspect, or any implementation manner of the second aspect, if the first text content may be modified, the method further includes: and the electronic equipment corrects the first text content through the correction model to obtain the text content after the correction of the first text content. Thus, the electronic device can correct the first text content through the pre-trained correction model so as to obtain the text content with consistent semantic meaning.
According to a second aspect, or any implementation manner of the second aspect, the electronic device displays a text recognition result of the first text region based on the image of the first text region and the first text content, including: the electronic equipment classifies the image of the first text region through a classification model to obtain a classification result; the classification result is used to indicate whether the first text area includes partially missing text content. In this way, the electronic device may classify the image of the text region through a pre-trained classification model to detect whether text content in the text region is truncated.
According to a second aspect, or any implementation manner of the second aspect, if the image of the first text area indicates that the first text area includes a partially missing text content, the electronic device displays a text recognition result of the first text area based on the image of the first text area and the first text content, including: the electronic equipment performs semantic analysis on the first text content through a semantic model to obtain a semantic analysis result; the semantic analysis results are used to indicate whether the first text content includes text content that is semantically incorrect. In this way, the electronic device can perform semantic analysis on the text content through the pre-trained semantic model so as to obtain a semantic analysis result.
For example, the semantic analysis result may be a numerical value, and the electronic device may preset a semantic coherence threshold, where the threshold is used to indicate semantic coherence of the text content. And if the numerical value of the semantic analysis result is larger than or equal to the threshold value, the semantics of the first text content are consistent, and if the numerical value of the semantic analysis result is smaller than the threshold value, the semantics of the first text content are not consistent.
According to a second aspect, or any implementation manner of the second aspect, the semantic analysis result is further used to indicate whether the first text content can be modified, and the electronic device displays a text recognition result of the first text region based on the image of the first text region and the first text content, including: the electronic device determines whether the first text content can be modified based on the semantic analysis results. The electronic device may set a correction threshold that is not the same as the semantic coherence threshold. If the value of the semantic analysis result is greater than or equal to the correction threshold, the first text content may be corrected. If the value of the semantic analysis result is smaller than the correction threshold value, the first text content cannot be corrected.
According to a second aspect, or any implementation manner of the second aspect, the correction model, the classification model, and the semantic model form a neural network, and training data of the neural network includes a second text region and second text content corresponding to the second text region, and a third text region and third text content corresponding to the third text region; the second text region includes partially missing text content, and the text content in the third text region is complete text content. In this way, by inputting images and text contents of text regions of different types (including text regions with text deletion and text regions without text deletion), the neural network can be trained in a circulating manner, so that the neural network can perform corresponding functions, namely, can perform truncation judgment, semantic analysis and correction on the images and text contents of the text regions.
According to a second aspect, or any implementation manner of the second aspect, the text recognition result of the first text region is displayed in a text recognition region, and the text recognition region further includes text content corresponding to a third text region in the object to be recognized.
According to a second aspect, or any implementation manner of the second aspect above, the semantics of the semantically erroneous text content expression are different from the semantics of the corresponding text content expression in the first text region.
According to a second aspect, or any implementation manner of the second aspect, the object to be identified is a picture, a web page or a document.
In a third aspect, an embodiment of the present application provides an electronic device. The electronic device includes: one or more processors; a memory; and one or more computer programs, wherein the one or more computer programs are stored on the memory, which when executed by the one or more processors, cause the electronic device to perform the instructions of the first aspect or any of the possible implementations of the first aspect.
In a fourth aspect, an embodiment of the present application provides an electronic device. The electronic device includes: one or more processors; a memory; and one or more computer programs, wherein the one or more computer programs are stored on the memory, which when executed by the one or more processors, cause the electronic device to perform the second aspect or instructions of the method in any possible implementation of the second aspect.
In a fifth aspect, embodiments of the present application provide a computer readable medium storing a computer program comprising instructions for performing the method of the first aspect or any possible implementation of the first aspect.
In a sixth aspect, embodiments of the present application provide a computer readable medium storing a computer program comprising instructions for performing the second aspect or any of the possible implementations of the second aspect.
In a seventh aspect, embodiments of the present application provide a computer program comprising instructions for performing the method of the first aspect or any possible implementation of the first aspect.
In an eighth aspect, embodiments of the present application provide a computer program comprising instructions for performing the method of the second aspect or any possible implementation of the second aspect.
Drawings
Fig. 1 is a schematic diagram of a hardware structure of an electronic device exemplarily shown;
FIG. 2 is a schematic diagram of a software architecture of an exemplary electronic device;
FIG. 3 is a schematic diagram of an exemplary text recognition scenario shown containing truncated text;
Fig. 4 is a schematic view of an application scenario in which a text recognition method in an embodiment of the present application is applied;
FIG. 5 is a flow chart diagram of an exemplary illustrated text recognition method;
FIG. 6 is a schematic diagram of an exemplary illustrated text recognition;
FIG. 7 is a text image coding schematic diagram shown by way of example;
fig. 8 is an exemplary image information encoding schematic diagram;
fig. 9 is an exemplary image information encoding schematic diagram;
FIG. 10 is an exemplary Image Patch flattening schematic shown;
FIG. 11 is a schematic diagram of an exemplary illustrated text content encoding;
FIG. 12 is a schematic diagram of an exemplary text information encoding flow;
FIG. 13 is a schematic diagram of an exemplary process for obtaining intermediate characterization information;
FIG. 14a is a schematic diagram of exemplary multi-modal encoding;
FIG. 14b is a schematic diagram of a process flow of the multi-mode encoder;
FIG. 14c is a schematic diagram of an exemplary classification flow;
FIG. 15 is a schematic diagram of an exemplary text modification;
FIG. 16 is a schematic diagram illustrating a modification module process flow;
fig. 17 is a schematic diagram of a process flow of Transformer Decoder shown by way of example;
FIG. 18a is a schematic diagram of an exemplary application scenario;
FIG. 18b is a schematic diagram of another exemplary application scenario;
FIG. 18c is a schematic diagram of yet another exemplary application scenario;
FIG. 18d is a schematic diagram of yet another exemplary application scenario;
FIG. 18e is a schematic diagram of yet another exemplary application scenario;
FIG. 19 is a flow chart of an exemplary illustrated text recognition method;
FIG. 20 is a text image processing schematic diagram shown by way of example;
FIG. 21 is a process flow of an exemplary illustrated semantic model;
fig. 22 is a schematic structural view of an exemplary illustrated device.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments disclosed herein without departing from the scope of the application, are intended to be within the scope of the application.
Fig. 1 shows a schematic configuration of an electronic device 100. It should be understood that the electronic device 100 shown in fig. 1 is only one example of an electronic device, and that the electronic device 100 may have more or fewer components than shown in the figures, may combine two or more components, or may have different component configurations. The various components shown in fig. 1 may be implemented in hardware, software, or a combination of hardware and software, including one or more signal processing and/or application specific integrated circuits.
The electronic device 100 may include: processor 110, external memory interface 120, internal memory 121, universal serial bus (universal serial bus, USB) interface 130, charge management module 140, power management module 141, battery 142, antenna 1, antenna 2, mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headset interface 170D, sensor module 180, keys 190, motor 191, indicator 192, camera 193, display 194, and subscriber identity module (subscriber identification module, SIM) card interface 195, etc. The sensor module 180 may include a pressure sensor 180A, a gyro sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, and the like.
The processor 110 may include one or more processing units, such as: the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), a controller, a memory, a video codec, a digital signal processor (digital signal processor, DSP), a baseband processor, and/or a neural network processor (neural-network processing unit, NPU), etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors.
The controller may be a neural hub and a command center of the electronic device 100, among others. The controller can generate an operation control signal according to the instruction operation code and the time sequence signal to finish the control of instruction fetching and instruction execution.
A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that the processor 110 has just used or recycled. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby improving the efficiency of the system.
The charge management module 140 is configured to receive a charge input from a charger. The charger can be a wireless charger or a wired charger. In some wired charging embodiments, the charge management module 140 may receive a charging input of a wired charger through the USB interface 130. In some wireless charging embodiments, the charge management module 140 may receive wireless charging input through a wireless charging coil of the electronic device 100. The charging management module 140 may also supply power to the electronic device through the power management module 141 while charging the battery 142.
The power management module 141 is used for connecting the battery 142, and the charge management module 140 and the processor 110. The power management module 141 receives input from the battery 142 and/or the charge management module 140 and provides power to the processor 110, the internal memory 121, the external memory, the display 194, the camera 193, the wireless communication module 160, and the like. The power management module 141 may also be configured to monitor battery capacity, battery cycle number, battery health (leakage, impedance) and other parameters. In other embodiments, the power management module 141 may also be provided in the processor 110. In other embodiments, the power management module 141 and the charge management module 140 may also be disposed in the same device.
The wireless communication function of the electronic device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, a modem processor, a baseband processor, and the like.
The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. Each antenna in the electronic device 100 may be used to cover a single or multiple communication bands. Different antennas may also be multiplexed to improve the utilization of the antennas. For example: antenna 1 may be multiplexed into a diversity antenna of a wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.
The mobile communication module 150 may provide a solution for wireless communication including 2G/3G/4G/5G, etc., applied to the electronic device 100. The mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (low noise amplifier, LNA), etc. The mobile communication module 150 may receive electromagnetic waves from the antenna 1, perform processes such as filtering, amplifying, and the like on the received electromagnetic waves, and transmit the processed electromagnetic waves to the modem processor for demodulation. The mobile communication module 150 can amplify the signal modulated by the modem processor, and convert the signal into electromagnetic waves through the antenna 1 to radiate. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be disposed in the processor 110. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be disposed in the same device as at least some of the modules of the processor 110.
The modem processor may include a modulator and a demodulator. The modulator is used for modulating the low-frequency baseband signal to be transmitted into a medium-high frequency signal. The demodulator is used for demodulating the received electromagnetic wave signal into a low-frequency baseband signal. The demodulator then transmits the demodulated low frequency baseband signal to the baseband processor for processing. The low frequency baseband signal is processed by the baseband processor and then transferred to the application processor. The application processor outputs sound signals through an audio device (not limited to the speaker 170A, the receiver 170B, etc.), or displays images or video through the display screen 194. In some embodiments, the modem processor may be a stand-alone device. In other embodiments, the modem processor may be provided in the same device as the mobile communication module 150 or other functional module, independent of the processor 110.
The wireless communication module 160 may provide solutions for wireless communication including wireless local area network (wireless local area networks, WLAN) (e.g., wireless fidelity (wireless fidelity, wi-Fi) network), bluetooth (BT), global navigation satellite system (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field wireless communication technology (near field communication, NFC), infrared technology (IR), etc., applied to the electronic device 100. The wireless communication module 160 may be one or more devices that integrate at least one communication processing module. The wireless communication module 160 receives electromagnetic waves via the antenna 2, modulates the electromagnetic wave signals, filters the electromagnetic wave signals, and transmits the processed signals to the processor 110. The wireless communication module 160 may also receive a signal to be transmitted from the processor 110, frequency modulate it, amplify it, and convert it to electromagnetic waves for radiation via the antenna 2.
In some embodiments, antenna 1 and mobile communication module 150 of electronic device 100 are coupled, and antenna 2 and wireless communication module 160 are coupled, such that electronic device 100 may communicate with a network and other devices through wireless communication techniques. The electronic device 100 implements display functions through a GPU, a display screen 194, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or change display information.
The display screen 194 is used to display images, videos, and the like. The display 194 includes a display panel. The display panel may employ a liquid crystal display (liquid crystal display, LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (AMOLED) or an active-matrix organic light-emitting diode (matrix organic light emitting diode), a flexible light-emitting diode (flex), a mini, a Micro led, a Micro-OLED, a quantum dot light-emitting diode (quantum dot light emitting diodes, QLED), or the like. In some embodiments, the electronic device 100 may include 1 or N display screens 194, N being a positive integer greater than 1.
The electronic device 100 may implement photographing functions through an ISP, a camera 193, a video codec, a GPU, a display screen 194, an application processor, and the like.
The ISP is used to process data fed back by the camera 193. The camera 193 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image onto the photosensitive element. In some embodiments, electronic device 100 may include 1 or N cameras 193, N being a positive integer greater than 1.
The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to enable expansion of the memory capabilities of the electronic device 100. The external memory card communicates with the processor 110 through an external memory interface 120 to implement data storage functions. For example, files such as music, video, etc. are stored in an external memory card.
The internal memory 121 may be used to store computer executable program code that includes instructions. The processor 110 executes various functional applications of the electronic device 100 and data processing by executing instructions stored in the internal memory 121. The internal memory 121 may include a storage program area and a storage data area. The storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like. The storage data area may store data created during use of the electronic device 100 (e.g., audio data, phonebook, etc.), and so on. In addition, the internal memory 121 may include a high-speed random access memory, and may further include a nonvolatile memory such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (universal flash storage, UFS), and the like.
The electronic device 100 may implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, an application processor, and the like. Such as music playing, recording, etc.
The audio module 170 is used to convert digital audio information into an analog audio signal output and also to convert an analog audio input into a digital audio signal. The audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be disposed in the processor 110, or a portion of the functional modules of the audio module 170 may be disposed in the processor 110.
The software system of the electronic device 100 may employ a layered architecture, an event driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture. In the embodiment of the application, taking an Android system with a layered architecture as an example, a software structure of the electronic device 100 is illustrated. In other embodiments, the embodiments of the present application may be applied to other systems such as hong-mo systems, and the implementation methods thereof may refer to the technical solutions in the embodiments of the present application, which are not illustrated one by one.
Fig. 2 is a software configuration block diagram of the electronic device 100 according to the embodiment of the present application.
The layered architecture of the electronic device 100 divides the software into several layers, each with a distinct role and division of labor. The layers communicate with each other through a software interface. In some embodiments, the Android system is divided into four layers, from top to bottom, an application layer, an application framework layer, an Zhuoyun row (Android run) and system libraries, and a kernel layer, respectively.
The application layer may include a series of application packages.
As shown in fig. 2, the application package may include applications for cameras, gallery, calendar, phone calls, map, navigation, WLAN, bluetooth, music, video, short messages, text recognition, text processing, etc. The text recognition application may also be referred to as a text recognition module or a text recognition engine in the embodiment of the present application, which is not limited by the present application. The text recognition module may be used to recognize text regions and text content in a picture to be recognized (see below for specific concepts). The text processing application may also be referred to as a text processing module for further processing of the output of the text recognition module (for a specific processing procedure reference is made to the following embodiments). In the embodiment of the present application, the text processing module is used to further process the result of the text recognition module. In other embodiments, the steps performed by the text recognition module may be performed by the text processing module, or it should be understood that the steps performed by the text recognition module and the text processing module may be performed by one module, which is not limited by the present application.
The application framework layer provides an application programming interface (application programming interface, API) and programming framework for application programs of the application layer. The application framework layer includes a number of predefined functions.
As shown in FIG. 2, the application framework layer may include a window manager, a content provider, a view system, a telephony manager, a resource manager, a notification manager, and the like.
The window manager is used for managing window programs. The window manager can acquire the size of the display screen, judge whether a status bar exists, lock the screen, intercept the screen and the like.
The content provider is used to store and retrieve data and make such data accessible to applications. The data may include video, images, audio, calls made and received, browsing history and bookmarks, phonebooks, etc.
The view system includes visual controls, such as controls to display text, controls to display pictures, and the like. The view system may be used to build applications. The display interface may be composed of one or more views. For example, a display interface including a text message notification icon may include a view displaying text and a view displaying a picture.
The telephony manager is used to provide the communication functions of the electronic device 100. Such as the management of call status (including on, hung-up, etc.).
The resource manager provides various resources for the application program, such as localization strings, icons, pictures, layout files, video files, and the like.
The notification manager enables the application to display notification information in the status bar, can be used to communicate notification type of consumption, can automatically disappear after a short dwell, and does not require user interaction. Such as notification manager is used to inform that the download is complete, a message reminder, etc. The notification manager may also be a notification of status bars appearing on top of the system in the form of a chart or scroll bar text, such as a notification of a background running application, or a notification appearing on the screen in the form of a dialog window. For example, a text message is prompted in a status bar, a prompt tone is emitted, the electronic device vibrates, and an indicator light blinks, etc.
The system library may include a plurality of functional modules. For example: surface manager (surface manager), media Libraries (Media Libraries), three-dimensional graphics processing Libraries, 2D graphics engines (e.g., SGL), etc.
The surface manager is used to manage the display subsystem and provides a fusion of 2D and 3D layers for multiple applications.
Media libraries support a variety of commonly used audio, video format playback and recording, still image files, and the like. The media library may support a variety of audio and video encoding formats.
The three-dimensional graphic processing library is used for realizing three-dimensional graphic drawing, image rendering, synthesis, layer processing and the like.
The 2D graphics engine is a drawing engine for 2D drawing.
The kernel layer is a layer between hardware and software. The kernel layer at least comprises a display driver, a camera driver, an audio driver, a sensor, a Bluetooth driver, a Wi-Fi driver and the like.
It is to be appreciated that the components contained in the system framework layer, the system library, and the runtime layer shown in fig. 2 do not constitute a particular limitation of the electronic device 100. In other embodiments of the application, electronic device 100 may include more or fewer components than shown, or certain components may be combined, or certain components may be split, or different arrangements of components.
Fig. 3 is a schematic diagram of an exemplary text recognition scenario with truncated text. Referring to fig. 3 (1), a picture 302 is displayed on a display interface 301 of the mobile phone. For example, the display interface 301 may be an application interface, for example, an interface of a system application such as a gallery application interface, and the interface 301 may also be an application interface of a third party application such as a chat application. That is, in the embodiment of the present application, the system in the mobile phone may have a text recognition function (i.e. the text recognition module in fig. 2), for example, the gallery application may call the text recognition module of the mobile phone to perform text recognition on the picture. Optionally, the third party application in the mobile phone may also have a text recognition function, and the implementation processes of the text recognition functions of different third party applications may be the same or different, which is not limited by the present application. Optionally, the third party application in the mobile phone may also call the text recognition module of the mobile phone, which is not limited by the present application.
Still referring to fig. 3 (1), the picture 302 includes text and images (of course, the picture 302 may include only text). It should be noted that, in the embodiment of the present application, only the text recognition scene of the picture is illustrated as an example, and in other embodiments, the present application may also be applied to the text recognition scene in the application interface, for example, the scene may be text recognition of a page displayed by the application of the browser, which is not limited by the present application.
Optionally, the picture 302 may be generated after the mobile phone responds to the user operation and performs the screenshot operation; the picture 302 may also be generated by the mobile phone through a photographing function; the picture 302 may also be a downloaded picture, etc., and the present application is not limited thereto.
Illustratively, the text in the picture 302 includes a plurality of lines, wherein the first line of text and the last line of text displayed in the picture 302 are truncated by the border of the picture 302, and in the embodiment of the present application, such text (or text) is referred to as "truncated text". It should be noted that, in fig. 3, only the vertical cut text is taken as an example for illustration, and the technical scheme in the embodiment of the present application is equally applicable to the recognition scenario of the horizontal cut text and the oblique cut text, and specific examples will be described below. Illustratively, the "vertical cut text" described in embodiments of the present application is optionally a cut perpendicular to the direction of travel of the text. It will be appreciated that the text lines are obscured by the screen top and bottom edges, or by some fixed, frozen status bar, due to the interface sliding up and down. Taking the picture 302 as an example of a screenshot of a web page, a user slides the web page up and down during browsing the web page, and accordingly, the first line currently displayed by the web page may be cut off by the upper edge (which may also be understood as the upper border of the display frame) of the web page. The user performs a screenshot operation on the currently displayed web page, and the mobile phone responds to the received user operation screenshot to generate a picture 302. The first line of text displayed in the picture 302 is the "vertical truncated text". For example, the "lateral text truncation" in the embodiments of the present application is a truncation along the text line, for example, may be a lateral text line truncation caused by photographing or scanning. Illustratively, "oblique stage text" is optionally truncated in a direction having an angle with the text walking direction.
Still referring to fig. 3 (1), the user may press the picture 302 for a long time. Referring to fig. 3 (2), an exemplary application displays an option box 303 in response to a received long press operation on the picture 302. Optionally, options box 303 includes, but is not limited to: sharing options, favorites options, extract text option 304, and the like. The location, size, and number and names of options included in the options box 303 are merely illustrative examples, and the present application is not limited thereto.
Illustratively, the user clicks the extract text option 304 to indicate that text in the picture 302 is to be extracted. The mobile phone starts a text recognition function in response to the received user operation (as described above, the text recognition function may be a text recognition function of an application itself or a text recognition function of a calling system, which is not limited by the present application).
In the embodiment of the application, the text recognition function optionally adopts an OCR technology, and the OCR technology is mainly divided into two steps, namely, text region detection in the first step and text content recognition (also called text content recognition) in the second step. Illustratively, the text region detecting step is optionally detecting at least one text region in the image, i.e. identifying a region in the image that contains text. Illustratively, the text content recognition step is optionally to recognize text in the acquired text region, i.e. to recognize specific text content in the text region. The detailed steps of text region detection and text content recognition can refer to the related content in the prior art embodiment, and the present application will not be repeated.
Referring to fig. 3 (3), exemplary display interfaces 301 include, but are not limited to: the reduced picture 302 and text recognition result display box 305. It should be noted that, the layout of the interface in the display interface 301 in the embodiment of the present application is only illustrative, and the present application is not limited thereto. Exemplary text recognition result display boxes 305 include, but are not limited to: the "paint select words" option, text recognition results, and other options. Optionally, other options include, but are not limited to: a "full select" option, a "search" option, a "copy" option, a "translate" option, and the like. Each of the other options may be used to process the text recognition results accordingly.
Still referring to fig. 3, the text recognition result in the text recognition result display box 305 is the result recognized by the text recognition function, for example. However, in this example, the top line text display is incomplete because the top line text of picture 302 is truncated, such as the vertically truncated text described above. Accordingly, the result of the recognition by the text recognition function may be inaccurate. For example, as shown in fig. 3 (3), the original text of the top text in the web page is "the first round match, full cheering when the full cheering is equal to 5", and the top text in the picture 302 after the screenshot is truncated due to the top text being truncated by the upper border when the web page is browsed. When the application carries out text recognition on the picture 302, after the first line text is recognized, the output result is 'Japanese hole L lending, shihong baking eight-element earth bitten, 5', the difference between the Shihong baking eight-element earth bitten and the original text is large, and semantic logic errors exist. And the recognition result is lower than the recognition result, and the original text cannot be restored in time through technologies such as semantic reasoning and the like, so that the use experience of a user is affected. Illustratively, the recognition result corresponding to the text line (e.g., the second line of text in the picture 302) that is not truncated in the picture 302 is not different from the original text.
The embodiment of the application provides a text recognition method, which takes a text image and text content as the input of a model (which can be called a text recognition model or a text recognition network), and obtains coding information of a corresponding mode through respective mode coding. The text processing module performs modal information fusion on the coding information corresponding to the text image and the coding information corresponding to the text content to be used as attention input of the classification decoder and the correction decoder. In principle, the model is equivalent to implicit comprehensive consideration of image information (mainly cut-off condition) and text information (mainly semantic consistency), and makes more refined decisions on different input combinations by utilizing high-dimensional multi-mode semantic information, so that personification complex decision-making effect can be obtained. The abbreviated complex decision is reflected on the final result, and three classification results are obtained: the direct filtering indicates that the shielding results in the situation that the semantics are uncorrectable, the corrected output indicates that the shielding results in the situation that the semantics are inconsistent but correctable, and the uncorrected direct output indicates that the shielding does not exist or the situation that the shielding exists but the semantics are not affected. That is, the text recognition method provided in the embodiment of the present application may provide a more anthropomorphic processing scheme, and under normal conditions, if the text is too much blocked, the user may not recognize the correct information by naked eyes, and the user may also determine that the text content read from the truncated text is incorrect. If the text is seldom blocked, the user can judge the blocked text through the semantics. For the non-occluded text, the user can correctly read the corresponding text content. The technical scheme in the embodiment of the application can achieve the anthropomorphic user reading effect, and can not output the result under the condition of larger text shielding (i.e. cutting off). And outputting the corrected result when the shielding is less. And the corresponding text can be output without occlusion. Therefore, the correct and semantic-smooth results can be presented in the text recognition results, and the results with semantic errors (namely text contents) can be filtered out, so that the user can use the experience.
Fig. 4 is an application scenario schematic diagram illustrating an application of the text recognition method in the embodiment of the present application, please refer to fig. 4 (1), which illustrates a gallery application, and after a user clicks a thumbnail corresponding to a picture 402 displayed in the gallery application, the gallery application may display the picture 402 in a display interface 401. Optionally, options (or controls) such as sharing options, favorites, etc. are also included in the display interface 401.
The gallery application may invoke a text recognition module and a text processing module of the system to text recognize and process the picture 402 (which may also be referred to as a picture to be recognized or an image to be recognized). As described above, in the embodiment of the present application, text recognition includes both text region detection and text content recognition. Alternatively, the text recognition module may perform the text region detection step after receiving the operation of clicking the thumbnail corresponding to the picture 402 by the user, so as to detect whether the text region is included in the picture such as 402. In this example, the picture 402 includes a picture and text (of course, only text may be included), which is not limited by the present application. Accordingly, the text recognition module may detect at least one text region included in the picture 402. After the text recognition module detects that the text region is included in the picture 402, an "extract text in picture" option 403 may be displayed in the display interface 401. The user may click on the "extract text in graph" option 403 to indicate text content in the extract graph 402. The gallery application performs text recognition on the picture 402, i.e., performs a text content recognition step, by the text recognition module in response to the received user operation to obtain corresponding text content in each text region. In the embodiment of the application, the text processing module can further process the recognition result (including the text region and the text content) obtained by the text recognition module. Referring to fig. 4 (2), the display interface 401 includes, but is not limited to: a reduced picture 402 and an extracted text display box 404. Optionally, the extracted text display box 404 includes, but is not limited to: text recognition results display box 405 and other options. Other options include, but are not limited to: "smear select words" option, "read full text" option, "full select" option, "search" option, "copy" option, and "translate" option, etc. It should be noted that, the layout of each control in the display interface shown in the embodiment of the present application is only an illustrative example, and the present application is not limited thereto. For example, the text recognition result display box 405 includes text content recognized by the text recognition module, as shown in fig. 4 (2), and in the embodiment of the present application, for the truncated text (for example, the top text), the mobile phone does not display the corresponding text in the text recognition result display box 405. That is, for text recognition results that may have semantic errors or are messy codes, the text processing module may take a way of not outputting (i.e., not displaying) the text recognition results to avoid the problem that the text recognition results are greatly different from the original text. Still referring to fig. 4 (2), for non-truncated text, the text processing module may display the corresponding text in the text recognition result display box 405. Optionally, in the embodiment of the present application, the text processing module may further correct (or correct) the text content identified by the text identifying module, so as to obtain a correct text (which may also be understood as a text that is close to or the same as the original text), and output (i.e. display in the text identifying result display box 405) the corrected result. That is, in the embodiment of the present application, text with semantic errors is filtered or corrected, so that the text recognition result displayed in the text recognition result display box 405 is semantically correct and coherent, so as to enhance the user experience.
It should be noted that, in the embodiment of the present application, only the text recognition and processing scene of the picture is taken as an example for illustration, and in other embodiments, the present application may also be applied to the text recognition and processing scene in the application interface, for example, the scene may be that the text recognition and processing is performed on the page displayed by the browser application.
It should be further noted that, the picture 402 may be generated after the mobile phone responds to the user operation and performs the screenshot operation; the picture 402 can also be generated by the mobile phone through a photographing function; the picture 402 may also be a downloaded picture, etc., and the present application is not limited thereto.
It should be further noted that, in the embodiment of the present application, only the scenario in which the gallery application calls the text recognition module and the text processing module is taken as an example for explanation. The steps executed by the text recognition module and the text processing module in the embodiment of the application can also be applied to other applications. For example, the chat application may perform text recognition on the picture to be recognized by using its own text recognition function, and obtain a corresponding text recognition result. The chat application can call a text processing module of the mobile phone to further process the text recognition result. For another example, the chat application may also have a text recognition module and a text processing module, and implement the steps implemented by the text recognition module and the text processing module in the embodiments of the present application. For another example, the chat application may also invoke a text recognition module and a text processing module of the mobile phone, which is not limited by the present application.
It should be further noted that the steps executed by the text recognition module in the embodiment of the present application are merely illustrative examples, and the steps executed by the text recognition module in the mobile phone and the steps executed by the text recognition module in the application may be the same or different, and the specific details may refer to the embodiments in the prior art, so that the present application is not limited. For example, the text recognition module in the mobile phone may perform text recognition using OCR technology and obtain corresponding recognition results, including text images and text contents (concepts of the text images and text contents will be described below). The text recognition module in the chat application can perform text recognition by using other technologies, and obtain a corresponding recognition result, which also includes text images and text contents. Alternatively, the recognition result obtained by the text recognition module of the chat application and the text recognition module of the mobile phone application may be the same or different, for example, the text recognition module in the mobile phone may recognize 5 text regions and obtain the corresponding text content. The text recognition module in the chat application may recognize 6 text regions and obtain corresponding text content, which is not limited by the present application. That is, the text processing module in the embodiment of the present application may further process the recognition result of any text recognition module (which may be a mobile phone and/or an application) to obtain a result that meets the user's requirement.
It should be further noted that the operations of the text recognition and processing functions triggered by different applications are the same or different, and the user operation (i.e. clicking on the "extract text" option) referred to in this application is merely illustrative, and the present application is not limited thereto.
It should be further noted that, in the embodiment of the present application, only the scene of the first line text truncation is taken as an example for illustration, and in other embodiments, the text recognition method in the embodiment of the present application is equally applicable to the scene including the last line text truncation.
It should be further noted that, in the embodiment of the present application, the text is cut off by using a frame as an example, and in other embodiments, the text is cut off due to a picture or other reasons, which is not limited by the present application.
In one possible implementation, the text recognition module may perform the text region detection step on each picture in the gallery application with the phone on standby or the gallery application in the background, etc. That is, the text recognition module may perform the text region detection step on the pictures in the gallery application in advance, so that the "extract text in the picture" option box may be displayed immediately after the user clicks the picture including the text region, so as to improve the overall efficiency of text recognition and processing.
The text recognition method in the embodiment of the application is described in detail below with reference to the accompanying drawings. Fig. 5 is a flow chart illustrating an exemplary text recognition method. Referring to fig. 5, the text recognition module may obtain a result recognized based on the OCR technology, where the result includes at least one text image and text content corresponding to each text image. For example, fig. 6 is a schematic diagram illustrating text recognition shown in an exemplary manner, referring to fig. 6, the text recognition module performs text region detection on a picture 601 (i.e. a picture 402, a specific description may refer to the picture 402, and is not described herein in detail) by using an OCR technology to obtain at least one text region. Specifically, the text region detection may be understood as that after the OCR technology detects the region containing the text in the picture 601, at least one text region in the picture 601 is segmented to obtain at least one text image (i.e., an image corresponding to at least one text region in the picture 601). For example, as shown in fig. 6, the text recognition module detects a text region 602a containing text in the picture 601, and the text recognition module may divide the text region 602a (e.g., along a dotted line) to obtain an image corresponding to the text region 602a, abbreviated as text image 602a.
For example, the text recognition module may sequentially segment the region of the picture 601 that contains text, for example, an image of the text region 603a, abbreviated as text image 603a, may be obtained. In the embodiment of the present application, only the text area 602a and the text area 603a are taken as examples, and the text recognition module may obtain more text areas in the picture 601.
In one possible implementation, after the text recognition module recognizes the text region through OCR technology, the text recognition module may obtain a corresponding text image through a process such as radial or perspective transformation correction.
In another possible implementation, the size of the single text image may be the same as or greater than the size of the actual area occupied by the text content in the text image. Such as text image 602a, which has a size that is larger than the size of the area actually occupied by the text content, i.e., there is a blank area between the border of the text image and the text content (i.e., the edges of the text content).
Still referring to fig. 6, the text recognition module may perform text content recognition on the acquired at least one text region (i.e., text image) by OCR technology. Taking the text image 602a and the text image 603a as examples, the text recognition module performs text content recognition on the text image 602a, and obtains a text content recognition result 602b (may also be referred to as text content 602 b), that is, recognizes that the text content in the text image 602a is "rikou L credit, shihong baked eight-element clay sting, 5". The text recognition module continues to recognize other text images to obtain corresponding text content recognition results, for example, the text recognition module performs text content recognition on the text image 603a by OCR technology to obtain corresponding text content recognition results 603B (may also be referred to as text content 603B), that is, recognizes that the text content in the text image 603B is "the impulse also shows high strength, the first round 107B). It should be noted that, in this embodiment, only the text image 602a and the text image 603a are taken as examples, and the text recognition module may perform text content recognition on each obtained text image based on the OCR technology to obtain corresponding text content, which is not described one by one. It should be further noted that the text recognition module may perform text content recognition on each text image in parallel or sequentially, which is not limited by the present application.
Still referring to FIG. 5, exemplary text processing modules obtain recognition results from text recognition modules, including, for example, but not limited to: text image 602a and corresponding text content 602b, and text image 603a and corresponding text content 603c. The text processing module performs the flow in fig. 5 for each text image and text content corresponding to the text image entered by the text recognition module. It should be noted that, after the text recognition module obtains the images corresponding to all text areas with the recognition image (for example, the picture 601) and the corresponding text contents, the text recognition module outputs the recognition result to the text processing module for further processing. The text recognition module may perform the flow in fig. 5 on the acquired text image and text content one by one. The text recognition module may also process multiple text images and text content in parallel, and the application is not limited thereto. Alternatively, the text recognition module may output the text content and the corresponding text image to the text processing module for processing after obtaining a text content, which is not limited by the present application and will not be repeated hereinafter.
With continued reference to fig. 5, taking text image 602a and text content 602b as an example, the text processing module passes text image 602a and text content 602b through an encoding model (may also be referred to as an encoding module) to obtain image encoding information corresponding to text image 602a and text encoding information corresponding to text content 602 b. Alternatively, the coding model may include, but is not limited to, an image coding model (may be referred to as an image coding module) and a text coding model (may be referred to as a text coding module). For example, the image coding model may be used to code the text image 602a, resulting in image coding information corresponding to the text image 602 a. That is, the image encoding model may encode the text image as machine-recognizable or understandable semantic information. Illustratively, the text encoding module may be configured to encode the text content 602b to obtain text encoded information. It is also understood that the text encoding module encodes text content into machine-recognizable or understandable semantic information.
It should be noted that, the structures of the image coding information and the text coding information may adopt corresponding encoder structures according to the coding process, and the encoder in the embodiment of the present application is only an illustrative example, and may be set according to actual requirements, which is not limited in this application.
It should be further noted that the processing of the text image 602a and the text content 602b by the text processing module may be sequential or parallel, which is not limited by the present application. For example, the text processing module may process the text image 602a to obtain image encoding information, and then process the text content 602b to obtain text encoding information. For another example, the text processing module may encode the text content 602b before encoding the text image 602 a. For another example, the text processing module may encode the text image 602a and the text content 602b simultaneously, which is not limited by the present application.
Still referring to fig. 5, taking a text image 602a and a text content 602b as an example, the text processing module fuses the image coding information corresponding to the text image 602a and the text coding information corresponding to the text content 602b through a multi-mode model (which may also be referred to as a multi-mode coding module, a multi-mode fusion module, etc., which is not limited by the present application), so as to obtain multi-mode coding information, which may also be referred to as intermediate characterization information.
Illustratively, the text processing module corrects the intermediate characterization information through a correction model (may also be referred to as a correction module), and the text processing module classifies the intermediate characterization information through a classification model (may also be referred to as a classification module) to obtain a classification result. In the embodiment of the application, the classification result comprises three types: filtering, correcting and outputting, and directly outputting. Wherein filtering the classification item is optionally filtering the text content, i.e. not displaying the corresponding text content in the text recognition result. Modifying and outputting the classification item is optionally outputting modified text, and it is also understood that the text content may be displayed in the text recognition result after modification. The direct output classification item is optionally displaying text content in the text recognition result. That is, the text processing module may display the text content recognized by the text recognition module through the OCR technology directly in the text recognition result. Taking the intermediate representation corresponding to the text image 602a and the text content 602b as an example, in one example, if the classification result of the intermediate representation information is a filtering classification item, the text processing module filters the text content 602b, that is, does not display the text content 602b in the text recognition result, so as to avoid the influence of the text with the semantic error on the text recognition result. In another example, if the classification result of the intermediate characterization information is the corrected output classification term, the text processing module may display the corrected result of the intermediate characterization information in the text recognition result. In yet another example, if the classification result of the intermediate characterization information is direct output, the text processing module displays the text content 602b in the text recognition result.
The following describes the flow in fig. 5 in detail, taking the text image 602a and the text content 602b as examples. Fig. 7 is a schematic diagram of an exemplary text image coding. Referring to fig. 7, in the embodiment of the present application, the process of encoding the text image 602a by the text processing module (specifically, the image encoding model) includes Patch Embedding and Positional Encoding (position encoding), so as to convert the three-dimensional image information into two-dimensional image encoding information E v
It should be noted that, as described above, the structure of the encoded information (for example, two-dimensional encoded information) obtained by encoding the text image and the text content is obtained according to the encoder architecture, and the encoder architecture may be set according to practical requirements, for example, in other embodiments, the three-dimensional image information may be converted into the image encoded information with a higher dimension or a lower dimension, which is not limited by the present application, and will not be repeated hereinafter.
It should be further noted that, in the embodiment of the present application, only the encoding mode of the text image by the Patch encoding mode and the Patch encoding mode Positional Encoding is taken as an example for encoding the text image, and in other embodiments, the text image may be encoded by other encoding modes, which is not limited by the present application.
Specific processes of Patch coding and Positional Encoding include, but are not limited to:
(1) The text processing module divides the text image 602a into N Patches.
Fig. 8 is an exemplary image information encoding schematic diagram, please refer to fig. 8, alternatively, the text processing module (specifically, the image encoding model may be used, and the description will not be repeated hereinafter) may perform the restore (adjustment) on the height (also the width, or the width and the height) of the text image 602a to adjust the height of the text image 602a to a preset pixel value, for example, the text module may adjust the height of the text image 602a to 32 pixels (also 64 pixels, which may be set according to the actual requirement, and the application is not limited thereto), and accordingly, the width of the text image 602a is adjusted according to the scale (i.e. the aspect ratio of the image 602 a) according to the height. As shown in fig. 8, in the embodiment of the present application, the height of the text image 602a after adjustment is H, and the width (may also be referred to as the length) is W. In other embodiments, the text image may not be resized, and the present application is not limited thereto.
Still referring to FIG. 8, an exemplary text processing module divides the text Image 602a into N Image Patches. In the embodiment of the present application, assuming that the width of one Image Patch is w and the height is h, the number of Image Patches acquired by the text processing module is:
N=L*W/h*w (1)
Alternatively, the values of h and w may be the same or different, for example, may be 16 pixels, and may be set according to practical requirements, which is not limited by the present application.
Alternatively, the value of N is a positive integer, for example, N may be rounded up.
(2) The text processing module performs Patch editing on the N Image Patches.
Fig. 9 is an exemplary image information encoding schematic diagram. Referring to FIG. 9, an exemplary Patch Embedding flow includes, but is not limited to, the following steps:
step a, flattening each Image Patch by the text processing module to obtain a one-dimensional vector P corresponding to each Image Patch i
Specifically, the width of each Image Patch is w, the height is h, the number of channels is c, and accordingly, the size of each Image Patch is (h×w×c). The text processing module flattens the Image Patch to obtain a one-dimensional vector with the length of (h x w x c). For the ith image block, the one-dimensional vector is noted as P i ,P i Expressed as:
for example, FIG. 10 is an exemplary Image Patch flattening schematic shown. Referring to fig. 10, an Image Patch801 in fig. 8 is taken as an example. The size of Image Patch801 is (h×w×c). The text processing module expands the Image Patch801 to obtain a corresponding one-dimensional vector P 1 ,P 1 Expressed as:
i.e. a one-dimensional vector of length (h x w x c). The text processing module may flatten each Image Patch based on the above manner to obtain N Pi, i.e., P as shown in FIG. 9 1 ……P n
And b, the text processing module passes the N one-dimensional vectors Pi through the full-connection layer to obtain N one-dimensional tensors with the length being the preset length.
For example, referring still to fig. 9, the text processing module obtains N one-dimensional tensors E with length of ebedding_size by passing N one-dimensional vectors Pi through the full connection layer with length of ebedding_size (which can be set according to practical requirements, but the application is not limited to the present application) vi ,E vi Expressed as:
for example, as shown in FIG. 9, the text processing module will P 1 Through the full connection layer with length of ebedding_size, a one-dimensional tensor E with length of ebedding_size is obtained v1 ,E v1 Expressed as:
the text processing module performs the same processing on the N one-dimensional tensors according to the mode to obtain E v1 ……E vn
It should be noted that, in the embodiment of the present application, only the preset length is exemplified as the sounding_size, and in other embodiments, the preset length may be other values, which are related to what full connection layer is adopted, and the present application is not limited.
Step c, the text processing module processes N one-dimensional tensors E vi Sequentially arranging to obtain a two-dimensional tensor with the dimension of N.
Illustratively, still referring to FIG. 9, the text processing module processes N one-dimensional tensors E v1 ……E vn Sequentially arranged to obtain a two-dimensional tensor E v0 ,E v0 Expressed as:
wherein E is v0 Is (N x bridging_size).
It should be noted that, in the embodiment of the present application, the Image encoding manner is merely an exemplary example, for example, in other embodiments, the text processing module may also call kernel (kernel) with a size (h×w), stride (step size) with h (or w), and the convolution kernel with the number of output channels being ebadd_size may be obtained by acting on Image Patches, which may specifically be set according to actual requirements, and the purpose of the method is to encode N Image Patches and obtain machine encoding information with higher semantics.
Alternatively, in an embodiment of the present application, the text processing module may send E v0 And sorting head E cls Splicing (concat) to obtain two-dimensional tensor E v1 . Alternatively E cls The dimension of (1) is optionally (1) and may be set according to practical requirements, which is not limited by the present application. Optionally, the sorting head E cls Parameters may be learned for a neural network.
Exemplary, E v1 Can be expressed as:
E v1 =[E cls ,E v0 ] (2)
for example, assume a sort head E cls Expressed as:
above E in the examples v0 For example, the text processing module will E v0 And E is connected with cls Splicing to obtain E v1 ,E v1 Expressed as:
wherein E is v1 The dimension of (2) is (n+1, casting_size).
It should be noted that in the embodiment of the present application, only E is used v0 And E is connected with cls The splicing is illustrated by way of example, and in other embodiments, other manners such as adding, fusing, etc. may be used, and the present application is not limited thereto.
(3) Text processing module pair E v1 Positional Encoding.
The text processing module illustratively processes the two-dimensional tensor E obtained above v1 And two-dimensional position code E pos Adding to obtain image coding information E v . Exemplary, image coding information E v Can be expressed as:
E v =E v1 +E pos (3)
the dimensions of the position code relate to the dimensions of the result after the above processing, and the present application is described by taking two dimensions as an example, and the present application is not limited thereto.
For example, suppose E pos Expressed as:
wherein E is pos The dimension of (2) is (n+1, casting_size). Alternatively E pos Parameters may be learned for a neural network. In the embodiment of the application, for convenience of representation, N is recorded v =N+1。
As shown in FIG. 9, E above v1 For example, correspondingly, E v1 Obtaining image coding information E through Positional Encoding v Expressed as:
in the embodiment of the present application, the method of combining the image encoding and the position encoding is described by taking addition (add) as an example, and in other embodiments, other combining methods are also possible, and the present application is not limited thereto.
Fig. 11 is a diagram illustrating exemplary text content encoding. Referring to fig. 11, in the embodiment of the present application, the text processing module (specifically, a text encoding model, which will not be repeated hereinafter) performs text information encoding (may also be referred to as text information encoding) on the text content 602b, which includes Word Embedding and Positional Encoding, so as to convert the text information into text encoded information (may also be referred to as text encoded information) with higher semantic features, denoted as E t
It should be noted that, in the embodiment of the present application, only the text information encoding of the text content is described by using the encoding modes of Word encoding and Positional Encoding as an example, and in other embodiments, the text information encoding may also be performed by using other encoding modes, which is not limited by the present application.
Fig. 12 is a schematic diagram illustrating a text information encoding process, referring to fig. 12, the process includes, but is not limited to, the following steps:
(1) The text processing module performs word segmentation processing on the text content 602 b.
As shown in fig. 12, an exemplary text processing module performs word segmentation on the text content 602b according to a preset character length, so as to obtain a word segmentation result (which may also be referred to as a word segmentation sequence).
In the embodiment of the present application, taking a preset character length as an example, that is, the text processing module divides each character (including punctuation marks) into one word to obtain m words (for example, m is 18, that is, divided into 18 words), that is, a word segmentation sequence w with a sequence length of m may be expressed as:
w=[w 1 ,w 2 ,……w m ]
it should be noted that, in other embodiments, the preset character length may also be set according to actual requirements, for example, two characters may be used, which is not limited by the present application. Alternatively, the preset character lengths may be different in length, for example, the "mesh" may be divided into one word, and the "mountain" may be divided into one word, which is not limited by the present application.
(2) The text processing module acquires a text sequence number corresponding to the word segmentation sequence.
In the embodiment of the application, the text processing module can preset a text sequence number table (also can be called text sequence number information, a word code table and the like, and the application is not limited thereto), and the text sequence number table is used for indicating the corresponding relation between characters (words or characters) and sequence numbers. For example, the corresponding number of "mesh" in the text number table is "12". For example, the corresponding serial number of the relation in the text serial number table is 52, and the corresponding relation between the characters and the serial number can be set according to the actual requirement, and the application is not limited. It should be noted that, the correspondence between the text and the serial number may be stored in a table manner, or may be stored in other manners, which is not limited by the present application.
Alternatively, the text included in the text sequence number table may cover a dictionary, or may cover any book in the professional field, and the application is not limited thereto.
As shown in fig. 12, the text processing module may, for example, look up a sequence number (may also be referred to as a text sequence number) corresponding to each word (word or word) in the word segmentation sequence w based on the text sequence number table, so as to obtain a text sequence number sequence n, where n may be expressed as:
n=[n 1 ,n 2 ,……n m ]
(3) The text processing module passes the text sequence number n through word subedding to obtain a two-dimensional tensor E t0
The text processing module, by way of example, passes the text sequence number n through the emmbedding layer to obtain a two-dimensional tensor E t0 , E t0 Can be expressed as:
E t0 =Embedding(n) (4)
for example, in an embodiment of the present application, a two-dimensional tensor E t0 Can be expressed as:
wherein the two-dimensional tensor E t0 Is (m, sounding_size).
E is also described as t0 The dimensions of (a) are related to the ebedding layer and the application is not limited.
(4) Text processing Module will E t0 Adding the text information code E to the position code t
Exemplary, as shown in FIG. 12, the text processing module will E t0 By and position coding E pos ' adding to obtain text information code E t . Exemplary, wen En information code E t Can be expressed as:
E t =E t0 +E pos ′ (5)
The dimensions of the position code relate to the dimensions of the result after the above processing, and the present application is described by taking two dimensions as an example, and the present application is not limited thereto.
For example, suppose E pos ' is expressed as:
wherein E is pos The dimension of' is (m, sounding_size). Alternatively E pos ' is a neural network learnable parameter. In the embodiment of the application, for convenience of representation, N is recorded t =m。
As shown in FIG. 12, E above t0 For example, the text processing module will E t0 And E is connected with pos ' adding to obtain text information code E t ,E t Expressed as:
in the embodiment of the present application, only the way of combining text encoding and position encoding is described as adding (add), and in other embodiments, other combining ways are also possible, and the present application is not limited thereto.
The position code in the embodiment of the present application may be a parameter learning embellishing layer similar to Bert Positional Embedding, or may be positional encoding based on sine/cosine transform similar to the native transform architecture, and may be set according to actual requirements, which is not limited by the present application.
Still referring to fig. 5, after the text processing module obtains the image encoding information and the text encoding information, the intermediate characterization information may be obtained based on the image encoding information and the text encoding information. Fig. 13 is a schematic flow chart of obtaining intermediate characterization information, please refer to fig. 13, which specifically includes but is not limited to the following steps:
(1) The text processing module encodes the image coding information E v And textThe coded information E t Feature fusion is carried out to obtain a mixed semantic code E m (which may also be referred to as hybrid encoded information, the application is not limited).
The text processing module (which may be a multimodal coding model in particular, and will not be described in detail in the following) encodes the image coding information E v With text-encoded information E t Splicing to obtain mixed semantic code E m For example, it can be expressed as:
E m =[E v ,E t ] (6)
for example, in combination with the image coding information E described above v With text-encoded information E t Hybrid semantic coding E m Can be expressed as:
wherein the hybrid semantic code E m Is of the dimension (N) v +N t ,embedding_size)
It should be noted that, in the embodiment of the present application, only the image encoding information E v With text-encoded information E t The fusion mode of (a) is described by taking stitching as an example, and in other embodiments, other modes, such as addition, etc., are also possible, and the application is not limited thereto.
(2) Text processing module encodes E with mixed semantics m The multi-modal coded information (i.e., intermediate characterization information) is obtained by a multi-modal encoder.
FIG. 14a is a schematic diagram illustrating exemplary multi-modal encoding, please refer to FIG. 14a, in which the text processing module encodes the hybrid semantic code E m The multi-modal encoder 1301 obtains multi-modal encoded information (i.e., intermediate characterization information), denoted as E IR . Illustratively, a multi-modal encoder may also be understood as a device for extracting high-dimensional semantic information that fuses image information and text information based on input multi-modal encoded information.
Alternatively, the multi-mode Encoder (Encoder) 1301 is composed of stacks Transfomer Encoder, for example, the number of stacks is L. Each Transformer Encoder consists essentially of a Multi-head attention layer (Multi-Head Attention layer), layer normalization (Layer Normalization) (i.e., (Norm) in fig. 14 a) and a Feed forward neural network (Feed forward neural network) (i.e., feed forward in fig. 14 a).
Fig. 14b is a schematic diagram of a process flow of the multi-mode encoder 1301. Referring to fig. 14b, in the embodiment of the present application, the number of stacks L is 3, that is, the multi-mode encoder 1301 includes a multi-mode encoder 1301a, a multi-mode encoder 1301b, and a multi-mode encoder 1301c. It should be noted that the number of encoders described in the embodiments of the present application is merely illustrative, and may be set according to actual requirements, and the present application is not limited thereto. Exemplary, hybrid semantic encoding E m Through the multimode encoder 1301a, and an output result is obtained. The output of the multi-mode encoder 1301a is used as an input to the multi-mode encoder 1301b to continue encoding. The multi-mode encoder 1301b encodes based on the output result of the multi-mode encoder 1301a, and obtains the output result as an input to the multi-mode encoder 1301 c. The multi-mode encoder 1301c encodes based on the output result of the multi-mode encoder 1301b to obtain an output result, i.e. multi-mode encoded information E IR ,E IR Can be expressed as:
E IR =TE(TE(TE(E m ))) (7)
wherein TE identifies a single one of the multi-mode encoders 1301. Multimodal encoded information E IR Has a dimension of (N) v +N t The sounding_size). For example, it can be expressed as:
it should be noted that, the internal processing flow of each layer in the multi-mode encoder 1301 may refer to the related content in the prior art embodiment, and the disclosure is not repeated.
It should be further noted that, in the embodiment of the present application, only the multi-mode encoder Transformer Encoder is taken as an example for explanation. In other embodiments, the multi-mode encoder may be a bi-directional cyclic neural network, or a simpler convolutional neural network encoder, which may be set according to practical requirements, and the present application is not limited thereto.
It should be further noted that, the method of obtaining the multi-mode encoding by the text processing module is not limited to the method of splicing the image encoding information and the text encoding information and then passing through the multi-mode encoder, and in other embodiments, the text processing module may also fuse the image encoding information and the text encoding information after respectively passing through the corresponding encoders. For example, the text processing module obtains high-dimensional image semantic information from the image encoding information through the image encoder, and obtains high-dimensional text semantic information from the text encoding information through the text encoder. And the text processing module performs dimension alignment on the high-dimensional image semantic information and the high-dimensional text semantic information and then splices the high-dimensional image semantic information and the high-dimensional text semantic information so as to obtain intermediate characterization information. The specific mode can be set according to actual requirements, and the purpose of the specific mode is to acquire the image semantic features and the text semantic features with high dimensions.
With continued reference to fig. 5, an exemplary text processing module (which may be a classification model, and is not repeated below) may classify the intermediate characterization information to determine whether to output text content 602b based on the classification result.
Fig. 14c is a schematic diagram of an exemplary classification flow. Referring to fig. 14c, the text processing module may pass the multi-modal encoded information (i.e., the intermediate characterization information) through the classification model to obtain the classification result. Illustratively, the classification model may include, but is not limited to, a classification decoder, and an argmax layer (or softmax layer). In the embodiment of the present application, the classification decoder is taken as a full-connection layer, and the full-connection layer is taken as an MLP (Multi-layer perceptron) as an example for explanation. Illustratively, the MLP may include a plurality of hidden layers (hidden layers). In the embodiment of the present application, only the full-connection layer (e.g., MLP) is taken as an example of the classification decoder. In other embodiments, the classification Decoder may be other decoders, for example, but not limited to, decoders like Transformer Decoder or cyclic neural network (Recurrent Neural Network, RNN) decoders, which may be set according to actual requirements, and the present application is not limited thereto. The method aims at outputting corresponding classification results based on the input intermediate characterization information. Further, it should be noted that in the embodiment of the present application, the argmax layer is only used as an example for illustration, and in other embodiments, the argmax layer and the softmax layer may also be set according to actual requirements, and the present application is not limited thereto. The aim is to output the classification item corresponding to the maximum score.
Optionally, in an embodiment of the present application, the classification result includes, but is not limited to, three classification items:
(a) Filtration
(b) Correcting and outputting
(c) Direct output
After the multi-mode coding information passes through the classification decoder, the obtained classification result comprises scores corresponding to three classification items. The module may pass the scores corresponding to the three classification items through an argmax layer or a softmax layer to obtain a final decision class.
Illustrating: as described above, the multimodal encoded information E IR Is of the dimension (N) v +N t The sounding_size). Alternatively, the embodiment of the present application may take the multi-mode encoded information E IR To obtain a one-dimensional tensor E with the length of ebadd_size IR0 Expressed as:
the text processing module processes the one-dimensional tensor E IR0 Through the fully connected layer, a one-dimensional tensor T with the length of 3 (i.e. the same number as the classification items) is output out . Alternatively, the fully connected layer may be an MLP, which may include a plurality of hidden layers (hidden layers). Correspondingly, T out Can be expressed as:
T out =MLP(E IR0 ) (8)
wherein T is out Is 3 in dimension and can be processedSolution to T out The score corresponding to the three classification items a, b and c can be expressed as:
T out =[f(a),f(b),f(c)]
wherein f (a) is the score corresponding to the classification item a (i.e. filtering the classification item), f (b) is the score corresponding to the classification item b (i.e. outputting the classification item after correction), and f (c) is the score corresponding to the classification item c (i.e. directly outputting the classification item).
Exemplary, the text processing module will T out And outputting the classification item corresponding to the maximum score through the argmax layer. In the embodiment of the present application, only MLP is used as the full connection layer for illustration. In other embodiments, the fully-connected layer may also be other decoders, for example, but not limited to, decoders like Transformer Decoder or cyclic neural network (Recurrent Neural Network, RNN) decoders, which may be set according to actual requirements, and the application is not limited thereto. The method aims at outputting corresponding classification results based on the input intermediate characterization information. It should be further noted that in the embodiment of the present application, the argmax layer is only used as an example for illustration, and in other embodiments, the argmax layer and the softmax layer may also be set according to actual requirements, and the present application is not limited thereto. The aim is to output the classification item corresponding to the maximum score.
In one example, if f (a) is the maximum, the output result is a, i.e., the classification result is a filtered classification term. Accordingly, the text processing module may filter the corresponding text content, i.e., not display the corresponding text content in the text recognition result. For example, in the process of processing the text image 602a and the text content 602b by the text processing module, when it is detected that the classification result is a type, that is, the classification item is filtered, the text processing module filters the text content 602b, and as shown in (2) of fig. 4, the text recognition result does not include the truncated first line text, so that the error of the truncated text recognition result is avoided, and the user experience is affected.
In another example, if f (c) is the maximum value, the output result is c, i.e., the classification result is a direct output classification term. That is, the result of OCR technology recognition is correct. Accordingly, the text processing module may display the corresponding text content in the text recognition result. For example, during the processing of the text image 603a and the text content 603b by the text processing module, it is detected that the classification result is of the c-class, i.e. the classification result is a direct output classification item. The text processing module determines that the text content 603b can be directly output, and as shown in (2) of fig. 4, the text processing module can display the text content 603b at a corresponding position in the text recognition result.
In yet another example, if f (b) is the maximum value, the output result is b, i.e., the classification result is the corrected output classification term. It will be appreciated that the results identified by OCR techniques include some errors that need to be corrected before they can be output. As described above (i.e., in fig. 5), each piece of multi-modal coded information (i.e., the intermediate feature information) that the text processing module will obtain is modified by the modification module. After the text processing module detects that the classification result corresponding to the single multi-mode coding information is corrected and outputs the classification item, the text processing module can display the text content corrected by the correction module in the text recognition result. It should be noted that, if the classification result is a class a or a class c, the text processing module discards (or omits) the correction result output by the correction module.
Fig. 15 is a schematic diagram of an exemplary text modification. Referring to fig. 15, in the embodiment of the application, the correction module includes Transformer Decoder as an example. The text processing module obtains corrected text content by passing the multimodal encoded information (i.e., the intermediate feature information) through Transformer Decoder1501, the full link layer and the argmax layer.
It should be noted that, in other embodiments, the correction module may also have other architectures, for example, may include, but is not limited to: the forward Decoder based on the cyclic neural network, the Bert Decoder architecture, a stepwise monotonic attention (stepwise monotonous attention) like Decoder, etc. can be set according to actual requirements, and the application is not limited. The aim is to correct the input intermediate characterization information to obtain corrected text.
Referring to fig. 15,Transformer Decoder1501, Q stacks Transformer Decoder, Q can be a positive integer greater than 0. A single Transformer Decoder may be denoted as TD, including but not limited to: a Masked multi-head attention layer, a multi-head attention layer, layer normalization (i.e., (Norm) in fig. 15) and a Feed forward neural network (i.e., feed forward in fig. 15). Details of the processing of each layer may refer to relevant matters in the prior art embodiments, and the disclosure is not repeated.
Optionally, in the Transformer Decoder architecture, the K-vector and V-vector of Transformer Decoder are multi-mode encoded information (i.e., output of the Encoder), and the Q-vector is output of the modulated multi-head layer.
Fig. 16 is a schematic diagram illustrating a modification module process flow. Referring to fig. 16, it is assumed that the recognition result of the OCR technology obtained by the text processing module includes text content and a text image, wherein the text content is "volcanic outbreak". That is to say that the "riot" word therein identifies errors. The text processing module obtains multi-modal coded information corresponding to the text content and the text image based on the method in the above embodiment. And the text processing module acquires a corresponding classification result based on the multi-mode coding information, and the classification result is corrected and then a classification item is output. Specific details may be found above and are not described in detail herein. Referring to fig. 16, the exemplary text processing module inputs the multi-mode encoded information as K-vector and V-vector inputs Transformer Decoder1501, and the initiator < s > passes through Output encoding and Positional Encoding as Q-vector inputs Transformer Decoder1501. Fig. 17 is a schematic diagram of a process flow of Transformer Decoder exemplarily shown. Referring to fig. 17, assume that the number Q of stacks Transformer Decoder1501 in the embodiment of the present application is 2,
Alternatively, the Output enhancement may be Word enhancement, and the specific implementation may refer to the manner in the above embodiment, or other implementation manners in the embodiments of the prior art, so that the present application does not chase any more.
Exemplary, assume that the number of stacks Q of Transformer Decoder1501 in the embodiment of the present application is 2 (which may be set according to actual requirements, the present application is not limited), including Transformer Decoder1501a andtransformer Decoder1501b. The text processing module illustratively inputs Transformer Decoder1501a, initiator, multimodal encoded information as a K-vector and a V-vector<s>Through Output encoding and Positional Encoding as Q vector input Transformer Decoder1501a. The input of Transformer Decoder1501a is the Q vector input of Transformer Decoder1501b, and the multi-mode encoded information is input Transformer Decoder1501b as K vector and V vector. The output of Transformer Decoder1501b is denoted as E dout 1,E dout 1 through the full connection layer to obtain E out 1, wherein E out 1 is of dimension (seq_len, N vocab ). Alternatively, the text processing module pair E out 1, and taking the last column to obtain a slice with a length of N vocab Is a one-dimensional tensor of (c). The text processing module passes the one-dimensional tensor through an argmax layer (the argmax and softmax layers can be set according to actual requirements, and the application is not limited). Wherein N is vocab Optionally the number of texts included in the text sequence number table, e.g. 100 words and corresponding sequence numbers in the dictionary, then N vocab The number of (2) is 100. The value of seq_len is illustratively the number of output characters, e.g., in embodiments of the present application, the number of output characters is 5, including "fire", "mountain", "pop", "fire", and end characters<end>. Illustratively, the value output by the argmax layer is used to indicate a sequence number in the dictionary. The text processing module may determine the corresponding word or word based on the sequence number. In this example, the text processing module may determine that the corresponding word or word is a "fire. That is, the text processing module encodes the multimodal information and the initiator<s>The character "fire" is available through Transformer Decoder1501.
Still referring to FIG. 16, multi-modal encoded information is used as a K vector and a Q vector, a "fire" character and a initiator<s>As Q vector input Transformer Decoder1501. Optionally, fire characters and initiator<s>Through Output encoding and Positional Encoding, as Q vector input Transformer Decoder1501a. Transformer Decoder1501 based on multimodal coding information, "fire" characters and initiator<s>Output E dout 2。E dout 2 through the full connection layer E out 2。E out And 2, obtaining a corresponding numerical value through the argmax layer. The text processing module may determine a corresponding character, such as "mountain," based on the value. That is, the text processing module encodes information, the character "fire" and the initiator in multiple modes<s>The character "mountain" is available through Transformer Decoder1501. Details not described can be referred to above for the relevant content of the character "fire", and will not be described here again.
With continued reference to FIG. 16, the multi-modal encoded information is used as a K vector and a Q vector, a "fire" character, a "mountain" character, and a start symbol<s>As Q vector input Transformer Decoder1501. Optionally, a "fire" character, a "mountain" character, and a start character<s>Through Output encoding and Positional Encoding, as Q vector input Transformer Decoder1501a. Transformer Decoder1501 based on multimodal coding information, "fire" characters, "mountain" characters and initiator<s>Output E dout 3。E dout 3 through the full connection layer E out 3。E out And 3, obtaining a corresponding numerical value through the argmax layer. The text processing module may determine the corresponding character, e.g., as a "pop," based on the numerical value. That is, the text processing module encodes information in multiple modes, "fire" characters, "mountain" characters and initiator characters <s>The character explosion is obtained through Transformer Decoder1501. Thereby correcting the false character 'explosion' in the recognition result of the OCR technology to 'explosion'. Details not described can be referred to above for the relevant content of the character "fire" and are not described here again.
With continued reference to FIG. 16, the multi-modal encoded information is used as a K vector and a Q vector, a "fire" character, a "mountain" character, an "explosion" character, and a initiator<s>As Q vector input Transformer Decoder1501. Optionally, a "fire" character, a "mountain" character, an "explosion" character, and a initiator<s>Through Output encoding and Positional Encoding as Q vector input Transformer Decoder1501a. Transformer Decoder1501 based on multimodal coding information, "fire" character, "mountain" character, "explosion" character and initiator<s>And initiator symbol<s>Output ofE dout 4。E dout 4 through full connection layer E out 4。E out And 4, obtaining a corresponding numerical value through the argmax layer. The text processing module may determine the corresponding character, e.g., a "send," based on the value. That is, the text processing module encodes the multimodal information, a "fire" character, a "mountain" character, an "explosion" character, and a starter<s>The character "send" is available through Transformer Decoder1501. Details not described can be referred to above for the relevant content of the character "fire", and will not be described here again.
With continued reference to FIG. 16, the multi-modal encoded information is used as a K vector and a Q vector, a "fire" character, a "mountain" character, an "explosion" character, a "fire" character, and a start character<s>As Q vector input Transformer Decoder1501. Optionally, a "fire" character, a "mountain" character, an "explosion" character, an "fire" character, and a starter<s>Through Output encoding and Positional Encoding, as Q vector input Transformer Decoder1501a. Transformer Decoder1501 based on multimodal coding information, "fire" character, "mountain" character, "explosion" character, "fire" character and initiator<s>Output E dout 5。E dout 5 through full connection layer E out 5。E out And 5, obtaining a corresponding numerical value through the argmax layer. The text processing module determines the output result as an ending symbol<end>I.e. the cycle is ended.
For example, the text processing module may obtain the corrected result, i.e. "volcanic eruption", output by the correction module after detecting that the classification result is b, i.e. the corrected classification item is output. The text processing module displays the acquired correction result in the identification result.
It should be noted that, the models involved in the embodiments of the present application include, but are not limited to: the image coding model, the text coding model, the multi-modal coding model, the classification model and the correction model may constitute a text processing model, which may also be understood as a neural network. In training a text processing model, the input data of the model is mainly text images (including truncated and untruncated samples) and corresponding text recognition content (i.e., text content). For each text image of the input model and corresponding text content, a pair of training samples is formed. For each pair of training samples, labeling can be performed manually, and the labels are respectively in the three categories. That is, the input text image and text content are classified by the manual annotation method. In particular, for the case of being modifiable, the text to be modified is manually modified to obtain a modified text, which is used as the supervision data output by the text modification decoder. Optionally, the training process of the text processing model is supervised training, the classification decoder (i.e. the classification model) adopts the classification cross entropy loss as a function at any time, and the text correction decoder (i.e. the correction model) is similar to the training of the native transducer autoregressive decoder, and the training is performed by adopting a teacher-forming method for each time step. The actual training process is joint training, as both decoders share the encoder (i.e., the backbone of the neural network of the text processing model).
In one possible implementation, the truncated text may also include a lateral truncated text and a diagonal truncated text, as described above. It should be noted that, for a lateral cut text, for example, the first character in each line of text is cut, in this scenario, the text recognition module may predict the text content to obtain the correct text based on processing such as OCR recognition. That is, the lateral cut text may not typically have the problem of the semantic errors of the vertical cut text described above. Correspondingly, when the scheme in the embodiment of the application is applied to transversely cutting text, corresponding processing can be performed in an aligned mode. And the processed results may be less different from the recognition results of OCR technology. For a text line with a small inclined angle (for example, less than or equal to 10 degrees), correct text content can be obtained by prediction and the like in the OCR technology. That is, after the processing by the scheme in the embodiment of the present application, the output result is slightly different from the recognition result of the OCR technology. Whereas for text with a large oblique included angle (e.g., greater than 10 °), OCR techniques may not be able to fully recognize text regions. By way of example, as shown in fig. 18a, the text lines are assumed to have an included angle of 30 °, and OCR technology recognizes that the text region includes only the portion shown in the broken line in performing text region detection. And when the OCR technology recognizes the text content of the detected text region, the text content consistent with the original text can be output according to the prediction function. It can also be understood that, for a text with a larger oblique included angle, the corresponding recognition result may not have a problem of semantic error.
It should be noted that, the technical solution in the embodiment of the present application can effectively solve the problem that the recognition result of the partially blocked text has semantic errors. In an embodiment of the present application, a "partial occlusion" is optionally an occlusion of the upper portion of all text of the entire line of text, such as the top line text occlusion scene shown in (1) of FIG. 4. In one example, a "partial occlusion" is optionally an occlusion of the lower portion of an entire line of text. For example, fig. 18b is an application scenario illustrating an exemplary illustration, referring to fig. 18b, an exemplary illustration includes a text line with a truncated lower portion in an image to be recognized, and for an OCR recognition result corresponding to the text line, the text processing module may also process based on the scheme described in the foregoing embodiment. In another example, a "partial occlusion" is optionally an occlusion of an upper portion (or lower portion, or any portion) of a full line of text in which a portion of the text is occluded. For example, fig. 18c is a schematic diagram of another exemplary application scenario. Referring to fig. 18c, a portion of text of a text line in an image to be recognized is blocked. That is, the original text is "multi-modal coded information (intermediate characterization information)", and the "intermediate characterization information" therein is partially blocked. Alternatively, the text recognition module performs OCR recognition on the text line, and may acquire a plurality of text regions. For example, as shown in fig. 18d, the text recognition module may recognize the text region corresponding to the "multi-modal encoded information", the text region corresponding to the blocked "(intermediate characterization information"), and the text content corresponding to the two text regions. The text processing module may then execute the processing scheme in the embodiment of the present application on the images of the two text regions and the corresponding text content. Alternatively, the text recognition module performs OCR recognition on the text line, and may acquire a text region, for example, as shown in fig. 18e, the text recognition module may divide the occluded text portion and the non-occluded text portion into the same text region. The image and text content of the text region may also be processed in the present application embodiment. That is, the technical scheme in the embodiment of the application can be applied to various scenes with blocked texts, so that the requirements of text recognition under different scenes are met. Optionally, the embodiment of the application can effectively solve the problem of text recognition of the text line with the shielding rate of 20% -50% (the text line can float within the range, and the application is not limited). It should be noted that, as described above, if the occlusion rate of the text line is too high (for example, 80%), the corresponding text region may not be detected in the OCR stage. Whereas if the occlusion rate is low, the recognition result of the possible OCR is correct. The text processing module can directly output or output corresponding text content after correction.
Fig. 19 is a flowchart of another text recognition method according to an embodiment of the present application. Referring to fig. 19, the method includes, but is not limited to:
(1) And the text processing module obtains a classification result by passing the text image through the classification model.
(2) The text processing module judges whether text content is truncated or not based on the classification result.
For example, the text processing module may perform preprocessing on the text image, for example, the preprocessing may be to restore the text image, and details may refer to the relevant content of the above embodiments, which are not described herein.
For example, fig. 20 is a schematic diagram illustrating text image processing, and referring to fig. 20, for example, still taking text image 602a above as an example, the text processing module inputs the text image 602a (or may be the processed text image) into the classification model. The classification model may classify the text image 602a and obtain classification results.
Optionally, the training data used by the classification model in the training phase includes, but is not limited to, text images corresponding to truncated text and text images corresponding to non-truncated text.
Alternatively, the classification model may be trained by a cross entropy loss function supervision model.
Alternatively, the classification model may include, but is not limited to, mainstream convolutional neural network (Convolutional Neural Network, CNN) based classification networks (e.g., including VGG, resNet, efficientNet, etc.), or a Transformer structure based VIT (Vision Transformer, visual Transfomer) classification model and variants thereof. The purpose of the method is mainly to output the probability of the classification problem, namely the score corresponding to the truncated classification item or the non-truncated classification item.
Illustratively, the scoring class model is CLS, and the output result of the classification model (which may also be referred to as classification result) may be expressed as:
score=CLS(I) (9)
wherein I is used to indicate a text image, the text image includes parameters of three dimensions of width, height and channel number, and the specific concept can refer to the relevant content of fig. 10, which is not repeated herein.
Alternatively, the output result score is optionally a value greater than 0 or less than 1. The closer the value is to 1, the higher the truncation probability is. The text processing module may set a cutoff threshold, for example, 0.5, and may be set according to actual requirements, which is not limited by the present application. In one example, if the output result score is greater than or equal to the cutoff threshold (0.5), it is determined that the text content corresponding to the text image is the cutoff text. In another example, if the output result score is smaller than the cutoff threshold (0.5), it is determined that the text content corresponding to the text image is non-cutoff text.
(3) And outputting the text.
For example, if the text processing module determines that the text content corresponding to the text image is a non-truncated text, the corresponding text content may be directly output, that is, the corresponding text content is displayed in the recognition result. The undescribed portions may refer to relevant content of the above embodiments, and are not described here again.
(4) And the text processing module obtains a semantic judgment result through the text content by using a semantic model.
For example, if the text processing module determines that the text content corresponding to the text image is a truncated text, the text processing module inputs the text content corresponding to the text image into a semantic model (may also be referred to as a semantic judgment module).
For example, fig. 21 is an exemplary illustrated processing flow of the semantic model, referring to fig. 21, the processing flow of the semantic model includes, but is not limited to, the following steps:
a. and the text processing module performs word segmentation on the text content to obtain a word segmentation result.
For example, taking the text content 602b in the above embodiment as an example, the text processing module (specifically, the semantic model) performs word segmentation on the text content 602b, and obtains a corresponding word segmentation sequence number. The specific steps of word segmentation and text sequence number acquisition can refer to the relevant content in the above embodiments, and will not be repeated here.
b. The text processing module obtains E through Word segmentation and Positional Endcoing text
The text module (in particular, semantic model) obtains text encoding information E by passing the obtained text sequence number through Word compressing and Positional Endcoing text . Specific details may be referred to in the description related to fig. 12 in the above embodiments, and are not repeated here.
c. Text processing Module will E text Obtaining F through a coding module text
Exemplary, the text processing module will E text By means of the coding module, i.e. Encoder (Encoder), it is possible to obtain encoded information with high-dimensional semantic features, i.e. F text . Encoding modules include, but are not limited to: CNN encoder, RNN encoder, biRNN (bi-directional cyclic neural network) encoder (e.g., bi-directional LSTM (Long Short-Term Memory network)), transformer Encoder, etc., the present application is not limited thereto. The process flow of the encoder can be referred to in the related description of fig. 14a and 14b, and will not be described herein. Wherein, in the implementation process, E text The multi-mode encoded information in fig. 14a and 14b is replaced.
Illustratively, the Encoder is an Encoder, then F text Can be expressed as:
F text =E ncoder (E text ) (10)
d. the text processing module processes F text Obtaining an output score through a decoding module t (i.e., the semantic judgment result).
Illustratively, the decoding module (i.e., decoder) is a Decoder, score t Can be expressed as:
score t =Decoder(F text ) (11)
optionally, the decoding module includes, but is not limited to: the MLP (i.e., full link layer) decoder, CNN decoder, RNN decoder and Transfomrer decoder may be set according to practical requirements, and the present application is not limited thereto. The specific processing flow of the decoding module may refer to the relevant content of fig. 15, 16 and 17, which are not described herein. Alternatively, the result is score due to the output in this example t Is the result of a classification problem. It is understood that the output result is used to indicate that the semantics are coherent or incoherent. Accordingly, the armax layer may not be included in the decoder. An argmax layer may also be included in other embodiments, and the application is not limited.
Illustratively, in this example, the input of the semantic model is mainly a line or a string, and the output is a category (i.e., semantic coherent type or semantic incoherent type). In the training process, the semantic model is used for determining whether the semantics are consistent or not by collecting corpus and manually marking each corpus. Alternatively, the semantic model may also obtain positive and negative training samples through data generation and other modes.
Illustratively, similar to the classification model, the score output by the decoding module t May be used to indicate semantic consistency. For example score t Alternatively, the text processing module may set the semantic coherence threshold to a value greater than 0 or less than 1, for example, 0.5, which may be set according to practical requirements, and the present application is not limited thereto.
In one example, if score t The text processing module may determine semantic continuity in the corresponding text content by a value greater than or equal to a semantic continuity threshold (i.e., 0.5). That is, the OCR technology is accurate to the result of truncated text recognition, and accordingly, the text processing module may directly output text contents, i.e., display the corresponding text contents in the text recognition result.
In another example, if score t The text processing module may determine that the semantics in the corresponding text content are not consistent with the value of less than the semantics consistency threshold (i.e., 0.5). That is, there is a semantic error in the recognition result of the truncated text by the OCR technology, and the text processing module continues to perform step (5).
It should be noted that, the manner in which the semantic coherence model detects text semantic coherence in the embodiment of the present application is merely illustrative. In other embodiments, the text processing module may also perform semantic consistency detection in other manners, for example, may be based on a grammar error checking model, where the grammar error checking model may output a candidate set at a position of a grammar error based on input text content, and set a threshold judgment by a ratio of the candidate set to a total number of tokens (minimum semantic units). For another example, the text processing module may obtain the probability of each token through a forward language model, and determine based on a preset threshold based on the average probability. The details of the embodiments of the present application may be referred to in the related art, and will not be described herein.
(5) The text processing module judges whether the text content is modifiable.
In the embodiment of the application, the text processing module can continue to further judge whether the text content can be corrected based on the result output by the semantic model. For example, the text processing module may set a correction threshold, such as 0.2, may be set according to actual requirements, and the application is not limited.
In one example, if score t The text processing module may determine that the corresponding text content is modifiable and the text processing module may perform the modification to the text contentAnd outputting after correction. For example, the text processing module may take the text content as an input of the correction module, and the correction module corrects the text content, and the processing flow of the correction module may refer to the relevant content in fig. 15, 16 and 17, which are not described herein.
In another example, if score t The text processing module may determine that the corresponding text content is uncorrectable if the value of (i) is less than the correction threshold (i.e., 0.2), and the text processing module filters the text content, i.e., does not display the corresponding text content in the text recognition result.
It should be noted that, the manner of determining whether the correction is possible in the step (5), that is, the manner of detecting the output result based on the semantic consistency is merely an illustrative example. In other embodiments, the text processing module may also detect whether the text content is modifiable based on other detection means. For example, as described above, the text processing module may process based on the syntax error checking model in a semantic continuity determination process, and the text processing module may further determine the number of syntax errors or the duty ratio of the number of syntax error characters based on the output result of the syntax error checking model. The text processing module may determine whether the text content is modifiable based on the duty cycle. As another example, as described above, the semantic consistency determination may calculate an average probability based on the forward language model, and the text processing module may determine whether the text content is revised based on the average probability (e.g., setting a corresponding revision threshold).
It should be further noted that, in addition to the modification of the text content based on the modification method (which may also be understood as a neural machine translation method) described in the embodiments of the present application, other modification methods may be adopted by the text processing module, for example, text modification may be performed on the text content by means of confusion set recall and candidate sorting based on the output result of the syntax error checking model. For another example, the text processing module may obtain a aliasing set of the error location by calling a statistical language model, a neural language model, or a bi-directional language model of Bert based on the output result of the grammar error checking model, and recall the corrected text by a candidate sorting and error remote mechanism. The specific implementation may refer to the relevant content in the prior art embodiment, and will not be described herein.
It should be further noted that, the models involved in the steps in fig. 19 may form a neural network, and the training manner of the neural network may refer to the related description of the training of the neural network in the foregoing embodiments, which is not described herein.
In one example, a schematic block diagram apparatus 2200 illustrating an apparatus 2200 of an embodiment of the present application, fig. 22 may comprise: the processor 2201 and transceiver/transceiving pin 2202, optionally, also include a memory 2203.
The various components of device 2200 are coupled together by bus 2204, where bus 2204 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are referred to in the figures as bus 2204.
Optionally, the memory 2203 may be used for instructions in the foregoing method embodiments. The processor 2201 is operable to execute instructions in the memory 2203 and control the receive pins to receive signals and the transmit pins to transmit signals.
The apparatus 2200 may be an electronic device or a chip of an electronic device in the above-described method embodiments,
all relevant contents of each step related to the above method embodiment may be cited to the functional description of the corresponding functional module, which is not described herein.
The present embodiment also provides a computer storage medium having stored therein computer instructions which, when executed on an electronic device, cause the electronic device to perform the above-described related method steps to implement the method in the above-described embodiments.
The present embodiment also provides a computer program product which, when run on a computer, causes the computer to perform the above-mentioned related steps to implement the method in the above-mentioned embodiments.
In addition, embodiments of the present application also provide an apparatus, which may be embodied as a chip, component or module, which may include a processor and a memory coupled to each other; the memory is configured to store computer-executable instructions, and when the device is operated, the processor may execute the computer-executable instructions stored in the memory, so that the chip performs the methods in the above method embodiments.
The electronic device, the computer storage medium, the computer program product, or the chip provided in this embodiment are used to execute the corresponding methods provided above, so that the beneficial effects thereof can be referred to the beneficial effects in the corresponding methods provided above, and will not be described herein.
Those skilled in the art will appreciate that in one or more of the examples described above, the functions described in the embodiments of the present application may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone.
The terms first and second and the like in the description and in the claims of embodiments of the application, are used for distinguishing between different objects and not necessarily for describing a particular sequential order of objects. For example, the first target object and the second target object, etc., are used to distinguish between different target objects, and are not used to describe a particular order of target objects.
In embodiments of the application, words such as "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g." in an embodiment should not be taken as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" and the like is intended to present related concepts in a concrete fashion.
In the description of the embodiments of the present application, unless otherwise indicated, the meaning of "a plurality" means two or more. For example, the plurality of processing units refers to two or more processing units; the plurality of systems means two or more systems.
The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are to be protected by the present application.

Claims (27)

1. A method of text recognition, comprising:
the electronic equipment detects a text region of an object to be identified to obtain an image of a first text region; the first text area comprises text content;
the electronic equipment performs text content identification on the first text region to obtain first text content;
the electronic equipment classifies the first text content based on the image of the first text region to obtain a classification result;
the electronic equipment displays a text recognition result of the first text region based on the classification result;
the electronic device displaying the text recognition result of the first text region based on the classification result includes:
if the classification result is a first classification, the text recognition result filters the first text content; if the classification result is a second classification, the text recognition result comprises the text content corrected by the first text content; and if the classification result is a third classification, the text recognition result comprises the first text content.
2. The method of claim 1, wherein the electronic device classifies the first text content based on the image of the first text region to obtain a classification result, comprising:
the electronic equipment obtains intermediate characterization information based on the image of the first text region and the first text content;
and the electronic equipment classifies the intermediate characterization information to obtain the classification result.
3. The method of claim 2, wherein the electronic device categorizes the intermediate characterization information to obtain the categorization result, comprising:
and the electronic equipment classifies the intermediate characterization information through a classification model to obtain the classification result.
4. The method of claim 3, wherein the electronic device further comprises, prior to displaying the text recognition result for the first text region based on the classification result:
and the electronic equipment corrects the intermediate characterization information to obtain the text content after the first text content is corrected.
5. The method of claim 4, wherein the electronic device modifying the intermediate characterization information to obtain modified target text content comprises:
And the electronic equipment corrects the intermediate characterization information through a correction model to obtain the text content after the first text content is corrected.
6. The method of claim 5, wherein the electronic device obtains intermediate characterization information based on the image of the first text region and the first text content, comprising:
the electronic equipment performs image coding on the image of the first text region to obtain first image coding information;
the electronic equipment performs text coding on the first text content to obtain first text coding information;
and the electronic equipment carries out multi-mode coding on the first image coding information and the first text coding information through a multi-mode coding model to obtain the intermediate characterization information.
7. The method of claim 6, wherein the multimodal coding model, the classification model, and the correction model form a neural network, training data of the neural network including a second text region and second text content corresponding to the second text region, and a third text region and third text content corresponding to the third text region; the second text region comprises partially missing text content, and the text content in the third text region is complete text content.
8. The method of claim 1, wherein the text recognition result of the first text region is displayed in a text recognition region, and wherein the text recognition region further includes text content corresponding to a third text region in the object to be recognized.
9. The method of claim 1, wherein if the first text region includes partially missing text content, the text recognition result is the first classification or the second classification.
10. The method of claim 9, wherein the semantics of the first text content expression are different from the semantics of the text content expression in the first text region.
11. The method according to any one of claims 1 to 10, wherein the object to be identified is a picture, a web page or a document.
12. A method of text recognition, comprising:
the electronic equipment detects a text region of an object to be identified to obtain an image of a first text region; the first text area comprises text content;
the electronic equipment performs text content identification on the first text region to obtain first text content;
The electronic equipment displays a text recognition result of the first text region based on the image of the first text region and the first text content;
the electronic device displaying a text recognition result of the first text region based on the image of the first text region and the first text content, including:
if the image of the first text region represents that the first text region comprises partially missing text content and the first text content is semantically coherent text content, or the image of the first text region represents that the first text region does not comprise partially missing text content, the text recognition result comprises the first text content; and if the image of the first text area represents that the first text area comprises the text content with partial deletion and the first text content comprises the text content with the semantic error, filtering the first text content by the text recognition result or correcting the text content by the text recognition result.
13. The method of claim 12, wherein the electronic device displaying the text recognition result of the first text region based on the image of the first text region and the first text content comprises:
If the image of the first text region characterizes that the first text region comprises partially missing text content and the first text content comprises text content with discontinuous semantics, the electronic equipment detects whether the first text content can be corrected;
if the first text content cannot be corrected, filtering the first text content by the text recognition result;
and if the first text content can be revised, the text recognition result comprises the text content revised by the first text content.
14. The method of claim 13, wherein if the first text content can be modified, the method further comprises:
and the electronic equipment corrects the first text content through a correction model to obtain the text content after the first text content is corrected.
15. The method of claim 14, wherein the electronic device displaying the text recognition result of the first text region based on the image of the first text region and the first text content comprises:
the electronic equipment classifies the image of the first text region through a classification model to obtain a classification result; the classification result is used for indicating whether the first text area comprises the partially missing text content.
16. The method of claim 15, wherein if the image of the first text region characterizes the first text region as including partially missing text content, the electronic device displaying text recognition results for the first text region based on the image of the first text region and the first text content, comprising:
the electronic equipment performs semantic analysis on the first text content through a semantic model to obtain a semantic analysis result; the semantic analysis result is used to indicate whether the first text content includes text content with semantic errors.
17. The method of claim 16, wherein the semantic analysis result is further used to indicate whether the first text content can be modified, wherein the electronic device displaying the text recognition result for the first text region based on the image of the first text region and the first text content comprises:
the electronic device determines whether the first text content can be modified based on the semantic analysis result.
18. The method of claim 17, wherein the correction model, the classification model, and the semantic model comprise a neural network, wherein training data of the neural network comprises a second text region and second text content corresponding to the second text region, and a third text region and third text content corresponding to the third text region; the second text region comprises partially missing text content, and the text content in the third text region is complete text content.
19. The method of claim 12, wherein the text recognition result of the first text region is displayed in a text recognition region, and wherein the text recognition region further includes text content corresponding to a third text region in the object to be recognized.
20. The method of claim 9, wherein the semantics of the semantically erroneous text content expression are different from the semantics of the corresponding text content expression in the first text region.
21. The method according to any one of claims 12 to 20, wherein the object to be identified is a picture, a web page or a document.
22. An electronic device, comprising:
one or more processors;
a memory;
and one or more computer programs, wherein the one or more computer programs are stored on the memory, which when executed by the one or more processors, cause the electronic device to perform the method of any of claims 1-11.
23. An electronic device, comprising:
one or more processors;
a memory;
and one or more computer programs, wherein the one or more computer programs are stored on the memory, which when executed by the one or more processors, cause the electronic device to perform the method of any of claims 12-21.
24. A computer readable storage medium comprising a computer program which, when run on an electronic device, causes the electronic device to perform the method of any one of claims 1 to 11.
25. A computer readable storage medium comprising a computer program which, when run on an electronic device, causes the electronic device to perform the method of any one of claims 12 to 21.
26. A computer program product comprising a computer program which, when executed by an electronic device, causes the electronic device to perform the method of any one of claims 1 to 11.
27. A computer program product comprising a computer program which, when executed by an electronic device, causes the electronic device to perform the method of any of claims 12 to 21.
CN202210597895.6A 2022-05-30 2022-05-30 Text recognition method and electronic equipment Pending CN117197811A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210597895.6A CN117197811A (en) 2022-05-30 2022-05-30 Text recognition method and electronic equipment
PCT/CN2023/096921 WO2023231987A1 (en) 2022-05-30 2023-05-29 Text recognition method and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210597895.6A CN117197811A (en) 2022-05-30 2022-05-30 Text recognition method and electronic equipment

Publications (1)

Publication Number Publication Date
CN117197811A true CN117197811A (en) 2023-12-08

Family

ID=88987403

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210597895.6A Pending CN117197811A (en) 2022-05-30 2022-05-30 Text recognition method and electronic equipment

Country Status (2)

Country Link
CN (1) CN117197811A (en)
WO (1) WO2023231987A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117478435B (en) * 2023-12-28 2024-04-09 中汽智联技术有限公司 Whole vehicle information security attack path generation method and system

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150269135A1 (en) * 2014-03-19 2015-09-24 Qualcomm Incorporated Language identification for text in an object image
CN110059694B (en) * 2019-04-19 2020-02-11 山东大学 Intelligent identification method for character data in complex scene of power industry
CN111090991B (en) * 2019-12-25 2023-07-04 北京百度网讯科技有限公司 Scene error correction method, device, electronic equipment and storage medium
CN113128494B (en) * 2019-12-30 2024-06-28 华为技术有限公司 Method, device and system for recognizing text in image
CN111275038A (en) * 2020-01-17 2020-06-12 平安医疗健康管理股份有限公司 Image text recognition method and device, computer equipment and computer storage medium
CN114140782A (en) * 2021-11-26 2022-03-04 北京奇艺世纪科技有限公司 Text recognition method and device, electronic equipment and storage medium
CN114419646B (en) * 2022-01-17 2024-06-28 马上消费金融股份有限公司 Image classification method, device, electronic equipment and storage medium

Also Published As

Publication number Publication date
WO2023231987A1 (en) 2023-12-07

Similar Documents

Publication Publication Date Title
CN113453040B (en) Short video generation method and device, related equipment and medium
CN111465918B (en) Method for displaying service information in preview interface and electronic equipment
US11893767B2 (en) Text recognition method and apparatus
CN112346695A (en) Method for controlling equipment through voice and electronic equipment
CN111738122A (en) Image processing method and related device
CN110377204B (en) Method for generating user head portrait and electronic equipment
CN116415594A (en) Question-answer pair generation method and electronic equipment
CN115134646B (en) Video editing method and electronic equipment
CN112269853B (en) Retrieval processing method, device and storage medium
CN113806473A (en) Intention recognition method and electronic equipment
CN113538227B (en) Image processing method based on semantic segmentation and related equipment
WO2023231987A1 (en) Text recognition method and electronic device
CN118103809A (en) Page display method, electronic device and computer readable storage medium
WO2022068522A1 (en) Target tracking method and electronic device
WO2024103775A1 (en) Answer generation method and apparatus, and storage medium
CN114328679A (en) Image processing method, image processing apparatus, computer device, and storage medium
CN116994169A (en) Label prediction method, label prediction device, computer equipment and storage medium
CN114943976A (en) Model generation method and device, electronic equipment and storage medium
CN114186535A (en) Structure diagram reduction method, device, electronic equipment, medium and program product
CN114970576A (en) Identification code identification method, related electronic equipment and computer readable storage medium
CN117710697B (en) Object detection method, electronic device, storage medium, and program product
CN117076702B (en) Image searching method and electronic equipment
CN115879436B (en) Electronic book quality inspection method
CN116050390A (en) Text processing method and electronic equipment
WO2023078221A1 (en) Language translation method and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination