CN110059559A - The processing method and its electronic equipment of OCR identification file - Google Patents

The processing method and its electronic equipment of OCR identification file Download PDF

Info

Publication number
CN110059559A
CN110059559A CN201910198318.8A CN201910198318A CN110059559A CN 110059559 A CN110059559 A CN 110059559A CN 201910198318 A CN201910198318 A CN 201910198318A CN 110059559 A CN110059559 A CN 110059559A
Authority
CN
China
Prior art keywords
picture
file
ocr
ocr identification
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910198318.8A
Other languages
Chinese (zh)
Inventor
刘丽珍
吕小立
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
OneConnect Smart Technology Co Ltd
Original Assignee
OneConnect Smart Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by OneConnect Smart Technology Co Ltd filed Critical OneConnect Smart Technology Co Ltd
Priority to CN201910198318.8A priority Critical patent/CN110059559A/en
Publication of CN110059559A publication Critical patent/CN110059559A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/416Extracting the logical structure, e.g. chapters, sections or page numbers; Identifying elements of the document, e.g. authors

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Character Discrimination (AREA)

Abstract

The present invention relates to the subdivision fields of the OCR identification in field of image processing, especially image detection, disclose a kind of processing method of OCR identification file, by obtaining the picture of file to be identified, and the picture are cached;Validity check is carried out to the picture according to effectiveness condition;When the picture does not meet the effectiveness condition, corresponding picture is reacquired;OCR, which is generated, according to the picture of the caching and the picture of reacquisition identifies file.The present invention correspondingly provides a kind of electronic equipment and computer storage medium.Technical solution provided by the invention, it can judge in file to be identified with the presence or absence of the identification defect for not meeting effectiveness condition for OCR identification, and by by whole image caches in local or server, it does not need to give up whole pictures when occurring and identifying defect, being promoted influences to generate the subsequent formation efficiency for OCR identification file, avoids waste system resource.

Description

The processing method and its electronic equipment of OCR identification file
Technical field
The present invention relates to field of image recognition, more particularly, to the processing method and its electronics of a kind of OCR identification file Equipment.
Background technique
OCR (Optical Character Recognition, optical character identification) on carrier mainly by showing Optical character is identified, text output is generated.By taking the OCR of paper document identification as an example, pass through the print on acquisition paper document The optical character that brush body obtains, identifies it, and the data such as text information can be obtained.
When OCR identification file to be identified in occur identification defect when, e.g. OCR identify file in occur leakage page, certain Situations such as a little images are fuzzy or program error.The technical solution of the prior art can only give up the early period of processed picture, The picture for resurveying manuscript to be identified identifies file as OCR.Such as when scanning or shooting multipage contract documents, having very much can Situations such as influencing OCR identification file can occur, whole pictures of obtained OCR identification file can only be given up at this time, again Carry out picture collection.
The efficiency that technical solution in the prior art generates file to be identified is lower, and time-consuming, repetitive operation often, no Present OCR identification is able to satisfy to require.
Summary of the invention
In view of the above problems, the invention proposes a kind of processing method of OCR identification file, above-mentioned technological deficiency is avoided, The formation efficiency of OCR file to be identified can be promoted.
In a first aspect, providing a kind of processing method of OCR identification file in the embodiment of the present invention, comprising:
The picture of file to be identified is obtained, and the picture is cached;
Validity check is carried out to the picture according to effectiveness condition;
When the picture does not meet the effectiveness condition, corresponding picture is reacquired;
OCR, which is generated, according to the picture of the caching and the picture of reacquisition identifies file.
With reference to first aspect, the step of picture for obtaining file to be identified includes:
Successively obtain the plurality of pictures of file to be identified;
Described the step of validity check is carried out to the picture according to effectiveness condition, comprising:
Before generating OCR identification file, integrity checking is carried out to the picture that multiple are cached according to completeness condition;
It is described when the picture does not meet the effectiveness condition, the step of reacquiring corresponding picture, comprising:
When the picture does not meet the completeness condition, the picture of the corresponding lack part of file to be identified is obtained.
With reference to first aspect, described the step of integrity checking is carried out to the picture that multiple are cached according to completeness condition, Include:
OCR identification is carried out to the page number of the picture of the caching, the continuity of the page number is determined, when the page number is discontinuous When, judge the picture leakage page.
With reference to first aspect, described the step of integrity checking is carried out to the picture that multiple are cached according to completeness condition, Including
OCR identification is carried out to the content of text of the picture of the caching, and obtains the keyword in the content of text;
Picture is verified according to the keyword, if the keyword of the keyword identified in picture and other pictures is not Unanimously, determine the picture leakage page.
With reference to first aspect, described the step of integrity checking is carried out to the picture that multiple are cached according to completeness condition, Including
The first trip text of the footline text to a upper picture and next picture carries out OCR identification respectively, obtains first Content of text and the second content of text;
Nature semantic analysis is carried out to first content of text and the second content of text, if the first content of text and second Content of text does not meet continuity, determines the picture leakage page.
With reference to first aspect, the step of reacquisition corresponding picture, comprising:
According to the position of the leakage page, the picture of the leakage page is reacquired;
Described the step of OCR identification file is generated according to the picture of the caching and the picture of reacquisition, comprising:
The picture that the picture of the leakage page is inserted into the caching is lacked into position accordingly according to the sequence of file to be identified It sets, all pictures is converted into OCR identification file.
With reference to first aspect, described the step of validity check is carried out to the picture according to effectiveness condition, comprising:
Before generating OCR identification file, identifiability inspection is carried out to the picture according to identity condition;
It is described when the picture does not meet the effectiveness condition, the step of reacquiring corresponding picture, comprising:
When the picture does not have identifiability, corresponding picture is reacquired;
Described the step of OCR identification file is generated according to the picture of the caching and the picture of reacquisition, comprising:
By the picture of the reacquisition replacement picture for not having identifiability, according to the picture of the caching and The picture of replacement generates OCR and identifies file.
With reference to first aspect, described the step of validity check is carried out to the picture according to effectiveness condition, comprising:
Generate OCR identification file before, judge the picture with the presence or absence of virtualization, there are non-recognizable region or deformations;
It is described when the picture does not meet the effectiveness condition, the step of reacquiring corresponding picture, comprising:
According to the presence virtualization or there are non-recognizable region or the positions of the picture of deformation, reacquire corresponding position Picture.
Second aspect, the present invention provide a kind of electronic equipment, comprising:
Processor;
Memory for storage processor executable instruction;
Wherein, the processor is configured to executing the processing method of the identification file of OCR described in above-mentioned any one.
The third aspect, the present invention provide a kind of non-transitorycomputer readable storage medium, when in the storage medium When instruction is executed by the processor of mobile terminal, so that mobile terminal is able to carry out as above-mentioned, described in any item OCR identification texts The processing method of part.
Compared with the existing technology, scheme provided by the invention, by obtaining the picture of file to be identified, and by the picture It is cached;Validity check is carried out to the picture according to effectiveness condition;When the picture does not meet the validity item When part, corresponding picture is reacquired;OCR, which is generated, according to the picture of the caching and the picture of reacquisition identifies file.This Invention correspondingly provides a kind of electronic equipment and computer storage medium.Technical solution provided by the invention can judge wait know With the presence or absence of the identification defect for not meeting effectiveness condition for OCR identification in other file, and by delaying whole pictures There are in local or server, not needing to give up whole pictures when occurring and identifying defect, is promoted to influence to generate and subsequent be known for OCR The formation efficiency of other file avoids waste system resource.
The aspects of the invention or other aspects can more straightforwards in the following description.
Detailed description of the invention
To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for For those skilled in the art, without creative efforts, it can also be obtained according to these attached drawings other attached Figure.
Fig. 1 shows the method flow diagram of the processing method of OCR identification file of the invention;
Fig. 2 is the schematic diagram of a scenario of the processing method of OCR identification file of the invention;
Fig. 3 shows the picture order of file to be identified of the invention and the schematic diagram that skips leaf;
Fig. 4 shows the method flow diagram that plurality of pictures of the invention carries out integrity checking;
Fig. 5 shows in the present invention in file to be identified the wherein schematic diagram of a picture structure;
Fig. 6 shows the present invention and judges whether file to be identified leaks the method flow diagram of page using keyword;
Fig. 7 shows the present invention and judges whether to leak the method flow diagram of page by first trip text and footline text;
Fig. 8 shows the schematic diagram of another picture structure in file to be identified in the present invention;
Fig. 9 shows the method flow diagram that the present invention reacquires picture and insertion and deletion position;
Figure 10 shows the method flow diagram that the present invention carries out identifiability inspection to picture and replaces;
Figure 11 shows the method flow diagram of the invention for specifically judging identifiability and replacing;
Figure 12 shows the block diagram of the part-structure of mobile phone relevant to terminal provided in an embodiment of the present invention.
Specific embodiment
In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described.
In some processes of the description in description and claims of this specification and above-mentioned attached drawing, contain according to Multiple operations that particular order occurs, but it should be clearly understood that these operations can not be what appears in this article suitable according to its Sequence is executed or is executed parallel, and serial number of operation such as 101,102 etc. is only used for distinguishing each different operation, serial number It itself does not represent and any executes sequence.In addition, these processes may include more or fewer operations, and these operations can To execute or execute parallel in order.It should be noted that the description such as " first " herein, " second ", is for distinguishing not Same message, equipment, module etc., does not represent sequencing, does not also limit " first " and " second " and be different type.
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those skilled in the art's every other implementation obtained without creative efforts Example, shall fall within the protection scope of the present invention.
This application involves OCR identification, OCR is optical character identification.Referring to FIG. 1, obtaining OCR identification file to improve Efficiency, avoid OCR identification file defect existing for cognitive phase cause identify file cannot efficiently generate.The application provides A kind of processing method of OCR identification file, comprising:
Step S11: the picture of file to be identified is obtained, and the picture is cached.
Referring to FIG. 2, Fig. 2 is the application scenario diagram of the application.User A can by the terminals such as mobile phone 210 or client, The picture of corresponding manuscript is acquired, and then generates the identification file identified for OCR.User by the modes such as take pictures or record a video, Obtain the picture for generating file to be identified.When caching, user terminal can locally save the picture.Certainly, user Picture server 220 can also be uploaded to by the client in terminal or terminal to cache.
Step S12: validity check is carried out to the picture according to effectiveness condition.
The picture carries out validity check according to effectiveness condition, and instruction does not meet the picture of effectiveness condition.Judgement The process for whether meeting effectiveness condition can be realized in terminals such as mobile phones, can also be realized in server.When terminal is realized Validity check directly can be begun to after obtaining corresponding picture, the time of inspection can be saved.When server is realized, Situations such as avoiding client or the exception of mobile phone terminal leads to the loss of picture, and e.g. client sudden strain of a muscle is moved back, terminal is surprisingly shut down.
Step S13: when the picture does not meet the effectiveness condition, corresponding picture is reacquired.
When judging that picture does not meet the effectiveness condition, terminal or server can prompt user's reproduction shooting specified Picture.
Step S14: OCR is generated according to the picture of the caching and the picture of reacquisition and identifies file.
After the picture of reacquisition, when the picture of the picture and reacquisition all meets effectiveness condition, by having After effect property checks, then the picture and the picture of reacquisition are generated into OCR identification file together.When the picture all accords with Close effectiveness condition, by validity check after, the picture can all be directly generated OCR identification file.
In order to preferably show the technical solution of the application, identified referring to FIG. 3, Fig. 3 illustrates OCR in the present embodiment Cheng Zhong obtains several pictures of manuscript to be identified and carries out the process of validity check.In Fig. 3, illustratively illustrate from Several pictures 1~5 obtained in manuscript to be identified by shooting, wherein the direction of arrow indicates that reading order or identification are suitable Sequence.Wherein by validity check, the picture (dotted line displaying) of " page 2 " haves the defects that OCR identification, the defect of this identification Can be occur when obtaining picture leakage clap, interception video when interception time it is too long etc. caused by leak page, can also be due to shooting When occur tremble shake caused by virtualization or subsequent image handle when occur image section missing, mosaicked field, or It is that light, manuscript paper lead to the hot spot of overlay text.
Referring to FIG. 4, the OCR identification in order to more efficiently obtain having successional file to be identified, in the present embodiment The step of processing method of file includes:
Step S41: the plurality of pictures of file to be identified is successively obtained.
Wherein, file to be identified or manuscript to be identified are exactly the identification object of current OCR identification.When file to be identified When having multipage, the corresponding multiple pictures of file to be identified are obtained according to reading order or according to the video of reading order page turning. Certainly, it is working out the page number or when marking the information of reading order on picture, time ordered pair band identification file can upset and executed The step of obtaining picture.
Step S42: before generating OCR identification file, integrality is carried out to the picture that multiple are cached according to completeness condition It checks.
The integrity checking wherein carried out refers to the plurality of pictures cached according to completeness condition inspection.Integrality Inspection is the continuity in order to examine the picture that will generate OCR identification file, avoid the occurrence of lose the page, the page number it is not corresponding or Situations such as being not in the right order.
Step S43: when the picture does not meet the completeness condition, the corresponding lack part of file to be identified is obtained Picture.
After carrying out integrity checking, judgement obtains not meeting the picture of completeness condition, issues prompt in terminal or refers to It enables corresponding terminal obtain the picture accordingly lacked, guarantees the integrality of file to be identified.
The present embodiment also provides a kind of scheme of integrality for judging file to be identified, the place of the OCR identification file Reason method judges that the technical solution of the integrality of file to be identified carries out OCR identification for the page number of the picture to the caching, sentences The continuity of the fixed page number judges the picture leakage page when the page number is discontinuous.Incorporated by reference to Fig. 3 auxiliary reference Fig. 5, pass through The number of page footer in picture is obtained, e.g. " page number A1 " may determine that whether the picture belongs to current page, multiple Whether picture is continuous.Further, it can also be determined by identifying the page number, such as the page number of identification last page in the application Whether picture number is enough, and whether the picture of file to be identified is complete.In addition, it is subsequent to can control generation by the setting page number OCR identifies the page quantity of file, such as obtaining the page number of expectation OCR identification last page is page 53, allows and detects current page When recto code is 53, just stops obtaining file to be identified or reduce the quantity of picture in file to be identified, reduce text to be identified The size of part.In addition to this, there is the file to be identified of inscription information for contract, letter, the phase of inscription can also be obtained Close instruction of the information as last page.
Referring to FIG. 6, the present embodiment also provides the scheme of another integrality for judging file to be identified, the OCR Identify that the step of processing method of file judges the integrality of file to be identified includes:
Step S61: OCR identification is carried out to the content of text of the picture of the caching, and is obtained in the content of text Keyword.
In some files to be identified, the whether corresponding page of the page can be determined by keyword, avoid leakage page.It please tie Conjunction Fig. 2 auxiliary reference Fig. 5, the part in a picture of file to be identified, the e.g. corresponding image-region of footline text, It identifies to obtain keyword by OCR.
Step S62: verifying picture according to the keyword, if the keyword identified in picture and other pictures Keyword is inconsistent, determines the picture leakage page.
User can accordingly input the keyword of page number A1, which appears in the picture of every page or occur general Rate is very high, according to the keyword of user's input and the keyword identified by local OCR, it can be determined that whether the picture is to be identified The picture of specified page in the page.When keyword is unrecognized to be arrived or mismatch, illustrate that the picture of file to be identified occurs Leak page.Certainly, user can also be correspondingly arranged keyword according to the picture of every page, promote the accuracy of identification.
In addition, can determine whether to belong to portion by obtaining contract bar code B for example when file to be identified is contract File.In other examples, contract bar code can be reference number of a document or special character.
Referring to FIG. 7, the present embodiment also provides the scheme of another integrality for judging file to be identified, the OCR Identify that the step of processing method of file judges the integrality of file to be identified includes:
Step S71: the first trip text of the footline text to a upper picture and next picture carries out OCR identification respectively, Obtain the first content of text and the second content of text.
Fig. 8 is referred to incorporated by reference to Fig. 5, using Fig. 5 as a upper picture, for Fig. 8 as next picture, Fig. 5 and Fig. 8 are successive The continuous picture of order.OCR identification is carried out in the footline region text E of Fig. 5, obtains the content of footline text as the first text Content carries out OCR identification in the first trip region text F of Fig. 8, obtains the content of first trip text as the second content of text.
Step S72: nature semantic analysis is carried out to first content of text and the second content of text, if in the first text Hold and the second content of text does not meet continuity, determines the picture leakage page.
By natural semantic analysis, judge whether first content of text and the second content of text have nature semanteme Continuity.For example, the first content of text is " studying hard, ", the second content of text is " to make progress every day.", illustrate first at this time Content of text and the second content of text have continuity.For example, the first content of text is " significance,statistical inspection ", the second text This content is that " biology is 21 century most promising subject.", illustrate the first content of text and the second content of text at this time not With continuity, may determine that Fig. 8 at this time not is the lower one page of Fig. 5 in reading order meaning, therefore is deposited between Fig. 8 and Fig. 5 In at least one skip leaf.
In order to solve the problems, such as that the present embodiment is also how by obtaining leakage page picture and generating complete OCR identification file A kind of technical solution is specifically provided, the processing method of the OCR identification file comprising steps of
Step S91: according to the position of the leakage page, the picture of the leakage page is reacquired.
Please auxiliary reference Fig. 3, such as the dashed rectangle of page 2 instruction marked as 2 is the dotted line side of page 2 marked as 2 The picture lacked between frame and solid line boxes of page 3 marked as 3.According to the position, the figure of page 2 of leakage page is reacquired Piece.The position can prompt user to need to reacquire the picture of page 2 by page 1 or page 3 of image of display.Except this Except, user can also be prompted to need to reacquire page 2 by forms such as text, pop-up, suspended windows.Certainly, it is mechanical from When in dynamic equipment using scheme involved in the application, the picture of page 2 can be directly acquired by being turned to specified page.
Step S92: the picture that the picture of the leakage page is inserted into the caching according to the sequence of file to be identified is corresponding Deletion sites, by all pictures be converted to OCR identification file.
According to the sequence of the file to be identified indicated in Fig. 3, the picture for leaking page is inserted into the deletion sites of instruction.It is mending When the picture of foot whole, all pictures are converted into OCR identification file.Further, in order to determine that all pictures have met Integrality condition can carry out integrity checking before generating OCR identification file.
In order to confirm the picture for the identifiability of one of the validity of OCR identification, the application provides corresponding skill Art scheme, please refers to Figure 10:
Step S101: before generating OCR identification file, identifiability inspection is carried out to the picture according to identity condition It looks into.
Identifiability refers to that picture identifies to obtain the degree of effective information for OCR, judges the figure in above process Whether piece has identifiability.At this point, the quantity of picture can be individual or one or more.
Step S102: when the picture does not have identifiability, corresponding picture is reacquired.
When identification, which obtains certain picture, does not have identifiability, user or the corresponding terminal of instruction is prompted to obtain corresponding It is short of the picture of identifiability.
Step S103: by the picture replacement picture for not having identifiability of the reacquisition, according to described slow The picture deposited and the picture of replacement generate OCR and identify file.
By the picture replacement of reacquisition or the above-mentioned picture for detecting and not having identifiability is updated, according to all pictures OCR, which is generated, with the picture of reacquisition identifies file.
In some applicable scenes, Figure 11 is please referred to, in order to solve the problems, such as that specific identifiability, the application provide one kind Technical solution, the processing method of the OCR identification file, comprising:
Step S111: generate OCR identification file before, judge the picture with the presence or absence of virtualization, there are non-recognizable areas Domain or deformation.
Figure 11 is referred to incorporated by reference to Fig. 5, by there are for the missing C of non-recognizable region, that is, image, when detecting in picture It existing one piece when being greater than the blank value of preset threshold, can determine existing in non-recognizable region, picture at this time is not met can Identity condition.By taking shade D as an example, when detecting occur the black of a large amount of black color lumps or concentration in picture by can recognize Color color lump, it is possible to determine that occur shade D in picture, picture at this time does not meet identifiability condition.In addition, identifiability lacks It falls into addition to i.e. there are other than non-recognizable region and shade for image missing, it is also possible to the identifiabilities such as blur or deform occur and lack It falls into.
Step S112: it according to described in the presence of virtualization or there are non-recognizable region or the position of the picture of deformation, obtains again Take the picture of corresponding position.
When the defect for detecting that corresponding picture has any of the above-described identifiability, the figure of corresponding position can be reacquired Piece, and replace or update the picture of script.
A kind of electronic equipment is also provided in the application, comprising:
Processor;
Memory for storage processor executable instruction;
Wherein, the processor is configured to executing the processing method of the identification file of OCR described in above-mentioned any one.
In order to more preferably explain the present invention, the embodiment of the present application also provides terminal devices, as shown in figure 12, for the ease of saying Bright, only parts related to embodiments of the present invention are shown, disclosed by specific technical details, please refers to embodiment of the present invention side Method part.The terminal can be include mobile phone, tablet computer, PDA (Personal Digital Assistant, individual digital Assistant), POS (Point of Sales, point-of-sale terminal), any terminal device such as vehicle-mounted computer, taking the terminal as an example:
Figure 12 shows the block diagram of the part-structure of mobile phone relevant to terminal provided in an embodiment of the present invention.With reference to figure 12, mobile phone includes: radio frequency (Radio Frequency, RF) circuit 1210, memory 1220, input unit 1230, display unit 1240, sensor 1250, voicefrequency circuit 1260, Wireless Fidelity (wireless fidelity, WiFi) module 1270, processor The components such as 1280 and power supply 1290.It will be understood by those skilled in the art that handset structure shown in Figure 12 is not constituted pair The restriction of mobile phone may include perhaps combining certain components or different component cloth than illustrating more or fewer components It sets.
It is specifically introduced below with reference to each component parts of the Figure 12 to mobile phone:
Although being not shown, mobile phone can also include camera, bluetooth module etc., and details are not described herein.Implement in the present invention In example, processor 1280 included by the terminal is also with the following functions:
The picture of file to be identified is obtained, and the picture is cached;
Validity check is carried out to the picture according to effectiveness condition;
When the picture does not meet the effectiveness condition, corresponding picture is reacquired;
OCR, which is generated, according to the picture of the caching and the picture of reacquisition identifies file.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.
In several embodiments provided herein, it should be understood that disclosed system, device and method can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit It divides, only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple units or components It can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown or The mutual coupling, direct-coupling or communication connection discussed can be through some interfaces, the indirect coupling of device or unit It closes or communicates to connect, can be electrical property, mechanical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.
Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of above-described embodiment is can It is completed with instructing relevant hardware by program, which can be stored in a computer readable storage medium, storage Medium may include: read-only memory (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), disk or CD etc..
Those of ordinary skill in the art will appreciate that implement the method for the above embodiments be can be with Relevant hardware is instructed to complete by program, the program can store in a kind of computer readable storage medium, on Stating the storage medium mentioned can be read-only memory, disk or CD etc..
Compared with the existing technology, scheme provided by the invention, by obtaining the picture of file to be identified, and by the picture It is cached;Validity check is carried out to the picture according to effectiveness condition;When the picture does not meet the validity item When part, corresponding picture is reacquired;OCR, which is generated, according to the picture of the caching and the picture of reacquisition identifies file.This Invention correspondingly provides a kind of electronic equipment and computer storage medium.Technical solution provided by the invention can judge wait know With the presence or absence of the identification defect for not meeting effectiveness condition for OCR identification in other file, and by delaying whole pictures There are in local or server, not needing to give up whole pictures when occurring and identifying defect, is promoted to influence to generate and subsequent be known for OCR The formation efficiency of other file avoids waste system resource.It is worth noting that, subsequent OCR identification can be in terminal or service Device executes, and the position cached is limited.
A kind of terminal device of user provided by the present invention is described in detail above, for the general of this field Technical staff, thought according to an embodiment of the present invention, there will be changes in the specific implementation manner and application range, to sum up Described, the contents of this specification are not to be construed as limiting the invention.

Claims (10)

1. a kind of processing method of OCR identification file characterized by comprising
The picture of file to be identified is obtained, and the picture is cached;
Validity check is carried out to the picture according to effectiveness condition;
When the picture does not meet the effectiveness condition, corresponding picture is reacquired;
OCR, which is generated, according to the picture of the caching and the picture of reacquisition identifies file.
2. the processing method of OCR identification file according to claim 1, which is characterized in that described to obtain file to be identified Picture the step of include:
Successively obtain the plurality of pictures of file to be identified;
Described the step of validity check is carried out to the picture according to effectiveness condition, comprising:
Before generating OCR identification file, integrity checking is carried out to the picture that multiple are cached according to completeness condition;
It is described when the picture does not meet the effectiveness condition, the step of reacquiring corresponding picture, comprising:
When the picture does not meet the completeness condition, the picture of the corresponding lack part of file to be identified is obtained.
3. the processing method of OCR identification file according to claim 2, which is characterized in that described according to completeness condition The step of integrity checking is carried out to the picture of multiple cachings, comprising:
OCR identification is carried out to the page number of the picture of the caching, the continuity of the page number is determined, when the page number is discontinuous, sentences Break the picture leakage page.
4. the processing method of OCR identification file according to claim 2, which is characterized in that described according to completeness condition The step of integrity checking is carried out to the picture of multiple cachings, including
OCR identification is carried out to the content of text of the picture of the caching, and obtains the keyword in the content of text;
Picture is verified according to the keyword, if the keyword of the keyword identified in picture and other pictures is different It causes, determines the picture leakage page.
5. the processing method of OCR identification file according to claim 2, which is characterized in that described according to completeness condition The step of integrity checking is carried out to the picture of multiple cachings, including
The first trip text of the footline text to a upper picture and next picture carries out OCR identification respectively, obtains the first text Content and the second content of text;
Nature semantic analysis is carried out to first content of text and the second content of text, if the first content of text and the second text Content does not meet continuity, determines the picture leakage page.
6. according to the processing method of the described in any item OCR identification files of claim 3 to 5, which is characterized in that described to obtain again The step of taking corresponding picture, comprising:
According to the position of the leakage page, the picture of the leakage page is reacquired;
Described the step of OCR identification file is generated according to the picture of the caching and the picture of reacquisition, comprising:
The picture of the leakage page is inserted into the corresponding deletion sites of picture of the caching according to the sequence of file to be identified, it will All pictures are converted to OCR identification file.
7. the processing method of OCR identification file according to claim 1, which is characterized in that described according to effectiveness condition The step of validity check is carried out to the picture, comprising:
Before generating OCR identification file, identifiability inspection is carried out to the picture according to identity condition;
It is described when the picture does not meet the effectiveness condition, the step of reacquiring corresponding picture, comprising:
When the picture does not have identifiability, corresponding picture is reacquired;
Described the step of OCR identification file is generated according to the picture of the caching and the picture of reacquisition, comprising:
By the picture replacement picture for not having identifiability of the reacquisition, according to the picture of the caching and replacement Picture generate OCR identify file.
8. the processing method of OCR identification file according to claim 1, which is characterized in that described according to effectiveness condition The step of validity check is carried out to the picture, comprising:
Generate OCR identification file before, judge the picture with the presence or absence of virtualization, there are non-recognizable region or deformations;
It is described when the picture does not meet the effectiveness condition, the step of reacquiring corresponding picture, comprising:
According to described in the presence of virtualization or there are non-recognizable region or the position of the picture of deformation, the figure of corresponding position is reacquired Piece.
9. a kind of electronic equipment characterized by comprising
Processor;
Memory for storage processor executable instruction;
Wherein, the processor is configured to perform claim requires the processing side of the identification file of OCR described in 1~8 any one Method.
10. a kind of non-transitorycomputer readable storage medium, which is characterized in that when the instruction in the storage medium is by moving When the processor of terminal executes, so that mobile terminal is able to carry out as described in any one of claim 1~8 claim The processing method of OCR identification file.
CN201910198318.8A 2019-03-15 2019-03-15 The processing method and its electronic equipment of OCR identification file Pending CN110059559A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910198318.8A CN110059559A (en) 2019-03-15 2019-03-15 The processing method and its electronic equipment of OCR identification file

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910198318.8A CN110059559A (en) 2019-03-15 2019-03-15 The processing method and its electronic equipment of OCR identification file

Publications (1)

Publication Number Publication Date
CN110059559A true CN110059559A (en) 2019-07-26

Family

ID=67317002

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910198318.8A Pending CN110059559A (en) 2019-03-15 2019-03-15 The processing method and its electronic equipment of OCR identification file

Country Status (1)

Country Link
CN (1) CN110059559A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112200185A (en) * 2020-10-10 2021-01-08 航天科工智慧产业发展有限公司 Method and device for reversely positioning picture by characters and computer storage medium
CN113780121A (en) * 2021-08-30 2021-12-10 国网上海市电力公司 Power system operation instruction ticket automatic identification application method based on artificial intelligence
CN114140778A (en) * 2021-01-14 2022-03-04 北京灵伴即时智能科技有限公司 Page turning abnormality detection method

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6154579A (en) * 1997-08-11 2000-11-28 At&T Corp. Confusion matrix based method and system for correcting misrecognized words appearing in documents generated by an optical character recognition technique
US20020063896A1 (en) * 2000-11-29 2002-05-30 Xerox Corporation In an electronic reprographic system, provide automatic document integrity determination and page organization
CN101533474A (en) * 2008-03-12 2009-09-16 三星电子株式会社 Character and image recognition system based on video image and method thereof
EP2383970A1 (en) * 2010-04-30 2011-11-02 beyo GmbH Camera based method for text input and keyword detection
US8315465B1 (en) * 2009-01-12 2012-11-20 Google Inc. Effective feature classification in images
CN103455786A (en) * 2012-05-28 2013-12-18 北京山海经纬信息技术有限公司 Image recognition method and system
CN104732226A (en) * 2015-03-31 2015-06-24 浪潮集团有限公司 Character recognition method and device
CN106250830A (en) * 2016-07-22 2016-12-21 浙江大学 Digital book structured analysis processing method
CN107437085A (en) * 2017-08-09 2017-12-05 厦门商集企业咨询有限责任公司 A kind of method, apparatus and readable storage medium storing program for executing of lifting OCR discriminations
CN107622266A (en) * 2017-09-21 2018-01-23 平安科技(深圳)有限公司 A kind of processing method, storage medium and the server of OCR identifications
CN108228720A (en) * 2017-12-07 2018-06-29 北京字节跳动网络技术有限公司 Identify method, system, device, terminal and the storage medium of target text content and artwork correlation
CN108304814A (en) * 2018-02-08 2018-07-20 海南云江科技有限公司 A kind of construction method and computing device of literal type detection model
CN108388930A (en) * 2018-01-17 2018-08-10 链家网(北京)科技有限公司 The method and device of verification contract spare part picture correctness and integrality
CN108647262A (en) * 2018-04-27 2018-10-12 平安科技(深圳)有限公司 A kind of picture management method, device, computer equipment and storage medium
CN108874283A (en) * 2018-05-29 2018-11-23 努比亚技术有限公司 Image identification method, mobile terminal and computer readable storage medium

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6154579A (en) * 1997-08-11 2000-11-28 At&T Corp. Confusion matrix based method and system for correcting misrecognized words appearing in documents generated by an optical character recognition technique
US20020063896A1 (en) * 2000-11-29 2002-05-30 Xerox Corporation In an electronic reprographic system, provide automatic document integrity determination and page organization
CN101533474A (en) * 2008-03-12 2009-09-16 三星电子株式会社 Character and image recognition system based on video image and method thereof
US8315465B1 (en) * 2009-01-12 2012-11-20 Google Inc. Effective feature classification in images
EP2383970A1 (en) * 2010-04-30 2011-11-02 beyo GmbH Camera based method for text input and keyword detection
CN103455786A (en) * 2012-05-28 2013-12-18 北京山海经纬信息技术有限公司 Image recognition method and system
CN104732226A (en) * 2015-03-31 2015-06-24 浪潮集团有限公司 Character recognition method and device
CN106250830A (en) * 2016-07-22 2016-12-21 浙江大学 Digital book structured analysis processing method
CN107437085A (en) * 2017-08-09 2017-12-05 厦门商集企业咨询有限责任公司 A kind of method, apparatus and readable storage medium storing program for executing of lifting OCR discriminations
CN107622266A (en) * 2017-09-21 2018-01-23 平安科技(深圳)有限公司 A kind of processing method, storage medium and the server of OCR identifications
CN108228720A (en) * 2017-12-07 2018-06-29 北京字节跳动网络技术有限公司 Identify method, system, device, terminal and the storage medium of target text content and artwork correlation
CN108388930A (en) * 2018-01-17 2018-08-10 链家网(北京)科技有限公司 The method and device of verification contract spare part picture correctness and integrality
CN108304814A (en) * 2018-02-08 2018-07-20 海南云江科技有限公司 A kind of construction method and computing device of literal type detection model
CN108647262A (en) * 2018-04-27 2018-10-12 平安科技(深圳)有限公司 A kind of picture management method, device, computer equipment and storage medium
CN108874283A (en) * 2018-05-29 2018-11-23 努比亚技术有限公司 Image identification method, mobile terminal and computer readable storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
马莉;: "复杂背景下基于OCR的变体文本识别技术", 科协论坛(下半月), no. 12, 25 December 2008 (2008-12-25) *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112200185A (en) * 2020-10-10 2021-01-08 航天科工智慧产业发展有限公司 Method and device for reversely positioning picture by characters and computer storage medium
CN114140778A (en) * 2021-01-14 2022-03-04 北京灵伴即时智能科技有限公司 Page turning abnormality detection method
CN113780121A (en) * 2021-08-30 2021-12-10 国网上海市电力公司 Power system operation instruction ticket automatic identification application method based on artificial intelligence

Similar Documents

Publication Publication Date Title
CN110135411B (en) Business card recognition method and device
CN109784235A (en) Method for automatically inputting, device, computer equipment and the storage medium of paper form
US9747269B2 (en) Smart optical input/output (I/O) extension for context-dependent workflows
CN110059559A (en) The processing method and its electronic equipment of OCR identification file
US8072495B2 (en) Automatic image capturing system
WO2018080546A1 (en) Image quality assessment and improvement for performing optical character recognition
CN110222694B (en) Image processing method, image processing device, electronic equipment and computer readable medium
CA3018437C (en) Optical character recognition utilizing hashed templates
US9208551B2 (en) Method and system for providing efficient feedback regarding captured optical image quality
CN108648189B (en) Image blur detection method and device, computing equipment and readable storage medium
US20150278248A1 (en) Personal Information Management Service System
US11972025B2 (en) Stored image privacy violation detection method and system
US20220350956A1 (en) Information processing apparatus, information processing method, and storage medium
CN113850060A (en) Civil aviation document data identification and entry method and system
CN111047147A (en) Automatic acquisition method for business process and intelligent terminal
US20210012511A1 (en) Visual search method, computer device, and storage medium
CN114067335A (en) Electronic archive text recognition method, system, computer equipment and storage medium
CN113568934A (en) Data query method and device, electronic equipment and storage medium
US8161023B2 (en) Inserting a PDF shared resource back into a PDF statement
JP5788447B2 (en) Information acquisition system for insurance policy
CN111145143B (en) Problem image determining method and device, electronic equipment and storage medium
CN110059184B (en) Operation error collection and analysis method and system
US20210136112A1 (en) Shared image sanitization method and system
CN111638839A (en) Screen capturing method and device and electronic equipment
CN111178365A (en) Picture character recognition method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination