CN110059559A - The processing method and its electronic equipment of OCR identification file - Google Patents
The processing method and its electronic equipment of OCR identification file Download PDFInfo
- Publication number
- CN110059559A CN110059559A CN201910198318.8A CN201910198318A CN110059559A CN 110059559 A CN110059559 A CN 110059559A CN 201910198318 A CN201910198318 A CN 201910198318A CN 110059559 A CN110059559 A CN 110059559A
- Authority
- CN
- China
- Prior art keywords
- picture
- file
- ocr
- ocr identification
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 26
- 238000012545 processing Methods 0.000 claims abstract description 3
- 238000000034 method Methods 0.000 claims description 22
- 238000007689 inspection Methods 0.000 claims description 8
- 238000004458 analytical method Methods 0.000 claims description 4
- 238000012217 deletion Methods 0.000 claims description 4
- 230000037430 deletion Effects 0.000 claims description 4
- 230000007547 defect Effects 0.000 abstract description 11
- 230000015572 biosynthetic process Effects 0.000 abstract description 4
- 239000002699 waste material Substances 0.000 abstract description 3
- 238000001514 detection method Methods 0.000 abstract 1
- 238000012015 optical character recognition Methods 0.000 description 77
- 238000010586 diagram Methods 0.000 description 14
- 230000003287 optical effect Effects 0.000 description 4
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000008878 coupling Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 206010044565 Tremor Diseases 0.000 description 1
- 230000001149 cognitive effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 239000004744 fabric Substances 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 210000003205 muscle Anatomy 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/416—Extracting the logical structure, e.g. chapters, sections or page numbers; Identifying elements of the document, e.g. authors
Landscapes
- Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Character Discrimination (AREA)
Abstract
The present invention relates to the subdivision fields of the OCR identification in field of image processing, especially image detection, disclose a kind of processing method of OCR identification file, by obtaining the picture of file to be identified, and the picture are cached;Validity check is carried out to the picture according to effectiveness condition;When the picture does not meet the effectiveness condition, corresponding picture is reacquired;OCR, which is generated, according to the picture of the caching and the picture of reacquisition identifies file.The present invention correspondingly provides a kind of electronic equipment and computer storage medium.Technical solution provided by the invention, it can judge in file to be identified with the presence or absence of the identification defect for not meeting effectiveness condition for OCR identification, and by by whole image caches in local or server, it does not need to give up whole pictures when occurring and identifying defect, being promoted influences to generate the subsequent formation efficiency for OCR identification file, avoids waste system resource.
Description
Technical field
The present invention relates to field of image recognition, more particularly, to the processing method and its electronics of a kind of OCR identification file
Equipment.
Background technique
OCR (Optical Character Recognition, optical character identification) on carrier mainly by showing
Optical character is identified, text output is generated.By taking the OCR of paper document identification as an example, pass through the print on acquisition paper document
The optical character that brush body obtains, identifies it, and the data such as text information can be obtained.
When OCR identification file to be identified in occur identification defect when, e.g. OCR identify file in occur leakage page, certain
Situations such as a little images are fuzzy or program error.The technical solution of the prior art can only give up the early period of processed picture,
The picture for resurveying manuscript to be identified identifies file as OCR.Such as when scanning or shooting multipage contract documents, having very much can
Situations such as influencing OCR identification file can occur, whole pictures of obtained OCR identification file can only be given up at this time, again
Carry out picture collection.
The efficiency that technical solution in the prior art generates file to be identified is lower, and time-consuming, repetitive operation often, no
Present OCR identification is able to satisfy to require.
Summary of the invention
In view of the above problems, the invention proposes a kind of processing method of OCR identification file, above-mentioned technological deficiency is avoided,
The formation efficiency of OCR file to be identified can be promoted.
In a first aspect, providing a kind of processing method of OCR identification file in the embodiment of the present invention, comprising:
The picture of file to be identified is obtained, and the picture is cached;
Validity check is carried out to the picture according to effectiveness condition;
When the picture does not meet the effectiveness condition, corresponding picture is reacquired;
OCR, which is generated, according to the picture of the caching and the picture of reacquisition identifies file.
With reference to first aspect, the step of picture for obtaining file to be identified includes:
Successively obtain the plurality of pictures of file to be identified;
Described the step of validity check is carried out to the picture according to effectiveness condition, comprising:
Before generating OCR identification file, integrity checking is carried out to the picture that multiple are cached according to completeness condition;
It is described when the picture does not meet the effectiveness condition, the step of reacquiring corresponding picture, comprising:
When the picture does not meet the completeness condition, the picture of the corresponding lack part of file to be identified is obtained.
With reference to first aspect, described the step of integrity checking is carried out to the picture that multiple are cached according to completeness condition,
Include:
OCR identification is carried out to the page number of the picture of the caching, the continuity of the page number is determined, when the page number is discontinuous
When, judge the picture leakage page.
With reference to first aspect, described the step of integrity checking is carried out to the picture that multiple are cached according to completeness condition,
Including
OCR identification is carried out to the content of text of the picture of the caching, and obtains the keyword in the content of text;
Picture is verified according to the keyword, if the keyword of the keyword identified in picture and other pictures is not
Unanimously, determine the picture leakage page.
With reference to first aspect, described the step of integrity checking is carried out to the picture that multiple are cached according to completeness condition,
Including
The first trip text of the footline text to a upper picture and next picture carries out OCR identification respectively, obtains first
Content of text and the second content of text;
Nature semantic analysis is carried out to first content of text and the second content of text, if the first content of text and second
Content of text does not meet continuity, determines the picture leakage page.
With reference to first aspect, the step of reacquisition corresponding picture, comprising:
According to the position of the leakage page, the picture of the leakage page is reacquired;
Described the step of OCR identification file is generated according to the picture of the caching and the picture of reacquisition, comprising:
The picture that the picture of the leakage page is inserted into the caching is lacked into position accordingly according to the sequence of file to be identified
It sets, all pictures is converted into OCR identification file.
With reference to first aspect, described the step of validity check is carried out to the picture according to effectiveness condition, comprising:
Before generating OCR identification file, identifiability inspection is carried out to the picture according to identity condition;
It is described when the picture does not meet the effectiveness condition, the step of reacquiring corresponding picture, comprising:
When the picture does not have identifiability, corresponding picture is reacquired;
Described the step of OCR identification file is generated according to the picture of the caching and the picture of reacquisition, comprising:
By the picture of the reacquisition replacement picture for not having identifiability, according to the picture of the caching and
The picture of replacement generates OCR and identifies file.
With reference to first aspect, described the step of validity check is carried out to the picture according to effectiveness condition, comprising:
Generate OCR identification file before, judge the picture with the presence or absence of virtualization, there are non-recognizable region or deformations;
It is described when the picture does not meet the effectiveness condition, the step of reacquiring corresponding picture, comprising:
According to the presence virtualization or there are non-recognizable region or the positions of the picture of deformation, reacquire corresponding position
Picture.
Second aspect, the present invention provide a kind of electronic equipment, comprising:
Processor;
Memory for storage processor executable instruction;
Wherein, the processor is configured to executing the processing method of the identification file of OCR described in above-mentioned any one.
The third aspect, the present invention provide a kind of non-transitorycomputer readable storage medium, when in the storage medium
When instruction is executed by the processor of mobile terminal, so that mobile terminal is able to carry out as above-mentioned, described in any item OCR identification texts
The processing method of part.
Compared with the existing technology, scheme provided by the invention, by obtaining the picture of file to be identified, and by the picture
It is cached;Validity check is carried out to the picture according to effectiveness condition;When the picture does not meet the validity item
When part, corresponding picture is reacquired;OCR, which is generated, according to the picture of the caching and the picture of reacquisition identifies file.This
Invention correspondingly provides a kind of electronic equipment and computer storage medium.Technical solution provided by the invention can judge wait know
With the presence or absence of the identification defect for not meeting effectiveness condition for OCR identification in other file, and by delaying whole pictures
There are in local or server, not needing to give up whole pictures when occurring and identifying defect, is promoted to influence to generate and subsequent be known for OCR
The formation efficiency of other file avoids waste system resource.
The aspects of the invention or other aspects can more straightforwards in the following description.
Detailed description of the invention
To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment
Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for
For those skilled in the art, without creative efforts, it can also be obtained according to these attached drawings other attached
Figure.
Fig. 1 shows the method flow diagram of the processing method of OCR identification file of the invention;
Fig. 2 is the schematic diagram of a scenario of the processing method of OCR identification file of the invention;
Fig. 3 shows the picture order of file to be identified of the invention and the schematic diagram that skips leaf;
Fig. 4 shows the method flow diagram that plurality of pictures of the invention carries out integrity checking;
Fig. 5 shows in the present invention in file to be identified the wherein schematic diagram of a picture structure;
Fig. 6 shows the present invention and judges whether file to be identified leaks the method flow diagram of page using keyword;
Fig. 7 shows the present invention and judges whether to leak the method flow diagram of page by first trip text and footline text;
Fig. 8 shows the schematic diagram of another picture structure in file to be identified in the present invention;
Fig. 9 shows the method flow diagram that the present invention reacquires picture and insertion and deletion position;
Figure 10 shows the method flow diagram that the present invention carries out identifiability inspection to picture and replaces;
Figure 11 shows the method flow diagram of the invention for specifically judging identifiability and replacing;
Figure 12 shows the block diagram of the part-structure of mobile phone relevant to terminal provided in an embodiment of the present invention.
Specific embodiment
In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention
Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described.
In some processes of the description in description and claims of this specification and above-mentioned attached drawing, contain according to
Multiple operations that particular order occurs, but it should be clearly understood that these operations can not be what appears in this article suitable according to its
Sequence is executed or is executed parallel, and serial number of operation such as 101,102 etc. is only used for distinguishing each different operation, serial number
It itself does not represent and any executes sequence.In addition, these processes may include more or fewer operations, and these operations can
To execute or execute parallel in order.It should be noted that the description such as " first " herein, " second ", is for distinguishing not
Same message, equipment, module etc., does not represent sequencing, does not also limit " first " and " second " and be different type.
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, those skilled in the art's every other implementation obtained without creative efforts
Example, shall fall within the protection scope of the present invention.
This application involves OCR identification, OCR is optical character identification.Referring to FIG. 1, obtaining OCR identification file to improve
Efficiency, avoid OCR identification file defect existing for cognitive phase cause identify file cannot efficiently generate.The application provides
A kind of processing method of OCR identification file, comprising:
Step S11: the picture of file to be identified is obtained, and the picture is cached.
Referring to FIG. 2, Fig. 2 is the application scenario diagram of the application.User A can by the terminals such as mobile phone 210 or client,
The picture of corresponding manuscript is acquired, and then generates the identification file identified for OCR.User by the modes such as take pictures or record a video,
Obtain the picture for generating file to be identified.When caching, user terminal can locally save the picture.Certainly, user
Picture server 220 can also be uploaded to by the client in terminal or terminal to cache.
Step S12: validity check is carried out to the picture according to effectiveness condition.
The picture carries out validity check according to effectiveness condition, and instruction does not meet the picture of effectiveness condition.Judgement
The process for whether meeting effectiveness condition can be realized in terminals such as mobile phones, can also be realized in server.When terminal is realized
Validity check directly can be begun to after obtaining corresponding picture, the time of inspection can be saved.When server is realized,
Situations such as avoiding client or the exception of mobile phone terminal leads to the loss of picture, and e.g. client sudden strain of a muscle is moved back, terminal is surprisingly shut down.
Step S13: when the picture does not meet the effectiveness condition, corresponding picture is reacquired.
When judging that picture does not meet the effectiveness condition, terminal or server can prompt user's reproduction shooting specified
Picture.
Step S14: OCR is generated according to the picture of the caching and the picture of reacquisition and identifies file.
After the picture of reacquisition, when the picture of the picture and reacquisition all meets effectiveness condition, by having
After effect property checks, then the picture and the picture of reacquisition are generated into OCR identification file together.When the picture all accords with
Close effectiveness condition, by validity check after, the picture can all be directly generated OCR identification file.
In order to preferably show the technical solution of the application, identified referring to FIG. 3, Fig. 3 illustrates OCR in the present embodiment
Cheng Zhong obtains several pictures of manuscript to be identified and carries out the process of validity check.In Fig. 3, illustratively illustrate from
Several pictures 1~5 obtained in manuscript to be identified by shooting, wherein the direction of arrow indicates that reading order or identification are suitable
Sequence.Wherein by validity check, the picture (dotted line displaying) of " page 2 " haves the defects that OCR identification, the defect of this identification
Can be occur when obtaining picture leakage clap, interception video when interception time it is too long etc. caused by leak page, can also be due to shooting
When occur tremble shake caused by virtualization or subsequent image handle when occur image section missing, mosaicked field, or
It is that light, manuscript paper lead to the hot spot of overlay text.
Referring to FIG. 4, the OCR identification in order to more efficiently obtain having successional file to be identified, in the present embodiment
The step of processing method of file includes:
Step S41: the plurality of pictures of file to be identified is successively obtained.
Wherein, file to be identified or manuscript to be identified are exactly the identification object of current OCR identification.When file to be identified
When having multipage, the corresponding multiple pictures of file to be identified are obtained according to reading order or according to the video of reading order page turning.
Certainly, it is working out the page number or when marking the information of reading order on picture, time ordered pair band identification file can upset and executed
The step of obtaining picture.
Step S42: before generating OCR identification file, integrality is carried out to the picture that multiple are cached according to completeness condition
It checks.
The integrity checking wherein carried out refers to the plurality of pictures cached according to completeness condition inspection.Integrality
Inspection is the continuity in order to examine the picture that will generate OCR identification file, avoid the occurrence of lose the page, the page number it is not corresponding or
Situations such as being not in the right order.
Step S43: when the picture does not meet the completeness condition, the corresponding lack part of file to be identified is obtained
Picture.
After carrying out integrity checking, judgement obtains not meeting the picture of completeness condition, issues prompt in terminal or refers to
It enables corresponding terminal obtain the picture accordingly lacked, guarantees the integrality of file to be identified.
The present embodiment also provides a kind of scheme of integrality for judging file to be identified, the place of the OCR identification file
Reason method judges that the technical solution of the integrality of file to be identified carries out OCR identification for the page number of the picture to the caching, sentences
The continuity of the fixed page number judges the picture leakage page when the page number is discontinuous.Incorporated by reference to Fig. 3 auxiliary reference Fig. 5, pass through
The number of page footer in picture is obtained, e.g. " page number A1 " may determine that whether the picture belongs to current page, multiple
Whether picture is continuous.Further, it can also be determined by identifying the page number, such as the page number of identification last page in the application
Whether picture number is enough, and whether the picture of file to be identified is complete.In addition, it is subsequent to can control generation by the setting page number
OCR identifies the page quantity of file, such as obtaining the page number of expectation OCR identification last page is page 53, allows and detects current page
When recto code is 53, just stops obtaining file to be identified or reduce the quantity of picture in file to be identified, reduce text to be identified
The size of part.In addition to this, there is the file to be identified of inscription information for contract, letter, the phase of inscription can also be obtained
Close instruction of the information as last page.
Referring to FIG. 6, the present embodiment also provides the scheme of another integrality for judging file to be identified, the OCR
Identify that the step of processing method of file judges the integrality of file to be identified includes:
Step S61: OCR identification is carried out to the content of text of the picture of the caching, and is obtained in the content of text
Keyword.
In some files to be identified, the whether corresponding page of the page can be determined by keyword, avoid leakage page.It please tie
Conjunction Fig. 2 auxiliary reference Fig. 5, the part in a picture of file to be identified, the e.g. corresponding image-region of footline text,
It identifies to obtain keyword by OCR.
Step S62: verifying picture according to the keyword, if the keyword identified in picture and other pictures
Keyword is inconsistent, determines the picture leakage page.
User can accordingly input the keyword of page number A1, which appears in the picture of every page or occur general
Rate is very high, according to the keyword of user's input and the keyword identified by local OCR, it can be determined that whether the picture is to be identified
The picture of specified page in the page.When keyword is unrecognized to be arrived or mismatch, illustrate that the picture of file to be identified occurs
Leak page.Certainly, user can also be correspondingly arranged keyword according to the picture of every page, promote the accuracy of identification.
In addition, can determine whether to belong to portion by obtaining contract bar code B for example when file to be identified is contract
File.In other examples, contract bar code can be reference number of a document or special character.
Referring to FIG. 7, the present embodiment also provides the scheme of another integrality for judging file to be identified, the OCR
Identify that the step of processing method of file judges the integrality of file to be identified includes:
Step S71: the first trip text of the footline text to a upper picture and next picture carries out OCR identification respectively,
Obtain the first content of text and the second content of text.
Fig. 8 is referred to incorporated by reference to Fig. 5, using Fig. 5 as a upper picture, for Fig. 8 as next picture, Fig. 5 and Fig. 8 are successive
The continuous picture of order.OCR identification is carried out in the footline region text E of Fig. 5, obtains the content of footline text as the first text
Content carries out OCR identification in the first trip region text F of Fig. 8, obtains the content of first trip text as the second content of text.
Step S72: nature semantic analysis is carried out to first content of text and the second content of text, if in the first text
Hold and the second content of text does not meet continuity, determines the picture leakage page.
By natural semantic analysis, judge whether first content of text and the second content of text have nature semanteme
Continuity.For example, the first content of text is " studying hard, ", the second content of text is " to make progress every day.", illustrate first at this time
Content of text and the second content of text have continuity.For example, the first content of text is " significance,statistical inspection ", the second text
This content is that " biology is 21 century most promising subject.", illustrate the first content of text and the second content of text at this time not
With continuity, may determine that Fig. 8 at this time not is the lower one page of Fig. 5 in reading order meaning, therefore is deposited between Fig. 8 and Fig. 5
In at least one skip leaf.
In order to solve the problems, such as that the present embodiment is also how by obtaining leakage page picture and generating complete OCR identification file
A kind of technical solution is specifically provided, the processing method of the OCR identification file comprising steps of
Step S91: according to the position of the leakage page, the picture of the leakage page is reacquired.
Please auxiliary reference Fig. 3, such as the dashed rectangle of page 2 instruction marked as 2 is the dotted line side of page 2 marked as 2
The picture lacked between frame and solid line boxes of page 3 marked as 3.According to the position, the figure of page 2 of leakage page is reacquired
Piece.The position can prompt user to need to reacquire the picture of page 2 by page 1 or page 3 of image of display.Except this
Except, user can also be prompted to need to reacquire page 2 by forms such as text, pop-up, suspended windows.Certainly, it is mechanical from
When in dynamic equipment using scheme involved in the application, the picture of page 2 can be directly acquired by being turned to specified page.
Step S92: the picture that the picture of the leakage page is inserted into the caching according to the sequence of file to be identified is corresponding
Deletion sites, by all pictures be converted to OCR identification file.
According to the sequence of the file to be identified indicated in Fig. 3, the picture for leaking page is inserted into the deletion sites of instruction.It is mending
When the picture of foot whole, all pictures are converted into OCR identification file.Further, in order to determine that all pictures have met
Integrality condition can carry out integrity checking before generating OCR identification file.
In order to confirm the picture for the identifiability of one of the validity of OCR identification, the application provides corresponding skill
Art scheme, please refers to Figure 10:
Step S101: before generating OCR identification file, identifiability inspection is carried out to the picture according to identity condition
It looks into.
Identifiability refers to that picture identifies to obtain the degree of effective information for OCR, judges the figure in above process
Whether piece has identifiability.At this point, the quantity of picture can be individual or one or more.
Step S102: when the picture does not have identifiability, corresponding picture is reacquired.
When identification, which obtains certain picture, does not have identifiability, user or the corresponding terminal of instruction is prompted to obtain corresponding
It is short of the picture of identifiability.
Step S103: by the picture replacement picture for not having identifiability of the reacquisition, according to described slow
The picture deposited and the picture of replacement generate OCR and identify file.
By the picture replacement of reacquisition or the above-mentioned picture for detecting and not having identifiability is updated, according to all pictures
OCR, which is generated, with the picture of reacquisition identifies file.
In some applicable scenes, Figure 11 is please referred to, in order to solve the problems, such as that specific identifiability, the application provide one kind
Technical solution, the processing method of the OCR identification file, comprising:
Step S111: generate OCR identification file before, judge the picture with the presence or absence of virtualization, there are non-recognizable areas
Domain or deformation.
Figure 11 is referred to incorporated by reference to Fig. 5, by there are for the missing C of non-recognizable region, that is, image, when detecting in picture
It existing one piece when being greater than the blank value of preset threshold, can determine existing in non-recognizable region, picture at this time is not met can
Identity condition.By taking shade D as an example, when detecting occur the black of a large amount of black color lumps or concentration in picture by can recognize
Color color lump, it is possible to determine that occur shade D in picture, picture at this time does not meet identifiability condition.In addition, identifiability lacks
It falls into addition to i.e. there are other than non-recognizable region and shade for image missing, it is also possible to the identifiabilities such as blur or deform occur and lack
It falls into.
Step S112: it according to described in the presence of virtualization or there are non-recognizable region or the position of the picture of deformation, obtains again
Take the picture of corresponding position.
When the defect for detecting that corresponding picture has any of the above-described identifiability, the figure of corresponding position can be reacquired
Piece, and replace or update the picture of script.
A kind of electronic equipment is also provided in the application, comprising:
Processor;
Memory for storage processor executable instruction;
Wherein, the processor is configured to executing the processing method of the identification file of OCR described in above-mentioned any one.
In order to more preferably explain the present invention, the embodiment of the present application also provides terminal devices, as shown in figure 12, for the ease of saying
Bright, only parts related to embodiments of the present invention are shown, disclosed by specific technical details, please refers to embodiment of the present invention side
Method part.The terminal can be include mobile phone, tablet computer, PDA (Personal Digital Assistant, individual digital
Assistant), POS (Point of Sales, point-of-sale terminal), any terminal device such as vehicle-mounted computer, taking the terminal as an example:
Figure 12 shows the block diagram of the part-structure of mobile phone relevant to terminal provided in an embodiment of the present invention.With reference to figure
12, mobile phone includes: radio frequency (Radio Frequency, RF) circuit 1210, memory 1220, input unit 1230, display unit
1240, sensor 1250, voicefrequency circuit 1260, Wireless Fidelity (wireless fidelity, WiFi) module 1270, processor
The components such as 1280 and power supply 1290.It will be understood by those skilled in the art that handset structure shown in Figure 12 is not constituted pair
The restriction of mobile phone may include perhaps combining certain components or different component cloth than illustrating more or fewer components
It sets.
It is specifically introduced below with reference to each component parts of the Figure 12 to mobile phone:
Although being not shown, mobile phone can also include camera, bluetooth module etc., and details are not described herein.Implement in the present invention
In example, processor 1280 included by the terminal is also with the following functions:
The picture of file to be identified is obtained, and the picture is cached;
Validity check is carried out to the picture according to effectiveness condition;
When the picture does not meet the effectiveness condition, corresponding picture is reacquired;
OCR, which is generated, according to the picture of the caching and the picture of reacquisition identifies file.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description,
The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.
In several embodiments provided herein, it should be understood that disclosed system, device and method can be with
It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit
It divides, only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple units or components
It can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown or
The mutual coupling, direct-coupling or communication connection discussed can be through some interfaces, the indirect coupling of device or unit
It closes or communicates to connect, can be electrical property, mechanical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit
The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple
In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme
's.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit
It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list
Member both can take the form of hardware realization, can also realize in the form of software functional units.
Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of above-described embodiment is can
It is completed with instructing relevant hardware by program, which can be stored in a computer readable storage medium, storage
Medium may include: read-only memory (ROM, Read Only Memory), random access memory (RAM, Random
Access Memory), disk or CD etc..
Those of ordinary skill in the art will appreciate that implement the method for the above embodiments be can be with
Relevant hardware is instructed to complete by program, the program can store in a kind of computer readable storage medium, on
Stating the storage medium mentioned can be read-only memory, disk or CD etc..
Compared with the existing technology, scheme provided by the invention, by obtaining the picture of file to be identified, and by the picture
It is cached;Validity check is carried out to the picture according to effectiveness condition;When the picture does not meet the validity item
When part, corresponding picture is reacquired;OCR, which is generated, according to the picture of the caching and the picture of reacquisition identifies file.This
Invention correspondingly provides a kind of electronic equipment and computer storage medium.Technical solution provided by the invention can judge wait know
With the presence or absence of the identification defect for not meeting effectiveness condition for OCR identification in other file, and by delaying whole pictures
There are in local or server, not needing to give up whole pictures when occurring and identifying defect, is promoted to influence to generate and subsequent be known for OCR
The formation efficiency of other file avoids waste system resource.It is worth noting that, subsequent OCR identification can be in terminal or service
Device executes, and the position cached is limited.
A kind of terminal device of user provided by the present invention is described in detail above, for the general of this field
Technical staff, thought according to an embodiment of the present invention, there will be changes in the specific implementation manner and application range, to sum up
Described, the contents of this specification are not to be construed as limiting the invention.
Claims (10)
1. a kind of processing method of OCR identification file characterized by comprising
The picture of file to be identified is obtained, and the picture is cached;
Validity check is carried out to the picture according to effectiveness condition;
When the picture does not meet the effectiveness condition, corresponding picture is reacquired;
OCR, which is generated, according to the picture of the caching and the picture of reacquisition identifies file.
2. the processing method of OCR identification file according to claim 1, which is characterized in that described to obtain file to be identified
Picture the step of include:
Successively obtain the plurality of pictures of file to be identified;
Described the step of validity check is carried out to the picture according to effectiveness condition, comprising:
Before generating OCR identification file, integrity checking is carried out to the picture that multiple are cached according to completeness condition;
It is described when the picture does not meet the effectiveness condition, the step of reacquiring corresponding picture, comprising:
When the picture does not meet the completeness condition, the picture of the corresponding lack part of file to be identified is obtained.
3. the processing method of OCR identification file according to claim 2, which is characterized in that described according to completeness condition
The step of integrity checking is carried out to the picture of multiple cachings, comprising:
OCR identification is carried out to the page number of the picture of the caching, the continuity of the page number is determined, when the page number is discontinuous, sentences
Break the picture leakage page.
4. the processing method of OCR identification file according to claim 2, which is characterized in that described according to completeness condition
The step of integrity checking is carried out to the picture of multiple cachings, including
OCR identification is carried out to the content of text of the picture of the caching, and obtains the keyword in the content of text;
Picture is verified according to the keyword, if the keyword of the keyword identified in picture and other pictures is different
It causes, determines the picture leakage page.
5. the processing method of OCR identification file according to claim 2, which is characterized in that described according to completeness condition
The step of integrity checking is carried out to the picture of multiple cachings, including
The first trip text of the footline text to a upper picture and next picture carries out OCR identification respectively, obtains the first text
Content and the second content of text;
Nature semantic analysis is carried out to first content of text and the second content of text, if the first content of text and the second text
Content does not meet continuity, determines the picture leakage page.
6. according to the processing method of the described in any item OCR identification files of claim 3 to 5, which is characterized in that described to obtain again
The step of taking corresponding picture, comprising:
According to the position of the leakage page, the picture of the leakage page is reacquired;
Described the step of OCR identification file is generated according to the picture of the caching and the picture of reacquisition, comprising:
The picture of the leakage page is inserted into the corresponding deletion sites of picture of the caching according to the sequence of file to be identified, it will
All pictures are converted to OCR identification file.
7. the processing method of OCR identification file according to claim 1, which is characterized in that described according to effectiveness condition
The step of validity check is carried out to the picture, comprising:
Before generating OCR identification file, identifiability inspection is carried out to the picture according to identity condition;
It is described when the picture does not meet the effectiveness condition, the step of reacquiring corresponding picture, comprising:
When the picture does not have identifiability, corresponding picture is reacquired;
Described the step of OCR identification file is generated according to the picture of the caching and the picture of reacquisition, comprising:
By the picture replacement picture for not having identifiability of the reacquisition, according to the picture of the caching and replacement
Picture generate OCR identify file.
8. the processing method of OCR identification file according to claim 1, which is characterized in that described according to effectiveness condition
The step of validity check is carried out to the picture, comprising:
Generate OCR identification file before, judge the picture with the presence or absence of virtualization, there are non-recognizable region or deformations;
It is described when the picture does not meet the effectiveness condition, the step of reacquiring corresponding picture, comprising:
According to described in the presence of virtualization or there are non-recognizable region or the position of the picture of deformation, the figure of corresponding position is reacquired
Piece.
9. a kind of electronic equipment characterized by comprising
Processor;
Memory for storage processor executable instruction;
Wherein, the processor is configured to perform claim requires the processing side of the identification file of OCR described in 1~8 any one
Method.
10. a kind of non-transitorycomputer readable storage medium, which is characterized in that when the instruction in the storage medium is by moving
When the processor of terminal executes, so that mobile terminal is able to carry out as described in any one of claim 1~8 claim
The processing method of OCR identification file.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910198318.8A CN110059559A (en) | 2019-03-15 | 2019-03-15 | The processing method and its electronic equipment of OCR identification file |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910198318.8A CN110059559A (en) | 2019-03-15 | 2019-03-15 | The processing method and its electronic equipment of OCR identification file |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110059559A true CN110059559A (en) | 2019-07-26 |
Family
ID=67317002
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910198318.8A Pending CN110059559A (en) | 2019-03-15 | 2019-03-15 | The processing method and its electronic equipment of OCR identification file |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110059559A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112200185A (en) * | 2020-10-10 | 2021-01-08 | 航天科工智慧产业发展有限公司 | Method and device for reversely positioning picture by characters and computer storage medium |
CN113780121A (en) * | 2021-08-30 | 2021-12-10 | 国网上海市电力公司 | Power system operation instruction ticket automatic identification application method based on artificial intelligence |
CN114140778A (en) * | 2021-01-14 | 2022-03-04 | 北京灵伴即时智能科技有限公司 | Page turning abnormality detection method |
Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6154579A (en) * | 1997-08-11 | 2000-11-28 | At&T Corp. | Confusion matrix based method and system for correcting misrecognized words appearing in documents generated by an optical character recognition technique |
US20020063896A1 (en) * | 2000-11-29 | 2002-05-30 | Xerox Corporation | In an electronic reprographic system, provide automatic document integrity determination and page organization |
CN101533474A (en) * | 2008-03-12 | 2009-09-16 | 三星电子株式会社 | Character and image recognition system based on video image and method thereof |
EP2383970A1 (en) * | 2010-04-30 | 2011-11-02 | beyo GmbH | Camera based method for text input and keyword detection |
US8315465B1 (en) * | 2009-01-12 | 2012-11-20 | Google Inc. | Effective feature classification in images |
CN103455786A (en) * | 2012-05-28 | 2013-12-18 | 北京山海经纬信息技术有限公司 | Image recognition method and system |
CN104732226A (en) * | 2015-03-31 | 2015-06-24 | 浪潮集团有限公司 | Character recognition method and device |
CN106250830A (en) * | 2016-07-22 | 2016-12-21 | 浙江大学 | Digital book structured analysis processing method |
CN107437085A (en) * | 2017-08-09 | 2017-12-05 | 厦门商集企业咨询有限责任公司 | A kind of method, apparatus and readable storage medium storing program for executing of lifting OCR discriminations |
CN107622266A (en) * | 2017-09-21 | 2018-01-23 | 平安科技(深圳)有限公司 | A kind of processing method, storage medium and the server of OCR identifications |
CN108228720A (en) * | 2017-12-07 | 2018-06-29 | 北京字节跳动网络技术有限公司 | Identify method, system, device, terminal and the storage medium of target text content and artwork correlation |
CN108304814A (en) * | 2018-02-08 | 2018-07-20 | 海南云江科技有限公司 | A kind of construction method and computing device of literal type detection model |
CN108388930A (en) * | 2018-01-17 | 2018-08-10 | 链家网(北京)科技有限公司 | The method and device of verification contract spare part picture correctness and integrality |
CN108647262A (en) * | 2018-04-27 | 2018-10-12 | 平安科技(深圳)有限公司 | A kind of picture management method, device, computer equipment and storage medium |
CN108874283A (en) * | 2018-05-29 | 2018-11-23 | 努比亚技术有限公司 | Image identification method, mobile terminal and computer readable storage medium |
-
2019
- 2019-03-15 CN CN201910198318.8A patent/CN110059559A/en active Pending
Patent Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6154579A (en) * | 1997-08-11 | 2000-11-28 | At&T Corp. | Confusion matrix based method and system for correcting misrecognized words appearing in documents generated by an optical character recognition technique |
US20020063896A1 (en) * | 2000-11-29 | 2002-05-30 | Xerox Corporation | In an electronic reprographic system, provide automatic document integrity determination and page organization |
CN101533474A (en) * | 2008-03-12 | 2009-09-16 | 三星电子株式会社 | Character and image recognition system based on video image and method thereof |
US8315465B1 (en) * | 2009-01-12 | 2012-11-20 | Google Inc. | Effective feature classification in images |
EP2383970A1 (en) * | 2010-04-30 | 2011-11-02 | beyo GmbH | Camera based method for text input and keyword detection |
CN103455786A (en) * | 2012-05-28 | 2013-12-18 | 北京山海经纬信息技术有限公司 | Image recognition method and system |
CN104732226A (en) * | 2015-03-31 | 2015-06-24 | 浪潮集团有限公司 | Character recognition method and device |
CN106250830A (en) * | 2016-07-22 | 2016-12-21 | 浙江大学 | Digital book structured analysis processing method |
CN107437085A (en) * | 2017-08-09 | 2017-12-05 | 厦门商集企业咨询有限责任公司 | A kind of method, apparatus and readable storage medium storing program for executing of lifting OCR discriminations |
CN107622266A (en) * | 2017-09-21 | 2018-01-23 | 平安科技(深圳)有限公司 | A kind of processing method, storage medium and the server of OCR identifications |
CN108228720A (en) * | 2017-12-07 | 2018-06-29 | 北京字节跳动网络技术有限公司 | Identify method, system, device, terminal and the storage medium of target text content and artwork correlation |
CN108388930A (en) * | 2018-01-17 | 2018-08-10 | 链家网(北京)科技有限公司 | The method and device of verification contract spare part picture correctness and integrality |
CN108304814A (en) * | 2018-02-08 | 2018-07-20 | 海南云江科技有限公司 | A kind of construction method and computing device of literal type detection model |
CN108647262A (en) * | 2018-04-27 | 2018-10-12 | 平安科技(深圳)有限公司 | A kind of picture management method, device, computer equipment and storage medium |
CN108874283A (en) * | 2018-05-29 | 2018-11-23 | 努比亚技术有限公司 | Image identification method, mobile terminal and computer readable storage medium |
Non-Patent Citations (1)
Title |
---|
马莉;: "复杂背景下基于OCR的变体文本识别技术", 科协论坛(下半月), no. 12, 25 December 2008 (2008-12-25) * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112200185A (en) * | 2020-10-10 | 2021-01-08 | 航天科工智慧产业发展有限公司 | Method and device for reversely positioning picture by characters and computer storage medium |
CN114140778A (en) * | 2021-01-14 | 2022-03-04 | 北京灵伴即时智能科技有限公司 | Page turning abnormality detection method |
CN113780121A (en) * | 2021-08-30 | 2021-12-10 | 国网上海市电力公司 | Power system operation instruction ticket automatic identification application method based on artificial intelligence |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110135411B (en) | Business card recognition method and device | |
CN109784235A (en) | Method for automatically inputting, device, computer equipment and the storage medium of paper form | |
US9747269B2 (en) | Smart optical input/output (I/O) extension for context-dependent workflows | |
CN110059559A (en) | The processing method and its electronic equipment of OCR identification file | |
US8072495B2 (en) | Automatic image capturing system | |
WO2018080546A1 (en) | Image quality assessment and improvement for performing optical character recognition | |
CN110222694B (en) | Image processing method, image processing device, electronic equipment and computer readable medium | |
CA3018437C (en) | Optical character recognition utilizing hashed templates | |
US9208551B2 (en) | Method and system for providing efficient feedback regarding captured optical image quality | |
CN108648189B (en) | Image blur detection method and device, computing equipment and readable storage medium | |
US20150278248A1 (en) | Personal Information Management Service System | |
US11972025B2 (en) | Stored image privacy violation detection method and system | |
US20220350956A1 (en) | Information processing apparatus, information processing method, and storage medium | |
CN113850060A (en) | Civil aviation document data identification and entry method and system | |
CN111047147A (en) | Automatic acquisition method for business process and intelligent terminal | |
US20210012511A1 (en) | Visual search method, computer device, and storage medium | |
CN114067335A (en) | Electronic archive text recognition method, system, computer equipment and storage medium | |
CN113568934A (en) | Data query method and device, electronic equipment and storage medium | |
US8161023B2 (en) | Inserting a PDF shared resource back into a PDF statement | |
JP5788447B2 (en) | Information acquisition system for insurance policy | |
CN111145143B (en) | Problem image determining method and device, electronic equipment and storage medium | |
CN110059184B (en) | Operation error collection and analysis method and system | |
US20210136112A1 (en) | Shared image sanitization method and system | |
CN111638839A (en) | Screen capturing method and device and electronic equipment | |
CN111178365A (en) | Picture character recognition method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |