CN112541461A - Automatic auditing method and device for consumption credentials without fixed format template - Google Patents

Automatic auditing method and device for consumption credentials without fixed format template Download PDF

Info

Publication number
CN112541461A
CN112541461A CN202011519262.0A CN202011519262A CN112541461A CN 112541461 A CN112541461 A CN 112541461A CN 202011519262 A CN202011519262 A CN 202011519262A CN 112541461 A CN112541461 A CN 112541461A
Authority
CN
China
Prior art keywords
consumption
fixed format
audit
credentials
recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011519262.0A
Other languages
Chinese (zh)
Inventor
卫浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan XW Bank Co Ltd
Original Assignee
Sichuan XW Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan XW Bank Co Ltd filed Critical Sichuan XW Bank Co Ltd
Priority to CN202011519262.0A priority Critical patent/CN112541461A/en
Publication of CN112541461A publication Critical patent/CN112541461A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/103Workflow collaboration or project management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Strategic Management (AREA)
  • Evolutionary Computation (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Operations Research (AREA)
  • Mathematical Physics (AREA)
  • Multimedia (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Medical Informatics (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Character Discrimination (AREA)

Abstract

The invention relates to the field of computer vision identification, and discloses an automatic auditing method and device for consumption certificates without fixed format templates, which comprises the following steps: step 1: establishing a sample library; step 2: constructing a recognition model and training; and step 3: inputting a consumption document image, carrying out feature recognition through a recognition model to determine the type of the consumption document, if the recognition is not successful, outputting the consumption document image uncertainly, carrying out manual audit, if the type of the consumption document is successfully recognized, respectively carrying out audit on various consumption documents, and then outputting the consumption document image without passing the audit or passing the audit. The invention can realize the identification of the consumption document without the fixed format template, the consumption document without the fixed format template is identified and classified, then the audit is carried out, the identified different consumption documents are respectively audited, the consumption documents which can not be identified are audited one by manpower, and the input manpower and material resources are greatly reduced.

Description

Automatic auditing method and device for consumption credentials without fixed format template
Technical Field
The invention relates to the field of computer vision identification, in particular to an automatic auditing method and device for consumption certificates without fixed format templates.
Background
In recent years, with the rapid development of computer vision technologies such as deep learning, image recognition technologies, OCR technologies and the like have been widely applied to the fields of face recognition, license recognition and the like, and have been developed more and more mature in terms of specific object recognition such as standardized licenses.
However, the current computer vision recognition technology still has a large error in recognition for non-standardized files, and even can not recognize the files. Taking the working scenario of the applicant as an example, in some scenarios, for example, when a bank issues a loan, it is necessary to verify the image of the consumption credential provided by the customer, and to verify whether the provided image is a real consumption credential. In the consumption document, except for the invoice which has a relatively standard format, other consumption documents such as a purchase contract, a receipt and the like do not have a standard format and a fixed format template.
Therefore, there is an urgent need for a method and an apparatus for auditing consumption credentials of a template without a fixed format, which can save manpower and material resources and is suitable for the template without the fixed format.
Disclosure of Invention
Based on the above problems, the present invention provides an automatic auditing method and apparatus for consumption credentials without fixed format templates. The invention can realize the identification of the consumption document without the fixed format template, the consumption document without the fixed format template is identified and classified, then the audit is carried out, the identified different consumption documents are respectively audited, the consumption documents which can not be identified are audited one by manpower, and the input manpower and material resources are greatly reduced.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows:
a unified automatic auditing method aiming at consumption certificates without fixed format templates comprises the following steps:
step 1: establishing a sample library;
step 2: constructing a recognition model and training;
and step 3: inputting a consumption document image, carrying out feature recognition through a recognition model to determine the type of the consumption document, if the recognition is not successful, outputting the consumption document image uncertainly, carrying out manual audit, if the type of the consumption document is successfully recognized, respectively carrying out audit on various consumption documents, and then outputting the consumption document image without passing the audit or passing the audit.
Preferably, step 1 comprises the following steps:
step 1.1: acquiring an imaging consumption certificate;
step 1.2: labeling the sample target label obtained in the step 1.1, wherein the labeled content is a credential type;
step 1.3: identifying and storing the character content of the imaged consumption certificate;
step 1.4: performing word segmentation on the text content acquired in the step 1.3, counting word frequency, and forming word-word frequency characteristics and word frequency proportion characteristics by the words and the word frequency;
step 1.5: and 4, vectorizing and coding the text words obtained in the step 4.
Preferably, the manner of acquiring the imaged consumption document in step 1.1 is to directly input the imaged consumption document through a device or crawl various consumption document pictures through the internet.
Preferably, in step 1.1, the interference picture is randomly crawled and added into the sample library while the imaging consumption certificate is acquired on the internet.
Preferably, the credential types in step 1.2 include invoice, contract, receipt, other non-consumable credentials.
Preferably, in step 1.3, the imaging credential identification uses OCR recognition technology.
Preferably, in step 2: training is carried out through a multi-classification machine learning model, standard samples are input, various consumption evidence probabilities are output, and a recognition model is obtained.
A unified automatic auditing device aiming at consumption certificates without fixed-format templates adopts the method to audit.
The invention has the beneficial effects that:
(1) the invention can realize the identification of the consumption document without the fixed format template, the consumption document without the fixed format template is identified and classified, then the audit is carried out, the identified different consumption documents are respectively audited, the consumption documents which can not be identified are audited one by manpower, and the input manpower and material resources are greatly reduced.
(2) The invention switches the consumption certificates of which the types are not successfully identified into manual auditing, and can ensure the accuracy of the auditing through the manual auditing and avoid the phenomenon that the unidentified consumption certificates are mixed into various consumption certificates to influence the auditing efficiency of the various consumption certificates.
Drawings
A better understanding of the features and advantages of the present invention will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the invention are utilized, and the accompanying drawings of which:
FIG. 1 is a schematic diagram of an overall workflow shown in accordance with some embodiments herein.
Fig. 2 is a schematic workflow diagram of step 1 shown in accordance with some embodiments herein.
Detailed Description
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only examples or embodiments of the present description, and that for a person skilled in the art, the present description can also be applied to other similar scenarios on the basis of these drawings without inventive effort.
Referring to fig. 1, a unified automatic auditing method for consumption credentials without fixed format templates includes the following steps:
step 1: establishing a sample library;
step 2: constructing a recognition model and training;
and step 3: inputting a consumption document image, carrying out feature recognition through a recognition model to determine the type of the consumption document, if the recognition is not successful, outputting the consumption document image uncertainly, carrying out manual audit, if the type of the consumption document is successfully recognized, respectively carrying out audit on various consumption documents, and then outputting the consumption document image without passing the audit or passing the audit.
It should be noted that, in some embodiments, the following steps are included in step 1:
step 1.1: acquiring an imaging consumption certificate;
step 1.2: labeling the sample target label obtained in the step 1.1, wherein the labeled content is a credential type;
step 1.3: identifying and storing the character content of the imaged consumption certificate;
step 1.4: performing word segmentation on the text content acquired in the step 1.3, counting word frequency, and forming word-word frequency characteristics and word frequency proportion characteristics by the words and the word frequency;
step 1.5: and 4, vectorizing and coding the text words obtained in the step 4.
It should be further noted that, in some embodiments, the manner of acquiring the visualized consumption credential in step 1.1 may be directly input through a device, such as a storage hard disk, or may be crawling various consumption credential pictures through the internet. In addition, in step 1.1, interference pictures including, but not limited to, various landscapes and people pictures can be randomly crawled from the internet and added into the sample library.
It is also noted that in some embodiments, the types of vouchers consumed in step 1.2 include, but are not limited to, traditional invoices, contracts, receipts, other non-consumable vouchers. If necessary, the user can also customize the consumption credential types according to the needs, such as domestic consumption credentials and foreign consumption credentials. If the interference picture is also crawled from the internet in the step 1.1, the interference picture marks the sample target label as other non-consumption credentials in the step 1.2.
It is further noted that in some embodiments, in step 1.3, the imaging credential identification employs OCR recognition techniques. The OCR technology is a mature image recognition technology in the market at present, and can well perform character recognition on pictures.
It should be noted that, in some embodiments, in step 2: training is carried out through a multi-classification machine learning model, standard samples are input, various consumption evidence probabilities are output, and a recognition model is obtained. And (3) performing machine learning classification through a classified machine learning model, inputting standard samples according to the word frequency, word-word frequency and word frequency proportion characteristics of various imaged consumption certificates and text word vectorization codes obtained in the step 1.4 and the step 1.5, and outputting the probabilities of various consumption certificates to obtain the recognition model.
It should be noted that, in some embodiments, in step 3, we can set the recognition success standard value as needed, for example, the following settings are adopted: if the probability of certain consumption credentials output by the recognition module is higher than 80%, the recognition is determined to be successful; if the rate is lower than 80%, the recognition is determined to be unsuccessful. The 80% is only an example, and the specific identification success standard value can be adjusted and set according to the precision required by the work.
It should be noted that, in some embodiments, in step 3, after the input consumption credential image is successfully identified, the consumption credential image is summarized into the consumption credential image database of the corresponding type, and then subsequent auditing is performed as needed, at this time, since the consumption credential image databases are all the same standardized file, machine auditing can be directly performed on the standardized file by using the existing identification technology, and also unified auditing can be performed by a special person, since all the consumption credential image databases are the same standardized file, auditing efficiency can be significantly improved, and manpower and material resources can be saved. And in the subsequent auditing process, outputting whether the consumption certificate passes the auditing or not according to the auditing standard of various consumption certificates.
It should be further noted that, in some embodiments, after the identification of the image of the consumption credential input in step 3 is successful, the verification may be performed directly by a machine without performing aggregation, or may be performed by a dedicated person.
A unified automatic auditing device aiming at consumption certificates without fixed-format templates adopts the method to audit.
Having thus described the basic concept, it will be apparent to those skilled in the art that the foregoing detailed disclosure is to be regarded as illustrative only and not as limiting the present specification. Various modifications, improvements and adaptations to the present description may occur to those skilled in the art, although not explicitly described herein. Such modifications, improvements and adaptations are proposed in the present specification and thus fall within the spirit and scope of the exemplary embodiments of the present specification.
Also, the description uses specific words to describe embodiments of the description. Reference throughout this specification to "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic described in connection with at least one embodiment of the specification is included. Therefore, it is emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, some features, structures, or characteristics of one or more embodiments of the specification may be combined as appropriate.
Additionally, the order in which the elements and sequences of the process are recited in the specification, the use of alphanumeric characters, or other designations, is not intended to limit the order in which the processes and methods of the specification occur, unless otherwise specified in the claims. While various presently contemplated embodiments of the invention have been discussed in the foregoing disclosure by way of example, it is to be understood that such detail is solely for that purpose and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements that are within the spirit and scope of the embodiments herein. For example, although the system components described above may be implemented by hardware devices, they may also be implemented by software-only solutions, such as installing the described system on an existing server or mobile device.
Similarly, it should be noted that in the preceding description of embodiments of the present specification, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the embodiments. This method of disclosure, however, is not intended to imply that more features than are expressly recited in a claim. Indeed, the embodiments may be characterized as having less than all of the features of a single embodiment disclosed above.
Finally, it should be understood that the embodiments described herein are merely illustrative of the principles of the embodiments of the present disclosure. Other variations are also possible within the scope of the present description. Thus, by way of example, and not limitation, alternative configurations of the embodiments of the specification can be considered consistent with the teachings of the specification. Accordingly, the embodiments of the present description are not limited to only those embodiments explicitly described and depicted herein.

Claims (8)

1. A unified automatic auditing method aiming at consumption certificates without fixed format templates is characterized by comprising the following steps:
step 1: establishing a sample library;
step 2: constructing a recognition model and training;
and step 3: inputting a consumption document image, carrying out feature recognition through a recognition model to determine the type of the consumption document, if the recognition is not successful, outputting the consumption document image uncertainly, carrying out manual audit, if the type of the consumption document is successfully recognized, respectively carrying out audit on various consumption documents, and then outputting the consumption document image without passing the audit or passing the audit.
2. The unified automatic audit method for non-fixed format template consumption credentials according to claim 1, characterized in that in step 1 comprises the following steps:
step 1.1: acquiring an imaging consumption certificate;
step 1.2: labeling the sample target label obtained in the step 1.1, wherein the labeled content is a credential type;
step 1.3: identifying and storing the character content of the imaged consumption certificate;
step 1.4: performing word segmentation on the text content acquired in the step 1.3, counting word frequency, and forming word-word frequency characteristics and word frequency proportion characteristics by the words and the word frequency;
step 1.5: and 4, vectorizing and coding the text words obtained in the step 4.
3. The unified automatic audit method for non-fixed format template consumption credentials of claim 2, wherein: the manner of acquiring the imaged consumption document in step 1.1 is to directly input the imaged consumption document through a device or crawl various consumption document pictures through the internet.
4. The unified automatic audit method for non-fixed format template consumption credentials of claim 2, wherein: in step 1.1, interference pictures are randomly crawled and added into a sample library while the imaging consumption certificates are obtained from the internet.
5. The unified automatic audit method for non-fixed format template consumption credentials of claim 2, wherein: the credential types in step 1.2 include invoice, contract, receipt, other non-consumable credentials.
6. The unified automatic audit method for non-fixed format template consumption credentials of claim 2, wherein: in step 1.3, the imaging consumption certificate recognition adopts OCR recognition technology.
7. The unified automatic audit method for non-fixed format template consumption credentials of claim 2, wherein: in step 2: training is carried out through a multi-classification machine learning model, standard samples are input, various consumption evidence probabilities are output, and a recognition model is obtained.
8. The utility model provides a unified automatic audit device to no fixed format template consumption credential which characterized in that: the unified automatic audit method for non-fixed format template consumption credentials of claims 1-7 is used for auditing.
CN202011519262.0A 2020-12-21 2020-12-21 Automatic auditing method and device for consumption credentials without fixed format template Pending CN112541461A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011519262.0A CN112541461A (en) 2020-12-21 2020-12-21 Automatic auditing method and device for consumption credentials without fixed format template

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011519262.0A CN112541461A (en) 2020-12-21 2020-12-21 Automatic auditing method and device for consumption credentials without fixed format template

Publications (1)

Publication Number Publication Date
CN112541461A true CN112541461A (en) 2021-03-23

Family

ID=75019356

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011519262.0A Pending CN112541461A (en) 2020-12-21 2020-12-21 Automatic auditing method and device for consumption credentials without fixed format template

Country Status (1)

Country Link
CN (1) CN112541461A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108717545A (en) * 2018-05-18 2018-10-30 北京大账房网络科技股份有限公司 A kind of bank slip recognition method and system based on mobile phone photograph
CN109977957A (en) * 2019-03-04 2019-07-05 苏宁易购集团股份有限公司 A kind of invoice recognition methods and system based on deep learning
CN110334640A (en) * 2019-06-28 2019-10-15 苏宁云计算有限公司 A kind of ticket processing method and system
CN110457973A (en) * 2018-05-07 2019-11-15 北京中海汇银财税服务有限公司 A kind of method and system of bank slip recognition
CN111626279A (en) * 2019-10-15 2020-09-04 西安网算数据科技有限公司 Negative sample labeling training method and highly-automated bill identification method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110457973A (en) * 2018-05-07 2019-11-15 北京中海汇银财税服务有限公司 A kind of method and system of bank slip recognition
CN108717545A (en) * 2018-05-18 2018-10-30 北京大账房网络科技股份有限公司 A kind of bank slip recognition method and system based on mobile phone photograph
CN109977957A (en) * 2019-03-04 2019-07-05 苏宁易购集团股份有限公司 A kind of invoice recognition methods and system based on deep learning
CN110334640A (en) * 2019-06-28 2019-10-15 苏宁云计算有限公司 A kind of ticket processing method and system
CN111626279A (en) * 2019-10-15 2020-09-04 西安网算数据科技有限公司 Negative sample labeling training method and highly-automated bill identification method

Similar Documents

Publication Publication Date Title
CN108960223B (en) Method for automatically generating voucher based on intelligent bill identification
US11195006B2 (en) Multi-modal document feature extraction
EP3432197A1 (en) Method and device for identifying characters of claim settlement bill, server and storage medium
CN110795525B (en) Text structuring method, text structuring device, electronic equipment and computer readable storage medium
CN110046978A (en) Intelligent method of charging out
CN110705515A (en) Hospital paper archive filing method and system based on OCR character recognition
EP3588376A1 (en) System and method for enrichment of ocr-extracted data
CN110956166A (en) Bill marking method and device
CN116912847A (en) Medical text recognition method and device, computer equipment and storage medium
CN110688998A (en) Bill identification method and device
CN110942063A (en) Certificate text information acquisition method and device and electronic equipment
CN112508000B (en) Method and equipment for generating OCR image recognition model training data
CN113469005A (en) Recognition method of bank receipt, related device and storage medium
CN113704474A (en) Bank outlet equipment operation guide generation method, device, equipment and storage medium
CN116343237A (en) Bill identification method based on deep learning and knowledge graph
CN112541461A (en) Automatic auditing method and device for consumption credentials without fixed format template
CN113988223B (en) Certificate image recognition method, device, computer equipment and storage medium
Dai et al. A Multimedia Learning for Chinese Character Image Recognition via Human‐Computer Interaction Network
Anagha et al. An automatic histogram detection and information extraction from document images
CN114443834A (en) Method and device for extracting license information and storage medium
CN107656909B (en) Document similarity judgment method and device based on document mixing characteristics
CN116229493B (en) Cross-modal picture text named entity recognition method and system and electronic equipment
CN110674859A (en) Chinese short text similarity detection method and system based on Chinese character strokes
CN111242307A (en) Judgment result obtaining method and device based on deep learning and storage medium
CN117373030B (en) OCR-based user material identification method, system, device and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination