CN112115735A - Identification management method for confidential files - Google Patents

Identification management method for confidential files Download PDF

Info

Publication number
CN112115735A
CN112115735A CN201910528541.4A CN201910528541A CN112115735A CN 112115735 A CN112115735 A CN 112115735A CN 201910528541 A CN201910528541 A CN 201910528541A CN 112115735 A CN112115735 A CN 112115735A
Authority
CN
China
Prior art keywords
document
confidential
file
template
confidential document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910528541.4A
Other languages
Chinese (zh)
Inventor
冯迪
汤丹
支劲超
顾梅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
State Grid Jiangsu Electric Power Co Ltd
Changzhou Power Supply Co of State Grid Jiangsu Electric Power Co Ltd
Original Assignee
State Grid Corp of China SGCC
State Grid Jiangsu Electric Power Co Ltd
Changzhou Power Supply Co of State Grid Jiangsu Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, State Grid Jiangsu Electric Power Co Ltd, Changzhou Power Supply Co of State Grid Jiangsu Electric Power Co Ltd filed Critical State Grid Corp of China SGCC
Priority to CN201910528541.4A priority Critical patent/CN112115735A/en
Publication of CN112115735A publication Critical patent/CN112115735A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Character Discrimination (AREA)

Abstract

The invention relates to an identification management method for a confidential file, which comprises the following steps: firstly, preprocessing; secondly, text detection; thirdly, optical character recognition; fourthly, extracting keywords from the photos and checking whether the photos are classified files or not; fifthly, checking whether the confidential document is the confidential document or not through an OCR template of the confidential document; sixthly, attaching EXIF information; seventhly, setting a suspicious coefficient and uploading the suspicious coefficient to a background manager; eighthly, inquiring a document part; and ninthly, improving the scanning efficiency. Aiming at the detection of the confidential document, the invention not only utilizes the prior OCR technology, but also generates a plurality of sets of templates aiming at the characteristics of the confidential document, thereby improving the identification rate and the analysis speed of the confidential document.

Description

Identification management method for confidential files
Technical Field
The invention relates to the field of file identification management, in particular to an identification management method for a confidential file.
Background
In the past, based on the management of paper confidential documents, each company has a set of strict management system, so that the confidential work is orderly carried out. With the development of the technology, after the electronic documents are popularized, in order to ensure the safe storage of the documents, a special encrypted U disk is generally used uniformly, a user must input a user name and a password, and the user can check the documents after logging in, so that the electronic documents are basically prevented from being leaked.
However, with the development of technology, the new period of security work has no longer managed simple paper documents and electronic documents. The popularization of high-pixel smart phones brings new problems to the work of file confidentiality.
In the file circulation process, part of personnel only need to use a portable smart phone to easily shoot a computer display or a paper file, and then content pictures with high definition can be obtained. Before, the leakage of some internal files occurs, namely, pictures are copied by a mobile phone and transmitted to the internet, so that bad influence is brought.
Based on the situation, on one hand, management of the confidential documents is further perfected, employee education is strengthened, and the employees are prohibited from storing the confidential documents into the mobile phone in any form. On the other hand, the monitoring processing of mobile phone photos and documents with specified formats is enhanced by using the emerging technology actively or not.
Disclosure of Invention
The invention aims to provide an identification management method for a confidential file, which is high in identification rate and good in reliability.
The technical scheme for realizing the aim of the invention is to provide an identification management method for a confidential file, which comprises the following steps:
the first step, pretreatment: firstly, image related data needs to be received, the image related data is made to be vertical in the horizontal and vertical directions, then, whether the image is a confidential file is detected by using an algorithm, and finally binaryzation is carried out so as to facilitate identification;
there are three approaches to identifying images; 1. a high-threshold self-adaptive binarization technology; convolutional Neural Network (CNN); a Haar feature classifier;
step two, text detection: there are two schemes to accomplish text detection; 1. detecting text by a linking component; 2. detecting the text by using the grids; firstly, optimizing a result by using a connection component algorithm and then using a network method;
step three, optical character recognition: the Convolutional Neural Network (CNN) is used for receiving the relevant fonts for training, and a part is output and is used for improving the probability through comparison: training by using a common font of a confidential document as a manual identification sample, wherein a plurality of characters are equal in width according to the characteristics of the confidential document, an approximate width of each character is obtained by using a picture non-uniform segmentation technology, an approximate classification is given, and then a convolutional neural network grammar is used for identification;
fourthly, extracting keywords from the photos and checking whether the photos are classified files or not;
and fifthly, checking whether the confidential document is the confidential document through an OCR template of the confidential document: by a conventional OCR algorithm, a part of files can be found; but the image processing is a very complicated process in order to improve the recognition rate of software; the identification template is used in a matching way; then processing the picture by using a template matching method;
the top of the confidential document is generally provided with a confidential character, so that a template and an area with the same size in an original image are aligned, then the template and the area are translated to a next pixel, the same operation is still carried out, after all positions are compared, a numerical value of the matching degree can be obtained, and then threshold value comparison can be set;
sixthly, the EXIF information is aided: obtaining geographical position information of picture shooting through EXIF information of a pre-read picture file; the analysis of the pictures generated in the working time and near the office area is enhanced, so that the scanning detection accuracy can be further improved;
step seven, setting a suspicious coefficient: shooting geographical position information according to the similarity of the pictures, and jointly judging the secret-related possibility of the pictures; for high suspected degree, directly deleting the suspected degree in an isolation way, and uploading the suspected degree to a background administrator; others can remind the user to check by oneself;
eighth step, document part inquiry: for the document information stored in the mobile phone, directly calling related components, reading the content, performing full-text retrieval, and checking whether each file contains keywords from beginning to end by adopting a sequential scanning method;
ninth, improve scanning efficiency: recording the scanned files by using SQLite, wherein the SQLite comprises file names, sizes and last modification time information; the next time the scan is made, the system will automatically compare the scanned document and automatically skip if it has already been scanned.
Further, in the first step, an adaptive binarization technique of a high threshold is preferred.
Further, in the fourth step, whether the document is a confidential document is checked according to predefined keywords including confidentiality, secrecy, internal matters, compensation and planning.
Furthermore, in the fifth step, when the template is actually set, the template related to the confidential document can be summarized and summarized according to the font and the language format of the document; then, the matching degree can be obtained by the following algorithms;
Figure 917767DEST_PATH_IMAGE001
Figure DEST_PATH_IMAGE003
the invention has the positive effects that: (1) aiming at the detection of the confidential document, the invention not only utilizes the prior OCR technology, but also generates a plurality of sets of templates aiming at the characteristics of the confidential document, thereby improving the identification rate and the analysis speed of the confidential document.
(2) In order to enhance the identification of the confidential documents, the system adds geographical position judgment. If the image is a photo shot at a working place, the detection is strengthened.
(3) The invention optimizes the retrieval algorithm and improves the retrieval efficiency.
(4) The invention establishes the level system of the suspicious files so as to give different reminders. For files with significant suspicion, the administrator will be deleted and uploaded.
Detailed Description
(example 1)
The method for identifying and managing the confidential documents in the embodiment utilizes the existing image OCR technology and the document scanning technology to scan and compare the documents and the images in the mobile phone and verify whether the documents and the images contain keywords or not.
According to predefined keywords such as confidentiality, secrecy, internal matters, compensation, planning and the like, scanning detection is carried out on documents and picture files stored in the mobile phone, a final result is fed back, and a user is prompted to process the documents and the picture files which may contain sensitive words.
In this function, the most critical is the scan detection of the picture file. Based on an optimized and improved OCR algorithm, each pixel in the picture is analyzed, comprehensive characteristics such as file format, character using font, character color and the like are judged in an auxiliary mode, a confidential file template library is set, and character content contained in the picture is obtained more accurately.
The method for identifying and managing the confidential document specifically comprises the following steps:
the first step, pretreatment: firstly, image related data needs to be received, the image related data is made to be vertical in the horizontal and vertical directions, then, whether the image is a confidential file is detected through an algorithm, and finally binaryzation is performed to facilitate identification.
There are three schemes that can be used to identify images. 1. High threshold adaptive binarization technique. Convolutional Neural Network (CNN). A Haar feature classifier.
A high threshold adaptive binarization technique is preferred.
Step two, text detection: there are two schemes to accomplish text detection. 1. Text is detected by the linking component. 2. The text is detected using a grid.
When the file is detected through the link component, a lot of noisy texts exist, and a threshold value needs to be additionally set for filtering. The semantics are known mainly by the combination of the most recent characters into words. After the texts are shaped into lines, whether the texts are in the same line is judged according to the height.
The text is detected through the grids, and a lot of noisy texts are avoided.
The detection of text is accomplished by a combination of the two methods. The result is optimized by a connection component algorithm and then by a network method.
Step three, optical character recognition: the Convolutional Neural Network (CNN) is used for receiving the relevant fonts for training, and a part is output and is used for improving the probability through comparison: the method comprises the steps of training by using common fonts of confidential documents as manual identification samples, obtaining the approximate width of each character by using a picture non-uniform segmentation technology according to the characteristics of the confidential documents, giving an approximate classification, and then identifying by using a convolutional neural network grammar. The two characters are combined, so that the recognition rate of the characters is improved.
And fourthly, extracting keywords from the photos and checking whether the files are classified files. Whether the document is a confidential document is checked according to predefined keywords, such as confidentiality, secrecy, internal matters, salary, planning and the like.
And fifthly, checking whether the confidential document is the confidential document through an OCR template of the confidential document: by means of conventional OCR algorithms, it is indeed possible to find parts of the file. But image processing is a very complex process in order to increase the recognition rate of the software. The identification template is used in a matching way. The picture is then processed using template matching.
The top of the security document is generally provided with a security word, so that the template and an area in the original image with the same size are aligned, then the template and the original image are translated to the next pixel, the same operation is still carried out, after all positions are compared, a numerical value of the matching degree can be obtained, and then threshold value comparison can be set.
When the template is actually set, the template related to the confidential document can be summarized and summarized according to the font, the language format and the like of the document. Then, there are several algorithms to find the matching degree.
Figure 354347DEST_PATH_IMAGE001
Figure 234884DEST_PATH_IMAGE003
Sixthly, the EXIF information is aided: and obtaining the geographical position information of picture shooting by pre-reading EXIF information of the picture file. The analysis is strengthened for the pictures generated near the working time and the office area, so that the scanning detection accuracy can be further improved.
Step seven, setting a suspicious coefficient: and shooting geographical position information according to the similarity of the pictures to jointly judge the confidential possibility of the pictures. And for high suspected degree, directly isolating and deleting the high suspected degree, and uploading the high suspected degree to a background administrator. Others may remind the user to check themselves.
Eighth step, document part inquiry: and directly calling related components for the document information stored in the mobile phone, and performing full-text retrieval after reading the content. The search method is generally divided into two methods, a sequential scanning method and an indexing method. Because the mobile phone file is too small, it takes a long time to build the index. Therefore, by adopting the sequential scanning method, whether each file contains keywords or not is checked from beginning to end.
Ninth, improve scanning efficiency: and recording the scanned files by using the SQLite, wherein the information comprises file names, sizes, last modification time and the like. The next time the scan is made, the system will automatically compare the scanned document and automatically skip if it has already been scanned. Therefore, the time of system scanning is greatly reduced, and the efficiency is improved.
It should be understood that the above examples are only for clearly illustrating the present invention and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And such obvious variations or modifications which fall within the spirit of the invention are intended to be covered by the scope of the present invention.

Claims (4)

1. An identification management method for a secure file is characterized by comprising the following steps:
the first step, pretreatment: firstly, image related data needs to be received, the image related data is made to be vertical in the horizontal and vertical directions, then, whether the image is a confidential file is detected by using an algorithm, and finally binaryzation is carried out so as to facilitate identification;
there are three approaches to identifying images; 1. a high-threshold self-adaptive binarization technology; convolutional Neural Network (CNN); a Haar feature classifier;
step two, text detection: there are two schemes to accomplish text detection; 1. detecting text by a linking component; 2. detecting the text by using the grids; firstly, optimizing a result by using a connection component algorithm and then using a network method;
step three, optical character recognition: the Convolutional Neural Network (CNN) is used for receiving the relevant fonts for training, and a part is output and is used for improving the probability through comparison: training by using a common font of a confidential document as a manual identification sample, wherein a plurality of characters are equal in width according to the characteristics of the confidential document, an approximate width of each character is obtained by using a picture non-uniform segmentation technology, an approximate classification is given, and then a convolutional neural network grammar is used for identification;
fourthly, extracting keywords from the photos and checking whether the photos are classified files or not;
and fifthly, checking whether the confidential document is the confidential document through an OCR template of the confidential document: by a conventional OCR algorithm, a part of files can be found; but the image processing is a very complicated process in order to improve the recognition rate of software; the identification template is used in a matching way; then processing the picture by using a template matching method;
the top of the confidential document is generally provided with a confidential character, so that a template and an area with the same size in an original image are aligned, then the template and the area are translated to a next pixel, the same operation is still carried out, after all positions are compared, a numerical value of the matching degree can be obtained, and then threshold value comparison can be set;
sixthly, the EXIF information is aided: obtaining geographical position information of picture shooting through EXIF information of a pre-read picture file; the analysis of the pictures generated in the working time and near the office area is enhanced, so that the scanning detection accuracy can be further improved;
step seven, setting a suspicious coefficient: shooting geographical position information according to the similarity of the pictures, and jointly judging the secret-related possibility of the pictures; for high suspected degree, directly deleting the suspected degree in an isolation way, and uploading the suspected degree to a background administrator; others can remind the user to check by oneself;
eighth step, document part inquiry: for the document information stored in the mobile phone, directly calling related components, reading the content, performing full-text retrieval, and checking whether each file contains keywords from beginning to end by adopting a sequential scanning method;
ninth, improve scanning efficiency: recording the scanned files by using SQLite, wherein the SQLite comprises file names, sizes and last modification time information; the next time the scan is made, the system will automatically compare the scanned document and automatically skip if it has already been scanned.
2. An identification management method for a security document according to claim 1, characterized in that: in the first step, a high threshold adaptive binarization technique is preferred.
3. An identification management method for a security document according to claim 1, characterized in that: in the fourth step, whether the file is a confidential file is checked according to predefined keywords including confidentiality, secrecy, internal matters, salary and planning.
4. An identification management method for a security document according to claim 1, characterized in that: in the fifth step, when the template is actually set, the template related to the confidential document can be summarized and summarized according to the font and the language format of the document; then, the matching degree can be obtained by the following algorithms;
Figure DEST_PATH_IMAGE002
Figure DEST_PATH_IMAGE004
CN201910528541.4A 2019-06-19 2019-06-19 Identification management method for confidential files Pending CN112115735A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910528541.4A CN112115735A (en) 2019-06-19 2019-06-19 Identification management method for confidential files

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910528541.4A CN112115735A (en) 2019-06-19 2019-06-19 Identification management method for confidential files

Publications (1)

Publication Number Publication Date
CN112115735A true CN112115735A (en) 2020-12-22

Family

ID=73795135

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910528541.4A Pending CN112115735A (en) 2019-06-19 2019-06-19 Identification management method for confidential files

Country Status (1)

Country Link
CN (1) CN112115735A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105468732A (en) * 2015-11-23 2016-04-06 中国科学院信息工程研究所 Image keyword inspecting method and device
CN106326332A (en) * 2015-07-03 2017-01-11 柯尼卡美能达株式会社 Retrieval device, retrieval method
CN107247915A (en) * 2016-08-02 2017-10-13 浙江远望信息股份有限公司 A kind of intelligent identification Method of sensitization picture file
JP2019008697A (en) * 2017-06-28 2019-01-17 コニカミノルタ株式会社 Electronic document creation apparatus, electronic document creation method, and electronic document creation program
CN109284756A (en) * 2018-08-01 2019-01-29 河海大学 A kind of terminal censorship method based on OCR technique
CN109902710A (en) * 2019-01-07 2019-06-18 南京热信软件科技有限公司 A kind of fast matching method and device of text image

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106326332A (en) * 2015-07-03 2017-01-11 柯尼卡美能达株式会社 Retrieval device, retrieval method
CN105468732A (en) * 2015-11-23 2016-04-06 中国科学院信息工程研究所 Image keyword inspecting method and device
CN107247915A (en) * 2016-08-02 2017-10-13 浙江远望信息股份有限公司 A kind of intelligent identification Method of sensitization picture file
JP2019008697A (en) * 2017-06-28 2019-01-17 コニカミノルタ株式会社 Electronic document creation apparatus, electronic document creation method, and electronic document creation program
CN109284756A (en) * 2018-08-01 2019-01-29 河海大学 A kind of terminal censorship method based on OCR technique
CN109902710A (en) * 2019-01-07 2019-06-18 南京热信软件科技有限公司 A kind of fast matching method and device of text image

Similar Documents

Publication Publication Date Title
KR101585029B1 (en) Recognition and classification system of document
US8600989B2 (en) Method and system for image matching in a mixed media environment
US8538184B2 (en) Systems and methods for handling and distinguishing binarized, background artifacts in the vicinity of document text and image features indicative of a document category
US7812986B2 (en) System and methods for use of voice mail and email in a mixed media environment
US8521737B2 (en) Method and system for multi-tier image matching in a mixed media environment
US9152859B2 (en) Property record document data verification systems and methods
US9063953B2 (en) System and methods for creation and use of a mixed media environment
US7849398B2 (en) Decision criteria for automated form population
US8064703B2 (en) Property record document data validation systems and methods
US20090070415A1 (en) Architecture for mixed media reality retrieval of locations and registration of images
US20070050712A1 (en) Visibly-Perceptible Hot Spots in Documents
US20070047780A1 (en) Shared Document Annotation
KR100979457B1 (en) Method and system for image matching in a mixed media environment
AU2015203150A1 (en) System and method for data extraction and searching
CN108304815B (en) Data acquisition method, device, server and storage medium
JP6882362B2 (en) Systems and methods for identifying images, including identification documents
US10482393B2 (en) Machine-based learning systems, methods, and apparatus for interactively mapping raw data objects to recognized data objects
EP1917637A1 (en) Data organization and access for mixed media document system
CN113936764A (en) Method and system for desensitizing sensitive information in medical report sheet photo
CN113076961A (en) Image feature library updating method, image detection method and device
CN110955796B (en) Case feature information extraction method and device based on stroke information
CN112100630A (en) Identification method for confidential document
US20070217691A1 (en) Property record document title determination systems and methods
CN116663549A (en) Digitized management method, system and storage medium based on enterprise files
US7532368B2 (en) Automated processing of paper forms using remotely-stored form content

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20201222

RJ01 Rejection of invention patent application after publication