CN112115735A - Identification management method for confidential files - Google Patents
Identification management method for confidential files Download PDFInfo
- Publication number
- CN112115735A CN112115735A CN201910528541.4A CN201910528541A CN112115735A CN 112115735 A CN112115735 A CN 112115735A CN 201910528541 A CN201910528541 A CN 201910528541A CN 112115735 A CN112115735 A CN 112115735A
- Authority
- CN
- China
- Prior art keywords
- document
- confidential
- file
- template
- confidential document
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000007726 management method Methods 0.000 title claims abstract description 14
- 238000001514 detection method Methods 0.000 claims abstract description 16
- 238000012015 optical character recognition Methods 0.000 claims abstract description 15
- 238000005516 engineering process Methods 0.000 claims abstract description 12
- 238000004458 analytical method Methods 0.000 claims abstract description 5
- 238000000034 method Methods 0.000 claims description 25
- 238000013527 convolutional neural network Methods 0.000 claims description 15
- 238000012545 processing Methods 0.000 claims description 6
- 238000012549 training Methods 0.000 claims description 6
- 238000012986 modification Methods 0.000 claims description 5
- 230000004048 modification Effects 0.000 claims description 5
- 230000003044 adaptive effect Effects 0.000 claims description 4
- 230000011218 segmentation Effects 0.000 claims description 3
- 238000013459 approach Methods 0.000 claims description 2
- 238000002955 isolation Methods 0.000 claims description 2
- 238000007781 pre-processing Methods 0.000 abstract 1
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000001914 filtration Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000008092 positive effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/413—Classification of content, e.g. text, photographs or tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/14—Image acquisition
- G06V30/148—Segmentation of character regions
- G06V30/153—Segmentation of character regions using recognition of characters or words
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Character Discrimination (AREA)
Abstract
The invention relates to an identification management method for a confidential file, which comprises the following steps: firstly, preprocessing; secondly, text detection; thirdly, optical character recognition; fourthly, extracting keywords from the photos and checking whether the photos are classified files or not; fifthly, checking whether the confidential document is the confidential document or not through an OCR template of the confidential document; sixthly, attaching EXIF information; seventhly, setting a suspicious coefficient and uploading the suspicious coefficient to a background manager; eighthly, inquiring a document part; and ninthly, improving the scanning efficiency. Aiming at the detection of the confidential document, the invention not only utilizes the prior OCR technology, but also generates a plurality of sets of templates aiming at the characteristics of the confidential document, thereby improving the identification rate and the analysis speed of the confidential document.
Description
Technical Field
The invention relates to the field of file identification management, in particular to an identification management method for a confidential file.
Background
In the past, based on the management of paper confidential documents, each company has a set of strict management system, so that the confidential work is orderly carried out. With the development of the technology, after the electronic documents are popularized, in order to ensure the safe storage of the documents, a special encrypted U disk is generally used uniformly, a user must input a user name and a password, and the user can check the documents after logging in, so that the electronic documents are basically prevented from being leaked.
However, with the development of technology, the new period of security work has no longer managed simple paper documents and electronic documents. The popularization of high-pixel smart phones brings new problems to the work of file confidentiality.
In the file circulation process, part of personnel only need to use a portable smart phone to easily shoot a computer display or a paper file, and then content pictures with high definition can be obtained. Before, the leakage of some internal files occurs, namely, pictures are copied by a mobile phone and transmitted to the internet, so that bad influence is brought.
Based on the situation, on one hand, management of the confidential documents is further perfected, employee education is strengthened, and the employees are prohibited from storing the confidential documents into the mobile phone in any form. On the other hand, the monitoring processing of mobile phone photos and documents with specified formats is enhanced by using the emerging technology actively or not.
Disclosure of Invention
The invention aims to provide an identification management method for a confidential file, which is high in identification rate and good in reliability.
The technical scheme for realizing the aim of the invention is to provide an identification management method for a confidential file, which comprises the following steps:
the first step, pretreatment: firstly, image related data needs to be received, the image related data is made to be vertical in the horizontal and vertical directions, then, whether the image is a confidential file is detected by using an algorithm, and finally binaryzation is carried out so as to facilitate identification;
there are three approaches to identifying images; 1. a high-threshold self-adaptive binarization technology; convolutional Neural Network (CNN); a Haar feature classifier;
step two, text detection: there are two schemes to accomplish text detection; 1. detecting text by a linking component; 2. detecting the text by using the grids; firstly, optimizing a result by using a connection component algorithm and then using a network method;
step three, optical character recognition: the Convolutional Neural Network (CNN) is used for receiving the relevant fonts for training, and a part is output and is used for improving the probability through comparison: training by using a common font of a confidential document as a manual identification sample, wherein a plurality of characters are equal in width according to the characteristics of the confidential document, an approximate width of each character is obtained by using a picture non-uniform segmentation technology, an approximate classification is given, and then a convolutional neural network grammar is used for identification;
fourthly, extracting keywords from the photos and checking whether the photos are classified files or not;
and fifthly, checking whether the confidential document is the confidential document through an OCR template of the confidential document: by a conventional OCR algorithm, a part of files can be found; but the image processing is a very complicated process in order to improve the recognition rate of software; the identification template is used in a matching way; then processing the picture by using a template matching method;
the top of the confidential document is generally provided with a confidential character, so that a template and an area with the same size in an original image are aligned, then the template and the area are translated to a next pixel, the same operation is still carried out, after all positions are compared, a numerical value of the matching degree can be obtained, and then threshold value comparison can be set;
sixthly, the EXIF information is aided: obtaining geographical position information of picture shooting through EXIF information of a pre-read picture file; the analysis of the pictures generated in the working time and near the office area is enhanced, so that the scanning detection accuracy can be further improved;
step seven, setting a suspicious coefficient: shooting geographical position information according to the similarity of the pictures, and jointly judging the secret-related possibility of the pictures; for high suspected degree, directly deleting the suspected degree in an isolation way, and uploading the suspected degree to a background administrator; others can remind the user to check by oneself;
eighth step, document part inquiry: for the document information stored in the mobile phone, directly calling related components, reading the content, performing full-text retrieval, and checking whether each file contains keywords from beginning to end by adopting a sequential scanning method;
ninth, improve scanning efficiency: recording the scanned files by using SQLite, wherein the SQLite comprises file names, sizes and last modification time information; the next time the scan is made, the system will automatically compare the scanned document and automatically skip if it has already been scanned.
Further, in the first step, an adaptive binarization technique of a high threshold is preferred.
Further, in the fourth step, whether the document is a confidential document is checked according to predefined keywords including confidentiality, secrecy, internal matters, compensation and planning.
Furthermore, in the fifth step, when the template is actually set, the template related to the confidential document can be summarized and summarized according to the font and the language format of the document; then, the matching degree can be obtained by the following algorithms;
the invention has the positive effects that: (1) aiming at the detection of the confidential document, the invention not only utilizes the prior OCR technology, but also generates a plurality of sets of templates aiming at the characteristics of the confidential document, thereby improving the identification rate and the analysis speed of the confidential document.
(2) In order to enhance the identification of the confidential documents, the system adds geographical position judgment. If the image is a photo shot at a working place, the detection is strengthened.
(3) The invention optimizes the retrieval algorithm and improves the retrieval efficiency.
(4) The invention establishes the level system of the suspicious files so as to give different reminders. For files with significant suspicion, the administrator will be deleted and uploaded.
Detailed Description
(example 1)
The method for identifying and managing the confidential documents in the embodiment utilizes the existing image OCR technology and the document scanning technology to scan and compare the documents and the images in the mobile phone and verify whether the documents and the images contain keywords or not.
According to predefined keywords such as confidentiality, secrecy, internal matters, compensation, planning and the like, scanning detection is carried out on documents and picture files stored in the mobile phone, a final result is fed back, and a user is prompted to process the documents and the picture files which may contain sensitive words.
In this function, the most critical is the scan detection of the picture file. Based on an optimized and improved OCR algorithm, each pixel in the picture is analyzed, comprehensive characteristics such as file format, character using font, character color and the like are judged in an auxiliary mode, a confidential file template library is set, and character content contained in the picture is obtained more accurately.
The method for identifying and managing the confidential document specifically comprises the following steps:
the first step, pretreatment: firstly, image related data needs to be received, the image related data is made to be vertical in the horizontal and vertical directions, then, whether the image is a confidential file is detected through an algorithm, and finally binaryzation is performed to facilitate identification.
There are three schemes that can be used to identify images. 1. High threshold adaptive binarization technique. Convolutional Neural Network (CNN). A Haar feature classifier.
A high threshold adaptive binarization technique is preferred.
Step two, text detection: there are two schemes to accomplish text detection. 1. Text is detected by the linking component. 2. The text is detected using a grid.
When the file is detected through the link component, a lot of noisy texts exist, and a threshold value needs to be additionally set for filtering. The semantics are known mainly by the combination of the most recent characters into words. After the texts are shaped into lines, whether the texts are in the same line is judged according to the height.
The text is detected through the grids, and a lot of noisy texts are avoided.
The detection of text is accomplished by a combination of the two methods. The result is optimized by a connection component algorithm and then by a network method.
Step three, optical character recognition: the Convolutional Neural Network (CNN) is used for receiving the relevant fonts for training, and a part is output and is used for improving the probability through comparison: the method comprises the steps of training by using common fonts of confidential documents as manual identification samples, obtaining the approximate width of each character by using a picture non-uniform segmentation technology according to the characteristics of the confidential documents, giving an approximate classification, and then identifying by using a convolutional neural network grammar. The two characters are combined, so that the recognition rate of the characters is improved.
And fourthly, extracting keywords from the photos and checking whether the files are classified files. Whether the document is a confidential document is checked according to predefined keywords, such as confidentiality, secrecy, internal matters, salary, planning and the like.
And fifthly, checking whether the confidential document is the confidential document through an OCR template of the confidential document: by means of conventional OCR algorithms, it is indeed possible to find parts of the file. But image processing is a very complex process in order to increase the recognition rate of the software. The identification template is used in a matching way. The picture is then processed using template matching.
The top of the security document is generally provided with a security word, so that the template and an area in the original image with the same size are aligned, then the template and the original image are translated to the next pixel, the same operation is still carried out, after all positions are compared, a numerical value of the matching degree can be obtained, and then threshold value comparison can be set.
When the template is actually set, the template related to the confidential document can be summarized and summarized according to the font, the language format and the like of the document. Then, there are several algorithms to find the matching degree.
Sixthly, the EXIF information is aided: and obtaining the geographical position information of picture shooting by pre-reading EXIF information of the picture file. The analysis is strengthened for the pictures generated near the working time and the office area, so that the scanning detection accuracy can be further improved.
Step seven, setting a suspicious coefficient: and shooting geographical position information according to the similarity of the pictures to jointly judge the confidential possibility of the pictures. And for high suspected degree, directly isolating and deleting the high suspected degree, and uploading the high suspected degree to a background administrator. Others may remind the user to check themselves.
Eighth step, document part inquiry: and directly calling related components for the document information stored in the mobile phone, and performing full-text retrieval after reading the content. The search method is generally divided into two methods, a sequential scanning method and an indexing method. Because the mobile phone file is too small, it takes a long time to build the index. Therefore, by adopting the sequential scanning method, whether each file contains keywords or not is checked from beginning to end.
Ninth, improve scanning efficiency: and recording the scanned files by using the SQLite, wherein the information comprises file names, sizes, last modification time and the like. The next time the scan is made, the system will automatically compare the scanned document and automatically skip if it has already been scanned. Therefore, the time of system scanning is greatly reduced, and the efficiency is improved.
It should be understood that the above examples are only for clearly illustrating the present invention and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And such obvious variations or modifications which fall within the spirit of the invention are intended to be covered by the scope of the present invention.
Claims (4)
1. An identification management method for a secure file is characterized by comprising the following steps:
the first step, pretreatment: firstly, image related data needs to be received, the image related data is made to be vertical in the horizontal and vertical directions, then, whether the image is a confidential file is detected by using an algorithm, and finally binaryzation is carried out so as to facilitate identification;
there are three approaches to identifying images; 1. a high-threshold self-adaptive binarization technology; convolutional Neural Network (CNN); a Haar feature classifier;
step two, text detection: there are two schemes to accomplish text detection; 1. detecting text by a linking component; 2. detecting the text by using the grids; firstly, optimizing a result by using a connection component algorithm and then using a network method;
step three, optical character recognition: the Convolutional Neural Network (CNN) is used for receiving the relevant fonts for training, and a part is output and is used for improving the probability through comparison: training by using a common font of a confidential document as a manual identification sample, wherein a plurality of characters are equal in width according to the characteristics of the confidential document, an approximate width of each character is obtained by using a picture non-uniform segmentation technology, an approximate classification is given, and then a convolutional neural network grammar is used for identification;
fourthly, extracting keywords from the photos and checking whether the photos are classified files or not;
and fifthly, checking whether the confidential document is the confidential document through an OCR template of the confidential document: by a conventional OCR algorithm, a part of files can be found; but the image processing is a very complicated process in order to improve the recognition rate of software; the identification template is used in a matching way; then processing the picture by using a template matching method;
the top of the confidential document is generally provided with a confidential character, so that a template and an area with the same size in an original image are aligned, then the template and the area are translated to a next pixel, the same operation is still carried out, after all positions are compared, a numerical value of the matching degree can be obtained, and then threshold value comparison can be set;
sixthly, the EXIF information is aided: obtaining geographical position information of picture shooting through EXIF information of a pre-read picture file; the analysis of the pictures generated in the working time and near the office area is enhanced, so that the scanning detection accuracy can be further improved;
step seven, setting a suspicious coefficient: shooting geographical position information according to the similarity of the pictures, and jointly judging the secret-related possibility of the pictures; for high suspected degree, directly deleting the suspected degree in an isolation way, and uploading the suspected degree to a background administrator; others can remind the user to check by oneself;
eighth step, document part inquiry: for the document information stored in the mobile phone, directly calling related components, reading the content, performing full-text retrieval, and checking whether each file contains keywords from beginning to end by adopting a sequential scanning method;
ninth, improve scanning efficiency: recording the scanned files by using SQLite, wherein the SQLite comprises file names, sizes and last modification time information; the next time the scan is made, the system will automatically compare the scanned document and automatically skip if it has already been scanned.
2. An identification management method for a security document according to claim 1, characterized in that: in the first step, a high threshold adaptive binarization technique is preferred.
3. An identification management method for a security document according to claim 1, characterized in that: in the fourth step, whether the file is a confidential file is checked according to predefined keywords including confidentiality, secrecy, internal matters, salary and planning.
4. An identification management method for a security document according to claim 1, characterized in that: in the fifth step, when the template is actually set, the template related to the confidential document can be summarized and summarized according to the font and the language format of the document; then, the matching degree can be obtained by the following algorithms;
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910528541.4A CN112115735A (en) | 2019-06-19 | 2019-06-19 | Identification management method for confidential files |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910528541.4A CN112115735A (en) | 2019-06-19 | 2019-06-19 | Identification management method for confidential files |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112115735A true CN112115735A (en) | 2020-12-22 |
Family
ID=73795135
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910528541.4A Pending CN112115735A (en) | 2019-06-19 | 2019-06-19 | Identification management method for confidential files |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112115735A (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105468732A (en) * | 2015-11-23 | 2016-04-06 | 中国科学院信息工程研究所 | Image keyword inspecting method and device |
CN106326332A (en) * | 2015-07-03 | 2017-01-11 | 柯尼卡美能达株式会社 | Retrieval device, retrieval method |
CN107247915A (en) * | 2016-08-02 | 2017-10-13 | 浙江远望信息股份有限公司 | A kind of intelligent identification Method of sensitization picture file |
JP2019008697A (en) * | 2017-06-28 | 2019-01-17 | コニカミノルタ株式会社 | Electronic document creation apparatus, electronic document creation method, and electronic document creation program |
CN109284756A (en) * | 2018-08-01 | 2019-01-29 | 河海大学 | A kind of terminal censorship method based on OCR technique |
CN109902710A (en) * | 2019-01-07 | 2019-06-18 | 南京热信软件科技有限公司 | A kind of fast matching method and device of text image |
-
2019
- 2019-06-19 CN CN201910528541.4A patent/CN112115735A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106326332A (en) * | 2015-07-03 | 2017-01-11 | 柯尼卡美能达株式会社 | Retrieval device, retrieval method |
CN105468732A (en) * | 2015-11-23 | 2016-04-06 | 中国科学院信息工程研究所 | Image keyword inspecting method and device |
CN107247915A (en) * | 2016-08-02 | 2017-10-13 | 浙江远望信息股份有限公司 | A kind of intelligent identification Method of sensitization picture file |
JP2019008697A (en) * | 2017-06-28 | 2019-01-17 | コニカミノルタ株式会社 | Electronic document creation apparatus, electronic document creation method, and electronic document creation program |
CN109284756A (en) * | 2018-08-01 | 2019-01-29 | 河海大学 | A kind of terminal censorship method based on OCR technique |
CN109902710A (en) * | 2019-01-07 | 2019-06-18 | 南京热信软件科技有限公司 | A kind of fast matching method and device of text image |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR101585029B1 (en) | Recognition and classification system of document | |
US8600989B2 (en) | Method and system for image matching in a mixed media environment | |
US8538184B2 (en) | Systems and methods for handling and distinguishing binarized, background artifacts in the vicinity of document text and image features indicative of a document category | |
US7812986B2 (en) | System and methods for use of voice mail and email in a mixed media environment | |
US8521737B2 (en) | Method and system for multi-tier image matching in a mixed media environment | |
US9152859B2 (en) | Property record document data verification systems and methods | |
US9063953B2 (en) | System and methods for creation and use of a mixed media environment | |
US7849398B2 (en) | Decision criteria for automated form population | |
US8064703B2 (en) | Property record document data validation systems and methods | |
US20090070415A1 (en) | Architecture for mixed media reality retrieval of locations and registration of images | |
US20070050712A1 (en) | Visibly-Perceptible Hot Spots in Documents | |
US20070047780A1 (en) | Shared Document Annotation | |
KR100979457B1 (en) | Method and system for image matching in a mixed media environment | |
AU2015203150A1 (en) | System and method for data extraction and searching | |
CN108304815B (en) | Data acquisition method, device, server and storage medium | |
JP6882362B2 (en) | Systems and methods for identifying images, including identification documents | |
US10482393B2 (en) | Machine-based learning systems, methods, and apparatus for interactively mapping raw data objects to recognized data objects | |
EP1917637A1 (en) | Data organization and access for mixed media document system | |
CN113936764A (en) | Method and system for desensitizing sensitive information in medical report sheet photo | |
CN113076961A (en) | Image feature library updating method, image detection method and device | |
CN110955796B (en) | Case feature information extraction method and device based on stroke information | |
CN112100630A (en) | Identification method for confidential document | |
US20070217691A1 (en) | Property record document title determination systems and methods | |
CN116663549A (en) | Digitized management method, system and storage medium based on enterprise files | |
US7532368B2 (en) | Automated processing of paper forms using remotely-stored form content |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20201222 |
|
RJ01 | Rejection of invention patent application after publication |