CN113420622A - Intelligent scanning, recognizing and filing system based on machine deep learning - Google Patents
Intelligent scanning, recognizing and filing system based on machine deep learning Download PDFInfo
- Publication number
- CN113420622A CN113420622A CN202110640604.2A CN202110640604A CN113420622A CN 113420622 A CN113420622 A CN 113420622A CN 202110640604 A CN202110640604 A CN 202110640604A CN 113420622 A CN113420622 A CN 113420622A
- Authority
- CN
- China
- Prior art keywords
- module
- recognition
- file
- system based
- deep learning
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000013135 deep learning Methods 0.000 title claims abstract description 26
- 238000012015 optical character recognition Methods 0.000 claims abstract description 25
- 238000010801 machine learning Methods 0.000 claims abstract description 14
- 230000005540 biological transmission Effects 0.000 claims abstract description 5
- 238000004422 calculation algorithm Methods 0.000 claims description 26
- 238000013528 artificial neural network Methods 0.000 claims description 7
- 230000000877 morphologic effect Effects 0.000 claims description 4
- 238000012937 correction Methods 0.000 claims description 3
- 238000003066 decision tree Methods 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- 238000007477 logistic regression Methods 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 238000007637 random forest analysis Methods 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 description 7
- 238000000034 method Methods 0.000 description 5
- 210000002569 neuron Anatomy 0.000 description 5
- 210000004556 brain Anatomy 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000003278 mimic effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000004579 scanning voltage microscopy Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Character Discrimination (AREA)
Abstract
The invention provides an intelligent scanning, identifying and filing system based on machine deep learning, which comprises the following modules: the system comprises an optical character recognition module, a machine learning module, an intelligent file filing module and a full-text retrieval module. The system can well realize intelligent filing of files and real full-text retrieval. By digitalizing the paper file and adopting OCR recognition, the file information resource can realize full-text retrieval, network transmission, remote retrieval, copying and reference of users. Therefore, the inquiry and the utilization of the user to the archive content are deepened, and the utilization range of the archive content is widened. The file can also be used as a means for acquiring information, utilizing information and increasing learning in daily life like books and information, so that the file can serve the public in many aspects.
Description
Technical Field
The invention particularly relates to an intelligent scanning, identifying and filing system based on machine deep learning.
Background
In the information age, file digitization is becoming the focus of file work in a period of time in the future, and paper file scanning work is actively carried out in various places. However, the electronic archive generated by the scan is actually only a document in the form of an image, not a text document in the true sense. That is, the computer only knows the appearance of the file, but not its inherent text. The user can only see the original appearance of the file through the computer, but cannot perform operations such as quote and search on the content in the file, which undoubtedly causes great inconvenience to the future electronic file utilization work.
In consideration of the utilization requirements of archive users, if an electronic archive in a real text form is to be obtained, the archive digitization work is more effective and more thorough, an intelligent scanning recognition filing system based on machine deep learning is provided, and characters in an image containing the characters are extracted and stored as a text file by using an OCR technology.
Disclosure of Invention
The invention aims to provide an intelligent scanning, recognizing and filing system based on machine deep learning, aiming at overcoming the defects of the prior art.
In order to meet the requirements, the technical scheme adopted by the invention is as follows: the intelligent scanning, recognizing and filing system based on the machine deep learning comprises the following modules: the optical character recognition module is used for cutting the image containing the characters into units which can be recognized independently according to the characters, analyzing the morphological characteristics of the characters in each image unit by using an algorithm, judging the standard codes of the characters in a computer by comparing the data in a standard characteristic library, and outputting and storing the standard codes in a text file according to a general format; the machine learning module is used for improving the recognition accuracy of the optical character recognition module; the file intelligent filing module is used for carrying out word frequency statistics and content analysis on the full file text so as to automatically extract identifiers, and then creating corresponding folders for classified storage through a self-defined matching rule; and the full-text retrieval module is used for directly performing word-by-word retrieval on the whole archive library to perform retrieval operation.
The intelligent scanning, identifying and filing system based on the machine deep learning has the following advantages:
the system can well realize intelligent filing of files, realize real full-text retrieval, and realize full-text retrieval and network transmission of archive information resources, and is convenient for users to retrieve, copy and quote in different places by digitalizing paper archives and adopting OCR recognition. Therefore, the inquiry and the utilization of the user to the archive content are deepened, and the utilization range of the archive content is widened. The file can also be used as a means for acquiring information, utilizing information and increasing learning in daily life like books and information, so that the file can serve the public in many aspects.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described in detail with reference to specific embodiments below.
In the following description, references to "one embodiment," "an embodiment," "one example," "an example," etc., indicate that the embodiment or example so described may include a particular feature, structure, characteristic, property, element, or limitation, but every embodiment or example does not necessarily include the particular feature, structure, characteristic, property, element, or limitation. Moreover, repeated use of the phrase "in accordance with an embodiment of the present application" although it may possibly refer to the same embodiment, does not necessarily refer to the same embodiment.
Certain features that are well known to those skilled in the art have been omitted from the following description for the sake of simplicity.
According to one embodiment of the application, an intelligent scanning, recognizing and filing system based on machine deep learning is provided, which comprises the following modules: the optical character recognition module is used for cutting the image containing the characters into units which can be recognized independently according to the characters, analyzing the morphological characteristics of the characters in each image unit by using an algorithm, judging the standard codes of the characters in a computer by comparing the data in a standard characteristic library, and outputting and storing the standard codes in a text file according to a general format; the machine learning module is used for improving the recognition accuracy of the optical character recognition module; the file intelligent filing module is used for carrying out word frequency statistics and content analysis on the full file text so as to automatically extract identifiers, and then creating corresponding folders for classified storage through a self-defined matching rule; and the full-text retrieval module is used for directly carrying out word-by-word retrieval operation on the whole archive library.
According to one embodiment of the application, the work flow of the optical character recognition module of the intelligent scanning, recognizing and archiving system based on machine depth learning comprises image input, image preprocessing, character feature extraction, comparison recognition, manual correction and output and storage of a recognition result.
According to one embodiment of the application, the intelligent scanning, identifying and archiving system based on the machine deep learning further comprises a network transmission module, a remote retrieval module and a copy reference module
According to one embodiment of the application, the identifier of the intelligent scanning recognition filing system based on machine deep learning comprises a keyword and a subject word.
According to one embodiment of the application, OCR software is preset in a learning character recognition module of the intelligent scanning recognition archiving system based on machine deep learning.
According to one embodiment of the application, the machine learning module of the intelligent scanning, identifying and archiving system based on machine deep learning applies the following machine learning algorithm: decision tree algorithm, random forest algorithm, logistic regression algorithm, naive Bayes algorithm, K nearest neighbor algorithm and neural network algorithm.
According to one embodiment of the application, the research content, the key problem to be solved and the technical route of the intelligent scanning, identifying and archiving system based on the machine deep learning are explained as follows:
OCR (optical Character recognition) is optical Character recognition. In popular terms, the computer is allowed to read words. The principle is that the image containing characters is cut into units which can be recognized independently according to characters through special OCR software, and then various algorithms are applied to analyze the morphological characteristics of the characters in each image unit. And comparing the data in the standard feature library to judge the standard code of the character in the computer, and outputting and storing the standard code in a text file according to a general format. The work flow of OCR is image input, image preprocessing, character feature extraction, comparison recognition, manual correction, and finally the recognition result is output and stored.
OCR technology has a strong advantage over traditional manual entry methods. First, OCR character recognition is much faster than manual entry. According to the international popular typing speed rating standard, even a professional can input only 150-240 words per minute. And by adopting the OCR technology, even if the time spent in the processing links of the front and the rear stages is calculated, the speed is absolutely faster than that of the former by multiple times. Second, OCR character recognition is of much higher quality than manual entry.
However, due to the influence of various factors, the recognition rate of the OCR technology is difficult to reach 100%, so that the technologies of machine learning and deep learning are introduced to help the system improve the recognition accuracy.
Machine learning is a technique for analyzing data by using an algorithm, continuously learning and judging and predicting events occurring in the world. Researchers do not write software, determine a special instruction set and then enable a program to complete a special task; instead, researchers may "train" machines with large amounts of data and algorithms, letting the machines learn how to perform tasks. "machine learning" is a path to "simulate, extend, and extend human intelligence", and is therefore a subset of artificial intelligence; "machine learning" is based on a large amount of data, that is, its "intelligence" is fed by a large amount of data. The commonly used learning algorithm for 10 machines is: decision tree, random forest, logistic regression, SVM, naive Bayes, K nearest neighbor algorithm, K mean algorithm, Adaboost algorithm, neural network, Markov.
Deep learning is a machine learning technique for establishing and simulating a neural network for analyzing and learning the human brain and for interpreting data by simulating the mechanism of the human brain. It is essentially characterized by the pattern of processing information that attempts to mimic the transfer between neurons of the brain. The most prominent applications are in the fields of computer vision and Natural Language Processing (NLP). Obviously, "deep learning" is strongly related to "neural network" in machine learning, and "neural network" is also its main algorithm and means; or we can refer to "deep learning" as an "improved version of neural network" algorithm. Deep learning is further classified into Convolutional Neural Networks (CNNs) and Deep Belief Networks (DBNs). The main idea is to simulate human neurons, each neuron receives information and transmits the information to all the neurons adjacent to the neuron after the information is processed.
According to one embodiment of the application, the beneficial effects and main functions of the intelligent scanning, identifying and filing system based on the machine deep learning in practical use are as follows:
1. intelligent filing of files
The establishment of file directories for classified storage is a relatively basic file digitalization work, and the manual establishment and entry are time-consuming and labor-consuming, and are easy to make mistakes. The OCR technology can be used for carrying out word frequency statistics and content analysis on the full file text so as to automatically extract identifiers such as key words and subject words, and then corresponding folders are created through self-defined matching rules for classified storage.
2. Realize true full-text search
The full-text search in filing work actually includes two types: one is that only the file directory database is searched, and after finding the related items, the corresponding file full text is opened; the other is true full-text retrieval, namely directly retrieving the archive full library and performing word-by-word retrieval on the archive full text. Obviously, the recall ratio of the latter retrieval method is much higher than that of the former retrieval method. The user can find more needed information from the large-scale collection of archives in the sea, and archive information resources are developed and utilized more deeply. The OCR technology is naturally not available to realize true full-text retrieval, because only the characters in the scanned image are changed into a text format, it is possible to perform word-by-word retrieval on the characters therein.
3. Broadening profile user utilization profile
The paper file is digitized and identified by OCR, so that the file information resource can realize full-text retrieval, network transmission, remote retrieval, copying and reference of users conveniently. Therefore, the inquiry and the utilization of the user to the archive content are deepened, and the utilization range of the archive content is widened. The file can also be used as a means for acquiring information, utilizing information and increasing learning in daily life like books and information, so that the file can serve the public in many aspects.
The above-mentioned embodiments only show some embodiments of the present invention, and the description thereof is more specific and detailed, but should not be construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the claims.
Claims (6)
1. The utility model provides an intelligence scanning discernment filing system based on machine deep learning which characterized in that includes following module:
the optical character recognition module is used for cutting the image containing the characters into units which can be recognized independently according to the characters, analyzing the morphological characteristics of the characters in each image unit by using an algorithm, judging the standard codes of the characters in a computer by comparing the data in a standard characteristic library, and outputting and storing the standard codes in a text file according to a general format;
the machine learning module is used for improving the recognition accuracy of the optical character recognition module;
the file intelligent filing module is used for carrying out word frequency statistics and content analysis on the full file text so as to automatically extract identifiers, and then creating corresponding folders for classified storage through a self-defined matching rule;
and the full-text retrieval module is used for directly carrying out word-by-word retrieval operation on the whole archive library.
2. The intelligent scanning, recognition and archiving system based on machine deep learning of claim 1, wherein: the work flow of the optical character recognition module comprises image input, image preprocessing, character feature extraction, comparison recognition, manual correction and output and storage of recognition results.
3. The intelligent scanning, recognition and archiving system based on machine deep learning of claim 1, wherein: the system also comprises a network transmission module, a remote retrieval module and a copy reference module.
4. The intelligent scanning, recognition and archiving system based on machine deep learning of claim 1, wherein: the identifier comprises a keyword and a subject word.
5. The intelligent scanning, recognition and archiving system based on machine deep learning of claim 1, wherein: OCR software is preset in the optical character recognition module.
6. The intelligent scanning, recognition and archiving system based on machine deep learning of claim 1, wherein: the machine learning module applies the following machine learning algorithm: decision tree algorithm, random forest algorithm, logistic regression algorithm, naive Bayes algorithm, K nearest neighbor algorithm and neural network algorithm.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110640604.2A CN113420622A (en) | 2021-06-09 | 2021-06-09 | Intelligent scanning, recognizing and filing system based on machine deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110640604.2A CN113420622A (en) | 2021-06-09 | 2021-06-09 | Intelligent scanning, recognizing and filing system based on machine deep learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113420622A true CN113420622A (en) | 2021-09-21 |
Family
ID=77788032
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110640604.2A Pending CN113420622A (en) | 2021-06-09 | 2021-06-09 | Intelligent scanning, recognizing and filing system based on machine deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113420622A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20230119516A1 (en) * | 2021-10-20 | 2023-04-20 | International Business Machines Corporation | Providing text information without reading a file |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109034147A (en) * | 2018-09-11 | 2018-12-18 | 上海唯识律简信息科技有限公司 | Optical character identification optimization method and system based on deep learning and natural language |
CN110909086A (en) * | 2019-11-27 | 2020-03-24 | 珠海格力电器股份有限公司 | Mail archiving method, system, computer device and computer readable storage medium |
CN111666259A (en) * | 2020-06-06 | 2020-09-15 | 智同道合(苏州)信息技术服务有限公司 | Document management method, management system, readable storage medium, and electronic device |
CN112101367A (en) * | 2020-09-15 | 2020-12-18 | 杭州睿琪软件有限公司 | Text recognition method, image recognition and classification method and document recognition processing method |
-
2021
- 2021-06-09 CN CN202110640604.2A patent/CN113420622A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109034147A (en) * | 2018-09-11 | 2018-12-18 | 上海唯识律简信息科技有限公司 | Optical character identification optimization method and system based on deep learning and natural language |
CN110909086A (en) * | 2019-11-27 | 2020-03-24 | 珠海格力电器股份有限公司 | Mail archiving method, system, computer device and computer readable storage medium |
CN111666259A (en) * | 2020-06-06 | 2020-09-15 | 智同道合(苏州)信息技术服务有限公司 | Document management method, management system, readable storage medium, and electronic device |
CN112101367A (en) * | 2020-09-15 | 2020-12-18 | 杭州睿琪软件有限公司 | Text recognition method, image recognition and classification method and document recognition processing method |
Non-Patent Citations (2)
Title |
---|
石会鹏等: "空间业务档案数字化与全文检索***的研究", 《数字通信世界》 * |
许呈辰: "档案数字化过程中OCR技术的应用", 《档案管理》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20230119516A1 (en) * | 2021-10-20 | 2023-04-20 | International Business Machines Corporation | Providing text information without reading a file |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Dadgar et al. | A novel text mining approach based on TF-IDF and Support Vector Machine for news classification | |
CN113011533A (en) | Text classification method and device, computer equipment and storage medium | |
Hassan et al. | Sentiment analysis on bangla and romanized bangla text using deep recurrent models | |
CN111767716B (en) | Method and device for determining enterprise multi-level industry information and computer equipment | |
CN112256939B (en) | Text entity relation extraction method for chemical field | |
Hassan et al. | Sentiment analysis on bangla and romanized bangla text (BRBT) using deep recurrent models | |
CN110543595B (en) | In-station searching system and method | |
CN112417863B (en) | Chinese text classification method based on pre-training word vector model and random forest algorithm | |
CN111611356A (en) | Information searching method and device, electronic equipment and readable storage medium | |
CN112860898B (en) | Short text box clustering method, system, equipment and storage medium | |
CN109460477B (en) | Information collection and classification system and method and retrieval and integration method thereof | |
CN1629837A (en) | Method and apparatus for processing, browsing and classified searching of electronic document and system thereof | |
CN114722137A (en) | Security policy configuration method and device based on sensitive data identification and electronic equipment | |
CN110909542A (en) | Intelligent semantic series-parallel analysis method and system | |
CN114265926A (en) | Natural language-based material recommendation method, system, equipment and medium | |
Gunaseelan et al. | Automatic extraction of segments from resumes using machine learning | |
CN114881043A (en) | Deep learning model-based legal document semantic similarity evaluation method and system | |
CN113420622A (en) | Intelligent scanning, recognizing and filing system based on machine deep learning | |
CN110866086A (en) | Article matching system | |
CN114238735B (en) | Intelligent internet data acquisition method | |
Trieschnigg et al. | Hierarchical topic detection in large digital news archives: exploring a sample based approach | |
CN114881012A (en) | Article title and content intelligent rewriting system and method based on natural language processing | |
CN109597879B (en) | Service behavior relation extraction method and device based on 'citation relation' data | |
CN111813975A (en) | Image retrieval method and device and electronic equipment | |
Haque et al. | Sentiment analysis in low-resource bangla text using active learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210921 |
|
RJ01 | Rejection of invention patent application after publication |