CN113420622A

CN113420622A - Intelligent scanning, recognizing and filing system based on machine deep learning

Info

Publication number: CN113420622A
Application number: CN202110640604.2A
Authority: CN
Inventors: 邓永安
Original assignee: Sichuan Baichuan Four Dimensional Information Technology Co ltd
Current assignee: Sichuan Baichuan Four Dimensional Information Technology Co ltd
Priority date: 2021-06-09
Filing date: 2021-06-09
Publication date: 2021-09-21

Abstract

The invention provides an intelligent scanning, identifying and filing system based on machine deep learning, which comprises the following modules: the system comprises an optical character recognition module, a machine learning module, an intelligent file filing module and a full-text retrieval module. The system can well realize intelligent filing of files and real full-text retrieval. By digitalizing the paper file and adopting OCR recognition, the file information resource can realize full-text retrieval, network transmission, remote retrieval, copying and reference of users. Therefore, the inquiry and the utilization of the user to the archive content are deepened, and the utilization range of the archive content is widened. The file can also be used as a means for acquiring information, utilizing information and increasing learning in daily life like books and information, so that the file can serve the public in many aspects.

Description

Intelligent scanning, recognizing and filing system based on machine deep learning

Technical Field

The invention particularly relates to an intelligent scanning, identifying and filing system based on machine deep learning.

Background

In the information age, file digitization is becoming the focus of file work in a period of time in the future, and paper file scanning work is actively carried out in various places. However, the electronic archive generated by the scan is actually only a document in the form of an image, not a text document in the true sense. That is, the computer only knows the appearance of the file, but not its inherent text. The user can only see the original appearance of the file through the computer, but cannot perform operations such as quote and search on the content in the file, which undoubtedly causes great inconvenience to the future electronic file utilization work.

In consideration of the utilization requirements of archive users, if an electronic archive in a real text form is to be obtained, the archive digitization work is more effective and more thorough, an intelligent scanning recognition filing system based on machine deep learning is provided, and characters in an image containing the characters are extracted and stored as a text file by using an OCR technology.

Disclosure of Invention

The invention aims to provide an intelligent scanning, recognizing and filing system based on machine deep learning, aiming at overcoming the defects of the prior art.

In order to meet the requirements, the technical scheme adopted by the invention is as follows: the intelligent scanning, recognizing and filing system based on the machine deep learning comprises the following modules: the optical character recognition module is used for cutting the image containing the characters into units which can be recognized independently according to the characters, analyzing the morphological characteristics of the characters in each image unit by using an algorithm, judging the standard codes of the characters in a computer by comparing the data in a standard characteristic library, and outputting and storing the standard codes in a text file according to a general format; the machine learning module is used for improving the recognition accuracy of the optical character recognition module; the file intelligent filing module is used for carrying out word frequency statistics and content analysis on the full file text so as to automatically extract identifiers, and then creating corresponding folders for classified storage through a self-defined matching rule; and the full-text retrieval module is used for directly performing word-by-word retrieval on the whole archive library to perform retrieval operation.

The intelligent scanning, identifying and filing system based on the machine deep learning has the following advantages:

the system can well realize intelligent filing of files, realize real full-text retrieval, and realize full-text retrieval and network transmission of archive information resources, and is convenient for users to retrieve, copy and quote in different places by digitalizing paper archives and adopting OCR recognition. Therefore, the inquiry and the utilization of the user to the archive content are deepened, and the utilization range of the archive content is widened. The file can also be used as a means for acquiring information, utilizing information and increasing learning in daily life like books and information, so that the file can serve the public in many aspects.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described in detail with reference to specific embodiments below.

In the following description, references to "one embodiment," "an embodiment," "one example," "an example," etc., indicate that the embodiment or example so described may include a particular feature, structure, characteristic, property, element, or limitation, but every embodiment or example does not necessarily include the particular feature, structure, characteristic, property, element, or limitation. Moreover, repeated use of the phrase "in accordance with an embodiment of the present application" although it may possibly refer to the same embodiment, does not necessarily refer to the same embodiment.

Certain features that are well known to those skilled in the art have been omitted from the following description for the sake of simplicity.

According to one embodiment of the application, an intelligent scanning, recognizing and filing system based on machine deep learning is provided, which comprises the following modules: the optical character recognition module is used for cutting the image containing the characters into units which can be recognized independently according to the characters, analyzing the morphological characteristics of the characters in each image unit by using an algorithm, judging the standard codes of the characters in a computer by comparing the data in a standard characteristic library, and outputting and storing the standard codes in a text file according to a general format; the machine learning module is used for improving the recognition accuracy of the optical character recognition module; the file intelligent filing module is used for carrying out word frequency statistics and content analysis on the full file text so as to automatically extract identifiers, and then creating corresponding folders for classified storage through a self-defined matching rule; and the full-text retrieval module is used for directly carrying out word-by-word retrieval operation on the whole archive library.

According to one embodiment of the application, the work flow of the optical character recognition module of the intelligent scanning, recognizing and archiving system based on machine depth learning comprises image input, image preprocessing, character feature extraction, comparison recognition, manual correction and output and storage of a recognition result.

According to one embodiment of the application, the intelligent scanning, identifying and archiving system based on the machine deep learning further comprises a network transmission module, a remote retrieval module and a copy reference module

According to one embodiment of the application, the identifier of the intelligent scanning recognition filing system based on machine deep learning comprises a keyword and a subject word.

According to one embodiment of the application, OCR software is preset in a learning character recognition module of the intelligent scanning recognition archiving system based on machine deep learning.

According to one embodiment of the application, the machine learning module of the intelligent scanning, identifying and archiving system based on machine deep learning applies the following machine learning algorithm: decision tree algorithm, random forest algorithm, logistic regression algorithm, naive Bayes algorithm, K nearest neighbor algorithm and neural network algorithm.

According to one embodiment of the application, the research content, the key problem to be solved and the technical route of the intelligent scanning, identifying and archiving system based on the machine deep learning are explained as follows:

OCR (optical Character recognition) is optical Character recognition. In popular terms, the computer is allowed to read words. The principle is that the image containing characters is cut into units which can be recognized independently according to characters through special OCR software, and then various algorithms are applied to analyze the morphological characteristics of the characters in each image unit. And comparing the data in the standard feature library to judge the standard code of the character in the computer, and outputting and storing the standard code in a text file according to a general format. The work flow of OCR is image input, image preprocessing, character feature extraction, comparison recognition, manual correction, and finally the recognition result is output and stored.

OCR technology has a strong advantage over traditional manual entry methods. First, OCR character recognition is much faster than manual entry. According to the international popular typing speed rating standard, even a professional can input only 150-240 words per minute. And by adopting the OCR technology, even if the time spent in the processing links of the front and the rear stages is calculated, the speed is absolutely faster than that of the former by multiple times. Second, OCR character recognition is of much higher quality than manual entry.

However, due to the influence of various factors, the recognition rate of the OCR technology is difficult to reach 100%, so that the technologies of machine learning and deep learning are introduced to help the system improve the recognition accuracy.

Machine learning is a technique for analyzing data by using an algorithm, continuously learning and judging and predicting events occurring in the world. Researchers do not write software, determine a special instruction set and then enable a program to complete a special task; instead, researchers may "train" machines with large amounts of data and algorithms, letting the machines learn how to perform tasks. "machine learning" is a path to "simulate, extend, and extend human intelligence", and is therefore a subset of artificial intelligence; "machine learning" is based on a large amount of data, that is, its "intelligence" is fed by a large amount of data. The commonly used learning algorithm for 10 machines is: decision tree, random forest, logistic regression, SVM, naive Bayes, K nearest neighbor algorithm, K mean algorithm, Adaboost algorithm, neural network, Markov.

Deep learning is a machine learning technique for establishing and simulating a neural network for analyzing and learning the human brain and for interpreting data by simulating the mechanism of the human brain. It is essentially characterized by the pattern of processing information that attempts to mimic the transfer between neurons of the brain. The most prominent applications are in the fields of computer vision and Natural Language Processing (NLP). Obviously, "deep learning" is strongly related to "neural network" in machine learning, and "neural network" is also its main algorithm and means; or we can refer to "deep learning" as an "improved version of neural network" algorithm. Deep learning is further classified into Convolutional Neural Networks (CNNs) and Deep Belief Networks (DBNs). The main idea is to simulate human neurons, each neuron receives information and transmits the information to all the neurons adjacent to the neuron after the information is processed.

According to one embodiment of the application, the beneficial effects and main functions of the intelligent scanning, identifying and filing system based on the machine deep learning in practical use are as follows:

1. intelligent filing of files

The establishment of file directories for classified storage is a relatively basic file digitalization work, and the manual establishment and entry are time-consuming and labor-consuming, and are easy to make mistakes. The OCR technology can be used for carrying out word frequency statistics and content analysis on the full file text so as to automatically extract identifiers such as key words and subject words, and then corresponding folders are created through self-defined matching rules for classified storage.

2. Realize true full-text search

The full-text search in filing work actually includes two types: one is that only the file directory database is searched, and after finding the related items, the corresponding file full text is opened; the other is true full-text retrieval, namely directly retrieving the archive full library and performing word-by-word retrieval on the archive full text. Obviously, the recall ratio of the latter retrieval method is much higher than that of the former retrieval method. The user can find more needed information from the large-scale collection of archives in the sea, and archive information resources are developed and utilized more deeply. The OCR technology is naturally not available to realize true full-text retrieval, because only the characters in the scanned image are changed into a text format, it is possible to perform word-by-word retrieval on the characters therein.

3. Broadening profile user utilization profile

The paper file is digitized and identified by OCR, so that the file information resource can realize full-text retrieval, network transmission, remote retrieval, copying and reference of users conveniently. Therefore, the inquiry and the utilization of the user to the archive content are deepened, and the utilization range of the archive content is widened. The file can also be used as a means for acquiring information, utilizing information and increasing learning in daily life like books and information, so that the file can serve the public in many aspects.

The above-mentioned embodiments only show some embodiments of the present invention, and the description thereof is more specific and detailed, but should not be construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the claims.

Claims

1. The utility model provides an intelligence scanning discernment filing system based on machine deep learning which characterized in that includes following module:

the optical character recognition module is used for cutting the image containing the characters into units which can be recognized independently according to the characters, analyzing the morphological characteristics of the characters in each image unit by using an algorithm, judging the standard codes of the characters in a computer by comparing the data in a standard characteristic library, and outputting and storing the standard codes in a text file according to a general format;

the machine learning module is used for improving the recognition accuracy of the optical character recognition module;

the file intelligent filing module is used for carrying out word frequency statistics and content analysis on the full file text so as to automatically extract identifiers, and then creating corresponding folders for classified storage through a self-defined matching rule;

and the full-text retrieval module is used for directly carrying out word-by-word retrieval operation on the whole archive library.

2. The intelligent scanning, recognition and archiving system based on machine deep learning of claim 1, wherein: the work flow of the optical character recognition module comprises image input, image preprocessing, character feature extraction, comparison recognition, manual correction and output and storage of recognition results.

3. The intelligent scanning, recognition and archiving system based on machine deep learning of claim 1, wherein: the system also comprises a network transmission module, a remote retrieval module and a copy reference module.

4. The intelligent scanning, recognition and archiving system based on machine deep learning of claim 1, wherein: the identifier comprises a keyword and a subject word.

5. The intelligent scanning, recognition and archiving system based on machine deep learning of claim 1, wherein: OCR software is preset in the optical character recognition module.

6. The intelligent scanning, recognition and archiving system based on machine deep learning of claim 1, wherein: the machine learning module applies the following machine learning algorithm: decision tree algorithm, random forest algorithm, logistic regression algorithm, naive Bayes algorithm, K nearest neighbor algorithm and neural network algorithm.