CN113420622A - Intelligent scanning, recognizing and filing system based on machine deep learning - Google Patents

Intelligent scanning, recognizing and filing system based on machine deep learning Download PDF

Info

Publication number
CN113420622A
CN113420622A CN202110640604.2A CN202110640604A CN113420622A CN 113420622 A CN113420622 A CN 113420622A CN 202110640604 A CN202110640604 A CN 202110640604A CN 113420622 A CN113420622 A CN 113420622A
Authority
CN
China
Prior art keywords
module
recognition
file
system based
deep learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110640604.2A
Other languages
Chinese (zh)
Inventor
邓永安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Baichuan Four Dimensional Information Technology Co ltd
Original Assignee
Sichuan Baichuan Four Dimensional Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Baichuan Four Dimensional Information Technology Co ltd filed Critical Sichuan Baichuan Four Dimensional Information Technology Co ltd
Priority to CN202110640604.2A priority Critical patent/CN113420622A/en
Publication of CN113420622A publication Critical patent/CN113420622A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Character Discrimination (AREA)

Abstract

The invention provides an intelligent scanning, identifying and filing system based on machine deep learning, which comprises the following modules: the system comprises an optical character recognition module, a machine learning module, an intelligent file filing module and a full-text retrieval module. The system can well realize intelligent filing of files and real full-text retrieval. By digitalizing the paper file and adopting OCR recognition, the file information resource can realize full-text retrieval, network transmission, remote retrieval, copying and reference of users. Therefore, the inquiry and the utilization of the user to the archive content are deepened, and the utilization range of the archive content is widened. The file can also be used as a means for acquiring information, utilizing information and increasing learning in daily life like books and information, so that the file can serve the public in many aspects.

Description

Intelligent scanning, recognizing and filing system based on machine deep learning
Technical Field
The invention particularly relates to an intelligent scanning, identifying and filing system based on machine deep learning.
Background
In the information age, file digitization is becoming the focus of file work in a period of time in the future, and paper file scanning work is actively carried out in various places. However, the electronic archive generated by the scan is actually only a document in the form of an image, not a text document in the true sense. That is, the computer only knows the appearance of the file, but not its inherent text. The user can only see the original appearance of the file through the computer, but cannot perform operations such as quote and search on the content in the file, which undoubtedly causes great inconvenience to the future electronic file utilization work.
In consideration of the utilization requirements of archive users, if an electronic archive in a real text form is to be obtained, the archive digitization work is more effective and more thorough, an intelligent scanning recognition filing system based on machine deep learning is provided, and characters in an image containing the characters are extracted and stored as a text file by using an OCR technology.
Disclosure of Invention
The invention aims to provide an intelligent scanning, recognizing and filing system based on machine deep learning, aiming at overcoming the defects of the prior art.
In order to meet the requirements, the technical scheme adopted by the invention is as follows: the intelligent scanning, recognizing and filing system based on the machine deep learning comprises the following modules: the optical character recognition module is used for cutting the image containing the characters into units which can be recognized independently according to the characters, analyzing the morphological characteristics of the characters in each image unit by using an algorithm, judging the standard codes of the characters in a computer by comparing the data in a standard characteristic library, and outputting and storing the standard codes in a text file according to a general format; the machine learning module is used for improving the recognition accuracy of the optical character recognition module; the file intelligent filing module is used for carrying out word frequency statistics and content analysis on the full file text so as to automatically extract identifiers, and then creating corresponding folders for classified storage through a self-defined matching rule; and the full-text retrieval module is used for directly performing word-by-word retrieval on the whole archive library to perform retrieval operation.
The intelligent scanning, identifying and filing system based on the machine deep learning has the following advantages:
the system can well realize intelligent filing of files, realize real full-text retrieval, and realize full-text retrieval and network transmission of archive information resources, and is convenient for users to retrieve, copy and quote in different places by digitalizing paper archives and adopting OCR recognition. Therefore, the inquiry and the utilization of the user to the archive content are deepened, and the utilization range of the archive content is widened. The file can also be used as a means for acquiring information, utilizing information and increasing learning in daily life like books and information, so that the file can serve the public in many aspects.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described in detail with reference to specific embodiments below.
In the following description, references to "one embodiment," "an embodiment," "one example," "an example," etc., indicate that the embodiment or example so described may include a particular feature, structure, characteristic, property, element, or limitation, but every embodiment or example does not necessarily include the particular feature, structure, characteristic, property, element, or limitation. Moreover, repeated use of the phrase "in accordance with an embodiment of the present application" although it may possibly refer to the same embodiment, does not necessarily refer to the same embodiment.
Certain features that are well known to those skilled in the art have been omitted from the following description for the sake of simplicity.
According to one embodiment of the application, an intelligent scanning, recognizing and filing system based on machine deep learning is provided, which comprises the following modules: the optical character recognition module is used for cutting the image containing the characters into units which can be recognized independently according to the characters, analyzing the morphological characteristics of the characters in each image unit by using an algorithm, judging the standard codes of the characters in a computer by comparing the data in a standard characteristic library, and outputting and storing the standard codes in a text file according to a general format; the machine learning module is used for improving the recognition accuracy of the optical character recognition module; the file intelligent filing module is used for carrying out word frequency statistics and content analysis on the full file text so as to automatically extract identifiers, and then creating corresponding folders for classified storage through a self-defined matching rule; and the full-text retrieval module is used for directly carrying out word-by-word retrieval operation on the whole archive library.
According to one embodiment of the application, the work flow of the optical character recognition module of the intelligent scanning, recognizing and archiving system based on machine depth learning comprises image input, image preprocessing, character feature extraction, comparison recognition, manual correction and output and storage of a recognition result.
According to one embodiment of the application, the intelligent scanning, identifying and archiving system based on the machine deep learning further comprises a network transmission module, a remote retrieval module and a copy reference module
According to one embodiment of the application, the identifier of the intelligent scanning recognition filing system based on machine deep learning comprises a keyword and a subject word.
According to one embodiment of the application, OCR software is preset in a learning character recognition module of the intelligent scanning recognition archiving system based on machine deep learning.
According to one embodiment of the application, the machine learning module of the intelligent scanning, identifying and archiving system based on machine deep learning applies the following machine learning algorithm: decision tree algorithm, random forest algorithm, logistic regression algorithm, naive Bayes algorithm, K nearest neighbor algorithm and neural network algorithm.
According to one embodiment of the application, the research content, the key problem to be solved and the technical route of the intelligent scanning, identifying and archiving system based on the machine deep learning are explained as follows:
OCR (optical Character recognition) is optical Character recognition. In popular terms, the computer is allowed to read words. The principle is that the image containing characters is cut into units which can be recognized independently according to characters through special OCR software, and then various algorithms are applied to analyze the morphological characteristics of the characters in each image unit. And comparing the data in the standard feature library to judge the standard code of the character in the computer, and outputting and storing the standard code in a text file according to a general format. The work flow of OCR is image input, image preprocessing, character feature extraction, comparison recognition, manual correction, and finally the recognition result is output and stored.
OCR technology has a strong advantage over traditional manual entry methods. First, OCR character recognition is much faster than manual entry. According to the international popular typing speed rating standard, even a professional can input only 150-240 words per minute. And by adopting the OCR technology, even if the time spent in the processing links of the front and the rear stages is calculated, the speed is absolutely faster than that of the former by multiple times. Second, OCR character recognition is of much higher quality than manual entry.
However, due to the influence of various factors, the recognition rate of the OCR technology is difficult to reach 100%, so that the technologies of machine learning and deep learning are introduced to help the system improve the recognition accuracy.
Machine learning is a technique for analyzing data by using an algorithm, continuously learning and judging and predicting events occurring in the world. Researchers do not write software, determine a special instruction set and then enable a program to complete a special task; instead, researchers may "train" machines with large amounts of data and algorithms, letting the machines learn how to perform tasks. "machine learning" is a path to "simulate, extend, and extend human intelligence", and is therefore a subset of artificial intelligence; "machine learning" is based on a large amount of data, that is, its "intelligence" is fed by a large amount of data. The commonly used learning algorithm for 10 machines is: decision tree, random forest, logistic regression, SVM, naive Bayes, K nearest neighbor algorithm, K mean algorithm, Adaboost algorithm, neural network, Markov.
Deep learning is a machine learning technique for establishing and simulating a neural network for analyzing and learning the human brain and for interpreting data by simulating the mechanism of the human brain. It is essentially characterized by the pattern of processing information that attempts to mimic the transfer between neurons of the brain. The most prominent applications are in the fields of computer vision and Natural Language Processing (NLP). Obviously, "deep learning" is strongly related to "neural network" in machine learning, and "neural network" is also its main algorithm and means; or we can refer to "deep learning" as an "improved version of neural network" algorithm. Deep learning is further classified into Convolutional Neural Networks (CNNs) and Deep Belief Networks (DBNs). The main idea is to simulate human neurons, each neuron receives information and transmits the information to all the neurons adjacent to the neuron after the information is processed.
According to one embodiment of the application, the beneficial effects and main functions of the intelligent scanning, identifying and filing system based on the machine deep learning in practical use are as follows:
1. intelligent filing of files
The establishment of file directories for classified storage is a relatively basic file digitalization work, and the manual establishment and entry are time-consuming and labor-consuming, and are easy to make mistakes. The OCR technology can be used for carrying out word frequency statistics and content analysis on the full file text so as to automatically extract identifiers such as key words and subject words, and then corresponding folders are created through self-defined matching rules for classified storage.
2. Realize true full-text search
The full-text search in filing work actually includes two types: one is that only the file directory database is searched, and after finding the related items, the corresponding file full text is opened; the other is true full-text retrieval, namely directly retrieving the archive full library and performing word-by-word retrieval on the archive full text. Obviously, the recall ratio of the latter retrieval method is much higher than that of the former retrieval method. The user can find more needed information from the large-scale collection of archives in the sea, and archive information resources are developed and utilized more deeply. The OCR technology is naturally not available to realize true full-text retrieval, because only the characters in the scanned image are changed into a text format, it is possible to perform word-by-word retrieval on the characters therein.
3. Broadening profile user utilization profile
The paper file is digitized and identified by OCR, so that the file information resource can realize full-text retrieval, network transmission, remote retrieval, copying and reference of users conveniently. Therefore, the inquiry and the utilization of the user to the archive content are deepened, and the utilization range of the archive content is widened. The file can also be used as a means for acquiring information, utilizing information and increasing learning in daily life like books and information, so that the file can serve the public in many aspects.
The above-mentioned embodiments only show some embodiments of the present invention, and the description thereof is more specific and detailed, but should not be construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the claims.

Claims (6)

1. The utility model provides an intelligence scanning discernment filing system based on machine deep learning which characterized in that includes following module:
the optical character recognition module is used for cutting the image containing the characters into units which can be recognized independently according to the characters, analyzing the morphological characteristics of the characters in each image unit by using an algorithm, judging the standard codes of the characters in a computer by comparing the data in a standard characteristic library, and outputting and storing the standard codes in a text file according to a general format;
the machine learning module is used for improving the recognition accuracy of the optical character recognition module;
the file intelligent filing module is used for carrying out word frequency statistics and content analysis on the full file text so as to automatically extract identifiers, and then creating corresponding folders for classified storage through a self-defined matching rule;
and the full-text retrieval module is used for directly carrying out word-by-word retrieval operation on the whole archive library.
2. The intelligent scanning, recognition and archiving system based on machine deep learning of claim 1, wherein: the work flow of the optical character recognition module comprises image input, image preprocessing, character feature extraction, comparison recognition, manual correction and output and storage of recognition results.
3. The intelligent scanning, recognition and archiving system based on machine deep learning of claim 1, wherein: the system also comprises a network transmission module, a remote retrieval module and a copy reference module.
4. The intelligent scanning, recognition and archiving system based on machine deep learning of claim 1, wherein: the identifier comprises a keyword and a subject word.
5. The intelligent scanning, recognition and archiving system based on machine deep learning of claim 1, wherein: OCR software is preset in the optical character recognition module.
6. The intelligent scanning, recognition and archiving system based on machine deep learning of claim 1, wherein: the machine learning module applies the following machine learning algorithm: decision tree algorithm, random forest algorithm, logistic regression algorithm, naive Bayes algorithm, K nearest neighbor algorithm and neural network algorithm.
CN202110640604.2A 2021-06-09 2021-06-09 Intelligent scanning, recognizing and filing system based on machine deep learning Pending CN113420622A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110640604.2A CN113420622A (en) 2021-06-09 2021-06-09 Intelligent scanning, recognizing and filing system based on machine deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110640604.2A CN113420622A (en) 2021-06-09 2021-06-09 Intelligent scanning, recognizing and filing system based on machine deep learning

Publications (1)

Publication Number Publication Date
CN113420622A true CN113420622A (en) 2021-09-21

Family

ID=77788032

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110640604.2A Pending CN113420622A (en) 2021-06-09 2021-06-09 Intelligent scanning, recognizing and filing system based on machine deep learning

Country Status (1)

Country Link
CN (1) CN113420622A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230119516A1 (en) * 2021-10-20 2023-04-20 International Business Machines Corporation Providing text information without reading a file

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109034147A (en) * 2018-09-11 2018-12-18 上海唯识律简信息科技有限公司 Optical character identification optimization method and system based on deep learning and natural language
CN110909086A (en) * 2019-11-27 2020-03-24 珠海格力电器股份有限公司 Mail archiving method, system, computer device and computer readable storage medium
CN111666259A (en) * 2020-06-06 2020-09-15 智同道合(苏州)信息技术服务有限公司 Document management method, management system, readable storage medium, and electronic device
CN112101367A (en) * 2020-09-15 2020-12-18 杭州睿琪软件有限公司 Text recognition method, image recognition and classification method and document recognition processing method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109034147A (en) * 2018-09-11 2018-12-18 上海唯识律简信息科技有限公司 Optical character identification optimization method and system based on deep learning and natural language
CN110909086A (en) * 2019-11-27 2020-03-24 珠海格力电器股份有限公司 Mail archiving method, system, computer device and computer readable storage medium
CN111666259A (en) * 2020-06-06 2020-09-15 智同道合(苏州)信息技术服务有限公司 Document management method, management system, readable storage medium, and electronic device
CN112101367A (en) * 2020-09-15 2020-12-18 杭州睿琪软件有限公司 Text recognition method, image recognition and classification method and document recognition processing method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
石会鹏等: "空间业务档案数字化与全文检索***的研究", 《数字通信世界》 *
许呈辰: "档案数字化过程中OCR技术的应用", 《档案管理》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230119516A1 (en) * 2021-10-20 2023-04-20 International Business Machines Corporation Providing text information without reading a file

Similar Documents

Publication Publication Date Title
Dadgar et al. A novel text mining approach based on TF-IDF and Support Vector Machine for news classification
CN113011533A (en) Text classification method and device, computer equipment and storage medium
Hassan et al. Sentiment analysis on bangla and romanized bangla text using deep recurrent models
CN111767716B (en) Method and device for determining enterprise multi-level industry information and computer equipment
CN112256939B (en) Text entity relation extraction method for chemical field
Hassan et al. Sentiment analysis on bangla and romanized bangla text (BRBT) using deep recurrent models
CN110543595B (en) In-station searching system and method
CN112417863B (en) Chinese text classification method based on pre-training word vector model and random forest algorithm
CN111611356A (en) Information searching method and device, electronic equipment and readable storage medium
CN112860898B (en) Short text box clustering method, system, equipment and storage medium
CN109460477B (en) Information collection and classification system and method and retrieval and integration method thereof
CN1629837A (en) Method and apparatus for processing, browsing and classified searching of electronic document and system thereof
CN114722137A (en) Security policy configuration method and device based on sensitive data identification and electronic equipment
CN110909542A (en) Intelligent semantic series-parallel analysis method and system
CN114265926A (en) Natural language-based material recommendation method, system, equipment and medium
Gunaseelan et al. Automatic extraction of segments from resumes using machine learning
CN114881043A (en) Deep learning model-based legal document semantic similarity evaluation method and system
CN113420622A (en) Intelligent scanning, recognizing and filing system based on machine deep learning
CN110866086A (en) Article matching system
CN114238735B (en) Intelligent internet data acquisition method
Trieschnigg et al. Hierarchical topic detection in large digital news archives: exploring a sample based approach
CN114881012A (en) Article title and content intelligent rewriting system and method based on natural language processing
CN109597879B (en) Service behavior relation extraction method and device based on 'citation relation' data
CN111813975A (en) Image retrieval method and device and electronic equipment
Haque et al. Sentiment analysis in low-resource bangla text using active learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210921

RJ01 Rejection of invention patent application after publication