CN117112872A - Government affair text archiving method and system based on semi-supervised learning - Google Patents

Government affair text archiving method and system based on semi-supervised learning Download PDF

Info

Publication number
CN117112872A
CN117112872A CN202311360019.2A CN202311360019A CN117112872A CN 117112872 A CN117112872 A CN 117112872A CN 202311360019 A CN202311360019 A CN 202311360019A CN 117112872 A CN117112872 A CN 117112872A
Authority
CN
China
Prior art keywords
text
archiving
semi
learning
government affair
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311360019.2A
Other languages
Chinese (zh)
Other versions
CN117112872B (en
Inventor
仇恒坦
陈兆亮
张兆勇
孙贤雯
杨春蕾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Software Co Ltd
Original Assignee
Inspur Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Software Co Ltd filed Critical Inspur Software Co Ltd
Priority to CN202311360019.2A priority Critical patent/CN117112872B/en
Publication of CN117112872A publication Critical patent/CN117112872A/en
Application granted granted Critical
Publication of CN117112872B publication Critical patent/CN117112872B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/93Document management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/20Software design
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/31Programming languages or programming paradigms
    • G06F8/315Object-oriented languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19147Obtaining sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19173Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N1/00Scanning, transmission or reproduction of documents or the like, e.g. facsimile transmission; Details thereof
    • H04N1/00127Connection or combination of a still picture apparatus with another apparatus, e.g. for storage, processing or transmission of still picture signals or of information associated with a still picture
    • H04N1/00326Connection or combination of a still picture apparatus with another apparatus, e.g. for storage, processing or transmission of still picture signals or of information associated with a still picture with a data reading, recognizing or recording apparatus, e.g. with a bar-code apparatus
    • H04N1/00328Connection or combination of a still picture apparatus with another apparatus, e.g. for storage, processing or transmission of still picture signals or of information associated with a still picture with a data reading, recognizing or recording apparatus, e.g. with a bar-code apparatus with an apparatus processing optically-read information
    • H04N1/00331Connection or combination of a still picture apparatus with another apparatus, e.g. for storage, processing or transmission of still picture signals or of information associated with a still picture with a data reading, recognizing or recording apparatus, e.g. with a bar-code apparatus with an apparatus processing optically-read information with an apparatus performing optical character recognition
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N1/00Scanning, transmission or reproduction of documents or the like, e.g. facsimile transmission; Details thereof
    • H04N1/21Intermediate information storage
    • H04N1/2166Intermediate information storage for mass storage, e.g. in document filing systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Library & Information Science (AREA)
  • Signal Processing (AREA)
  • Business, Economics & Management (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Business, Economics & Management (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a government affair text archiving method and system based on semi-supervised learning, which belong to the technical field of machine learning and intelligent government affairs, and comprise a manual input stage and an automatic archiving stage, wherein the manual input stage is used for scanning materials and inputting or selecting labels by a transaction staff during business transaction, and a background program is used for storing the scanned materials under a specified path according to the labels to finish text archiving; in the automatic filing stage, a clerk scans the material, the label to which the material belongs is judged through an automatic verification module, and a background program stores the scanned material under a specified path according to the label; the automatic archiving phase starts a self-learning mechanism. The invention improves the accuracy of government text identification and classification, reduces the identification and archiving uncertainty caused by excessive manual intervention, and further improves the business service level; the method is beneficial to the electronization of information materials and greatly facilitates the management of government affair data.

Description

Government affair text archiving method and system based on semi-supervised learning
Technical Field
The invention relates to the technical field of machine learning and intelligent government affairs, in particular to a government affair text archiving method and system based on semi-supervised learning.
Background
With the development of e-government affairs, the application of automated processes in government systems is more common, and paperless office work has become a necessary trend. However, when transacting business, there is still a lot of paper materials that need to be electronically, archived; at present, more manual recording files are adopted, and excessive manual intervention easily causes uncertainty of file identification, and manual files are low in efficiency.
Disclosure of Invention
Aiming at the defects, the invention provides a government affair text archiving method and system based on semi-supervised learning, which improves the accuracy of government affair text identification and classification, reduces the identification and archiving uncertainty caused by excessive manual intervention, and further improves the business service level; the method is beneficial to the electronization of information materials and greatly facilitates the management of government affair data.
The technical scheme adopted for solving the technical problems is as follows:
a government affair text archiving method based on semi-supervised learning comprises a manual input stage and an automatic archiving stage, wherein the manual input stage is used for scanning materials by a transaction staff and inputting or selecting labels when the business is handled, and a background program is used for storing the scanned materials under a specified path according to the labels to finish text archiving; in the automatic filing stage, a clerk scans the material, the label to which the material belongs is judged through an automatic verification module, and a background program stores the scanned material under a specified path according to the label;
starting a self-learning mechanism in an automatic filing stage, firstly extracting feature vectors of text materials under each label, then establishing a relation between the labels and the feature vectors of the text materials, and finally updating a relation table to an automatic verification module, and sequentially and circularly reciprocating to realize dynamic management of the labels;
the text material comes from a manual calibration text and an automatic calibration text, the manual calibration text has an exact corresponding relation between the text material and a label, the automatic calibration text has uncertainty, a punishment mechanism is added into the automatic calibration text, and noise text information is controlled to enter the training machine; after training, the model is obtained and updated to an automatic verification module, and the model is sequentially and circularly reciprocated to realize uninterrupted model optimization.
According to the method, manual input filing is changed into automatic filing, so that great convenience is provided for government affairs offices; in an application scene, the accuracy of government affair text identification and classification is improved, the identification and archiving uncertainty caused by excessive manual intervention is reduced, and the business service level is further improved; the perfect automatic filing method can be beneficial to the electronization of information materials, and is greatly convenient for the management of government affair data.
Preferably, the automatic verification module processes the acquired real-time material information, and extracts character strings as one of the classifying bases of texts for the text materials; the set of strings with descriptive text features is then converted into feature vectors.
Further, since the extracted character string contains many worthless characters, it is necessary to extract a plurality of characters that can describe the characteristics of the text through a design strategy.
Further, a set of labels is created in advance, and corresponds to the catalogue to be archived; simultaneously, constructing feature vectors corresponding to each label in advance to form a feature vector set of a group of labels;
carrying out correlation analysis on feature vectors generated by the just-recorded materials one by one in a feature vector set of the tag; in order to achieve higher accuracy, the label with the strongest correlation can be obtained by controlling the correlation, and if the label meeting the condition is not available, the label can be selected to be created or uniformly defined as other labels;
inquiring an archive information resource library through the obtained tag to obtain archive information;
finally, according to the archiving information, the archiving operation is performed by the back-end program.
Preferably, the realization business of the method mainly comprises data acquisition, information input, archiving management and data storage management, and files texts comprising license types, contracts, commissions, policies and regulations and proving materials; wherein,
the data acquisition mode comprises the steps of acquiring government affair texts by adopting a shooting mode of a high-speed scanner, an electronic material and portable equipment;
the information input comprises a manual audit input mode and an automatic audit input mode;
the archiving management comprises newly-built archiving catalogues and existing archiving catalogue management according to whether the archiving catalogues exist in advance or not;
the data management comprises data storage, ER index, data query and data deletion, so that the data can be fully utilized to other services.
Preferably, the implementation of the method comprises a task scheduling module, a business processing module, a data management module and an AI service module, wherein,
the task scheduling module is used as a Controller (main control) to coordinate the operation among the modules, including starting or closing an automatic filing mode and starting or closing a self-learning mode;
the business processing module is responsible for business handling matters, including input/selection of labels, scanning of materials and storage operation;
the data management module is responsible for adding, deleting, modifying and searching data and coordinating data resources;
the AI service module is responsible for intelligent computing services, including providing text recognition and policy decisions.
Preferably, the specific implementation of the method comprises a manual recording stage, a semi-supervised learning stage and an unsupervised learning stage,
in the manual recording stage, recording material images according to a conventional method, and gradually accumulating a large number of effective government affair text images with labels;
in the semi-supervised learning stage, on one hand, manual input is continuously implemented, on the other hand, a self-learning model is started, the accumulated sample pictures are fully utilized for classification learning, and the recognition accuracy is gradually optimized, namely, the manual input and the self-learning are simultaneously carried out; after a certain accumulation is reached, starting a self-learning function to assist manual input;
in the unsupervised learning stage, the system has autonomous learning capability and higher accuracy, manual recording and archiving are not needed at all, and a clerk can automatically archive text image materials only by submitting the materials.
The invention also claims a government affair text archiving system based on semi-supervised learning, which realizes the government affair text archiving method based on semi-supervised learning; the system comprises an interactive client, an application server cluster, an AI server cluster, various data and database systems and components for perfecting functions;
the interactive client comprises a service hall, a mobile client, a Web client and an administrator client, and is used for providing functions of user information input, reference and the like and providing an administrator user operation and maintenance function;
the application server cluster is used for realizing the basic function of the system, and the AI server cluster is used for providing AI computing service for the system; customizing archiving tasks and controlling system operation parameters through a configuration center;
and deploying a database service, providing data storage, adding, deleting and checking operations, and deploying a message queue service and a cache service for enhancing the stability of the system.
Preferably, various clients are connected with the gateway cluster through a mode of Nginx+firewall so as to ensure the safety of system information;
the mobile equipment of the personal client and the Web client access the system through the firewall by the public cloud to complete business transaction; the business hall, internal private devices and the operation and maintenance client access the system by the private cloud.
Preferably, the backend Server comprises an application Server (App Server), an AI Server (AI Server) and a database Server (DB Server), and the application program interface Service (API Service), the AI Service (AI Service) and the Database (DB) are respectively operated according to different task types.
Compared with the prior art, the government affair text archiving method and system based on semi-supervised learning have the following beneficial effects:
1. the existing working flow is fully utilized, and a government affair text archiving scheme of semi-supervised learning is realized;
2. the accuracy and the efficiency of the algorithm are improved based on the self-learning process of the label material;
3. a punishment mechanism and threshold control (correlation control) are added, so that the robustness and stability of the algorithm are enhanced;
4. by implementing the technical scheme, the cost of manpower and material resources is greatly saved, and the business handling efficiency is improved;
5. the system adopts modularized design and development, and has the advantages of small occupation of computing resources, simple deployment and convenient application.
Drawings
FIG. 1 is a flow chart of an implementation of a government text archiving method based on semi-supervised learning provided by an embodiment of the invention;
FIG. 2 is a business logic diagram of a government text archiving method based on semi-supervised learning provided by the embodiment of the invention;
FIG. 3 is a flow chart of a government text archiving method service implementation based on semi-supervised learning provided by the embodiment of the invention;
FIG. 4 is a self-learning flow chart provided by an embodiment of the present invention;
FIG. 5 is a diagram of system functional modules provided by an embodiment of the present invention;
FIG. 6 is a diagram of a government text archiving system deployment based on semi-supervised learning provided by an embodiment of the present invention;
fig. 7 is a network architecture diagram of a government text archiving system based on semi-supervised learning provided by the embodiment of the invention.
Detailed Description
The embodiment of the invention provides a government affair text archiving method based on semi-supervised learning, which comprises a manual input stage and an automatic archiving stage. Referring to fig. 3, in the manual entry stage, during business handling, a handling person scans materials and inputs or selects labels, and a background program stores the scanned materials under a designated path according to the labels to complete text archiving, which is also an existing business processing scheme; and in the automatic filing stage, a clerk scans the material, the label to which the material belongs is judged through the automatic verification module, and the background program stores the scanned material under a specified path according to the label.
The automatic filing stage starts a self-learning mechanism, firstly extracts the characteristic vector of the text material under each label, then establishes the relation between the label and the characteristic vector of the text material, and finally updates the relation table to an automatic verification module, and the automatic filing stage sequentially and circularly reciprocates to realize the dynamic management of the label, thereby achieving the purpose of self-learning.
The text material comes from a manual calibration text and an automatic calibration text, the manual calibration text has an exact corresponding relation between the text material and a label, the automatic calibration text has uncertainty, a punishment mechanism is added into the automatic calibration text, and noise text information is controlled to enter the training machine; after training, the model is obtained and updated to an automatic verification module, and the model is sequentially and circularly reciprocated to realize uninterrupted model optimization. As shown in reference to figure 4 of the drawings,
the penalty mechanism refers to a mechanism commonly used in machine learning for controlling the regularization process to adjust errors, and is usually implemented by a penalty function and coefficients thereof, wherein the detailed penalty mechanism is not limited herein, but the objective to be achieved is consistent, namely, the fitting capability of a machine learning model is enhanced, and more accurate inference is made.
The automatic verification module is a key for realizing semi-supervised learning, and aims to acquire archivable information of the input material with less human intervention. The specific implementation process is as follows: processing the acquired real-time material information, and extracting character strings for the text material to be used as one of the classifying bases of the text; since the extracted character string contains a plurality of worthless characters, a plurality of characters capable of describing the text features are required to be extracted through a design strategy; the set of strings with descriptive text features is then converted into feature vectors.
A set of labels may be created in advance that corresponds to the catalog to be archived; simultaneously, constructing feature vectors corresponding to each label in advance to form a feature vector set of a group of labels;
at this time, the feature vectors generated by the materials just recorded can be concentrated into the feature vector set of the tag one by one for correlation analysis; therefore, a group of labels with stronger correlation can be obtained, and in order to achieve higher accuracy, the label with the strongest correlation can be obtained by controlling the correlation, and if the label which does not meet the condition is not available, the label can be selectively created or uniformly defined as other labels;
inquiring an archive information resource library through the obtained tag to obtain archive information;
finally, according to the archiving information, the archiving operation is performed by the back-end program.
This process is shown in fig. 1.
As shown in figure 2, the realization business of the method mainly comprises four parts of data acquisition, information input, archiving management and data storage management, and can be used for archiving texts such as license class, contract, commission, policy and regulation, proving materials and the like. Wherein,
the data acquisition can be performed in various ways, for example, government affair texts are acquired by means of a high-speed scanner, an electronic material, portable equipment and the like;
the information input stage is mainly divided into two modes of manual audit input and automatic audit input;
the archiving management comprises two conditions of newly-built archiving catalogues and existing archiving catalogue management according to whether the archiving catalogues exist in advance or not;
the data management stage mainly comprises data storage, ER index, data inquiry and data deletion, so that the data can be conveniently and fully utilized to other services.
ER index: ER, collectively Entity Relationship, is translated into entity relationships, expressed in the form of a common graph, i.e., an entity relationship graph, which is a method of providing entities, attributes, and connections; with this approach, relationships between the transaction entity and various types of materials, and between materials, are established, providing an index describing such complex relationships, referred to as the ER index.
As shown in fig. 5, the implementation of the method includes a task scheduling module, a traffic processing module, a data management module, and an AI service module, wherein,
the task scheduling module is used as a Controller (main control) to coordinate the operation among the modules, including opening or closing an automatic filing mode, opening or closing a self-learning mode and the like;
the business processing module is responsible for business handling matters, including input/selection labels, scanning materials, storage operations and the like;
the data management module is responsible for adding, deleting, modifying and searching data and coordinating data resources;
the AI service module mainly provides intelligent computing services such as character recognition, strategy judgment and the like.
The application of the method is specifically described by the implementation process of automatic archiving of government affair texts in a city intelligent approval system as follows:
the project requires that texts provided by the business masses be archived, and common text materials include identity cards, business licenses, bank cards, contracts, commissions, policy regulations, proving materials and the like, and the types of materials are generally classified into photographing, scanning pieces, electronic materials and the like. The archived materials can be used for sharing resources in government systems, so that the office flow of other links is reduced, and the office efficiency is improved.
Generally, the submitted materials are audited and archived one by one in a manual audit entry mode. Along with the increase of the traffic, a great amount of material auditing work seriously affects the transaction progress and even brings about the risk of archiving errors, and an intelligent method is needed to realize automatic auditing and archiving of submitted materials.
By utilizing the method, the existing system is fully utilized for optimization and upgrading. Specifically, the implementation based on the method is divided into three stages, namely a manual recording stage, a semi-supervised learning stage and an unsupervised learning stage.
In the manual recording stage, material images are recorded according to a conventional method, and a large number of effective government affair text images with labels are gradually accumulated.
In the semi-supervised learning stage, on one hand, manual input is continuously implemented, on the other hand, a self-learning model is started, the accumulated sample pictures are fully utilized for classification learning, the recognition accuracy is gradually optimized, namely, the manual input and the self-learning are performed simultaneously, and even after a certain accumulation is achieved, the self-learning function is started to assist the manual input.
In the unsupervised learning stage, the system has autonomous learning capability and higher accuracy, manual recording and archiving are not needed at all, and a clerk can automatically archive text image materials only by submitting the materials.
When the method is adopted to develop a system, four modules can be considered: task scheduling, business processing, data management, AI services. The framework fully considers decoupling design, the development tasks of the front-end engineer and the rear-end engineer are developed, the Java language of the main stream of the government affair system is used, the framework can be compatible with other systems, meanwhile, the AI service is developed by the algorithm engineer, the Python language of the main stream of the industry is used, the algorithm advantage is fully exerted, the web service is provided for the whole domain of the system, and the iterative update is optimized rapidly.
The word recognition technology is widely applied to digital construction of government affairs systems, can improve administrative efficiency and service level, is widely applied to application scenes such as digital processing of government official documents, digital input of form information, text and automatic classification of smart city data, and the like, and improves the processing speed of the government affairs systems and reduces the processing cost of the government affairs systems. Text automatic archiving is a file sorting and managing method based on artificial intelligence technology, which helps users to quickly and accurately identify and archive various documents, pictures, audio and video files. The text automatic filing mainly adopts the techniques of natural language processing, machine learning, deep learning and the like to carry out the processes of document classification, labeling, filing and the like.
Common methods of machine learning are largely classified into supervised learning (supervised learning) and unsupervised learning (unsupervised learning). A simple generalization is whether there is supervision (supervised), and whether there is a tag (label) on the input data. The input data has labels, and is supervised learning; the non-label is unsupervised learning. In addition, one learning algorithm involved in the supervised and unsupervised intermediaries is semi-supervised learning (semi-supervised learning). For semi-supervised learning, part of its training data is tagged and the other part is not tagged, and the amount of untagged data is often significantly larger than the tagged data amount (which is also realistic). The basic law hidden under semi-supervised learning is that: the distribution of the data is not completely random, and acceptable or even very good classification results can be obtained by local features of some tagged data and overall distribution of more untagged data.
The embodiment of the invention also provides a government affair text archiving system based on semi-supervised learning, which realizes the government affair text archiving method based on semi-supervised learning described in the embodiment; as shown in fig. 6, the system includes an interactive client, an application server cluster, an AI server cluster, various types of data and database systems, components for perfecting functions, and the like.
The interactive client comprises a business hall, a mobile client, a Web client, an administrator client and the like, and provides functions of user information input, reference and the like and an administrator user operation and maintenance function.
The external clients are connected with the gateway cluster through the mode of Nginx+firewall so as to ensure the security of system information.
An application server cluster and an AI server cluster are provided in the system, the application server cluster is used for realizing the basic function of the system, and the AI server cluster is used for providing AI computing service for the system; the configuration center can realize customization of archiving tasks and control of system operation parameters;
PHP application Server Cluster: PHP is named as Hypertext Preprocessor, chinese name as "hypertext preprocessor", is a universal open source script language; based on the language, the application server cluster can be developed, and has the characteristics of high concurrency and distribution. K8s: the Chinese character is totally called kubernetes, and the middle 8 letters are replaced by 8 because of overlong names; it is an open-source, well-known container-based cluster management platform, used herein to build AI server clusters, providing docker (container) management and load balancing. In addition, the types of management platforms are relatively more, and the technical type selection stage can be suitable for own service management platform according to actual selection.
And deploying a database service, providing data storage, adding, deleting and checking operations, and deploying a message queue service and a cache service for enhancing the stability of the system. Kafka is a source-opened distributed streaming media platform under Apache flag, and is a high-throughput, durable and distributed message queue system for publishing and subscribing. Redis is free of open source, complies with BSD protocol, is a high performance key-value non-relational database, and is used herein to implement caching services.
As shown in fig. 7, a network architecture diagram of a system developed based on this method is presented. The mobile equipment of the personal client and the Web client access the system through the firewall by the public cloud to complete business transaction; the business hall, internal private devices and the operation and maintenance client access the system by the private cloud.
The back-end Server comprises an application Server (App Server), an AI Server (AI Server) and a database Server (DB Server), and the tasks such as an application program interface Service (API Service), an AI Service (AI Service) and Database (DB) operation are respectively operated according to different task types.
The present invention can be easily implemented by those skilled in the art through the above specific embodiments. It should be understood that the invention is not limited to the particular embodiments described above. Based on the disclosed embodiments, a person skilled in the art may combine different technical features at will, so as to implement different technical solutions.
Other than the technical features described in the specification, all are known to those skilled in the art.

Claims (10)

1. A government affair text archiving method based on semi-supervised learning is characterized by comprising a manual input stage and an automatic archiving stage, wherein during business handling, a handling person scans materials and inputs or selects labels, and a background program stores the scanned materials under a specified path according to the labels to finish text archiving; in the automatic filing stage, a clerk scans the material, the label to which the material belongs is judged through an automatic verification module, and a background program stores the scanned material under a specified path according to the label;
starting a self-learning mechanism in an automatic filing stage, firstly extracting feature vectors of text materials under each label, then establishing a relation between the labels and the feature vectors of the text materials, and finally updating a relation table to an automatic verification module, and sequentially and circularly reciprocating to realize dynamic management of the labels;
the text material comes from a manual calibration text and an automatic calibration text, the automatic calibration text is added with a punishment mechanism, and noise text information is controlled to enter the training machine; after training, the model is obtained and updated to an automatic verification module, and the model is sequentially and circularly reciprocated to realize uninterrupted model optimization.
2. The government affair text archiving method based on semi-supervised learning as set forth in claim 1, wherein the automatic verification module processes the acquired real-time material information, extracts character strings for text materials, and uses the extracted character strings as one of the classifying bases of texts; the set of strings with descriptive text features is then converted into feature vectors.
3. The government affair text filing method based on semi-supervised learning as set forth in claim 2, wherein the processing of the acquired real-time material information extracts a character string for the literal material, and extracts a plurality of characters describing the characteristics of the text through a design strategy.
4. The government affair text filing method based on semi-supervised learning as claimed in claim 2, wherein a set of labels corresponding to the catalogue to be filed is created in advance; simultaneously, constructing feature vectors corresponding to each label in advance to form a feature vector set of a group of labels;
carrying out correlation analysis on feature vectors generated by the just-recorded materials one by one in a feature vector set of the tag; obtaining a group of labels with the highest correlation by controlling the correlation, and if no label meeting the condition exists, selecting to create the label or uniformly define the label as other labels;
inquiring an archive information resource library through the obtained tag to obtain archive information;
finally, according to the archiving information, the archiving operation is performed by the back-end program.
5. The government affair text filing method based on semi-supervised learning as set forth in claim 1, 2, 3 or 4, wherein the implementation business of the method includes data collection, information input, filing management and data storage management, and files texts including license class, contract, commission, policy regulation and proving materials; wherein,
the data acquisition mode comprises the steps of acquiring government affair texts by adopting a shooting mode of a high-speed scanner, an electronic material and portable equipment;
the information input comprises a manual audit input mode and an automatic audit input mode;
the archiving management comprises newly-built archiving catalogues and existing archiving catalogue management according to whether the archiving catalogues exist in advance or not;
data management includes data storage, ER indexing, data querying, data deletion.
6. The government affair text filing method based on semi-supervised learning as set forth in claim 5, wherein the implementation of the method includes a task scheduling module, a business processing module, a data management module and an AI service module, wherein,
the task scheduling module is used as a main control and coordinates the operation among the modules, including an automatic filing mode and a self-learning mode;
the business processing module is responsible for business handling matters, including input/selection of labels, scanning of materials and storage operation;
the data management module is responsible for adding, deleting, modifying and searching data and coordinating data resources;
the AI service module is responsible for intelligent computing services, including providing text recognition and policy decisions.
7. The government affair text filing method based on semi-supervised learning as set forth in claim 6, wherein the method includes a manual recording stage, a semi-supervised learning stage and an unsupervised learning stage,
in the manual recording stage, recording material images according to a conventional method, and gradually accumulating a large number of effective government affair text images with labels;
in the semi-supervised learning stage, on one hand, manual input is continuously implemented, on the other hand, a self-learning model is started, the accumulated sample pictures are fully utilized for classification learning, and the recognition accuracy is gradually optimized, namely, the manual input and the self-learning are simultaneously carried out; after the accumulation amount is reached, starting a self-learning function to assist manual input;
in the unsupervised learning stage, manual input and archiving are not needed, and a clerk can automatically archive text image materials only by submitting the materials.
8. A government affair text filing system based on semi-supervised learning, characterized in that the system implements the government affair text filing method based on semi-supervised learning as set forth in any one of claims 1 to 7; the system comprises an interactive client, an application server cluster, an AI server cluster, various data and database systems and components for perfecting functions;
the interactive client comprises a service hall, a mobile client, a Web client and an administrator client, and provides a user information input and review function and an administrator user operation and maintenance function;
the application server cluster is used for realizing the basic function of the system, and the AI server cluster is used for providing AI computing service for the system; customizing archiving tasks and controlling system operation parameters through a configuration center;
and deploying a database service, providing data storage, adding, deleting and checking operations, and deploying a message queue service and a cache service for enhancing the stability of the system.
9. The government affair text filing system based on semi-supervised learning as set forth in claim 8, wherein the clients are connected to the gateway cluster through a pattern of nginnx+ firewall;
the mobile equipment of the personal client and the Web client access the system through the firewall by the public cloud to complete business transaction; the business hall, internal private devices and the operation and maintenance client access the system by the private cloud.
10. The government affair text filing system based on semi-supervised learning as claimed in claim 9, wherein the back-end server includes an application server, an AI server and a database server, and the application program interface service, the AI service and the database operation task are respectively operated according to different task types.
CN202311360019.2A 2023-10-20 2023-10-20 Government affair text archiving method and system based on semi-supervised learning Active CN117112872B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311360019.2A CN117112872B (en) 2023-10-20 2023-10-20 Government affair text archiving method and system based on semi-supervised learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311360019.2A CN117112872B (en) 2023-10-20 2023-10-20 Government affair text archiving method and system based on semi-supervised learning

Publications (2)

Publication Number Publication Date
CN117112872A true CN117112872A (en) 2023-11-24
CN117112872B CN117112872B (en) 2024-07-12

Family

ID=88796891

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311360019.2A Active CN117112872B (en) 2023-10-20 2023-10-20 Government affair text archiving method and system based on semi-supervised learning

Country Status (1)

Country Link
CN (1) CN117112872B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109658062A (en) * 2018-12-13 2019-04-19 广州华资软件技术有限公司 A kind of electronic record intelligent processing method based on deep learning
CN111461636A (en) * 2019-01-22 2020-07-28 广东鼎义互联科技股份有限公司 Virtual robot-based government affair service platform and application
CN112182326A (en) * 2020-10-16 2021-01-05 山东浪潮商用***有限公司 Efficient electronic archive management method and system
CN113312476A (en) * 2021-02-03 2021-08-27 珠海卓邦科技有限公司 Automatic text labeling method and device and terminal
WO2023019120A2 (en) * 2021-08-13 2023-02-16 Pricewaterhousecoopers Llp Methods and systems for artificial intelligence-assisted document annotation
CN115827939A (en) * 2022-11-28 2023-03-21 华东冶金地质勘查局八一五地质队 Digital archive management system
CN116756395A (en) * 2023-05-12 2023-09-15 严福 Electronic archiving method and system for urban construction archives

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109658062A (en) * 2018-12-13 2019-04-19 广州华资软件技术有限公司 A kind of electronic record intelligent processing method based on deep learning
CN111461636A (en) * 2019-01-22 2020-07-28 广东鼎义互联科技股份有限公司 Virtual robot-based government affair service platform and application
CN112182326A (en) * 2020-10-16 2021-01-05 山东浪潮商用***有限公司 Efficient electronic archive management method and system
CN113312476A (en) * 2021-02-03 2021-08-27 珠海卓邦科技有限公司 Automatic text labeling method and device and terminal
WO2023019120A2 (en) * 2021-08-13 2023-02-16 Pricewaterhousecoopers Llp Methods and systems for artificial intelligence-assisted document annotation
CN115827939A (en) * 2022-11-28 2023-03-21 华东冶金地质勘查局八一五地质队 Digital archive management system
CN116756395A (en) * 2023-05-12 2023-09-15 严福 Electronic archiving method and system for urban construction archives

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
S. SHASHANK HOLLA 等: "End-to-End Speech Recognition for Low Resource Language Sanskrit using Self-Supervised Learning", 2022 INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS SIGNAL PROCESSING AND NETWORKING (WISPNET), 31 December 2022 (2022-12-31) *
宋华;: "在线政务服务平台电子文件归档管理对策研究", 浙江档案, no. 05, 31 May 2019 (2019-05-31) *
龚炜;: "一套基于人工智能技术的政务服务平台设计", 中国科技信息, no. 12, 15 June 2020 (2020-06-15) *

Also Published As

Publication number Publication date
CN117112872B (en) 2024-07-12

Similar Documents

Publication Publication Date Title
CN109543690B (en) Method and device for extracting information
CN113011889B (en) Account anomaly identification method, system, device, equipment and medium
US9002842B2 (en) System and method for computerized batching of huge populations of electronic documents
CN109784272A (en) A kind of container identifying system and container recognition methods
CN110163268A (en) A kind of image processing method, device and server, storage medium
CN109657063A (en) A kind of processing method and storage medium of magnanimity environment-protection artificial reported event data
CN113569895A (en) Image processing model training method, processing method, device, equipment and medium
CN110334214A (en) A kind of method of false lawsuit in automatic identification case
CN113841161A (en) Extensible architecture for automatically generating content distribution images
CN111444362A (en) Malicious picture intercepting method, device, equipment and storage medium
CN114913376A (en) Image-based defect automatic identification method, device and system and storage medium
CN117112872B (en) Government affair text archiving method and system based on semi-supervised learning
CN117079195B (en) Wild animal identification method and system based on image video
CN113222109A (en) Internet of things edge algorithm based on multi-source heterogeneous data aggregation technology
CN113157729A (en) Batch mail automatic processing method and device
CN109933627A (en) Information system management knowledge acquisition and archiving method based on the identification of OCR cloud
CN113920127B (en) Training data set independent single-sample image segmentation method and system
Wei Deep learning model under complex network and its application in traffic detection and analysis
CN109783488B (en) Project management information processing method, device, server and storage medium
CN116127067B (en) Text classification method, apparatus, electronic device and storage medium
CN112307251A (en) Self-adaptive recognition correlation system and method for knowledge point atlas of English vocabulary
Li et al. Multi-attribute feature fusion algorithm for blockchain communications in healthcare systems using machine intelligence
CN114005009B (en) Training method and device of target detection model based on RS loss
CN100401287C (en) Chinese data intelligent classifying system and method
US20230120826A1 (en) Systems and methods for machine learning-based data matching and reconciliation of information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant