CN117112872A

CN117112872A - Government affair text archiving method and system based on semi-supervised learning

Info

Publication number: CN117112872A
Application number: CN202311360019.2A
Authority: CN
Inventors: 仇恒坦; 陈兆亮; 张兆勇; 孙贤雯; 杨春蕾
Original assignee: Inspur Software Co Ltd
Current assignee: Inspur Software Co Ltd
Priority date: 2023-10-20
Filing date: 2023-10-20
Publication date: 2023-11-24
Anticipated expiration: 2043-10-20
Also published as: CN117112872B

Abstract

The invention discloses a government affair text archiving method and system based on semi-supervised learning, which belong to the technical field of machine learning and intelligent government affairs, and comprise a manual input stage and an automatic archiving stage, wherein the manual input stage is used for scanning materials and inputting or selecting labels by a transaction staff during business transaction, and a background program is used for storing the scanned materials under a specified path according to the labels to finish text archiving; in the automatic filing stage, a clerk scans the material, the label to which the material belongs is judged through an automatic verification module, and a background program stores the scanned material under a specified path according to the label; the automatic archiving phase starts a self-learning mechanism. The invention improves the accuracy of government text identification and classification, reduces the identification and archiving uncertainty caused by excessive manual intervention, and further improves the business service level; the method is beneficial to the electronization of information materials and greatly facilitates the management of government affair data.

Description

Government affair text archiving method and system based on semi-supervised learning

Technical Field

The invention relates to the technical field of machine learning and intelligent government affairs, in particular to a government affair text archiving method and system based on semi-supervised learning.

Background

With the development of e-government affairs, the application of automated processes in government systems is more common, and paperless office work has become a necessary trend. However, when transacting business, there is still a lot of paper materials that need to be electronically, archived; at present, more manual recording files are adopted, and excessive manual intervention easily causes uncertainty of file identification, and manual files are low in efficiency.

Disclosure of Invention

Aiming at the defects, the invention provides a government affair text archiving method and system based on semi-supervised learning, which improves the accuracy of government affair text identification and classification, reduces the identification and archiving uncertainty caused by excessive manual intervention, and further improves the business service level; the method is beneficial to the electronization of information materials and greatly facilitates the management of government affair data.

The technical scheme adopted for solving the technical problems is as follows:

a government affair text archiving method based on semi-supervised learning comprises a manual input stage and an automatic archiving stage, wherein the manual input stage is used for scanning materials by a transaction staff and inputting or selecting labels when the business is handled, and a background program is used for storing the scanned materials under a specified path according to the labels to finish text archiving; in the automatic filing stage, a clerk scans the material, the label to which the material belongs is judged through an automatic verification module, and a background program stores the scanned material under a specified path according to the label;

starting a self-learning mechanism in an automatic filing stage, firstly extracting feature vectors of text materials under each label, then establishing a relation between the labels and the feature vectors of the text materials, and finally updating a relation table to an automatic verification module, and sequentially and circularly reciprocating to realize dynamic management of the labels;

the text material comes from a manual calibration text and an automatic calibration text, the manual calibration text has an exact corresponding relation between the text material and a label, the automatic calibration text has uncertainty, a punishment mechanism is added into the automatic calibration text, and noise text information is controlled to enter the training machine; after training, the model is obtained and updated to an automatic verification module, and the model is sequentially and circularly reciprocated to realize uninterrupted model optimization.

According to the method, manual input filing is changed into automatic filing, so that great convenience is provided for government affairs offices; in an application scene, the accuracy of government affair text identification and classification is improved, the identification and archiving uncertainty caused by excessive manual intervention is reduced, and the business service level is further improved; the perfect automatic filing method can be beneficial to the electronization of information materials, and is greatly convenient for the management of government affair data.

Preferably, the automatic verification module processes the acquired real-time material information, and extracts character strings as one of the classifying bases of texts for the text materials; the set of strings with descriptive text features is then converted into feature vectors.

Further, since the extracted character string contains many worthless characters, it is necessary to extract a plurality of characters that can describe the characteristics of the text through a design strategy.

Further, a set of labels is created in advance, and corresponds to the catalogue to be archived; simultaneously, constructing feature vectors corresponding to each label in advance to form a feature vector set of a group of labels;

carrying out correlation analysis on feature vectors generated by the just-recorded materials one by one in a feature vector set of the tag; in order to achieve higher accuracy, the label with the strongest correlation can be obtained by controlling the correlation, and if the label meeting the condition is not available, the label can be selected to be created or uniformly defined as other labels;

inquiring an archive information resource library through the obtained tag to obtain archive information;

finally, according to the archiving information, the archiving operation is performed by the back-end program.

Preferably, the realization business of the method mainly comprises data acquisition, information input, archiving management and data storage management, and files texts comprising license types, contracts, commissions, policies and regulations and proving materials; wherein,

the data acquisition mode comprises the steps of acquiring government affair texts by adopting a shooting mode of a high-speed scanner, an electronic material and portable equipment;

the information input comprises a manual audit input mode and an automatic audit input mode;

the archiving management comprises newly-built archiving catalogues and existing archiving catalogue management according to whether the archiving catalogues exist in advance or not;

the data management comprises data storage, ER index, data query and data deletion, so that the data can be fully utilized to other services.

Preferably, the implementation of the method comprises a task scheduling module, a business processing module, a data management module and an AI service module, wherein,

the task scheduling module is used as a Controller (main control) to coordinate the operation among the modules, including starting or closing an automatic filing mode and starting or closing a self-learning mode;

the business processing module is responsible for business handling matters, including input/selection of labels, scanning of materials and storage operation;

the data management module is responsible for adding, deleting, modifying and searching data and coordinating data resources;

the AI service module is responsible for intelligent computing services, including providing text recognition and policy decisions.

Preferably, the specific implementation of the method comprises a manual recording stage, a semi-supervised learning stage and an unsupervised learning stage,

in the manual recording stage, recording material images according to a conventional method, and gradually accumulating a large number of effective government affair text images with labels;

in the semi-supervised learning stage, on one hand, manual input is continuously implemented, on the other hand, a self-learning model is started, the accumulated sample pictures are fully utilized for classification learning, and the recognition accuracy is gradually optimized, namely, the manual input and the self-learning are simultaneously carried out; after a certain accumulation is reached, starting a self-learning function to assist manual input;

in the unsupervised learning stage, the system has autonomous learning capability and higher accuracy, manual recording and archiving are not needed at all, and a clerk can automatically archive text image materials only by submitting the materials.

The invention also claims a government affair text archiving system based on semi-supervised learning, which realizes the government affair text archiving method based on semi-supervised learning; the system comprises an interactive client, an application server cluster, an AI server cluster, various data and database systems and components for perfecting functions;

the interactive client comprises a service hall, a mobile client, a Web client and an administrator client, and is used for providing functions of user information input, reference and the like and providing an administrator user operation and maintenance function;

the application server cluster is used for realizing the basic function of the system, and the AI server cluster is used for providing AI computing service for the system; customizing archiving tasks and controlling system operation parameters through a configuration center;

and deploying a database service, providing data storage, adding, deleting and checking operations, and deploying a message queue service and a cache service for enhancing the stability of the system.

Preferably, various clients are connected with the gateway cluster through a mode of Nginx+firewall so as to ensure the safety of system information;

the mobile equipment of the personal client and the Web client access the system through the firewall by the public cloud to complete business transaction; the business hall, internal private devices and the operation and maintenance client access the system by the private cloud.

Preferably, the backend Server comprises an application Server (App Server), an AI Server (AI Server) and a database Server (DB Server), and the application program interface Service (API Service), the AI Service (AI Service) and the Database (DB) are respectively operated according to different task types.

Compared with the prior art, the government affair text archiving method and system based on semi-supervised learning have the following beneficial effects:

1. the existing working flow is fully utilized, and a government affair text archiving scheme of semi-supervised learning is realized;

2. the accuracy and the efficiency of the algorithm are improved based on the self-learning process of the label material;

3. a punishment mechanism and threshold control (correlation control) are added, so that the robustness and stability of the algorithm are enhanced;

4. by implementing the technical scheme, the cost of manpower and material resources is greatly saved, and the business handling efficiency is improved;

5. the system adopts modularized design and development, and has the advantages of small occupation of computing resources, simple deployment and convenient application.

Drawings

FIG. 1 is a flow chart of an implementation of a government text archiving method based on semi-supervised learning provided by an embodiment of the invention;

FIG. 2 is a business logic diagram of a government text archiving method based on semi-supervised learning provided by the embodiment of the invention;

FIG. 3 is a flow chart of a government text archiving method service implementation based on semi-supervised learning provided by the embodiment of the invention;

FIG. 4 is a self-learning flow chart provided by an embodiment of the present invention;

FIG. 5 is a diagram of system functional modules provided by an embodiment of the present invention;

FIG. 6 is a diagram of a government text archiving system deployment based on semi-supervised learning provided by an embodiment of the present invention;

fig. 7 is a network architecture diagram of a government text archiving system based on semi-supervised learning provided by the embodiment of the invention.

Detailed Description

The embodiment of the invention provides a government affair text archiving method based on semi-supervised learning, which comprises a manual input stage and an automatic archiving stage. Referring to fig. 3, in the manual entry stage, during business handling, a handling person scans materials and inputs or selects labels, and a background program stores the scanned materials under a designated path according to the labels to complete text archiving, which is also an existing business processing scheme; and in the automatic filing stage, a clerk scans the material, the label to which the material belongs is judged through the automatic verification module, and the background program stores the scanned material under a specified path according to the label.

The automatic filing stage starts a self-learning mechanism, firstly extracts the characteristic vector of the text material under each label, then establishes the relation between the label and the characteristic vector of the text material, and finally updates the relation table to an automatic verification module, and the automatic filing stage sequentially and circularly reciprocates to realize the dynamic management of the label, thereby achieving the purpose of self-learning.

The text material comes from a manual calibration text and an automatic calibration text, the manual calibration text has an exact corresponding relation between the text material and a label, the automatic calibration text has uncertainty, a punishment mechanism is added into the automatic calibration text, and noise text information is controlled to enter the training machine; after training, the model is obtained and updated to an automatic verification module, and the model is sequentially and circularly reciprocated to realize uninterrupted model optimization. As shown in reference to figure 4 of the drawings,

the penalty mechanism refers to a mechanism commonly used in machine learning for controlling the regularization process to adjust errors, and is usually implemented by a penalty function and coefficients thereof, wherein the detailed penalty mechanism is not limited herein, but the objective to be achieved is consistent, namely, the fitting capability of a machine learning model is enhanced, and more accurate inference is made.

The automatic verification module is a key for realizing semi-supervised learning, and aims to acquire archivable information of the input material with less human intervention. The specific implementation process is as follows: processing the acquired real-time material information, and extracting character strings for the text material to be used as one of the classifying bases of the text; since the extracted character string contains a plurality of worthless characters, a plurality of characters capable of describing the text features are required to be extracted through a design strategy; the set of strings with descriptive text features is then converted into feature vectors.

A set of labels may be created in advance that corresponds to the catalog to be archived; simultaneously, constructing feature vectors corresponding to each label in advance to form a feature vector set of a group of labels;

at this time, the feature vectors generated by the materials just recorded can be concentrated into the feature vector set of the tag one by one for correlation analysis; therefore, a group of labels with stronger correlation can be obtained, and in order to achieve higher accuracy, the label with the strongest correlation can be obtained by controlling the correlation, and if the label which does not meet the condition is not available, the label can be selectively created or uniformly defined as other labels;

This process is shown in fig. 1.

As shown in figure 2, the realization business of the method mainly comprises four parts of data acquisition, information input, archiving management and data storage management, and can be used for archiving texts such as license class, contract, commission, policy and regulation, proving materials and the like. Wherein,

the data acquisition can be performed in various ways, for example, government affair texts are acquired by means of a high-speed scanner, an electronic material, portable equipment and the like;

the information input stage is mainly divided into two modes of manual audit input and automatic audit input;

the archiving management comprises two conditions of newly-built archiving catalogues and existing archiving catalogue management according to whether the archiving catalogues exist in advance or not;

the data management stage mainly comprises data storage, ER index, data inquiry and data deletion, so that the data can be conveniently and fully utilized to other services.

ER index: ER, collectively Entity Relationship, is translated into entity relationships, expressed in the form of a common graph, i.e., an entity relationship graph, which is a method of providing entities, attributes, and connections; with this approach, relationships between the transaction entity and various types of materials, and between materials, are established, providing an index describing such complex relationships, referred to as the ER index.

As shown in fig. 5, the implementation of the method includes a task scheduling module, a traffic processing module, a data management module, and an AI service module, wherein,

the task scheduling module is used as a Controller (main control) to coordinate the operation among the modules, including opening or closing an automatic filing mode, opening or closing a self-learning mode and the like;

the business processing module is responsible for business handling matters, including input/selection labels, scanning materials, storage operations and the like;

the AI service module mainly provides intelligent computing services such as character recognition, strategy judgment and the like.

The application of the method is specifically described by the implementation process of automatic archiving of government affair texts in a city intelligent approval system as follows:

the project requires that texts provided by the business masses be archived, and common text materials include identity cards, business licenses, bank cards, contracts, commissions, policy regulations, proving materials and the like, and the types of materials are generally classified into photographing, scanning pieces, electronic materials and the like. The archived materials can be used for sharing resources in government systems, so that the office flow of other links is reduced, and the office efficiency is improved.

Generally, the submitted materials are audited and archived one by one in a manual audit entry mode. Along with the increase of the traffic, a great amount of material auditing work seriously affects the transaction progress and even brings about the risk of archiving errors, and an intelligent method is needed to realize automatic auditing and archiving of submitted materials.

By utilizing the method, the existing system is fully utilized for optimization and upgrading. Specifically, the implementation based on the method is divided into three stages, namely a manual recording stage, a semi-supervised learning stage and an unsupervised learning stage.

In the manual recording stage, material images are recorded according to a conventional method, and a large number of effective government affair text images with labels are gradually accumulated.

In the semi-supervised learning stage, on one hand, manual input is continuously implemented, on the other hand, a self-learning model is started, the accumulated sample pictures are fully utilized for classification learning, the recognition accuracy is gradually optimized, namely, the manual input and the self-learning are performed simultaneously, and even after a certain accumulation is achieved, the self-learning function is started to assist the manual input.

When the method is adopted to develop a system, four modules can be considered: task scheduling, business processing, data management, AI services. The framework fully considers decoupling design, the development tasks of the front-end engineer and the rear-end engineer are developed, the Java language of the main stream of the government affair system is used, the framework can be compatible with other systems, meanwhile, the AI service is developed by the algorithm engineer, the Python language of the main stream of the industry is used, the algorithm advantage is fully exerted, the web service is provided for the whole domain of the system, and the iterative update is optimized rapidly.

The word recognition technology is widely applied to digital construction of government affairs systems, can improve administrative efficiency and service level, is widely applied to application scenes such as digital processing of government official documents, digital input of form information, text and automatic classification of smart city data, and the like, and improves the processing speed of the government affairs systems and reduces the processing cost of the government affairs systems. Text automatic archiving is a file sorting and managing method based on artificial intelligence technology, which helps users to quickly and accurately identify and archive various documents, pictures, audio and video files. The text automatic filing mainly adopts the techniques of natural language processing, machine learning, deep learning and the like to carry out the processes of document classification, labeling, filing and the like.

Common methods of machine learning are largely classified into supervised learning (supervised learning) and unsupervised learning (unsupervised learning). A simple generalization is whether there is supervision (supervised), and whether there is a tag (label) on the input data. The input data has labels, and is supervised learning; the non-label is unsupervised learning. In addition, one learning algorithm involved in the supervised and unsupervised intermediaries is semi-supervised learning (semi-supervised learning). For semi-supervised learning, part of its training data is tagged and the other part is not tagged, and the amount of untagged data is often significantly larger than the tagged data amount (which is also realistic). The basic law hidden under semi-supervised learning is that: the distribution of the data is not completely random, and acceptable or even very good classification results can be obtained by local features of some tagged data and overall distribution of more untagged data.

The embodiment of the invention also provides a government affair text archiving system based on semi-supervised learning, which realizes the government affair text archiving method based on semi-supervised learning described in the embodiment; as shown in fig. 6, the system includes an interactive client, an application server cluster, an AI server cluster, various types of data and database systems, components for perfecting functions, and the like.

The interactive client comprises a business hall, a mobile client, a Web client, an administrator client and the like, and provides functions of user information input, reference and the like and an administrator user operation and maintenance function.

The external clients are connected with the gateway cluster through the mode of Nginx+firewall so as to ensure the security of system information.

An application server cluster and an AI server cluster are provided in the system, the application server cluster is used for realizing the basic function of the system, and the AI server cluster is used for providing AI computing service for the system; the configuration center can realize customization of archiving tasks and control of system operation parameters;

PHP application Server Cluster: PHP is named as Hypertext Preprocessor, chinese name as "hypertext preprocessor", is a universal open source script language; based on the language, the application server cluster can be developed, and has the characteristics of high concurrency and distribution. K8s: the Chinese character is totally called kubernetes, and the middle 8 letters are replaced by 8 because of overlong names; it is an open-source, well-known container-based cluster management platform, used herein to build AI server clusters, providing docker (container) management and load balancing. In addition, the types of management platforms are relatively more, and the technical type selection stage can be suitable for own service management platform according to actual selection.

And deploying a database service, providing data storage, adding, deleting and checking operations, and deploying a message queue service and a cache service for enhancing the stability of the system. Kafka is a source-opened distributed streaming media platform under Apache flag, and is a high-throughput, durable and distributed message queue system for publishing and subscribing. Redis is free of open source, complies with BSD protocol, is a high performance key-value non-relational database, and is used herein to implement caching services.

As shown in fig. 7, a network architecture diagram of a system developed based on this method is presented. The mobile equipment of the personal client and the Web client access the system through the firewall by the public cloud to complete business transaction; the business hall, internal private devices and the operation and maintenance client access the system by the private cloud.

The back-end Server comprises an application Server (App Server), an AI Server (AI Server) and a database Server (DB Server), and the tasks such as an application program interface Service (API Service), an AI Service (AI Service) and Database (DB) operation are respectively operated according to different task types.

The present invention can be easily implemented by those skilled in the art through the above specific embodiments. It should be understood that the invention is not limited to the particular embodiments described above. Based on the disclosed embodiments, a person skilled in the art may combine different technical features at will, so as to implement different technical solutions.

Other than the technical features described in the specification, all are known to those skilled in the art.

Claims

1. A government affair text archiving method based on semi-supervised learning is characterized by comprising a manual input stage and an automatic archiving stage, wherein during business handling, a handling person scans materials and inputs or selects labels, and a background program stores the scanned materials under a specified path according to the labels to finish text archiving; in the automatic filing stage, a clerk scans the material, the label to which the material belongs is judged through an automatic verification module, and a background program stores the scanned material under a specified path according to the label;

the text material comes from a manual calibration text and an automatic calibration text, the automatic calibration text is added with a punishment mechanism, and noise text information is controlled to enter the training machine; after training, the model is obtained and updated to an automatic verification module, and the model is sequentially and circularly reciprocated to realize uninterrupted model optimization.

2. The government affair text archiving method based on semi-supervised learning as set forth in claim 1, wherein the automatic verification module processes the acquired real-time material information, extracts character strings for text materials, and uses the extracted character strings as one of the classifying bases of texts; the set of strings with descriptive text features is then converted into feature vectors.

3. The government affair text filing method based on semi-supervised learning as set forth in claim 2, wherein the processing of the acquired real-time material information extracts a character string for the literal material, and extracts a plurality of characters describing the characteristics of the text through a design strategy.

4. The government affair text filing method based on semi-supervised learning as claimed in claim 2, wherein a set of labels corresponding to the catalogue to be filed is created in advance; simultaneously, constructing feature vectors corresponding to each label in advance to form a feature vector set of a group of labels;

carrying out correlation analysis on feature vectors generated by the just-recorded materials one by one in a feature vector set of the tag; obtaining a group of labels with the highest correlation by controlling the correlation, and if no label meeting the condition exists, selecting to create the label or uniformly define the label as other labels;

5. The government affair text filing method based on semi-supervised learning as set forth in claim 1, 2, 3 or 4, wherein the implementation business of the method includes data collection, information input, filing management and data storage management, and files texts including license class, contract, commission, policy regulation and proving materials; wherein,

data management includes data storage, ER indexing, data querying, data deletion.

6. The government affair text filing method based on semi-supervised learning as set forth in claim 5, wherein the implementation of the method includes a task scheduling module, a business processing module, a data management module and an AI service module, wherein,

the task scheduling module is used as a main control and coordinates the operation among the modules, including an automatic filing mode and a self-learning mode;

7. The government affair text filing method based on semi-supervised learning as set forth in claim 6, wherein the method includes a manual recording stage, a semi-supervised learning stage and an unsupervised learning stage,

in the semi-supervised learning stage, on one hand, manual input is continuously implemented, on the other hand, a self-learning model is started, the accumulated sample pictures are fully utilized for classification learning, and the recognition accuracy is gradually optimized, namely, the manual input and the self-learning are simultaneously carried out; after the accumulation amount is reached, starting a self-learning function to assist manual input;

in the unsupervised learning stage, manual input and archiving are not needed, and a clerk can automatically archive text image materials only by submitting the materials.

8. A government affair text filing system based on semi-supervised learning, characterized in that the system implements the government affair text filing method based on semi-supervised learning as set forth in any one of claims 1 to 7; the system comprises an interactive client, an application server cluster, an AI server cluster, various data and database systems and components for perfecting functions;

the interactive client comprises a service hall, a mobile client, a Web client and an administrator client, and provides a user information input and review function and an administrator user operation and maintenance function;

9. The government affair text filing system based on semi-supervised learning as set forth in claim 8, wherein the clients are connected to the gateway cluster through a pattern of nginnx+ firewall;

10. The government affair text filing system based on semi-supervised learning as claimed in claim 9, wherein the back-end server includes an application server, an AI server and a database server, and the application program interface service, the AI service and the database operation task are respectively operated according to different task types.