CN114117215A

CN114117215A - Government affair data personalized recommendation system based on mixed mode

Info

Publication number: CN114117215A
Application number: CN202111383044.3A
Authority: CN
Inventors: 冉从敬; 黄海瑛; 宋凯; 何梦婷; 李旺; 张逸人; 王福新; 杜娟娟
Original assignee: Shenzhen Research Institute of Wuhan University
Current assignee: Shenzhen Research Institute of Wuhan University
Priority date: 2021-11-22
Filing date: 2021-11-22
Publication date: 2022-03-01

Abstract

The invention discloses a government affair data personalized recommendation system based on a mixed mode, which comprises a data retrieval and preprocessing module, a theme extraction and text clustering module, a content-based government affair data recommendation module, a collaborative filtering-based government affair data recommendation module and a mixed method-based government affair data recommendation module.

Description

Government affair data personalized recommendation system based on mixed mode

Technical Field

The invention relates to a government affair big data analysis technology, in particular to a government affair data personalized recommendation system based on a mixed mode.

Background

The government data open platform is an official platform which is built by government departments and used for publishing owned data to the public, the data platform is like a one-stop market and is used for the public to obtain the data which are required by the public and come from different government departments, but in the face of huge and mixed government data, a data user is difficult to obtain the data meeting the self requirement from massive government data, so that the problem under the data utilization efficiency is more severe.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a government affair data personalized recommendation system based on a mixed mode.

In order to achieve the purpose, the invention adopts the following technical scheme:

a mixed-mode-based government affair data personalized recommendation system comprises: a data retrieval and preprocessing module, a theme extraction and text clustering module, a content-based government affair data recommending module, a collaborative filtering-based government affair data recommending module, a hybrid method-based government affair data recommending module, wherein,

the data retrieval and preprocessing module is used for performing natural language processing on the retrieved data in the specific industry field and converting the text vector into a word vector; sorting, updating and iterating the word segmentation data sets to obtain an optimal word segmentation result set;

the topic extraction and text clustering module is used for extracting topics from government affair data by adopting an LDA (latent dirichlet allocation) model to obtain a document-topic probability matrix and summarizing the topics by using the most relevant semantic words;

calculating an initial clustering center value of a K-means algorithm according to the document-theme probability matrix, further performing text clustering by using the K-means algorithm, setting a clustering number, an initial clustering center and iteration times, and realizing the clustering division of government affair data;

the content-based government affair data recommendation module finds the key concern theme of the target user, calculates based on the similarity of the average theme, and summarizes the government affair data to form a content-recommendation-based government affair data list;

the government affair data recommending module based on collaborative filtering finds out the potential interest topic of the target user, counts other interest user sets in the field, recommends the government affair data concerned by the interest users to the target user, and forms a government affair data list based on collaborative filtering recommendation;

and the government affair data recommendation module based on the mixed method combines two recommendation methods based on content and collaborative filtering, and adopts a combination mode of weighting, mixing, feature combination and the like to display an optimal recommendation result.

Preferably, in the data retrieval and preprocessing module, the jiebaR package in the R language is used to perform natural language processing such as text word segmentation, word stop removal, word screening and the like on the government affair data set; and performing optimized sorting such as dictionary updating, multi-turn iteration and the like on the word segmentation result set.

Preferably, in the topic extraction and text clustering module, the lda package of the R language is used to perform topic extraction on the experimental corpus; displaying the theme visualization result by using an LDAvis package, adjusting the number, alpha value and beta value of the themes, determining the optimal number of the themes by using multidimensional scale analysis, and judging the excellent degree of the theme model extraction result; and fusing the LDA model and a K-means algorithm, determining an initial clustering center on the dimensionality of K subjects according to a document-subject probability matrix extracted by the LDA, setting the clustering number and the iteration times, and realizing the clustering division of the patent text.

Preferably, in the content-based government affair data recommendation module, the key concern topics with the largest browsing and downloading number of target users are counted, and the government affair data sets of the key concern topics are summarized; and summarizing the government affair data sets of the technical subject, calculating cosine similarity, and ranking to form a government affair data list with descending average cosine similarity.

Preferably, in the collaborative filtering-based government affair data recommendation module, the potential interest topic of the target user is counted, other interest users under the potential interest topic are found, and a government affair data list with decreasing average cosine similarity is formed by cosine similarity calculation and ranking.

Preferably, the government affair data recommendation module based on the hybrid method combines the content-based and collaborative recommendation-based government affair data recommendation methods, so that the target user is concerned about the emphasis of the target user on the topic of focus, and the requirements of the target user on the potential topic of interest are considered, and the optimal recommendation result of the government affair data is formed.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the LDA model is an unsupervised machine learning technique. The invention adopts the LDA model to extract the subject of the patent text. The model assumes that words are generated from a mixture of topics, and that each topic is a polynomial distribution over a fixed vocabulary, with the topics being shared by all documents in the collection, each document having a specific topic proportion, sampled from a Dirichlet distribution. As a production model, the structural model is complete and clear, and a high-efficiency probabilistic inference algorithm is adopted to process large-scale data, so that the model is a topic identification model which is widely researched and used at present.

2. The K-means clustering algorithm is an unsupervised learning algorithm and is one of ten classical algorithms for data mining. The invention adopts a K-means algorithm to realize the division of the data text. In consideration of the technical discussion of the data text, the technical subject has the characteristics of specificity and deepening, so that the data text is only classified into one subject cluster when the clustering is carried out. Clustering analysis is an important research content in knowledge discovery, and aims to divide a data set into a plurality of classes, so that intra-class differences are small and inter-class differences are large. As an algorithm based on division, the method has the advantages of simple thought, easy implementation and time complexity close to linearity, has high efficiency and scalability on large-scale data mining, and is widely applied to research of text clustering.

3. Cosine similarity is a similarity calculation method which is most widely applied and is suitable for calculating similarity between patent texts. Mathematically, the difference between two individuals is measured by the cosine value of the included angle between two vectors in the vector space, the text vector is constructed by the word frequency vector, and the text similarity is compared. Because the cosine similarity emphasizes the difference between the two samples in the direction, and the Euclidean distance calculation is based on the absolute numerical value of each dimension characteristic, the dimension indexes are required to be ensured to be at the same scale level, and therefore the cosine similarity is adopted for calculating the data text similarity between enterprises and colleges and universities.

The core technology related by the invention extends through the whole process of 'data retrieval-data processing-data storage-data analysis-data application', covers natural language processing, theme modeling, text clustering, data visualization and the like, and has important significance for promoting government affair data to be open to the society, solving the open utilization predicament of government affair data and supporting public digital culture construction.

Drawings

FIG. 1 is a block diagram of an embodiment of the present invention.

Detailed Description

To facilitate understanding and practice of the invention by those of ordinary skill in the art, the invention is described in further detail below with reference to the accompanying drawings, it being understood that the present examples are set forth merely to illustrate and explain the invention and are not intended to limit the invention.

As shown in fig. 1, a mixed-mode-based personalized government affair data recommendation system includes a data retrieval and preprocessing module, a topic extraction and text clustering module, a content-based government affair data recommendation module, a collaborative filtering-based government affair data recommendation module, and a hybrid-method-based government affair data recommendation module, wherein:

and the data retrieval and preprocessing module is used for retrieving government affair data in a specific industry field by taking a certain province and government affair data open platform as a data source and cleaning the data. Extracting government affair data names, forming an analysis corpus, constructing a professional dictionary in the technical field, and performing natural language processing such as word segmentation, word stop, word screening and the like by means of a jiebaR packet of R language;

the theme extraction and text clustering module is used for carrying out theme modeling by utilizing lda packages of the R language; and visually displaying the theme result by using the LDAvis package, and judging the excellent degree of the theme model extraction result based on multi-dimensional scale analysis. In order to enable the themes to be relatively independent and the theme similarity to be small, the number of the themes is set to be 10, and the alpha and beta values are fixed to be 0.02 and 0.7;

extracting the topics by using an LDA model to obtain the most relevant semantic words under each topic, and summarizing the topics; training a document-theme probability model matrix, and calculating the clustering number and the initial clustering center of the K-means algorithm; importing the document-theme probability model matrix into an SPSS, and setting a clustering number and an initial clustering center to obtain a government affair data text clustering result;

The government affair data recommending module is used for analyzing the browsing and downloading quantity of the government affair data of the target user under each technical theme, determining the key concern theme and the potential interest theme of the target user, and counting other interest users under the potential interest theme; and forming a government affair data recommendation list for the target user by adopting a mixed recommendation mode based on the content and the collaborative filtering according to the analysis result.

One example of use is:

the specific implementation process is as follows:

(1) data retrieval and preprocessing: the block chain is used as the research industry field, the block chain is used as a name to be searched in a public data open platform of Shanghai city to obtain related texts, policies and data, and a target user selects the Shen technology (Shenzhen) Limited company;

(2) topic modeling and text clustering: when the industry field theme is extracted, 10 themes respectively extract the most relevant 10 words to summarize the 10 themes; when the government affair data are clustered, eliminating the theme 3 to obtain 9 important themes; calculating initial clustering centers (0.721157151, 0.724248556, 0.713041588, 0.733758854, 0.72711089, 0.736014371, 0.703095687, 0.702814238, 0.69800075 and 0.734391872) of 9 themes (block chain deployment, medical industry application, intelligent contracts, identity authentication, encryption technology, consensus mechanism, data tracing, Token and block chain finance) by using a K-means algorithm, training a document-theme probability matrix, importing the document-theme probability matrix into an SPSS, and setting a clustering number and the initial clustering centers to obtain the government affair data distribution condition of each theme;

(3) and (3) government affair data recommendation and display: according to the government affair data text clustering result, 4 topics with a large number of downloads and browses are determined to be key attention topics of the safe science and technology, other topics with 6 patents with a small distribution are determined to be potential interest topics, and other interest users with the largest number of browses and downloads under the potential interest topics are counted; and fusing the content recommendation based on the key attention topic and the collaborative filtering recommendation based on other interested users to form a final government affair data recommendation list.

It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A government affair data personalized recommendation system based on a mixed mode is characterized by comprising:

a data retrieval and preprocessing module, a theme extraction and text clustering module, a content-based government affair data recommending module, a collaborative filtering-based government affair data recommending module, a hybrid method-based government affair data recommending module, wherein,

the content-based government affair data recommendation module finds the key concern theme of the target user, calculates based on the similarity of the average theme, and summarizes the government affair data to form a content-recommendation-based government affair data list; the government affair data recommending module based on collaborative filtering finds out the potential interest topic of the target user, counts other interest user sets in the field, recommends the government affair data concerned by the interest users to the target user, and forms a government affair data list based on collaborative filtering recommendation;

2. The government affair data personalized recommendation system based on the mixed mode according to claim 1, wherein in the data retrieval and preprocessing module, natural language processing such as text word segmentation, stop word removal, word screening and the like is performed on the government affair data set by using a jiebaR package of an R language; and performing optimized sorting such as dictionary updating, multi-turn iteration and the like on the word segmentation result set.

3. The government affair data personalized recommendation system based on the mixed mode according to claim 1, wherein in the topic extraction and text clustering module, lda packages in R language are used for topic extraction of the experimental corpus; displaying the theme visualization result by using an LDAvis package, adjusting the number, alpha value and beta value of the themes, determining the optimal number of the themes by using multidimensional scale analysis, and judging the excellent degree of the theme model extraction result; and fusing the LDA model and a K-means algorithm, determining an initial clustering center on the dimensionality of K subjects according to a document-subject probability matrix extracted by the LDA, setting the clustering number and the iteration times, and realizing the clustering division of the patent text.

4. The personalized government affair data recommendation system based on the mixed mode according to claim 1, wherein in the content-based government affair data recommendation module, the key topics of interest with the largest number of browsed and downloaded by the target user are counted, and the government affair data sets of the key topics of interest are summarized; and summarizing the government affair data sets of the technical subject, calculating cosine similarity, and ranking to form a government affair data list with descending average cosine similarity.

5. The government affair data personalized recommendation system based on the mixed mode according to claim 1, wherein in the government affair data recommendation module based on the collaborative filtering, a potential interest topic of a target user is counted, other interest users under the potential interest topic are found, and a government affair data list with descending average cosine similarity is formed by cosine similarity calculation and ranking.

6. The personalized government affair data recommendation system based on the mixed mode as claimed in claim 1, wherein the government affair data recommendation module based on the mixed method combines the content-based and collaborative recommendation-based government affair data recommendation methods, so as to not only pay attention to the emphasis of the target user on the topic of focus but also give consideration to the requirements of the target user on the topic of potential interest, thereby forming the best recommendation result of the government affair data.