CN114117215A - Government affair data personalized recommendation system based on mixed mode - Google Patents

Government affair data personalized recommendation system based on mixed mode Download PDF

Info

Publication number
CN114117215A
CN114117215A CN202111383044.3A CN202111383044A CN114117215A CN 114117215 A CN114117215 A CN 114117215A CN 202111383044 A CN202111383044 A CN 202111383044A CN 114117215 A CN114117215 A CN 114117215A
Authority
CN
China
Prior art keywords
government affair
affair data
data
recommendation
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111383044.3A
Other languages
Chinese (zh)
Inventor
冉从敬
黄海瑛
宋凯
何梦婷
李旺
张逸人
王福新
杜娟娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Research Institute of Wuhan University
Original Assignee
Shenzhen Research Institute of Wuhan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Research Institute of Wuhan University filed Critical Shenzhen Research Institute of Wuhan University
Priority to CN202111383044.3A priority Critical patent/CN114117215A/en
Publication of CN114117215A publication Critical patent/CN114117215A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Tourism & Hospitality (AREA)
  • Software Systems (AREA)
  • Economics (AREA)
  • Educational Administration (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Development Economics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a government affair data personalized recommendation system based on a mixed mode, which comprises a data retrieval and preprocessing module, a theme extraction and text clustering module, a content-based government affair data recommendation module, a collaborative filtering-based government affair data recommendation module and a mixed method-based government affair data recommendation module.

Description

Government affair data personalized recommendation system based on mixed mode
Technical Field
The invention relates to a government affair big data analysis technology, in particular to a government affair data personalized recommendation system based on a mixed mode.
Background
The government data open platform is an official platform which is built by government departments and used for publishing owned data to the public, the data platform is like a one-stop market and is used for the public to obtain the data which are required by the public and come from different government departments, but in the face of huge and mixed government data, a data user is difficult to obtain the data meeting the self requirement from massive government data, so that the problem under the data utilization efficiency is more severe.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a government affair data personalized recommendation system based on a mixed mode.
In order to achieve the purpose, the invention adopts the following technical scheme:
a mixed-mode-based government affair data personalized recommendation system comprises: a data retrieval and preprocessing module, a theme extraction and text clustering module, a content-based government affair data recommending module, a collaborative filtering-based government affair data recommending module, a hybrid method-based government affair data recommending module, wherein,
the data retrieval and preprocessing module is used for performing natural language processing on the retrieved data in the specific industry field and converting the text vector into a word vector; sorting, updating and iterating the word segmentation data sets to obtain an optimal word segmentation result set;
the topic extraction and text clustering module is used for extracting topics from government affair data by adopting an LDA (latent dirichlet allocation) model to obtain a document-topic probability matrix and summarizing the topics by using the most relevant semantic words;
calculating an initial clustering center value of a K-means algorithm according to the document-theme probability matrix, further performing text clustering by using the K-means algorithm, setting a clustering number, an initial clustering center and iteration times, and realizing the clustering division of government affair data;
the content-based government affair data recommendation module finds the key concern theme of the target user, calculates based on the similarity of the average theme, and summarizes the government affair data to form a content-recommendation-based government affair data list;
the government affair data recommending module based on collaborative filtering finds out the potential interest topic of the target user, counts other interest user sets in the field, recommends the government affair data concerned by the interest users to the target user, and forms a government affair data list based on collaborative filtering recommendation;
and the government affair data recommendation module based on the mixed method combines two recommendation methods based on content and collaborative filtering, and adopts a combination mode of weighting, mixing, feature combination and the like to display an optimal recommendation result.
Preferably, in the data retrieval and preprocessing module, the jiebaR package in the R language is used to perform natural language processing such as text word segmentation, word stop removal, word screening and the like on the government affair data set; and performing optimized sorting such as dictionary updating, multi-turn iteration and the like on the word segmentation result set.
Preferably, in the topic extraction and text clustering module, the lda package of the R language is used to perform topic extraction on the experimental corpus; displaying the theme visualization result by using an LDAvis package, adjusting the number, alpha value and beta value of the themes, determining the optimal number of the themes by using multidimensional scale analysis, and judging the excellent degree of the theme model extraction result; and fusing the LDA model and a K-means algorithm, determining an initial clustering center on the dimensionality of K subjects according to a document-subject probability matrix extracted by the LDA, setting the clustering number and the iteration times, and realizing the clustering division of the patent text.
Preferably, in the content-based government affair data recommendation module, the key concern topics with the largest browsing and downloading number of target users are counted, and the government affair data sets of the key concern topics are summarized; and summarizing the government affair data sets of the technical subject, calculating cosine similarity, and ranking to form a government affair data list with descending average cosine similarity.
Preferably, in the collaborative filtering-based government affair data recommendation module, the potential interest topic of the target user is counted, other interest users under the potential interest topic are found, and a government affair data list with decreasing average cosine similarity is formed by cosine similarity calculation and ranking.
Preferably, the government affair data recommendation module based on the hybrid method combines the content-based and collaborative recommendation-based government affair data recommendation methods, so that the target user is concerned about the emphasis of the target user on the topic of focus, and the requirements of the target user on the potential topic of interest are considered, and the optimal recommendation result of the government affair data is formed.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the LDA model is an unsupervised machine learning technique. The invention adopts the LDA model to extract the subject of the patent text. The model assumes that words are generated from a mixture of topics, and that each topic is a polynomial distribution over a fixed vocabulary, with the topics being shared by all documents in the collection, each document having a specific topic proportion, sampled from a Dirichlet distribution. As a production model, the structural model is complete and clear, and a high-efficiency probabilistic inference algorithm is adopted to process large-scale data, so that the model is a topic identification model which is widely researched and used at present.
2. The K-means clustering algorithm is an unsupervised learning algorithm and is one of ten classical algorithms for data mining. The invention adopts a K-means algorithm to realize the division of the data text. In consideration of the technical discussion of the data text, the technical subject has the characteristics of specificity and deepening, so that the data text is only classified into one subject cluster when the clustering is carried out. Clustering analysis is an important research content in knowledge discovery, and aims to divide a data set into a plurality of classes, so that intra-class differences are small and inter-class differences are large. As an algorithm based on division, the method has the advantages of simple thought, easy implementation and time complexity close to linearity, has high efficiency and scalability on large-scale data mining, and is widely applied to research of text clustering.
3. Cosine similarity is a similarity calculation method which is most widely applied and is suitable for calculating similarity between patent texts. Mathematically, the difference between two individuals is measured by the cosine value of the included angle between two vectors in the vector space, the text vector is constructed by the word frequency vector, and the text similarity is compared. Because the cosine similarity emphasizes the difference between the two samples in the direction, and the Euclidean distance calculation is based on the absolute numerical value of each dimension characteristic, the dimension indexes are required to be ensured to be at the same scale level, and therefore the cosine similarity is adopted for calculating the data text similarity between enterprises and colleges and universities.
The core technology related by the invention extends through the whole process of 'data retrieval-data processing-data storage-data analysis-data application', covers natural language processing, theme modeling, text clustering, data visualization and the like, and has important significance for promoting government affair data to be open to the society, solving the open utilization predicament of government affair data and supporting public digital culture construction.
Drawings
FIG. 1 is a block diagram of an embodiment of the present invention.
Detailed Description
To facilitate understanding and practice of the invention by those of ordinary skill in the art, the invention is described in further detail below with reference to the accompanying drawings, it being understood that the present examples are set forth merely to illustrate and explain the invention and are not intended to limit the invention.
As shown in fig. 1, a mixed-mode-based personalized government affair data recommendation system includes a data retrieval and preprocessing module, a topic extraction and text clustering module, a content-based government affair data recommendation module, a collaborative filtering-based government affair data recommendation module, and a hybrid-method-based government affair data recommendation module, wherein:
and the data retrieval and preprocessing module is used for retrieving government affair data in a specific industry field by taking a certain province and government affair data open platform as a data source and cleaning the data. Extracting government affair data names, forming an analysis corpus, constructing a professional dictionary in the technical field, and performing natural language processing such as word segmentation, word stop, word screening and the like by means of a jiebaR packet of R language;
the theme extraction and text clustering module is used for carrying out theme modeling by utilizing lda packages of the R language; and visually displaying the theme result by using the LDAvis package, and judging the excellent degree of the theme model extraction result based on multi-dimensional scale analysis. In order to enable the themes to be relatively independent and the theme similarity to be small, the number of the themes is set to be 10, and the alpha and beta values are fixed to be 0.02 and 0.7;
extracting the topics by using an LDA model to obtain the most relevant semantic words under each topic, and summarizing the topics; training a document-theme probability model matrix, and calculating the clustering number and the initial clustering center of the K-means algorithm; importing the document-theme probability model matrix into an SPSS, and setting a clustering number and an initial clustering center to obtain a government affair data text clustering result;
the content-based government affair data recommendation module finds the key concern theme of the target user, calculates based on the similarity of the average theme, and summarizes the government affair data to form a content-recommendation-based government affair data list;
the government affair data recommending module based on collaborative filtering finds out the potential interest topic of the target user, counts other interest user sets in the field, recommends the government affair data concerned by the interest users to the target user, and forms a government affair data list based on collaborative filtering recommendation;
and the government affair data recommendation module based on the mixed method combines two recommendation methods based on content and collaborative filtering, and adopts a combination mode of weighting, mixing, feature combination and the like to display an optimal recommendation result.
The government affair data recommending module is used for analyzing the browsing and downloading quantity of the government affair data of the target user under each technical theme, determining the key concern theme and the potential interest theme of the target user, and counting other interest users under the potential interest theme; and forming a government affair data recommendation list for the target user by adopting a mixed recommendation mode based on the content and the collaborative filtering according to the analysis result.
One example of use is:
the specific implementation process is as follows:
(1) data retrieval and preprocessing: the block chain is used as the research industry field, the block chain is used as a name to be searched in a public data open platform of Shanghai city to obtain related texts, policies and data, and a target user selects the Shen technology (Shenzhen) Limited company;
(2) topic modeling and text clustering: when the industry field theme is extracted, 10 themes respectively extract the most relevant 10 words to summarize the 10 themes; when the government affair data are clustered, eliminating the theme 3 to obtain 9 important themes; calculating initial clustering centers (0.721157151, 0.724248556, 0.713041588, 0.733758854, 0.72711089, 0.736014371, 0.703095687, 0.702814238, 0.69800075 and 0.734391872) of 9 themes (block chain deployment, medical industry application, intelligent contracts, identity authentication, encryption technology, consensus mechanism, data tracing, Token and block chain finance) by using a K-means algorithm, training a document-theme probability matrix, importing the document-theme probability matrix into an SPSS, and setting a clustering number and the initial clustering centers to obtain the government affair data distribution condition of each theme;
(3) and (3) government affair data recommendation and display: according to the government affair data text clustering result, 4 topics with a large number of downloads and browses are determined to be key attention topics of the safe science and technology, other topics with 6 patents with a small distribution are determined to be potential interest topics, and other interest users with the largest number of browses and downloads under the potential interest topics are counted; and fusing the content recommendation based on the key attention topic and the collaborative filtering recommendation based on other interested users to form a final government affair data recommendation list.
It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (6)

1. A government affair data personalized recommendation system based on a mixed mode is characterized by comprising:
a data retrieval and preprocessing module, a theme extraction and text clustering module, a content-based government affair data recommending module, a collaborative filtering-based government affair data recommending module, a hybrid method-based government affair data recommending module, wherein,
the data retrieval and preprocessing module is used for performing natural language processing on the retrieved data in the specific industry field and converting the text vector into a word vector; sorting, updating and iterating the word segmentation data sets to obtain an optimal word segmentation result set;
the topic extraction and text clustering module is used for extracting topics from government affair data by adopting an LDA (latent dirichlet allocation) model to obtain a document-topic probability matrix and summarizing the topics by using the most relevant semantic words;
calculating an initial clustering center value of a K-means algorithm according to the document-theme probability matrix, further performing text clustering by using the K-means algorithm, setting a clustering number, an initial clustering center and iteration times, and realizing the clustering division of government affair data;
the content-based government affair data recommendation module finds the key concern theme of the target user, calculates based on the similarity of the average theme, and summarizes the government affair data to form a content-recommendation-based government affair data list; the government affair data recommending module based on collaborative filtering finds out the potential interest topic of the target user, counts other interest user sets in the field, recommends the government affair data concerned by the interest users to the target user, and forms a government affair data list based on collaborative filtering recommendation;
and the government affair data recommendation module based on the mixed method combines two recommendation methods based on content and collaborative filtering, and adopts a combination mode of weighting, mixing, feature combination and the like to display an optimal recommendation result.
2. The government affair data personalized recommendation system based on the mixed mode according to claim 1, wherein in the data retrieval and preprocessing module, natural language processing such as text word segmentation, stop word removal, word screening and the like is performed on the government affair data set by using a jiebaR package of an R language; and performing optimized sorting such as dictionary updating, multi-turn iteration and the like on the word segmentation result set.
3. The government affair data personalized recommendation system based on the mixed mode according to claim 1, wherein in the topic extraction and text clustering module, lda packages in R language are used for topic extraction of the experimental corpus; displaying the theme visualization result by using an LDAvis package, adjusting the number, alpha value and beta value of the themes, determining the optimal number of the themes by using multidimensional scale analysis, and judging the excellent degree of the theme model extraction result; and fusing the LDA model and a K-means algorithm, determining an initial clustering center on the dimensionality of K subjects according to a document-subject probability matrix extracted by the LDA, setting the clustering number and the iteration times, and realizing the clustering division of the patent text.
4. The personalized government affair data recommendation system based on the mixed mode according to claim 1, wherein in the content-based government affair data recommendation module, the key topics of interest with the largest number of browsed and downloaded by the target user are counted, and the government affair data sets of the key topics of interest are summarized; and summarizing the government affair data sets of the technical subject, calculating cosine similarity, and ranking to form a government affair data list with descending average cosine similarity.
5. The government affair data personalized recommendation system based on the mixed mode according to claim 1, wherein in the government affair data recommendation module based on the collaborative filtering, a potential interest topic of a target user is counted, other interest users under the potential interest topic are found, and a government affair data list with descending average cosine similarity is formed by cosine similarity calculation and ranking.
6. The personalized government affair data recommendation system based on the mixed mode as claimed in claim 1, wherein the government affair data recommendation module based on the mixed method combines the content-based and collaborative recommendation-based government affair data recommendation methods, so as to not only pay attention to the emphasis of the target user on the topic of focus but also give consideration to the requirements of the target user on the topic of potential interest, thereby forming the best recommendation result of the government affair data.
CN202111383044.3A 2021-11-22 2021-11-22 Government affair data personalized recommendation system based on mixed mode Pending CN114117215A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111383044.3A CN114117215A (en) 2021-11-22 2021-11-22 Government affair data personalized recommendation system based on mixed mode

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111383044.3A CN114117215A (en) 2021-11-22 2021-11-22 Government affair data personalized recommendation system based on mixed mode

Publications (1)

Publication Number Publication Date
CN114117215A true CN114117215A (en) 2022-03-01

Family

ID=80439033

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111383044.3A Pending CN114117215A (en) 2021-11-22 2021-11-22 Government affair data personalized recommendation system based on mixed mode

Country Status (1)

Country Link
CN (1) CN114117215A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115098596A (en) * 2022-05-25 2022-09-23 开普数智科技(广东)有限公司 Government affair related data combing method, device and equipment and readable storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115098596A (en) * 2022-05-25 2022-09-23 开普数智科技(广东)有限公司 Government affair related data combing method, device and equipment and readable storage medium

Similar Documents

Publication Publication Date Title
Negara et al. Topic modelling twitter data with latent dirichlet allocation method
CN110297988B (en) Hot topic detection method based on weighted LDA and improved Single-Pass clustering algorithm
Shi et al. Learning-to-rank for real-time high-precision hashtag recommendation for streaming news
CN111797214A (en) FAQ database-based problem screening method and device, computer equipment and medium
CN113434636B (en) Semantic-based approximate text searching method, semantic-based approximate text searching device, computer equipment and medium
CN111078852A (en) College leading-edge scientific research team detection system based on machine learning
CN110532378B (en) Short text aspect extraction method based on topic model
CN111125297B (en) Massive offline text real-time recommendation method based on search engine
Jiang et al. A unified neural network approach to e-commerce relevance learning
Henderi et al. Unsupervised Learning Methods for Topic Extraction and Modeling in Large-scale Text Corpora using LSA and LDA
CN114117215A (en) Government affair data personalized recommendation system based on mixed mode
CN111259110A (en) College patent personalized recommendation system
Wawrzinek et al. Semantic facettation in pharmaceutical collections using deep learning for active substance contextualization
CN106951511A (en) A kind of Text Clustering Method and device
Parida et al. Ranking of Odia text document relevant to user query using vector space model
Xu et al. Research on Tibetan hot words, sensitive words tracking and public opinion classification
Mahdipour et al. Automatic Persian text summarizer using simulated annealing and genetic algorithm
CN113934910A (en) Automatic optimization and updating theme library construction method and hot event real-time updating method
Quemy European court of human right open data project
Pu et al. A semantic-based short-text fast clustering method on hotline records in Chengdu
CN111259150A (en) Document representation method based on word frequency co-occurrence analysis
Ahmed et al. Clustering technique on search engine dataset using data mining tool
TWI290684B (en) Incremental thesaurus construction method
Edi Topic modelling Twitter data with latent Dirichlet allocation method
Nikitinsky et al. An information retrieval system for technology analysis and forecasting

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination