Big data platform for realizing tax public opinion analysis and full text retrieval
Technical Field
The invention relates to the technical field of big data, in particular to a big data platform which has strong practicability and can realize tax public opinion analysis and full text retrieval.
Background
in order to fully utilize internet network resources of tax public opinions and create a big data platform for internet plus tax public opinions, a whole set of public opinion big data platform framework design for collecting, processing, analyzing, storing, indexing and searching public opinion big data is required to be provided for users.
Disclosure of Invention
The technical task of the invention is to provide a big data platform which has strong practicability and can realize tax public opinion analysis and full text retrieval aiming at the defects.
A big data platform for realizing tax public opinion analysis and full text retrieval comprises a basic data layer, a big data acquisition layer, a big data storage layer, a big data analysis layer, a big data index layer, a big data search layer and a big data application layer from bottom to top; wherein, the basic data layer is a basic data source related to public sentiment in the tax industry; the big data acquisition layer provides a public opinion information acquisition mode for the user to complete information acquisition of the basic data layer; the big data analysis layer processes, analyzes and excavates the big data collected by the big data collection layer, and then sends the big data to the big data storage layer; the big data storage layer is used for storing public opinion webpage big data information and basic data information; the big data index layer establishes an inverted index channel for the data stored in the big data storage layer and provides a quick search data source for the big data search layer; the big data search layer provides a full text search engine of big data; and the big data application layer is used for effectively utilizing the resources of the searched tax public opinion big data.
The basic data sources arranged in the basic data layer comprise a taxpayer name list library, a tax administration machine unit library, a policy and regulation library and a tax seed library, and the basis is provided for the whole platform to directionally collect the network tax public opinion information through the basic data sources.
The big data acquisition layer provides an acquisition mode comprising directional acquisition, a web crawler and an acquisition rule configuration module, wherein the directional acquisition refers to the directional acquisition of basic data, the web crawler is used for capturing the information of a public opinion website, and the web crawler is a deep web crawler for crawling a seed website; the acquisition rule configuration module provides a basic webpage acquisition configuration rule function.
The big data analysis layer comprises three parts of preprocessing, word segmentation and emotion analysis, wherein the preprocessing is the preprocessing operation of removing labels, extracting characters and eliminating noise on the acquired webpage; the word segmentation is the operation of carrying out word segmentation, word deactivation, entity recognition and word vector construction on the text; the emotion analysis is to analyze the text through emotion information extraction and emotion information classification, and judge the emotion tendency of the text: i.e., positive, negative, neutral.
The big data storage layer comprises a public opinion webpage information base and a basic resource base, wherein the public opinion webpage information base stores unstructured public opinion information comprising collected original webpages, processed webpages, pictures, videos, style files and extracted webpage contents; the basic resource library is used for supplementing information of four basic resource layers, and the supplemented information comprises tax payer official networks and news information, tax institution unit official networks and news information, policy and regulation original text webpages, file information and network resources including tax encyclopedias.
The big data index layer comprises an increment index, a full index, index addition and deletion and index updating, wherein the increment index provides increment index establishing operation for a data source; the full index provides all reconstruction index operations for the data source; index add and delete provides index add and delete operations; index updates provide update operations to the established index.
The big data search layer comprises a whole text search module, a synonym search module, a grouping statistic module and a fuzzy matching module, wherein the whole text search module provides matching query for each word and each word of index content and feeds back a query result; the synonym search provides a synonym function for simultaneously inquiring a word when inquiring the word; the grouping statistics provides grouping statistics functions classified by types or other appointed classifications for the search results; and the fuzzy matching carries out word segmentation on the search phrase and then carries out query.
The big data application layer comprises classified public sentiments, negative public sentiments, public sentiment search and public sentiment reports, wherein the classified public sentiments are based on four basic types of public sentiment information of taxpayers, organ units, policy and regulation and taxes and the subdivision combination of positive, negative and neutral subclasses; negative public sentiment provides real-time negative public sentiment monitoring and tracking; public opinion search provides a multi-condition search mode of users, wherein the search comprises single search, combined search, advanced search and the like; the public opinion report provides a public opinion statistical analysis report on a monthly, weekly, and daily basis.
The big data platform for realizing tax public opinion analysis and full text retrieval has the following advantages:
The big data platform for realizing tax public opinion analysis and full text retrieval fully integrates tax internal data and internet tax public opinion big data, and enables tax departments to master public opinion information on the internet such as taxpayers, tax administration units, policy and regulation, various tax types and the like in advance, thereby improving the public opinion monitoring and responding capability of tax authorities, and having strong practicability, wide application range and easy popularization.
drawings
FIG. 1 is a schematic diagram of the overall structure of the present invention.
Detailed Description
The invention is further described with reference to the following figures and specific examples.
The invention provides a big data platform for realizing tax public opinion analysis and full text retrieval, which comprises a basic data layer, a big data acquisition layer, a big data storage layer, a big data analysis layer, a big data index layer, a big data search layer and a big data application layer from bottom to top, as shown in figure 1; wherein, the basic data layer is a basic data source related to public sentiment in the tax industry; the big data acquisition layer provides a public opinion information acquisition mode for the user to complete information acquisition of the basic data layer; the big data analysis layer processes, analyzes and excavates the big data collected by the big data collection layer, and then sends the big data to the big data storage layer; the big data storage layer is used for storing public opinion webpage big data information and basic data information; the big data index layer establishes an inverted index channel for the data stored in the big data storage layer and provides a quick search data source for the big data search layer; the big data search layer provides a full text search engine of big data; and the big data application layer is used for effectively utilizing the resources of the searched tax public opinion big data.
the basic data sources arranged in the basic data layer comprise a taxpayer name list library, a tax administration machine unit library, a policy and regulation library and a tax seed library, and the basis is provided for the whole platform to directionally collect the network tax public opinion information through the basic data sources.
the big data acquisition layer provides an acquisition mode comprising directional acquisition, a web crawler and an acquisition rule configuration module, wherein the directional acquisition refers to the directional acquisition of basic data, the web crawler is used for capturing the information of a public opinion website, and the web crawler is a deep web crawler for crawling a seed website; the acquisition rule configuration module provides a basic webpage acquisition configuration rule function.
The big data analysis layer comprises three parts of preprocessing, word segmentation and emotion analysis, wherein the preprocessing is the preprocessing operation of removing labels, extracting characters and eliminating noise on the acquired webpage; the word segmentation is the operation of carrying out word segmentation, word deactivation, entity recognition and word vector construction on the text; the emotion analysis is to analyze the text through emotion information extraction and emotion information classification, and judge the emotion tendency of the text: i.e., positive, negative, neutral.
The big data storage layer comprises a public opinion webpage information base and a basic resource base, wherein the public opinion webpage information base stores unstructured public opinion information comprising collected original webpages, processed webpages, pictures, videos, style files and extracted webpage contents; the basic resource library is used for supplementing information of four basic resource layers, and the supplemented information comprises tax payer official networks and news information, tax institution unit official networks and news information, policy and regulation original text webpages, file information and network resources including tax encyclopedias.
The big data index layer comprises an increment index, a full index, index addition and deletion and index updating, wherein the increment index provides increment index establishing operation for a data source; the full index provides all reconstruction index operations for the data source; index add and delete provides index add and delete operations; index updates provide update operations to the established index.
the big data search layer comprises a whole text search module, a synonym search module, a grouping statistic module and a fuzzy matching module, wherein the whole text search module provides matching query for each word and each word of index content and feeds back a query result; the synonym search provides a synonym function for simultaneously inquiring a word when inquiring the word; the grouping statistics provides grouping statistics functions classified by types or other appointed classifications for the search results; and the fuzzy matching carries out word segmentation on the search phrase and then carries out query.
The big data application layer comprises classified public sentiments, negative public sentiments, public sentiment search and public sentiment reports, wherein the classified public sentiments are based on four basic types of public sentiment information of taxpayers, organ units, policy and regulation and taxes and the subdivision combination of positive, negative and neutral subclasses; negative public sentiment provides real-time negative public sentiment monitoring and tracking; public opinion search provides a multi-condition search mode of users, wherein the search comprises single search, combined search, advanced search and the like; the public opinion report provides a public opinion statistical analysis report on a monthly, weekly, and daily basis.
In the actual design process, the distributed internet data acquisition Nutch platform is used for carrying out secondary development and integration, a public opinion webpage information mode and a crawling type whole network searching and matching tax public opinion webpage information mode are associated and acquired according to a tax payers name list library, a policy and regulation library and a tax variety library in a targeted mode, comprehensive tax public opinion big data information is collected and stored into an Hbase distributed database cluster, then a Hadoop platform is used for carrying out batch text preprocessing operation on a webpage document data set, then JAVA integrated open source natural language processing tools OpenNLP, FudanNLP, LingPipe, IKAnalyzer, word2vec and other tools are used for collecting Mathout algorithm libraries to complete text word segmentation and emotion analysis, and three types of positive, neutral and negative labels are marked on each piece of public opinion webpage information. And finally, establishing an index and providing a full-text search engine by using a SolrCloud distributed full-text search engine tool for a user to perform full-text retrieval.
The above embodiments are only specific cases of the present invention, and the protection scope of the present invention includes but is not limited to the above embodiments, and any suitable changes or substitutions by those skilled in the art, which are consistent with the claims of the present invention for a big data platform for realizing tax public opinion analysis and full text search, should fall into the protection scope of the present invention.